# PROJECT | Natural Language Processing Challenge

- **dataset/data.csv** dataset containing news articles with the following columns:

    label: 0 if the news is fake, 1 if the news is real.
    title: The headline of the news article.
    text: The full content of the article.
    subject: The category or topic of the news.
    date: The publication date of the article.


### Import Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

### Load and Read dataset

In [None]:
df=pd.read_csv('./dataset/data.csv', encoding = "ISO-8859-1")


In [7]:
df.head()

Unnamed: 0,label,title,text,subject,date
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,1,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,1,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,1,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [8]:
print("Dataset shape: ", df.shape)
print("\nColumns: ", df.columns.tolist())
df.info()


Dataset shape:  (39942, 5)

Columns:  ['label', 'title', 'text', 'subject', 'date']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39942 entries, 0 to 39941
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    39942 non-null  int64 
 1   title    39942 non-null  object
 2   text     39942 non-null  object
 3   subject  39942 non-null  object
 4   date     39942 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.5+ MB


we can see already there are no missing values

In [16]:
print("Examples of titles:")
print(df[['label','title']].sample(5, random_state=42))

print("Examples of subjects:")
print(df['subject'].unique())


Examples of titles:
       label                                              title
6524       1  Oil business seen in strong position as Trump ...
30902      0  WHOA! COLLEGE SNOWFLAKE FREAKS OUT: Screams Fo...
36459      0  CRONY CORRUPT POLITICS: Obama Admin BLOCKED FB...
9801       1  Cruz campaign vetting Fiorina as a possible VP...
25638      0   Minnesota Woman Writes Amazing F*ck Off Lette...
Examples of subjects:
['politicsNews' 'worldnews' 'News' 'politics' 'Government News'
 'left-news']


In [18]:
df[df['subject'] == 'left-news'][['label', 'title']].head(5)

Unnamed: 0,label,title
37460,0,BARBRA STREISAND Gives Up On Dream Of Impeachi...
37461,0,WATCH: SENATOR LINDSEY GRAHAM DROPS BOMBSHELLâ...
37462,0,âCONSERVATIVE GAY GUYâ BLASTS Penceâs As...
37463,0,WHITE COLLEGE SNOWFLAKES Can âIdentifyâ As...
37464,0,BILL NYE The FAKE Science Guy THREATENS Conser...


We wanted to check if there are any 'left-news' subject that are not fake

In [None]:
df[(df['subject'] == 'left-news') & (df['label'] == 1)][['label', 'subject', 'title']].head()

Unnamed: 0,label,subject,title


We will count how many fake (0) or real (1) articles exist per subject 

In [None]:
pd.crosstab(df['subject'], df['label'], normalize='index')

label,0,1
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
Government News,1.0,0.0
News,1.0,0.0
left-news,1.0,0.0
politics,1.0,0.0
politicsNews,0.0,1.0
worldnews,0.0,1.0


hot encoding

leme
randomforest