# PROJECT | Natural Language Processing Challenge

- **dataset/data.csv** dataset containing news articles with the following columns:

    label: 0 if the news is fake, 1 if the news is real.
    title: The headline of the news article.
    text: The full content of the article.
    subject: The category or topic of the news.
    date: The publication date of the article.


## Phase 1: Data Loading and Exploration

### Import Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

### Load and Read dataset

In [None]:
df=pd.read_csv('./dataset/data.csv', encoding = "ISO-8859-1")


In [7]:
df.head()

Unnamed: 0,label,title,text,subject,date
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,1,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,1,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,1,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [8]:
print("Dataset shape: ", df.shape)
print("\nColumns: ", df.columns.tolist())
df.info()


Dataset shape:  (39942, 5)

Columns:  ['label', 'title', 'text', 'subject', 'date']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39942 entries, 0 to 39941
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    39942 non-null  int64 
 1   title    39942 non-null  object
 2   text     39942 non-null  object
 3   subject  39942 non-null  object
 4   date     39942 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.5+ MB


we can see already there are no missing values

In [16]:
print("Examples of titles:")
print(df[['label','title']].sample(5, random_state=42))

print("Examples of subjects:")
print(df['subject'].unique())


Examples of titles:
       label                                              title
6524       1  Oil business seen in strong position as Trump ...
30902      0  WHOA! COLLEGE SNOWFLAKE FREAKS OUT: Screams Fo...
36459      0  CRONY CORRUPT POLITICS: Obama Admin BLOCKED FB...
9801       1  Cruz campaign vetting Fiorina as a possible VP...
25638      0   Minnesota Woman Writes Amazing F*ck Off Lette...
Examples of subjects:
['politicsNews' 'worldnews' 'News' 'politics' 'Government News'
 'left-news']


In [18]:
df[df['subject'] == 'left-news'][['label', 'title']].head(5)

Unnamed: 0,label,title
37460,0,BARBRA STREISAND Gives Up On Dream Of Impeachi...
37461,0,WATCH: SENATOR LINDSEY GRAHAM DROPS BOMBSHELLâ...
37462,0,âCONSERVATIVE GAY GUYâ BLASTS Penceâs As...
37463,0,WHITE COLLEGE SNOWFLAKES Can âIdentifyâ As...
37464,0,BILL NYE The FAKE Science Guy THREATENS Conser...


We wanted to check if there are any 'left-news' subject that are not fake

In [None]:
df[(df['subject'] == 'left-news') & (df['label'] == 1)][['label', 'subject', 'title']].head()

Unnamed: 0,label,subject,title


We will count how many fake (0) or real (1) articles exist per subject 

In [None]:
pd.crosstab(df['subject'], df['label'], normalize='index')

label,0,1
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
Government News,1.0,0.0
News,1.0,0.0
left-news,1.0,0.0
politics,1.0,0.0
politicsNews,0.0,1.0
worldnews,0.0,1.0


hot encoding

leme (stem)
randomforest

## Phase 2: Text Preprocessing

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lain\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Define cleaning function

In [None]:
def preprocess_text(text):
    # to lower case
    text = text.lower()

    # remove special characters, digits and punctuation
    text = re.sub(r'[^a-z\s]', '', text)

    # tokenize
    tokens = word_tokenize(text)

    # stopwords
    tokens = [w for w in tokens if w not in stop_words]

    # apply stemming 
    tokens = [stemmer.stem(w) for w in tokens]
    return ' '.join(tokens)

In [31]:
print(df['title']+ " " + df['text'])

0        As U.S. budget fight looms, Republicans flip t...
1        U.S. military to accept transgender recruits o...
2        Senior U.S. Republican senator: 'Let Mr. Muell...
3        FBI Russia probe helped by Australian diplomat...
4        Trump wants Postal Service to charge 'much mor...
                               ...                        
39937    THIS IS NOT A JOKE! Soros-Linked Group Has Pla...
39938    THE SMARTEST WOMAN In Politics: âHow Trump C...
39939    BREAKING! SHOCKING VIDEO FROM CHARLOTTE RIOTS:...
39940    BREAKING! Charlotte News Station Reports Cops ...
39941    BIG MISTAKE! HILLARY JUST Proved To America Sh...
Length: 39942, dtype: object


In [30]:
df['combined'] = df['title'] + " " + df['text']
df['clean_combined'] = df['combined'].apply(preprocess_text) # train and test

In [32]:
df['clean_combined']

0        us budget fight loom republican flip fiscal sc...
1        us militari accept transgend recruit monday pe...
2        senior us republican senat let mr mueller job ...
3        fbi russia probe help australian diplomat tipo...
4        trump want postal servic charg much amazon shi...
                               ...                        
39937    joke soroslink group plan destroy trumpwil reg...
39938    smartest woman polit trump knock hillari first...
39939    break shock video charlott riot situat control...
39940    break charlott news station report cop dash ca...
39941    big mistak hillari prove america shes commit k...
Name: clean_combined, Length: 39942, dtype: object

### Subject hot one encodig

In [36]:
subject_hot = pd.get_dummies(df['subject'], prefix='subject', dtype=int)

df_preprocess = pd.concat([df, subject_hot], axis=1)

In [49]:
df_label = df['label']

In [50]:
df_label

0        1
1        1
2        1
3        1
4        1
        ..
39937    0
39938    0
39939    0
39940    0
39941    0
Name: label, Length: 39942, dtype: int64

In [51]:

df_preprocess.drop(columns=['label', 'title', 'text','subject','date','combined'], inplace=True)

In [52]:
df_preprocessed = pd.concat([df_label, df_preprocess], axis=1)

In [53]:
df_preprocessed

Unnamed: 0,label,clean_combined,subject_Government News,subject_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews
0,1,us budget fight loom republican flip fiscal sc...,0,0,0,0,1,0
1,1,us militari accept transgend recruit monday pe...,0,0,0,0,1,0
2,1,senior us republican senat let mr mueller job ...,0,0,0,0,1,0
3,1,fbi russia probe help australian diplomat tipo...,0,0,0,0,1,0
4,1,trump want postal servic charg much amazon shi...,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
39937,0,joke soroslink group plan destroy trumpwil reg...,0,0,1,0,0,0
39938,0,smartest woman polit trump knock hillari first...,0,0,1,0,0,0
39939,0,break shock video charlott riot situat control...,0,0,1,0,0,0
39940,0,break charlott news station report cop dash ca...,0,0,1,0,0,0


## Phase 3: Feature Engineering

In [55]:
# Split train & test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_preprocess, df_label, test_size=0.2, random_state=42)