# Problem Statement 2: Fake News
Fake contents are everywhere from social media platforms, news platforms and there is a big list. Considering the advancement in NLP research institutes are putting a lot of sweat, blood, and tears to detect the fake content generated across the platforms.

Fake news, defined by the New York Times as “a made-up story with an intention to deceive”, often for a secondary gain, is arguably one of the most serious challenges facing the news industry today. In a December Pew Research poll, 64% of US adults said that “made-up news” has caused a “great deal of confusion” about the facts of current events.

## Content
Your goal as a data scientist is to create an NLP model, to combat fake content problems. We believe that these AI technologies hold promise for significantly automating parts of the procedure human fact-checkers use today to determine if a story is real or a hoax.
- Text - Raw content from social media/ new platforms
- Text_Tag - Different types of content tags
- Labels - Represents various classes of Labels
    - Half-True - 2
    - False - 1
    - Mostly-True - 3
    - True - 5
    - Barely-True - 0
    - Not-Known - 4

References
https://www.machinehack.com/hackathons/fake_news_content_detection_weekend_hackathon_20/overview

In [1]:
import sys
import scipy
import sklearn
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import re
import seaborn as sns
import sklearn
import spacy

from collections import Counter
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

%matplotlib inline

In [2]:
train_file = 'Data/train.csv'
test_file = 'Data/train.csv'
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)

In [3]:
train_df

Unnamed: 0,Labels,Text,Text_Tag
0,1,Says the Annies List political group supports ...,abortion
1,2,When did the decline of coal start? It started...,"energy,history,job-accomplishments"
2,3,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy
3,1,Health care reform legislation is likely to ma...,health-care
4,2,The economic turnaround started at the end of ...,"economy,jobs"
...,...,...,...
10235,3,There are a larger number of shark attacks in ...,"animals,elections"
10236,3,Democrats have now become the party of the [At...,elections
10237,2,Says an alternative to Social Security that op...,"retirement,social-security"
10238,1,On lifting the U.S. Cuban embargo and allowing...,"florida,foreign-policy"


In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10240 entries, 0 to 10239
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Labels    10240 non-null  int64 
 1   Text      10240 non-null  object
 2   Text_Tag  10238 non-null  object
dtypes: int64(1), object(2)
memory usage: 240.1+ KB


In [5]:
np.where(pd.isnull(train_df))

(array([2142, 9375], dtype=int64), array([2, 2], dtype=int64))

In [6]:
train_df = train_df.drop([train_df.index[2142], train_df.index[9375]])

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10238 entries, 0 to 10239
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Labels    10238 non-null  int64 
 1   Text      10238 non-null  object
 2   Text_Tag  10238 non-null  object
dtypes: int64(1), object(2)
memory usage: 319.9+ KB


In [8]:
train_df['Labels'].value_counts()

2    2114
1    1993
3    1962
5    1676
0    1654
4     839
Name: Labels, dtype: int64

In [9]:
nlp = spacy.load('en_core_web_sm')

In [10]:
def convert_text(text):
    sent = nlp(text)
    ents = {x.text: x for x in sent.ents}
    tokens = []
    for w in sent:
        if w.is_stop or w.is_punct or w.is_digit:
            continue
        if w.text in ents:
            tokens.append(w.text)
        else:
            tokens.append(w.lemma_.lower())
    text = ' '.join(tokens)
    
    return text

In [11]:
def clean_text(text):
    text = re.sub(r'[,-]', ' ', text)

    return text

In [12]:
train_df['short'] = train_df['Text'].apply(convert_text)
train_df['Text_Tag2'] = train_df['Text_Tag'].apply(clean_text)

In [13]:
train_df.sample(10)

Unnamed: 0,Labels,Text,Text_Tag,short,Text_Tag2
937,3,Much more than 50 percentof parents out there ...,families,percentof parent spanker,families
7109,2,Says Scott Walker gave $6 million in tax break...,"economy,infrastructure,jobs,state-budget,taxes...",say scott walker give $ million tax break corp...,economy infrastructure jobs state budget taxes...
8938,3,"Barack Obama is ""the only candidate who doesn'...",energy,barack obama candidate dime oil company pac lo...,energy
7796,2,The U.S. Postal Service doesnt run on your tax...,"debt,federal-budget",u.s. postal service nt run tax dollar funded s...,debt federal budget
9893,1,Says a gun bill before the Senate is proposing...,"civil-rights,guns",say gun bill Senate propose universal registra...,civil rights guns
5182,5,Says theres no language in the U.S. Constituti...,elections,say s language u.s. constitution prevent run C...,elections
539,3,Says Texas ranks 49th nationally in what we ar...,"education,state-budget,state-finances,states",say Texas rank 49th nationally support pupil i...,education state budget state finances states
9537,0,Says Florida Gov. Rick Scotts cuts to educatio...,"education,state-budget,taxes",say Florida gov. rick scotts cut education pro...,education state budget taxes
9696,1,Says Milwaukee County Executive Chris Abele on...,county-budget,say Milwaukee county executive chris abele pro...,county budget
9569,1,TheAffordable Care Actalters the sensible doct...,health-care,theaffordable care actalters sensible doctor p...,health care


In [14]:
X = train_df['short']
y = train_df['Labels']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 99)

In [16]:
counts = CountVectorizer()
A = counts.fit_transform(X_train, y_train)
pd.DataFrame(A.todense(), columns = counts.get_feature_names()).head()

Unnamed: 0,00,000,000new,014,024,029,05,050,054th,07,...,zimmerman,zinn,zip,zippo,zombie,zone,zones,zoning,zoo,zuckerberg
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
classifier = SVC(kernel = 'rbf', probability=True)
classifier.fit(A.toarray(), y_train)
B = counts.transform(X_test)
predictions = classifier.predict(B.todense())
print('Accuracy: %.4f' % accuracy_score(y_test, predictions))

Accuracy: 0.2555


In [18]:
tfidf = TfidfVectorizer()
A = tfidf.fit_transform(X_train, y_train)
pd.DataFrame(A.todense(), columns = tfidf.get_feature_names()).head()
classifier.fit(A.toarray(), y_train)
B = tfidf.transform(X_test)
predictions_2 = classifier.predict(B)
print('Accuracy: %.4f' % accuracy_score(y_test, predictions_2))

ValueError: cannot use sparse input in 'SVC' trained on dense data

In [20]:
classifier = SVC(kernel = 'poly', degree = 6, probability=True)
classifier.fit(A.toarray(), y_train)
B = counts.transform(X_test)
predictions = classifier.predict(B.todense())
print('Accuracy: %.4f' % accuracy_score(y_test, predictions))

Accuracy: 0.2223
