# <center> Classifying fake news using supervised learning with NLP

#### What is supervised learning?
- Form of machine learning
    - Problem has predefined training data
    - This data has a label (or outcome) you want the model to learn
    - Classification or Regression problem
    
#### Supervised learning with NLP
- Need to use language instead of geometric features
- scikit-learn: Powerful open-source library
- How to create supervised learning data from text?
    - Use bag-of-words models or tf-idf as features
    
    
#### Supervised learning steps
- Collect and preprocess our data
- Determine a label (Example: Movie genre)
- Split data into training and test sets
- Extract features from the text to help predict the label
    - Bag-of-words vector built into scikit-learn
- Evaluate trained model using the test set

#### Possible features for a text classification problem?
- Number of words in a document.
- Specific named entities.
- Language

#### <center> Building word count vectors with scikit-learn
    
#### Predicting movie genre
- Dataset consisting of movie plots and corresponding genre
- Goal: Create bag-of-word vectors for the movie plots
- Can we predict genre based on the words used in the plot summary?
    
#### Count Vectorizer with Python    
    
>In [1]: import pandas as pd

>In [2]: from sklearn.model_selection import train_test_split

>In [3}: from sklearn.feature_extraction.text import CountVectorizer
    
>In [4]: df = ... # Load data into DataFrame
    
>In [5]: y = df['Sci-Fi']

    33% of test data |
        X_train: training data
        y_test: training labels
        X_test: test data
        y_test: test labels
>In [6]: X_train, X_test, y_train, y_test = train_test_split(
                                             df['plot'], y, 
                                             test_size=0.33, 
                                             random_state=53)

    Turn text into bag-of-words vectors, it also remove English stop words as a preprocessing step
>In [7]: count_vectorizer = CountVectorizer(stop_words='english')

    Fit transform to create a Bag-of-words vectors from train data
    Generates a mapping of words with IDs and vectors representing how many times each word apperars in the plot
>In [8]: count_train = count_vectorizer.fit_transform(X_train.values)
    
>In [9]: count_test = count_vectorizer.transform(X_test.values)
    
    
IMPORTANT: Problems for words that dont appear in train data but do appear in test data. Options: add more train data or remove missing words from test
    
#### CountVectorizer for text classification

In [1]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd

# Print the head of df
df=pd.read_csv('datasets/fake_or_real_news.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [2]:
# Create a series to store the labels: y
y = df['label']

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'],y,test_size=0.33, random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


#### TfidfVectorizer for text classification

In [3]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english",max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


In [4]:
# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


#### Inspecting the vectors

Using the same data structures created in the previous code (count_train, count_vectorizer, tfidf_train, tfidf_vectorizer)

- The values can be accessed by using the .A attribute of, respectively, count_train and tfidf_train.
- The columns can be accessed using the .get_feature_names() methods of count_vectorizer and tfidf_vectorizer.

In [5]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())
# Print the head of count_df
count_df.head()

Unnamed: 0,00,000,0000,00000031,000035,00006,0001,0001pt,000ft,000km,...,حلب,عربي,عن,لم,ما,محاولات,من,هذا,والمرضى,ยงade
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
# Print the head of tfidf_df
tfidf_df.head()

Unnamed: 0,00,000,0000,00000031,000035,00006,0001,0001pt,000ft,000km,...,حلب,عربي,عن,لم,ما,محاولات,من,هذا,والمرضى,ยงade
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Test if the two DataFrames are equivalent by using the .equals()
print(count_df.equals(tfidf_df))

set()
False


# <center> Training and testing a classification model with scikit-learn

#### Naive Bayes classifier
- Naive Bayes Model
    - Commonly used for testing NLP classification problems
    - Basis in probability
- Given a particular piece of data, how likely is a particular outcome?
    Examples:
        - If the plot has a spaceship, how likely is it to be sci-fi?
        - Given a spaceship and an alien, how likely now is it sci-fi?
- Each word from CountVectorizer acts as a feature
- Naive Bayes: Simple and effective

#### Example using MultinomialNB: 
- Works well with CountVectorizers as it expects integer inputs
- Is also used for multiple label classification
- Not work well with floats. Is better to use other alternatives as SVM or linear models.

#### Count Vectorizer with NB example:

In [8]:
# Import the necessary modules
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train,y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

In [9]:
# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,pred)
print(score)

0.893352462936394


#### Confusion Matrix
- Predicted labels are shown across the top
- True labels are shown down the side 

In [10]:
# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred,labels=['FAKE', 'REAL'])
print(cm)

[[ 865  143]
 [  80 1003]]


#### tfidf with NB example:

In [11]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train,y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred,labels=['FAKE', 'REAL'])
print(cm)

0.8565279770444764
[[ 739  269]
 [  31 1052]]


# <CENTER> Simple NLP, Complex Problems
    
#### Complex problems in NLP
- Translation (Inaccurate translation)
- Sentiment analysis (Complex issues like sarcasm)
- Language Biases (recommended: https://www.youtube.com/watch?v=j7FwpZB1hWc)
    
#### How to improve the model
- Tweaking alpha levels.
- Trying a new classification model.
- Training on a larger dataset.
- Improving text preprocessing.

Example improving the model tuning parameter alpha:

In [12]:
import numpy as np
# Create the list of alphas: alphas
alphas = np.arange(0,1,0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test,pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.0


  'setting alpha = %.1e' % _ALPHA_MIN)


Score:  0.8813964610234337

Alpha:  0.1
Score:  0.8976566236250598

Alpha:  0.2
Score:  0.8938307030129125

Alpha:  0.30000000000000004
Score:  0.8900047824007652

Alpha:  0.4
Score:  0.8857006217120995

Alpha:  0.5
Score:  0.8842659014825442

Alpha:  0.6000000000000001
Score:  0.874701099952176

Alpha:  0.7000000000000001
Score:  0.8703969392635102

Alpha:  0.8
Score:  0.8660927785748446

Alpha:  0.9
Score:  0.8589191774270684



#### Inspecting the model
Investigate what it has learned.
- Save the class labels as class_labels by accessing the .classes_ attribute of nb_classifier
- Extract the features using the .get_feature_names() method of tfidf_vectorizer

In [13]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

FAKE [(-11.316312804238807, '0000'), (-11.316312804238807, '000035'), (-11.316312804238807, '0001'), (-11.316312804238807, '0001pt'), (-11.316312804238807, '000km'), (-11.316312804238807, '0011'), (-11.316312804238807, '006s'), (-11.316312804238807, '007'), (-11.316312804238807, '007s'), (-11.316312804238807, '008s'), (-11.316312804238807, '0099'), (-11.316312804238807, '00am'), (-11.316312804238807, '00p'), (-11.316312804238807, '00pm'), (-11.316312804238807, '014'), (-11.316312804238807, '015'), (-11.316312804238807, '018'), (-11.316312804238807, '01am'), (-11.316312804238807, '020'), (-11.316312804238807, '023')]


In [14]:
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

REAL [(-7.742481952533027, 'states'), (-7.717550034444668, 'rubio'), (-7.703583809227384, 'voters'), (-7.654774992495461, 'house'), (-7.649398936153309, 'republicans'), (-7.6246184189367, 'bush'), (-7.616556675728881, 'percent'), (-7.545789237823644, 'people'), (-7.516447881078008, 'new'), (-7.448027933291952, 'party'), (-7.411148410203476, 'cruz'), (-7.410910239085596, 'state'), (-7.35748985914622, 'republican'), (-7.33649923948987, 'campaign'), (-7.2854057032685775, 'president'), (-7.2166878130917755, 'sanders'), (-7.108263114902301, 'obama'), (-6.724771332488041, 'clinton'), (-6.5653954389926845, 'said'), (-6.328486029596207, 'trump')]
