## Project Objective

In this project, we will build a fake news classifier, to classify fake news using supervised learning with NLP.

## What is Supervised Learning?

Supervised learning is a type of machine learning where a model is trained on labeled data. The goal is to learn a mapping from input variables (also known as features) to output variables (also known as labels) based on the examples in the training set. Once the model is trained, it can be used to make predictions on new, unseen data.

Classification is one of the 2 main types of supervised learning. In classification, the output variable is categorical, such as "FAKE" or "REAL" in our case.

## What is Supervised Learning with NLP?

Supervised learning with natural language processing (NLP) is the application of supervised learning techniques to problems related to understanding and processing human language.
Here, the input variables aslo known as features will be text column.

## Scientific Kit for Machine Learning

We will use Scikit learn to create features and train a model. Scikit-learn (short for "Scientific Kit for Machine Learning") is a free and open-source Python library for machine learning. It provides a wide range of tools and algorithms for supervised learning. Scikit-learn is built on top of other popular Python libraries such as NumPy and SciPy, and is designed to be easy to use and integrate with other scientific libraries.

Scikit-learn is a popular choice for machine learning tasks in Python, due to its ease of use, wide range of features, and strong community support.

## Supervised Learning steps

1. Collect and preprocess our data
2. Determine a lable
3. Split data into training and test sets
4. Feature engineering - extract features from the text to help predict the label
5. Evaluate the trained model using the test set

## Dataset

The dataset "fake_or_real_news.csv" is downloaded from [DataCamp](https://app.datacamp.com/) - an online learning platform.

In [2]:
# Import the necessary modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Load the dataset
fake_or_real_news = pd.read_csv('/Users/brindhamanivannan/NLP/Fake_News/fake_or_real_news.csv')
fake_or_real_news.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


## Explore the dataset

In [3]:
fake_or_real_news.shape

(6335, 4)

The dataset has 6335 rows and 4 columns.

In [4]:
fake_or_real_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  6335 non-null   int64 
 1   title       6335 non-null   object
 2   text        6335 non-null   object
 3   label       6335 non-null   object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


Note there is no null values.

In [6]:
# Label
# Create a series to store the label: y

y = fake_or_real_news.label
y

0       FAKE
1       FAKE
2       REAL
3       FAKE
4       REAL
        ... 
6330    REAL
6331    FAKE
6332    FAKE
6333    REAL
6334    REAL
Name: label, Length: 6335, dtype: object

In [7]:
type(y)

pandas.core.series.Series

In [8]:
# Feature

fake_or_real_news["text"]

0       Daniel Greenfield, a Shillman Journalism Fello...
1       Google Pinterest Digg Linkedin Reddit Stumbleu...
2       U.S. Secretary of State John F. Kerry said Mon...
3       — Kaydee King (@KaydeeKing) November 9, 2016 T...
4       It's primary day in New York and front-runners...
                              ...                        
6330    The State Department told the Republican Natio...
6331    The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332     Anti-Trump Protesters Are Tools of the Oligar...
6333    ADDIS ABABA, Ethiopia —President Obama convene...
6334    Jeb Bush Is Suddenly Attacking Trump. Here's W...
Name: text, Length: 6335, dtype: object

In [9]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(fake_or_real_news["text"], y, test_size=0.33, random_state=53)

In [10]:
X_train.shape

(4244,)

In [11]:
X_train.head()

2576                                                     
1539    Report Copyright Violation Do you think there ...
5163    The election in 232 photos, 43 numbers and 131...
2615    Email Ever wonder what’s on the mind of today’...
4270    Wells Fargo is Rotting from the Top Down Wells...
Name: text, dtype: object

In [12]:
y_train.shape

(4244,)

In [13]:
y_train.head()

2576    FAKE
1539    FAKE
5163    REAL
2615    FAKE
4270    FAKE
Name: label, dtype: object

In [14]:
X_test.shape

(2091,)

In [15]:
X_test.head()

4221    Donald Trump threatened to sue the New York Ti...
1685    Planned Parenthood: Abortion pill usage now ri...
3348    In a last dash, final "hail mary" attempt to e...
2633    Washington (CNN) Donald Trump and Ben Carson n...
975     The Obama administration announced Friday it w...
Name: text, dtype: object

In [16]:
y_test.shape

(2091,)

In [17]:
y_test.head()

4221    REAL
1685    FAKE
3348    REAL
2633    REAL
975     REAL
Name: label, dtype: object

## Feature Engineering

How to create supervised learning data from text? This is also known as feature engineering, and can be done using techniques such as bag-of-words, term frequency-inverse document frequency (TF-IDF), or word embeddings.

Let's extract features from the text. 

## CountVectorizer for text classification with Python

In [19]:
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english') # stop words are removed
count_vectorizer

CountVectorizer(stop_words='english')

In [21]:
# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train.values) # call fit_transform on the training data to create the bag-of-words vectors
count_train

<4244x56922 sparse matrix of type '<class 'numpy.int64'>'
	with 1119820 stored elements in Compressed Sparse Row format>

In [22]:
count_train.A # The values can be accessed by using the .A attribute

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [23]:
# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test.values)
count_test

<2091x56922 sparse matrix of type '<class 'numpy.int64'>'
	with 533697 stored elements in Compressed Sparse Row format>

In [24]:
count_test.A

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 3, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Note:
After calling fit_transform on the training data, we call transform on the test data to create bag of words vectors using the same dictionary. 


In [25]:
# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10]) # The columns can be accessed using the .get_feature_names()

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


In [26]:
type(count_vectorizer)

sklearn.feature_extraction.text.CountVectorizer

In [27]:
count_vectorizer

CountVectorizer(stop_words='english')

In [28]:
count_vectorizer.get_feature_names()

['00',
 '000',
 '0000',
 '00000031',
 '000035',
 '00006',
 '0001',
 '0001pt',
 '000ft',
 '000km',
 '001',
 '0011',
 '002',
 '003',
 '004',
 '006',
 '006s',
 '007',
 '007s',
 '008',
 '008s',
 '009',
 '0099',
 '00am',
 '00p',
 '00pm',
 '01',
 '010',
 '013',
 '014',
 '015',
 '016',
 '018',
 '01am',
 '02',
 '020',
 '022',
 '023',
 '024',
 '025',
 '027',
 '028',
 '02welcome',
 '03',
 '031',
 '032',
 '0325',
 '033',
 '034',
 '035',
 '037',
 '039',
 '03eb',
 '04',
 '040',
 '0400',
 '042',
 '044',
 '048',
 '049',
 '04pm',
 '05',
 '0509245d29',
 '052',
 '056',
 '06',
 '062',
 '066',
 '068',
 '06pm',
 '07',
 '0700',
 '075',
 '076',
 '079',
 '07dryempjx',
 '08',
 '080',
 '081',
 '082',
 '084',
 '089',
 '0891',
 '09',
 '098263',
 '09am',
 '09pm',
 '0_jgdktlmn',
 '0a_merrill',
 '0d',
 '0fjjvowyhg8qtskiz',
 '0h4at2yetra17uxetni02ls2jeg0mty45jrcu7mrzsrpcbq464i',
 '0hq3vb2giv',
 '0in',
 '0jsn6pjkan',
 '0oeekvljlt',
 '0pt',
 '0t5',
 '0txrbwvobzz4fi5nksw6k5a6cxzbb3juxthmdiz93cby8gvrqiypzhajvjnt2',
 '0wo

## TfidfVectorizer for text classification with Python

In [29]:
# Creating tf-idf vectors for the documents

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
tfidf_vectorizer

TfidfVectorizer(max_df=0.7, stop_words='english')

Note:
The code creates an instance of the TfidfVectorizer class from the scikit-learn library, and sets the "stop_words" parameter to "english" and the "max_df" parameter to 0.7.

The "stop_words" parameter is used to remove common english words such as "a", "an", "the", etc., which are not useful for text classification tasks.

The "max_df" parameter represents the maximum frequency within the documents a given feature can have to be used in the tfi-idf matrix. It is used to ignore terms that have a high occurrence rate in the given corpus. Values between 0-1 represent a percentage.

This TfidfVectorizer object can then be used to transform text data into numerical feature vectors, which can be used as input to a machine learning model.

In [30]:
# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train.values)
tfidf_train

<4244x56922 sparse matrix of type '<class 'numpy.float64'>'
	with 1119820 stored elements in Compressed Sparse Row format>

In [31]:
tfidf_train.A

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [32]:
# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test.values)
tfidf_test

<2091x56922 sparse matrix of type '<class 'numpy.float64'>'
	with 533697 stored elements in Compressed Sparse Row format>

In [33]:
tfidf_test.A

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.05719984, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [34]:
tfidf_vectorizer.get_feature_names()

['00',
 '000',
 '0000',
 '00000031',
 '000035',
 '00006',
 '0001',
 '0001pt',
 '000ft',
 '000km',
 '001',
 '0011',
 '002',
 '003',
 '004',
 '006',
 '006s',
 '007',
 '007s',
 '008',
 '008s',
 '009',
 '0099',
 '00am',
 '00p',
 '00pm',
 '01',
 '010',
 '013',
 '014',
 '015',
 '016',
 '018',
 '01am',
 '02',
 '020',
 '022',
 '023',
 '024',
 '025',
 '027',
 '028',
 '02welcome',
 '03',
 '031',
 '032',
 '0325',
 '033',
 '034',
 '035',
 '037',
 '039',
 '03eb',
 '04',
 '040',
 '0400',
 '042',
 '044',
 '048',
 '049',
 '04pm',
 '05',
 '0509245d29',
 '052',
 '056',
 '06',
 '062',
 '066',
 '068',
 '06pm',
 '07',
 '0700',
 '075',
 '076',
 '079',
 '07dryempjx',
 '08',
 '080',
 '081',
 '082',
 '084',
 '089',
 '0891',
 '09',
 '098263',
 '09am',
 '09pm',
 '0_jgdktlmn',
 '0a_merrill',
 '0d',
 '0fjjvowyhg8qtskiz',
 '0h4at2yetra17uxetni02ls2jeg0mty45jrcu7mrzsrpcbq464i',
 '0hq3vb2giv',
 '0in',
 '0jsn6pjkan',
 '0oeekvljlt',
 '0pt',
 '0t5',
 '0txrbwvobzz4fi5nksw6k5a6cxzbb3juxthmdiz93cby8gvrqiypzhajvjnt2',
 '0wo

In [35]:
# Print the first 10 features of tfidf_vectorizer
print(tfidf_vectorizer.get_feature_names()[:10])

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


In [36]:
# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Create a dataframe from CountVectorizer and TfidfVectorizer

In [37]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())
count_df

Unnamed: 0,00,000,0000,00000031,000035,00006,0001,0001pt,000ft,000km,...,حلب,عربي,عن,لم,ما,محاولات,من,هذا,والمرضى,ยงade
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4239,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4240,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4241,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4242,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
tfidf_df

Unnamed: 0,00,000,0000,00000031,000035,00006,0001,0001pt,000ft,000km,...,حلب,عربي,عن,لم,ما,محاولات,من,هذا,والمرضى,ยงade
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4239,0.0,0.014123,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4240,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4241,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4242,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

set()


In [40]:
# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

False


## Training and testing a classification model with scikit-learn

Now let us use the features we have extracted above to train and test a supervised classification model.

## Naive Bayes model for text classification

Naive Bayes is a simple and effective method for text classification tasks, such as identifying fake news. It makes the assumption that the features (in this case, the words in the news article) are independent of one another, which simplifies the computation and allows for fast training and classification. Additionally, Naive Bayes has been shown to perform well on a variety of text classification tasks and requires relatively little data to train. These properties make it a good choice for a fake news classifier.

It is not strictly necessary to fully understand the underlying mechanics of the Naive Bayes algorithm in order to build a fake news classifier using it. However, having a general understanding of how the algorithm works can be helpful in interpreting the results of the classifier and making informed decisions about how to improve its performance. It can also help one to understand the strengths and weaknesses of the algorithm, which can inform how to best apply it to a given task.

## Naive Bayes algorithm in simple terms

Naive Bayes is a machine learning algorithm that is commonly used for text classification tasks such as identifying fake news. It is based on Bayes' theorem, which states that the probability of an event (in this case, a news article being fake) can be determined by the probabilities of certain features (in this case, the words in the article) occurring.

The basic idea behind Naive Bayes is that it uses the probabilities of certain words occurring in fake news articles to classify new articles as fake or not. It starts by training the algorithm on a dataset of labeled news articles (fake or not), and using that training data to estimate the probability of each word occurring in a fake news article.

Once the algorithm is trained, when a new article is encountered, the algorithm uses the probabilities of each word in the article to estimate the probability that the article is fake. It then classifies the article as fake if the probability is greater than a certain threshold, otherwise it classifies it as not fake.

The "Naive" part of the name comes from the fact that the algorithm makes the assumption that the words in the article are independent of each other, which is not always the case in natural language. Despite this assumption, Naive Bayes can still perform well on text classification tasks.

## Naive Bayes with scikit-learn

#### Training and testing the "fake news" model with CountVectorizer

In [42]:
# Import the necessary modules

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
nb_classifier

MultinomialNB()

In [43]:
# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

MultinomialNB()

In [44]:
# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)
pred

array(['REAL', 'REAL', 'REAL', ..., 'REAL', 'FAKE', 'REAL'], dtype='<U4')

In [45]:
# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

0.893352462936394


In [46]:
# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

[[ 865  143]
 [  80 1003]]


We have evaluated the model using the CountVectorizer as above.

#### Training and testing the "fake news" model with TfidfVectorizer

In [48]:
# TfidfVectorizer with a Naive Bayes model

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)
print(pred)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)


['REAL' 'REAL' 'REAL' ... 'REAL' 'FAKE' 'REAL']
0.8565279770444764
[[ 739  269]
 [  31 1052]]


## Improving the model

Let us test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination.

In [49]:
# Create the list of alphas: alphas
# These alphas will be used to adjust the smoothing parameter for the Naive Bayes model
import numpy as np
alphas = np.arange(0, 1, 0.1)
alphas

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [50]:
# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

In [51]:
# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.0
Score:  0.8813964610234337

Alpha:  0.1
Score:  0.8976566236250598

Alpha:  0.2
Score:  0.8938307030129125

Alpha:  0.30000000000000004
Score:  0.8900047824007652

Alpha:  0.4
Score:  0.8857006217120995

Alpha:  0.5
Score:  0.8842659014825442

Alpha:  0.6000000000000001




Score:  0.874701099952176

Alpha:  0.7000000000000001
Score:  0.8703969392635102

Alpha:  0.8
Score:  0.8660927785748446

Alpha:  0.9
Score:  0.8589191774270684



In [None]:
# Create the list of alphas: alphas
import numpy as np
alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Note:
    
This code is running a grid search to find the best alpha value for the MultinomialNB classifier. The idea is to run the classifier for each value of alpha, evaluate the performance of the model by checking its accuracy score, and then select the value of alpha that gives the best performance.

## Inspecting the model

Let us investigate what the model has learned.

In [52]:
# we have the tfidf Naive Bayes classifier as nb_classifier
# and tfidf vectors as tfidf_vectorizer

# Get the class labels: class_labels
class_labels = nb_classifier.classes_ # accessing the .classes_ attribute of nb_classifier
class_labels

array(['FAKE', 'REAL'], dtype='<U4')

In [53]:
# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()
feature_names

['00',
 '000',
 '0000',
 '00000031',
 '000035',
 '00006',
 '0001',
 '0001pt',
 '000ft',
 '000km',
 '001',
 '0011',
 '002',
 '003',
 '004',
 '006',
 '006s',
 '007',
 '007s',
 '008',
 '008s',
 '009',
 '0099',
 '00am',
 '00p',
 '00pm',
 '01',
 '010',
 '013',
 '014',
 '015',
 '016',
 '018',
 '01am',
 '02',
 '020',
 '022',
 '023',
 '024',
 '025',
 '027',
 '028',
 '02welcome',
 '03',
 '031',
 '032',
 '0325',
 '033',
 '034',
 '035',
 '037',
 '039',
 '03eb',
 '04',
 '040',
 '0400',
 '042',
 '044',
 '048',
 '049',
 '04pm',
 '05',
 '0509245d29',
 '052',
 '056',
 '06',
 '062',
 '066',
 '068',
 '06pm',
 '07',
 '0700',
 '075',
 '076',
 '079',
 '07dryempjx',
 '08',
 '080',
 '081',
 '082',
 '084',
 '089',
 '0891',
 '09',
 '098263',
 '09am',
 '09pm',
 '0_jgdktlmn',
 '0a_merrill',
 '0d',
 '0fjjvowyhg8qtskiz',
 '0h4at2yetra17uxetni02ls2jeg0mty45jrcu7mrzsrpcbq464i',
 '0hq3vb2giv',
 '0in',
 '0jsn6pjkan',
 '0oeekvljlt',
 '0pt',
 '0t5',
 '0txrbwvobzz4fi5nksw6k5a6cxzbb3juxthmdiz93cby8gvrqiypzhajvjnt2',
 '0wo

In [55]:
print(len(feature_names))

56922


In [54]:
# Create a zipped array of the classifier coefficients with the feature names and sort them by the coefficients

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))
feat_with_weights # This creates a list of tuples where each tuple contains a coefficient and a feature name

[(-11.316312804238807, '0000'),
 (-11.316312804238807, '000035'),
 (-11.316312804238807, '0001'),
 (-11.316312804238807, '0001pt'),
 (-11.316312804238807, '000km'),
 (-11.316312804238807, '0011'),
 (-11.316312804238807, '006s'),
 (-11.316312804238807, '007'),
 (-11.316312804238807, '007s'),
 (-11.316312804238807, '008s'),
 (-11.316312804238807, '0099'),
 (-11.316312804238807, '00am'),
 (-11.316312804238807, '00p'),
 (-11.316312804238807, '00pm'),
 (-11.316312804238807, '014'),
 (-11.316312804238807, '015'),
 (-11.316312804238807, '018'),
 (-11.316312804238807, '01am'),
 (-11.316312804238807, '020'),
 (-11.316312804238807, '023'),
 (-11.316312804238807, '02welcome'),
 (-11.316312804238807, '031'),
 (-11.316312804238807, '032'),
 (-11.316312804238807, '0325'),
 (-11.316312804238807, '033'),
 (-11.316312804238807, '034'),
 (-11.316312804238807, '039'),
 (-11.316312804238807, '03eb'),
 (-11.316312804238807, '0400'),
 (-11.316312804238807, '049'),
 (-11.316312804238807, '04pm'),
 (-11.31631

Note:
    
The resulting list of tuples is sorted by the coefficients using the sorted() function. The sorting is done in ascending order, so the features with the smallest coefficients will be at the beginning of the list and the features with the largest coefficients will be at the end of the list.

In [56]:
print(len(feat_with_weights))

56922


In [57]:
# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0])


FAKE


In [58]:
print(feat_with_weights[:20])

[(-11.316312804238807, '0000'), (-11.316312804238807, '000035'), (-11.316312804238807, '0001'), (-11.316312804238807, '0001pt'), (-11.316312804238807, '000km'), (-11.316312804238807, '0011'), (-11.316312804238807, '006s'), (-11.316312804238807, '007'), (-11.316312804238807, '007s'), (-11.316312804238807, '008s'), (-11.316312804238807, '0099'), (-11.316312804238807, '00am'), (-11.316312804238807, '00p'), (-11.316312804238807, '00pm'), (-11.316312804238807, '014'), (-11.316312804238807, '015'), (-11.316312804238807, '018'), (-11.316312804238807, '01am'), (-11.316312804238807, '020'), (-11.316312804238807, '023')]


In [59]:
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1])

REAL


In [60]:
print(feat_with_weights[-20:])

[(-7.742481952533027, 'states'), (-7.717550034444668, 'rubio'), (-7.703583809227384, 'voters'), (-7.654774992495461, 'house'), (-7.649398936153309, 'republicans'), (-7.6246184189367, 'bush'), (-7.616556675728881, 'percent'), (-7.545789237823644, 'people'), (-7.516447881078008, 'new'), (-7.448027933291952, 'party'), (-7.411148410203476, 'cruz'), (-7.410910239085596, 'state'), (-7.35748985914622, 'republican'), (-7.33649923948987, 'campaign'), (-7.2854057032685775, 'president'), (-7.2166878130917755, 'sanders'), (-7.108263114902301, 'obama'), (-6.724771332488041, 'clinton'), (-6.5653954389926845, 'said'), (-6.328486029596207, 'trump')]


This allows you to see which features are the most important for each class according to the model's coefficients.
Also, this is useful for analyzing and interpreting the model, as it can give insight into which words or phrases the model is using to make its predictions, and how strongly it is considering each feature.
