# Building a "fake news" classifier
>  You'll apply the basics of what you've learned along with some supervised machine learning to build a "fake news" detector. You'll begin by learning the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify fake news articles.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 4 exercises "Introduction to Natural Language Processing in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## Classifying fake news using supervised learning with NLP

### Which possible features?

<p>Which of the following are possible features for a text classification problem?</p>

<pre>
Possible Answers

Number of words in a document.

Specific named entities.

Language.

<b>All of the above.</b>

</pre>

### Training and testing

<p>What datasets are needed for supervised learning?</p>

<pre>
Possible Answers

Training data.

Testing data.

<b>Both training and testing data.</b>

A label or outcome.

</pre>

### Building word count vectors with scikit-learn

### CountVectorizer for text classification

<div class=""><p>It's time to begin building your text classifier! The <a href="https://s3.amazonaws.com/assets.datacamp.com/production/course_3629/fake_or_real_news.csv" target="_blank" rel="noopener noreferrer">data</a> has been loaded into a DataFrame called <code>df</code>. Explore it in the IPython Shell to investigate what columns you can use. The <code>.head()</code> method is particularly informative.</p>
<p>In this exercise, you'll use <code>pandas</code> alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a <code>CountVectorizer</code> and investigate some of its features.</p></div>

In [25]:
df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/12-introduction-to-natural-language-processing-in-python/datasets/fake_or_real_news.csv')

Instructions
<ul>
<li>Import <code>CountVectorizer</code> from <code>sklearn.feature_extraction.text</code> and <code>train_test_split</code> from <code>sklearn.model_selection</code>.</li>
<li>Create a Series <code>y</code> to use for the labels by assigning the <code>.label</code> attribute of <code>df</code> to <code>y</code>.</li>
<li>Using <code>df["text"]</code> (features) and <code>y</code> (labels), create training and test sets using <code>train_test_split()</code>. Use a <code>test_size</code> of <code>0.33</code> and a <code>random_state</code> of <code>53</code>.</li>
<li>Create a <code>CountVectorizer</code> object called <code>count_vectorizer</code>. Ensure you specify the keyword argument <code>stop_words="english"</code> so that stop words are removed.</li>
<li>Fit and transform the training data <code>X_train</code> using the <code>.fit_transform()</code> method of your <code>CountVectorizer</code> object. Do the same with the test data <code>X_test</code>, except using the <code>.transform()</code> method.</li>
<li>Print the first 10 features of the <code>count_vectorizer</code> using its <code>.get_feature_names()</code> method.</li>
</ul>

In [26]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.text, y, test_size=0.33, random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words="english")

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

   Unnamed: 0  ... label
0        8476  ...  FAKE
1       10294  ...  FAKE
2        3608  ...  REAL
3       10142  ...  FAKE
4         875  ...  REAL

[5 rows x 4 columns]
['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


### TfidfVectorizer for text classification

<div class=""><p>Similar to the sparse <code>CountVectorizer</code> created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a <code>TfidfVectorizer</code> and investigate some of its features.</p>
<p>In this exercise, you'll use <code>pandas</code> and <code>sklearn</code> along with the same <code>X_train</code>, <code>y_train</code> and <code>X_test</code>, <code>y_test</code> DataFrames and Series you created in the last exercise.</p></div>

Instructions
<ul>
<li>Import <code>TfidfVectorizer</code> from <code>sklearn.feature_extraction.text</code>.</li>
<li>Create a <code>TfidfVectorizer</code> object called <code>tfidf_vectorizer</code>. When doing so, specify the keyword arguments <code>stop_words="english"</code> and <code>max_df=0.7</code>.</li>
<li>Fit and transform the training data. </li>
<li>Transform the test data.</li>
<li>Print the first 10 features of <code>tfidf_vectorizer</code>.</li>
<li>Print the first 5 vectors of the tfidf training data using slicing on the <code>.A</code> (or array) <strong><em>attribute</em></strong> of <code>tfidf_train</code>.</li>
</ul>

In [27]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Inspecting the vectors

<div class=""><p>To get a better idea of how the vectors work, you'll investigate them by converting them into <code>pandas</code> DataFrames.</p>
<p>Here, you'll use the same data structures you created in the previous two exercises (<code>count_train</code>, <code>count_vectorizer</code>, <code>tfidf_train</code>, <code>tfidf_vectorizer</code>) as well as <code>pandas</code>, which is imported as <code>pd</code>.</p></div>

Instructions
<ul>
<li>Create the DataFrames <code>count_df</code> and <code>tfidf_df</code> by using <code>pd.DataFrame()</code> and specifying the values as the first argument and the columns (or features) as the second argument.<ul>
<li>The values can be accessed by using the <code>.A</code> attribute of, respectively, <code>count_train</code> and <code>tfidf_train</code>.</li>
<li>The columns can be accessed using the <code>.get_feature_names()</code> methods of <code>count_vectorizer</code> and <code>tfidf_vectorizer</code>.</li></ul></li>
<li>Print the head of each DataFrame to investigate their structure. <em>This has been done for you.</em></li>
<li>Test if the column names are the same for each DataFrame by creating a new object called <code>difference</code> to see the difference between the columns that <code>count_df</code> has from <code>tfidf_df</code>. Columns can be accessed using the <code>.columns</code> attribute of a DataFrame. Subtract the set of <code>tfidf_df.columns</code> from the set of <code>count_df.columns</code>.</li>
<li>Test if the two DataFrames are equivalent by using the <code>.equals()</code> method on <code>count_df</code> with <code>tfidf_df</code> as the argument.</li>
</ul>

In [28]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=count_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

   00  000  0000  00000031  000035  00006  ...  ما  محاولات  من  هذا  والمرضى  ยงade
0   0    0     0         0       0      0  ...   0        0   0    0        0      0
1   0    0     0         0       0      0  ...   0        0   0    0        0      0
2   0    0     0         0       0      0  ...   0        0   0    0        0      0
3   0    0     0         0       0      0  ...   0        0   0    0        0      0
4   0    0     0         0       0      0  ...   0        0   0    0        0      0

[5 rows x 56922 columns]
    00  000  0000  00000031  000035  ...  محاولات   من  هذا  والمرضى  ยงade
0  0.0  0.0   0.0       0.0     0.0  ...      0.0  0.0  0.0      0.0    0.0
1  0.0  0.0   0.0       0.0     0.0  ...      0.0  0.0  0.0      0.0    0.0
2  0.0  0.0   0.0       0.0     0.0  ...      0.0  0.0  0.0      0.0    0.0
3  0.0  0.0   0.0       0.0     0.0  ...      0.0  0.0  0.0      0.0    0.0
4  0.0  0.0   0.0       0.0     0.0  ...      0.0  0.0  0.0      0.0    0.0

[5 rows

## Training and testing a classification model with scikit-learn

### Text classification models

<p>Which of the below is the most reasonable model to use when training a new supervised model using text vector data?</p>

<pre>
Possible Answers

Random Forests

<b>Naive Bayes</b>

Linear Regression

Deep Learning

</pre>

### Training and testing the "fake news" model with CountVectorizer

<div class=""><p>Now it's your turn to train the "fake news" model using the features you identified and extracted. In this first exercise you'll train and test a Naive Bayes model using the <code>CountVectorizer</code> data.</p>
<p>The training and test sets have been created, and <code>count_vectorizer</code>, <code>count_train</code>, and <code>count_test</code> have been computed.</p></div>

Instructions
<ul>
<li>Import the <code>metrics</code> module from <code>sklearn</code> and <code>MultinomialNB</code> from <code>sklearn.naive_bayes</code>.</li>
<li>Instantiate a <code>MultinomialNB</code> classifier called <code>nb_classifier</code>.</li>
<li>Fit the classifier to the training data.</li>
<li>Compute the predicted tags for the test data.</li>
<li>Calculate and print the accuracy score of the classifier.</li>
<li>Compute the confusion matrix. To make it easier to read, specify the keyword argument <code>labels=['FAKE', 'REAL']</code>.</li>
</ul>

In [30]:
# Import the necessary modules
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, ['FAKE', 'REAL'])
print(cm)

0.893352462936394
[[ 865  143]
 [  80 1003]]


### Training and testing the "fake news" model with TfidfVectorizer

<div class=""><p>Now that you have evaluated the model using the <code>CountVectorizer</code>, you'll do the same using the <code>TfidfVectorizer</code> with a Naive Bayes model.</p>
<p>The training and test sets have been created, and <code>tfidf_vectorizer</code>, <code>tfidf_train</code>, and <code>tfidf_test</code> have been computed. Additionally, <code>MultinomialNB</code> and <code>metrics</code> have been imported from, respectively, <code>sklearn.naive_bayes</code> and <code>sklearn</code>.</p></div>

Instructions
<ul>
<li>Instantiate a <code>MultinomialNB</code> classifier called <code>nb_classifier</code>.</li>
<li>Fit the classifier to the training data.</li>
<li>Compute the predicted tags for the test data.</li>
<li>Calculate and print the accuracy score of the classifier.</li>
<li>Compute the confusion matrix. As in the previous exercise, specify the keyword argument <code>labels=['FAKE', 'REAL']</code> so that the resulting confusion matrix is easier to read.</li>
</ul>

In [31]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, ['FAKE', 'REAL'])
print(cm)

0.8565279770444764
[[ 739  269]
 [  31 1052]]


### Simple NLP, complex problems

### Improving the model

<pre>
Possible Answers

Tweaking alpha levels.

Trying a new classification model.

Training on a larger dataset.

Improving text preprocessing.

<b>All of the above.</b>

</pre>

### Improving your model

<div class=""><p>Your job in this exercise is to test a few different alpha levels using the <code>Tfidf</code> vectors to determine if there is a better performing combination.</p>
<p>The training and test sets have been created, and <code>tfidf_vectorizer</code>, <code>tfidf_train</code>, and <code>tfidf_test</code> have been computed.</p></div>

Instructions
<ul>
<li>Create a list of alphas to try using <code>np.arange()</code>. Values should range from <code>0</code> to <code>1</code> with steps of <code>0.1</code>.</li>
<li>Create a function <code>train_and_predict()</code> that takes in one argument: <code>alpha</code>. The function should:<ul>
<li>Instantiate a <code>MultinomialNB</code> classifier with <code>alpha=alpha</code>.</li>
<li>Fit it to the training data.</li>
<li>Compute predictions on the test data.</li>
<li>Compute and return the accuracy score.</li></ul></li>
<li>Using a <code>for</code> loop, print the <code>alpha</code>, <code>score</code> and a newline in between. Use your <code>train_and_predict()</code> function to compute the <code>score</code>. Does the score change along with the alpha? What is the best alpha?</li>
</ul>

In [32]:
# Create the list of alphas: alphas
alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  0.0
Score:  0.8813964610234337

Alpha:  0.1
Score:  0.8976566236250598

Alpha:  0.2
Score:  0.8938307030129125

Alpha:  0.30000000000000004
Score:  0.8900047824007652

Alpha:  0.4
Score:  0.8857006217120995

Alpha:  0.5
Score:  0.8842659014825442

Alpha:  0.6000000000000001


  'setting alpha = %.1e' % _ALPHA_MIN)


Score:  0.874701099952176

Alpha:  0.7000000000000001
Score:  0.8703969392635102

Alpha:  0.8
Score:  0.8660927785748446

Alpha:  0.9
Score:  0.8589191774270684



### Inspecting your model

<div class=""><p>Now that you have built a "fake news" classifier, you'll investigate what it has learned. You can map the important vector weights back to actual words using some simple inspection techniques.</p>
<p>You have your well performing tfidf Naive Bayes classifier available as <code>nb_classifier</code>, and the vectors as <code>tfidf_vectorizer</code>.</p></div>

Instructions
<ul>
<li>Save the class labels as <code>class_labels</code> by accessing the <code>.classes_</code> attribute of <code>nb_classifier</code>.</li>
<li>Extract the features using the <code>.get_feature_names()</code> method of <code>tfidf_vectorizer</code>.</li>
<li>Create a zipped array of the classifier coefficients with the feature names and sort them by the coefficients. To do this, first use <code>zip()</code> with the arguments <code>nb_classifier.coef_[0]</code> and <code>feature_names</code>. Then, use <code>sorted()</code> on this.</li>
<li>Print the <em>top</em> 20 weighted features for the first label of <code>class_labels</code> and print the bottom 20 weighted features for the second label of <code>class_labels</code>. <em>This has been done for you.</em></li>
</ul>

In [33]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

FAKE [(-11.316312804238807, '0000'), (-11.316312804238807, '000035'), (-11.316312804238807, '0001'), (-11.316312804238807, '0001pt'), (-11.316312804238807, '000km'), (-11.316312804238807, '0011'), (-11.316312804238807, '006s'), (-11.316312804238807, '007'), (-11.316312804238807, '007s'), (-11.316312804238807, '008s'), (-11.316312804238807, '0099'), (-11.316312804238807, '00am'), (-11.316312804238807, '00p'), (-11.316312804238807, '00pm'), (-11.316312804238807, '014'), (-11.316312804238807, '015'), (-11.316312804238807, '018'), (-11.316312804238807, '01am'), (-11.316312804238807, '020'), (-11.316312804238807, '023')]
REAL [(-7.742481952533027, 'states'), (-7.717550034444668, 'rubio'), (-7.703583809227384, 'voters'), (-7.654774992495461, 'house'), (-7.649398936153309, 'republicans'), (-7.6246184189367, 'bush'), (-7.616556675728881, 'percent'), (-7.545789237823644, 'people'), (-7.516447881078008, 'new'), (-7.448027933291952, 'party'), (-7.411148410203476, 'cruz'), (-7.410910239085596, 'st