## **Building a fake news classifier**

1. Here, we shall build or learn vectors from the movie plot and genre dataset
2. In short, there is a dataset full of movie plots and what genre the movie is (Action or Sci-Fi)
3. We wish to create bag of words vectors for this movie plots to see if we can predict the genre based on the words used in the plot summary
4. To do so, we shall employ the following methods from scikit-learn:
5. * Load the data
   * Define the label, y
   * Split the data into train and test
   * Create the Countervectorizer object which turns the text into **bags of words**, this is similar to a Gensim corpus. NOTE: as a pre-processing step, **ensure english stop words are removed during the formation of the bad of words****.
   * Each token will now act as feature for the classifier
   * Use the .fit_transform() method on the training data (bag_of_word object) to create the bad of words vectors.
   * Generally, fit_transform() will create the bag of words dictionary and vectors for each documents using the training data
   * Use the transformation for the training on the test data as well.
  
     **Feature Extraction**: This chapter also dealt with feature extraction. Simply put, feature extraction dealt with how to transform text into numerical data using techniques like bag of words models or TF-IDF, essential for processing by machine learning algorithms.

     **Scikit-learn for Model Training**: We will also see how use Scikit-learn, a powerful library for machine learning, to train a model on textual data.
     
    **NOTE**: Possible Features for Text Classification: Discussion on what could be considered as features in a text classification problem, such as the number of words, named entities, or the language of the document.

   **DATA**: Recall that the data we are using is the IMDB plot summaries, with labels indicating if the movie is sci-fi or action.
     

### **CountVectorizer for text classification**

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

df = pd.read_csv("/kaggle/input/fake-or-real-news/fake_or_real_news.csv")         #load data
print(df.head(5))
print(df.columns)

   Unnamed: 0                                              title  \
0        8476                       You Can Smell Hillary’s Fear   
1       10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2        3608        Kerry to go to Paris in gesture of sympathy   
3       10142  Bernie supporters on Twitter erupt in anger ag...   
4         875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  
Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')


> Import CountVectorizer from sklearn.feature_extraction.text and train_test_split from sklearn.model_selection.
    Create a Series y to use for the labels by assigning the .label attribute of df to y.
    Using df["text"] (features) and y (labels), create training and test sets using train_test_split(). Use a test_size of 0.33 and a random_state of 53.
> 
   
>  Create a CountVectorizer object called count_vectorizer. Ensure you specify the keyword argument stop_words="english" so that stop words are removed.
> 

> Fit and transform the training data X_train using the .fit_transform() method of your CountVectorizer object. Do the same with the test data X_test, except using the .transform() method.
>
> Print the first 10 features of the count_vectorizer using its .get_feature_names() method.




In [8]:
# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], y, test_size = 0.33, random_state = 53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words = "english")

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names_out()[:10])

['00' '000' '0000' '00000031' '000035' '00006' '0001' '0001pt' '000ft'
 '000km']


### **TfidfVectorizer for text classification**

> Similar to the sparse CountVectorizer created in the previous exercise, we'll work on creating tf-idf vectors for our documents here.
>
> We'll set up a TfidfVectorizer and investigate some of its features.

> Import TfidfVectorizer from sklearn.feature_extraction.text.
>
>  Create a TfidfVectorizer object called tfidf_vectorizer. When doing so, specify the keyword arguments stop_words="english" and max_df=0.7.
>
> Fit and transform the training data.
>  Transform the test data.
> Print the first 10 features of tfidf_vectorizer.
>
> Print the first 5 vectors of the tfidf training data using slicing on the .A (or array) attribute of tfidf_train.


In [9]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df = 0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names_out()[:10])


# Print the first 5 vectors of the tfidf training data
print(tfidf_train.toarray()[:15])


['00' '000' '0000' '00000031' '000035' '00006' '0001' '0001pt' '000ft'
 '000km']
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.05687994 0.         0.         ... 0.         0.         0.        ]]


In [7]:
print(count_train.toarray()[:10])
print(count_train.get_feature_names_out)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]


AttributeError: 'csr_matrix' object has no attribute 'get_feature_names_out'

We shall now compare 

### **Exercise - Inspecting the vectors**

In [13]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.toarray(), 
columns=count_vectorizer.get_feature_names_out())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.toarray(), columns = tfidf_vectorizer.get_feature_names_out())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(tfidf_df.columns) - set(count_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))


   00  000  0000  00000031  000035  00006  0001  0001pt  000ft  000km  ...  \
0   0    0     0         0       0      0     0       0      0      0  ...   
1   0    0     0         0       0      0     0       0      0      0  ...   
2   0    0     0         0       0      0     0       0      0      0  ...   
3   0    0     0         0       0      0     0       0      0      0  ...   
4   0    0     0         0       0      0     0       0      0      0  ...   

   حلب  عربي  عن  لم  ما  محاولات  من  هذا  والمرضى  ยงade  
0    0     0   0   0   0        0   0    0        0      0  
1    0     0   0   0   0        0   0    0        0      0  
2    0     0   0   0   0        0   0    0        0      0  
3    0     0   0   0   0        0   0    0        0      0  
4    0     0   0   0   0        0   0    0        0      0  

[5 rows x 56922 columns]
    00  000  0000  00000031  000035  00006  0001  0001pt  000ft  000km  ...  \
0  0.0  0.0   0.0       0.0     0.0    0.0   0.0     0.0    

### **Building a Naive Baye's classifier**

Here we shall be building a supervised machine learning model for "fake news" detection, focusing on the use of the Naive Bayes model for text classification. 

> The Naive Bayes algorithm, based on probability, helps classify text by determining how likely a piece of data leads to a particular outcome. This method is particularly effective for natural language processing (NLP) classification problems due to its simplicity and efficiency, despite the availability of more complex models and algorithms.

> Recall that the data we are using is the IMDB plot summaries, with labels indicating if the movie is sci-fi or action.
>
> NOTE: The Naive Bayes in scikit-learn expects integer input. So, it does work well with the output from the Count Vectorizer
>
> NOTE: Below, while evaluating the confusion matrix, we specify labels. If we don't specify labels, scikit-learn will use python ordering.

**Evaluating Model Performance**: The accuracy of the model will be assessed using the accuracy_score function from scikit-learn's metrics module. Additionally, we shall interpret the confusion matrix to understand the model's performance in classifying 'FAKE' and 'REAL' news articles.

#### **Training and testing the "fake news" model with CountVectorizer**

In [16]:
# Import the necessary modules
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels = ['FAKE', 'REAL'])
print(cm)


0.893352462936394
[[ 865  143]
 [  80 1003]]


#### **Training and testing the "fake news" model with  TfidfVectorizer**

In [17]:
from sklearn.naive_bayes import MultinomialNB
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred,labels=['FAKE', 'REAL'])
print(cm)


0.8565279770444764
[[ 739  269]
 [  31 1052]]


### **Improving your model**

**Here, we shall investigate the effect of one of the hyperparameter of the Naive Baye's method.**

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

df = pd.read_csv("/kaggle/input/fake-or-real-news/fake_or_real_news.csv")         #load data
#print(df.head(5))

# Create a series to store the labels: y
y = df["label"]                # The label column

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], y, test_size = 0.33, random_state = 53)

In [9]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#from scipy.sparse import csr_matrix

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df = 0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names_out()[:10])


# Print the first 5 vectors of the tfidf training data
print(tfidf_train.toarray()[:15])


['00' '000' '0000' '00000031' '000035' '00006' '0001' '0001pt' '000ft'
 '000km']
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.05687994 0.         0.         ... 0.         0.         0.        ]]


In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
# Create the list of alphas: alphas

alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha = alpha)
    
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    
    # Compute accuracy: score
    score = accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()


Alpha:  0.0
Score:  0.8813964610234337

Alpha:  0.1
Score:  0.8976566236250598

Alpha:  0.2
Score:  0.8938307030129125

Alpha:  0.30000000000000004
Score:  0.8900047824007652

Alpha:  0.4
Score:  0.8857006217120995

Alpha:  0.5
Score:  0.8842659014825442

Alpha:  0.6000000000000001




Score:  0.874701099952176

Alpha:  0.7000000000000001
Score:  0.8703969392635102

Alpha:  0.8
Score:  0.8660927785748446

Alpha:  0.9
Score:  0.8589191774270684



**We can now map the important vector weights back to actual words using some simple inspection techniques.** 

Print the top 20 weighted features for the first label of class_labels and print the bottom 20 weighted features for the second label of class_labels. This has been done for you.

In [19]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB(alpha = 0.5)

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = confusion_matrix(y_test, pred,labels=['FAKE', 'REAL'])
print(cm)


0.8842659014825442
[[ 808  200]
 [  42 1041]]


In [21]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.feature_log_prob_[0],feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])


FAKE [(-11.529191688198756, '00000031'), (-11.529191688198756, '00006'), (-11.529191688198756, '000ft'), (-11.529191688198756, '001'), (-11.529191688198756, '002'), (-11.529191688198756, '003'), (-11.529191688198756, '006'), (-11.529191688198756, '008'), (-11.529191688198756, '010'), (-11.529191688198756, '013'), (-11.529191688198756, '025'), (-11.529191688198756, '027'), (-11.529191688198756, '035'), (-11.529191688198756, '037'), (-11.529191688198756, '040'), (-11.529191688198756, '044'), (-11.529191688198756, '048'), (-11.529191688198756, '066'), (-11.529191688198756, '068'), (-11.529191688198756, '075')]
REAL [(-7.611760822717352, 'president'), (-7.596887245110436, 'american'), (-7.587846888505773, 'media'), (-7.582180995208038, 'donald'), (-7.581029757693777, 'october'), (-7.563695582092016, 'government'), (-7.5026656393893925, 'like'), (-7.495597360162555, 'war'), (-7.4884547344946775, 'new'), (-7.4814927738232395, 'world'), (-7.4572092141138455, 'just'), (-7.328307731326699, 'sai