# Importing Libraries and Data



In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.svm import SVC, LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score, precision_score, average_precision_score, precision_recall_curve, f1_score

In [None]:
training = pd.read_csv('https://raw.githubusercontent.com/mehdimerbah/COVID19_fake_news_detection/main/preprocessing/processed_training_data.csv')
validation = pd.read_csv('https://raw.githubusercontent.com/mehdimerbah/COVID19_fake_news_detection/main/preprocessing/processed_validation_data.csv')
testing = pd.read_csv('https://raw.githubusercontent.com/mehdimerbah/COVID19_fake_news_detection/main/preprocessing/processed_testing_data.csv')

# Feature Extraction


## The Count Vectorizer
A **_Count Vectorizer_** is a simple approach to tokenize a collection of text documents and build a dictionary of known words in these documents. It basically takes in a text of collection of texts and then encodes every word with a number. we do this by using the `fit()` function on our text.

In [62]:
count_vectorizer = CountVectorizer()
text = ['This is a test to test the vectorizer. We are working on a datamining project project']
count_vectorizer.fit(text)
print(dict(list(count_vectorizer.vocabulary_.items())[0:10]))

{'this': 7, 'is': 2, 'test': 5, 'to': 8, 'the': 6, 'vectorizer': 9, 'we': 10, 'are': 0, 'working': 11, 'on': 3}


We can see that we get a dictionary of words and their respective encoding. We say that the `count_vectorizer` has **_learned the text vocabulary_**
Now we can use this to transform the text into an array (vector) of word counts. Each word count is stored at the position specified by its unique encoding number we got from the previous step. To do this we use the `transform()` function on the text from our count_vectorizer object.

In [None]:
vector = count_vectorizer.transform(text)
print(vector.toarray())

[[1 1 1 1 2 2 1 1 1 1 1 1]]


We can see that now we have a vector of wordcounts. All words have a count of **1** except for _test_ and _project_ at positions 4 and 5 respectively that have a count of 2 since they were repeated.

Now let's apply this process to our dataset

In [None]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(training['tweet'])
vectorized = count_vectorizer.transform(training['tweet'])
print(count_vectorizer.vocabulary_)
print('The transformed data matrix dimensions:', vectorized.shape)

The transformed data matrix dimensions: (6420, 14122)


So now we have a data matrix for the 6420 (rows) and 14122 distinct words (columns). `Each vectorized[i][j]` entry is a count for the word at the encoded position `j` in the tweet `i`.


## TF-IDF Transformer


The Tern Frequency - Inverse Document Frequency transformer gives us two types of count statsitics about the count in our text: 
**Term Frequency** summarizes how often a word appears in a text.
**Inverse Document Frequency** normalizes the word count by checkin the appearance of the word across all the given texts.
We can do this transformation using the `fit`, `transform` function sequence. Normally, when using the transformer alone we would have to do the vectorization first using a `TfidfVectorizer` then get a normalized count matrix; but since we have a count matrix `vectorized`, we can just normalize it directly.

In [None]:
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(vectorized)
transformed_tfidf = tfidf_transformer.transform(vectorized)
print(transformed_tfidf.shape)

(6420, 14122)


LinearSVC()

Since we will be repeating these steps for both the validation and training data set, we could store them in a `Pipeline` object and fit them over the data set we want. 

In [None]:
pipeline = Pipeline([
        ('count_vectorizer', CountVectorizer()),  
        ('tfidf_transformer', TfidfTransformer()),  
        ('classifier', LinearSVC())
    ])

# Model Fitting and Prediction

The `Pipeline.fit()` function runs both `fit()` and `transform()` functions for the `CountVectorizer` and the `TfidfTransformer` and then fits the classifier. 

In [None]:
pipeline.fit(training['tweet'], training['label'])
prediction = pipeline.predict(validation['tweet'])
print(prediction)

['fake' 'real' 'fake' ... 'fake' 'real' 'real']


In [None]:
print(classification_report(validation['label'], prediction))

              precision    recall  f1-score   support

        fake       0.92      0.94      0.93      1020
        real       0.95      0.93      0.94      1120

    accuracy                           0.93      2140
   macro avg       0.93      0.94      0.93      2140
weighted avg       0.93      0.93      0.93      2140

