## News Classification using Gensim Word Vectors


In [32]:
# So first we load Google News Word2vec model from gensim library:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

In [33]:
dir(wv)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_load_specials',
 '_log_evaluate_word_analogies',
 '_save_specials',
 '_smart_save',
 '_upconvert_old_d2vkv',
 '_upconvert_old_vocab',
 'add_lifecycle_event',
 'add_vector',
 'add_vectors',
 'allocate_vecattrs',
 'closer_than',
 'cosine_similarities',
 'distance',
 'distances',
 'doesnt_match',
 'evaluate_word_analogies',
 'evaluate_word_pairs',
 'expandos',
 'fill_norms',
 'get_index',
 'get_normed_vectors',
 'get_vecattr',
 'get_vector',
 'has_index_for',
 'index2entity',
 'index2word',
 'index_to_key',
 'init_sims',
 'intersect_word2vec_

In [2]:
# Again to simple check similarity between two words:
wv.similarity(w1="great", w2="good")

0.72915095

In [3]:
# The lenght for each vector is 300:
wv_great = wv["great"]
wv_good = wv["good"]

In [4]:
wv_great.shape

(300,)

In [5]:
# To see the vector:
wv_great

array([ 7.17773438e-02,  2.08007812e-01, -2.84423828e-02,  1.78710938e-01,
        1.32812500e-01, -9.96093750e-02,  9.61914062e-02, -1.16699219e-01,
       -8.54492188e-03,  1.48437500e-01, -3.34472656e-02, -1.85546875e-01,
        4.10156250e-02, -8.98437500e-02,  2.17285156e-02,  6.93359375e-02,
        1.80664062e-01,  2.22656250e-01, -1.00585938e-01, -6.93359375e-02,
        1.04427338e-04,  1.60156250e-01,  4.07714844e-02,  7.37304688e-02,
        1.53320312e-01,  6.78710938e-02, -1.03027344e-01,  4.17480469e-02,
        4.27246094e-02, -1.10351562e-01, -6.68945312e-02,  4.19921875e-02,
        2.50000000e-01,  2.12890625e-01,  1.59179688e-01,  1.44653320e-02,
       -4.88281250e-02,  1.39770508e-02,  3.55529785e-03,  2.09960938e-01,
        1.52343750e-01, -7.32421875e-02,  2.16796875e-01, -5.76171875e-02,
       -2.84423828e-02, -3.60107422e-03,  1.52343750e-01, -2.63671875e-02,
        2.13623047e-02, -1.51367188e-01,  1.04003906e-01,  3.18359375e-01,
       -1.85546875e-01,  

### Fake vs Real News Classification Using This Word2Vec Embeddings
* Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

* Fake news spreads faster than real news and creates problems and fear among groups and in society.

* We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is **Real or Fake Message**.

* We will use **glove embeddings** from spacy which is trained on massive wikipedia dataset to pre-process and text vectorization and apply different classification algorithms.

### Dataset
Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

* This data consists of two columns. - Text - label
* Text is the statements or messages regarding a particular event/situation.
* label feature tells whether the given text is Fake or Real.
* As there are only 2 classes, this problem comes under the Binary Classification.

In [6]:
# So let's import pandas and read the dataset:
import pandas as pd

df = pd.read_csv("fake_and_real_news.csv")
df.sample(5)

Unnamed: 0,Text,label
5445,"Factbox: Trump on Twitter (Sept 25) - NASCAR, ...",Real
1775,"Trump slaps travel restrictions on N.Korea, Ve...",Real
8581,"Republican Senators Corker, Toomey reach deal ...",Real
2155,Trump Just Started Following ‘Emergency Kitte...,Fake
8445,Former Trump aide nomination to be Singapore e...,Real


In [7]:
# To see the shape of the dataset:
print(df.shape)

(9900, 2)


In [8]:
# To check the imbalance of dataset:
df['label'].value_counts()

Fake    5000
Real    4900
Name: label, dtype: int64

In [9]:
# So as we see both classes samples are almost similar, so no need for further processing.
# Next we change the label from text to numbers by creating a new column:

df['label_num'] = df['label'].map({'Fake' : 0, 'Real': 1})
df.head(5)

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


**Now we will convert the text into a vector using gensim's word2vec embeddings.**

**We will do this in three steps,**
   1. Preprocess the text to remove stop words, punctuations and get lemma for each word.
   2. Get word vectors for each of the words in a pre-processed sentece.
   3. Take a mean of all word vectors to derive the numeric representation of the entire news article.
First let's explore get_mean_vector api of gensim to see how it works

In [28]:
# Now let's write the function that can do preprocessing and vectorization both:

import spacy
nlp = spacy.load("en_core_web_lg") 

def preprocess_and_vectorize(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return filtered_tokens

In [29]:
# To check the function, if it works or not:
preprocess_and_vectorize("Don't worry if you don't understand")

['worry', 'understand']

In [26]:
# Next we use 'get_mean_vector' to return the mean of all filtered tokens. The reason why we do this is, as we saw that the 
# filtered tokens 'filtered_tokens' is a list of words in a sentence and we're interested in the embedding of the entire 
# sentence.
'''
import spacy
nlp = spacy.load("en_core_web_lg") 

def preprocess_and_vectorize(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return wv.get_mean_vector(filtered_tokens)
    '''



'\nimport spacy\nnlp = spacy.load("en_core_web_lg") \n\ndef preprocess_and_vectorize(text):\n    doc = nlp(text)\n    filtered_tokens = []\n    for token in doc:\n        if token.is_stop or token.is_punct:\n            continue\n        filtered_tokens.append(token.lemma_)\n\n    return wv.get_mean_vector(filtered_tokens)\n    '

In [27]:
# v = preprocess_and_vectorize("Don't worry if you don't understand")
# v.shape

In [30]:
# Next we want to do vectorization. We create a new column 'vector' which will have a vectors for the 'Text' column 
df['vector'] = df['Text'].apply(lambda text: preprocess_and_vectorize(text))

In [31]:
# Now to see the first five rows:
df.head()

Unnamed: 0,Text,label,label_num,vector
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0,"[ , Trump, Surrogate, BRUTALLY, Stabs, Patheti..."
1,U.S. conservative leader optimistic of common ...,Real,1,"[U.S., conservative, leader, optimistic, commo..."
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1,"[Trump, propose, U.S., tax, overhaul, stir, co..."
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0,"[ , Court, Forces, Ohio, allow, million, illeg..."
4,Democrats say Trump agrees to work on immigrat...,Real,1,"[Democrats, Trump, agree, work, immigration, b..."


In [None]:
# So next we split the dataset into train and test samples:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values, 
    df.label_num, 
    test_size=0.2,
    random_state=2022,
    stratify=df.label_num
)

In [None]:
# Next we reshape the X_train and X_test so as to fit for models:
print("Shape of X_train before reshaping: ", X_train.shape)
print("Shape of X_test before reshaping: ", X_test.shape)


X_train_2d = np.stack(X_train)
X_test_2d =  np.stack(X_test)

print("Shape of X_train after reshaping: ", X_train_2d.shape)
print("Shape of X_test after reshaping: ", X_test_2d.shape)

In [None]:
# Next we train the model.
# Here we tried to train Gradient Boosting Classifier as it probably will get best result comparing with RF, NB and DT.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

clf = GradientBoostingClassifier()
clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)
print(classification_report(y_test, y_pred))

In [None]:
# Let's make some prediction:
test_news = [
    "Michigan governor denies misleading U.S. House on Flint water (Reuters) - Michigan Governor Rick Snyder denied Thursday that he had misled a U.S. House of Representatives committee last year over testimony on Flintâ€™s water crisis after lawmakers asked if his testimony had been contradicted by a witness in a court hearing. The House Oversight and Government Reform Committee wrote Snyder earlier Thursday asking him about published reports that one of his aides, Harvey Hollins, testified in a court hearing last week in Michigan that he had notified Snyder of an outbreak of Legionnairesâ€™ disease linked to the Flint water crisis in December 2015, rather than 2016 as Snyder had testified. â€œMy testimony was truthful and I stand by it,â€ Snyder told the committee in a letter, adding that his office has provided tens of thousands of pages of records to the committee and would continue to cooperate fully.  Last week, prosecutors in Michigan said Dr. Eden Wells, the stateâ€™s chief medical executive who already faced lesser charges, would become the sixth current or former official to face involuntary manslaughter charges in connection with the crisis. The charges stem from more than 80 cases of Legionnairesâ€™ disease and at least 12 deaths that were believed to be linked to the water in Flint after the city switched its source from Lake Huron to the Flint River in April 2014. Wells was among six current and former Michigan and Flint officials charged in June. The other five, including Michigan Health and Human Services Director Nick Lyon, were charged at the time with involuntary manslaughter",
    " WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia For Hacking Our Election Is Unpatriotic This woman is insane.In an incredibly disrespectful rant against President Obama and anyone else who supports investigating Russian interference in our election, Fox News host Jeanine Pirro said that anybody who is against Donald Trump is anti-American. Look, it s time to take sides,  she began.",
    " Sarah Palin Celebrates After White Man Who Pulled Gun On Black Protesters Goes Unpunished (VIDEO) Sarah Palin, one of the nigh-innumerable  deplorables  in Donald Trump s  basket,  almost outdid herself in terms of horribleness on Friday."
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
clf.predict(test_news_vectors)

In [None]:
# Confusion matrix:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

from matplotlib import pyplot as plt
import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Prediction')
plt.ylabel('Truth')