In [32]:
#Make necessary imports

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


# Tfidftransformer vs. Tfidfvectorizer
In summary, the main difference between the two modules are as follows:

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

When to use what?
So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

In [33]:
#Read the DataFrame
df = pd.read_csv('C:\\Data Science And CS\\Fake News Detection Project_1\\news.csv')

#get head and shape
df.shape
df.head(10)


Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [34]:
#Example of a fake news, it is very obvious that the contents are gibberish and makes no sense
df['text'][1]

'Google Pinterest Digg Linkedin Reddit Stumbleupon Print Delicious Pocket Tumblr \nThere are two fundamental truths in this world: Paul Ryan desperately wants to be president. And Paul Ryan will never be president. Today proved it. \nIn a particularly staggering example of political cowardice, Paul Ryan re-re-re-reversed course and announced that he was back on the Trump Train after all. This was an aboutface from where he was a few weeks ago. He had previously declared he would not be supporting or defending Trump after a tape was made public in which Trump bragged about assaulting women. Suddenly, Ryan was appearing at a pro-Trump rally and boldly declaring that he already sent in his vote to make him President of the United States. It was a surreal moment. The figurehead of the Republican Party dosed himself in gasoline, got up on a stage on a chilly afternoon in Wisconsin, and lit a match. . @SpeakerRyan says he voted for @realDonaldTrump : “Republicans, it is time to come home” ht

In [35]:
#Get the labels
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [36]:
#Splitting of dataset into training and test

X_train, X_test, y_train, y_test = train_test_split(df['text'], labels, test_size=0.3, random_state=42)

TF-IDF stands for Term Frequency Inverse Document Frequency

Algo to transform text into a meaningful representation of numbers to fit ML algo for prediction

TF-IDF = TF(t,d) x IDF(t)
where:
TF = Term frequency (number of times term t, appears in a doc, d)
IDF = Inverse Document Frequency (log[(1+n)/(1+df)]) + 1
where n = number of documents, df(d,t) = document frequency of the term t

stop_words like 'is' will be removed from analysis

In [37]:
#Initialize the TfidfVectorizer
tfidf = TfidfVectorizer(stop_words = 'english', max_df = 0.7)
#Initiate separate countvectorizer
countvectorizer = CountVectorizer(analyzer='word', stop_words = 'english')
# instantiate the vectorizer object
# use analyzer is word and stop_words is english which are responsible for remove stop words and create word vocabulary

#convert the documents into a sparse matrix
count_wm = countvectorizer.fit_transform(X_train)
tfidf_wm = tfidf.fit_transform(X_train)

count_tokens = countvectorizer.get_feature_names()
tfidf_tokens = tfidf.get_feature_names()

df_countvect = pd.DataFrame(data = count_wm.toarray(), columns = count_tokens)
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(), columns = tfidf_tokens)

#CountVectorizer gives number of frequency with respect to index of vocabulary
print("Count Vectorizer\n") 
print(df_countvect)
#TF-IDF considers overall documents of weight of words
print("\nTFIDF Vectorizer\n")
print(df_tfidfvect)

#Fit and transform training set and transform test set #TF-IDF
tfidf_train = tfidf.fit_transform(X_train)
tfidf_test = tfidf.transform(X_test)



Count Vectorizer

      00  000  0000  000000031  00000031  0001  0002  000billion  000ft  \
0      0    0     0          0         0     0     0           0      0   
1      0    0     0          0         0     0     0           0      0   
2      0    0     0          0         0     0     0           0      0   
3      0    4     0          0         0     0     0           0      0   
4      0    0     0          0         0     0     0           0      0   
...   ..  ...   ...        ...       ...   ...   ...         ...    ...   
4429   0    0     0          0         0     0     0           0      0   
4430   0    0     0          0         0     0     0           0      0   
4431   0    0     0          0         0     0     0           0      0   
4432   0    0     0          0         0     0     0           0      0   
4433   0    0     0          0         0     0     0           0      0   

      000km  ...  שתי  תאמצנה  תוצאה  תחל  תיירות  תנותק  תעודת  תתרכז  \
0      

In [38]:
#Fit and transform training set and transform test set #CountVectorizer
terms = countvectorizer.fit_transform(X_train)
terms_vector = countvectorizer.transform(X_test)

print("Sparse Matrix form of test data : \n")
print(terms_vector.todense())

Sparse Matrix form of test data : 

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [39]:
#Transfer sparse matrix of Countvectorizer to TF-IDF by using TFIDF Transformer

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_trans = TfidfTransformer(norm ='l2')

terms_vector.todense()
tfidf_trans.fit(terms_vector)

tf_idf_matrix = tfidf_trans.transform(terms_vector)

print("\nVector of idf \n")
print(tfidf_trans.idf_)
print("\nFinal tf-idf vectorizer matrix form using 3 steps approach starting from countvectorizer :\n")
print(tf_idf_matrix.todense())
print("\nFinal tf-idf vectorizer matrix form using tf-idfVectorizer :\n")
print(tfidf_test.todense())
print("\nBoth methods should tally.\n")


Vector of idf 

[5.7174479  2.63445918 8.55066124 ... 8.55066124 8.55066124 8.55066124]

Final tf-idf vectorizer matrix form using 3 steps approach starting from countvectorizer :

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.00823125 0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.01490196 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]

Final tf-idf vectorizer matrix form using tf-idfVectorizer :

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.00810119 0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0

In [40]:
#Initialize a PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

#Predict on the test set and calculate accuracy
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred, labels=['FAKE','REAL']))

Accuracy: 93.53%
              precision    recall  f1-score   support

        FAKE       0.93      0.94      0.94       968
        REAL       0.94      0.93      0.93       933

    accuracy                           0.94      1901
   macro avg       0.94      0.94      0.94      1901
weighted avg       0.94      0.94      0.94      1901

[[910  58]
 [ 65 868]]
