<a href="https://colab.research.google.com/github/jbiancamano/Projects/blob/main/Assignment5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 5
## Differentiating real news from fake news
Group members: Tim Brady, Kayla Shamayev, Justin Biancamano

In [None]:
# preamble to be able to run notebooks in Jupyter and Colab
try:
    from google.colab import drive
    import sys
    
    drive.mount('/content/drive')
    notes_home = "/content/drive/Shared drives/CSC310/ds/notes/"
    user_home = "/content/drive/My Drive/"
    
    sys.path.insert(1,notes_home) # let the notebook access the notes folder

except ModuleNotFoundError:
    notes_home = "" # running native Jupyter environment -- notes home is the same as the notebook
    user_home = ""  # under Jupyter we assume the user directory is the same as the notebook

Mounted at /content/drive


In [None]:
# setup
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from assets.confint import classification_confint
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from assets.treeviz import tree_print
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.naive_bayes import MultinomialNB 

In [None]:
# access fake news data

# URL pointing to the RAW content of a GitHub CSV file. Here is 
# a nice article describing this:
# https://projectosyo.wixsite.com/datadoubleconfirm/single-post/2019/04/15/Reading-csv-data-from-Github---Python

url = 'https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv'
fake_news_df = pd.read_csv(url)

# Look at the first 10 rows of the data
fake_news_df.head(n=10)

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [None]:
# Determine the number of rows and columns
fake_news_df.shape

(6335, 4)

In [None]:
# Determine how many fake and real labels there are
fake_news_df['label'].value_counts()

REAL    3171
FAKE    3164
Name: label, dtype: int64

The data has 4 columns and 6335 rows. There appears to be an almost equal number of real and fake posts. There is only a difference of about 0.001 The 'id' column is the only numerical variable and it does not give us any important information. 
## 1. Use the vector model and text processing techniques to construct a training data set

In [None]:
text = fake_news_df["text"]
label = fake_news_df["label"]

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(label).toarray()

# print out the coordinate system
# NOTE: sklearn filters out single character words -- is drops 'a'
print("Coordinates:")
coords = vectorizer.get_feature_names()
print(coords)

print("\nfakenews:")
term = pd.DataFrame(data=docarray,index=text,columns=coords)
print(term)

# print pairwise distances between documents
distances = euclidean_distances(term)
distances_df = pd.DataFrame(data=distances, index=text, columns=text)
print("\nPairwise Distances:")
print(distances_df)

Coordinates:
['fake', 'real']

fakenews:
                                                    fake  real
text                                                          
Daniel Greenfield, a Shillman Journalism Fellow...     1     0
Google Pinterest Digg Linkedin Reddit Stumbleup...     1     0
U.S. Secretary of State John F. Kerry said Mond...     0     1
— Kaydee King (@KaydeeKing) November 9, 2016 Th...     1     0
It's primary day in New York and front-runners ...     0     1
...                                                  ...   ...
The State Department told the Republican Nation...     0     1
The ‘P’ in PBS Should Stand for ‘Plutocratic’ o...     1     0
 Anti-Trump Protesters Are Tools of the Oligarc...     1     0
ADDIS ABABA, Ethiopia —President Obama convened...     0     1
Jeb Bush Is Suddenly Attacking Trump. Here's Wh...     0     1

[6335 rows x 2 columns]

Pairwise Distances:
text                                                                                          

In [None]:
#cats = ['text', 'label']
#fakenews_train = fake_news_df(subset='train', categories=cats)

In [None]:
print("******** docarray **********")

# build the stemmer object
stemmer = PorterStemmer()

# build a new default analyzer using CountVectorizer that only uses words: [a-zA-Z]+
# also eliminate stop words
analyzer= CountVectorizer(analyzer = "word", 
                          stop_words = 'english',
                          token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]

# build docarray
vectorizer = CountVectorizer(analyzer=stemmed_words,
                             #analyzer=analyzer,
                             binary=True,
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(fake_news_df['text']).toarray()
docarray.shape
doc_df = pd.DataFrame(docarray, columns=list(vectorizer.get_feature_names()))
doc_df.head()

******** docarray **********


Unnamed: 0,aa,aaa,aab,aadmi,aaib,aam,aamaq,aap,aaron,aarp,ab,aba,abaaoud,ababa,aback,abadi,abandon,abat,abba,abbar,abbey,abbi,abbott,abbottabad,abbrevi,abc,abcnew,abcpolit,abd,abdallah,abdel,abdelhamid,abdeslam,abdic,abdollahi,abdomen,abduct,abducte,abductor,abdul,...,zika,zilch,zimbabw,zimbabwean,zimmer,zimmerman,zimr,zinc,zing,zinger,zinn,zion,zionism,zionist,zip,zirconium,zirp,zodiac,zoe,zoellick,zolil,zombi,zone,zoo,zoolog,zoom,zor,zoroastrian,zu,zucker,zuckerberg,zuckerburg,zuckerman,zuess,zukowski,zulu,zurich,zvezda,zvz,zwick
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Decision Tree

In [None]:
print("******** model **********")


# Decision Tree
model = DecisionTreeClassifier()

# grid search
param_grid = {'max_depth': list(range(1,30)), 'criterion':['gini','entropy']}
grid = GridSearchCV(model, param_grid, cv=2, verbose=10, n_jobs=-1)
grid.fit(docarray, label)
print("Grid Search: best parameters: {}".format(grid.best_params_))
tree_print(grid.best_estimator_,doc_df)

******** model **********
Fitting 2 folds for each of 58 candidates, totalling 116 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    9.3s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   27.9s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   44.0s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  6.1min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  8.6min
[Parallel(n_jobs=-1)]: Done 116 out of 116 | elapsed:  9.1min finished


Grid Search: best parameters: {'criterion': 'entropy', 'max_depth': 26}
if republican =< 0.5: 
  |then if octob =< 0.5: 
  |  |then if obama =< 0.5: 
  |  |  |then if said =< 0.5: 
  |  |  |  |then if com =< 0.5: 
  |  |  |  |  |then if verdict =< 0.5: 
  |  |  |  |  |  |then if novemb =< 0.5: 
  |  |  |  |  |  |  |then if waterg =< 0.5: 
  |  |  |  |  |  |  |  |then if sen =< 0.5: 
  |  |  |  |  |  |  |  |  |then if share =< 0.5: 
  |  |  |  |  |  |  |  |  |  |then if number =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |then if demo =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |then if washburn =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |then if campaign =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if parti =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if wisconsin =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if natur =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if administr =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |

#Confusion Matrix

In [None]:
print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = grid.best_estimator_
predict_y = best_model.predict(docarray)
acc = accuracy_score(label, predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

******** Accuracy **********
Accuracy: 0.99 (0.99,0.99)


In [None]:
print("******** confusion matrix **********")

# build the confusion matrix
cats = ['REAL','FAKE']
cm = confusion_matrix(label, predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** confusion matrix **********
Confusion Matrix:
      REAL  FAKE
REAL  3103    68
FAKE     0  3164


#Naive Bayes Model

In [None]:
print("******** model **********")


# Naive Bayes
model = MultinomialNB()
# NOTE: NB does not have any hyper-parameters - no overfitting - no searching over parameter space!
model.fit(docarray, label)


print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = model
predict_y = best_model.predict(docarray)
acc = accuracy_score(label, predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

print("******** confusion matrix **********")

# build the confusion matrix
cats = ['REAL','FAKE']
cm = confusion_matrix(label, predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** model **********
******** Accuracy **********
Accuracy: 0.92 (0.91,0.93)
******** confusion matrix **********
Confusion Matrix:
      REAL  FAKE
REAL  2974   197
FAKE   308  2856


The Naive Bayes model prints out a less accurrate version of the confusion matrix getting only 92% of calulations correct, instead of the previous 98%. This is statistically significant as the confidence intervals do not overlap.

#Extra Credit:

Use the vector model and text processing techniques to construct a training data set:

In [None]:
title = fake_news_df["title"]
label = fake_news_df["label"]

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(label).toarray()

# print out the coordinate system
# NOTE: sklearn filters out single character words -- is drops 'a'
print("Coordinates:")
coords = vectorizer.get_feature_names()
print(coords)

print("\nfakenews:")
term = pd.DataFrame(data=docarray,index=title,columns=coords)
print(term)

# print pairwise distances between documents
distances = euclidean_distances(term)
distances_df = pd.DataFrame(data=distances, index=title, columns=title)
print("\nPairwise Distances:")
print(distances_df)

Coordinates:
['fake', 'real']

fakenews:
                                                    fake  real
title                                                         
You Can Smell Hillary’s Fear                           1     0
Watch The Exact Moment Paul Ryan Committed Poli...     1     0
Kerry to go to Paris in gesture of sympathy            0     1
Bernie supporters on Twitter erupt in anger aga...     1     0
The Battle of New York: Why This Primary Matters       0     1
...                                                  ...   ...
State Department says it can't find emails from...     0     1
The ‘P’ in PBS Should Stand for ‘Plutocratic’ o...     1     0
Anti-Trump Protesters Are Tools of the Oligarch...     1     0
In Ethiopia, Obama seeks progress on peace, sec...     0     1
Jeb Bush Is Suddenly Attacking Trump. Here's Wh...     0     1

[6335 rows x 2 columns]

Pairwise Distances:
title                                               You Can Smell Hillary’s Fear  ...  Jeb Bus

In [None]:
print("******** docarray **********")

# build the stemmer object
stemmer = PorterStemmer()

# build a new default analyzer using CountVectorizer that only uses words: [a-zA-Z]+
# also eliminate stop words
analyzer= CountVectorizer(analyzer = "word", 
                          stop_words = 'english',
                          token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]

# build docarray
vectorizer = CountVectorizer(analyzer=stemmed_words,
                             #analyzer=analyzer,
                             binary=True,
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(fake_news_df['title']).toarray()
docarray.shape
doc_df = pd.DataFrame(docarray, columns=list(vectorizer.get_feature_names()))
doc_df.head()

******** docarray **********


Unnamed: 0,abandon,abc,abdullah,abedin,abil,aboard,abolish,abort,absente,absolut,abstain,absurd,abus,accept,access,accid,accident,accomplish,accord,account,accus,achiev,acknowledg,acquit,acquitt,acr,act,action,activ,activist,actual,ad,add,adderal,addict,address,adelson,adhd,admin,administr,...,worst,worth,wouldn,wound,wow,wreck,wreckag,write,writer,wrong,wsj,wtf,ww,wwiii,wyom,x,xl,y,yale,ye,year,yemen,yemeni,yesterday,yield,york,yorker,young,youth,youtub,yr,z,zakharova,zealand,zero,zika,zionist,zone,zuckerberg,zuess
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Decision Tree

In [None]:
print("******** model **********")


# Decision Tree
model = DecisionTreeClassifier()

# grid search
param_grid = {'max_depth': list(range(1,30)), 'criterion':['gini','entropy']}
grid = GridSearchCV(model, param_grid, cv=2, verbose=10, n_jobs=-1)
grid.fit(docarray, label)
print("Grid Search: best parameters: {}".format(grid.best_params_))
tree_print(grid.best_estimator_,doc_df)

******** model **********
Fitting 2 folds for each of 58 candidates, totalling 116 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:   21.6s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   48.2s
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 116 out of 116 | elapsed:  2.3min finished


Grid Search: best parameters: {'criterion': 'gini', 'max_depth': 28}
if gop =< 0.5: 
  |then if obama =< 0.5: 
  |  |then if sander =< 0.5: 
  |  |  |then if republican =< 0.5: 
  |  |  |  |then if cruz =< 0.5: 
  |  |  |  |  |then if debat =< 0.5: 
  |  |  |  |  |  |then if hous =< 0.5: 
  |  |  |  |  |  |  |then if jeb =< 0.5: 
  |  |  |  |  |  |  |  |then if iran =< 0.5: 
  |  |  |  |  |  |  |  |  |then if trump =< 0.5: 
  |  |  |  |  |  |  |  |  |  |then if marriag =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |then if shoot =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |then if comment =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |then if rubio =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if suprem =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if polar =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if primari =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |then if attack =< 0.5: 
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |th

Confusion Matrix

In [None]:
print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = grid.best_estimator_
predict_y = best_model.predict(docarray)
acc = accuracy_score(label, predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

******** Accuracy **********
Accuracy: 0.79 (0.77,0.80)


In [None]:
print("******** confusion matrix **********")

# build the confusion matrix
cats = ['REAL','FAKE']
cm = confusion_matrix(label, predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** confusion matrix **********
Confusion Matrix:
      REAL  FAKE
REAL  2018  1153
FAKE   209  2955


Naive Bayes Model

In [None]:
print("******** model **********")


# Naive Bayes
model = MultinomialNB()
# NOTE: NB does not have any hyper-parameters - no overfitting - no searching over parameter space!
model.fit(docarray, label)


print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = model
predict_y = best_model.predict(docarray)
acc = accuracy_score(label, predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

print("******** confusion matrix **********")

# build the confusion matrix
cats = ['REAL','FAKE']
cm = confusion_matrix(label, predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** model **********
******** Accuracy **********
Accuracy: 0.90 (0.89,0.90)
******** confusion matrix **********
Confusion Matrix:
      REAL  FAKE
REAL  2876   295
FAKE   355  2809


When using "title" for the training text instead of "text", a less accurate classifier is produced. It produces a classifier of 79% accuracy versus the previous which had 99%. This is statistically significant because the confidence intervals do not overlap (0.77, 0.80) vs (0.99, 0.99).