# 605.745 Reasoning Under Uncertainty
## Final Draft of Project Code
## Lindsay Zetlmeisl

### Import required libraries and download NLTK packages

In [1]:
import pandas as pd
import nltk
import pickle
import matplotlib.pyplot as plt
from pomegranate import *
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Joseph\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Joseph\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Read in data from files and combine into one DataFrame

For this project, I used one dataset for fake news and another for real news. The fake news dataset that I used is focused on the time around the 2016 election, and while this is a few years old it's the best fake news dataset that could be found. The dataset is called "Getting Real About Fake News" and was found on Kaggle (https://www.kaggle.com/mrisdal/fake-news). The code below reads this data from a CSV into a DataFrame. Some of the articles in this dataset are not in English, and since this project focuses on language processing I filtered out everything that was not English.

In [2]:
fake = pd.read_csv("fake.csv")
fake = fake[fake["language"] == "english"]
fake = fake[fake["ord_in_thread"] == 0]
fake.rename(columns={'published': 'date', 'text': 'content'}, inplace=True)
fake = fake[["date", "title", "content", "author"]]
fake["label"] = "fake"
fake

Unnamed: 0,date,title,content,author,label
0,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,Barracuda Brigade,fake
1,2016-10-29T08:47:11.259+03:00,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,reasoning with facts,fake
2,2016-10-31T01:41:49.479+02:00,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,Barracuda Brigade,fake
3,2016-11-01T05:22:00.000+02:00,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,Fed Up,fake
4,2016-11-01T21:56:00.000+02:00,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,Fed Up,fake
...,...,...,...,...,...
12908,2016-10-26T21:28:00.000+03:00,Tesla Earnings Smash Expectations After Dramat...,"Oct 26, 2016 4:26 PM 0 SHARES \nThere was a su...",,fake
12909,2016-10-26T23:53:43.161+03:00,Rules For Rulers (Or How The World Really Works),"The following video is a must watch, particula...",Tyler Durden,fake
12910,2016-10-26T23:53:49.879+03:00,Fact Check: Trump Is Right that Clinton Might ...,She explains : \nHillary Clinton wants to star...,George Washington,fake
12911,2016-10-27T00:20:00.111+03:00,Caught On Tape: ISIS Destroys Iraqi Abrams Wit...,"YHC-FTSE Oct 26, 2016 5:14 PM \nWould have bee...",Tyler Durden,fake


The second dataset that I used is the "All the News" dataset (https://www.kaggle.com/snapcrack/all-the-news), which is the source of the real news for this project. It contains many different data sources, but six sources that are known to be credible were selected from the data (The Atlantic, CNN, New York Times, NPR, Reuters, and the Washington Post). The date range of this dataset is also much larger than the fake news dataset, so it was filtered down to a smaller date range around the 2016 election time frame. 

In [3]:
true = pd.read_csv("articles1.csv")
true = pd.concat([true, pd.read_csv("articles2.csv"), pd.read_csv("articles3.csv")], ignore_index=True)

In [4]:
true["date"] = pd.to_datetime(true["date"])

#filter on credible news sources and dates
true = true[true["publication"].isin(["Atlantic", "CNN", "New York Times", "NPR", "Reuters", "Washington Post"])]
true = true[(true['date'] > '2016-09-15') & (true['date'] < '2016-12-25')]

true = true[["date", "title", "content", "author"]]
true["label"] = "real"
true

Unnamed: 0,date,title,content,author,label
2547,2016-12-21,U.S. Plans to Step Up Military Campaign Agains...,"ABU DHABI, United Arab Emirates — The Obama...",Michael S. Schmidt and Eric Schmitt,real
2551,2016-12-15,272 Slaves Were Sold to Save Georgetown. What ...,WASHINGTON — The human cargo was loaded on ...,Rachel L. Swarns,real
2561,2016-10-14,"Among Travelers and Commuters, the Homeless St...",Wilson Silva said he knew the homeless situati...,Corey Kilgannon,real
2575,2016-10-19,Bus Bombing in Jerusalem Wounds 21 - The New Y...,JERUSALEM — A bomb exploded on a bus in Jer...,Isabel Kershner,real
2581,2016-10-20,Syria Cease-Fire Crumbles as Bombings Kill Doz...,"BEIRUT, Lebanon — For 38 straight days, the...",Anne Barnard,real
...,...,...,...,...,...
142469,2016-12-23,This man is on a mission to fix the way we sleep,James Proud is a man on a mission to fix ...,Hayley Tsukayama,real
142470,2016-12-24,8 overnight casseroles and easy scones for you...,Let’s be honest: Chances are a cold bowl of ce...,Becky Krystal,real
142471,2016-12-19,They aren’t just for backpackers. ‘Poshtels’ b...,I’ve been in my room at the Hollander in...,Kate Silver,real
142472,2016-12-20,My answer to the holiday sugar glut: Pomegrana...,Candy and sugar have once again invaded our w...,Casey Seidenberg,real


### Iterate through rows and perform text pre-processing

Now that the data is read in and stored in a DataFrame, the text needs to be pre-processed to prepare it for use in a model. The code below completes the following steps:
- Append the title of the article to the body of the article (so that both are included in the processed dat) and then word-tokenize the text (i.e. split it on spaces and punctuation).
- Convert all of the tokens to lowercase and remove any remaining punctuation and any tokens that are numeric. 
- Remove stop words from the tokens (using the NLTK English stop word list)
- Perform part of speech tagging (because part of speech is needed for lemmatization)
- Reduce words to their root form using lemmatization

In [None]:
# combine real and fake news into one DataFrame
news_df = pd.concat([true, fake])
news_df

In [None]:
processed_text = []
labels = []
for i, row in news_df.iterrows():
    tokens = nltk.word_tokenize(str(row["title"]) + " " + str(row["content"])) #split text into tokens
    tokens = [word.lower() for word in tokens if word.isalpha()] #convert to lowercase, remove punctuation and numbers
    filtered_tokens = [word for word in tokens if word not in stopwords.words('english')] #remove stop words
    #perform part of speech tagging and lemmatization
    lemmatized_words = []
    for word, tag in nltk.pos_tag(filtered_tokens):
        pos = tag[0].lower()
        pos = pos if pos in ['a', 'r', 'n', 'v'] else None
        if not pos:
            lemma = word
        else:
            lemma = lemmatizer.lemmatize(word, pos)
        lemmatized_words.append(lemma)
    processed_text.append(lemmatized_words)
    labels.append(row["label"])

This code saves the pre-processed text and the corresponding labels (real/fake) to a pickle file so that the task does not need to be repeated in the future. I did this simply because the text pre-processing is a bit time consuming and it's simpler to save the text in its processed state.

In [None]:
text_dict = {"processed_text": processed_text, "labels": labels}
with open('processed_text.pickle', 'wb') as file:
    pickle.dump(text_dict, file)

### Convert documents to vectors using bag-of-words method

This step involves converting the processed text into numerical vectors using the bag-of-words method. 

The first block of code loads the pre-processed text and the labels from the pickle file, so that the text does not need to be re-processed.

In [5]:
# load processed text from pickle file so that pre-processing doesn't need to be redone
with (open("processed_text.pickle", "rb")) as file:
    contents = pickle.load(file)
    processed_text = contents["processed_text"]
    labels = contents["labels"]

The next two code blocks concatenate all words from the same document together, and then run the scikit-learn CountVectorizer on the data. This creates a vector that has the occurence count of each word in each document. This code also converts the labels of "real" and "fake" to the numerical values of 0 and 1. 1 corresponds to "fake" and 0 corresponds to "real".

In [6]:
for i, document in enumerate(processed_text):
    processed_text[i] = " ".join(document)

In [7]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(processed_text)
y = [1 if l == "fake" else 0 for l in labels ]

In [8]:
print(len(vectorizer.get_feature_names()))

115377


This last code block prints the number of unqiue features in the data, which is the same as the length of each document vector. This is a very large number and a Bayesian network cannot effectively handle this many features. For that reason, the next step involves selecting the best features.

### Find best features using chi<sup>2</sup> score

The code below instantiates a feature selector that uses the chi<sup>2</sup> score to find the 10 best features in the data. Each of those features (words) is then printed out along with its chi<sup>2</sup> score and p-value. The purpose of this code is just to test the selector and to see what some of the top words are. Later code will involve iteratively increasing the number of features to keep and then fitting a model on the data with only those features.

The chi<sup>2</sup> score will be higher for features that cause a more homogenous split in the data, when considering class label. This means that the top features found using the chi<sup>2</sup> score will be the words that most clearly divide the documents into "real" and "fake" news categories. 

In [9]:
pd.set_option('display.max_rows', 500)

# Select features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)), columns=['word', 'score', 'pval'])

chi2_scores.nlargest(10, "score")

Unnamed: 0,word,score,pval
89148,say,41219.58951,0.0
104288,trump,14388.194406,0.0
113408,year,5097.927263,0.0
44880,hillary,4853.44588,0.0
76225,percent,4361.260516,0.0
110708,wednesday,3715.079188,0.0
19959,company,3536.620276,0.0
18408,city,3294.16934,0.0
113733,york,3261.625956,0.0
69577,new,3254.233992,0.0


These top 10 words are interesting because they are words that we would expect to see in news articles, but are not necessarily expected to clearly distinguish between real and fake news. The top two words, at least, appear to be a result of this particular dataset. Looking back at the portion of the DataFrame that was output after the data was read in, we can see that the last few entries repeat the phrase "21st Century Wire says". If there are a lot of articles from this one source, and they frequently start their articles with that phrase, it seems like that is why the word "say" is considered to be the best word for distinguishing between real and fake news.

### Fit and evaluate Bayesian network on different numbers of features

Now that we've established a method to select the top k features, we'll run several experiments where we fit a Bayesian network on varying numbers of features and evaluate the resulting model by making predictions. This fits models on 10, 50, 100, 250, 500, 750, and 1000 features. For each of those sets of features, the following steps are completed:

- Initialize a k-best feature selector for the current number of features, and transform the data using that selector
- Append the class labels to the end of the feature vector for each document, since the Bayes net requires the label to be one of the nodes in the network in order to use inference to predict it
- Split the data into 80% training data and 20% test data
- Fit the Bayesian network on the training data, using the pomegranate library
- Remove the labels from the feature vectors of the test data and then use the model to predict the label of those documents
- Compare the predicted value to the true labels (for the test set) and tally up the number of true positives (i.e. label is 1 and prediction is 1), true negatives (i.e. label is 0 and prediction is 0), false positives (i.e. label is 0 and prediction is 1), and false negatives (i.e. label is 1 and prediction is 0).
- Use those tallies to compute accuracy, precision, recall, and the F1 score

In [14]:
num_features_list = [10, 50, 100, 250, 500, 750, 1000]
for num_features in num_features_list:
    chi2_selector = SelectKBest(chi2, k=num_features) # initialize feature selector
    X_new = chi2_selector.fit_transform(X, y).toarray() # fit selector and transform the data
    
    # combine features in labels into one matrix for Bayes net training
    y_new = [[label] for label in y]
    X_and_y = np.append(X_new, y_new, axis=1)
    
    # perform 80/20 split for training and test data
    training_data, test_data = train_test_split(X_and_y, test_size=0.2, random_state=42)
    
    # fit the model
    model = BayesianNetwork.from_samples(training_data, algorithm="chow-liu", n_jobs=3)
    
    # remove labels from test data and make predictions
    test_data_no_labels = []
    for entry in test_data:
        test_data_no_labels.append(list(entry))
        test_data_no_labels[-1][-1] = None
    test_data_no_labels = np.array(test_data_no_labels)
    predictions = model.predict(test_data_no_labels, check_input=False)
    
    # evaluate predictions by computing accuracy, precision, recall, and f1 score
    true_positives = 0
    true_negatives = 0
    false_positives = 0
    false_negatives = 0
    for prediction, true_value in zip(predictions, test_data):
        if prediction[-1] == 1 and true_value[-1] == 1:
            true_positives += 1
        elif prediction[-1] == 0 and true_value[-1] == 0:
            true_negatives += 1
        elif prediction[-1] == 1 and true_value[-1] == 0:
            false_positives += 1
        else:
            false_negatives += 1

    accuracy = (true_positives + true_negatives) / len(test_data)
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1 = 2 * ((precision * recall) / (precision + recall))

    print("Bayesian network with " + str(num_features) + " features")
    print("Accuracy: ", accuracy)
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F1 Score: ", f1)
    print()

Bayesian network with 10 features
Accuracy:  0.7432403661911858
Precision:  0.7180871577323563
Recall:  0.7967479674796748
F1 Score:  0.7553752535496957

Bayesian network with 50 features
Accuracy:  0.7466467958271237
Precision:  0.7349446947972142
Recall:  0.7676508344030809
F1 Score:  0.7509418166596903

Bayesian network with 100 features
Accuracy:  0.7577176921439217
Precision:  0.7474205530334296
Recall:  0.7749251176722294
F1 Score:  0.7609243697478991

Bayesian network with 250 features
Accuracy:  0.7594209069618906
Precision:  0.7492771581990912
Recall:  0.7762088147197261
F1 Score:  0.762505254308533

Bayesian network with 500 features
Accuracy:  0.7651692569725357
Precision:  0.7514262428687857
Recall:  0.789045785194694
F1 Score:  0.7697766645794197

Bayesian network with 750 features
Accuracy:  0.7645305514157973
Precision:  0.748887990295188
Recall:  0.7924689773213521
F1 Score:  0.7700623700623701

Bayesian network with 1000 features
Accuracy:  0.7632531403023206
Precision

### Fit and evaluate naive Bayes classifier on the same numbers of features

Finally, we run the same experiments using a naives Bayes classifier to compare it to the performance of the Bayesian network. I specifically used the Multinomial naive Bayes algorithm because it is known to do well on text classification tasks. 

In [10]:
num_features_list = [10, 50, 100, 250, 500, 750, 1000]
for num_features in num_features_list:
    chi2_selector = SelectKBest(chi2, k=num_features) # initialize feature selector
    X_new = chi2_selector.fit_transform(X, y).toarray() # fit selector and transform the data
    
    # perform 80/20 split for training and test data
    X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)
    
    # fit the model
    nb_model = MultinomialNB().fit(X_train, y_train)
    
    # make predictions on test set
    predictions = nb_model.predict(X_test)
    
    # evaluate predictions by computing accuracy, precision, recall, and f1 score
    true_positives = 0
    true_negatives = 0
    false_positives = 0
    false_negatives = 0
    for prediction, true_value in zip(predictions, y_test):
        if prediction == 1 and true_value == 1:
            true_positives += 1
        elif prediction == 0 and true_value == 0:
            true_negatives += 1
        elif prediction == 1 and true_value == 0:
            false_positives += 1
        else:
            false_negatives += 1

    accuracy = (true_positives + true_negatives) / len(y_test)
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1 = 2 * ((precision * recall) / (precision + recall))

    print("Naive Bayes with " + str(num_features) + " features")
    print("Accuracy: ", accuracy)
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F1 Score: ", f1)
    print()

Naive Bayes with 10 features
Accuracy:  0.6940600383223334
Precision:  0.7432432432432432
Recall:  0.5883611467693625
F1 Score:  0.6567948411750656

Naive Bayes with 50 features
Accuracy:  0.76218863104109
Precision:  0.8028798411122146
Recall:  0.6919127086007703
F1 Score:  0.7432774074925305

Naive Bayes with 100 features
Accuracy:  0.7879497551628699
Precision:  0.8454404945904173
Recall:  0.7021822849807445
F1 Score:  0.767180925666199

Naive Bayes with 250 features
Accuracy:  0.7960400255482223
Precision:  0.8604286461055933
Recall:  0.7043217800599059
F1 Score:  0.7745882352941177

Naive Bayes with 500 features
Accuracy:  0.8162657015116032
Precision:  0.8699799196787149
Recall:  0.7415489944373128
F1 Score:  0.8006468006468008

Naive Bayes with 750 features
Accuracy:  0.8290398126463701
Precision:  0.8763493621197253
Recall:  0.7642276422764228
F1 Score:  0.8164571428571429

Naive Bayes with 1000 features
Accuracy:  0.8358526719182456
Precision:  0.8782608695652174
Recall:  0.77

### Visualize Bayesian network with 10 features

In [11]:
num_features = 10
chi2_selector = SelectKBest(chi2, k=num_features) # initialize feature selector
X_new = chi2_selector.fit_transform(X, y).toarray() # fit selector and transform the data

chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)), columns=['word', 'score', 'pval'])

feature_names = list(chi2_scores.nlargest(num_features, "score")["word"])
feature_names.append("*label*")
    
# combine features in labels into one matrix for Bayes net training
y_new = [[label] for label in y]
X_and_y = np.append(X_new, y_new, axis=1)

# perform 80/20 split for training and test data
training_data, test_data = train_test_split(X_and_y, test_size=0.2, random_state=42)
    
# fit the model
model = BayesianNetwork.from_samples(training_data, algorithm="chow-liu", root=0, n_jobs=3, state_names=feature_names)
model.freeze()

In [13]:
plt.figure(figsize=(14, 10))
model.plot(filename="Bayes_net_10_features.pdf")
plt.show()

<Figure size 1008x720 with 0 Axes>