# NLP Final Project
## Marissa Beaty

For my NLP Final Project, I will be using a "True" and "Fake" news dataset to topic model and train an LSTM classifier. The goal of this project is to showcase my understanding of topic modelling, but also show an early introduction to what can be done with this type of data in terms of classification. For the purpose of this project, the topic modeling and classifying will be done using the news article titles. 

The results of this training (i.e. accuracy, top words, etc.) will be discussed under the results section of my paper. 

In [1]:
#importing necessary packages, taken from NLP Lecture 5.1.2
import numpy as np  
from keras.preprocessing import sequence   
from keras.models import Sequential        
from keras.layers import Dense, Dropout, Activation    
from gensim.models.keyedvectors import KeyedVectors
import nltk
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
import pandas as pd
from random import shuffle
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Using TensorFlow backend.


## Preparing and Reading the Data into Python

This section takes our two separate datasets, labels them based on their fake and true status, reduces their size, and combines them into one dataframe that is saved and read back into this jupyter notebook. Once the data has been saved, I do a quick check to confirm the distribution of true versus fake articles is near equal.

In [2]:
#read the fake and true news articles into dataframes
full_fake_df = pd.read_csv("Data/Fake.csv")
full_true_df = pd.read_csv("Data/True.csv")

In [3]:
#reduce the size of the full dataframes by taking a random sample
fake_df = full_fake_df.sample(frac=0.05, replace=True, random_state=1)
true_df = full_true_df.sample(frac=0.05, replace=True, random_state=1)

In [4]:
#add a new column in this dataframe called "label" and give all fake articles a label value of 0 and all true a label of 1.
fake_df.loc[:, "label"] = 0
true_df.loc[:, "label"] = 1

In [5]:
#create a new dataframe with all fake and true news articles
news_dataframe = true_df.append(fake_df, ignore_index=True)

  news_dataframe = true_df.append(fake_df, ignore_index=True)


In [6]:
#save that new dataframe as a .csv
file_name = "combined_news_dataset"
news_dataframe.to_csv("data/" + file_name + ".csv", sep = "\t", index = False)

In [7]:
#read the new dataset back into the jupyter notebook
combined_news_dataset = pd.read_csv("Data/" + file_name + ".csv",sep = "\t")

In [8]:
#checking the distribution of fake to true news articles
combined_news_dataset.label.value_counts()

0    1174
1    1071
Name: label, dtype: int64

## Tokenizing the Data

In this section, I define a tokenizer to clean up my data. First, I replace a term to better fit the dataset. Then I remove all special characters, and finally remove all stop words before saving the new set of words in a list. 

In [9]:
#import stop words from the NLTK library: https://www.nltk.org/search.html?q=stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marissabeaty/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
#defining a tokenizer to clean up and tokenize my data

def my_tokenizer(text):
    #changing one term so the tokenizer does not remove important information
    article_titles = combined_news_dataset["title"].str.replace("U.S.", "unitedstates", case = False)
    
    #removing all stop characters and adding the regexed titles back into a new list
    stop_char = "[^A-Za-z0-9]+"

    regexed_titles = []
    for title in article_titles:
        regexed_articles = re.sub(stop_char, ' ', str(title).lower()).strip()
        regexed_titles.append(regexed_articles)
        
    stop_words = stopwords.words('english')

    #removing all stop words and putting them into a new list
    go_words = []
    for titles in regexed_titles:
        titles = titles.split()
        for words in titles:
            if words not in stop_words:
                go_words.append(words)
    return go_words

## Topic Modeling with Article Titles

For the topic modelling section of this project, I will be using a Latent Semantic Analysis model. Because I am interested in how topics are similar or different between the True and Fake news articles, I have first split up my dataset again to look at the topics individually and comparatively. This split is only used for the topic modelling analysis. My classifier will be trained on the whole dataset created earlier. 

In [13]:
#separating the data so I can look at the Fake news titles in comparison to the True news titles
fake_news_titles = combined_news_dataset.loc[combined_news_dataset['label'] == 0, 'title']
true_news_titles = combined_news_dataset.loc[combined_news_dataset['label'] == 1, 'title']

In [14]:
#setting up TF-IDF to use my tokenizer as defined above
#the modeling of my LSA model is based on our Lecture 2.2 material
tfidf_vectoriser = TfidfVectorizer(tokenizer=my_tokenizer)

In [15]:
#importing TruncatedSVD to use later
from sklearn.decomposition import TruncatedSVD

### Topic Modelling the Fake News Titles

In [16]:
#applying my tokenizer to the dataset and using it to create a variable holding all tokenized words
#printing out the shape of the dataset and the number of tokenized words

fake_tfidf = tfidf_vectoriser.fit_transform(fake_news_titles)
fake_vocab = tfidf_vectoriser.get_feature_names_out()
fake_tfidf_df = pd.DataFrame(fake_tfidf.todense(), columns = fake_vocab)
print(fake_tfidf.todense().shape)

  article_titles = combined_news_dataset["title"].str.replace("U.S.", "unitedstates", case = False)


(1174, 5996)


In [17]:
fake_tfidf_df = fake_tfidf_df - fake_tfidf_df.mean()

In [18]:
#setting up variables and number of topics for use later

fake_num_topics = 5
pd.options.display.max_columns=fake_num_topics
fake_labels = ['topic{}'.format(i) for i in range(fake_num_topics)]

In [19]:
#applying TruncatedSVD to generate a matrix from the data

fake_svd = TruncatedSVD(n_components = fake_num_topics, n_iter = 100) 
fake_svd_topic_vectors = fake_svd.fit_transform(fake_tfidf_df.values)

  self.explained_variance_ratio_ = exp_var / full_var
  self.explained_variance_ratio_ = exp_var / full_var


In [20]:
#defining the topic_weights variable and confirming the topic weights is set up properly by testing it on a sample of the data.

fake_topic_weights = pd.DataFrame(fake_svd.components_.T, index=fake_vocab, columns=fake_labels)
fake_topic_weights.sample(10)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4
tounitedstatespromises,-0.001199,-1.9e-05,0.00076,0.000306,0.000567
sh,-0.006167,0.000932,0.003472,0.01621,-0.016598
given,-0.002398,-3.9e-05,0.001521,0.000611,0.001135
meryl,-0.001199,-1.9e-05,0.00076,0.000306,0.000567
ite,-0.001199,-1.9e-05,0.00076,0.000306,0.000567
stores,-0.001199,-1.9e-05,0.00076,0.000306,0.000567
apologizes,-0.002398,-3.9e-05,0.001521,0.000611,0.001135
jaw,-0.001199,-1.9e-05,0.00076,0.000306,0.000567
spirit,-0.002398,-3.9e-05,0.001521,0.000611,0.001135
baiting,-0.001199,-1.9e-05,0.00076,0.000306,0.000567


In [21]:
#pull out 20 terms from the top 5 topics as defined earlier

num_terms = 20
for i in range(fake_num_topics):
    print("___topic " + str(i) + "___")
    fake_topicName = "topic" + str(i)
    fake_weightedlist = fake_topic_weights.get(fake_topicName).sort_values()[-num_terms:]
    print(fake_weightedlist.index.values)

___topic 0___
['debate' 'voters' 'gop' 'house' 'tweets' 'north' 'eu' 'chief' 'deal'
 'bill' 'clinton' 'media' 'vote' 'black' 'president' 'says' 'watch'
 'white' 'obama' 'unitedstates']
___topic 1___
['lives' 'american' 'general' 'committee' 'wife' 'korean' 'media' 'obama'
 'donald' 'foreign' 'two' 'korea' 'police' 'anti' 'hillary' 'president'
 'house' 'new' 'watch' '000']
___topic 2___
['world' 'race' 'wall' 'rights' 'help' 'stop' 'gets' 'donald' 'clinton'
 'people' 'syria' 'muslim' 'campaign' 'war' 'speech' 'news' 'republican'
 'new' 'obama' 'court']
___topic 3___
['report' 'fight' 'south' 'woman' 'man' 'poll' 'tells' 'americans'
 'minister' 'sanctions' 'putin' 'tweets' 'north' 'white' 'obama' 'donald'
 'republican' 'news' 'video' 'unitedstates']
___topic 4___
['healthcare' 'claims' 'saudi' 'department' 'killed' 'pm' 'week' 'left'
 'brexit' 'probe' 'gop' 'unitedstates' 'china' '000' 'trump' 'video'
 'bill' 'clinton' '111' 'court']


### Topic Modelling the True News Articles

In [22]:
#applying my tokenizer to the dataset and using it to create a variable holding all tokenized words
#printing out the shape of the dataset and the number of tokenized words

true_tfidf = tfidf_vectoriser.fit_transform(true_news_titles)
true_vocab = tfidf_vectoriser.get_feature_names_out()
true_tfidf_df = pd.DataFrame(true_tfidf.todense(), columns = true_vocab)
print(true_tfidf.todense().shape)

  article_titles = combined_news_dataset["title"].str.replace("U.S.", "unitedstates", case = False)


(1071, 5996)


In [23]:
true_tfidf_df = true_tfidf_df - true_tfidf_df.mean()

In [24]:
#setting up variables and number of topics for use later

true_num_topics = 5
pd.options.display.max_columns=true_num_topics
true_labels = ['topic{}'.format(i) for i in range(true_num_topics)]

In [25]:
#applying TruncatedSVD to generate a matrix from the data

true_svd = TruncatedSVD(n_components = true_num_topics, n_iter = 100) 
true_svd_topic_vectors = true_svd.fit_transform(true_tfidf_df.values)

  self.explained_variance_ratio_ = exp_var / full_var
  self.explained_variance_ratio_ = exp_var / full_var


In [26]:
#defining the topic_weights variable and confirming the topic weights is set up properly by testing it on a sample of the data.

true_topic_weights = pd.DataFrame(true_svd.components_.T, index=true_vocab, columns=true_labels)
true_topic_weights.sample(10)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4
stresses,-0.002484,9.9e-05,1.7e-05,1.7e-05,2.6e-05
voted,-0.003109,4e-06,8e-06,3.6e-05,-1.8e-05
investigators,-0.002484,9.9e-05,1.7e-05,1.7e-05,2.6e-05
unitedstatesretired,-0.001242,4.9e-05,8e-06,8e-06,1.3e-05
bewilder,-0.001242,4.9e-05,8e-06,8e-06,1.3e-05
changing,-0.002484,9.9e-05,1.7e-05,1.7e-05,2.6e-05
eviscerates,-0.001242,4.9e-05,8e-06,8e-06,1.3e-05
tried,-0.004968,0.000197,3.4e-05,3.3e-05,5.3e-05
pepper,-0.001242,4.9e-05,8e-06,8e-06,1.3e-05
fees,-0.001242,4.9e-05,8e-06,8e-06,1.3e-05


In [27]:
#pull out 20 terms from the top 5 topics as defined earlier

num_terms = 20
for i in range(true_num_topics):
    print("___topic " + str(i) + "___")
    true_topicName = "topic" + str(i)
    true_weightedlist = true_topic_weights.get(true_topicName).sort_values()[-num_terms:]
    print(true_weightedlist.index.values)

___topic 0___
['factbox' 'runitedstatesan' 'bill' 'gop' 'house' 'deal' 'eu' 'chief'
 'tweets' 'clinton' 'north' 'vote' 'black' 'media' 'says' 'president'
 'obama' 'white' 'watch' 'unitedstates']
___topic 1___
['says' 'gets' 'attack' 'runitedstatesan' 'factbox' 'obama' 'new'
 'hillary' 'media' '05' 'court' '104' '100k' '11' 'video' 'news'
 'republican' '108' '1' '000']
___topic 2___
['latest' 'want' 'students' 'action' 'general' 'islamic' 'korean' 'order'
 'years' 'way' 'gets' 'two' 'foreign' 'hillary' 'unitedstates' 'video'
 'trump' '05' '100k' '1']
___topic 3___
['campaign' 'speech' 'war' 'syria' 'people' 'new' 'donald' 'court'
 'hillary' 'republican' 'news' '02' 'video' '04' '05' '100k' '1' '106'
 '10' '10th']
___topic 4___
['war' 'speech' 'campaign' 'tweets' 'north' 'clinton' 'vote' 'black'
 'gets' 'obama' 'says' 'watch' 'white' 'trump' '04' '000' '02' '05' '100'
 '100k']


## Training a Classifier based on Article Titles

I have elected to use an LSTM classifier on my data due to the ease and speed of training. Due to time constraints, I have used the classifying method as laid out in NLP Lecture 5.2.

In [28]:
#looking at my dataset
combined_news_dataset.head()

Unnamed: 0,title,text,subject,date,label
0,U.S. House committee 'may reconsider' WHO canc...,LONDON (Reuters) - U.S. congressional committe...,politicsNews,"December 8, 2017",1
1,'Congratulations': EU moves to Brexit phase tw...,BRUSSELS (Reuters) - The European Union agreed...,worldnews,"December 15, 2017",1
2,White House aides told to preserve materials i...,WASHINGTON (Reuters) - The White House counsel...,politicsNews,"March 2, 2017",1
3,U.S. 'very concerned' by violence around Iraq'...,WASHINGTON (Reuters) - The U.S. State Departme...,worldnews,"October 16, 2017",1
4,Obama's move on gender pay gap seen as first s...,NEW YORK (Reuters) - Advocates fighting to clo...,politicsNews,"February 5, 2016",1


In [29]:
#dropping unnecessary aspects of our dataset for training the model
classifying_dataset = combined_news_dataset.drop(["text", "subject", "date"], axis=1)
classifying_dataset.head()

Unnamed: 0,title,label
0,U.S. House committee 'may reconsider' WHO canc...,1
1,'Congratulations': EU moves to Brexit phase tw...,1
2,White House aides told to preserve materials i...,1
3,U.S. 'very concerned' by violence around Iraq'...,1
4,Obama's move on gender pay gap seen as first s...,1


In [41]:
#Modelling my training model after NLP Lecture 5.1.2
from nltk.tokenize.casual import casual_tokenize

In [37]:
#uploading the GoogleNews Vectors to assist with vectorizing my data

embeddings_file = "GoogleNews-vectors-negative300.bin"
wv = KeyedVectors.load_word2vec_format(embeddings_file, binary=True, limit=200000)

In [38]:
#defining a tokenizer and vectorizer

word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=200000)
def tokenize_and_vectorize(dataset):
    vectorized_data = []
    for sample in dataset:
        tokens = casual_tokenize(sample)
        sample_vecs = []
        for token in tokens:
            try:
                sample_vecs.append(word_vectors[token])
            except KeyError:
                pass  # No matching token in the Google w2v vocab
        vectorized_data.append(sample_vecs)

    return vectorized_data

In [39]:
#defining a function to make all vectors the same length and shape

def pad_trunc(data, maxlen):
    new_data = []

    zero_vector = []
    for _ in range(len(data[0][0])):
        zero_vector.append(0.0)

    for sample in data:
 
        if len(sample) > maxlen:
            temp = sample[:maxlen]
        elif len(sample) < maxlen:
            temp = sample
            additional_elems = maxlen - len(sample)
            for _ in range(additional_elems):
                temp.append(zero_vector)
        else:
            temp = sample
        new_data.append(temp)
    return new_data

In [44]:
#titles_dataset = classifying_dataset.sample(frac = 1) 
features = tokenize_and_vectorize(classifying_dataset["title"])
x_train, x_test, y_train, y_test = train_test_split(features, classifying_dataset["label"], test_size=0.3, random_state=0)

In [45]:
maxlen = 50
embedding_dims = 300 

In [46]:
#getting the shape of my training and test sets
np.array(x_train).shape,np.array(x_test).shape

  np.array(x_train).shape,np.array(x_test).shape


((1571,), (674,))

In [47]:
#applying the pad function to the training and test sets
x_train = pad_trunc(x_train, maxlen)
x_test = pad_trunc(x_test, maxlen)
x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))
y_train = np.array(y_train)
x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

In [48]:
#displaying the new shape of the training and test sets
np.array(x_train).shape,np.array(x_test).shape

((1571, 50, 300), (674, 50, 300))

In [50]:
batch_size = 32       
num_neurons = 10     
epochs = 3   

In [56]:
#setting up the model to be trained

from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Dense, Dropout, Flatten, SimpleRNN, LSTM

print('Build model...')
model = Sequential()

model.add(LSTM(num_neurons, return_sequences=False, input_shape=(maxlen, embedding_dims)))
model.add(Dropout(.2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile('rmsprop', 'binary_crossentropy',  metrics=['accuracy'])
print(model.summary())

Build model...
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 10)                12440     
_________________________________________________________________
dropout (Dropout)            (None, 10)                0         
_________________________________________________________________
flatten (Flatten)            (None, 10)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 11        
Total params: 12,451
Trainable params: 12,451
Non-trainable params: 0
_________________________________________________________________
None


In [57]:
#training the LSTM model on my dataset

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))
model_structure = model.to_json()
with open("simplernn_model2.json", "w") as json_file:
    json_file.write(model_structure)

model.save_weights("simplernn_weights2.h5")
print('Model saved.')

Epoch 1/3
Epoch 2/3
Epoch 3/3
Model saved.
