# HW 10: Natural Language Processing

## Setup

### Module Import

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/MyDrive/CSC 310"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/CSC 310


In [2]:
import pandas as pd
import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

from nltk.stem import PorterStemmer

In [3]:
# Function for computing confidence intervals from class lecture

def classification_confint(acc, n):
  '''
  Compute the 95% confidence interval for a classification problem.
    acc -- classification accuracy
    n -- number of samples
  Returns a tuple (lb, ub) where lb is the lower bound and ub is the upper bound
  '''
  import math
  interval = 1.96 * math.sqrt((acc * (1 - acc)) / n)
  lb = max(0, acc - interval)
  ub = min(1.0, acc + interval)
  return (lb, ub)

### Data Import

In [4]:
url = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"
df = pd.read_csv(url)

In [5]:
print(df.shape)
df.head()

(6335, 4)


Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


The data source is a real vs. fake news dataset provided in the assignment description. It has a good balance between the two classes, with 3171 samples of real news and 3164 samples of fake news.

In [6]:
df.label.value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

## Text Model

### Minimal Preprocessing

In this initial setup, the only vectorization will be breaking the text up into words, and no further processing.

In [7]:
# Construct training data set

vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(df.text).toarray()

In [8]:
# Dimensions of vector model

print("doc array shape: {}".format(docarray.shape))
print("first 10 features: {}".format(vectorizer.get_feature_names_out()[:10]))

doc array shape: (6335, 67659)
first 10 features: ['00' '000' '0000' '000000031' '00000031' '000035' '00006' '0001' '0001pt'
 '0002']


With minimal processing, the docarray has 67,659 features, meaning the model will have to work in this many dimensions. Looking at the first 10 features, one can see that even in a small sample there are plenty of features that may not necessarily be meaningful.

In [9]:
# Construct and fit Naive Bayes classifer

model = MultinomialNB()
model.fit(docarray, df.label)

In [10]:
# Print accuracy and confidence interval

pred = model.predict(docarray)
accuracy = accuracy_score(df.label, pred)
lb, ub = classification_confint(accuracy, docarray.shape[0])
print("Accuracy: {:.2f}\nConfidence Interval: [{:.2f}, {:.2f}]".format(accuracy, lb, ub))

Accuracy: 0.95
Confidence Interval: [0.94, 0.95]


Even when including some potentially 'noisy' features, the model classifies docs as 'real' or 'fake' with 95% accuracy. Additionally, we can be fairly confident about this figure, since the 95% confidence interval only spans 94 - 95%.

### More Preprocessing

This time, the text will go through much more preprocessing. The text will intially be split by words as before, but now standard english stop words will be removed, in order to have fewer features which don't contribute adequate meaning to the text. Additionally, only alphabetical tokens will be considered, which will mean dropping some of the potentially noisy numeric features seen in the last docarray.

In [11]:
# Construct training data set

# Create stemmer
stemmer = PorterStemmer()

# Create analyzer that takes alphabetical english words, and uses relevant english stopwords
analyzer = CountVectorizer(analyzer = "word",
                           stop_words = "english",
                           token_pattern = "[a-zA-Z]+").build_analyzer()

Defined here is a function from lecture which takes a document, passes it through the analyzer for pre-filtering, and stems each word down to its base in order to reduce dimensionality while retaining most of the meaning in the words.

In [12]:
# Function for stemming words in a doc

def stemmed_words(doc):
  return [stemmer.stem(w) for w in analyzer(doc)]

Finally the filtered and stemmed words are vectorized in binary fashion, with the requirement that a word appear twice to be included.

In [13]:
# Create vectorizer that stems the doc and requires 2 instances of a word to count
vectorizer = CountVectorizer(analyzer = stemmed_words,
                             binary = True,
                             min_df = 2)

docarray = vectorizer.fit_transform(df.text).toarray()

In [14]:
# Dimensions of vector model

print("doc array shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))
print("{:3.2f}%".format(24014 / 67659 * 100))

doc array shape: (6335, 24014)
first 10 coords: ['aa' 'aaa' 'aab' 'aadmi' 'aaib' 'aam' 'aamaq' 'aap' 'aaron' 'aarp']
35.49%


There are still clearly some 'noisy' features, but this time the docarray has 24,014 features, or just 35% of the un-preprocessed vector model.

In [15]:
# Construct and fit Naive Bayes classifier

model = MultinomialNB()
model.fit(docarray, df.label)

In [16]:
# Print accuracy and confidence interval

pred = model.predict(docarray)
accuracy = accuracy_score(df.label, pred)
lb, ub = classification_confint(accuracy, docarray.shape[0])
print("Accuracy: {:.2f}\nConfidence Interval: [{:.2f}, {:.2f}]".format(accuracy, lb, ub))

Accuracy: 0.92
Confidence Interval: [0.91, 0.93]


Interestingly, this model is actually worse than the un-preprocessed model! It's accuracy is 3% lower, and the confidence intervals of the two models don't overlap at all. More analysis would need to be done to figure out exactly why, since several changes were made.  
Perhaps there was plenty of decent numerical information that should've been kept, or perhaps the stemming removed some important contextual meaning from some features.

## Title Model

### Minimal Preprocessing

For the next two models, the construction and vectorization will be similar to the text-based models, but these models will make title-based predictions instead. The first will again use minimal preprocessing, just splitting the titles into words.

In [17]:
# Construct training data set

vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(df.title).toarray()

In [18]:
# Dimensions of vector model

print("doc array shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

doc array shape: (6335, 10071)
first 10 coords: ['00' '000' '00pm' '01' '04' '05' '06' '08' '10' '100']


Using all title words, again we get some 'noisy' looking features, but this time there are only 10,071 features, less than half that of even the lower-dimensional model from above.

In [19]:
# Construct and fit Naive Bayes classifier

model = MultinomialNB()
model.fit(docarray, df.label)

In [20]:
# Print accuracy and confidence interval

pred = model.predict(docarray)
accuracy = accuracy_score(df.label, pred)
lb, ub = classification_confint(accuracy, docarray.shape[0])
print("Accuracy: {:.2f}\nConfidence Interval: [{:.2f}, {:.2f}]".format(accuracy, lb, ub))

Accuracy: 0.94
Confidence Interval: [0.93, 0.94]


Using just the titles and no preprocessing, this model was nearly as successful as the better of the two previous text-based models. It may in fact be just as good, since the confidence intervals of this model and the better text-based model overlap on 94%.

### More Preprocessing

This title-based model will be pre-processed in the same way as the 'worse' text-based model, with a stricter 'pre-filter' and word stemming

In [21]:
# Construct training data set

# Pre-filter using only alphabetical words, and dropping stop words
analyzer = CountVectorizer(analyzer = "word",
                           stop_words = "english",
                           token_pattern = "[a-zA-Z]+").build_analyzer()

# Redefining the vectorizer
vectorizer = CountVectorizer(analyzer = stemmed_words,
                             binary = True,
                             min_df = 2)

docarray = vectorizer.fit_transform(df.title).toarray()

In [22]:
# Dimensions of vector model

print("doc array shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

doc array shape: (6335, 3787)
first 10 coords: ['abandon' 'abc' 'abdullah' 'abedin' 'abil' 'aboard' 'abolish' 'abort'
 'absente' 'absolut']


With shorter texts (titles are shorter than the messages they accompany) being reduced even further by stemming, this docarray only contains 3,787 dimensions, barely over a third as many as the other title-based model.

In [23]:
# Construct and fit Naive Bayes classifier

model = MultinomialNB()
model.fit(docarray, df.label)

In [24]:
# Print accuracy and confidence interval

pred = model.predict(docarray)
accuracy = accuracy_score(df.label, pred)
lb, ub = classification_confint(accuracy, docarray.shape[0])
print("Accuracy: {:.2f}\nConfidence Interval: [{:.2f}, {:.2f}]".format(accuracy, lb, ub))

Accuracy: 0.90
Confidence Interval: [0.89, 0.90]


Once again, the pre-processed model actually turns out less accurate than the model based on un-preprocessed data. With a tight confidence interval well below that of the other model, we can be fairly sure that the other model is better. My guess is that there's some valuable numerical information that the model no longer gets to work with in our current preprocessing pipeline.

## Allowing Numeric Features

Again I'll initialize a stemmer and an analyzer, but this time the analyzer will accept numeric characters. The rest of the pipeline is unchanged, and the model will be text-based.

In [25]:
# Construct training data set

# Create stemmer
stemmer = PorterStemmer()

# Create analyzer that takes alphabetical english words, and uses relevant english stopwords
analyzer = CountVectorizer(analyzer = "word",
                           stop_words = "english",
                           token_pattern = "[a-zA-Z0-9]+").build_analyzer()

In [26]:
# Create vectorizer that stems the doc and requires 2 instances of a word to count
vectorizer = CountVectorizer(analyzer = stemmed_words,
                             binary = True,
                             min_df = 2)

docarray = vectorizer.fit_transform(df.text).toarray()

In [27]:
# Dimensions of vector model

print("doc array shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

doc array shape: (6335, 25015)
first 10 coords: ['0' '00' '000' '0000' '000ft' '001' '003' '004' '005' '006']


This time our preprocessed vector model has 25,015 features, 1,001 more than the non-numeric vector model.

In [28]:
# Construct and fit Naive Bayes classifier

model = MultinomialNB()
model.fit(docarray, df.label)

In [29]:
# Print accuracy and confidence interval

pred = model.predict(docarray)
accuracy = accuracy_score(df.label, pred)
lb, ub = classification_confint(accuracy, docarray.shape[0])
print("Accuracy: {:.2f}\nConfidence Interval: [{:.2f}, {:.2f}]".format(accuracy, lb, ub))

Accuracy: 0.93
Confidence Interval: [0.92, 0.93]


The accuracy from this model to the last preprocessed text-based model has increased 1%, but this is not statistically significant since the confidence intervals overlap.  
Evidently the numerical info accounted for some, but not all of the entire difference in accuracy. It's likely that each of the pre-processing techniques contributed in ways that will be difficult to untangle, but this does make me curious whether there is a 'grid-search' sort of method for finding the best way to set up a Naive Bayes model.