## ML Final Project first try of making an ML model

Use ngrams code from sentiment analysis to make a fist pass at making a ML model


First, I did a lot of pre-processing that got the data into a usable form. You can see the code for pre-processing here: https://colab.research.google.com/drive/1IVmYII4bwYTvOjXig5POtmhTEMWgaBC3

In [1]:
# Import data (for the code that produced this data, see notebook: https://colab.research.google.com/drive/1IVmYII4bwYTvOjXig5POtmhTEMWgaBC3)
import gdown
import pandas as pd

# https://drive.google.com/open?id=10Te32ZdtwxPbKdqd5YZw6Ztoa4g7FYDB
gdown.download('https://drive.google.com/uc?authuser=0&id=10Te32ZdtwxPbKdqd5YZw6Ztoa4g7FYDB&export=download',
               'finalprojectdata.csv',
               quiet=False)
df = pd.read_csv('finalprojectdata.csv', header=0, delimiter=',')
df

Downloading...
From: https://drive.google.com/uc?authuser=0&id=10Te32ZdtwxPbKdqd5YZw6Ztoa4g7FYDB&export=download
To: /content/finalprojectdata.csv
100%|██████████| 344k/344k [00:00<00:00, 47.3MB/s]


Unnamed: 0,id,label,text
0,1_0,0,find out more here
1,2_0,0,i had a long battle with anorexia
2,3_0,0,those thoughts telling me that if i just lost...
3,4_0,0,the trouble is that never happened
4,5_0,0,there was never a magic number
...,...,...,...
3364,2131_1,1,the last pro ana diet comes with a twist in at...
3365,2132_1,1,"in this diet, you can hardly eat any carbs bu..."
3366,2133_1,1,"with this diet, you will see a drastic loss i..."
3367,2134_1,1,"well, these were some of the best pro ana diet..."


There are a lot of weird symbols that come from having a list of strings, but they will all be removed when we clean the data. 

Then, convert from text to a bag of words representation using scikit learn's built-in [count vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.model_selection import train_test_split

y = np.array(df['label'])

dfX_train, dfX_test, y_train, y_test = train_test_split(df['text'], y)
print("df_train.shape",dfX_train.shape)
print("y_train.shape",y_train.shape)
print("dfX_test.shape",dfX_test.shape)
print("y_test.shape",y_test.shape)

vectorizer = CountVectorizer(binary=True, min_df = 20) #convert a collection of text documents into a matrix of token counts
vectorizer.fit(dfX_train) #learn a vocabulary dictionary of all tokens in the raw documents

X_train = vectorizer.transform(dfX_train).todense() #transform to a document-term matrix
X_test = vectorizer.transform(dfX_test).todense()
print("X_test.shape",X_test.shape)
print("X_train.shape", X_train.shape)


df_train.shape (2526,)
y_train.shape (2526,)
dfX_test.shape (843,)
y_test.shape (843,)
X_test.shape (843, 260)
X_train.shape (2526, 260)


We also split the data into training and test right from the start. We'll check to make sure our data is organized properly. The word "king" should occur more frequently in Grimm's Fairy Tales, and the word "sherlock" should only appear in The Adventures of Sherlock Holmes

In [3]:
# Looking at a paragraph to make sure things work
reviews_wrapped = dfX_train.str.wrap(80)
calories_index = vectorizer.get_feature_names().index('calories')
print("calories occurs in", X_train[y_train==1, calories_index].mean(), "for Y=1")
print("calories occurs in", X_train[y_train==0, calories_index].mean(), "for Y=0")
print(reviews_wrapped.iloc[1]) # Just in case you want to read a random paragraph

calories occurs in 0.05040957781978576 for Y=1
calories occurs in 0.004259850905218318 for Y=0
 the first wave of pro-ana websites was observed in the 1990s


### Fitting the Parameters of the Model & Making Predictions



In [0]:
def fit_nb_model(X, y):
    X_1 = np.asarray(X[y == 1, :]) # all paragraphs from Sherlock
    X_0 = np.asarray(X[y == 0, :]) # all paragraphs from Grimm
    return y.mean(), 1 - y.mean(), X_1.mean(axis=0), X_0.mean(axis=0)

def get_nb_predictions(p_y_1, p_y_0, p_x_y_1, p_x_y_0, X):
    """ Predict the labels for the data X given the Naive Bayes model """
    log_odds_ratios = np.zeros(X.shape[0])
    for i in range(X.shape[0]): # loop over data points
        if i%(X.shape[0]/10) == 0: print("progress", i/X.shape[0])
        log_odds_ratios[i] += np.log(p_y_1) - np.log(p_y_0)
        for j in range(X.shape[1]): #loop over words
            if X[i, j] == 1: #if this example includes word j
                log_odds_ratios[i] += np.log(p_x_y_1[j]) - np.log(p_x_y_0[j])
            else: 
                log_odds_ratios[i] += np.log(1 - p_x_y_1[j]) - np.log(1 - p_x_y_0[j])
    return (log_odds_ratios >= 0).astype(np.float)

In [5]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB(alpha=1) 
model.fit(X_train, y_train)
y_pred = model.predict(X_test[:100,:])
np.mean(y_pred == y_test[:100])

0.85

### Model including Laplace smoothing

In [6]:
def fit_nb_model_smooth(X, y,alpha):
    X_1 = np.asarray(X[y == 1, :])
    X_0 = np.asarray(X[y == 0, :])
    N_1,V_1 = X_1.shape
    N_0,V_0 = X_0.shape #should actually be the same size in our case
    return y.mean(), 1 - y.mean(), np.divide(X_1.sum(axis=0)+alpha,N_1), np.divide(X_0.sum(axis=0)+1,N_0) 


# Code to call and run your new fitting with alpha =1
p_y_1, p_y_0, p_x_y_1, p_x_y_0 = fit_nb_model_smooth(X_train, y_train,1) #Model with smoothing
y_pred = get_nb_predictions(p_y_1, p_y_0, p_x_y_1, p_x_y_0, X_test[:100,:]) #Only looking at first 100 X_test
print("accuracy is", (y_pred == y_test[:100]).astype(np.float).mean()) #also only need to compare first 100 y_test

progress 0.0
progress 0.1
progress 0.2
progress 0.3
progress 0.4
progress 0.5
progress 0.6
progress 0.7
progress 0.8
progress 0.9
accuracy is 0.79


In [7]:
X_0 = np.asarray(X_train[y_train == 0, :])
print(np.shape(X_0))
print(np.shape(p_x_y_0))

(939, 260)
(260,)


### Clean Text


In [8]:
import collections
from nltk.util import ngrams

import re
# Essentially the same as above, but putting it into a function for later
def clean_text(s):
  s = s.lower() # Convert to lowercases
  s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s) # Replace all non alphanumeric characters with spaces
  s = re.sub(' +',' ',s) # Replace series of spaces with single space
  return s


for i in range(10):
  origtext = list(dfX_train)[i]
  print("Original text: ",origtext)
  cleaned = clean_text(origtext)
  print("Cleaned  text: ",cleaned)
  tokens = [token for token in cleaned.split(" ") if token != ""]
  bigramWords = list(ngrams(tokens, 2))
  bigramFreq = collections.Counter(bigramWords)
  print("Bigrams: ", bigramFreq.most_common(10))
  print('\n')


Original text:  through acceptance and commitment therapy, a person with an eating disorder can learn to break destructive cycles of negative thoughts
Cleaned  text:  through acceptance and commitment therapy a person with an eating disorder can learn to break destructive cycles of negative thoughts
Bigrams:  [(('through', 'acceptance'), 1), (('acceptance', 'and'), 1), (('and', 'commitment'), 1), (('commitment', 'therapy'), 1), (('therapy', 'a'), 1), (('a', 'person'), 1), (('person', 'with'), 1), (('with', 'an'), 1), (('an', 'eating'), 1), (('eating', 'disorder'), 1)]


Original text:   the first wave of pro-ana websites was observed in the 1990s
Cleaned  text:   the first wave of pro ana websites was observed in the 1990s
Bigrams:  [(('the', 'first'), 1), (('first', 'wave'), 1), (('wave', 'of'), 1), (('of', 'pro'), 1), (('pro', 'ana'), 1), (('ana', 'websites'), 1), (('websites', 'was'), 1), (('was', 'observed'), 1), (('observed', 'in'), 1), (('in', 'the'), 1)]


Original text:  if the

#### Now, let's clean all the data

In [9]:
def clean_series_data(sdata):
  sdata = list(sdata)
  for i in range(len(sdata)):
    sdata[i] = clean_text(sdata[i])
    if i%(len(sdata)/5)==0:
      print(sdata[i]) #Printing occasional text can be helpful for making sure that your cleaning is working how you want it to. Or you can comment this out.
  return sdata

dfX_train = clean_series_data(dfX_train)
dfX_test = clean_series_data(dfX_test)


through acceptance and commitment therapy a person with an eating disorder can learn to break destructive cycles of negative thoughts
 you need to believe that now


###Notebook Exercise 3 

Find the top 10 bigrams for Sherlock and Grimm paragraphs

In [10]:
# A solution (though there are likely faster, cleaner ways to do this)
wordsProAna = list() # initialize empty list for words
wordsProRecovery = list() # initialize empty list for words
n = 4

for i in range(len(y_train)): # iterate over y_train
  # split words?
  tokens = [token for token in dfX_train[i].split(" ") if token != ""] 
  if y_train[i]==1: # if y_train[i] is Sherlock
    wordsProAna.extend(tokens) # add list of tokens to wordssherlock list
  else:
    wordsProRecovery.extend(tokens) # add list of tokens to wordsGrimm list

# uses nltk library to make ngrams from wordsSherlock, then puts ngrams in list
bigramWordsProAna = list(ngrams(wordsProAna, n)) 
# counts the bigramWordsSherlock list
bigramFreqProAna = collections.Counter(bigramWordsProAna)
# make ngrams from wordsGrimm and put ngrams in list
bigramWordsProRecovery = list(ngrams(wordsProRecovery, n))
# counts the bigramWordsGrimm list
bigramFreqProRecovery = collections.Counter(bigramWordsProRecovery)

print("pro-ana bigrams:")
bgfp = bigramFreqProAna.most_common(30) # sets variable bgfp to 10 most common bigrams
for bg in bgfp: # iterates over bigrams
  print(bg) # prints bigrams
print('\n')
print("pro-recovery bigrams")
bgfp = bigramFreqProRecovery.most_common(30)
for bg in bgfp:
  print(bg)

pro-ana bigrams:
(('the', 'pro', 'ana', 'diet'), 9)
(('plan', 'you', 'have', 'to'), 8)
(('pro', 'ana', 'tips', 'and'), 7)
(('pro', 'ana', 'diet', 'plan'), 6)
(('ana', 'tips', 'and', 'tricks'), 6)
(('pro', 'ana', 'diet', 'plans'), 5)
(('stay', 'safe', 'stay', 'strong'), 5)
(('safe', 'stay', 'strong', 'stay'), 5)
(('stay', 'strong', 'stay', 'skinny'), 5)
(('pro', 'ana', 'is', 'a'), 5)
(('the', 'pro', 'ana', 'lifestyle'), 5)
(('this', 'diet', 'plan', 'you'), 5)
(('let', 'us', 'begin', 'with'), 5)
(('if', 'you', 'want', 'to'), 5)
(('diet', 'plan', 'you', 'have'), 5)
(('tips', 'and', 'tricks', 'for'), 5)
(('that', 'you', 'do', 'not'), 5)
(('in', 'front', 'of', 'the'), 5)
(('that', 'you', 'don', 't'), 4)
(('don', 't', 'want', 'to'), 4)
(('you', 'don', 't', 'want'), 4)
(('pro', 'ana', 'tips', 'that'), 4)
(('the', 'food', 'that', 'you'), 4)
(('want', 'to', 'lose', 'weight'), 4)
(('day', 'of', 'the', 'cycle'), 4)
(('it', 'will', 'make', 'you'), 4)
(('you', 'want', 'to', 'eat'), 4)
(('this', 'wi

#### Try a different sized n-gram (instead of a bigram / 2-gram)
Not shockingly, this list is not very exciting. It turns out, people use some pretty standard words bigrams when talking about movies (e.g. "this film" and "of the")... really riveting stuff. If we look at a greater number of bigrams (e.g., the top 100), we can eventually start to find something relevant among mostly trite pairings.

However, it might be interesting to look at a different sized ngram than the bigram. **Try something in the n = 4 to 7 range.**


## Classifying movie review sentiment with bigrams


Let's revisit our Na&iuml;ve Bayes model, but now using bigrams as our features instead of single words. 


#### Use CountVectorizer to get top bigrams and then classify sentiment
The code below gives a black box approach to classifying with ngrams. 

The ngram_range(2,2) makes our code use bigrams. 

Note that we now have a different shape to our data because it is stored in sparse form (no longer using todense()). If we try to store this in dense form, we will run into RAM errors, which we could combat by limiting the number of ngrams that we include in our CountVectorizer by setting max_features=10000 limits the total number of feautures.


In [11]:
ngramvectorizer = CountVectorizer(ngram_range=(2,2))
ngramvectorizer.fit(dfX_train) #learn a vocabulary dictionary of all tokens in the raw documents

X_train_ngram = ngramvectorizer.transform(dfX_train)
X_test_ngram = ngramvectorizer.transform(dfX_test)
print("X_train_ngram.shape", X_train_ngram.shape)
print("X_test_ngram.shape", X_test_ngram.shape) 

X_train_ngram.shape (2526, 23340)
X_test_ngram.shape (843, 23340)


In [12]:
# Actually run the model and print results
model = MultinomialNB(alpha=1)
model.fit(X_train_ngram, y_train)

y_pred_train = model.predict(X_train_ngram)
print("Training accuracy: ", np.mean(y_pred_train == y_train))
y_pred = model.predict(X_test_ngram)
print("Testing accuracy: ",np.mean(y_pred == y_test))


Training accuracy:  0.9901029295328583
Testing accuracy:  0.8327402135231317


## Predicting the next words with bigrams
Use bigrams to generate a predicted sequence of words when given a single word. I modified this code to be a function which I then used several times with different texts and different starting words.

In [14]:
import random
def generate_text(bigramList, startingWord):
  bigramLookup = {}

  for i in range(len(bigramList)-1):
      w1 = bigramList[i][0]
      w2 = bigramList[i][1]
      #print(w1,w2)
      if  w1 not in bigramLookup.keys():
        bigramLookup[w1] = {w2:1}
      elif w2 not in bigramLookup[w1].keys():
        bigramLookup[w1][w2] = 1
      else:
        bigramLookup[w1][w2] = bigramLookup[w1][w2] + 1

  curr_sequence = startingWord 
  output = curr_sequence
  for i in range(50):
      if curr_sequence not in bigramLookup.keys():
        print("not in my keys, choosing seed word ")
        output += '. '
        curr_sequence = 'the'
        output += curr_sequence
      else: 
        possible_words = list(bigramLookup[curr_sequence].keys())
        next_word = possible_words[random.randrange(len(possible_words))] #Randomly choose a word
        output += ' ' + next_word
        curr_sequence = next_word
        

  print(output)

generate_text(bigramWordsProAna, "food")
generate_text(bigramWordsProRecovery, "food")

food many bites making sure that thinspiration pictures to weight please don t use stored food that makes me know before eating almost half the babies 200 7 keep reading to some inspirational lyrics such rewards such a charm indicates that then gained  pro ana songs16 pro anna individuals see people
food plan or push aside deep feelings food groups 2000 15 8 577 590 your information will to create increased anxiety as psychiatrists psychotherapists and beginning to understand what your treatment options here the risk period in here at camp that rang so hard to realise that no question that take
