Credit for most of this notebook goes to Chitipolu Sri Sudheera. See her version here: https://www.kaggle.com/srisudheera/nlp-bag-of-words-meets-bags-of-popcorn.  The model validation code at the end was developed separately.

Google Colab notebook with more comments:  https://colab.research.google.com/drive/1IRVt-OXQs0deuCmU-95f3WFmvNHYp3EO

In [0]:
## Set up Google drive
from google.colab import drive
drive.mount('/content/drive')

# Install a bunch of libraries
!pip install -U -q PyDrive
import pandas as pd 
import numpy as np
import os
import matplotlib.pyplot as plt
import nltk
nltk.download('stopwords')
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

from nltk.corpus import stopwords
from bs4 import BeautifulSoup
import re

print("\nImported things sucessfully!")


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Imported things sucessfully!


# Getting the Data

In [0]:
# load in the training data
#data_file_path = "/content/drive/My Drive/Documents/SCHOOL/Watson CI/Parks Project - Watson CI/Training Data/Labelbox Best 4.csv"

data_file_path = "/content/drive/My Drive/Documents/SCHOOL/Watson CI/Parks Project - Watson CI/Training Data/Flipped Data.csv"
proanti_file   = "/content/drive/My Drive/Documents/SCHOOL/Watson CI/Parks Project - Watson CI/Training Data/Non-Neutral ProAnti.csv"
sentiment_file = "/content/drive/My Drive/Documents/SCHOOL/Watson CI/Parks Project - Watson CI/Training Data/Non-Neutral Sentiment.csv"

train=pd.read_csv(data_file_path, delimiter=',')
train_sentiment = pd.read_csv(sentiment_file, delimiter=',')
train_proanti = pd.read_csv(proanti_file, delimiter=',')

#shuffle the data - optional but makes me feel better
train = train.sample(frac=1).reset_index(drop=True)
train_sentiment = train_sentiment.sample(frac=1).reset_index(drop=True)
train_proanti   = train_proanti.sample(frac=1).reset_index(drop=True)


#Define the column names in case you changed them later
DataColName = "Labeled Data"  #The name of the column with the text data to analyze
ExternalID = "External ID"
# ClassifierColName = "Sentiment" #Sentiment or Pro/anti park #not really needed any more
Sentiment = "Sentiment"
ProAnti = "Pro Anti Park"

# examine the loaded data
print(train.shape)
train[:10][:]


(288, 4)


Unnamed: 0,Labeled Data,External ID,Sentiment,Pro Anti Park
0,I had a really nice time in Joshua Tree this w...,/bignosebug/status/1120147383138967553,0,0
1,Sunrise in Joshua Tree\n.\nCan't wait to go ba...,/EMP_Creative/status/1071457436056346625,0,0
2,The Perfect Joshua Tree Day Trip: A One Day It...,/InAfricaNBeyond/status/1139961543557124096,0,0
3,So how did the latest government shutdown do t...,/swvacavalier/status/1140981246828392448,1,0
4,Here are a few of the beauties I've spotted in...,/englishinmin/status/1111344469163438080,0,0
5,THE JOSHUA TREE CANNOT AFFECT MY LIFE SO.........,/lmbs83/status/1142489839272714240,1,1
6,The inspiration for my first tattoo was the Jo...,/Marky_Boy1968/status/1177585075497635841,0,0
7,My mate @LewisHamilton-very happy tue see U ta...,/brent_otter7/status/1075146367234965504,0,0
8,Ready to have some fun in Joshua Tree,/CatherineViola/status/1111290392127823872,0,0
9,Love this crew and their willingness to explor...,/BrentWalls39/status/1054229278592651265,0,0


Clean the tweets into word lists


In [0]:
# Clean the data
words_to_keep = ['no', 'not', 'nor', 'against', 'why', 'dont', "don't", 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
extra_words_to_remove = ["gmail", "com", "or", "of", "I'll"]  #Remove the words that we used as keywords for finding the tweets?

mystopwords = [w for w in stopwords.words("english") and extra_words_to_remove if not w in words_to_keep]

#String that determines what characters we keep.  Currently getting rid of all numbers and non a-z characters except what you add in
letters_only_sub_string = "[^a-zA-Z?!:']"

# Define the function that cleans the tweet and makes it into lowercase words and joins it back together
def review_to_words(original_tweet):
    review_text=BeautifulSoup(original_tweet).get_text()
    no_html = re.sub(r'http\S+', '', review_text)             #Get rid of any http URLs
    no_html = re.sub(r'pic.twitter.com\S+', '', no_html)      #Get rid of any twitter picture urls
    letters_only=re.sub(letters_only_sub_string," ",no_html)  #Get rid of numbers and special characters using the string key above

    words=letters_only.lower().split()  #Convert to lowercase and split into individual words
    meaningful_words=[w for w in words if not w in mystopwords]
    return(' '.join(meaningful_words))

#Test it on a few tweets
example=BeautifulSoup(train[DataColName][0])
print("Original:  " + example.get_text())
print("Cleaned:   " + review_to_words(example.get_text()))
example=BeautifulSoup(train[DataColName][1])
print("Original:  " + example.get_text())
print("Cleaned:   " + review_to_words(example.get_text()))
example=BeautifulSoup(train[DataColName][2])
print("Original:  " + example.get_text())
print("Cleaned:   " + review_to_words(example.get_text()))

# Clean our tweets - general
num_tweets=train[DataColName].size
clean_train_review=[]
for i in range(0,num_tweets):
    clean_train_review.append(review_to_words(train[DataColName][i]))
print("Cleaned " + str(num_tweets) + " tweets and stored in clean_train_review")

#Clean our sentiment / proanti tweets
clean_sentiment = []
clean_proanti   = []
for i in range(0,train_sentiment[DataColName].size):
    clean_sentiment.append(review_to_words(train_sentiment[DataColName][i]))
for i in range(0,train_proanti[DataColName].size):
    clean_proanti.append(review_to_words(train_proanti[DataColName][i]))


Original:  so sad that i'm missing the snow in joshua tree
Cleaned:   so sad that i'm missing the snow in joshua tree
Original:  The Ukranian crochet museum at Joshua Tree is amazing! I recommend everyone check it out!
Cleaned:   the ukranian crochet museum at joshua tree is amazing! i recommend everyone check it out!
Original:  This shit makes me so mad. I can't read any more articles about the treatment of Joshua Tree during the government shut down or I'll implode with rage.
Cleaned:   this shit makes me so mad i can't read any more articles about the treatment joshua tree during the government shut down i'll implode with rage
Cleaned 288 tweets and stored in clean_train_review


# Model optimizing and batch testing of variation

In [0]:
#OPTIONAL
# This section was for my testing to find the best compbination of parameters.  Please skip it for real stuff

#Generate a set of x_data based on a variety of max_features
max_features_list = [100, 250, 500, 750, 1000, 1500, 3000]
X_data = []

print(len(max_features_list))

i = 0
while i < len(max_features_list):
  vectorizer=CountVectorizer(analyzer='word',tokenizer=None,preprocessor = None, stop_words = None, max_features = max_features_list[i])
  train_data_features=vectorizer.fit_transform(clean_train_review)
  train_data_features=train_data_features.toarray()
  X_data.append(train_data_features)
  i += 1

# X_data

In [0]:
#OPTIONAL
# WARNING This takes a long time.  Like 15 minutes per set of X_data
# This section was for my testing to find the best compbination of parameters.  Please skip it for real stuff


# Try to find the best parameters for our model
# Source:  https://medium.com/@hjhuney/implementing-a-random-forest-classification-model-in-python-583891c99652
# Also need to look at this:  https://www.datacamp.com/community/tutorials/random-forests-classifier-python

# WARNING This takes a long time.  Like 15 minutes per set of X_data

X_final = train_data_features
y = train[Sentiment]


from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)] # number of trees in random forest (default num=10)
max_features = ['auto', 'sqrt']  # number of features at every split

# max depth
max_depth = [int(x) for x in np.linspace(50, 500, num = 11)] 
max_depth.append(None)

# create random grid
random_grid = {
 'n_estimators': n_estimators,
 'max_features': max_features,
 'max_depth': max_depth
 }
# Random search of parameters
rfc_random = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, n_iter = 100, cv = 3, verbose=10, random_state=42, n_jobs = -1)

j = 0
for this_X_Data in X_data:
  # Fit the model
  rfc_random.fit(this_X_Data, y)
  # print results
  print("For a max features of " + str(max_features_list[j]))
  print(rfc_random.best_params_)
  #to-do - run/test the model with these params
  j += 1


## Jacob saving the results of running it
# Vectorizer Max Features = 100 -->  'n_estimators': 200,  'max_features': 'sqrt', 'max_depth': 95
# Vectorizer Max Features = 250 -->  'n_estimators': 1800, 'max_features': 'sqrt', 'max_depth': 95
# Vectorizer Max Features = 500 -->  'n_estimators': 2000, 'max_features': 'auto', 'max_depth': 50
# Vectorizer Max Features = 750 -->  'n_estimators': 200,  'max_features': 'sqrt', 'max_depth': 365
# Vectorizer Max Features = 1000 --> 'n_estimators': 200,  'max_features': 'auto', 'max_depth': 320
# Vectorizer Max Features = 1500 --> 'n_estimators': 1400, 'max_features': 'auto', 'max_depth': 50
# Vectorizer Max Features = 3000 --> 'n_estimators': 800, 'max_features': 'sqrt', 'max_depth': 185




# Main Training Code

In [0]:
# Setup
vectorization_max_features = 1000
Logistic_max_iterations = 500
rf_estimators = 200
rf_max_features = "sqrt"
rf_max_depth = 100
scoring_method = 'roc_auc' # from this list  https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

verbose_num = 0
return_train = True
k_folds = 5

# Vectorization
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(analyzer='word',tokenizer=None,preprocessor = None, stop_words = None, max_features = 500)
train_data_features       = (vectorizer.fit_transform(clean_train_review)).toarray()
train_sentiment_features  = (vectorizer.fit_transform(clean_sentiment)).toarray()
train_proanti_features    = (vectorizer.fit_transform(clean_proanti)).toarray()

# Vectorizer Max Features = 100 -->  'n_estimators': 200,  'max_features': 'sqrt', 'max_depth': 95
# Vectorizer Max Features = 250 -->  'n_estimators': 1800, 'max_features': 'sqrt', 'max_depth': 95
# Vectorizer Max Features = 500 -->  'n_estimators': 2000, 'max_features': 'auto', 'max_depth': 50
# Vectorizer Max Features = 750 -->  'n_estimators': 200,  'max_features': 'sqrt', 'max_depth': 365
# Vectorizer Max Features = 1000 --> 'n_estimators': 200,  'max_features': 'auto', 'max_depth': 320
# Vectorizer Max Features = 1500 --> 'n_estimators': 1400, 'max_features': 'auto', 'max_depth': 50
# Vectorizer Max Features = 3000 --> 'n_estimators': 800,  'max_features': 'sqrt', 'max_depth': 185


#Set up our models for Random Forest and Logistic
lin = LogisticRegression(solver='liblinear', max_iter=Logistic_max_iterations) # may need to tune C: inverse regularization strength
rf = RandomForestClassifier(n_estimators=rf_estimators, max_features=rf_max_features, max_depth=rf_max_depth)
X_final = train_data_features

# Logistic Regression doc:    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# Random Forest Classifier:   https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
# Word2vec notebook:          https://github.com/tthustla/twitter_sentiment_analysis_part11/blob/master/Capstone_part11.ipynb
# Another word2vec Keras twitter sentiment analyzer:  https://www.ahmedbesbes.com/blog/sentiment-analysis-with-keras-and-word-2-vec


### Model Validation
# Define a confidence interval function
from scipy.stats import t
def conf_int(ar):
  mean = np.mean(ar)
  sem = np.std(ar) / np.sqrt(len(ar))
  t_score = np.abs(t.ppf(0.05, len(ar)-1))
  # print("T-Score:  " + str(t_score))  #This value is just based on the size of the dataset so it will be the same for all models 
  return (mean - t_score * sem, mean + t_score * sem )

### SENTIMENT
### Train based on Sentiment (Positive/Negative)
X_final = train_sentiment_features
y = train_sentiment[Sentiment]

#Linear Model
cv_lin_sentiment = cross_validate(lin, X_final, y, cv=k_folds, scoring=scoring_method, return_train_score=return_train, verbose=verbose_num)
#Random forest model
cv_rf_sentiment = cross_validate(rf, X_final, y, cv=k_folds, scoring=scoring_method, return_train_score=return_train, verbose=verbose_num)
#Get error of each model in dollars(?)
print("Logistic Regression model Sentiment scores confidence interval: (%0.3f, %0.3f)" % conf_int(cv_lin_sentiment['test_score']))
print("Random Forest model Sentiment scores confidence interval: (%0.3f, %0.3f)" % conf_int(cv_rf_sentiment['test_score']))

sentiment_model_trained_rf = rf.fit(X_final, y)  #look into random forest "out of bag error" 
sentiment_model_trained_lin = lin.fit(X_final, y)
sorted_word_idx_sentiment = np.argsort(sentiment_model_trained_lin.coef_)


### Train a new model based on Pro/Anti Park
X_final = train_proanti_features
y = train_proanti[ProAnti]

#Linear Model
cv_lin_proanti = cross_validate(lin, X_final, y, cv=k_folds, scoring=scoring_method, return_train_score=return_train, verbose=verbose_num)
#Random forest model
cv_rf_proanti = cross_validate(rf, X_final, y, cv=k_folds, scoring=scoring_method, return_train_score=return_train, verbose=verbose_num)
#Get error of each model in dollars(?)
print("Logistic Regression model Pro/Anti scores confidence interval: (%0.3f, %0.3f)" % conf_int(cv_lin_proanti['test_score']))
print("Random Forest model Pro/Anti scores confidence interval: (%0.3f, %0.3f)" % conf_int(cv_rf_proanti['test_score']))

proanti_model_trained_rf = rf.fit(X_final, y)
proanti_model_trained_lin = lin.fit(X_final, y)
sorted_word_idx_proanti = np.argsort(proanti_model_trained_lin.coef_)


# #temp - F1 
# Logistic Regression model Sentiment scores confidence interval: (0.289, 0.519)
# Random Forest model Sentiment scores confidence interval: (0.343, 0.515)
# Logistic Regression model Pro/Anti scores confidence interval: (-0.026, 0.319)
# Random Forest model Pro/Anti scores confidence interval: (-0.026, 0.319)


#vs roc_auc:

Logistic Regression model Sentiment scores confidence interval: (0.743, 0.833)
Random Forest model Sentiment scores confidence interval: (0.703, 0.780)
Logistic Regression model Pro/Anti scores confidence interval: (0.619, 0.780)
Random Forest model Pro/Anti scores confidence interval: (0.568, 0.724)


# Which words predict positive and negative sentiment?

In [0]:
#Only works if you use the linear / logistic model

# Most negative words
[vectorizer.get_feature_names()[ix] for ix in sorted_word_idx_sentiment[0,:30]]

['road',
 'because',
 'river',
 'whole',
 'do',
 'before',
 'need',
 'shut',
 'grew',
 'why',
 'wtf',
 'small',
 'drive',
 'many',
 'canyon',
 'your',
 'damn',
 'years',
 'garden',
 'being',
 'travel',
 'wildlife',
 'been',
 'palm',
 'shutdown',
 'art',
 'world',
 'sunrise',
 'lol',
 'outside']

In [0]:
# Most positive words
[vectorizer.get_feature_names()[ix] for ix in sorted_word_idx_sentiment[0,-30:]][::-1]

['ll',
 'got',
 'way',
 'also',
 'beautiful',
 'summer',
 'keep',
 'ever',
 'issues',
 'perfect',
 'galaxy',
 'two',
 'beauty',
 'quiet',
 'first',
 'away',
 'nationalparks',
 'her',
 'wanna',
 'already',
 'say',
 'especially',
 'air',
 'visited',
 'be',
 'since',
 'nationalpark',
 'old',
 'bliss',
 'how']

Test the models on new data

In [0]:

# load in the testing data
data_file_path = "/content/drive/My Drive/Documents/SCHOOL/Watson CI/Parks Project - Watson CI/Training Data/Best 4 Test Data.csv"
#Dumb bad test - test it on the training data!
# data_file_path = "/content/drive/My Drive/Documents/SCHOOL/Watson CI/Parks Project - Watson CI/Training Data/Labelbox Best 4.csv"

test = pd.read_csv(data_file_path, delimiter=',')

#Define the column names in case you changed them later
DataColName = "Labeled Data"  #The name of the column with the text data to analyze
ExternalID  = "External ID"
Sentiment   = "Sentiment"
ProAnti     = "Pro Anti Park"

# examine the loaded data
test[:][:10]


In [0]:
# Create an empty list and append the clean reviews one by one
num_tweets = len(test[ClassifierColName])
clean_test_tweets= [] 
print("Cleaning and parsing the test set of tweets...\n")
for i in range(0,num_tweets):
    clean_tweet = review_to_words( test[DataColName][i] )
    clean_test_tweets.append( clean_tweet )

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_tweets)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
sentiment_result_rf   = sentiment_model_trained_rf.predict(test_data_features)
sentiment_result_lin  = sentiment_model_trained_lin.predict(test_data_features)
proanti_result_rf     = proanti_model_trained_rf.predict(test_data_features)
proanti_result_lin    = proanti_model_trained_lin.predict(test_data_features)

# Copy the results to a pandas dataframe
output = pd.DataFrame( data={DataColName:test[DataColName],  
                            #  ExternalID:test[ExternalID], 
                             "Sent RF":sentiment_result_rf, "Sent Lin:":sentiment_result_lin,   "Exp. Sent":test[Sentiment],
                             "Pro/ant RF":proanti_result_rf,  "ProAnt Lin":proanti_result_lin,    "Exp. ProAnti":test[ProAnti],
                             } )

#Optional - Write the result to a csv
output_file = "/content/drive/My Drive/Documents/SCHOOL/Watson CI/Parks Project - Watson CI/Training Data/Most Recent Saved Output.csv"
output.to_csv(output_file, index=False)

#Display the results here
output


Cleaning and parsing the test set of tweets...



Unnamed: 0,Labeled Data,Sent RF,Sent Lin:,Exp. Sent,Pro/ant RF,ProAnt Lin,Exp. ProAnti
0,"""The Joshua Tree""\nHeaded up for one last go b...",0,0,1,0,0,1
1,Joshua Tree Weekend 2 > Coachella Weekend 2 Ca...,0,0,1,0,0,1
2,Bruh Joshua Tree Is Gonna Be Fkn Litty Bitchhh!!!,0,0,1,0,0,1
3,I'm thrilled about the all NEW location for my...,0,0,1,0,0,1
4,@KeeleyDonovan Many happy returns of the day -...,0,0,1,0,0,1
5,It's #SundayMorning and I am sending my happy ...,0,0,1,0,0,1
6,A few shots from the wonderful experience of m...,0,0,1,0,0,1
7,I always see cool people going to Joshua tree ...,1,1,0,1,1,0
8,I was just at a wedding in Joshua Tree in Octo...,0,0,0,0,0,1
9,Joshua Tree is losing it's Joshua trees! https...,0,0,0,0,0,1
