# PART A: Basic Content Analysis with Twitter Data

Scraping qualitative data from the web is less intimidating than it sounds, but you will need the right tools to begin your analysis. In this section, we will run through the basics of Tweet scraping using the Twitter API Search Endpoint and discuss how to tailor your results to your specific needs.


<br>

Please note you will be unable to execute commands from the first section. An output file (ca_prop_tweets.csv) is provided in the datafolder for use in section 2.

<br>


There are a few reasons why you are unable to follow along in this first section. First, our class Python environment is missing several key packages needed to interact with Twitter and make meaning from tweets. These packages include tweepy (one of the more widely used packages to interact with Twitter's API endpoints) and textblob (a natural language processing package with several useful text analysis methods). We could easily install these to the class environment, but this would require that we take a few unnecessary risks. Second, to interact with Twitter via their API, you will need consumer and access tokens so that Twitter can monitor your search/stream requests. We do not have time to create Twitter developer accounts today, unfortunately, but you can find more information on this using the link below.

<br>

https://developer.twitter.com/en


<br>

Other useful links:

<br>


*   Twitter Search API parameters: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets
*   Tweet object data dictionary: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object
*   tweepy reference for interacting with Twitter API: http://docs.tweepy.org/en/v3.5.0/api.html#tweepy-api-twitter-api-wrapper
*   textblob reference for NPL processes: https://textblob.readthedocs.io/en/dev/api_reference


## 1: A brief introduction to web scraping

In [None]:
from tweepy import API
from tweepy import Cursor
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import re
from textblob import TextBlob
import numpy as np
import pandas as pd
from datetime import datetime

In [None]:
consumer_key =
consumer_secret =
access_token =
access_secret =

In [None]:
# in order to interact with any Twitter API endpoint, you will need to prove your identity...
# the tweepy package initiates this process via the OAuthHandler class

class TwitterAuthenticator():

    def authenticate_twitter_app(self):
        auth = OAuthHandler(consumer_key,
                            consumer_secret)
        auth.set_access_token(access_token,
                              access_secret)
        return auth

### 1.1: Accessing the Twitter Search API

In [None]:
# the two primary tasks you are likely to use with tweepy are (a) streaming live tweets and (b) searching for existing tweets
# the primary function of this class is to interact with the SEARCH endpoint

class TwitterClient():

    # __init__() is a built-in function for every class that executes whenever the function is called
    # here, we are saying hello to the Twitter API

    def __init__(self, twitter_user=None):
        self.auth = TwitterAuthenticator().authenticate_twitter_app()
        self.twitter_client = API(self.auth)

        self.twitter_user = twitter_user

    # here, we are requesting access to tweets from the Twitter API

    def get_twitter_client_api(self):
      return self.twitter_client

    # this function is the one to pay attention to...Alter this using the tweepy documentation above
    # here, we are applying the search method using tweepy's Cursor object to search for tweets
    # our criteria is limited to a search query, geographic information and the number of tweets we'd like to request (see tweepy documentation)

    def get_search_results(self, query, geo, num_tweets):
        tweets = []
        for tweet in Cursor(self.twitter_client.search,
                            q = query,
                            lang = 'en',
                            geocode = geo,
                            count = num_tweets).items(num_tweets):
          tweets.append(tweet)
        return tweets

In [None]:
# tweet data comes in an efficient, but difficult-to-read format called JSON
# this class allows us to run a basic sentiment analysis and create a legible pandas DataFrame for our tweets

class TweetAnalyzer():

  # these two functions set us up to run a basic sentiment analysis method from textblob
  # we are cleaning the data by running a regular expression command and subsequently running a
  # basic machine learning algorithm to estimate the polarity (pos/neg/neutral) of each tweet

  def clean_tweet(self, tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

  def analyze_sentiment(self, tweet):
        analysis = TextBlob(self.clean_tweet(tweet))

        if analysis.sentiment.polarity > 0:
            return 1
        elif analysis.sentiment.polarity == 0:
            return 0
        else:
            return -1

  # this function creates a pandas dataframe and uses list comprehension code to derive
  # values from the list of tweets we generate. We can access other aspects of a tweet using
  # different root-level attributes. See Tweet Object data dictionary link above

  def tweets_to_data_frame(self, tweets):
    df = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['tweets'])

    df['date'] = np.array([tweet.created_at.strftime("%m/%d/%Y %H:%M:%S") for tweet in tweets])

    df['date'] = df['date'].astype(object)

    df['len'] = np.array([len(tweet.text) for tweet in tweets])
    df['latlon'] = np.array([tweet.coordinates for tweet in tweets])
    df['user_loc'] = np.array([tweet.user.location for tweet in tweets])
    df['user_handle'] = np.array([tweet.user.screen_name for tweet in tweets])
    df['followers'] = np.array([tweet.user.followers_count for tweet in tweets])
    df['favorites'] = np.array([tweet.favorite_count for tweet in tweets])
    df['retweets'] = np.array([tweet.retweet_count for tweet in tweets])

    df = df.set_index(np.array([tweet.id for tweet in tweets]))

    return df

In [None]:
# don't worry too much about this if statement...this is us telling the python interpreter
# to only run the following functions/statements if they appear in this script

if __name__ == '__main__':

  # Question for you: what does it look like we are searching for here?

  query = '"delta" -filter:retweets'
  geo = '34.042201,-118.245854,100km'
  num_tweets = 50

  twitter_client = TwitterClient()
  tweet_analyzer = TweetAnalyzer()

  tweets = twitter_client.get_search_results(query, geo, num_tweets)
  df = tweet_analyzer.tweets_to_data_frame(tweets)

  # Another question for you: what are we appending to our dataframe and why might these
  # pieces of information be interesting to us?

  df['tb_sentiment'] = np.array([tweet_analyzer.analyze_sentiment(tweet) for tweet in df['tweets']])
  df['query_term'] = query
  df['scrape_time'] = datetime.now().strftime("%m/%d/%Y %H:%M:%S")

  df.index.name = 'id'

print(df.head)

<bound method NDFrame.head of                                                                 tweets  \
id                                                                       
1314325575608098816  @nancyjdavidson @Delta I say book it! Mine cur...   
1314325180148113408  @hklegacy @Delta Thanks Hann, glad you liked m...   
1314324111246594049  @morgfair @GeorgeTakei @Delta  you need to ban...   
1314323830072987649  Delta Airlines flight #DAL335 spotted at 38,00...   
1314323312013725698  Hurricane #Delta mini-series:\nPart 1 - small ...   
1314317343279865856  Natural Gas Price Fundamental Daily Forecast –...   
1314316153150271488  Delta Airlines flight #DAL884 spotted at 30,30...   
1314314245316595713  Trump “total shut down of everything from Chin...   
1314313662237032448  Hurricane Delta to strike Louisiana Friday as ...   
1314313393830916096  Delta Airlines flight #DAL757 spotted at 4,800...   
1314313253468561408  Delta Airlines flight #DAL855 spotted at 19,20...   
13143121

**Sample search results:**


                                                                tweets  \

id                                                                       
1313596177841975296  Human Rights Watch is against CA Prop 25 (2020...   
1313535404172247041  Prop 25 is very contentious. I'm going to dig ...   
1313382184024043521  @CoCoSouthLA @LAP
aysAttention "'with Propositi...   
1313381072760041472  @CoCoSouthLA @LAPaysAttention Prop 25 essentia...   
1313305141609549824  Very important read. I trust HRW on all things...   


                                    date  len latlon                user_loc  \
id                                                                             
1313596177841975296  10/06/2020 21:45:08   96   None         Los Angeles, CA   
1313535404172247041  10/06/2020 17:43:39  133   None        Venice Beach, CA   
1313382184024043521  10/06/2020 07:34:48  139   None         Los Angeles, CA   
1313381072760041472  10/06/2020 07:30:23  140   None         Los Angeles, CA   
1313305141609549824  10/06/2020 02:28:40   97   None             Los Angeles   
   

                         user_handle  followers  favorites  retweets  \
id                                                                     
1313596177841975296  CalvinStarnesOG       4039          0         0   
1313535404172247041      antifa_chad       1541          2         0   
1313382184024043521      its_a_lotte        912          0         0   
1313381072760041472      its_a_lotte        912          0         0   
1313305141609549824     Benjaminlear        869          3         0   


                     tb_sentiment                         query_term  \
id                                                                     
1313596177841975296             0  "proposition 25" -filter:retweets   
1313535404172247041             1  "proposition 25" -filter:retweets   
1313382184024043521             0  "proposition 25" -filter:retweets   
1313381072760041472             0  "proposition 25" -filter:retweets   
1313305141609549824             1  "proposition 25" -filter:retweets   


                             scrape_time  
id                                        
1313596177841975296  10/06/2020 17:44:53  
1313535404172247041  10/06/2020 17:44:53  
1313382184024043521  10/06/2020 17:44:53  
1313381072760041472  10/06/2020 17:44:53  
1313305141609549824  10/06/2020 17:44:53    >

### 1.2: Scraping Tweets and saving data to .csv



In [None]:
# only run this line of code if you are creating a new file for your new search.
# running this twice after creating a new csv will replace the information you assigned the first time

#df.to_csv('../data/ca_prop_tweets.csv', index='True', index_label='id', encoding='utf-8')

In [None]:
# here we are adding new search results to the .csv file we created. Thanks pandas!
# we are dropping duplicate records in case our subsequent searches garner the same tweets

def append_new_tweets(master_file, new_tweets):
  master_file = pd.concat([master_file, new_tweets])
  master_file = master_file.drop_duplicates(subset=['date', 'tweets'], keep='first')
  master_file.to_csv('ca_prop_tweets.csv',
                     index=True,
                     index_label='id',
                     encoding='utf-8')

In [None]:
master_file = pd.read_csv('ca_prop_tweets.csv', index_col='id')

append_new_tweets(master_file, df)

In [None]:
# and voila! Let's take a look at our cleaned up .csv dataframe in the next section

master_file

Unnamed: 0_level_0,tweets,date,len,latlon,user_loc,user_handle,followers,favorites,retweets,tb_sentiment,query_term,scrape_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1313626138246168577,@p1rat3girl08 Thanks for the input! So just to...,10/06/2020 23:44:12,118,,"Los Angeles, CA",thealdywaldy,3528,0,0,1,"""prop 22"" -filter:retweets",10/06/2020 16:45:30
1313625012310429696,https://t.co/8QN3R7Jo8Z @KNOCKdotLA have a gre...,10/06/2020 23:39:43,140,,"Los Angeles, California",theduncanbo,717,0,0,1,"""prop 22"" -filter:retweets",10/06/2020 16:45:30
1313624353523736576,@struthioniforme @mattyglesias Isn't Prop 22 a...,10/06/2020 23:37:06,83,,"Pasadena, CA",VATVSLPR,183,0,0,0,"""prop 22"" -filter:retweets",10/06/2020 16:45:30
1313619501980635136,"just a reminder, if prop 22 PASSES, it will ta...",10/06/2020 23:17:49,139,,"Los Angeles, CA",katiemcvay,4828,8,0,0,"""prop 22"" -filter:retweets",10/06/2020 16:45:30
1313617735113306112,If any other Uber/Lyft engineers want to come ...,10/06/2020 23:10:48,140,,SF/LA,alicec47,1005,11,2,-1,"""prop 22"" -filter:retweets",10/06/2020 16:45:30
...,...,...,...,...,...,...,...,...,...,...,...,...
1311831516343795712,@MinorityJustNow @JusticeLANow Human Rights Wa...,10/02/2020 00:53:00,140,,Los Angeles,MamaSiobhan,188,1,0,-1,"""proposition 25"" -filter:retweets",10/06/2020 17:44:53
1311831196704231425,@CarboneLukas Like @isaacscher mentioned below...,10/02/2020 00:51:44,140,,Los Angeles,MamaSiobhan,188,0,0,0,"""proposition 25"" -filter:retweets",10/06/2020 17:44:53
1311410931751030785,"@stopprop25 @banales_adrian Actually, Proposit...",09/30/2020 21:01:45,140,,"Los Angeles, CA",pjrodriguez,884,0,0,1,"""proposition 25"" -filter:retweets",10/06/2020 17:44:53
1311383565192572929,Endorsement: Yes on Proposition 25 to end bail...,09/30/2020 19:13:00,94,,"El Segundo, CA",latimesopinion,20311,1,1,0,"""proposition 25"" -filter:retweets",10/06/2020 17:44:53


## 2: Examining & manipulating unstructured text data

In [None]:
import nltk
import pandas as pd
import numpy as np
import random
import os
import re
import seaborn as sns
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [None]:
# let's look at our gorgeous dataframe...
# take 1 minute and see if anything strikes you...

df = pd.read_csv('../data/ca_prop_tweets.csv', index_col='id').sample(frac = 1)

df.head(10)

In [None]:
# look familiar?

type(df)

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df['tweets'].head(10)

In [None]:
# let's look at one more column and expand our visibility to see more of our tweets...
# take two minutes and consider what the textblob sentiment analysis is capturing (1 = positive, -1 = negative, 0 = neutral)

pd.options.display.max_colwidth = 300

df[['tweets', 'tb_sentiment']].head(10)

One thing to consider: is my sentiment analysis providing meaningful information regarding my research question?

<br>

What kind of research questions can we address with this data?

In [None]:
# How many occurances of no and yes are in our tweets?

for t in df['tweets'][:10]:
  print(t.lower().count("no"), "mention(s) of NO and", t.lower().count("yes"), "mention(s) of YES")

In [None]:
# let's create two variables to explore trends in support/opposition
# notice anything fishy? You might or you might not...

df['num_yes'] = df['tweets'].str.lower().str.count("yes")
df['num_no'] = df['tweets'].str.lower().str.count('no')

df[['num_yes', 'num_no', 'tweets']].head(10)

At this stage, we need to make a few assumptions about our data to begin to examine our RQ...those assumptions might be:


*   records with more than 3 instances of NO and/or YES are likely voter guides
*   records with 0 instances of NO AND YES will require further examination (that we do not currently have time for). We can drop these cases.
*   records with equivalent NO and YES counts will also require further investigation and can be dropped for right now.
*   records with a higher NO count than YES count likely signal opposition to the proposition, and vice versa (this is not always the case and further investigation is required before we can make this kind of assumption)
*   Any others?






In [None]:
# Filtering out guides...

mask_not_guide = (df['num_no'] < 4) & (df['num_yes'] <4) & ((df['num_no'] + df['num_yes']) <4)

# Filtering out cases that do not state support or opposition with YES/NO

mask_clear_position = (df['num_no'] > 0) | (df['num_yes'] > 0)

# Filtering out cases with no clear position

mask_unclear_position = (df['num_no'] == df['num_yes'])

mask = mask_not_guide & mask_clear_position & ~mask_unclear_position

df[mask].shape

(707, 14)

In [None]:
# any better?

df_new = df[mask]
df_new.head(10)

In [None]:
# let's create a new variable, "user_stance", to ID supporters and opposers based on our criteria

pd.set_option('mode.chained.assignment', None)

mask_support = df_new['num_no'] < df_new['num_yes']

df_new.loc[mask_support, 'user_stance'] = "Support"
df_new.loc[~mask_support, 'user_stance'] = "Oppose"

In [None]:
df_new[['tweets', 'user_stance']].head(10)

##3: Visualizing trends from content analysis

In [None]:
# aaaaand one more piece of housekeeping: let's create a field to identify our propositions
# pandas is loop averse...if anyone has any ideas for doing this operation more efficiently...please.

mask_14 = df_new['query_term'].str.contains('14') == True
mask_15 = df_new['query_term'].str.contains('15') == True
mask_16 = df_new['query_term'].str.contains('16') == True
mask_17 = df_new['query_term'].str.contains('17') == True
mask_18 = df_new['query_term'].str.contains('18') == True
mask_19 = df_new['query_term'].str.contains('19') == True
mask_20 = df_new['query_term'].str.contains('20') == True
mask_21 = df_new['query_term'].str.contains('21') == True
mask_22 = df_new['query_term'].str.contains('22') == True
mask_23 = df_new['query_term'].str.contains('23') == True
mask_24 = df_new['query_term'].str.contains('24') == True
mask_25 = df_new['query_term'].str.contains('25') == True

df_new.loc[mask_14, 'proposition'] = "Proposition 14"
df_new.loc[mask_15, 'proposition'] = "Proposition 15"
df_new.loc[mask_16, 'proposition'] = "Proposition 16"
df_new.loc[mask_17, 'proposition'] = "Proposition 17"
df_new.loc[mask_18, 'proposition'] = "Proposition 18"
df_new.loc[mask_19, 'proposition'] = "Proposition 19"
df_new.loc[mask_20, 'proposition'] = "Proposition 20"
df_new.loc[mask_21, 'proposition'] = "Proposition 21"
df_new.loc[mask_22, 'proposition'] = "Proposition 22"
df_new.loc[mask_23, 'proposition'] = "Proposition 23"
df_new.loc[mask_24, 'proposition'] = "Proposition 24"
df_new.loc[mask_25, 'proposition'] = "Proposition 25"

In [None]:
# nice...

df_new[['tweets', 'user_stance', 'proposition']].head(10)

In [None]:
# so how are Twitter users speaking about the CA props?

df_new = df_new.sort_values(by=['proposition'])

ax = sns.countplot(x=df_new['proposition'],
                   hue=df_new['user_stance'],
                   alpha=0.8)
ax.set_xticklabels(ax.get_xticklabels(),
                   rotation=45,
                   horizontalalignment='right')
ax.set_xlabel('Nov 2020 California Propositions')
ax.set_ylabel('Number of Unique Tweets')
ax.set_title('Twitter User Stances on CA Props (October 2020)')

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[2:], labels=labels[2:])

In [None]:
# Are influencers behaving any differently on twitter?

mask_influencer = df_new['followers'] >=10000
df_influencer = df_new[mask_influencer]

ax = sns.countplot(x=df_influencer['proposition'],
                   hue=df_influencer['user_stance'],
                   alpha=0.8)
ax.set_xticklabels(ax.get_xticklabels(),
                   rotation=45,
                   horizontalalignment='right')
ax.set_xlabel('Nov 2020 California Propositions')
ax.set_ylabel('Number of Unique Tweets')
ax.set_title('INFLUENCER Stances on CA Props (October 2020)')

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[2:], labels=labels[2:])

# PART B: Basic Sentiment Analysis with Movie Reviews

Now let's move on to another form of natural language processing using machine learning. In this section, we recreate from scratch a process similar to the textblob NLP method we applied to our Tweets earlier in this notebook. Together we will train an algorithm to identify positive and negative sentiment from ANY text using movie review data. In theory, you could use the algorithm we are training to study polarity in any text...but should you?

<br>

There are many different approaches to sentiment analysis we might take, but for today, let's explore a basic categorical approach that treats of texts as bags of words (BOW). A BOW approach to sentiment analysis focuses on sentimental value from the individual words in our texts, but ignores advanced information such as sarcasm or grammar. For simplicity's sake, we will be using a popular positive/negative sentiment lexicon developed by Hu & liu (2004) to compare with our movie reviews and to train our algorithm!

## 4: Cleaning Text Data for Machine Learning

In [None]:
# let's ask nltk to download a few important files to our current folder

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [None]:
pos_folder = os.listdir('../data/train/pos')
neg_folder = os.listdir('../data/train/neg')

In [None]:
print(len(pos_folder))
print(len(neg_folder))

In [None]:
print(pos_folder[0:5])
print(neg_folder[0:5])

In [None]:
type(neg_folder[0])

In [None]:
# You can open your .txt files in python to examine sentence structure, punctuation issues, etc.
# These text files are pretty clean! We still have some grooming to do before we can use them to model...

positive_review = open('../data/train/pos/'+pos_folder[0], 'r').read()
negative_review = open('../data/train/neg/'+neg_folder[0], 'r').read()

print('POSITIVE REVIEW:', positive_review)
print('NEGATIVE REVIEW:', negative_review)

In [None]:
# you can also shuffle through folders using the random package
# you might use this method to acquaint yourself with large datasets like these

random.shuffle(pos_folder)
random.shuffle(neg_folder)

positive_review = open('../data/train/pos/'+pos_folder[0], 'r').read()
negative_review = open('../data/train/neg/'+neg_folder[0], 'r').read()

print('RANDOM POSITIVE REVIEW:', positive_review)
print('RANDOM NEGATIVE REVIEW:', negative_review)

In [None]:
# let's take random samples from the negative and positive review folders to save processing time later on

random.shuffle(pos_folder)
pos_folder = pos_folder[:1500]

random.shuffle(neg_folder)
neg_folder = neg_folder[:1500]

print(len(pos_folder))
print(len(neg_folder))

In [None]:
# now let's create a list containing all of our opened movie review files...
# by appending the .txt. file name to a path string

files_positive = []
files_negative = []

for file in pos_folder:
  files_positive.append(open('../data/train/pos/'+file, 'r').read())

for file in neg_folder:
  files_negative.append(open('../data/train/neg/'+file, 'r').read())

In [None]:
# aaaand here is a sample of our list

print(files_positive[0])
print(files_positive[1])
print(files_positive[0:2])

In [None]:
# removing all non-alphabetical content allows us to tokenize words if that is part of our NLP process.
# here we are using something called a regular expression to clean our reviews...
# this is an advanced topic, so do not stress the syntax

no_punctuation = []

for review in files_positive:
  no_punctuation.append(re.sub(r'[^a-zA-Z\s]','', review))

no_punctuation[:10]

In [None]:
# and here we transformed all upper-case letters to lower-case letters

pos_cleaned = []

for string in no_punctuation:
  pos_cleaned.append(string.lower())

pos_cleaned[:10]

In [None]:
# ditto for our negative reviews

neg_cleaned = []
no_punct = []

for review in files_negative:
  neg_cleaned.append(re.sub(r'[^a-zA-Z\s]','',review.lower()))

neg_cleaned[:10]

In [None]:
all_cleaned = pos_cleaned + neg_cleaned

## 5: Training, Validation, & Test Sets

5.1: Cleaning, tokenizing, and creating a lexicon

In [None]:
# Tokenizers are used to split strings into lists of substrings
# word_tokenize() from the nltk package divides strings at punctuation marks other than periods.

tokenized = []

for review in pos_cleaned:
    review_tokens = word_tokenize(review)
    for word in review_tokens:
      tokenized.append(word)

print('There are', len(tokenized), 'words in my batch of positive reviews')
print(tokenized[:10])

In [None]:
# stop words are common 'empty' words that we can filter out to create space for words/phrases with sentimental weight
# here we are using the stopwords constant that we downloaded earlier from NLTK, but you can create your own as well

stop_words = list(set(stopwords.words('english')))
sorted(stop_words)

In [None]:
no_stops = []

for word in tokenized:
    if word not in stop_words:
      no_stops.append(word)

print('I removed', (len(tokenized)-len(no_stops)), 'stop words!!')
print(no_stops[:10])

In [None]:
# here we are asking nltk to tag all of our tokenized words with a semantic part of speech identifier

part_of_speech_positive = nltk.pos_tag(no_stops)

part_of_speech_positive[:10]

In [None]:
# allowed_word_types = ["J","R","V"]
# J = adjectives, R = adverbs, and V = verbs

adjectives = ["J"]

In [None]:
all_pos_adjectives = []

for word in part_of_speech_positive:
  if word[1][0] in adjectives:
    all_pos_adjectives.append(word[0])

all_pos_adjectives[:10]

In [None]:
all_neg_adjectives = []
tokenizedb = []
no_stopsb = []

for review in neg_cleaned:
    review_tokens = word_tokenize(review)
    for word in review_tokens:
      tokenizedb.append(word)


for word in tokenizedb:
    if word not in stop_words:
      no_stopsb.append(word)

part_of_speech_negative = nltk.pos_tag(no_stopsb)

for word in part_of_speech_negative:
  if word[1][0] in adjectives:
    all_neg_adjectives.append(word[0])

all_neg_adjectives[:10]

In [None]:
# and voila!

lex = all_pos_adjectives + all_pos_adjectives

4.2: Constructing your feature sets

In [None]:
# now onto the cool stuff...to begin our analysis, let's create a list of tuples with
# our text information and pos/neg attributes on either side of each pair.

documents = []

for review in files_positive:
  pos_docs = re.sub(r'[^a-zA-Z\s]', '',review)
  documents.append((pos_docs.lower(), "pos"))

for review in files_negative:
  neg_docs = re.sub(r'[^a-zA-Z\s]', '',review)
  documents.append((neg_docs.lower(), "neg"))

documents[:5]

In [None]:
# in another module, you might use the lexicon we developed earlier, ID'ed the
# most common words and used those strings to train our algorithm.
# let's instead use a tried and true sentiment lexicon...now we are not limited to adjectives

pos_file = open('../data/lexicon/pos.txt', 'r').read()
neg_file = open('../data/lexicon/neg.txt', encoding = 'ISO=8859-1').read()

pos_words = word_tokenize(pos_file)
random.shuffle(pos_words)
neg_words = word_tokenize(neg_file)
random.shuffle(neg_words)

all_sent_words = pos_words + neg_words

In [None]:
len(all_adjectives)

In [None]:
random.shuffle(all_sent_words)
all_sent_words[:10]

In [None]:
# this function creates a list of 'feature' dictionaries from each text file with
# information about the presence of lexicon words in the file.

def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in all_sent_words:
        features[w] = (w in words)
    return features

In [None]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [None]:
# if the above cell takes too long for you...use the following code to take a smaller random sample
# of our sentiment lexicon. Any idea why it took so long?

#random.shuffle(all_adjectives)
#all_adjectives = all_adjectives[:5000]

#featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [None]:
# this should explain why the prior cell took so long...

featuresets[:2]

In [None]:
random.shuffle(featuresets)

In [None]:
# and now let's separate our data into training and test cases...

training_set = featuresets[:2500]
testing_set = featuresets[2500:]

## 6: Sentiment analysis with machine learning

In [None]:
# So how did we do?
# let's discuss what we've done here...

classifier = nltk.NaiveBayesClassifier.train(training_set)

print("Classifier accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)

classifier.show_most_informative_features(25)

Our classifier works pretty well!! Although, your mileage may vary...Depending on the number of lexicon words you were able to incorporate, your accuracy percentage (the ratio of correctly predicted cases) should be around 80%.

<br>

There are many other classifying algorithms to try out, but we like this one for today. Let's figure out how to pack it away so we can test new data with it without having to retrain it over and over again. One way to do this is through a process called 'pickling'...yum!

In [None]:
# pickling allows us to store python objects like our classifier in byte format for simple recall
# let's dump our classifier object into a pickle file

# wb = write in bytes as opposed to strings...
save_my_algorithm = open("../data/my_algorithm.pickle","wb")
pickle.dump(classifier, save_my_algorithm)
save_my_algorithm.close

In [None]:
# rb = read bytes in the pickle file
my_algorithm = open("../data/naivebayes.pickle", "rb")
classifier = pickle.load(my_algorithm)
my_algorithm.close()

In [None]:
# let's test our algorithm with individual text snippets to see how it performs...

text = all_cleaned[0]
feature = find_features(text)

#and let's read the text to get a feel
print(text)

In [None]:
# last but not least, let's classify the individual feature with our pickled algorithm
# how did it do for you?

classifier.classify(feature)