<h1>Table of content</h1>

<h4>A. Brief intro to NLP (intro and sources to learn)</h4>

- What is NLP?
- How is this done with programming?
- What is semantic analysis?
- How is semantic analysis useful?

<h4>B. Project: Tweet sentiment analysis with NLTK:</h4>

SET UP THE SENTIMENT ANALYSIS WITH NLTK and Naive Bayes classifier
- Explain our approach
- Install packages and explain workspace structure
- Get the data
- Text Classification
- Creating Features
- Train the dataset using Naive Bayes Classifier
- Use the sentiment
- Evaluation


GET LIVE TWEETS FROM TWITTER API
- Applying for Twitter API (tweepy)
- Get live tweets to go through our sentiment
- Graph live tweets using Matplotlib

<h1> A. Brief intro in NLP</h1>

NLP is a specialized field of computer science and artificial intelligence with roots in computational linguistics. The idea of Natural Language Processing is to do some form of analysis, or processing, where the machine can understand, at least to some level, what the text means, says, or implies.


Ever since the digital age, there accumulated a resourseful amount of data on the internet, specifically text, which contain a wealth of information. However, due to the huge amount of text data available, among with inherent complexity in processing, there is a limit of what human can do to analyze these unstructured sources of data, which can be a potential gold mine, for example in fields from marketing to politics, where it's extremely important to know the public opinions. 


This is where programming and the work of sentiment analysis come in handy. Sentiment analysis is the extraction of sentiment - extracting whether a product review is oriented towards positive, negative, or neutral, whether an email is spam, not spam, or important. Feel free to take the time to think of other sentiment extraction applications you might have come accross in your daily life.


The simplest version of sentiment analysis is a binary classification task, <i>classify</i> whether a piece of texts is expressing a positive or a negative opinion. Consider a review towards moviews, words like <i>great, boring, wonderful, mediocre </i> are extremely helpful for hinting about where this review is going. 


Imagine, if we have an enough amount of these informative words, labeled with their classification, such as: ('great', 'positive'), ('boring', negative), ('wonderful', 'positive'), we could just tell the computer to help us save these info and calculate the possibility if a review 'I'm so bored I could't finish the whole thing' is positive or negative? This is exactly what we are attempting to do in this tutorial.


But now first, lets review the standard process of a NLP project to have a hint of what we'll be going through:

![NLP workflow](https://miro.medium.com/max/1000/1*BiVCmiQtCBIdBNcaOKjurg.png)

<i> source: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 </i>

If you love this topic, here are some great resouces I found super helpful for us beginners: 
- 1 entry project: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

- A tutorial series: https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/

- Linguistics crash course: (video series) https://www.youtube.com/watch?v=eDop3FDoUzk


<h1>B. Project: Tweet sentiment analysis with NLTK:</h1>

As stated above, we're gonna build a binary sentiment analysis, a basic one. Our module will be able to extract the sentiment of a sentence or a small graph of text, and decide if it's positive (classification: pos) or negative (classification: neg). 
    
Next, we get live tweets (how exciting!), let those tweets run through our module and at the same time, how about visualizing the tweet pos/neg trend through a live graph?

<h3>EXPLAIN OUR APPROACH</h3>  

We'll be given a labeled dataset of 10,000 positive and negative movie reviews. 

With each category (positive or negative), we'll break it into a bag-of-words. There'll be 2 bags total, 1 positive 1 negative, that are bot an unordered set of words, keeping only the frequency in the document, and their positions are ignored for now.

![NLP bags-of-words](https://northeastern-my.sharepoint.com/:i:/r/personal/pham_phuon_northeastern_edu/Documents/nlp-intro-pics/bags-of-words.png?csf=1&web=1&e=Lrwihd)

<i> source: https://web.stanford.edu/~jurafsky/slp3/4.pdf </i>

We'll choose mark the most imformative words, and send the whole thing through the  Naive Bayes Classifier for module training. The Classifier basically will help us check if they are in pos or neg or both. If a word appears significantly more in a category, it's very likely the word is very critical for that category. We'll use that for our prediction.

Alright, let's get to work.

<h3>Install things and introduce file structure</h3>  

Obviously, we'll use Python for this.

The following modules will be needed: Jupyter, Matplotlib, NLTK.

Usually, a command line { pip install jupyterlab nltk matplotlib } or { pip3 install jupyterlab nltk matplotlib } will do.

If it doesn't work for you, please refer to the installation guides:
- jupyterlab: https://jupyter.org/install
- nltk: https://www.nltk.org/install.html
- matplotlib: https://matplotlib.org/users/installing.html

And here's a brief introduction of our file structure:

- intro-nltk.ipynb : We use this Jupyter notebook to explore the dataset, set up features and train our model.
- sentiment.py : Our trained model will be saved in here. This will act as a module, so that we could import it and use in other projects. 
- stream.py : The code to connect to twitter API. This will run constantly as we stream live tweets in.
- graph.py : The code for graphing. This file will also be running parallel with stream.py to graph the trend of live tweets.

Alright, ready to pack our bags of words?

In your working folder, open jupyternotebook and create a new file. Let's call it <b>intro-nltk.ipynb</b> and import nltk to start with.

In [53]:
#intro-nltk.ipynb

import nltk

<h3>Get the data</h3>  

I mentioned we are given a labeled dataset. Yes, here you go: https://northeastern-my.sharepoint.com/:f:/g/personal/pham_phuon_northeastern_edu/Ep1e6LFSz6tAk95KQKmZz-sBFux8oYSOtggonMRSU1kOUg?e=U6JTTy

In your working folder, create a folder called "dataset" and put those files in there

In [75]:
short_pos = open("dataset/positive.txt", "r").read()
short_neg = open("dataset/negative.txt", "r").read()

<h3>Text Classification</h3>

The goals here are to:
- Set up the dataset or featuresets for training and testing: collect all reviews, each review needs to be marked with either "pos" or "neg"
- Prep for feature selection: pick out the most 5000 repeated words.

Let's first create empty variables as holders:

In [76]:
#intro-nltk.ipynb

#a list of tuples. Each tuple contains the actual review and the label, for example: ("This is a nice movie", "pos")
documents = []

#all single words in our dataset
all_words = [] 

#Filter only Adjectives
allowed_word_types = ["J"] 

Note: here I chose to use Adjectives only, but feel free to combine with other types of words, like "N" for nouns, "V" for verb; or just delete the filter and take all word types.

In [77]:
#intro-nltk.ipynb

for p in short_pos.split("\n"):
    documents.append( (p, "pos") )
    words = nltk.word_tokenize(p) #break into words only
    pos = nltk.pos_tag(words) #part of speeach tagging - tagging the type of the word, ex: Adjective, Verb, Noun, ...
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())
            
for n in short_neg.split("\n"):
    documents.append((n, "neg"))
    words = nltk.word_tokenize(n)
    neg = nltk.pos_tag(words)
    for w in neg:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())

In [78]:
#intro-nltk.ipynb

len(documents) 

10664

In [79]:
#intro-nltk.ipynb

documents[0]

('the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . ',
 'pos')

In [80]:
#intro-nltk.ipynb

len(all_words)

26287

 <h3>Creating Features</h3>
 

 Remember Zak's explanations of feature in the first meetup? They are basically characteristics of objects, so that our modules can rely on for classfication tasks. For fishes, it can be their weight, color, eye color, ... But what do we use as features in NLP, aren't there just texts? It turns out that we can do a lot with texts as features. In this project in particular, we can select the most repeated words, say 5000 words, count the frequencies, and compute as the fraction of times the word appears among all words in all documents of each topic. 
 

For example, "wonderful" is a among the top 5000 popular words. Our training spotted the word "bad" in 10 negative reviews, and only 1 negative review. Then if we come accross a piece of texts: "the storyline is wonderful!". Then it's very likely that it's a positive comment.
 


Now let's use nltk to find out the frequency of all words

In [84]:
#intro-nltk.ipynb

#calculate the number of time a word is repeated, and put the in a dictionary.
all_words = nltk.FreqDist(all_words)

#find out the most common 15 words
all_words.most_common(15)

[('good', 369),
 ('more', 331),
 ('little', 265),
 ('funny', 245),
 ('much', 234),
 ('bad', 234),
 ('best', 208),
 ('new', 206),
 ('own', 185),
 ('many', 183),
 ('most', 167),
 ('other', 167),
 ('great', 160),
 ('big', 156),
 ('few', 139)]

Curious about any particular word? Just check them out

In [83]:
#intro-nltk.ipynb

# check the amount of time a specific word appears
all_words["ridiculous"]

18

Now back to our feature, lets go on pick out the most repeated 5000 words

In [85]:
#intro-nltk.ipynb

#create a list to put those features in
word_features = []
for i in all_words.most_common(5000):
    word_features.append(i[0])

Take a look in there to make sure they are good stuff. 

In [46]:
#intro-nltk.ipynb

word_features[:10]

['good',
 'more',
 'little',
 'funny',
 'much',
 'bad',
 'best',
 'new',
 'own',
 'many']

Alright, cool. Now lets find those word features and mark them within the document we are using, so that later our classifier can know that they appear in either pos or neg category.

In [87]:
#intro-nltk.ipynb

#send document in, and get returned a dictionary that confirms if the each word in the document is a feature word or not
def find_features(document):
    words = nltk.word_tokenize(document)
    features = {} 
    for word in word_features:
        features[word] = (word in words) #True/False 
    return features

featuresets = [(find_features(rev), category) for (rev,category) in documents]


Now we have a collection of featuresets. #a list of tuples. 
Each featureset is a 2 element tuple: 
- 1 element is a dictionary returned from find_features
- and the other element is the label: "pos"/"neg"

Lets just pick one out and see.

In [88]:
#intro-nltk.ipynb

featuresets[0]

({'good': False,
  'more': False,
  'little': False,
  'funny': False,
  'much': False,
  'bad': False,
  'best': False,
  'new': True,
  'own': False,
  'many': False,
  'most': False,
  'other': False,
  'great': False,
  'big': False,
  'few': False,
  'first': False,
  'real': False,
  'i': False,
  'better': False,
  'full': False,
  'such': False,
  'romantic': False,
  'american': False,
  'old': False,
  'same': False,
  'original': False,
  'human': False,
  'hard': False,
  '[': False,
  'interesting': False,
  'young': False,
  'enough': False,
  'emotional': False,
  'least': False,
  'long': False,
  'last': False,
  'cinematic': False,
  'true': False,
  'entertaining': False,
  'high': False,
  'special': False,
  'predictable': False,
  ']': False,
  'visual': False,
  'familiar': False,
  'whole': False,
  'comic': False,
  'enjoyable': False,
  'sweet': False,
  'narrative': False,
  'less': False,
  'short': False,
  'worst': False,
  'strong': False,
  'only': False

Looks good enough. And that's it, our featuresets.

Now one important last step, let's shuffle them up before sending them through the classifier.

In [89]:
#intro-nltk.ipynb

import random
random.shuffle(featuresets)

<h3>Train the dataset using Naive Bayes Classifier</h3>

So, there're many Bayes Classifiers, and Naive Bayes Classifier is propably the most simple one as you can probably guess from the name.

Naive Bayes is a probabilistic classifier. On a high level explaination, the classifier will take an input of texts, break the texts down into single words, compute the likeliness of each word (in the bags-of-words) independently and then compute the likelihood of the whole texts altogether, and give us result - either pos or neg. 

The computation of the Naive Bayes classifiers is fairly simple, therefore it's fast, and considered as an easy one to scale. This material is by far the best in-depth explanation for Naive Bayes Classifier: http://web.stanford.edu/~jurafsky/slp3/4.pdf (Thanks Zak for sharing!)

Back to our work. Let's split our ready-to-go featuresets into a training set and a testing set. 

In [90]:
#intro-nltk.ipynb

#rule of thumb 80/20
training_set = featuresets[:8530]
testing_set = featuresets[8530:]

<b>training_set</b> is picked for the machine to do the training. Here, the machine knows if a review is negative or positive.

<b>testing_set</b> is picked for the machine to do the testing. Here, the machine doesn't actually know the result.
What the accuracy method actually does here is give the machine a data set withou any result, let the machine guess itself, 
and then test against it, and finally return the result.

In [91]:
#intro-nltk.ipynb

classifier = nltk.NaiveBayesClassifier.train(training_set)
accuracy = nltk.classify.accuracy(classifier, testing_set)

Lets find out how our module does in terms of accuracy. Note that the accuracy can change slightly each time we run it.

In [92]:
#intro-nltk.ipynb

accuracy_percentage = accuracy*100
accuracy_percentage

73.33645735707591

This is called supervised machine learning, because we're showing the machine data, and telling it "hey, this data is positive," or "this data is negative." Then, after that training is done, we show the machine some new data and ask the computer, based on what we taught the computer before, what the computer thinks the category of the new data is.

In [93]:
#intro-nltk.ipynb

#check the most informative words in deciding whether a review is either positive or negative. interesting.
classifier.show_most_informative_features(20)

Most Informative Features
              engrossing = True              pos : neg    =     20.3 : 1.0
                    warm = True              pos : neg    =     18.9 : 1.0
                  boring = True              neg : pos    =     17.1 : 1.0
                powerful = True              pos : neg    =     13.8 : 1.0
                 routine = True              neg : pos    =     13.7 : 1.0
                    loud = True              neg : pos    =     13.0 : 1.0
                    flat = True              neg : pos    =     12.6 : 1.0
                  unique = True              pos : neg    =     12.3 : 1.0
              unexpected = True              pos : neg    =     11.6 : 1.0
              delightful = True              pos : neg    =     11.6 : 1.0
              refreshing = True              pos : neg    =     11.6 : 1.0
               wonderful = True              pos : neg    =     10.6 : 1.0
                   minor = True              pos : neg    =      9.6 : 1.0

<h3>Test our module manually</h3>

Lets create a function, that takes any texts as an input, and see how it returns

In [94]:
#intro-nltk.ipynb

def sentiment(text):
    feats = find_features(text)
    return classifier.classify(feats)

In [95]:
#intro-nltk.ipynb

sentiment("It was awesome!")

'pos'

In [96]:
#intro-nltk.ipynb

sentiment("I hate it")

'neg'

In [97]:
#intro-nltk.ipynb

sentiment("AMZN do you like it")

'neg'

<h4>Some comments</h4>

- the method relies on if specific words appear more in neg/pos reviews and predicts based on words
- neutral sentences tend to be classified as negative
- Zak please add more :D

<h3>Save feature sets and module</h3>

Now we have our module all set up. The next important thing is to save all of these above steps in a .py file, so that we could later import it and use it somewhere else. 


Luckily we won't have to copy and run everything again in the .py file. We can skip some steps thanks to Pickle - a package for Python object serialization. Pickle will help us zip our progress in a .pickle file, so that we can load and run it fast later. 


To install pickle, use the command line { pip install pickle } or { pip3 install pickle }. If these don't work, hmm... google it.


The object we'll pickle here are:
- documents
- word_features

In your working folder, create a folder "pickled" to save the .pickle files

In [98]:
#intro-nltk.ipynb

import pickle

save_documents = open("pickled/documents.pickle", "wb") #create and open a pickle file, allow writing access
pickle.dump(documents, save_documents) #put the classifier in the pickle file
save_documents.close()  #close the file and done!

save_word_features = open("pickled/word_features.pickle", "wb")
pickle.dump(word_features, save_word_features)
save_word_features.close()

It's all done for the our intro-nltk.ipynb file now.

In your working folder, create a file sentiment_mod.py.

Go ahead, put in the following code, these are just the work we did above. We're now done saving the module in sentiment_mod.py too.

In [None]:
#sentiment_mod.py

import pickle
import nltk
import random

#get training data
documents_f = open("pickled/documents.pickle", "rb")
documents = pickle.load(documents_f)
documents_f.close()

#check frequencies
word_features5k_f = open("pickled/word_features.pickle", "rb")
word_features = pickle.load(word_features5k_f)
word_features5k_f.close()


def find_features(document):
    words = nltk.word_tokenize(document)
    features = {}
    for word in word_features:
        features[word] = (word in words) 
    
    return features

featuresets = [(find_features(rev), category) for (rev,category) in documents]
random.shuffle(featuresets)

#positive data example
training_set = featuresets[:8530]
testing_set = featuresets[8530:]

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy: ", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

def sentiment(text):
    feats = find_features(text)
    return classifier.classify(feats)

print(sentiment("It was awesome!"))

<h3>2. GET LIVE TWEETS FROM TWITTER API</h3>

Twitter streaming API allows us to connect to 1% of all tweets, that is 1% 500 million tweets per day - still actually a lot of tweets. 

First, you gotta be able to connect to the API using your twitter account and get the following info:
-ckey
-csecret
-atoken
-asecret

Unfortunately, I can't share with you my keys. But here's a helpful tutorial to help you get yours: https://pythonprogramming.net/twitter-api-streaming-tweets-python-tutorial/

When you're done, create a file called stream.py in your working foler and insert the following code

Don't run the code on your jupyter notebook

- Connect to Twitter API to get access to live tweets
- 
- Save all of those tweets in a .txt file

In [129]:
#my API access token
#I HAVE TO REMEMBER TO HIDE MY TOKENS LATER

# stream.py

from tweepy import Stream, OAuthHandler
from tweepy.streaming import StreamListener
import json
import sentiment_mod as s

ckey="ASfenRJcyi4b8r9NjPVzFex8m"
csecret="WMYSrpfA82DQHNpiTPB6XFkjZgMxFphG1hu4wXKLIoZCZm3aW1"
atoken="798767325772029952-lcpNI5IOTozzniYRXblWHcfNYYUJoTa"
asecret="fqWAC8kGtzU1CBrxAcwtdd68VADpwZFf5Y9ivsi00qmkb"


class listener(StreamListener):

    def on_data(self, data):
        #Connect to Twitter API to get access to live tweets
        all_data = json.loads(data)
        tweet = all_data["text"]
        
        #Get all of the tweets that has the given keyword (check the last line)
        sentiment_value = s.sentiment(tweet)
        print(tweet)
        print(sentiment_value)

        #Save them tweets all in a .txt file
        output = open("out.txt","a")
        output.write(sentiment_value)
        output.write('\n')
        output.close()

        return True

    def on_error(self, status):
        print(status)

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)

twitterStream = Stream(auth, listener())
twitterStream.filter(track=["election"]) #<----- change the word to any word that your curious about.

Now, that we got our stream going. Here're some notes though, you gotta pay attention too:
- Twitter does limit your connection, it's a free API after all. Just google the error code that gives you. Or most of the time I just waited for awhile and run my stream.py again. 
- Our next step is to visualize based on the results we get from the stream, so you will want to have the stream.py file up and running, while we also run the graph.py that we're going to talk about now

<h2>Graphing Live Twitter Sentiment Analysis</h2>

Using Matplotlib, a very popular tool for visualization,I wwon't explain much how it works here but the idea is that we have the results save in the out.txt file, each result is separated by a new line. Just open your out.txt file and check if you need to.

Our plan to graph the trend is like this:
- if positive, the line will go up 1. 
- if negative, the line will go down 1.

Simple enough, go ahead and create a file called graph.py in your working folder and insert the following code

<b>Keep the stream.py file running parallel with graph.py</b>

In [31]:
#graph.py

import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import style
import time

style.use("ggplot")

fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

def animate(i):
    pullData = open("out.txt","r").read()
    lines = pullData.split('\n')

    xar = []
    yar = []

    x = 0
    y = 0

    for l in lines[-200:]:
        x += 1
        if "pos" in l:
            y += 1
        elif "neg" in l:
            y -= 1

        xar.append(x)
        yar.append(y)
        
    ax1.clear()
    ax1.plot(xar,yar)
    
ani = animation.FuncAnimation(fig, animate, interval=1000)
plt.show()

<Figure size 640x480 with 1 Axes>

Phew. My part's done and there are several things we can ellaborate based on this simple projects. Here are some of my ideas for us:
- Use other datasets
- Mix and test some variables: for example, word types in the featuresets, the ratio of training set and testing set
- Use some other classifiers in scikit-learn (if you are familiar with scikit-learn), test the result for each classifier, and maybe build a combined classifier?
- ...