# Data Science - Introduction to Sentiment Analysis Workshop

This workshop will cover a very important and interesting field in data science: sentiment analysis! Sentiment analysis has limitless applications. It is used in customer support analysis, market research, brand monitoring, etc. The big question is: how can we extract useful information and insights from seemingly plain text? 

### What we'll cover
- What is sentiment analysis? Why sentiment analysis?
- Getting started (importing libraries, dataset)
- Data preprocessing (cleaning, transforming)
- Classification Modeling
- Evaluation
- Implementation

## Import libraries

Let's start by importing all necessary libraries. The main ones we'll need are pandas, nltk, and scikit-learn.

In [1]:
import pandas as pd #helps us view, store, and process our data
import nltk #helpful NLP-specific functions and libraries
import sklearn #helps us setup, train and evaluate our model

Settings adjustment so we can see the full text within the dataframe

In [2]:
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

## Now let's import our dataset

Today, we'll be predicting sentiments of IMDB Movie reviews. 

In [3]:
data = pd.read_csv("movie.csv")

Let's see what our dataset looks like

In [4]:
data.head()

Unnamed: 0,text,label
0,"I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played ""Thunderbirds"" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.",0
1,"When I put this movie in my DVD player, and sat down with a coke and some chips, I had some expectations. I was hoping that this movie would contain some of the strong-points of the first movie: Awsome animation, good flowing story, excellent voice cast, funny comedy and a kick-ass soundtrack. But, to my disappointment, not any of this is to be found in Atlantis: Milo's Return. Had I read some reviews first, I might not have been so let down. The following paragraph will be directed to those who have seen the first movie, and who enjoyed it primarily for the points mentioned.<br /><br />When the first scene appears, your in for a shock if you just picked Atlantis: Milo's Return from the display-case at your local videoshop (or whatever), and had the expectations I had. The music feels as a bad imitation of the first movie, and the voice cast has been replaced by a not so fitting one. (With the exception of a few characters, like the voice of Sweet). The actual drawings isnt that bad, but the animation in particular is a sad sight. The storyline is also pretty weak, as its more like three episodes of Schooby-Doo than the single adventurous story we got the last time. But dont misunderstand, it's not very good Schooby-Doo episodes. I didnt laugh a single time, although I might have sniggered once or twice.<br /><br />To the audience who haven't seen the first movie, or don't especially care for a similar sequel, here is a fast review of this movie as a stand-alone product: If you liked schooby-doo, you might like this movie. If you didn't, you could still enjoy this movie if you have nothing else to do. And I suspect it might be a good kids movie, but I wouldn't know. It might have been better if Milo's Return had been a three-episode series on a cartoon channel, or on breakfast TV.",0
2,"Why do people who do not know what a particular time in the past was like feel the need to try to define that time for others? Replace Woodstock with the Civil War and the Apollo moon-landing with the Titanic sinking and you've got as realistic a flick as this formulaic soap opera populated entirely by low-life trash. Is this what kids who were too young to be allowed to go to Woodstock and who failed grade school composition do? ""I'll show those old meanies, I'll put out my own movie and prove that you don't have to know nuttin about your topic to still make money!"" Yeah, we already know that. The one thing watching this film did for me was to give me a little insight into underclass thinking. The next time I see a slut in a bar who looks like Diane Lane, I'm running the other way. It's child abuse to let parents that worthless raise kids. It's audience abuse to simply stick Woodstock and the moonlanding into a flick as if that ipso facto means the film portrays 1969.",0
3,"Even though I have great interest in Biblical movies, I was bored to death every minute of the movie. Everything is bad. The movie is too long, the acting is most of the time a Joke and the script is horrible. I did not get the point in mixing the story about Abraham and Noah together. So if you value your time and sanity stay away from this horror.",0
4,"Im a die hard Dads Army fan and nothing will ever change that. I got all the tapes, DVD's and audiobooks and every time i watch/listen to them its brand new. <br /><br />The film. The film is a re run of certain episodes, Man and the hour, Enemy within the gates, Battle School and numerous others with a different edge. Introduction of a new General instead of Captain Square was a brilliant move - especially when he wouldn't cash the cheque (something that is rarely done now).<br /><br />It follows through the early years of getting equipment and uniforms, starting up and training. All in all, its a great film for a boring Sunday afternoon. <br /><br />Two draw backs. One is the Germans bogus dodgy accents (come one, Germans cant pronounced the letter ""W"" like us) and Two The casting of Liz Frazer instead of the familiar Janet Davis. I like Liz in other films like the carry ons but she doesn't carry it correctly in this and Janet Davis would have been the better choice.",1


# Data Preprocessing

Data preprocessing is an important step for any data science task. In order to make our data useable for our model (your computer), we need to go through a few data cleaning/transforming steps. 

First, let's important any important data processing module from our libraries

In [22]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer

Now let's create a RegexpTokenizer. This will help us get rid of any special characters (#, $, &, etc.) in our text.

In [6]:
token = RegexpTokenizer(r'[a-zA-Z0-9]+')

In [20]:
token

RegexpTokenizer(pattern='[a-zA-Z0-9]+', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)

Let's initiate a CountVectorizer, passing in our tokenizer from earlier. 

In [25]:
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
# tfidf = TfidfVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)

Now that we have our data and count vectorizer all set up, let's tokenize our data! This effectively converts our text data into a matrix of integer data based on the frequency of each word in each review, which is interpretable by the model.

In [26]:
text_counts = cv.fit_transform(data['text'])
# text_counts_tfidf = tfidf.fit_transform(data['text'])

In [35]:
# len(text_counts[0].toarray()[0])

92082

The last step in the data preprocessing step is splitting our data into training and test sets. We can use a very helpful function from scikit-learn to accomplish this. 

In [27]:
#Splitting the data into training and testing
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['label'], test_size=0.25, random_state=5)
# X_train, X_test, Y_train, Y_test = train_test_split(text_counts_tfidf, data['label'], test_size=0.25, random_state=5)

# Modeling

Now that we have our data all preprocessed and split, we are ready to start modeling! 

We can start by importing and instantiating an instance of our model. Today, we'll be using the __Logistic Regression__ model

In [29]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

With our model and data all prepared, we can start training!

In [30]:
model.fit(X_train, Y_train)

LogisticRegression()

# Evaluation

Now that our model is all trained, how do we see how good it is? Let's first import the metrics module from sklearn, which will help us evaluate the results of our model.

In [31]:
#Caluclating the accuracy score of the model
from sklearn import metrics

We can now obtain predictions from our model. Remember, the model has never seen this data, so the results should be a good indicator of model performance.

In [32]:
predicted = model.predict(X_test)

Now that we have our predictions, we can compare them with the true values and see how well we did.

In [33]:
accuracy_score = metrics.accuracy_score(predicted, Y_test)

In [34]:
print("Accuracuy Score: ",accuracy_score)

Accuracuy Score:  0.8928


# Implementation

We have our well-performing model, but how do we actually use this model in a practical sense? In other words, how can we efficiently input text and receive sentiments from them? 

Let's create a function to help us with that.

In [17]:
def getPrediction(text, vectorizer, model):
    '''
    Takes in a sequence of texts, the pre-fit CountVectorizer, trained model and returns the models predictions
    on the texts in the form of a dataframe.
    '''
    textCounts = vectorizer.transform(text)
    predictions = model.predict(textCounts)
    sentiments = list(map(lambda x: "Positive" if x == 1 else "Negative", predictions))
    #return a df
    return pd.DataFrame({"text": text, "predictions": sentiments})

Let's test out our function! Let's say we're building a movie rating website (like IMDB), and we have user-input reviews.

Try it yourself! Find an IMDB review for your favorite movie. **Make sure you can detect the sentiment yourself**. Set the text of the review equal to the review variable below. 

In [18]:
#I used a review from the new Minions movie
review = "Its a fun movie. Minion scenes are more than Gru and thankfully so. The chemistry between minions characters is epic. Love Kevin, Bob and Stuart. There is no deep meaning or higher message, a simple carefree movie which is supposed to be seen that way."

In [19]:
getPrediction([review], cv, model)

Unnamed: 0,text,predictions
0,"Its a fun movie. Minion scenes are more than Gru and thankfully so. The chemistry between minions characters is epic. Love Kevin, Bob and Stuart. There is no deep meaning or higher message, a simple carefree movie which is supposed to be seen that way.",Positive


# That's it!

Congratulations! You just built your own sentiment classifier.