# Data Science - Introduction to Sentiment Analysis Workshop

This workshop will cover a very important and interesting field in data science: sentiment analysis! Sentiment analysis has limitless applications. It is used in customer support analysis, market research, brand monitoring, etc. The big question is: how can we extract useful information and insights from seemingly plain text? 

### What we'll cover
- What is sentiment analysis? Why sentiment analysis?
- Getting started (importing libraries, dataset)
- Data preprocessing (cleaning, transforming)
- Classification Modeling
- Evaluation

## Import libraries

Let's start by importing all necessary libraries. The main ones we'll need are pandas, nltk, and scikit-learn.

In [3]:
import pandas as pd #helps us view, store, and process our data
import nltk #helpful NLP-specific functions and libraries
import sklearn #helps us setup, train and evaluate our model

Settings adjustment so we can see the full text within the dataframe

In [4]:
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

## Now let's import our dataset

Today, we'll be predicting sentiments of IMDB Movie reviews. 

In [5]:
#type here

Let's see what our dataset looks like

In [None]:
#type here

# Data Preprocessing

Data preprocessing is an important step for any data science task. In order to make our data useable for our model (your computer), we need to go through a few data cleaning/transforming steps. 

First, let's important any important data processing module from our libraries

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

Now let's create a RegexpTokenizer. This will help us get rid of any special characters (#, $, &, etc.) in our text.

In [8]:
#type here

Let's initiate a CountVectorizer, passing in our tokenizer from earlier. 

In [9]:
#type here

Now that we have our data and count vectorizer all set up, let's tokenize our data! This effectively converts our text data into a matrix of integer data based on the frequency of each word in each review, which is interpretable by the model.

In [10]:
#type here

The last step in the data preprocessing step is splitting our data into training and test sets. We can use a very helpful function from scikit-learn to accomplish this. 

In [12]:
#Splitting the data into training and testing
from sklearn.model_selection import train_test_split

In [13]:
#type here

# Modeling

Now that we have our data all preprocessed and split, we are ready to start modeling! 

We can start by importing and instantiating an instance of our model. Today, we'll be using the __modelname__

In [14]:
from sklearn.naive_bayes import MultinomialNB
#type here

With our model and data all prepared, we can start training!

In [15]:
#type here

MultinomialNB()

# Evaluation

Now that our model is all trained, how do we see how good it is? Let's first import the metrics module from sklearn, which will help us evaluate the results of our model.

In [16]:
#Caluclating the accuracy score of the model
from sklearn import metrics

We can now obtain predictions from our model. Remember, the model has never seen this data, so the results should be a good indicator of model performance.

In [18]:
#type here

Now that we have our predictions, we can compare them with the true values and see how well we did.

In [19]:
#type here

In [None]:
print("Accuracy Score: ",accuracy_score)

# Implementation

We have our well-performing model, but how do we actually use this model in a practical sense? In other words, how can we efficiently input text and receive sentiments from them? 

Let's create a function to help us with that.

In [21]:
def getPrediction(text, cv, model):
    '''
    Takes in a sequence of texts, the pre-fit CountVectorizer, trained model and returns the models predictions
    on the texts in the form of a dataframe.
    '''
    textCounts = cv.transform(text)
    predictions = model.predict(textCounts)
    sentiments = list(map(lambda x: "Positive" if x == 1 else "Negative", predictions))
    #return a df
    return pd.DataFrame({"text": text, "predictions": sentiments})

Let's test out our function! Let's say we're building a movie rating website (like IMDB), and we have user-input reviews.

Try it yourself! Find an IMDB review for your favorite movie. **Make sure you can detect the sentiment yourself**. Set the text of the review equal to the review variable below. 

In [1]:
review = "REPLACE THIS WITH A REVIEW FROM YOUR FAVORITE MOVIE"

In [None]:
getPrediction([review], cv, model)

# That's it

Congratulations! You build your own sentiment classifier!