# Lab 3: Sentiment Analysis on Movie Reviews 🤩

Working on this lab should be a **collaborative effort**. We encourage you to work together with your group. If you do not work on your own notebook, make sure you demo to the TA/instructor as a group and share your work across the group after the lab.

> Remember to indicate the names of your group members if you use some of the collectively developed code in a future homework.

## Learning Objectives
1. Experience the full data science workflow from data aquisition, pre-processing, to building a model and presenting the results. 
![DSworkflow](utility/pics/DSworkflow.png)
2. Work with free-form text data.
3. Learn and understand two approaches to sentiment analysis. 
4. Explore model evaluation techniques and analyze errors. 

## Outline

0. [DS Use-Case: Analyzing Customer Feedback](#DS-Use-Case:-Analyzing-Customer-Feedback)
1. [Rule-Based Sentiment Prediction](#1.-Rule-Based-Sentiment-Prediction)
    1. [Toy Example](#A.-Toy-Example)
    2. [Movie Reviews: Test Yourself](#B.-Movie-Reviews:-Test-Yourself)
    3. [Evaluation](#C.-Evaluation)
2. [Limitations and Introduction to Machine Learning](#2.-Limitations-and-Introduction-to-Machine-Learning)
    1. [Quick Introduction to Sentiment Classification and Scikit-Learn](#A.-Quick-Introduction-to-Sentiment-Classification-and-Scikit-Learn)
    2. [Coding Task: Evaluate the Sentiment Classifier](#B.-Coding-Task:-Evaluate-the-Sentiment-Classifier)
3. [Communicate your Results](#3.-Communicate-your-Results) 
4. [[Optional] More Things to Try](#[Optional]-More-Things-to-Try)

## DS Use-Case: Analyzing Customer Feedback

Today we want to look at ways to help businesses make the most out of their customers' feedback, which oftentimes comes as textual reviews or comments. To analyze this form of data we can use _sentiment analysis_. It's main goal is to categorize attitudes towards something. This is quite relevant today, Amazon for example, sells products of all kinds; those who purchase these items are able to leave reviews and comments. Besides the ratings that are given (which are often noisy, can easily be created by bots, or are systematically biased), how would a company be able to tell which products are well-liked and which ones should be removed?

An easy way is through _sentiment analysis_, where the goal is to predict the sentiment or positivity/negativity of a product or service solely based on the text provided as comments and reviews. In this lab, we will explore two different ways to predict and understand the sentiment of text data. First, we will work through a simple **rule-based algorithm**, looking at positive and negative words to determine the classification of reviews. Following this, we will work through a more sophisticated **machine learning-based approach**, allowing us to _learn_ which words are more commonly found in positive versus negative reviews.

## 1. Rule-Based Sentiment Prediction

Rule-based sentiment prediction is the easier of the two algorithms to learn and implement. In short, we have a list of positive words and a list of negative words, both of which will be used to calculate a "sentiment score" for the review.

### A. Toy Example

For example, let's say we have two sets of words, positive_words and negative_words:

In [1]:
positive_words = ['great', 'awesome', 'happy', 'good', 'exciting', 'love']
negative_words = ['bad', 'dislike', 'sad', 'boring', 'awful', 'poor']

We also have a set of reviews or text that we want to analyze. Here we have three example movie reviews:

In [2]:
reviews = ['I thought the movie was great! I was very happy I could see it.',
           'I did not like the movie; boring acting, poor attitudes, bad lighting.',
           'The movie was pretty exciting overall, but the sound quality was bad.']

We then go through each review and add or subtract to the sentiment score based on the **number of positive** or **negative words**. First, we split the strings by spaces using `split()`. Now, for each word that is in the list of positive words, we add one to the score; for each word in the list of negative words subtract one.

In [3]:
sentiment_scores = []
for review in reviews:
    sentiment_score = 0
    for word in review.split(' '):
        if word in positive_words:
            sentiment_score += 1
        if word in negative_words:
            sentiment_score -= 1
    sentiment_scores.append(sentiment_score)

We can print out these results to see the overall scores in order of the reviews.

In [4]:
print(sentiment_scores)

[1, -3, 1]


If we do this by hand, we see that the scores don't add up correctly. Why is this? When **tokenizing** the reviews into words, we split by spaces. Take the first review for example. If we split it by spaces and look at the words, we see that the word great still has the exclamation point with it!

In [5]:
first_review = reviews[0]
first_review_words = first_review.split(' ')
print(first_review_words)

['I', 'thought', 'the', 'movie', 'was', 'great!', 'I', 'was', 'very', 'happy', 'I', 'could', 'see', 'it.']


Having the words split only by spaces causes some words to include punctuation, which is something we don't want. We won't touch on this too much, but preprocessing data to make sure words or numbers are functioning correctly can increase performance and accuracy greatly. Making sure that punctuation is removed as well as standardizing to lowercase gives much more control over the text data at hand.

**[🐍 Python Feature 🐍]: String Functions**  

We will use some string functions for text preprocessing. Here are a couple of useful examples: 

> `lower()` changes all characters to lowercase.

> `translate(str.maketrans(input, output, delete))` will replace characters from `input` with respective characters in `output` and deletes what's in `delete`. For example `translate(str.maketrans(“aeiou”, “12345", "!"))` will replace vowels with their respective numbers and deletes all exclamation marks. This is useful if you want to delete or replace a bunch of characters all at once. 
      
> `split(' ')` splits the words into an array based on ' ', or a space.
 
> `replace(target, new)` will replace all matches of the `target` string with the `new` string.

> `string.punctuation` gives you all punctuation symbols: $!"#\$%&\'()*+,-./:;<=>?@[\\]^_`{|}~$ 
You will need to import `string` for this.

In [6]:
import string

new_first_words = first_review.lower().translate(str.maketrans("", "", string.punctuation)).split(" ")
print(new_first_words)

['i', 'thought', 'the', 'movie', 'was', 'great', 'i', 'was', 'very', 'happy', 'i', 'could', 'see', 'it']


**Try this!** Update the tokenization part in the code from above and re-run it on the reviews to see the appropriate scores that should be allocated.

In [8]:
sentiment_scores = []
for review in reviews:
    tokens = review.lower().translate(str.maketrans("", "", string.punctuation)).split(" ")
        
    sentiment_score = 0
    for word in tokens: 
        if word in positive_words:
            sentiment_score += 1
        if word in negative_words:
            sentiment_score -= 1
    sentiment_scores.append(sentiment_score)

print(sentiment_scores)

[2, -3, 0]


Great! We now have a working function to assign sentiment scores to reviews. The final step is simply to assign a sentiment to the reviews. There are several ways to approach this, depending on what the user is attempting to do. We could do this as a Binary Classification, where each review is either positive or negative, and cannot be anything else. For this, we would assign "Negative" to any review with a score less than zero, and "Positive" to every other review.

In [9]:
review_sentiments = []

for score in sentiment_scores:
    if score >= 0:
        review_sentiments.append("Positive")
    if score < 0:
        review_sentiments.append("Negative")
        
print(review_sentiments)

['Positive', 'Negative', 'Positive']


However, we could also use Multi-class classification, including a "Neutral" class for the reviews that have a score of zero.

In [10]:
review_sentiments = []

for score in sentiment_scores:
    if score > 0:
        review_sentiments.append("Positive")
    if score < 0:
        review_sentiments.append("Negative")
    if score == 0:
        review_sentiments.append("Neutral")
        
print(review_sentiments)

['Positive', 'Negative', 'Neutral']


With all of this in mind, there are no limits to the number of classes or splits that could be made for text data. We could adjust the range for neutral to be any reviews between -1 and 1, or perhaps add in more classes ("Slightly Positive", "Slightly Negative", "Very Positive", "Very Negative", etc...). 
> **Caution**: The only challenge with this is that you will need to define the thresholds on the scores. This adds another set of **hard-coded "rules"** to your approach. 

#### Bottom line:

* As long as the data is preprocessed correctly and you have a good set of positive and negative words and thresholds, you will be able to run sentiment analysis easily on the majority of text files.


* Rule-Based Sentiment Analysis is also _easy to implement_! 


* But there are several drawbacks that can render this method inefficient. This method fails to correctly handle:
    * misspellings
    * context
    * negations 
    
> Take the two following reviews for example: "_The movie was not good, it was bad_" and "_The movie was not bad, it was good_" 

> Both of these reviews would end up with the same sentiment score, but are clearly different reviews. This is partly due to the nature of the method; we are only looking at one word at a time, and not pairs of words. We also didn't implement **negation handling**. We will not look at this specifically, but more elaborate text pre-processing and tokenizing, as well as, looking at pairs of words or groups of three word (called bi-grams or tri-grams or in general n-grams) can help alleviate mistakes in our analysis.

* Rule-Based Sentiment Analysis also does not take into account the length of the review. If we have a very long review that uses a mix of positive and negative words, it may end up being classified as something it is not. Likewise, a short but very strongly opinionated review may not receive the same sentiment as a longer, equally opinionated review.

### B. Movie Reviews: Test Yourself

In the following code blocks, work through them to analyze a dataset of real-life **movie reviews**! Some of the code is written for you and some you will have to fill in.


In [66]:
# Setup - This cell block is needed to set up everything for this testing section
# No need to edit this cell

import os
import string
import zipfile
import shutil

# Unzip folder with negative reviews
if not os.path.exists('utility/data/neg'):
    zip_ref = zipfile.ZipFile('utility/data/neg.zip', 'r')
    zip_ref.extractall('utility/data/')
    zip_ref.close()
    print('Unzipped Negative')

# Unzip folder with postive reviews
if not os.path.exists('utility/data/pos'):
    zip_ref = zipfile.ZipFile('utility/data/pos.zip', 'r')
    zip_ref.extractall('utility/data/')
    zip_ref.close()
    print('Unzipped Positive')
    
# Create folder for testing
pos_test = ['357_10p.txt', '347_10p.txt', '1697_10p.txt', '13_10p.txt']  
neg_test = ['1919_1n.txt', '54_1n.txt', '1819_1n.txt', '7_1n.txt'] 

if not os.path.exists('utility/data/test'):
    os.mkdir('utility/data/test')
    
    for rev in pos_test:
        shutil.copy('utility/data/pos/'+rev,'utility/data/test')

    for rev in neg_test:
        shutil.copy('utility/data/neg/'+rev,'utility/data/test')
    print('Created test folder.')
    
# Create list of positive words from given file
with open('utility/data/negative-words.txt') as f:
    negative_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]
    print('Created list of negtaive words: negative_words')

# Create list of negative words from given file
with open('utility/data/positive-words.txt') as f:
    positive_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]
    print('Created list of postive words: postitive_words')

Unzipped Negative
Unzipped Positive
Created test folder.
Created list of negtaive words: negative_words
Created list of postive words: postitive_words


**Try this!** The bulk of the code will be executed in the following function. Fill in what needs to be filled in to perform rule-based sentiment prediction and test the function on a small number of reviews. 

> **[🐍 Python Feature 🐍] Reading text from files:** There are different ways to read data from text files: next to `f.readlines()`, there is `f.readline()` and also `f.read()`.

In [67]:
def get_sentiment_scores(path2folder,test_mode=False):

    # Create a blank sentiment_scores list
    sentiment_scores = []

    print('...computing sentiment scores on '+path2folder +'...')
    
    # get the filenames 
    testfiles = os.listdir(path2folder)
    
    # sort test files (only in test-mode) 
    if test_mode:
        testfiles.sort()
    
    
    for file in testfiles:
        
        path_start = path2folder + '/'
    
        # Create the sentiment_score variable for this review, and set it to zero
        sentiment_score = 0
        # your code here 

    
        with open(path_start + file, encoding = "utf-8") as f:
            
            words = f.read().lower().replace("<br />", " ").translate(str.maketrans("", "", string.punctuation)).split(" ")
            

            # Pull the words into a words array

            # The reviews include the string "<br />" quite a few times; the data looks cleaner if replaced
            # with a space!
            

            # Hint: Remember to read, lower, replace, translate, and split!

            # your code here 


            # Loop through the words to generate the sentiment score
            
            for word in words: 
                if word in positive_words:
                    sentiment_score += 1
                if word in negative_words:
                    sentiment_score -= 1
            sentiment_scores.append(sentiment_score)
    
            # your code here 


            # Append the sentiment_score to the sentiment_scores array!

            # your code here 

        
    print('Done Running \n')
    return sentiment_scores

# executing in test-mode on test folder
test_scores = get_sentiment_scores('utility/data/test',test_mode=True)
print(test_scores)

...computing sentiment scores on utility/data/test...
Done Running 

[-3, 2, -2, -2, -6, 5, 6, -19]


> **Check your Code**: The sentiment scores you should get for those test reviews are `[-3, 2, -2, -2, -6, 5, 6, -19]` (note: they may not be in this order, but as long as the same scores are present you should be good).

**Try this!** Once the code is running correctly, perform rule-based sentiment prediction by calling `get_sentiment_scores()` on all the **_positive reviews_**! Running this function will take a little while as it needs to go through all of the reviews and count the positive and negative words in order to get the sentiment score.  

> **Hint**: The data is stored in the folder `data` under `utility`, with two subfolders being `neg` or `pos`

> **Hint**: Provide only the path as an argument to `get_sentiment_scores()` (to trigger _non-test mode_ execution)


In [70]:
scores_pos_reviews = None

# your code here 
scores_pos_reviews = get_sentiment_scores('utility/data/pos')            


len(scores_pos_reviews)

...computing sentiment scores on utility/data/pos...


KeyboardInterrupt: 

**Try this!** Repeat the sentiment score computation for all **_negative reviews_**.

In [69]:
scores_neg_reviews = None

# your code here 
scores_neg_reviews = get_sentiment_scores('utility/data/neg')

len(scores_neg_reviews)

...computing sentiment scores on utility/data/neg...
Done Running 



2000

### C. Evaluation
Now, we can see how our approach predicts the sentiment for those reviews. This phase is a crucial part in the data science workflow as it will tell us how well our model or approach works.  

#### Accuracy 
What is the overall performance of our rule-based sentiment predictor? 

**Try this!** Compute the percentage of correctly predicted reviews over *all* reviews (this measure is also called _accuracy_), and the percentage of incorrectly predicted reviews over *all* reviews (this measure is also called _error rate_). As a sanity check, make sure both measures add up to 100%.  

In [37]:
# your code here 

num_true = 0
num_false = 0
for i in range(len(scores_pos_reviews)): 
    if scores_pos_reviews[i] > 0: 
        num_true+=1
    else: 
        num_false +=1
        
print("POSITIVE DATA")
print("percent correct: ")
print(100*num_true/len(scores_pos_reviews))

print("percent incorrect")
print(100*num_false/len(scores_pos_reviews))
    
neg_true = 0
neg_false = 0
for i in range(len(scores_neg_reviews)): 
    if scores_neg_reviews[i] < 0: 
        neg_true+=1
    else: 
        neg_false +=1
print("NEGATIVE DATA")
print("percent correct: ")
print(100*neg_true/len(scores_neg_reviews))

print("percent incorrect")
print(100*neg_false/len(scores_neg_reviews))
    

POSITIVE DATA
percent correct: 
79.1
percent incorrect
20.9
NEGATIVE DATA
percent correct: 
72.7
percent incorrect
27.3


#### [🐍 Python Feature 🐍] Quick Intro to Formatted Printing:
> **ProTip**: Use _formatted printing_ to get nice print statements and save yourself from doing `str` conversions all the time.  

> The general rules are: `%[flags][width][.precision]type`, where `%` indicates that we want to format something at this point in the string.  Then you need to add the comma-separated variables you want to format surrounded by `(` `)` after the string preceeded by another `%`. You can add multiple inputs that get fomatted in differnt locations in your string that way. 

Example: 

In [38]:
my_sentiment = 1
my_score = 0.33333
print("Class : %5d, Score : %5.2f" % (my_sentiment, my_score))

Class :     1, Score :  0.33


Let's break it down: `d` means that the first input is trated as integer and the `5` in both formatting instructions means to use a width of 5 characters (even if the displayed string is smaller) and `.2f` means to include 2 decimal places and to treat the input as a `float` (we did not use `flags`). 


> **Note**: the `%%` in the cell below escapes the '%' symbol that we  want to print out.


#### True Positives, False Positives, True Negatives, and False Negatives

Let's look at the different poissble errors we can make on positve versus negative reviews.  

In [39]:
# Positive Predicted Reviews:
percent_pos = sum([1 for score in scores_pos_reviews if score >= 0]) / len(scores_pos_reviews)*100
print("%.2f%% true positive reviews (those are predicted correctly)" % (percent_pos))

# Negative Predicted Reviews:
percent_neg = sum([1 for score in scores_pos_reviews if score < 0]) / len(scores_pos_reviews)*100
print("%.2f%% false negative reviews (those are actually positive reviews)" % (percent_neg))

84.40% true positive reviews (those are predicted correctly)
15.60% false negative reviews (those are actually positive reviews)


What do these numbers mean? Explain whether our approach works well or not.

Let's look at the negative reviews: 

In [40]:
# Positive Predcited Reviews:
percent_pos = sum([1 for score in scores_neg_reviews if score >= 0]) / len(scores_neg_reviews)*100
print("%.2f%% false positive reviews (those are actually negtaive reviews)" % (percent_pos))

# Negative Predicted Reviews:
percent_neg = sum([1 for score in scores_neg_reviews if score < 0]) / len(scores_neg_reviews)*100
print("%.2f%% true negative reviews (those are predicted correctly)" % (percent_neg))

27.30% false positive reviews (those are actually negtaive reviews)
72.70% true negative reviews (those are predicted correctly)


> **Good to Know:** The error values we computed are the standard way to evaluate binary classification models. The values can be summarized in what is called the _confusion matrix_: 

<img src="utility/pics/confusion_matrix.png" alt="Drawing" style="width: 350px;"/>

**Write-up!** Compare the results for negative reviews with the ones for poistive ones above. Is our approach better in predicting positive reviews correctly or negative ones? 

**Your response here:**

our approach is better at predicting positive reviews. 
(this is confusing, but easier to think about when you put the false/true values on the chart above^^)


## 2. Limitations and an Introduction to Machine Learning
The rule-based sentiment predictor has many advantages, like being so simple to implement. With just a couple of extensions to our version (such as negation handling) we could actually make this production ready. However, the main drawback of this approach is that we need **hand engineered** lists of positive and negative expressions, which are non-trivial to create and also static. That means they don't adapt automatically to the domain they are being used for. For example, formal language expressions might have different meanings when compared to a colloquial context. 

#### Rule-Based: 
![rule-based](utility/pics/rule-based1.png)

How can we overcome this problem? Can we maybe learn what expressions are used in a positive versus a negative review? The answer is '_yes - we can!_'

### A. Quick Introduction to Sentiment Classification and Scikit-Learn

Instead of working with lists of positive and negative expressions we will now look at reviews with known ratings and use them to learn what positive versus negative reviews are. With the known set of positive and negative reviews, we can build a model just like so: 

#### Training an ML approach:
![machine-learning](utility/pics/ml_train1.png)

And then use it on new comments and reviews to determine a customer's attitudes. This approach is a **machine learning** approach commonly known as _classification_. Just like so: 

#### Use Trained Classifier:
![predict](utility/pics/ml_predict1.png)

Okay, let's do it. We will be using the Scikit-Learn `sklearn` Python package (https://scikit-learn.org/stable/). For further reading/reference, check [**PDSH**] Ch5 for a quick introduction to Scikit-Learn (p343-359).  

> It's okay to **not understand** every single statement we are doing in the following cells. Just enjoy the flow and watch how things evolve. Concentrate on the output and the bigger picture: which method is better _rule-based_ or _ML-based_? 


> **We will spend a lot more time on machine learnign and the** `sklearn` **library in the upcoming weeks!**

In [41]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Remove test folder 
if os.path.exists('utility/data/test'):
    shutil.rmtree('utility/data/test')  
    
# Load data (folders will be considered as classes (target variable) 0,1,... # subfolders)    
data_folder = "utility/data/"
dataset = load_files(data_folder, shuffle=False)
docs_raw = dataset.data

## Text preprocessing
docs_all = []
for doc in docs_raw:
    docs_all.append(doc.decode('utf-8', errors='replace')) # prevent UnicodeDecodeError
y_all = dataset.target

# Text tokenizing and filtering of stopwords
count_vect = CountVectorizer(min_df=5)  
X_all_counts = count_vect.fit_transform(docs_all)

# Number of docs and number of words
print("Number of documents: " + str(X_all_counts.shape[0])) 
print("Number of words: " + str(X_all_counts.shape[1])) 
    # X_all_counts data representation (* = occurrence count):
    #    - - - - -
    #  |
    #  |  *        <- document
    #  |
    #  |
    #     ^
    #    word index

Number of documents: 4000
Number of words: 8870


After **preprocessing** the text documents, we **split** our data into two parts: one for building the model (_training set_) 
and one for testing/evaluating it (_test/evaluation set_). Then we will **build the model** using the _training set_ and use the model to **predict** the sentiment of the documents in the _testing set_. 

In [42]:
# Split the data into two parts 
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .8, test_size = .2, random_state = 16)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

# Build the model using a linear classification model
model = LogisticRegression(max_iter=1000).fit(X_train,y_train)

# Use the classification model for predictions
predicted_target = model.predict(X_test)

Size of the training set: 3200
Size of the test/evaluation set: 800


### B. Coding Task: Evaluate the Sentiment Classifier
Write a function that will go through all the test data and compare the predicted class and the actual class. If an entry is put into the wrong class by the model, this function will add one to the respective variable: `fneg_error_count` if it was a _false negative_, `fpos_error_count` if it is a _false positive_. From these values you can compute  
* the total _number of mistakes made_, 
* the _error rate_, and 
* the _accuracy_ 

of the machine learning approach. 


Then, this function will print out how many total errors, how many _false negatives_, and how many _false positives_ were found and the rates (which important to get the relative measure based on the number of positive/negtaive test examples). 

The inputs for this function are the **predicted classifications** for each review generated by the model and the **actual classifications** from the dataset.

In [58]:
def test_predictions(predictions, actual):
    
    fneg_error_count = 0
    fpos_error_count = 0

    mistakes = 0
    error_rate = 0
    accuracy = 0
    
    num_pos = np.sum(actual==1)
    num_neg = np.sum(actual==0)
    
    
    
    # your code here 
    for i in range(len(actual)): 
        if predictions[i] != actual[i]: 
            mistakes +=1
            if actual[i]==1: 
                fneg_error_count +=1
            else: 
                fpos_error_count +=1
        else: 
            accuracy +=1 
            

    error_rate = 100*mistakes/len(actual)
    accuracy = 100*accuracy/len(actual)
            

    
    print("There were " + str(fneg_error_count) + " false negative errors")
    print("There were " + str(fpos_error_count) + " false positive errors")
    print("There were a total of " + str(mistakes) + " errors out of " + str(len(predictions)) + " testpoints.\n")
    
    
    #false negative and false postive rates
    fnr = fneg_error_count/num_pos *100
    fpr = fpos_error_count/num_neg *100
    
    print("%5.2f%% true positive reviews (those are predicted correctly)" % (100-fnr))
    print("%5.2f%% false negative reviews (those are actually positive reviews)" % (fnr))
    
    print("%4.2f%% false positive reviews (those are actually negtaive reviews)" % (fpr))
    print("%5.2f%% true negative reviews (those are predicted correctly)\n" % (100-fpr))
    
    print("The algorithm was correct in %.2f%% of the test cases." % (accuracy) )
    print("The algorithm was wrong in %.2f%% of the test cases." % (error_rate) )

Now, we can call this function using our predicted sentiments and the ground truth sentiments as input: 

In [59]:
test_predictions(predicted_target, y_test)

There were 30 false negative errors
There were 47 false positive errors
There were a total of 77 errors out of 800 testpoints.

92.27% true positive reviews (those are predicted correctly)
 7.73% false negative reviews (those are actually positive reviews)
11.41% false positive reviews (those are actually negtaive reviews)
88.59% true negative reviews (those are predicted correctly)

The algorithm was correct in 90.38% of the test cases.
The algorithm was wrong in 9.62% of the test cases.


## 3. Communicate your Results

Now, it is time to summarize your findings by comparing the two approaches and their results, as well as, to criticallly review your work. Did you find a satisfying solution to the given business problem: _analyzing customer feedback_? What are the strengths and limitations of the approach(es)? What could be imporved in the future?  

**Group Discussion:** Compare the two apporaches **rule-based sentiment prediction** versus **sentiment classification**. What are the main differences in terms of... 
* required data?
* quality of the results? 
* efficiency of the computation?
* possibilities to extend the basic algorithms? 

**Write-up!** Complete the given table: 

| .             | required data | results    | efficiency | extensions | other things
| ---           | ---           | ---        | ---        | ---        | ---
| rule-based SA |pos/neg list+reviews  |  label          |   ok         |    give arrays        |
| ML-based SA   |new reviews/model |     label       |     more efficient       |    training approach        |  this one is probably better  


**Write-up!** Collect the main _pros_ and _cons_ for both approaches and goive a recommendation on which one to use to analyze your company's customer feedback data. This will likley be very helpful for  the decision makers discussing the integration of SA into your company's business processes. 

**Your response here:** 

ML 
pros: more accurate
cons: need more data

rule based
pros: small set of data needed 
cons: less accurate


## [Optional] More Things to Try

There are a lot of (ad-hoc) decisions we have made for you with repect to the machine leanning pipeline above. We encourage you to modify some of these to see if and how the results will be affected. E.g.,

* Play with the train/test split sizes: we used a 80/20 split, but you can change this and see if it has an effect on the results. 
* Play with the random seed to create different train/test splits. How does this affect the results? 
* Use a different classifier: for example, NaiveBayes or a Support Vector Machine (SVM). Code examples are below - **replace** the model computation in the cell above with the respective lines to train these different models. Do these models produce different (better/worse) results?

In [60]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train, y_train)

from sklearn.svm import LinearSVC
model = LinearSVC(max_iter=5000).fit(X_train,y_train)

So, it turns out that this performs quite well. Of course, we can do more fancy things with the text data, instead of only counting word occurrences. 

[**Challenge**] In practice people also use the counts of _pairs of words_ (so-called _bi-grams_) or even _n-grams_ (counts of tuples of n words), or a feature called _TF-IDF_, which is very powerful in practice. If you still have time, check-out this tutorial explaining how to compute those: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html. Adapt the features used, create a new train/test split, train the model again, and evaluate the performace using your new features. 

### Clean-up
Please run the following cell in order to clean up some of the files on your computer. While not mandatory, it will certainly save some space (over 4000 files are already unzipped, this will clear space).

In [62]:
# Run this to clean folders (unless you want to keep several thousand text files on your computer!)

if os.path.exists('utility/data/neg'):
    shutil.rmtree('utility/data/neg')
if os.path.exists('utility/data/pos'):
    shutil.rmtree('utility/data/pos')  