# Sentiment Analysis with Python
This is an example of training a sentiment classifier with Python. 

The aims of this hands-on experiment are to present:
- the basics of data analysis
- how to pre-process a dataset and why it is important
- the (very) basics of supervised machine learning
- analysis of a classifier's results

We will use the [Women's E-commerce Clothing Review](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) dataset with their rating scores.
Rating scores goes from 1 to 5, where 1 is the worst rating and 5 the best.

## You're using a notebook!
This is a 'notebook' and it allows us to work with a programming language called python. The notebook has cells. Some cells (like this one) are text. Others (like the one below) are code. It can be a little confusing! You can run code by clicking on the cell, then clicking the 'play' button on the left hand side (or pressing Ctrl-Enter). Try it for the cell below:

In [None]:
food = "chocolate"
print("My favourite food is " + food + ".")

## Getting started
Let's get started! We need to get things set up, you can just run the next cell & move on, as this just gets things installed that we need later.

In [None]:
#by the way, in python text after a hash (#) is a comment! Like this!

#loading some libraries
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, accuracy_score
import numpy as np
import re
import matplotlib.pyplot as plt
from urllib.request import urlopen
%matplotlib inline

In [None]:
#download and read the dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/lionfish0/discover_stem/master/Womens%20Clothing%20E-Commerce%20Reviews.csv")
dataset

### Exercise 1
Try to print only the 'Rating' column. 

In [None]:
##exercise 1 area
#Here we print out their ages by adding ['Age'] to the dataset variable:
dataset['Age']

## Data Analysis

In [None]:
#playing with graphs (here are just some basic configurations)
plot_size = plt.rcParams["figure.figsize"] 
plot_size[0] = 10
plot_size[1] = 10
plt.rcParams["figure.figsize"] = plot_size 

In [None]:
#ploting the distribution of ratings
dataset['Rating'].value_counts().plot(kind='pie', autopct='%1.0f%%')

#if you want a bar graph, you could uncomment (remove the #) from these three lines instead.
#dataset['Rating'].value_counts().plot(kind='bar')
#plt.xlabel('Rating')
#plt.ylabel('Number of reviews')

### Exercise 2
Try making the pie chart show the information in a different column. For example 'Department Name' or 'Class name', by modifying the code above. (tip: try replacing 'Rating' with 'Department Name').

In [None]:
##exercise 2 area
#dataset['Rating'].value_counts().plot(kind='pie', autopct='%1.0f%%')

In [None]:
#why this graph can be misleading? 
clothes_sentiment = dataset.groupby(['Department Name', 'Rating'])['Rating'].count().unstack()
clothes_sentiment.plot(kind='bar')

In [None]:
##Analysing according to the counts of ratings
clothes_sentiment_count = dataset.groupby(['Department Name', 'Rating'])['Rating'].count().unstack()
print("** Number of each rating per department **")
print(clothes_sentiment_count)
#sum per deparment
print("** Sum of ratings per deparment **")
dept_sum = clothes_sentiment_count.sum(axis=1)
print(dept_sum)
#percentage
clothes_sentiment_perc = (clothes_sentiment_count.transpose()/dept_sum).transpose()
print("** Percentage of each rating per department **")
print(clothes_sentiment_perc)
clothes_sentiment_perc.plot(kind='bar')

## Pre-processing

In [None]:
#analysing the text: any ideas of potential problems? 
dataset['Review Text'][2] #this is review number 2, try changing the '2', to read other reviews.

In [None]:
#transforming the review column into an array of reviews
features = np.array(dataset['Review Text'])
features

In [None]:
#creating the vector with the labels
#we are using the 'Rating' column as our labels, so we have 5 classes 
labels = [int(l) for l in dataset['Rating']]
labels

In [None]:
#splitting the data into training, validation and test sets
#training = data used to train the classifiers
#validation = data used to tune the classifiers' parameters (it will make more sense later)
#test = data used to test the classifiers
raw_train, raw_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)
raw_train, raw_val, y_train, y_val = train_test_split(raw_train, y_train, train_size = 0.8)

In [None]:
#Pre-processing the reviews
#normalise words, remove punctuation, remove extra spaces, etc
def pre_proc(features):
    processed_features = []
    for sentence in range(0, len(features)):
        # Remove all tags (like <br />)
        processed_feature = re.sub(r'<.*?>', ' ', str(features[sentence]))

        #Remove all special characters
        processed_feature = re.sub(r'[^a-zA-Z0-9]', ' ', processed_feature)

        # Substituting multiple spaces with single space
        processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

        # Removing prefixed 'b'
        processed_feature = re.sub(r'^b\s+', '', processed_feature)
        
        # Removing everything that has numbers 
        #processed_feature = re.sub(r'\w*\d\w*', '', processed_feature)

        # Converting to Lowercase
        processed_feature = processed_feature.lower()

        processed_features.append(processed_feature)
    return processed_features

In [None]:
#apply pre_proc() function to all data splits
proc_train = pre_proc(raw_train)
proc_val = pre_proc(raw_val)
proc_test = pre_proc(raw_test)

### Exercise 3
Change the 'i' variable to see different examples of the original and pre-processed data

In [None]:
i = 5
print("ORIGINAL: %s" % raw_train[i])
print()
print("PRE-PROCESSED: %s" % proc_train[i])

## Feature Extraction
We will use bag-of-words as features for training our classifiers. In a bag-of-words approach, an algorithm counts the number of times a word appear in a document. Each word in the entire collection of documents (corpus) became a feature in the feature vector, which results in a sparse vector.

Instead of count the "number of times" a word appear in a document, we can also use a binary approach (whether or not a word a appear in a document). Any other ideas? 

In [None]:
def extract_features(binary=True, max_df=1.0, min_df=0.0, ngram_range=(1,1), sw=False):
#By default, we are using a binary bag-of-words approach: 
#if a word appears in a document it will receive 1 (0 otherwise)

    stop_words=[]
    if sw:
        f = urlopen("https://raw.githubusercontent.com/lionfish0/discover_stem/master/stopwords.txt").read()
        stop_words = list(np.array(f.split(), dtype=str))

    cv = CountVectorizer(binary=True, max_df=max_df, min_df=min_df, ngram_range=ngram_range, stop_words=stop_words)
    cv.fit(proc_train)

    #check the features outputted below
    #each possible word in our pre-processed vector became a feature
    #can we do better? 
    print("** Vocabulary size: %d" % len(cv.get_feature_names()))
    print("** Words:")
    print(cv.get_feature_names())

    #apply the model to all data splits
    #can you think of any problems? 
    X_train = cv.transform(proc_train)
    X_val = cv.transform(proc_val)
    X_test = cv.transform(proc_test)
    return X_train, X_val, X_test

In [None]:
#call the above function to extract features
X_train, X_val, X_test = extract_features(binary=True, max_df=1.0, min_df=0.0, ngram_range=(1,1), sw=False)

In [None]:
print(X_train)

## Training classifiers

In [None]:
#just a function to print a nice confusion matrix
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax
class_names = ['1', '2', '3', '4', '5']
np.set_printoptions(precision=2)

### Baseline: majority class classifier
Predicts all instances as the majority class 

In [None]:
#training the majority class classifier
dc = DummyClassifier(strategy="most_frequent")
dc.fit(X_train, y_train)

In [None]:
#evaluation function
def evaluate_classifier(cls, X_test, y_test):
    
    preds = cls.predict(X_test)
    print(classification_report(y_test, preds))
    ax = plot_confusion_matrix(y_test, preds, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
    plt.plot()

In [None]:
#evaluating it
evaluate_classifier(dc, X_test, y_test)

### First experiment: K-Nearest Neighbors


In [None]:
def train_knn(X_train, y_train, n_neighbors=3):  
    #training the classifier
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    return knn


In [None]:
knn = train_knn(n_neighbors=3)
evaluate_classifier(knn, X_test, y_test)

### Exercise 4
Vary the value of 'n_neighbors' and see if you can improve the performance of the classifier.

In [None]:
knn = train_knn(X_train, y_train, n_neighbors=10)
evaluate_classifier(knn, X_test, y_test)

### Exercise 5
Vary the value of 'i' to see more examples of the data and their predicted and true ratings

In [None]:
#print some samples of the data and compare predicted and true values
i = 1
print(raw_test[i])
print()
print("** Predicted value by KNN: %d" % knn_preds[i])
print("** True value: %d" % y_test[i])

### Second experiment: Logistic Regression

In [None]:
def optimise_C(X_train, y_train, X_val, y_val, C=[0.01, 0.025, 0.05, 0.25, 0.5, 1.0]):
    #optimise the parameter C using the validation data
    best_acc = 0.
    best_c = 0.
    accuracies = []
    for c in C:
    
        lr = LogisticRegression(C=c, multi_class='auto', solver='liblinear')
        lr.fit(X_train, y_train)
        cur_acc = accuracy_score(y_val, lr.predict(X_val))
        print ("Accuracy for C=%s: %s" % (c, cur_acc))
        accuracies.append(cur_acc)
        if cur_acc > best_acc:
            best_c = c
            best_acc = cur_acc

    print ("*** Best accuracy = %f, best C = %f" % (best_acc, best_c))
    plt.plot(np.array(C).astype('str'), accuracies, 'ro')
    return best_c

    

In [None]:
best_c = optimise_C(X_train, y_train, X_val, y_val, C=[0.01, 0.025, 0.05, 0.25, 0.5, 1.0])

### Exercise 6
Vary the values of the C list (positive float)

In [None]:
best_c = optimise_C(X_train, y_train, X_val, y_val, C=[0.01, 0.025, 0.05, 0.25, 0.5, 1.0, 2.0])

In [None]:
def train_lr(X_train, y_train, best_c=1.0):
    #training the model with the best C
    lr = LogisticRegression(C=best_c, multi_class='auto', solver='liblinear')
    lr.fit(X_train, y_train)
    return lr

In [None]:
lr = train_lr(X_train, y_train, best_c=best_c)
evaluate_classifier(lr, X_test, y_test)

### Exercise 7
Try to change the features and re-train the KNN and Logistic Regression classifiers. 
- What happens if you filter out words that are too frequent or less frequent? 
- What happens if we use bigrams our trigrams? 
- What happens if we use a stop-word list?
- What happens if we use counts of words instead of the binary approach?

In [None]:
#exercise 4 area

#change the parameters below to solve the exercise
#binary: whether or not we use the binary approach
#max_df: maximum frequency
#min_df: minimum frequency
#ngram_range: n-grams considered 
#sw: whehter or not we should use a stopwords list
X_train, X_val, X_test = extract_features(binary=False, max_df=1.0, min_df=0.0, ngram_range=(1,1), sw=False)


In [None]:
##KNN - you can also change n_neighbors values
knn = train_knn(X_train, y_train, n_neighbors=3)
evaluate_classifier(knn, X_test, y_test)

In [None]:
##LR - you can also change C values
best_c = optimise_C(X_train, y_train, X_val, y_val, C=[0.01, 0.025, 0.05, 0.25, 0.5, 1.0])
lr = train_lr(X_train, y_train, best_c=best_c)
evaluate_classifier(lr, X_test, y_test)