# Homework 10 - Sentiment Analysis with TF-IDF Vectorization
In this assignment, we will apply NLP concepts from lecture and use TF-IDF Vectorization. We will need to use the sentiment dataset linked to from the canvas assignment page. Make sure to have this downloaded for using this guide. As always, we'll first need a few libraries for this assignment.

Complete the missing parts in this guide.

### Step 1: Load Data
You can load the data from the provided TSV file using `pandas`.

### Step 2: Preprocess
 - Clean the data by removing stop-words, punctuations, emoticons etc..

### Step 3: Train and test a model to predict the sentiment of each sentence
 - Train and test the model using TfidfVectorizer, Pipeline, Logistic regression with this data.
 - Print the best_params_, best_score_, score.

### Step 4: Repeat for all the datasets
 - 'amazon_cells_labelled.tsv'
 - 'yelp_labelled.tsv'
 - 'imdb_labelled.tsv'

## Dataset Overview
The dataset obtained originally from https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences contains sentences labeled with sentiment. Each sentence is associated with a sentiment label (positive or negative). The dataset is split into three parts, each containing sentences from different sources: Amazon, Yelp, and IMDb.
Score is either 1 (for positive) or 0 (for negative)	


## Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
     - Search for `Jupyter: Export to HTML`.
     - Save the HTML file to your computer and submit it via Canvas.


In [1]:
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
import re
%matplotlib inline

The datasets have two columns: `sentence` and `score`. The `sentence` column contains the text of the sentence, and the `score` column contains the sentiment label (1 for positive, 0 for negative).

In [2]:
df = pd.read_csv("../Datasets/yelp_labelled.tsv", sep="\t")
df.head()

Unnamed: 0,sentence,score
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [3]:
# Debugging
df.loc[0, 'sentence']

'Wow... Loved this place.'

Great! Now we need to clean up the dataframe by removing non words like stop-test, and punctuation. Fill in the code for the `remove_punctuation()` and `remove_stopwords()` functions as described in lecture.

Note: In addition to the 'remove_punctuation' and 'remove_stopwords', you can also try to check for lower case and upper case and convert to lower case accordingly. You can also tokenize the text, stemming the tokens and then join the stemmed tokens back into a string.

In [4]:
nltk.download('stopwords')
stop = stopwords.words('english')

def remove_punctuation(text):
    # Your Code Here
    '''a function for removing punctuation'''
    import string
    # replacing the punctuation with no space, which in effect deletes the punctuation marks
    # translator = str.maketrans('', '', string.punctuation) 
    # Leaving the space preserves the separation between words
    translator = str.maketrans({key: " " for key in string.punctuation})
    # return the text stripped of punctuation marks
    return text.translate(translator)


def remove_stopwords(text):
    # Your Code Here
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in stop]
    # joining the list of words with space separator
    return " ".join(text)

df['sentence'] = df['sentence'].apply(remove_punctuation).apply(remove_stopwords)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\EUSRIOM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,sentence,score
0,wow loved place,1
1,crust good,0
2,tasty texture nasty,0
3,stopped late may bank holiday rick steve recom...,1
4,selection menu great prices,1


In [5]:
# Debugging
df.loc[0, 'sentence']

'wow loved place'

Split the cleaned dataset using train test split

In [6]:
# Your code here
from sklearn.model_selection import train_test_split

X = df['sentence']
y = df['score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Define a pipeline with the asked models(tfifd and Logistic Regression) in our case

Next we can call the `TfidfVectorizer()` function, passing it 'english' as a parameter.

In [7]:
# Libraries
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# not including stop_words as that was already done (by remove_stopwords) 

pipe = Pipeline([('tfidf', TfidfVectorizer()),                  # Text feature extraction
                 ('lr', LogisticRegression(random_state=42)) ]) # Classification

Next, you can define a parameter grid for finding the best hyperparameters, then use GridsearchCV and pass the pipeline to find the best parameters and then fit the model using the best hyperparameters

In [8]:
# Hyperparameter to tune

param_grid = {
    'tfidf__max_df': [0.5, 0.7, 0.9, 1.0],
    'tfidf__min_df': [1, 5, 10],
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    # 'tfidf__use_idf': [True],   # Default
    # 'tfidf__smooth_idf': [True],# Default
    'lr__C': [0.01, 0.1, 1, 10, 100],
    # 'lr__penalty': ['l2'],      # Default
    # 'lr__max_iter': [100],      # Default
    'lr__solver': ['liblinear', 'lbfgs', 'saga']
    }

In [9]:
# Your code for GridsearchCV
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)


In [10]:
# Your code to fit the model
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


Write a code to print the best_params_, best_score_, score.

In [11]:

# Your code here
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

Best Parameters: {'lr__C': 1, 'lr__solver': 'liblinear', 'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}
Best Cross-Validation Accuracy: 0.7863


## Now repeat the above steps for the remaining datasets

## amazon_cells_labelled.tsv

In [12]:
# amazon_cells_labelled.tsv and imdb_labelled.tsv

file_name = 'amazon_cells_labelled.tsv'

print('\n'+file_name)
df = pd.read_csv("../Datasets/"+file_name, sep="\t")
df.head()


amazon_cells_labelled.tsv


Unnamed: 0,sentence,score
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [13]:
# Debugging
df.loc[3, 'sentence']

'Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!'

In [14]:
# Removing puntuation, stop words

df['sentence'] = df['sentence'].apply(remove_punctuation).apply(remove_stopwords)
df.head()

Unnamed: 0,sentence,score
0,way plug us unless go converter,0
1,good case excellent value,1
2,great jawbone,1
3,tied charger conversations lasting 45 minutes ...,0
4,mic great,1


In [15]:
# Debugging
df.loc[3, 'sentence']

'tied charger conversations lasting 45 minutes major problems'

In [16]:
# Train test split

X = df['sentence']
y = df['score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
# Pipeline

# not including stop_words as that was already done (by remove_stopwords) 

pipe = Pipeline([('tfidf', TfidfVectorizer()),                  # Text feature extraction
                 ('lr', LogisticRegression(random_state=42)) ]) # Classification

In [18]:
# Hyperparameters to tune

param_grid = {
    'tfidf__max_df': [0.5, 0.7, 0.9, 1.0],
    'tfidf__min_df': [1, 5, 10],
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    # 'tfidf__use_idf': [True],   # Default
    # 'tfidf__smooth_idf': [True],# Default
    'lr__C': [0.01, 0.1, 1, 10, 100],
    # 'lr__penalty': ['l2'],      # Default
    # 'lr__max_iter': [100],      # Default
    'lr__solver': ['liblinear', 'lbfgs', 'saga']
    }


In [19]:
# GridsearchCV

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)

In [20]:
# Your code to fit the model
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 540 candidates, totalling 2700 fits


In [21]:
# Print best parameters, score

print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

Best Parameters: {'lr__C': 10, 'lr__solver': 'liblinear', 'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}
Best Cross-Validation Accuracy: 0.8125


## imdb_labelled.tsv

In [22]:
# imdb_labelled.tsv

file_name = 'imdb_labelled.tsv'

print('\n'+file_name)
df = pd.read_csv("../Datasets/"+file_name, sep="\t")
df.head()


imdb_labelled.tsv


Unnamed: 0,sentence,score
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [23]:
# Removing puntuation, stop words

df['sentence'] = df['sentence'].apply(remove_punctuation).apply(remove_stopwords)
df.head()

Unnamed: 0,sentence,score
0,slow moving aimless movie distressed drifting ...,0
1,sure lost flat characters audience nearly half...,0
2,attempting artiness black white clever camera ...,0
3,little music anything speak,0
4,best scene movie gerardo trying find song keep...,1


In [24]:
# Train test split

X = df['sentence']
y = df['score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
# Pipeline

# not including stop_words as that was already done (by remove_stopwords) 

pipe = Pipeline([('tfidf', TfidfVectorizer()),                  # Text feature extraction
                 ('lr', LogisticRegression(random_state=42)) ]) # Classification

In [26]:
# Hyperparameter tuning

param_grid = {
    'tfidf__max_df': [0.5, 0.7, 0.9, 1.0],
    'tfidf__min_df': [1, 5, 10],
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    # 'tfidf__use_idf': [True],   # Default
    # 'tfidf__smooth_idf': [True],# Default
    'lr__C': [0.01, 0.1, 1, 10, 100],
    # 'lr__penalty': ['l2'],      # Default
    # 'lr__max_iter': [100],      # Default
    'lr__solver': ['liblinear', 'lbfgs', 'saga']
    }

In [27]:
# GridsearchCV

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)

In [28]:
# Your code to fit the model
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


In [29]:
# Print best parameters, score

print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

Best Parameters: {'lr__C': 10, 'lr__solver': 'liblinear', 'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 2)}
Best Cross-Validation Accuracy: 0.7676


## Results Summary

**Replacing puntuation for ' ' - space**

|File|Best parameters|Best Cross-Validation Accuracy|
|---|---|:-:|
|yelp_labelled.tsv|{'lr__C': 1, 'lr__solver': 'liblinear',<br/>'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}|0.7863|
|amazon_cells_labelled.tsv|{'lr__C': 10, 'lr__solver': 'liblinear',<br/>'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}|0.8125|
|imdb_labelled.tsv|{'lr__C': 10, 'lr__solver': 'liblinear',<br/>'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 2)}|0.7676|


**Replacing puntuation for '' - blank**

|File|Best parameters|Best Cross-Validation Accuracy|
|---|---|:-:|
|yelp_labelled.tsv|{'lr__C': 1, 'lr__solver': 'liblinear',<br/>'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}|0.7913|
|amazon_cells_labelled.tsv|{'lr__C': 10, 'lr__solver': 'liblinear',<br/>'tfidf__max_df': 0.5, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}|0.8163|
|imdb_labelled.tsv|{'lr__C': 100, 'lr__solver': 'saga',<br/>'tfidf__max_df': 0.7, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 2)}|0.7593|

