<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.4: Sentiment Analysis

This lab performs sentiment analysis on sentiment-labelled sentences using two types of feature extraction - a count vectoriser and TF-IDF vectoriser.

Based on the video tutorial **Text Classification with Machine Learning,SpaCy and Scikit(Sentiment Analysis)** by **Jesse E. Agbe (JCharis)**.

## Data Source: UCI
### UCI - Machine Learning Repository
- Center for Machine Learning and Intelligent Systems

The [**UCI Machine Learning Repository**](http://archive.ics.uci.edu/about) is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

### Dataset
- [Sentiment Labelled Sentences Data Set](http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)

### Abstract
The dataset contains sentences labelled with positive or negative sentiment.

- Data Set Characteristics: Text
- Number of Instances: 3000
- Area: N/A
- Attribute Characteristics: N/A
- Number of Attributes: N/A
- Date Donated: 2015-05-30
- Associated Tasks: Classification
- Missing Values? N/A

### Source
Dimitrios Kotzias dkotzias '@' ics.uci.edu

### Data Set Information
This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015

Please cite the paper if you want to use it :)

It contains sentences labelled with positive or negative sentiment.

### Format
sentence &lt;tab&gt; score &lt;newline&gt;

### Details
Score is either 1 (for positive) or 0 (for negative)

The sentences come from three different websites/fields:
- imdb.com
- amazon.com
- yelp.com

For each website, there exist **500 positive** and **500 negative** sentences. Those were selected randomly for larger datasets of reviews.

We attempted to select sentences that have a clearly positive or negative connotation, the goal was for no neutral sentences to be selected.

For the full datasets look:

- **imdb**: Maas et. al., 2011 _Learning word vectors for sentiment analysis_
- **amazon**: McAuley et. al., 2013 _Hidden factors and hidden topics: Understanding rating dimensions with review text_
- **yelp**: [Yelp dataset challenge](http://www.yelp.com/dataset_challenge)


### Attribute Information
The attributes are text sentences, extracted from reviews of products, movies, and restaurants

### Relevant Papers
**From Group to Individual Labels using Deep Features**, Kotzias et. al,. KDD 2015

### Citation Request
**From Group to Individual Labels using Deep Features**, Kotzias et. al,. KDD 2015

## Import libraries

In [1]:
## Import Libraries
import pandas as pd

import regex as re
import spacy

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

import warnings
warnings.filterwarnings('ignore')

## Load data

Load Yelp, Amazon and Imdb Data into dataframes. Create three column names 'text', 'sentiment' and 'source' (equal to one of 'yelp', 'imdb' or 'amazon' for each dataframe).

Hint: Source is separated by tabs and has no headers.

In [2]:
yelp_text = 'yelp_labelled.txt'
imdb_text = 'imdb_labelled_fixed.txt'
amazon_text = 'amazon_cells_labelled.txt'

# ANSWER

In [3]:

# Define file paths to the datasets
yelp_text = 'yelp_labelled.txt'        
imdb_text = 'imdb_labelled_fixed.txt'  
amazon_text = 'amazon_cells_labelled.txt'  

# Load the Yelp data
yelp_df = pd.read_csv(yelp_text, sep='\t', header=None, names=['text', 'sentiment'])
yelp_df['yelp_labelled.txt'] = 'yelp'  # Set source to 'yelp'

# Load the IMDb data
imdb_df = pd.read_csv(imdb_text, sep='\t', header=None, names=['text', 'sentiment'])
imdb_df['source'] = 'imdb'  # Set source to 'imdb'

# Load the Amazon data
amazon_df = pd.read_csv(amazon_text, sep='\t', header=None, names=['text', 'sentiment'])
amazon_df['source'] = 'amazon'  # Set source to 'amazon'

# Optionally, combine all datasets into a single DataFrame if needed
combined_df = pd.concat([yelp_df, imdb_df, amazon_df], ignore_index=True)

# Display the first few rows of each dataframe
print("Yelp DataFrame:\n", yelp_df.head())
print("\nIMDb DataFrame:\n", imdb_df.head())
print("\nAmazon DataFrame:\n", amazon_df.head())
print("\nCombined DataFrame:\n", combined_df.head())


Yelp DataFrame:
                                                 text  sentiment  \
0                           Wow... Loved this place.          1   
1                                 Crust is not good.          0   
2          Not tasty and the texture was just nasty.          0   
3  Stopped by during the late May bank holiday of...          1   
4  The selection on the menu was great and so wer...          1   

  yelp_labelled.txt  
0              yelp  
1              yelp  
2              yelp  
3              yelp  
4              yelp  

IMDb DataFrame:
                                                 text  sentiment source
0  A very, very, very slow-moving, aimless movie ...          0   imdb
1  Not sure who was more lost - the flat characte...          0   imdb
2  Attempting artiness with black & white and cle...          0   imdb
3       Very little music or anything to speak of.            0   imdb
4  The best scene in the movie was when Gerardo i...          1   imdb

Ama

## Inspect the data

Check your datasets.

In [4]:
# Function to inspect DataFrames
def inspect_data(df, source_name):
    print(f"{source_name} DataFrame Inspection")
    print("-" * 40)
    
    # Display the shape of the DataFrame
    print(f"Shape: {df.shape}")
    
    # Display the first few rows of the DataFrame
    print("First 5 Rows:")
    print(df.head())
    
    # Display summary statistics
    print("\nSummary Statistics:")
    print(df.describe())
    
    # Check for any null values
    print("\nMissing Values:")
    print(df.isnull().sum())
    
    print("\nUnique Sentiment Labels:")
    print(df['sentiment'].unique())
    
    print("=" * 40)

# Inspecting each DataFrame
inspect_data(yelp_df, 'Yelp')
inspect_data(imdb_df, 'IMDb')
inspect_data(amazon_df, 'Amazon')


Yelp DataFrame Inspection
----------------------------------------
Shape: (1000, 3)
First 5 Rows:
                                                text  sentiment  \
0                           Wow... Loved this place.          1   
1                                 Crust is not good.          0   
2          Not tasty and the texture was just nasty.          0   
3  Stopped by during the late May bank holiday of...          1   
4  The selection on the menu was great and so wer...          1   

  yelp_labelled.txt  
0              yelp  
1              yelp  
2              yelp  
3              yelp  
4              yelp  

Summary Statistics:
        sentiment
count  1000.00000
mean      0.50000
std       0.50025
min       0.00000
25%       0.00000
50%       0.50000
75%       1.00000
max       1.00000

Missing Values:
text                 0
sentiment            0
yelp_labelled.txt    0
dtype: int64

Unique Sentiment Labels:
[1 0]
IMDb DataFrame Inspection
---------------------------

## Merge the data

Merge all three datasets.

In [5]:
# ANSWER

# Merge all three datasets
merged_df = pd.concat([yelp_df, imdb_df, amazon_df], ignore_index=True)

# Display the shape of the merged DataFrame
print("Merged DataFrame Shape:", merged_df.shape)

# Display the first few rows of the merged DataFrame
print(merged_df.head())

# Optionally, check unique sources to verify they're correctly represented
print("\nUnique Sources in the Merged DataFrame:")
print(merged_df['source'].unique())

# Optionally, check counts of each sentiment in the merged DataFrame
print("\nCounts of each sentiment in the Merged DataFrame:")
print(merged_df['sentiment'].value_counts())


Merged DataFrame Shape: (3000, 4)
                                                text  sentiment  \
0                           Wow... Loved this place.          1   
1                                 Crust is not good.          0   
2          Not tasty and the texture was just nasty.          0   
3  Stopped by during the late May bank holiday of...          1   
4  The selection on the menu was great and so wer...          1   

  yelp_labelled.txt source  
0              yelp    NaN  
1              yelp    NaN  
2              yelp    NaN  
3              yelp    NaN  
4              yelp    NaN  

Unique Sources in the Merged DataFrame:
[nan 'imdb' 'amazon']

Counts of each sentiment in the Merged DataFrame:
sentiment
1    1500
0    1500
Name: count, dtype: int64


## Prepare the stage
- Load spaCy

In [6]:
nlp = spacy.load('en_core_web_sm')

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [7]:
!pip install spacy




In [9]:
!python -m spacy download en_core_web_sm


Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/__init__.py", line 13, in <module>
    from . import pipeline  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/pipeline/__init__.py", line 1, in <module>
    from .attributeruler import AttributeRuler
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/pipeline/attributeruler.py", line 8, in <module>
    from ..language import Language
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/language.py", line 43, in <module>
    from .pipe_analysis import analyze_pipes, print_pipe_analysis, validate_attrs
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/pipe_analysis.py", line 6, in <module>
    from .tokens import Doc, Span, Token
  File "/opt/anaconda3/lib

In [10]:
pip uninstall spacy pydantic -y


Found existing installation: spacy 3.5.0
Uninstalling spacy-3.5.0:
  Successfully uninstalled spacy-3.5.0
Found existing installation: pydantic 1.9.0
Uninstalling pydantic-1.9.0:
  Successfully uninstalled pydantic-1.9.0
Note: you may need to restart the kernel to use updated packages.


In [11]:
pip install spacy==3.0.0 
pip install pydantic==1.7.3

SyntaxError: invalid syntax (2181000152.py, line 1)

In [12]:
!pip install spacy==3.0.0


Collecting spacy==3.0.0
  Downloading spacy-3.0.0.tar.gz (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[651 lines of output][0m
  [31m   [0m Collecting setuptools
  [31m   [0m   Using cached setuptools-75.1.0-py3-none-any.whl.metadata (6.9 kB)
  [31m   [0m Collecting cython>=0.25
  [31m   [0m   Downloading Cython-3.0.11-cp312-cp312-macosx_10_9_x86_64.whl.metadata (3.2 kB)
  [31m   [0m Collecting cymem<2.1.0,>=2.0.2
  [31m   [0m   Using cached cymem-2.0.8-cp312-cp312-macosx_10_9_x86_64.whl.metadata (8.4 kB)
  [31m   [0m Collecting preshed<3.1.0,>=3.0.2
  [31m   [0m   Using cached preshed-3.0.9-cp312-cp312-macosx_1

In [13]:
!pip install pydantic==1.7.3


Collecting pydantic==1.7.3
  Downloading pydantic-1.7.3-py3-none-any.whl.metadata (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.9/84.9 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading pydantic-1.7.3-py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.7/107.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pydantic
  Attempting uninstall: pydantic
    Found existing installation: pydantic 2.5.3
    Uninstalling pydantic-2.5.3:
      Successfully uninstalled pydantic-2.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.2.5 requires pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4, but you have pydantic 1.7.3 which is incompatible.
weasel 0.4.1 requires pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4, but you have pydant

In [14]:
!python -m spacy download en_core_web_sm


Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/__init__.py", line 13, in <module>
    from . import pipeline  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/pipeline/__init__.py", line 1, in <module>
    from .attributeruler import AttributeRuler
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/pipeline/attributeruler.py", line 8, in <module>
    from ..language import Language
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/language.py", line 43, in <module>
    from .pipe_analysis import analyze_pipes, print_pipe_analysis, validate_attrs
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/pipe_analysis.py", line 6, in <module>
    from .tokens import Doc, Span, Token
  File "/opt/anaconda3/lib

In [18]:
nlp = spacy.load('en_core_web_sm')

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

## Prepare the text
All the text handling and preparation concerned with the changes and modifications from the raw source text to a format that will be used for the actual processing, things like:
- handle encoding
- handle extraneous and international characters
- handle symbols
- handle metadata and embedded information
- handle repetitions (such multiple spaces or newlines)

Clean text.

In [17]:
def clean_text(text):
    # reduce multiple spaces and newlines to only one
    text = re.sub(r'(\s\s+|\n\n+)', r'\1', text)
    # remove double quotes
    text = re.sub(r'"', '', text)

    return text

In [None]:
# Apply the clean_text function to your dataset.
# ANSWER

## Work the text
Using techniques learned in previous labs, remove StopWords, punctuation, and digits. Entities can be retained. Return the lemmatised form of any remaining words in lower case form.

This removes meaningless information.

In [None]:
# Complete the function
def convert_text(text):
    '''
    Use techniques learned in previous labs.
    1) Remove StopWords, Punctuation and digits.
    2) Retain entities.
    3) Return the lemmatised form of remaining words in lower case form.
    '''
    return text

In [None]:
%%time
df['short'] = df['text'].apply(convert_text)

In [None]:
df.sample(10)

## Split the dataset

In [None]:
# Features and Labels
X = df['short']
y = df['sentiment']

# Apply a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## Create a Bag-of-Words Model

In [None]:
# create a matrix of word counts from the text
counts = CountVectorizer()

In [None]:
# do the actual counting
A = counts.fit_transform(X_train, y_train)

In [None]:
# create a classifier using SVC
classifier = SVC(kernel='linear', probability=True)

In [None]:
# train the classifier with the training data
classifier.fit(A, y_train)

In [None]:
# do the transformation for the test data
# NOTE: use `transform()` instead of `fit_transform()`
B = counts.transform(X_test)

In [None]:
# make predictions based on the test data
predictions = classifier.predict(B)

# store probabilities of predictions being 1
probabilities = classifier.predict_proba(B)[:, 1]

In [None]:
# check the accuracy
print('Accuracy: %.4f' % accuracy_score(y_test, predictions))

## Repeat using TF-IDF
TF-IDF = Term Frequency - Inverse Document Frequency

In [None]:
# create a matrix of word counts from the text
# use TF-IDF
tfidf = TfidfVectorizer()
# do the actual counting
A = tfidf.fit_transform(X_train, y_train)

# train the classifier with the training data
classifier.fit(A, y_train)

# do the transformation for the test data
# NOTE: use `transform()` instead of `fit_transform()`
B = tfidf.transform(X_test)

# make predictions based on the test data
predictions = classifier.predict(B)

# store probabilities of predictions being 1
probabilities = classifier.predict_proba(B)[:, 1]

# check the accuracy
print('Accuracy: %.4f' % accuracy_score(y_test, predictions))

## Defining a helper function to show results and charts

In [None]:

def show_summary_report(actual, prediction, probabilities):

    if isinstance(actual, pd.Series):
        actual = actual.values.astype(int)
    prediction = prediction.astype(int)

    accuracy_ = accuracy_score(actual, prediction)
    precision_ = precision_score(actual, prediction)
    recall_ = recall_score(actual, prediction)
    roc_auc_ = roc_auc_score(actual, probabilities)

    print('Accuracy : %.4f [TP / N] Proportion of predicted labels that match the true labels. Best: 1, Worst: 0' % accuracy_)
    print('Precision: %.4f [TP / (TP + FP)] Not to label a negative sample as positive.        Best: 1, Worst: 0' % precision_)
    print('Recall   : %.4f [TP / (TP + FN)] Find all the positive samples.                     Best: 1, Worst: 0' % recall_)
    print('ROC AUC  : %.4f                                                                     Best: 1, Worst: < 0.5' % roc_auc_)
    print('-' * 107)
    print('TP: True Positives, FP: False Positives, TN: True Negatives, FN: False Negatives, N: Number of samples')

    # Confusion Matrix
    mat = confusion_matrix(actual, prediction)

    # Precision/Recall
    precision, recall, _ = precision_recall_curve(actual, probabilities)
    average_precision = average_precision_score(actual, probabilities)

    # Compute ROC curve and ROC area
    fpr, tpr, _ = roc_curve(actual, probabilities)
    roc_auc = auc(fpr, tpr)


    # plot
    fig, ax = plt.subplots(1, 3, figsize = (18, 6))
    fig.subplots_adjust(left = 0.02, right = 0.98, wspace = 0.2)

    # Confusion Matrix
    sns.heatmap(mat.T, square = True, annot = True, fmt = 'd', cbar = False, cmap = 'Blues', ax = ax[0])

    ax[0].set_title('Confusion Matrix')
    ax[0].set_xlabel('True label')
    ax[0].set_ylabel('Predicted label')

    # Precision/Recall
    step_kwargs = {'step': 'post'}
    ax[1].step(recall, precision, color = 'b', alpha = 0.2, where = 'post')
    ax[1].fill_between(recall, precision, alpha = 0.2, color = 'b', **step_kwargs)
    ax[1].set_ylim([0.0, 1.0])
    ax[1].set_xlim([0.0, 1.0])
    ax[1].set_xlabel('Recall')
    ax[1].set_ylabel('Precision')
    ax[1].set_title('2-class Precision-Recall curve')

    # ROC
    ax[2].plot(fpr, tpr, color = 'darkorange', lw = 2, label = 'ROC curve (AUC = %0.2f)' % roc_auc)
    ax[2].plot([0, 1], [0, 1], color = 'navy', lw = 2, linestyle = '--')
    ax[2].set_xlim([0.0, 1.0])
    ax[2].set_ylim([0.0, 1.0])
    ax[2].set_xlabel('False Positive Rate')
    ax[2].set_ylabel('True Positive Rate')
    ax[2].set_title('Receiver Operating Characteristic')
    ax[2].legend(loc = 'lower right')

    plt.show()

    return (accuracy_, precision_, recall_, roc_auc_)

## Repeating it all for comparison
Repeat the whole lot in one big block using the show_summary_report function.

Find 'Accuracy', 'Precision', 'Recall', 'ROC_AUC' using CountVectorizer and TfidfVectorizer and keep the result in a dataframe.

In [None]:
# ANSWER



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



