<a href="https://colab.research.google.com/github/marimcmurtrie/NLP/blob/main/Mari_McMurtrie_Lab_3_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3 : Mari McMurtrie

**References**

*   https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes : dataset
*   https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html : scikit learn TFVectorizer doc
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html " scikit learn classification report doc



In [20]:
# Download libraries
!pip install nltk
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')




[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [30]:
# Imports
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


**Download rotten tomatoes dataset from hugging face**


In [22]:
!pip install datasets
from datasets import load_dataset

# Load the Rotten Tomatoes dataset from Hugging Face
dataset = load_dataset("rotten_tomatoes")

# Convert to pandas DataFrame (optional)
df_train = dataset['train'].to_pandas()
df_test = dataset['test'].to_pandas()



In [23]:
# Examine the training data set
'''
text: a string feature.
label: a classification label, with possible values including neg (0), pos (1).
'''
print(df_train.head())
print(df_test.head())

                                                text  label
0  the rock is destined to be the 21st century's ...      1
1  the gorgeously elaborate continuation of " the...      1
2                     effective but too-tepid biopic      1
3  if you sometimes like to go to the movies to h...      1
4  emerges as something rare , an issue movie tha...      1
                                                text  label
0  lovingly photographed in the manner of a golde...      1
1              consistently clever and suspenseful .      1
2  it's like a " big chill " reunion of the baade...      1
3  the story gives ample opportunity for large-sc...      1
4                  red dragon " never cuts corners .      1


In [24]:
# I first pre-process text. I wrote this code in lab2
def normalize_text(corpus: list[str], lemmatizer:WordNetLemmatizer) -> list[str]:
  normalized_corpus: list[str] = []
  for sentence in corpus:
    # Remove non-alphanumeric characters
    alpha_numeric_sentence =re.sub(r'[^a-zA-Z0-9\s]', '', sentence)
    # Lower case words.
    alpha_numeric_sentence = alpha_numeric_sentence.lower()
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(alpha_numeric_sentence)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Lemmatize text
    lemmatized_sentence = " ".join([lemmatizer.lemmatize(word) for word in filtered_words])
    normalized_corpus.append(lemmatized_sentence)
  return normalized_corpus

lemmatizer = WordNetLemmatizer()
corpus: list[str] = df_train['text'].tolist()
df_train['normalized'] = normalize_text(corpus, lemmatizer)
df_train.tail()

Unnamed: 0,text,label,normalized
8525,any enjoyment will be hinge from a personal th...,0,enjoyment hinge personal threshold watching sa...
8526,if legendary shlockmeister ed wood had ever ma...,0,legendary shlockmeister ed wood ever made movi...
8527,hardly a nuanced portrait of a young woman's b...,0,hardly nuanced portrait young woman breakdown ...
8528,"interminably bleak , to say nothing of boring .",0,interminably bleak say nothing boring
8529,"things really get weird , though not particula...",0,thing really get weird though particularly sca...


**Vectorize text via bag of words or tf-idf (set max_features to 100)**

In [25]:
tfidfVectorizer = TfidfVectorizer(max_features=100)
Xt = tfidfVectorizer.fit_transform(df_train['normalized'])
print(Xt)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 19076 stored elements and shape (8530, 100)>
  Coords	Values
  (0, 63)	0.6418251160934867
  (0, 54)	0.5199763702681223
  (0, 24)	0.5636356045481611
  (3, 47)	0.3830753529695407
  (3, 36)	0.526402634263817
  (3, 60)	0.3043878702003812
  (3, 32)	0.5271541613273428
  (3, 37)	0.4534424497745196
  (4, 47)	0.3221958154418456
  (4, 60)	0.25601359442620286
  (4, 80)	0.43844552935925524
  (4, 85)	0.43608154106241914
  (4, 19)	0.420156698962546
  (4, 29)	0.410969285603583
  (4, 66)	0.32065230065855604
  (5, 24)	0.5898585361687921
  (5, 30)	0.3861795564566983
  (5, 38)	0.7091771693192875
  (7, 37)	0.4434352194181682
  (7, 69)	0.5147852119781714
  (7, 25)	0.5377214580961454
  (7, 53)	0.4992164111996975
  (9, 83)	1.0
  (10, 30)	0.4988883860113645
  (10, 93)	0.8666662438926392
  :	:
  (8522, 17)	0.49206968428424386
  (8522, 16)	0.5307813254331092
  (8522, 26)	0.544719305453322
  (8523, 47)	0.8703589733784818
  (8523, 90)	0.492417767205403

**Fit a logistic regression model**




In [26]:
lr_model = LogisticRegression(solver='liblinear', random_state=42)
lr_model.fit(Xt, df_train['label'])

**Inference on test**

In [34]:
df_test['text'][0]
predictions = lr_model.predict(tfidfVectorizer.transform(df_test['text']))
print(f"LogisticRegression predicted labels for test data: {predictions = }")
print(f"Number of test texts: {len(df_test['text'])}")
print(f"Number of label prediction: {len(predictions)}")

LogisticRegression predicted labels for test data: predictions = array([1, 0, 0, ..., 1, 0, 0])
Number of test texts: 1066
Number of label prediction: 1066


**Print classification report**

In [32]:
classification_report(df_test['label'], predictions, output_dict=True)

{'0': {'precision': 0.5671641791044776,
  'recall': 0.6416510318949343,
  'f1-score': 0.602112676056338,
  'support': 533.0},
 '1': {'precision': 0.5874730021598272,
  'recall': 0.5103189493433395,
  'f1-score': 0.5461847389558233,
  'support': 533.0},
 'accuracy': 0.575984990619137,
 'macro avg': {'precision': 0.5773185906321524,
  'recall': 0.575984990619137,
  'f1-score': 0.5741487075060807,
  'support': 1066.0},
 'weighted avg': {'precision': 0.5773185906321524,
  'recall': 0.575984990619137,
  'f1-score': 0.5741487075060806,
  'support': 1066.0}}

**Write out interpretation of report**