<a href="https://colab.research.google.com/github/lmassaron/ml4dummies_3ed/blob/main/ML4D3E_17_scoring_opinions_and_sentiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
text_1 = "The quick brown fox jumps over the lazy dog."
text_2 = "My dog is quick and can jump over fences."
text_3 = "Your dog is so lazy that it sleeps all the day."
corpus = [text_1, text_2, text_3]

In [2]:
from sklearn.feature_extraction import text

vectorizer = text.CountVectorizer(binary=True)
vectorizer.fit(corpus)
vectorized_text = vectorizer.transform(corpus)
print(vectorized_text.todense())

[[0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0]
 [0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0]
 [1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1]]


In [3]:
print(vectorizer.vocabulary_)

{'the': 19, 'quick': 15, 'brown': 2, 'fox': 7, 'jumps': 11, 'over': 14, 'lazy': 12, 'dog': 5, 'my': 13, 'is': 8, 'and': 1, 'can': 3, 'jump': 10, 'fences': 6, 'your': 20, 'so': 17, 'that': 18, 'it': 9, 'sleeps': 16, 'all': 0, 'day': 4}


In [4]:
text_4 = "A black dog just passed by but my dog is brown."
corpus.append(text_4)
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)
vectorized_text = vectorizer.transform(corpus)
print(vectorized_text.todense()[-1])

[[0 0 1 1 1 1 0 0 2 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0]]


In [5]:
tfidf = text.TfidfTransformer(norm="l1")
tfidf_mtx = tfidf.fit_transform(vectorized_text)

phrase = 3 # choose a number from 0 to 3

total = 0
for word in vectorizer.vocabulary_:
    pos = vectorizer.vocabulary_[word]
    value = list(tfidf_mtx.toarray()[phrase])[pos]
    if value !=0.0:
        print(f"{word:7s}: {value:0.3f}")
        total += value
print('\nSummed values of a phrase: %0.1f' % total)

brown  : 0.095
dog    : 0.126
my     : 0.095
is     : 0.077
black  : 0.121
just   : 0.121
passed : 0.121
by     : 0.121
but    : 0.121

Summed values of a phrase: 1.0


In [6]:
bigrams = text.CountVectorizer(ngram_range=(2, 2))
print(bigrams.fit(corpus).vocabulary_)

{'the quick': 30, 'quick brown': 24, 'brown fox': 3, 'fox jumps': 9, 'jumps over': 15, 'over the': 21, 'the lazy': 29, 'lazy dog': 17, 'my dog': 19, 'dog is': 7, 'is quick': 11, 'quick and': 23, 'and can': 1, 'can jump': 6, 'jump over': 14, 'over fences': 20, 'your dog': 31, 'is so': 12, 'so lazy': 26, 'lazy that': 18, 'that it': 27, 'it sleeps': 13, 'sleeps all': 25, 'all the': 0, 'the day': 28, 'black dog': 2, 'dog just': 8, 'just passed': 16, 'passed by': 22, 'by but': 5, 'but my': 4, 'is brown': 10}


In [7]:
import nltk
nltk.download("punkt_tab")
nltk.download("stopwords")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
from sklearn.feature_extraction import text

from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

stemmer = PorterStemmer()
stop_words = stopwords.words("english")

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    stems = stem_tokens(tokens, stemmer)
    return stems

vocab = ["Sam loves swimming so he swims all the time"]
vect = text.CountVectorizer(tokenizer=tokenize)
vec = vect.fit(vocab)

sentence1 = vec.transform(["George loves swimming too! "])

print(vec.get_feature_names_out())
print(sentence1.toarray())

['love' 'sam' 'swim' 'time']
[[1 0 1 0]]




In [9]:
import pandas as pd

repository = (
    "https://github.com/lmassaron/ml4dummies_3ed/")
release = "releases/download/v1.0/"
filename = repository + release + "imdb_50k.csv"
reviews = pd.read_csv(filename)

In [10]:
reviews.sentiment.value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
1,25000
0,25000


In [11]:
print(reviews.review.sample(1).values[0])

Can't liberals like Alec Baldwin get it through their heads that they lost the elections of 2000 and 2004? The ridiculous. lame swipes at WalMart, non-union workers, George W Bush and the stock market not to mention the intentional GWB accent that Balwin's character uses in the film just makes him look silly and bitter on screen. As the credits roll the sour grapes continue as "Special Thanks" are given to Ken Lay, and other CEOs from Enron, Tyco, WorldCom, and IMClone. Let me clue you in to something - if you put all your money into one company's stock YOU'RE AN IDIOT. We don't need this excuse for a movie to tell us that. What a waste of Jim Carrey's talent - from the trailer I expected a completely different movie - what I got was a 90 minute DNC commercial on how to scare people into not investing for their own future, keep them stupid, and keep them dependent on government. No wonder Hollywood is in trouble and can't make a decent movie anymore - maybe you guys could get an origin

In [12]:
from sklearn.model_selection import train_test_split

train, temp = train_test_split(reviews, test_size=0.4, random_state=0)
valid, test = train_test_split(temp, test_size=0.5, random_state=0)

print(f"Train size: {len(train)}")
print(f"Validation size: {len(valid)}")
print(f"Test size: {len(test)}")

Train size: 30000
Validation size: 10000
Test size: 10000


In [13]:
import os
os.environ["KERAS_BACKEND"] = "jax"

In [14]:
import keras

maxlen = 256
vocab_size_limit = 10000

text_vectorization = keras.layers.TextVectorization(
    max_tokens=vocab_size_limit,
    output_mode='int',
    output_sequence_length=maxlen,
    pad_to_max_tokens=True)

text_vectorization.adapt(train.review.values)

def vectorize_text_data(df, vectorizer):
    sequences = vectorizer(df.review.values)
    return sequences, df.sentiment.values

X, y = vectorize_text_data(train, text_vectorization)
Xv, yv = vectorize_text_data(valid, text_vectorization)
Xt, yt = vectorize_text_data(test, text_vectorization)

In [15]:
keras.utils.set_random_seed(0)

model = keras.models.Sequential()
vocab_size = text_vectorization.vocabulary_size()
embedding_dim = 64

model.add(keras.layers.Input(shape=(maxlen,)))
model.add(keras.layers.Embedding(input_dim=vocab_size,
                                 output_dim=embedding_dim))
model.add(keras.layers.Bidirectional(
    keras.layers.LSTM(32, return_sequences=True)))
model.add(keras.layers.Bidirectional(
    keras.layers.LSTM(32, return_sequences=False)))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

In [16]:
history = model.fit(X, y, epochs=2, batch_size=8,
                    validation_data=(Xv, yv))

Epoch 1/2
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m180s[0m 47ms/step - accuracy: 0.6772 - loss: 0.5865 - val_accuracy: 0.8276 - val_loss: 0.4191
Epoch 2/2
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m176s[0m 46ms/step - accuracy: 0.8698 - loss: 0.3177 - val_accuracy: 0.8712 - val_loss: 0.3098


In [17]:
from sklearn.metrics import accuracy_score

predictions = (model.predict(Xt) >= 0.5).astype(int)
test_accuracy = accuracy_score(yt, predictions)
print(f"Accuracy on test set: {test_accuracy}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step
Accuracy on test set: 0.8686


In [18]:
from datasets import Dataset
from transformers import AutoTokenizer

MODEL_CHECKPOINT = "answerdotai/ModernBERT-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

def tokenize_function(examples):
    return tokenizer(examples["text"],
                     padding="max_length",
                     truncation=True,
                     max_length=256)

def tokenize_dataset(data):
  data_dict = {'text': data['review'].values, 'labels': data['sentiment'].values}
  dataset = Dataset.from_dict(data_dict)
  return dataset.map(tokenize_function, batched=True)

tokenized_train_dataset = tokenize_dataset(train)
tokenized_valid_dataset = tokenize_dataset(valid)
tokenized_test_dataset = tokenize_dataset(test)

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [19]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT,
                                                           num_labels=2)

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_valid_dataset
)

train_result = trainer.train()

W0531 11:07:54.441000 354 torch/_inductor/utils.py:1137] [1/0] Not enough SMs to use max_autotune_gemm mode


Step,Training Loss
500,0.3508
1000,0.2563
1500,0.2374
2000,0.2331
2500,0.2257
3000,0.2074
3500,0.1824


In [21]:
import numpy as np
from sklearn.metrics import accuracy_score

predictions = trainer.predict(tokenized_test_dataset)
predicted_labels = np.argmax(predictions.predictions, axis=1)
test_accuracy = accuracy_score(test['sentiment'].values, predicted_labels)
print(f"Accuracy on test set: {test_accuracy}")

Accuracy on test set: 0.9439
