<a href="https://colab.research.google.com/github/lsw4cq/llm-data-project/blob/main/days_1_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Statements

In [None]:
!pip install datasets
from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd
import string
from nltk.corpus import stopwords
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from transformers import pipeline
from transformers import DistilBertTokenizer




# Project Selection
I decided to do sentiment analysis as my project. After reading through the ideas this one seemed to be the most interesting.

# Day 1: Data
I had a choice between yelp and imdb datasets for my project. I decided to go with imdb because I am familiar with movies and would be able to glance and see how the model is performing.

In [None]:
ds = load_dataset('imdb')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

## Staging the Data

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
ds['train'][0:3]


{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

### Initial Take-a-ways

Exploring the data, and knowing where it is coming from, shows that there are a lot of lower case and upper case words. There will be slang and different internet lingo as well. I will be cleaning that up in the next section.

In [None]:
ds['train'].features


{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

In [None]:
ds_train = pd.DataFrame(ds['train'])
ds_test = pd.DataFrame(ds['test'])

In [None]:
# assign the splits
train = Dataset.from_pandas(ds_train)
test = Dataset.from_pandas(ds_test)
# reconstruct both datasets into a Dataset Dict object
new_ds = DatasetDict(
    {
        'train': train,
        'test': test
    }
)
# view the resulting dataset dict object
new_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})

# Day 2 Cleaning
There seems to be standard cleaning methods that should happen when dealing with text. I will remove punctuation marks and stopwords in both the training and test sets.

In [None]:
def replace_punc(review):
  review = review.replace('...', ' ')
  review = ''.join([char for char in review if char not in string.punctuation])
  return review

In [None]:
ds_train['review_nopunc'] = ds_train['text'].apply(lambda x: replace_punc(x))

In [None]:
ds_test['review_nopunc'] = ds_test['text'].apply(lambda x: replace_punc(x))

Tokenizing the data with no punctuation

In [None]:
def tokenize(review):
  tokens = review.lower().split()
  return tokens

In [None]:
ds_train['review_tokens'] = ds_train['review_nopunc'].apply(lambda x: tokenize(x))

In [None]:
ds_test['review_tokens'] = ds_test['review_nopunc'].apply(lambda x: tokenize(x))

Removing stopwords

In [None]:

nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def remove_stopwords(review):
  review = [word for word in review if word not in stop_words]
  return review

In [None]:
ds_train['review_nostop'] = ds_train['review_tokens'].apply(lambda x: remove_stopwords(x))
ds_test['review_nostop'] = ds_test['review_tokens'].apply(lambda x: remove_stopwords(x))

## Reviewing the tokenized data

In [None]:
ds_train.head()

Unnamed: 0,text,label,review_nopunc,review_tokens,review_nostop
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,I rented I AM CURIOUSYELLOW from my video stor...,"[i, rented, i, am, curiousyellow, from, my, vi...","[rented, curiousyellow, video, store, controve..."
1,"""I Am Curious: Yellow"" is a risible and preten...",0,I Am Curious Yellow is a risible and pretentio...,"[i, am, curious, yellow, is, a, risible, and, ...","[curious, yellow, risible, pretentious, steami..."
2,If only to avoid making this type of film in t...,0,If only to avoid making this type of film in t...,"[if, only, to, avoid, making, this, type, of, ...","[avoid, making, type, film, future, film, inte..."
3,This film was probably inspired by Godard's Ma...,0,This film was probably inspired by Godards Mas...,"[this, film, was, probably, inspired, by, goda...","[film, probably, inspired, godards, masculin, ..."
4,"Oh, brother...after hearing about this ridicul...",0,Oh brother after hearing about this ridiculous...,"[oh, brother, after, hearing, about, this, rid...","[oh, brother, hearing, ridiculous, film, umpte..."


## Special Note
There are spaces that were removed that shouldn't have been because there is no context. ds_train[4] has ellipsis so the result is creating a word that doesn't exist. To adjust for this I can add it to my puncutation removal function (which I did).

In [None]:
ds_train['tokens_nostop_string'] = ds_train['review_nostop'].apply(lambda x: ' '.join(x))
ds_test['tokens_nostop_string'] = ds_test['review_nostop'].apply(lambda x: ' '.join(x))

## Splitting and Fitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(ds_train['tokens_nostop_string'], ds_train['label'], test_size=0.2)

In [None]:
n_words = 100
tfidf_vecorizer = TfidfVectorizer(max_features=n_words)
X_train_tfidf = tfidf_vecorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vecorizer.transform(X_test)

I decided to go with tfidf because I am working with informal writing. The occurance of words is not going to be enough for the interpretation. Especially since slang is going to be used, bad can mean bad or good.

In [None]:
pd.DataFrame(X_train_tfidf.toarray(), columns=tfidf_vecorizer.get_feature_names_out())

Unnamed: 0,acting,actors,actually,also,another,around,back,bad,best,better,...,want,watch,watching,way,well,work,world,would,years,young
0,0.136695,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.165937,0.000000,0.000000,0.000000,0.000000,0.000000,0.173621,0.000000,0.157361,0.000000
1,0.176749,0.000000,0.000000,0.163574,0.000000,0.217420,0.000000,0.341878,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.170220,0.157871,0.000000,0.000000,0.000000,0.000000,0.449907
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.153782,0.000000,0.000000,...,0.193024,0.000000,0.000000,0.000000,0.142025,0.000000,0.000000,0.266580,0.000000,0.000000
3,0.000000,0.000000,0.226705,0.059859,0.074797,0.159128,0.000000,0.187663,0.067160,0.000000,...,0.000000,0.000000,0.000000,0.062291,0.173316,0.076250,0.000000,0.054219,0.000000,0.000000
4,0.000000,0.210573,0.000000,0.000000,0.210950,0.000000,0.000000,0.176423,0.000000,0.193520,...,0.000000,0.000000,0.206195,0.000000,0.000000,0.215050,0.000000,0.152915,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0.205995,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.199223,0.000000,0.000000,...,0.250061,0.204951,0.000000,0.000000,0.000000,0.000000,0.000000,0.345353,0.000000,0.000000
19996,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
19997,0.000000,0.000000,0.000000,0.252029,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.131135,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
19998,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.200322,0.000000,0.000000,0.000000,0.000000


## Testing Models
I went with a LogisticRegression() because my labels are binary for positive or negative.

In [None]:
clf = LogisticRegression()
clf.fit(X_train_tfidf, y_train)
y_pred_tfidf = clf.predict(X_test_tfidf)
print(classification_report(y_test, y_pred_tfidf))

              precision    recall  f1-score   support

           0       0.74      0.72      0.73      2469
           1       0.73      0.75      0.74      2531

    accuracy                           0.73      5000
   macro avg       0.73      0.73      0.73      5000
weighted avg       0.73      0.73      0.73      5000



## Initial Takeaways

LogisticRegression() performed better than I thought it would. Since this is a "simple" model I imagine the LLM will be much more accurate.

# Day 3 Pre-Trained Model and Preprocessing

I went to HuggingFace and found a few "sentiment-analysis" models to work with. I opted to go with the one that had the most likes.

In production, I would take more time to find the best model for my situation if I were to use a pretrained one.

In [None]:
pipe = pipeline("text-classification", model="juliensimon/reviews-sentiment-analysis")

config.json:   0%|          | 0.00/586 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
data = ds['train'][0]['text']
preds = pipe(data)
preds

[{'label': 'LABEL_0', 'score': 0.8783227205276489}]

## Initial Takeaways
Being able to run everything through a pipeline is very helpful. I can see that this can be dangerous too if you don't know what steps were taken to preprocess the model. Since my data is unique there may be things that still need to be done before putting it through the pipeline.

# Day 4 Final Modelling
The following is from the HuggingFace Task Guide for Text Classification

In [None]:
!pip install transformers datasets evaluate accelerate
!pip install huggingface-hub
from huggingface_hub import notebook_login
notebook_login()

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb")

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1", average="weighted")
evaluator = evaluate.combine({"accuracy": accuracy, "f1": f1})

In [None]:
import numpy as np

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return evaluator.compute(predictions=predictions, references=labels)

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2221,0.218138,0.91464,0.910366
2,0.149,0.233342,0.9316,0.932019


No files have been modified since last commit. Skipping to prevent empty commit.


TrainOutput(global_step=3126, training_loss=0.20687891044299417, metrics={'train_runtime': 685.9003, 'train_samples_per_second': 72.897, 'train_steps_per_second': 4.558, 'total_flos': 6556904415524352.0, 'train_loss': 0.20687891044299417, 'epoch': 2.0})

## Training my model with my data

In [None]:
training_args = TrainingArguments(
    output_dir="my_tuned_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps = 500,
    num_train_epochs=4,
    weight_decay=0.01,
    eval_strategy="epoch",
    metric_for_best_model="accuracy",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1287,0.352982,0.89652,0.889307
2,0.1327,0.271117,0.92684,0.928215
3,0.0722,0.315059,0.93064,0.931506
4,0.0358,0.368491,0.93032,0.93078


No files have been modified since last commit. Skipping to prevent empty commit.


TrainOutput(global_step=6252, training_loss=0.09140520216331068, metrics={'train_runtime': 1368.1705, 'train_samples_per_second': 73.09, 'train_steps_per_second': 4.57, 'total_flos': 1.3106947433758944e+16, 'train_loss': 0.09140520216331068, 'epoch': 4.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.3150591552257538,
 'eval_accuracy': 0.93064,
 'eval_f1': 0.9315057671038078,
 'eval_runtime': 86.3986,
 'eval_samples_per_second': 289.356,
 'eval_steps_per_second': 18.091,
 'epoch': 4.0}

In [None]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/landoncodes/my_awesome_model/commit/fd968e4a1fc25cc5dfdf6c06882de0b0693641ff', commit_message='End of training', commit_description='', oid='fd968e4a1fc25cc5dfdf6c06882de0b0693641ff', pr_url=None, repo_url=RepoUrl('https://huggingface.co/landoncodes/my_awesome_model', endpoint='https://huggingface.co', repo_type='model', repo_id='landoncodes/my_awesome_model'), pr_revision=None, pr_num=None)