<a href="https://colab.research.google.com/github/maryamteimouri/TextClassification/blob/main/TextClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import All

In [None]:
!pip install --quiet datasets

In [None]:
import datasets
import random
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV


print("All modules have been imported")

All modules have been imported


### Choose the datasets module to download the "emotion" dataset.


In [None]:
# Download the 'emotion' dataset
emotion_data = datasets.load_dataset('emotion')



  0%|          | 0/3 [00:00<?, ?it/s]

### Analyzed the dataset

We analyzed the dataset to get familiar with it

In [None]:
def analyze_dataset(dataset_name):
    # Load the dataset
    dataset = datasets.load_dataset(dataset_name)
    builder = datasets.load_dataset_builder(dataset_name)

    # Print the dataset description
    print("Dataset description:", builder.info.description)

    # Get the sizes of the 'train', 'validation', and 'test' subsets
    train_size = len(dataset['train'])
    validation_size = len(dataset['validation']) if 'validation' in dataset else 0
    test_size = len(dataset['test'])

    # Calculate the percentage of each subset relative to the total size
    total_size = train_size + validation_size + test_size
    train_pct = round(train_size / total_size * 100)
    validation_pct = round(validation_size / total_size * 100)
    test_pct = round(test_size / total_size * 100)

    # Print the subset sizes
    print("Subset sizes:")
    print(f"train: {train_pct}%, validation: {validation_pct}%, test: {test_pct}%")

    # Get the distribution of labels in the 'train' subset
    if dataset_name == 'emotion':
        label_name_map = {0: 'anger', 1: 'joy', 2: 'love', 3: 'sadness', 4: 'surprise', 5: 'fear'}
    elif dataset_name == 'rotten_tomatoes':
        label_name_map = {0: 'negative', 1: 'positive'}
    elif dataset_name == 'snli':
        label_name_map = {-1: 'contradiction', 0: 'neutral', 1: 'entailment', 2: 'unknown'}
    elif dataset_name == 'sst2':
        label_name_map = {0: 'negative', 1: 'positive'}
    elif dataset_name == 'emo':
        label_name_map = {0: 'anger', 1: 'disgust', 2: 'fear', 3: 'happy', 4: 'neutral', 5: 'sad', 6: 'surprise'}
    else:
        label_name_map = {}

    train_labels = dataset['train']['label']
    label_counts = Counter(train_labels)
    total_labels = len(train_labels)

    # Print the label distribution
    print(f"Label distribution in the '{dataset_name}' 'train' subset:")
    for label, count in label_counts.items():
        label_name = label_name_map[label]
        pct = round(count / total_labels * 100)
        print(f"{label_name}: {pct}%")
#-----------------------------------

analyze_dataset('emotion')



  0%|          | 0/3 [00:00<?, ?it/s]



Dataset description: Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.

Subset sizes:
train: 80%, validation: 10%, test: 10%
Label distribution in the 'emotion' 'train' subset:
anger: 29%
sadness: 13%
love: 8%
fear: 4%
surprise: 12%
joy: 34%


### Preprocessing

Preprocessing is an essential step in natural language processing tasks, including text classification. It involves transforming the raw text data into a format suitable for machine learning algorithms. Here are some common preprocessing steps you can consider:

1. Text Cleaning: This step involves removing any unnecessary characters, symbols, or special characters from the text data. It may include removing punctuation, numbers, and other non-alphabetic characters.

2. Tokenization: Tokenization is the process of splitting the text into individual words or tokens. It helps in breaking down the text into meaningful units for further analysis. The tokens act as the input features for the machine learning model.

3. Lowercasing: Converting all text to lowercase can help in treating words with different capitalizations as the same and reducing the dimensionality of the feature space.

4. Stopword Removal: Stopwords are common words that do not carry much meaning, such as "a," "an," "the," etc. Removing stopwords can reduce noise in the data and improve computational efficiency.

5. Lemmatization or Stemming: Lemmatization and stemming are techniques used to reduce words to their base or root forms. Lemmatization aims to obtain the base form of a word by considering its part of speech, while stemming applies simpler rules to remove prefixes and suffixes. These techniques help in standardizing words and reducing vocabulary size.

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Text Cleaning
    # Remove special characters and symbols
    cleaned_text = re.sub(r'[^\w\s]', '', text)

    # Tokenization
    tokens = word_tokenize(cleaned_text)

    # Lowercasing
    tokens = [token.lower() for token in tokens]

    # Stopword Removal
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Join tokens back into a single string
    preprocessed_text = ' '.join(tokens)

    return preprocessed_text



# Preprocess the text in the dataset
preprocessed_dataset = emotion_data.map(lambda example: {'text': preprocess_text(example['text'])})

# Print the preprocessed text of the first example
print("Example of preprocessed text: ",preprocessed_dataset['train'][0]['text'])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Example of preprocessed text:  didnt feel humiliated


### Train a machine learning method on the training set, evaluating performance on the validation set

To train a machine learning method on the preprocessed training set and evaluate its performance on the validation set, you need to follow these steps:

1. Split the preprocessed dataset into training and validation sets.
2. Extract the features and labels from the training and validation sets.
3. Train a machine learning model on the training set.
4. Evaluate the model's performance on the validation set.


In this part we split the preprocessed dataset into training and validation sets using the train_test_split function from scikit-learn. We extract the preprocessed texts and labels from the training and validation sets.

After that, we define a pipeline consisting of a TF-IDF vectorizer and a logistic regression classifier. The TF-IDF vectorizer converts the text data into numerical features, and the logistic regression classifier is used as the machine learning model.

We train the pipeline on the training set by calling the fit method and passing the preprocessed training texts and labels.

Finally, we evaluate the pipeline on the validation set by making predictions on the validation texts using the predict method and calculating the accuracy score using the accuracy_score function from scikit-learn. The accuracy score is a common metric for classification tasks.

In [None]:
# Extract the features and labels from the training set
train_data = preprocessed_dataset['train']
val_data = preprocessed_dataset['validation']

# Extract the features and labels from the training set
train_texts = [example['text'] for example in train_data]
train_labels = train_data['label']

# Extract the features and labels from the validation set
val_texts = [example['text'] for example in val_data]
val_labels = val_data['label']

# Define a pipeline with a TF-IDF vectorizer and a logistic regression classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

# Train the pipeline on the training set
pipeline.fit(train_texts, train_labels)

# Evaluate the pipeline on the validation set
val_predictions = pipeline.predict(val_texts)
accuracy = accuracy_score(val_labels, val_predictions)
print("Validation Accuracy:", accuracy)


Validation Accuracy: 0.873


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Perform hyperparameter optimization

To perform hyperparameter optimization, we can use the scikit-learn's GridSearchCV or RandomizedSearchCV classes. These classes allow us to search over a grid of hyperparameters or randomly sample from a given distribution of hyperparameters, respectively. We use GridSearchCV in this project.

In [None]:
# Extract the features and labels from the training set
train_data = preprocessed_dataset['train']
val_data = preprocessed_dataset['validation']

# Extract the features and labels from the training set
train_texts = [example['text'] for example in train_data]
train_labels = train_data['label']

# Extract the features and labels from the validation set
val_texts = [example['text'] for example in val_data]
val_labels = val_data['label']

# Define a pipeline with a TF-IDF vectorizer and a logistic regression classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Define the hyperparameters to search over
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],  # different n-gram ranges
    'classifier__C': [0.01, 0.1, 1, 10, 100]  # different values of C for the logistic regression classifier
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(train_texts, train_labels)

# Get the best hyperparameters and the corresponding validation accuracy
best_params = grid_search.best_params_
best_accuracy = grid_search.best_score_

print("Best Hyperparameters:", best_params)
print("Validation Accuracy:", best_accuracy)


Best Hyperparameters: {'classifier__C': 100, 'tfidf__ngram_range': (1, 2)}
Validation Accuracy: 0.89925


### Evaluate your final model on the test set

To evaluate the final model on the test set, we use the best hyperparameters obtained from the hyperparameter optimization process and train a new model using these parameters on the combined training and validation sets. Then, we evaluate this model on the test set.

In the below code, the combined training and validation sets are used to train the final model using the best hyperparameters obtained from the hyperparameter optimization process. Then, the model is evaluated on the test set by making predictions and calculating the accuracy.

In [None]:
# Extract the features and labels from the training set
train_data = preprocessed_dataset['train']
val_data = preprocessed_dataset['validation']
test_data = preprocessed_dataset['test']

# Combine the training and validation sets
train_val_texts = train_data['text'] + val_data['text']
train_val_labels = train_data['label'] + val_data['label']

# Extract the features and labels from the test set
test_texts = test_data['text']
test_labels = test_data['label']

# Define a pipeline with a TF-IDF vectorizer and a logistic regression classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 1))),  # Use the best n-gram range
    ('classifier', LogisticRegression(C=0.1, max_iter=1000))  # Use the best C value
])

# Train the final model on the combined training and validation sets
pipeline.fit(train_val_texts, train_val_labels)

# Evaluate the final model on the test set
predictions = pipeline.predict(test_texts)
accuracy = accuracy_score(test_labels, predictions)

print("Test Accuracy:", accuracy)


Test Accuracy: 0.688


In [None]:
label_dict = {0: 'anger', 1: 'joy', 2: 'love', 3: 'sadness', 4: 'surprise', 5: 'fear'}

for index in range(0,10):
  print(test_texts[index],'\n** prediction:' ,label_dict[predictions[index]],
        ', label: ', label_dict[test_labels[index]],'\n')

im feeling rather rotten im ambitious right 
** prediction: anger , label:  anger 

im updating blog feel shitty 
** prediction: anger , label:  anger 

never make separate ever want feel like ashamed 
** prediction: anger , label:  anger 

left bouquet red yellow tulips arm feeling slightly optimistic arrived 
** prediction: joy , label:  joy 

feeling little vain one 
** prediction: anger , label:  anger 

cant walk shop anywhere feel uncomfortable 
** prediction: joy , label:  surprise 

felt anger end telephone call 
** prediction: joy , label:  sadness 

explain clung relationship boy many ways immature uncommitted despite excitement feeling getting accepted masters program university virginia 
** prediction: joy , label:  joy 

like breathless feeling reader eager see happen next 
** prediction: joy , label:  joy 

jest feel grumpy tired pre menstrual probably week im fit walrus vacation summer 
** prediction: joy , label:  sadness 



------------------------------------------------------------------------------

In [None]:
pip install transformers datasets evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np
import tensorflow as tf



In [None]:
from datasets import load_dataset

# Load the 'emotion' dataset
emotion_data = load_dataset('emotion')

emotion_data["test"][0]




  0%|          | 0/3 [00:00<?, ?it/s]

{'text': 'im feeling rather rotten so im not very ambitious right now',
 'label': 0}

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_emotion = emotion_data.map(preprocess_function, batched=True)



Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
id2label = {0: 'anger', 1: 'joy', 2: 'love', 3: 'sadness', 4: 'surprise', 5: 'fear'}
label2id = {'anger':0 ,  'joy':1 , 'love':2 , 'sadness':3 , 'surprise':4 , 'fear':5}

In [None]:
!pip install transformers[torch]


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=6, id2label=id2label, label2id=label2id
)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.we

In [None]:
pip install accelerate


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)#, return_tensors="tf")

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
training_args = TrainingArguments(
    output_dir="classification_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_emotion["train"],
    eval_dataset=tokenized_emotion["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Cloning https://huggingface.co/mtebad/classification_model into local empty directory.


Download file pytorch_model.bin:   0%|          | 8.00k/255M [00:00<?, ?B/s]

Download file runs/Jun17_10-24-41_50aadcb747a9/events.out.tfevents.1686997484.50aadcb747a9.9131.2: 100%|######…

Download file runs/Jun17_09-59-38_50aadcb747a9/events.out.tfevents.1686995992.50aadcb747a9.236.0: 100%|#######…

Download file runs/Jun17_10-00-39_50aadcb747a9/events.out.tfevents.1686996042.50aadcb747a9.236.1: 100%|#######…

Download file runs/Jun17_10-19-58_50aadcb747a9/events.out.tfevents.1686997201.50aadcb747a9.9131.1: 100%|######…

Download file runs/Jun17_10-15-59_50aadcb747a9/events.out.tfevents.1686996965.50aadcb747a9.9131.0: 100%|######…

Download file runs/Jun17_10-01-00_50aadcb747a9/events.out.tfevents.1686996063.50aadcb747a9.236.2: 100%|#######…

Download file training_args.bin: 100%|##########| 3.87k/3.87k [00:00<?, ?B/s]

Clean file runs/Jun17_10-24-41_50aadcb747a9/events.out.tfevents.1686997484.50aadcb747a9.9131.2:  19%|#8       …

Clean file runs/Jun17_09-59-38_50aadcb747a9/events.out.tfevents.1686995992.50aadcb747a9.236.0:  25%|##4       …

Clean file runs/Jun17_10-00-39_50aadcb747a9/events.out.tfevents.1686996042.50aadcb747a9.236.1:  25%|##4       …

Clean file runs/Jun17_10-19-58_50aadcb747a9/events.out.tfevents.1686997201.50aadcb747a9.9131.1:  25%|##4      …

Clean file runs/Jun17_10-15-59_50aadcb747a9/events.out.tfevents.1686996965.50aadcb747a9.9131.0:  25%|##4      …

Clean file runs/Jun17_10-01-00_50aadcb747a9/events.out.tfevents.1686996063.50aadcb747a9.236.2:  25%|##4       …

Clean file training_args.bin:  26%|##5       | 1.00k/3.87k [00:00<?, ?B/s]

Download file runs/Jun17_10-14-28_50aadcb747a9/events.out.tfevents.1686996871.50aadcb747a9.236.3: 100%|#######…

Clean file runs/Jun17_10-14-28_50aadcb747a9/events.out.tfevents.1686996871.50aadcb747a9.236.3: 100%|##########…

Clean file pytorch_model.bin:   0%|          | 1.00k/255M [00:00<?, ?B/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2461,0.196414,0.9265
2,0.1464,0.166869,0.937


TrainOutput(global_step=2000, training_loss=0.31886674880981447, metrics={'train_runtime': 232.0126, 'train_samples_per_second': 137.924, 'train_steps_per_second': 8.62, 'total_flos': 390202276358016.0, 'train_loss': 0.31886674880981447, 'epoch': 2.0})

In [None]:
trainer.push_to_hub()

Upload file runs/Jun17_13-06-08_952811e718fd/events.out.tfevents.1687007248.952811e718fd.3451.0: 100%|########…

To https://huggingface.co/mtebad/classification_model
   8486f57..f059c3f  main -> main

   8486f57..f059c3f  main -> main

To https://huggingface.co/mtebad/classification_model
   f059c3f..5357e41  main -> main

   f059c3f..5357e41  main -> main



'https://huggingface.co/mtebad/classification_model/commit/f059c3fd925dea623fd60eb4b0860a02c0e25ca5'

In [None]:

trainer.predict(tokenized_emotion["test"])



PredictionOutput(predictions=array([[ 6.155718  , -0.75724536, -1.5658146 , -0.9352906 , -1.5542933 ,
        -2.342891  ],
       [ 6.236772  , -0.9788001 , -1.6186336 , -0.8419975 , -1.5057111 ,
        -2.2893734 ],
       [ 6.2045794 , -1.0326067 , -1.5238668 , -1.2160791 , -1.2748581 ,
        -2.2339344 ],
       ...,
       [-1.4128406 ,  6.231103  , -0.56595236, -1.7813299 , -2.2774913 ,
        -1.7730592 ],
       [-1.1865782 ,  6.015909  , -1.0668901 , -1.7895515 , -1.5586472 ,
        -1.6562362 ],
       [-1.8966075 , -1.6872445 , -2.019763  , -1.9924353 ,  3.1283295 ,
         3.0167706 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 4]), metrics={'test_loss': 0.1948271095752716, 'test_accuracy': 0.921, 'test_runtime': 3.552, 'test_samples_per_second': 563.057, 'test_steps_per_second': 35.191})

In [None]:
!pip3 install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
tokenizer = AutoTokenizer.from_pretrained("mtebad/classification_model")
model = AutoModelForSequenceClassification.from_pretrained("mtebad/classification_model")

for index in range(0,10):

  inputs = tokenizer(tokenized_emotion["test"][index]['text'], return_tensors="pt")
  with torch.no_grad():
      logits = model(**inputs).logits
  predicted_class_id = logits.argmax().item()

  print(tokenized_emotion["test"][index]['text'],'\n** prediction:' ,id2label[predicted_class_id],
        ', label: ', id2label[preprocessed_dataset['test']['label'][index]],'\n')


im feeling rather rotten so im not very ambitious right now 
** prediction: anger , label:  anger 

im updating my blog because i feel shitty 
** prediction: anger , label:  anger 

i never make her separate from me because i don t ever want her to feel like i m ashamed with her 
** prediction: anger , label:  anger 

i left with my bouquet of red and yellow tulips under my arm feeling slightly more optimistic than when i arrived 
** prediction: joy , label:  joy 

i was feeling a little vain when i did this one 
** prediction: anger , label:  anger 

i cant walk into a shop anywhere where i do not feel uncomfortable 
** prediction: surprise , label:  surprise 

i felt anger when at the end of a telephone call 
** prediction: sadness , label:  sadness 

i explain why i clung to a relationship with a boy who was in many ways immature and uncommitted despite the excitement i should have been feeling for getting accepted into the masters program at the university of virginia 
** predictio