<a href="https://colab.research.google.com/github/mathewsrc/Natural-Language-Processing-in-Python/blob/master/predict_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predicting sentiment analysis**

Using a supervised models to predict sentiment

In [6]:
%%capture
!pip install polars
!pip install transformers datasets
!pip install evaluate

In [4]:
import polars as pl

### Checking dataset

In [7]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("rotten_tomatoes")
ds_builder.info.features

Downloading builder script:   0%|          | 0.00/5.03k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.25k [00:00<?, ?B/s]

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

### Load dataset from Hugging Face

In [8]:
from datasets import load_dataset
from datasets import get_dataset_split_names


dataset = load_dataset("rotten_tomatoes", split="train").shuffle().select(range(1000))
print(get_dataset_split_names("rotten_tomatoes"))

Downloading and preparing dataset rotten_tomatoes/default to /root/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46...


Downloading data:   0%|          | 0.00/488k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset rotten_tomatoes downloaded and prepared to /root/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46. Subsequent calls will reuse this data.
['train', 'validation', 'test']


### Use a transformer model to predict sentiment analysis and evaluate the model

In [10]:
import evaluate
from evaluate import evaluator
from evaluate.visualization import radar_plot 
from transformers import pipeline

# Benchmaker model
classifier = pipeline("sentiment-analysis")
metric = evaluate.load("accuracy")
task_evaluator = evaluator("text-classification")

results = task_evaluator.compute(model_or_pipeline=classifier,
                                 data=dataset,
                                 input_column="text",
                                 label_column="label",
                                 metric=metric,
                                 label_mapping={"NEGATIVE": 0, "POSITIVE": 1},
                                 strategy="bootstrap",
                                 n_resamples=200,
                                 random_state=42)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [17]:
import json
print(json.dumps(results, indent=4))

{
    "accuracy": {
        "confidence_interval": [
            0.875504983757101,
            0.912
        ],
        "standard_error": 0.00921351285316722,
        "score": 0.896
    },
    "total_time_in_seconds": 84.14236229800008,
    "samples_per_second": 11.884619978440613,
    "latency_in_seconds": 0.08414236229800008
}


### Train a Logistic Regression model to predict sentiment analysis

In [117]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix

In [126]:
train_ds = load_dataset("rotten_tomatoes", split="train").shuffle().select(range(1000)).to_pandas()
X_train = train_ds.drop("label", axis=1)["text"]
y_train = train_ds["label"]

test_ds = load_dataset("rotten_tomatoes", split="test").shuffle().select(range(400)).to_pandas()
X_test = test_ds.drop("label", axis=1)["text"]
y_test = test_ds["label"]

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)  

lr_classifier = LogisticRegression(random_state=42).fit(X_train, y_train)

y_predict = lr_classifier.predict(X_test)

print('Accuracy on train set: ', lr_classifier.score(X_train, y_train))
print('Accuracy on test set: ', lr_classifier.score(X_test, y_test))
print('Confusion matrix on test set: ', confusion_matrix(y_test, y_predict)/len(y_test))



Accuracy on train set:  0.999
Accuracy on test set:  0.675
Confusion matrix on test set:  [[0.365 0.135]
 [0.19  0.31 ]]
