[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/text-classification/sklearn/sentiment-analysis/sentiment-sklearn.ipynb)


# Sentiment analysis using sklearn

This notebook illustrates how sklearn models can be upladed to the Unbox platform.

In [None]:
!curl "https://raw.githubusercontent.com/unboxai/examples-gallery/main/text-classification/sklearn/requirements.txt" --output "requirements.txt"

In [None]:
!pip install -r requirements.txt

## Importing the modules and loading the dataset

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv files. Alternatively, you can also find the original datasets on [this Kaggle competition](https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset?select=testdata.manual.2009.06.14.csv). The training set in this example corresponds to the first 20,000 lines of the original training set.

In [2]:
TRAINING_SET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/sentiment-analysis/sentiment_training_set_sample.csv"
VALIDATION_SET_URL = "https://unbox-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/text-classification/sentiment-analysis/sentiment_validation_set.csv"

In [3]:
columns = ['polarity', 'tweetid', 'date', 'query_name', 'user', 'text']
df_train = pd.read_csv(TRAINING_SET_URL,
                      header=None, 
                       encoding='ISO-8859-1', index_col=0)

df_test = pd.read_csv(VALIDATION_SET_URL,
                     header=None,
                     encoding='ISO-8859-1')
df_train.columns = columns
df_test.columns = columns

In [4]:
df_train.head()

Unnamed: 0_level_0,polarity,tweetid,date,query_name,user,text
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
234777,0,1979653827,Sun May 31 03:58:02 PDT 2009,NO_QUERY,Missscribbler,Just went back from lunch and some small shopp...
1040566,4,1956965983,Thu May 28 23:09:01 PDT 2009,NO_QUERY,mike_online,"@tbbs: Sarah Connor: So badass, even men from ..."
1444362,4,2062251235,Sat Jun 06 22:43:07 PDT 2009,NO_QUERY,JennysMyName,@KayleenDuhh Nope. I don't see it.
932855,4,1771210840,Mon May 11 23:33:47 PDT 2009,NO_QUERY,Scyranth,know ur flippin enemy!
1259486,4,1998127816,Mon Jun 01 18:02:16 PDT 2009,NO_QUERY,heatherzajac,the ice cream truck came today!!! first &quot;...


## Training and evaluating the model's performance

In [5]:
sklearn_model = Pipeline([("count_vect", 
                           CountVectorizer(min_df=100, 
                                           ngram_range=(1, 2), 
                                           stop_words="english"),),
                          ("lr", LogisticRegression()),])
sklearn_model.fit(df_train.text, df_train.polarity)

Pipeline(steps=[('count_vect',
                 CountVectorizer(min_df=100, ngram_range=(1, 2),
                                 stop_words='english')),
                ('lr', LogisticRegression())])

In [6]:
x_test, y_test = df_test.text, df_test.polarity
print(classification_report(y_test, sklearn_model.predict(x_test)))

              precision    recall  f1-score   support

           0       0.61      0.49      0.54       177
           2       0.00      0.00      0.00       139
           4       0.43      0.83      0.56       182

    accuracy                           0.48       498
   macro avg       0.34      0.44      0.37       498
weighted avg       0.37      0.48      0.40       498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Unbox part!

### pip installing unboxapi

In [None]:
!pip install unboxapi

### Instantiating the client

In [7]:
import unboxapi

client = unboxapi.UnboxClient("YOUR_API_KEY_HERE")

### Creating a project on the platform

In [8]:
from unboxapi import TaskType

project = client.create_project(name="Sentiment Analysis",
                                task_type=TaskType.TextClassification,
                                description="Sklearn Sentiment Analysis with Unbox")

Created your project. Check out https://unbox.ai/projects!


### Uploading the validation set

In [10]:
import random

# Remove 'neutral' since it isn't in training dataset
df_test['polarity'] = df_test['polarity'].replace(2, random.choice([0, 4]))
# Make labels monotonically increasing [0,1]
df_test['polarity'] = df_test['polarity'].replace(4, 1)
df_train['polarity'] = df_train['polarity'].replace(4, 1)

In [11]:
dataset = project.add_dataframe(
    df=df_test,
    class_names=['negative', 'positive'],
    label_column_name='polarity',
    text_column_name='text',
    commit_message='this is my sentiment test dataset',
)

Uploading dataset to Unbox! Check out https://unbox.ai/datasets to have a look!


### Uploading the model

First, it is important to create a `predict_proba` function, which is how Unbox interacts with your model

In [12]:
def predict_proba(model, text_list):
    return model.predict_proba(text_list)

Let's test the `predict_proba` function to make sure the input-output format is consistent with what Unbox expects:

In [13]:
predict_proba(sklearn_model, ['good', 'bad'])

array([[0.30904471, 0.69095529],
       [0.78541812, 0.21458188]])

Now, we can upload the model:

In [15]:
from unboxapi.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=['negative', 'positive'],
    name='05.15.2021.sentiment_analyzer',
    commit_message='this is my sklearn sentiment model',
    requirements_txt_file='requirements.txt'
)

Bundling model and artifacts...
Uploading model to Unbox! Check out https://unbox.ai/models to have a look!


  pickler.file_handle.write(chunk.tostring('C'))
