[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/unboxai/examples-gallery/blob/main/text-classification/sentiment-sklearn.ipynb)


# Sentiment analysis using sklearn

This notebook illustrates how sklearn models can be upladed to the Unbox platform.

## Importing the modules and loading the dataset

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

In [3]:
%%bash

if [ ! -d ./data ]; then
    mkdir ./data
fi

if [ ! -f ./data/trainingandtestdata.zip ]; then
    curl -q -O ./data/trainingandtestdata.zip http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
fi

unzip -n ./data/trainingandtestdata.zip -d ./data

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:15 --:--:--     0
curl: (28) Failed to connect to cs.stanford.edu port 80: Operation timed out
unzip:  cannot find or open ./data/trainingandtestdata.zip, ./data/trainingandtestdata.zip.zip or ./data/trainingandtestdata.zip.ZIP.


CalledProcessError: Command 'b'\nif [ ! -d ./data ]; then\n    mkdir ./data\nfi\n\nif [ ! -f ./data/trainingandtestdata.zip ]; then\n    curl -q -O ./data/trainingandtestdata.zip http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip\nfi\n\nunzip -n ./data/trainingandtestdata.zip -d ./data\n'' returned non-zero exit status 9.

In [4]:
columns = ['polarity', 'tweetid', 'date', 'query_name', 'user', 'text']
df_train_file_path = './data/training.1600000.processed.noemoticon.csv'
df_train_name = 'training.1600000.processed.noemoticon'
df_train = pd.read_csv(df_train_file_path,
                      header=None,
                      encoding='ISO-8859-1')

df_test_file_path = './data/testdata.manual.2009.06.14.csv'
df_test_name = 'testdata.manual.2009.06.14'
df_test = pd.read_csv(df_test_file_path,
                     header=None,
                     encoding='ISO-8859-1')
df_train.columns = columns
df_test.columns = columns

## Training and evaluating the model's performance

In [5]:
sklearn_model = Pipeline([("count_vect", 
                           CountVectorizer(min_df=100, 
                                           ngram_range=(1, 2), 
                                           stop_words="english"),),
                          ("lr", LogisticRegression()),])
sklearn_model.fit(df_train.text, df_train.polarity)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('count_vect',
                 CountVectorizer(min_df=100, ngram_range=(1, 2),
                                 stop_words='english')),
                ('lr', LogisticRegression())])

In [7]:
x_test, y_test = df_test.text, df_test.polarity
print(classification_report(y_test, sklearn_model.predict(x_test)))

              precision    recall  f1-score   support

           0       0.75      0.82      0.78       177
           2       0.00      0.00      0.00       139
           4       0.52      0.88      0.66       182

    accuracy                           0.61       498
   macro avg       0.43      0.57      0.48       498
weighted avg       0.46      0.61      0.52       498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Unbox part!

### Instantiating the client

In [8]:
import unboxapi

client = unboxapi.UnboxClient("8c14712a-2901-4e6d-a2c4-452ff3659726")

### Creating a project on the platform

In [9]:
project = client.create_project(name="Sentiment Analysis",
                                description="Sklearn Sentiment Analysis with Unbox")

Creating project on Unbox! Check out https://unbox.ai/projects to have a look!


### Uploading the validation set

In [13]:
import random

# Remove 'neutral' since it isn't in training dataset
df_test['polarity'] = df_test['polarity'].replace(2, random.choice([0, 4]))
# Make labels monotonically increasing [0,1]
df_test['polarity'] = df_test['polarity'].replace(4, 1)
df_train['polarity'] = df_train['polarity'].replace(4, 1)

In [14]:
from unboxapi.tasks import TaskType

dataset = project.add_dataframe(
    df=df_test,
    class_names=['negative', 'positive'],
    label_column_name='polarity',
    text_column_name='text',
    name=df_test_name,
    description='this is my sentiment test dataset',
    task_type=TaskType.TextClassification
)

Uploading dataset to Unbox! Check out https://unbox.ai/datasets to have a look!


### Uploading the model

First, it is important to create a `predict_proba` function, which is how Unbox interacts with your model

In [15]:
def predict_proba(model, text_list):
    return model.predict_proba(text_list)

Let's test the `predict_proba` function to make sure the input-output format is consistent with what Unbox expects:

In [17]:
predict_proba(sklearn_model, ['good', 'bad'])

array([[0.30857194, 0.69142806],
       [0.71900947, 0.28099053]])

Now, we can upload the model:

In [20]:
from unboxapi.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=['negative', 'positive'],
    name='05.15.2021.sentiment_analyzer',
    description='this is my sklearn sentiment model',
    task_type=TaskType.TextClassification
)

Bundling model and artifacts...
Uploading model to Unbox! Check out https://unbox.ai/models to have a look!


  retry_strategy = Retry(
