# IMDB Sentiment Analyses

Neste notebook estamos utilizando os dados do Kaggle (https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

Vamos seguir os seguintes passos:

1. Importar o dataset
2. Analisar os dados
3. Preparar os dados para construir o modelo
4. Criar o dataset de teste e treino
5. Treinar o modelo utilizando diferentes algoritmos
6. Avaliar os modelos
7. Seleção do melhor modelo para este dataset
8. Realizar o deploy do modelo para o Watson Machine Learning



In [None]:
!pip install nltk

In [1]:
%matplotlib inline 

import pandas as pd
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import seaborn as sns
import json

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn import tree
from sklearn import svm
from sklearn import ensemble
from sklearn import neighbors
from sklearn import linear_model
from sklearn import metrics

In [2]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/wsuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/wsuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Importar o Dataset

In [53]:
# The code was removed by Watson Studio for sharing.

## Analisando dos dados

In [4]:
df = df_data_1
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


## Preparando dos dados

Agora vamos preparar nossos seguindo os passos:

    1. Tonkenization         
    2. Remover stopwords
    3. Stemming text
    4. Juntar novamente em uma única frase
    
Como estamos trabalhando com uma entrada de texto, realizamos estas etapas para "normalizar" nossa base.

In [6]:
stop_words = stopwords.words('english')
porter_stemmer = PorterStemmer()

In [7]:
def identify_tokens(row):
    source = row[0]
    tokens = word_tokenize(source)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

In [8]:
def remove_stops(row):
    source_tokenization = row[2]
    stop = [w for w in source_tokenization if not w in stop_words]
    return (stop)

In [9]:
def stem_porter(row):
    my_list = row[2]
    stemmed_list = [porter_stemmer.stem(word) for word in my_list]
    return (stemmed_list)

In [10]:
def rejoin_words(row):
    my_list = row[2]
    joined_words = (" ".join(my_list))
    return joined_words

In [11]:
def pre_processing(df):
    print('Tokenization')
    df['text1'] = df.apply(identify_tokens, axis=1)
    print('Remove stop words')
    df['text1'] = df.apply(remove_stops, axis=1)
    print('Stemming')
    df['text1'] = df.apply(stem_porter, axis=1)
    print('Rejoin words')
    df['tidy_text'] = df.apply(rejoin_words, axis=1)
    
    return df

In [12]:
df = pre_processing(df)

df['tidy_text'] = df['tidy_text'].str.lower()
df.head()

Tokenization
Remove stop words
Stemming
Rejoin words


Unnamed: 0,review,sentiment,text1,tidy_text
0,One of the other reviewers has mentioned that ...,positive,"[one, review, mention, watch, Oz, episod, hook...",one review mention watch oz episod hook they r...
1,A wonderful little production. <br /><br />The...,positive,"[A, wonder, littl, product, br, br, the, film,...",a wonder littl product br br the film techniqu...
2,I thought this was a wonderful way to spend ti...,positive,"[I, thought, wonder, way, spend, time, hot, su...",i thought wonder way spend time hot summer wee...
3,Basically there's a family where a little boy ...,negative,"[basic, famili, littl, boy, jake, think, zombi...",basic famili littl boy jake think zombi closet...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, love, time, money, visual, st...",petter mattei love time money visual stun film...


## Criando o dataset de teste/treino

Vamos criar o nosso dataset de teste (30%) e treino (70%) de forma balanceado (Stratified)

In [13]:
X = df['tidy_text']
Y = df['sentiment']

print(X.shape)
print(Y.shape)

(50000,)
(50000,)


In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, stratify=Y)

Os modelos de Machine Learning ou Deep Learning esperam como entrada "X" um valor numérico. Como estamos trabalhando com texto iremos realizar o processo de TfIdf para transformar o texto em valores numéricos.

In [15]:
tfidf = TfidfVectorizer(max_features=2000, ngram_range=(2,3), sublinear_tf=True)

X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

print(Y.value_counts().shape)
print(X_train_tf.shape)

(2,)
(35000, 2000)


In [16]:
le = preprocessing.LabelEncoder()

Y_train_le = le.fit_transform(list(Y_train))
Y_test_le = le.transform(list(Y_test))

## Construindo e treinando o modelo

In [19]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
# binary classifiers
# GradientBoostingClassifier
gradient_boost = GradientBoostingClassifier()
gradient_boost.fit(X_train_tf, Y_train_le)
Y_predict_gradient_boost = gradient_boost.predict(X_test_tf)
print('Gradient Boosting Classifier DONE!')

# SVC
svc_model = SVC(gamma='auto', kernel='sigmoid', C=1.8, probability=True)
svc_model.fit(X_train_tf, Y_train_le)
Y_predict_svm = svc_model.predict(X_test_tf)
print('Support Vector Machine(SVM) DONE!')

# RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train_tf, Y_train_le)
Y_predict_random_forest = random_forest.predict(X_test_tf)
print('Random Forest Classifier DONE!')

# KNeighborsClassifier
k_neighbors = KNeighborsClassifier()
k_neighbors.fit(X_train_tf, Y_train_le)
Y_predict_k_neighbors = k_neighbors.predict(X_test_tf)
print('K Nearest Neighbor Classifier DONE!')

# LogisticRegression
logistic_regression = LogisticRegression(solver='lbfgs', penalty='l2', C=1.5)
logistic_regression.fit(X_train_tf, Y_train_le)
Y_predict_logistic_regression = logistic_regression.predict(X_test_tf)
print('Logistic Regression DONE!')

In [None]:
print('Gradient Boosting Classifier:  ', metrics.accuracy_score(Y_test_le, Y_predict_gradient_boost))
print('Support Vector Machine(SVM):   ', metrics.accuracy_score(Y_test_le, Y_predict_svm))
print('Random Forest Classifier:      ', metrics.accuracy_score(Y_test_le, Y_predict_random_forest))
print('K Nearest Neighbor Classifier: ', metrics.accuracy_score(Y_test_le, Y_predict_k_neighbors))
print('Logistic Regression:           ', metrics.accuracy_score(Y_test_le, Y_predict_logistic_regression))

## Avaliação do modelo

### Support Vector Machines

In [None]:
svm_svc_conf_matrix = metrics.confusion_matrix(Y_test_le, Y_predict_svm)
sns.heatmap(svm_svc_conf_matrix, annot=True,  fmt='');
title = 'SVM'
plt.title(title);

### Random Forest

In [None]:
random_forest_conf_matrix = metrics.confusion_matrix(Y_test_le, Y_predict_random_forest)
sns.heatmap(random_forest_conf_matrix, annot=True,  fmt='');
title = 'Random Forest'
plt.title(title);

### Logistic Regression

In [None]:
logistic_regression_conf_matrix = metrics.confusion_matrix(Y_test_le, Y_predict_logistic_regression)
sns.heatmap(random_forest_conf_matrix, annot=True,  fmt='');
title = 'Logistic Regression'
plt.title(title);

## Resumo de classificação

In [None]:
print('Support vector machine(SVM):\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_svm)))
print('Random Forest Classifier:\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_random_forest)))
print('Logistic Regression:\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_logistic_regression)))

## Seleção do modelo final

In [17]:
X_train_final = tfidf.fit_transform(X)
Y_train_final = le.fit_transform(list(Y))

print(X_train_final.shape)

(50000, 2000)


In [20]:
lrc = LogisticRegression(solver='lbfgs', penalty='l2', C=1.5)
lrc.fit(X_train_final, Y_train_final)

LogisticRegression(C=1.5)

## Deploy para o Watson Machine Learning



Para nos autenticar no Watson Machine Learning no IBM Cloud, você precisa da api_key e location do seu serviço.

Podemos utilizar o [IBM Cloud CLI](https://cloud.ibm.com/docs/cli/index.html) ou diretamente pelo portal do IBM Cloud.

Usando o IBM Cloud CLI:

```
ibmcloud login
ibmcloud iam api-key-create API_KEY_NAME
```

NOTE: Você pode obter a URL do serviço indo até [Endpoint URLs section of the Watson Machine Learning docs](https://cloud.ibm.com/apidocs/machine-learning).

In [51]:
api_key = 'YOUR_API_KEY'
location = 'YOUR_LOCATION'

In [22]:
wml_credentials = {
    "apikey": api_key,
    "url": location
}

### Instalando a biblioteca do Watson Machine Learning

NOTE: Documentação pode ser encontrada [aqui](http://ibm-wml-api-pyclient.mybluemix.net/)

In [23]:
!pip install -U ibm-watson-machine-learning

Requirement already up-to-date: ibm-watson-machine-learning in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.0.45)


In [24]:
from ibm_watson_machine_learning import APIClient

client = APIClient(wml_credentials)
print(client.version)

1.0.45


### Criando nosso espaço de implementação

Primeiro, crie um espaço de implementação que será usado para fazer o deploy do nosso modelo. Caso ainda não tenha criado siga os passos abaixo.

    Clique em Novo Espaço de Implementação
    Crie um novo espaço vazio
    Selecione Cloud Object Storage
    Selecione Watson Machine Learning e clique em Criar
    Copie space_id e cole abaixo

In [25]:
space_id = 'YOUR_SPACE_ID'

In [26]:
client.spaces.list(limit=10)

------------------------------------  ----------------------------------  ------------------------
ID                                    NAME                                CREATED
2321bbed-e41c-42e4-a03e-68be4655a18a  imdb-sentiment-analyses-deployment  2021-02-02T17:40:39.531Z
------------------------------------  ----------------------------------  ------------------------


In [27]:
client.set.default_space(space_id)

'SUCCESS'

In [28]:
sofware_spec_uid = client.software_specifications.get_id_by_name("default_py3.7")
metadata = {
            client.repository.ModelMetaNames.NAME: 'Logistic Regression model to predict IMDB reviews',
            client.repository.ModelMetaNames.TYPE: 'scikit-learn_0.23',
            client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: sofware_spec_uid
}

published_model = client.repository.store_model(
    model=lrc,
    meta_props=metadata)

In [29]:
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

{
  "entity": {
    "software_spec": {
      "id": "e4429883-c883-42b6-87a8-f419d64088cd",
      "name": "default_py3.7"
    },
    "type": "scikit-learn_0.23"
  },
  "metadata": {
    "created_at": "2021-02-04T18:11:21.270Z",
    "id": "680cc61b-ec5a-4d79-812d-d2fefd809b7b",
    "modified_at": "2021-02-04T18:11:23.700Z",
    "name": "Logistic Regression model to predict IMDB reviews",
    "owner": "IBMid-5500082K8Y",
    "space_id": "2321bbed-e41c-42e4-a03e-68be4655a18a"
  },
  "system": {
  }
}


In [30]:
client.repository.list_models()

------------------------------------  -------------------------------------------------  ------------------------  -----------------
ID                                    NAME                                               CREATED                   TYPE
680cc61b-ec5a-4d79-812d-d2fefd809b7b  Logistic Regression model to predict IMDB reviews  2021-02-04T18:11:21.002Z  scikit-learn_0.23
------------------------------------  -------------------------------------------------  ------------------------  -----------------


In [31]:
# client.repository.delete('GUID of stored model')

In [32]:
metadata = {
    client.deployments.ConfigurationMetaNames.NAME: "Deployment of IMDB reviews",
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

created_deployment = client.deployments.create(published_model_uid, meta_props=metadata)



#######################################################################################

Synchronous deployment creation for uid: '680cc61b-ec5a-4d79-812d-d2fefd809b7b' started

#######################################################################################


initializing
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='caaf0bf6-c8cc-41cf-8e32-f62c64c7e314'
------------------------------------------------------------------------------------------------




In [33]:
# Get deployment UID and show details on the deployment
deployment_uid = client.deployments.get_uid(created_deployment)
client.deployments.get_details(deployment_uid)

{'entity': {'asset': {'id': '680cc61b-ec5a-4d79-812d-d2fefd809b7b'},
  'custom': {},
  'deployed_asset_type': 'model',
  'hardware_spec': {'id': 'Not_Applicable', 'name': 'S', 'num_nodes': 1},
  'name': 'Deployment of IMDB reviews',
  'online': {},
  'space_id': '2321bbed-e41c-42e4-a03e-68be4655a18a',
  'status': {'online_url': {'url': 'https://us-south.ml.cloud.ibm.com/ml/v4/deployments/caaf0bf6-c8cc-41cf-8e32-f62c64c7e314/predictions'},
   'state': 'ready'}},
 'metadata': {'created_at': '2021-02-04T18:11:26.957Z',
  'id': 'caaf0bf6-c8cc-41cf-8e32-f62c64c7e314',
  'modified_at': '2021-02-04T18:11:26.957Z',
  'name': 'Deployment of IMDB reviews',
  'owner': 'IBMid-5500082K8Y',
  'space_id': '2321bbed-e41c-42e4-a03e-68be4655a18a'}}

In [34]:
client.deployments.list()

------------------------------------  --------------------------  -----  ------------------------
GUID                                  NAME                        STATE  CREATED
caaf0bf6-c8cc-41cf-8e32-f62c64c7e314  Deployment of IMDB reviews  ready  2021-02-04T18:11:26.957Z
------------------------------------  --------------------------  -----  ------------------------


In [35]:
#client.deployments.delete('GUID of deployed model')

## Avaliando o modelo

Agora vamos enviar dados para o web service usando o método score do WML.

In [36]:
# get scoring end point
scoring_endpoint = client.deployments.get_scoring_href(created_deployment)
print(scoring_endpoint)

https://us-south.ml.cloud.ibm.com/ml/v4/deployments/caaf0bf6-c8cc-41cf-8e32-f62c64c7e314/predictions


In [37]:
# add some test data
scoring_payload = {"input_data": [
    {'values': X_test_tf.toarray()
    }]}

In [38]:
# score the model
predictions = client.deployments.score(deployment_uid, scoring_payload)
print('prediction',json.dumps(predictions, indent=2))

prediction {
  "predictions": [
    {
      "fields": [
        "prediction",
        "probability"
      ],
      "values": [
        [
          1,
          [
            0.23422804618282556,
            0.7657719538171744
          ]
        ],
        [
          1,
          [
            0.46593115935913687,
            0.5340688406408631
          ]
        ],
        [
          0,
          [
            0.7542509694893551,
            0.24574903051064492
          ]
        ],
        [
          0,
          [
            0.6486852599552931,
            0.35131474004470686
          ]
        ],
        [
          1,
          [
            0.454181147575738,
            0.545818852424262
          ]
        ],
        [
          0,
          [
            0.8713206086547348,
            0.1286793913452652
          ]
        ],
        [
          0,
          [
            0.9954243892257998,
            0.004575610774200209
          ]
        ],
        [
          0,

In [50]:
Y_predict_final_model = []
for y in predictions['predictions'][0]['values']:
    Y_predict_final_model.append(y[0])
    
print('Final Model WML:\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_final_model)))

Final Model WML:
               precision    recall  f1-score   support

           0       0.56      0.63      0.60      7500
           1       0.58      0.51      0.54      7500

    accuracy                           0.57     15000
   macro avg       0.57      0.57      0.57     15000
weighted avg       0.57      0.57      0.57     15000


