# IMDB Reviews Sentiment Analyses

Neste notebook será utilizado os dados do Kaggle (https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

1. Importar o Dataset
2. Analisar os Dados
3. Preparar os Dados para Construir o Modelo
4. Criar o Dataset de Teste e Treino
5. Treino do Modelo Utilizando Diferentes Algoritmos
6. Avaliação dos Modelos
7. Seleção do Melhor Modelo
8. Deploy do Modelo para o Watson Machine Learning
9. Avaliação do Modelo Final

In [1]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 20.5 MB/s eta 0:00:01
Installing collected packages: nltk
Successfully installed nltk-3.7


In [2]:
%matplotlib inline 

import pandas as pd
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import seaborn as sns
import json

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn import tree
from sklearn import svm
from sklearn import ensemble
from sklearn import neighbors
from sklearn import linear_model
from sklearn import metrics

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/wsuser/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/wsuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# 1 - Importar o Dataset

In [4]:
# The code was removed by IBM Watson Studio for sharing.

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# 2 - Analisar dos dados

In [5]:
df = df_data_1
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


# 3 - Preparar os Dados para Construir o Modelo

Preparação dos dados utilizando as seguintes técnicas de Text Feature Engineering:

    1. Tonkenization         
    2. Remover stopwords
    3. Stemming text
    4. Juntar novamente em uma única frase

In [7]:
stop_words = stopwords.words('english')
porter_stemmer = PorterStemmer()

In [8]:
def identify_tokens(row):
    source = row[0]
    tokens = word_tokenize(source)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

In [9]:
def remove_stops(row):
    source_tokenization = row[2]
    stop = [w for w in source_tokenization if not w in stop_words]
    return (stop)

In [10]:
def stem_porter(row):
    my_list = row[2]
    stemmed_list = [porter_stemmer.stem(word) for word in my_list]
    return (stemmed_list)

In [11]:
def rejoin_words(row):
    my_list = row[2]
    joined_words = (" ".join(my_list))
    return joined_words

In [12]:
def pre_processing(df):
    print('Tokenization')
    df['text1'] = df.apply(identify_tokens, axis=1)
    print('Remove stop words')
    df['text1'] = df.apply(remove_stops, axis=1)
    print('Stemming')
    df['text1'] = df.apply(stem_porter, axis=1)
    print('Rejoin words')
    df['tidy_text'] = df.apply(rejoin_words, axis=1)
    print('DONE!')
    
    return df

In [13]:
df = pre_processing(df)

df['tidy_text'] = df['tidy_text'].str.lower()
df.head()

Tokenization
Remove stop words
Stemming
Rejoin words


Unnamed: 0,review,sentiment,text1,tidy_text
0,One of the other reviewers has mentioned that ...,positive,"[one, review, mention, watch, oz, episod, hook...",one review mention watch oz episod hook they r...
1,A wonderful little production. <br /><br />The...,positive,"[a, wonder, littl, product, br, br, the, film,...",a wonder littl product br br the film techniqu...
2,I thought this was a wonderful way to spend ti...,positive,"[i, thought, wonder, way, spend, time, hot, su...",i thought wonder way spend time hot summer wee...
3,Basically there's a family where a little boy ...,negative,"[basic, famili, littl, boy, jake, think, zombi...",basic famili littl boy jake think zombi closet...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, love, time, money, visual, st...",petter mattei love time money visual stun film...


# 4 - Criar o Dataset de Teste e Treino
NOTE: Dataset de teste (30%) e treino (70%) de forma balanceado (Stratified)

In [16]:
X = df['tidy_text']
Y = df['sentiment']

print(X.shape)
print(Y.shape)

(50000,)
(50000,)


In [17]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, stratify=Y)

NOTE: Modelos de Machine Learning ou Deep Learning esperam como entrada "X" valores numéricos. Será utilizado o processo de Text Feature Engineering (TFE) Tf-Idf para transformar os textos em valores numéricos.

In [18]:
tfidf = TfidfVectorizer(max_features=2000, ngram_range=(2,3), sublinear_tf=True)

X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

print(Y.value_counts().shape)
print(X_train_tf.shape)

(2,)
(30000, 2000)


In [19]:
le = preprocessing.LabelEncoder()

Y_train_le = le.fit_transform(list(Y_train))
Y_test_le = le.transform(list(Y_test))

# 5 - Treino do Modelo Utilizando Diferentes Algoritmos

In [20]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [21]:
# Binary classifiers
# GradientBoostingClassifier
gradient_boost = GradientBoostingClassifier()
gradient_boost.fit(X_train_tf, Y_train_le)
Y_predict_gradient_boost = gradient_boost.predict(X_test_tf)
print('Gradient Boosting Classifier DONE!')

# SVC
svc_model = SVC(gamma='auto', kernel='sigmoid', C=1.8, probability=True)
svc_model.fit(X_train_tf, Y_train_le)
Y_predict_svm = svc_model.predict(X_test_tf)
print('Support Vector Machine(SVM) DONE!')

# RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train_tf, Y_train_le)
Y_predict_random_forest = random_forest.predict(X_test_tf)
print('Random Forest Classifier DONE!')

# KNeighborsClassifier
k_neighbors = KNeighborsClassifier()
k_neighbors.fit(X_train_tf, Y_train_le)
Y_predict_k_neighbors = k_neighbors.predict(X_test_tf)
print('K Nearest Neighbor Classifier DONE!')

# LogisticRegression
logistic_regression = LogisticRegression(solver='lbfgs', penalty='l2', C=1.5)
logistic_regression.fit(X_train_tf, Y_train_le)
Y_predict_logistic_regression = logistic_regression.predict(X_test_tf)
print('Logistic Regression DONE!')

Gradient Boosting Classifier DONE!
Support Vector Machine(SVM) DONE!
Random Forest Classifier DONE!
K Nearest Neighbor Classifier DONE!
Logistic Regression DONE!


In [22]:
print('Gradient Boosting Classifier:  ', metrics.accuracy_score(Y_test_le, Y_predict_gradient_boost))
print('Support Vector Machine(SVM):   ', metrics.accuracy_score(Y_test_le, Y_predict_svm))
print('Random Forest Classifier:      ', metrics.accuracy_score(Y_test_le, Y_predict_random_forest))
print('K Nearest Neighbor Classifier: ', metrics.accuracy_score(Y_test_le, Y_predict_k_neighbors))
print('Logistic Regression:           ', metrics.accuracy_score(Y_test_le, Y_predict_logistic_regression))

Gradient Boosting Classifier:   0.70165
Support Vector Machine(SVM):    0.74775
Random Forest Classifier:       0.72405
K Nearest Neighbor Classifier:  0.5168
Logistic Regression:            0.78075


# 6 - Avaliação dos Modelos

## 6.1 - Resumo de classificação
NOTE: Accuracy >= 0.70

In [26]:
print('Gradient Boosting Classifier:\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_gradient_boost)))
print('Support vector machine(SVM):\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_svm)))
print('Random Forest Classifier:\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_random_forest)))
print('Logistic Regression:\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_logistic_regression)))

Support vector machine(SVM):
               precision    recall  f1-score   support

           0       0.78      0.68      0.73     10000
           1       0.72      0.81      0.76     10000

    accuracy                           0.75     20000
   macro avg       0.75      0.75      0.75     20000
weighted avg       0.75      0.75      0.75     20000


Random Forest Classifier:
               precision    recall  f1-score   support

           0       0.70      0.79      0.74     10000
           1       0.76      0.66      0.71     10000

    accuracy                           0.72     20000
   macro avg       0.73      0.72      0.72     20000
weighted avg       0.73      0.72      0.72     20000


Logistic Regression:
               precision    recall  f1-score   support

           0       0.79      0.77      0.78     10000
           1       0.78      0.79      0.78     10000

    accuracy                           0.78     20000
   macro avg       0.78      0.78      0.78    

## 6.3 - Logistic Regression
NOTE: Matriz de confusão do melhor modelo

In [None]:
logistic_regression_conf_matrix = metrics.confusion_matrix(Y_test_le, Y_predict_logistic_regression)
sns.heatmap(random_forest_conf_matrix, annot=True,  fmt='');
title = 'Logistic Regression'
plt.title(title);

# 7 - Seleção do Melhor Modelo

In [27]:
X_train_final = tfidf.fit_transform(X)
Y_train_final = le.fit_transform(list(Y))

print(X_train_final.shape)

(50000, 2000)


In [28]:
lrc = LogisticRegression(solver='lbfgs', penalty='l2', C=1.5)
lrc.fit(X_train_final, Y_train_final)

LogisticRegression(C=1.5)

# 8 - Deploy do Modelo para o Watson Machine Learning



Para autenticar no Watson Machine Learning no IBM Cloud, você precisa da api_key e location do seu serviço.

Podemos utilizar o [IBM Cloud CLI](https://cloud.ibm.com/docs/cli/index.html) ou diretamente pelo portal do IBM Cloud.

Usando o IBM Cloud CLI:

```
ibmcloud login
ibmcloud iam api-key-create API_KEY_NAME
```

NOTE: Você pode obter a URL do serviço indo até [Endpoint URLs section of the Watson Machine Learning docs](https://cloud.ibm.com/apidocs/machine-learning).

In [29]:
api_key = 'API_KEY'
location = 'LOCATION'

In [30]:
wml_credentials = {
    "apikey": api_key,
    "url": location
}

## 8.1 - Instalando a biblioteca do Watson Machine Learning

NOTE: Documentação pode ser encontrada [aqui](http://ibm-wml-api-pyclient.mybluemix.net/)

In [31]:
!pip install -U ibm-watson-machine-learning

Collecting ibm-watson-machine-learning
  Downloading ibm_watson_machine_learning-1.0.206-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 21.4 MB/s eta 0:00:01
Installing collected packages: ibm-watson-machine-learning
  Attempting uninstall: ibm-watson-machine-learning
    Found existing installation: ibm-watson-machine-learning 1.0.204
    Uninstalling ibm-watson-machine-learning-1.0.204:
      Successfully uninstalled ibm-watson-machine-learning-1.0.204
Successfully installed ibm-watson-machine-learning-1.0.206


In [32]:
from ibm_watson_machine_learning import APIClient

client = APIClient(wml_credentials)
print(client.version)

1.0.206


## 8.2 - Criando nosso espaço de implementação

Crie um espaço de implementação pela UI do Watson Studio que será usado para fazer o deploy do nosso modelo.

    1. Clique em "Novo Espaço de Implementação"
    2. Selecione Cloud Object Storage
    3. Selecione Watson Machine Learning e clique em "Criar"
    4. Copie "space_id" e cole abaixo

In [33]:
space_id = 'SPACE_ID'

In [34]:
client.spaces.list(limit=10)

------------------------------------  -----------  ------------------------
ID                                    NAME         CREATED
3d781a74-42e1-466d-9de5-83b16953498d  imdb-kaggle  2022-03-28T18:24:28.322Z
------------------------------------  -----------  ------------------------


In [35]:
client.set.default_space(space_id)

'SUCCESS'

In [42]:
sofware_spec_uid = client.software_specifications.get_id_by_name("runtime-22.1-py3.9")
metadata = {
            client.repository.ModelMetaNames.NAME: 'Logistic Regression model to predict IMDB reviews',
            client.repository.ModelMetaNames.TYPE: 'scikit-learn_1.0',
            client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: sofware_spec_uid
}

published_model = client.repository.store_model(
    model=lrc,
    meta_props=metadata)

In [43]:
published_model_uid = client.repository.get_model_id(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

{
  "entity": {
    "hybrid_pipeline_software_specs": [],
    "software_spec": {
      "id": "12b83a17-24d8-5082-900f-0ab31fbfd3cb",
      "name": "runtime-22.1-py3.9"
    },
    "type": "scikit-learn_1.0"
  },
  "metadata": {
    "created_at": "2022-05-04T13:22:49.464Z",
    "id": "22256147-d87a-4f65-ab9a-068269b9bb90",
    "modified_at": "2022-05-04T13:22:53.017Z",
    "name": "Logistic Regression model to predict IMDB reviews",
    "owner": "IBMid-5500082K8Y",
    "resource_key": "c7dd49f1-7471-4c28-b969-59eb9c4c38bc",
    "space_id": "3d781a74-42e1-466d-9de5-83b16953498d"
  },
  "system": {
  }
}


In [44]:
client.repository.list_models()

------------------------------------  -------------------------------------------------  ------------------------  -----------------
ID                                    NAME                                               CREATED                   TYPE
22256147-d87a-4f65-ab9a-068269b9bb90  Logistic Regression model to predict IMDB reviews  2022-05-04T13:22:49.002Z  scikit-learn_1.0
78a3f18d-3d11-44e1-9818-64dd0ef24b99  Logistic Regression model to predict IMDB reviews  2022-05-04T13:20:09.002Z  scikit-learn_1.0
992c8b1c-0a79-4606-bd7b-003e89d8953d  imdb AutoAI - P3 Logistic Regression               2022-03-30T14:55:22.002Z  wml-hybrid_0.1
c09b5466-7ed3-4916-a2da-cd0baf8b5856  Logistic Regression model to predict IMDB reviews  2022-03-28T19:09:36.002Z  scikit-learn_0.23
------------------------------------  -------------------------------------------------  ------------------------  -----------------


In [45]:
# client.repository.delete('ID of stored model')

In [46]:
metadata = {
    client.deployments.ConfigurationMetaNames.NAME: "Deployment of IMDB reviews",
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

created_deployment = client.deployments.create(published_model_uid, meta_props=metadata)



#######################################################################################

Synchronous deployment creation for uid: '22256147-d87a-4f65-ab9a-068269b9bb90' started

#######################################################################################


initializing
Note: online_url is deprecated and will be removed in a future release. Use serving_urls instead.

ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='c993b5d7-4bda-4a4c-9979-0c712036d200'
------------------------------------------------------------------------------------------------




In [47]:
# Get deployment UID and show details on the deployment
deployment_uid = client.deployments.get_uid(created_deployment)
client.deployments.get_details(deployment_uid)

Note: online_url is deprecated and will be removed in a future release. Use serving_urls instead.


{'entity': {'asset': {'id': '22256147-d87a-4f65-ab9a-068269b9bb90'},
  'custom': {},
  'deployed_asset_type': 'model',
  'hardware_spec': {'id': 'e7ed1d6c-2e89-42d7-aed5-863b972c1d2b',
   'name': 'S',
   'num_nodes': 1},
  'name': 'Deployment of IMDB reviews',
  'online': {},
  'space_id': '3d781a74-42e1-466d-9de5-83b16953498d',
  'status': {'online_url': {'url': 'https://us-south.ml.cloud.ibm.com/ml/v4/deployments/c993b5d7-4bda-4a4c-9979-0c712036d200/predictions'},
   'serving_urls': ['https://us-south.ml.cloud.ibm.com/ml/v4/deployments/c993b5d7-4bda-4a4c-9979-0c712036d200/predictions'],
   'state': 'ready'}},
 'metadata': {'created_at': '2022-05-04T13:24:03.670Z',
  'id': 'c993b5d7-4bda-4a4c-9979-0c712036d200',
  'modified_at': '2022-05-04T13:24:03.670Z',
  'name': 'Deployment of IMDB reviews',
  'owner': 'IBMid-5500082K8Y',
  'space_id': '3d781a74-42e1-466d-9de5-83b16953498d'},
    'message': 'online_url is deprecated and will be removed in a future release. Use serving_urls instead

In [48]:
client.deployments.list()

------------------------------------  --------------------------  -----  ------------------------
GUID                                  NAME                        STATE  CREATED
c993b5d7-4bda-4a4c-9979-0c712036d200  Deployment of IMDB reviews  ready  2022-05-04T13:24:03.670Z
------------------------------------  --------------------------  -----  ------------------------


In [49]:
# client.deployments.delete('GUID of deployed model')

# 9 - Avaliação do Modelo Final

NOTE: Testando a nossa API criada com o WML.

In [50]:
# get scoring end point
scoring_endpoint = client.deployments.get_scoring_href(created_deployment)
print(scoring_endpoint)

https://us-south.ml.cloud.ibm.com/ml/v4/deployments/c993b5d7-4bda-4a4c-9979-0c712036d200/predictions


In [51]:
# add some test data
scoring_payload = {"input_data": [
    {'values': X_test_tf.toarray()
    }]}

In [52]:
# score the model
predictions = client.deployments.score(deployment_uid, scoring_payload)
print('prediction',json.dumps(predictions, indent=2))

prediction {
  "predictions": [
    {
      "fields": [
        "prediction",
        "probability"
      ],
      "values": [
        [
          1,
          [
            0.3604623260784797,
            0.6395376739215203
          ]
        ],
        [
          0,
          [
            0.8904179800449684,
            0.10958201995503161
          ]
        ],
        [
          1,
          [
            0.481278193748623,
            0.518721806251377
          ]
        ],
        [
          0,
          [
            0.9046420917965481,
            0.09535790820345191
          ]
        ],
        [
          0,
          [
            0.6529084398381808,
            0.34709156016181925
          ]
        ],
        [
          1,
          [
            0.22643264764480076,
            0.7735673523551992
          ]
        ],
        [
          1,
          [
            0.2693004740684488,
            0.7306995259315512
          ]
        ],
        [
          1,
 

In [53]:
Y_predict_final_model = []
for y in predictions['predictions'][0]['values']:
    Y_predict_final_model.append(y[0])
    
print('Final Model WML:\n {}\n'.format(metrics.classification_report(Y_test_le, Y_predict_final_model)))

Final Model WML:
               precision    recall  f1-score   support

           0       0.61      0.61      0.61     10000
           1       0.61      0.61      0.61     10000

    accuracy                           0.61     20000
   macro avg       0.61      0.61      0.61     20000
weighted avg       0.61      0.61      0.61     20000


