<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# Tutorial sobre cómo generar una explicación para un modelo basado en texto en Watson OpenScale


Este notebook incluye pasos para crear un modelo de aprendizaje automático de Watson basado en texto, crear una suscripción, configurar explicabilidad y, finalmente, generar una explicación para una transacción

### Contenido
- [1. Instalación](#setup)
- [2. Crear y desplegar un modelo basado en texto](#)
- [3. Suscripciones](#subscription)
- [4. Explicabilidad](#explainability)

**Note**: Si usa Watson Studio, intente ejecutar el portátil en al menos la versión 'Default Python 3.5 XS' para obtener resultados más rápidos.

<a id="setup"></a>
## 1. Instalación

### 1.1 Instala los paquetes de Watson OpenScale y WML 

In [1]:
!pip install --upgrade ibm-ai-openscale --no-cache | tail -n 1

Successfully installed ibm-ai-openscale-2.2.1


In [2]:
!pip install --upgrade watson-machine-learning-client --no-cache | tail -n 1

Successfully installed watson-machine-learning-client-1.0.378


Nota: Reinicia el kernel para asegurar que las nuevas librerías se están usando

### 1.2 Configura las credenciales

Obtenga la `apikey` de Watson Openscale yendo a la [Consola Bluemix] (https://console.bluemix.net/) y haciendo clic en` Administrar-> Cuenta-> Usuarios`. Seleccione "Claves API de plataforma" en la barra lateral y luego haga clic en el botón "Crear".

Se puede obtener el `instancia_id` de Watson OpenScale (guid) accediendo a la [consola en la nube] (https://cloud.ibm.com/resources), haciendo clic en` Servicios` y haciendo clic en cualquier parte del mosaico de servicio de Watson OpenScale, excepto el enlace de servicio y luego verificando la barra lateral emergente a la derecha.

In [3]:
AIOS_CREDENTIALS = {
   
    "instance_guid": "",
    "apikey": "", 
    "url": ""
}

Genere o busque las credenciales WML haciendo clic en Credenciales en la barra lateral de la página WML aprovisionada.

In [4]:
WML_CREDENTIALS = {
  
}

## 2.Crear e implementar un modelo basado en texto

El conjunto de datos utilizado es el conjunto de datos UCI-ML SMS Spam Collection que se puede encontrar aquí: https://archive.ics.uci.edu/ml/machine-learning-databases/00228/. Es un conjunto de datos de clasificación binaria con las etiquetas 'ham' y 'spam'.

### 2.1 Cargando los datos de entrenamiento

In [10]:
# The training data is downloaded and saved as 'SMSSpam.csv' in this step

!pip install pandas
!rm smsspamcollection.zip
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip --no-check-certificate
!unzip smsspamcollection.zip



rm: cannot remove ‘smsspamcollection.zip’: No such file or directory
--2020-04-02 07:22:45--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
  Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’


2020-04-02 07:22:46 (1.69 MB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [11]:
import os
!ls

readme	SMSSpamCollection  smsspamcollection.zip


In [12]:
import pandas as pd

pd.read_csv('SMSSpamCollection',sep="\t",header=None, encoding="utf-8").to_csv("SMSSpam.csv", header=["label", "text"], sep=",", index=False)

In [13]:
!rm SMSSpamCollection
!rm readme
!rm smsspamcollection.zip

### 2.2 Creando un modelo

**Nota**: Omita el paso de instalación de pyspark a continuación si está utilizando un entorno Spark en Watson Studio.

In [14]:
!pip install pyspark==2.3.1

Collecting pyspark==2.3.1
[?25l  Downloading https://files.pythonhosted.org/packages/ee/2f/709df6e8dc00624689aa0a11c7a4c06061a7d00037e370584b9f011df44c/pyspark-2.3.1.tar.gz (211.9MB)
[K     |████████████████████████████████| 211.9MB 190kB/s  eta 0:00:01 eta 0:00:02
[?25hCollecting py4j==0.10.7 (from pyspark==2.3.1)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 50.0MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/37/48/54/f1b63f0dbb729e20c92f1bbcf1c53c03b300e0b93ca1781526
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.3.1


**Nota**: Cuando ejecute este notebook localmente, si la importación de SparkSession falla a continuación, configure la variable de entorno 'SPARK_HOME' con la ruta a la instalación de pyspark.

In [15]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv(path="SMSSpam.csv", header=True, multiLine=True, escape='"')
df.show(5, truncate = False)

+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|text                                                                                                                                                       |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham  |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                            |
|ham  |Ok lar... Joking wif u oni...                                                                                                                              |
|spam |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's|
|ham  |U dun say

In [16]:
train_df, test_df = df.randomSplit([0.8, 0.2], seed=12345)
print("Total count of data set: {}".format(df.count()))
print("Total count of training data set: {}".format(train_df.count()))
print("Total count of test data set: {}".format(test_df.count()))

Total count of data set: 5572
Total count of training data set: 4420
Total count of test data set: 1152


In [17]:
!pip install nltk
from pyspark.ml.feature import StringIndexer, IndexToString, CountVectorizer, Tokenizer, IDF, StopWordsRemover
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline, Model
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
stop_words = list(set(stopwords.words('english')))

stringIndexer_label = StringIndexer(inputCol="label", outputCol="label_ix").fit(df)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
stopword_remover = StopWordsRemover(inputCol="words", outputCol="filtered_words").setStopWords(stop_words)
count = CountVectorizer(inputCol="filtered_words", outputCol="rawFeatures")
idf = IDF(inputCol="rawFeatures", outputCol="features")
nb = GBTClassifier(labelCol="label_ix")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictionLabel", labels=stringIndexer_label.labels)



[nltk_data] Downloading package punkt to /home/dsxuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [18]:
pipeline = Pipeline(stages=[stringIndexer_label, tokenizer, stopword_remover, count, idf, nb, labelConverter])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol="label_ix", rawPredictionCol="prediction", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print("Area under ROC curve = %g" % auc)

Area under ROC curve = 0.846312


In [22]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

MODEL_NAME = "Text Binary Classifier"
wml_client = WatsonMachineLearningAPIClient(WML_CREDENTIALS)

model_props = {
    wml_client.repository.ModelMetaNames.NAME: "{}".format(MODEL_NAME),
}

# publish model 
published_model_details = wml_client.repository.store_model(model=model, meta_props=model_props, training_data=train_df, pipeline=pipeline)

!rm SMSSpam.csv

rm: cannot remove ‘SMSSpam.csv’: No such file or directory


In [23]:
model_uid = wml_client.repository.get_model_uid(published_model_details)
print(model_uid)

a33e348f-8ab1-4c15-beb8-151a8e4ae6ab


### 2.3 Desplegando el modelo

In [24]:
deployment = wml_client.deployments.create(model_uid, MODEL_NAME + " deployment")



#######################################################################################

Synchronous deployment creation for uid: 'a33e348f-8ab1-4c15-beb8-151a8e4ae6ab' started

#######################################################################################


INITIALIZING
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='726e1aea-745b-462b-9cb3-8fe9a9512368'
------------------------------------------------------------------------------------------------




In [25]:
scoring_url = wml_client.deployments.get_scoring_url(deployment)
print(scoring_url)

https://us-south.ml.cloud.ibm.com/v3/wml_instances/197aa151-8d29-4a35-af88-cfc689f25d87/deployments/726e1aea-745b-462b-9cb3-8fe9a9512368/online


## 3. Suscripciones

### 3.1 Configurando AIOS

In [26]:
from ibm_ai_openscale import APIClient
from ibm_ai_openscale.engines import WatsonMachineLearningAsset

aios_client = APIClient(AIOS_CREDENTIALS)
aios_client.version

'2.2.1'

**Nota**: Vuelve a ejecutar la celda anterior si no funciona la primera vez.

In [27]:
aios_client.data_mart.bindings.list()

0,1,2,3
197aa151-8d29-4a35-af88-cfc689f25d87,WML instance,watson_machine_learning,2020-03-28T20:29:48.546Z


### 3.2 Suscribiendo el asset

In [28]:
from ibm_ai_openscale.supporting_classes import *

subscription = aios_client.data_mart.subscriptions.add(WatsonMachineLearningAsset(
    model_uid,
    label_column='label',
    problem_type=ProblemType.BINARY_CLASSIFICATION,
    input_data_type=InputDataType.UNSTRUCTURED_TEXT,
    feature_columns = ["text"],
    categorical_columns = ["text"],
    prediction_column='predictionLabel',
    probability_column='probability'
))

### 3.3 Obtener una suscripción

In [29]:
aios_client.data_mart.subscriptions.list()

0,1,2,3,4
84b03cb6-3c61-40df-aded-0921451e9bd6,Text Binary Classifier,model,197aa151-8d29-4a35-af88-cfc689f25d87,2020-04-02T08:03:39.708Z
827b02c8-3b2a-483f-ba3c-b46b0d736b44,MNIST Model,model,197aa151-8d29-4a35-af88-cfc689f25d87,2020-04-01T19:05:10.255Z
eadd87e5-de57-4a35-8ee5-0a9530080c2c,Scikit German Risk Model,model,197aa151-8d29-4a35-af88-cfc689f25d87,2020-04-01T14:03:18.183Z


In [30]:
subscription.get_details()

{'entity': {'asset': {'asset_id': 'a33e348f-8ab1-4c15-beb8-151a8e4ae6ab',
   'asset_type': 'model',
   'created_at': '2020-04-02T08:02:41.313Z',
   'name': 'Text Binary Classifier',
   'url': 'https://us-south.ml.cloud.ibm.com/v3/wml_instances/197aa151-8d29-4a35-af88-cfc689f25d87/published_models/a33e348f-8ab1-4c15-beb8-151a8e4ae6ab'},
  'asset_properties': {'categorical_fields': ['text'],
   'feature_fields': ['text'],
   'input_data_schema': {'fields': [{'metadata': {'measure': 'discrete',
       'modeling_role': 'feature'},
      'name': 'text',
      'nullable': True,
      'type': 'string'}],
    'type': 'struct'},
   'input_data_type': 'unstructured_text',
   'label_column': 'label',
   'model_type': 'mllib-2.3',
   'output_data_schema': {'fields': [{'metadata': {'columnInfo': {'columnLength': 64},
       'measure': 'discrete',
       'modeling_role': 'feature'},
      'name': 'text',
      'nullable': True,
      'type': 'string'},
     {'metadata': {},
      'name': 'prediction

### 3.4 Evaluar el modelo y obtener un transaction-id

In [31]:
text = "SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info"
payload = {"fields": ["text"], "values": [[text]]}

response = wml_client.deployments.score(scoring_url=scoring_url, payload=payload)

In [32]:
print(response)

{'fields': ['text', 'label_ix', 'words', 'filtered_words', 'rawFeatures', 'features', 'rawPrediction', 'probability', 'prediction', 'predictionLabel'], 'values': [['SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info', 0.0, ['six', 'chances', 'to', 'win', 'cash!', 'from', '100', 'to', '20,000', 'pounds', 'txt>', 'csh11', 'and', 'send', 'to', '87575.', 'cost', '150p/day,', '6days,', '16+', 'tsandcs', 'apply', 'reply', 'hl', '4', 'info'], ['six', 'chances', 'win', 'cash!', '100', '20,000', 'pounds', 'txt>', 'csh11', 'send', '87575.', 'cost', '150p/day,', '6days,', '16+', 'tsandcs', 'apply', 'reply', 'hl', '4', 'info'], [11750, [9, 18, 38, 85, 288, 397, 457, 510, 654, 1010, 1793, 2164, 2545, 5462, 5760, 5779, 6414, 6923, 8930, 9354, 11642], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]], [11750, [9, 18, 38, 85, 288, 397, 457, 510, 654, 1010, 1793

**Note**: Please wait for a few seconds before running the cell below.

In [44]:
time.sleep(10)

In [41]:
transaction_id = subscription.payload_logging.get_table_content().scoring_id[0]
print(transaction_id)

7dc75a64a729bcff48e8ae953b999821-1


## 4. Explicabilidad

### 4.1 Configurar la explicabilidad

In [42]:
subscription.explainability.enable()
subscription.explainability.get_details()

{'enabled': True,
 'monitor_definition': {'entity': {'applies_to': {'input_data_type': ['structured',
     'unstructured_image',
     'unstructured_text'],
    'problem_type': ['binary', 'multiclass', 'regression']},
   'description': 'Provides explanations to the predictions made by a Machine Learning model.',
   'metrics': [],
   'monitor_runtime': {'type': 'service'},
   'name': 'Explainability',
   'parameters_schema': {'properties': {'tokenizer': {'$id': '#/properties/tokenizer',
      'items': {'$id': '#/properties/tokenizer/items',
       'properties': {'enabled': {'$id': '#/properties/tokenizer/items/properties/enabled',
         'type': 'boolean'},
        'language': {'$id': '#/properties/tokenizer/items/properties/language',
         'type': 'string'},
        'part_of_speech': {'$id': '#/properties/tokenizer/items/properties/part_of_speech',
         'items': {'oneOf': [{'type': 'string'}]},
         'type': 'array'}},
       'required': ['enabled'],
       'title': 'The to

### 4.2 Obtener explicaciones para la transacción

In [43]:
subscription.explainability.run(transaction_id, background_mode=False)




 Looking for explanation for 7dc75a64a729bcff48e8ae953b999821-1 




finished

---------------------------
 Successfully finished run 
---------------------------




{'entity': {'perturbed': False,
  'explanation_type': 'lime',
  'asset': {'problem_type': 'binary',
   'type': 'text',
   'id': 'a33e348f-8ab1-4c15-beb8-151a8e4ae6ab',
   'input_data_type': 'unstructured_text',
   'deployment': {'name': 'Text Binary Classifier deployment',
    'id': '726e1aea-745b-462b-9cb3-8fe9a9512368'},
   'name': 'Text Binary Classifier'},
  'predictions': [{'probability': 0.8171882652921649,
    'value': 'spam',
    'explanation_features': [{'positions': [[15, 18]],
      'feature_value': 'win',
      'weight': 0.39503435397824244},
     {'positions': [[4, 11]],
      'feature_value': 'chances',
      'weight': 0.07487966798553222},
     {'positions': [[127, 129]],
      'feature_value': 'HL',
      'weight': 0.07373243353059876},
     {'positions': [[25, 29]],
      'feature_value': 'From',
      'weight': 0.07041854004165561},
     {'positions': [[115, 120]],
      'feature_value': 'apply',
      'weight': 0.0702060248446271},
     {'positions': [[81, 85]],
    