<header style="padding:10px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />
  
# Capacidades de Analítica Avanzada en la Base de Datos 
# (In-Database)

## Caso 3: Clasificación de Textos
    
![Slide](images/Diapositiva4.PNG)

![Slide](images/Diapositiva8.PNG)

![Slide](images/Diapositiva5.PNG)

![Slide](images/Diapositiva6.PNG)


## **Instalar las librerías**

In [None]:
#!pip install teradataml==17.20.0.4 kds==0.1.3 lightgbm==4.0.0 nyoka==4.3.0

## **Carga de Modulos**

In [1]:
import pandas as pd
import numpy as np
import getpass as gp
import plotly.express as px
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
import kds

from teradataml import *
from teradataml.analytics.valib import *
configure.val_install_location = "val"

In [None]:
con=create_context(host = "20.172.147.24", username="pocuser", password = gp.getpass())

![Slide](images/Diapositiva10.PNG)

## **Lectura Inicial de base de datos**

In [3]:
# Leemos la data de desarrollo de modelos
tdf = DataFrame("caso3_data_texto")

In [4]:
# Primeros Registros
tdf.head(10)

detalle,target
"301036BARRA CONST.5/8 X 9 MTS""",G
![CDATA[CARTUCHO FILTRO 2091 POLVO P100 3M]],G
"![CDATA[GUANTE QUIM. LATEX NEGRO Y NARANJA 12\ C35 T.9 MASTER]]""",G
"![CDATA[PROTECTOR T COPA ADAPT CASCO H9P3E PELTOR OPTIME 98 3M]] ![CDATA[GUANTE BADANA BLANCO S REFUERZO GC14 T.9]] ![CDATA[GUANTE QUIM. NITRILO 18\ 37-185 T.9 VERDE ANSELL EDMON SOLVEX]] ![CDATA[LENTE ANTIFOG TURBINE L. OSCURA STEELPRO]] ![CDATA[LENTE SIMPLE VISION L. CLARA CLUTE]] ![CDATA[TRAJE CONTRA PARTIC. Y LIQ. S110 T.XL STEELPRO]] ![CDATA[LENTE SOBRELENTE OTG L. CLARA CLUTE]] ![CDATA[MPU ACERO WERK T.38]]""",G
![CDATA[RESPIRADOR DESCART. POLVO KN95 MAYFIELD X10UD]],G
#TRIPTOFANO (BOLSA),G
![CDATA[RESPIRADOR DESCART. POLVO KN95 MAYFIELD X10UD]],G
"ETIQ.AUTOAD.160MM X 70MM POLIPROPILENO IMPRESO \"" 10KG NAZCA VALLEY\""""",C
"301036BARRA CONST.3/8 X 9 MTS""",G
"301036BARRA CONST.3/4 X 9 MTS""",G


In [5]:
# Vemos la dimensionalidad
tdf.shape

(192734, 2)

![Slide](images/Diapositiva6.PNG)

## **Exploración de los datos**

In [6]:
# Estadísticas Descriptivas por Columna
objcs = ColumnSummary(data=tdf,target_columns=['detalle', 'target'])
objcs.result.head()

ColumnName,Datatype,NonNullCount,NullCount,BlankCount,ZeroCount,PositiveCount,NegativeCount,NullPercentage,NonNullPercentage
target,VARCHAR(15000) CHARACTER SET LATIN,192734,0,0,,,,0.0,100.0
detalle,VARCHAR(15000) CHARACTER SET LATIN,192734,0,0,,,,0.0,100.0


In [7]:
objv = valib.Values(data=tdf, columns=['detalle', 'target'])
objv.result

xdb,xtbl,xcol,xtype,xcnt,xnull,xunique,xblank,xzero,xpos,xneg
POCUSER,caso3_data_texto,target,VARCHAR(15000) CHARACTER SET LATIN,192734.0,0.0,19.0,0.0,,,
POCUSER,caso3_data_texto,detalle,VARCHAR(15000) CHARACTER SET LATIN,192734.0,0.0,96506.0,0.0,,,


In [8]:
# Agregando show_query() al final, podemos ver que efectivamente, los códigos se traducen automáticamente a lenguaje SQL antes de ejecutarse
objcs.result.show_query()

'select * from "POCUSER"."ml__td_sqlmr_out__1697330259626198"'

In [9]:
## Calculando la proporción del Target
tResp = valib.Frequency(data=tdf, columns="target")
tResp.result.head(19)

xtbl,xcol,xval,xcnt,xpct
caso3_data_texto,target,E,690.0,0.3580063714757126
caso3_data_texto,target,H,26848.0,13.930079799101351
caso3_data_texto,target,N,2571.0,1.3339628711073293
caso3_data_texto,target,Q,8974.0,4.656158228439196
caso3_data_texto,target,M,753.0,0.3906939097408864
caso3_data_texto,target,F,1521.0,0.7891705666877665
caso3_data_texto,target,S,985.0,0.5110670665269231
caso3_data_texto,target,G,101849.0,52.84433467888385
caso3_data_texto,target,L,322.0,0.1670696400219992
caso3_data_texto,target,P,330.0,0.1712204385318625


In [12]:
con.execute('CREATE VIEW caso3 AS (SELECT ROW_NUMBER() OVER (ORDER BY detalle) - 1 AS doc_id, detalle, target FROM caso3_data_texto);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f2d8255c6d8>

In [13]:
tdf = DataFrame("caso3")

### Particion Muestral

Seleccionamos las variables más relevantes y dividimos en muestras 70% Train / 30% Test

In [14]:
## Seleccionamos variables más relevantes y generamos una columna que divide la tabla en muestras de entrenamiento y test
tbl_sample = tdf[['doc_id', 'detalle', 'target']].sample(frac = [0.5, 0.5])

In [15]:
## Almacenando la Muestra de Entrenamiento en la BD y generando la referencia con un DF de TeradataML
copy_to_sql(tbl_sample[tbl_sample.sampleid == "1"].drop("sampleid", axis = 1), schema_name="pocuser", table_name="TrainModel", if_exists="replace")
tbl_train = DataFrame("TrainModel")
tbl_train.shape

(96367, 3)

In [16]:
## Almacenando la Muestra de Test en la BD y generando la referencia con un DF de TeradataML
copy_to_sql(tbl_sample[tbl_sample.sampleid == "2"].drop("sampleid", axis = 1), schema_name="pocuser", table_name="TestModel", if_exists="replace")
tbl_test = DataFrame("TestModel")
tbl_test.shape

(96367, 3)

In [17]:
## Verificando que el DF hace referencia a la tabla creada en la BD
tbl_test.show_query()

'select * from "TestModel"'

In [18]:
stopwords = DataFrame("stopwords")

### Limpieza y Tokenización de Datos

In [19]:
TextParserTrain = TextParser(data=tbl_train,
                            text_column="detalle",
                            punctuation="\"<>!#$%&[]()*+,-./:;?@\^_`{|}~''",
                            object=stopwords,
                            remove_stopwords=True,
                            accumulate=["doc_id","target"])

In [20]:
TextParserTrain.result.head()

doc_id,target,token
0,G,mts
0,G,4
0,G,x
0,G,3
2,G,8
2,G,x
2,G,5
0,G,9
0,G,const
0,G,301036barra


Observamos que existen correlaciones significativas entre los grupos de variables históricas

### Entrenamiento del Modelo 

In [22]:
## Función para Entrenar el Modelo de Clasificación de Texto
NaiveBayesTextClassifierTrainer_out = NaiveBayesTextClassifierTrainer(data=TextParserTrain.result,
                                                                     token_column="token",
                                                                     doc_id_columns = 'doc_id',
                                                                     doc_category_column="target",
                                                                     model_type = "MULTINOMIAL",
                                                                     data_partition_column = "target")

In [23]:
## Métricas estadísticas del Modelo
NaiveBayesTextClassifierTrainer_out.result.head()

token,category,prob
0,N,0.0001833410529566
0,E,0.0003248546275541
0,H,0.0003378074047383
0,Q,0.004451213289266
0,J,4.763129566650472e-05
0,G,0.0098369166073285
0,A,0.0006629192055866
0,F,1.863811307743204e-05
0,I,4.221119103096613e-05
0,O,0.0005592111394858


In [24]:
NaiveBayesTextClassifierTrainer_out.model_data.head()

token,category,prob
0,J,4.763129566650472e-05
0,N,0.0001833410529566
0,Q,0.004451213289266
0,O,0.0005592111394858
0,A,0.0006629192055866
0,I,4.221119103096613e-05
0,G,0.0098369166073285
0,C,0.0088840437042882
0,H,0.0003378074047383
0,E,0.0003248546275541


In [25]:
TextParserTest = TextParser(data=tbl_test,
                            text_column="detalle",
                            punctuation="\"<>!#$%&[]()*+,-./:;?@\^_`{|}~''",
                            object=stopwords,
                            remove_stopwords=True,
                            accumulate=["doc_id","target"])

In [26]:
TextParserTest.result.head()

doc_id,target,token
2,G,8
2,G,mts
2,G,301036barra
2,G,const
7,G,kn95
7,G,mayfield
7,G,cdata
2,G,x
2,G,9
2,G,5


### Scoring del Modelo

In [51]:
nbt_predict_out = NaiveBayesTextClassifierPredict(object = NaiveBayesTextClassifierTrainer_out.result,
                                                      newdata = TextParserTest.result,
                                                      input_token_column = 'token',
                                                      doc_id_columns = 'doc_id',
                                                      model_type = "MULTINOMIAL",
                                                      model_token_column = 'token',
                                                      model_category_column = 'category',
                                                      accumulate = 'target',
                                                      responses = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S'],
                                                      output_prob = True,
                                                      model_prob_column = 'prob',
                                                      newdata_partition_column = 'doc_id')

In [53]:
nbt_predict_out.result.head()

doc_id,prediction,loglik_A,loglik_B,loglik_C,loglik_D,loglik_E,loglik_F,loglik_G,loglik_H,loglik_I,loglik_J,loglik_K,loglik_L,loglik_M,loglik_N,loglik_O,loglik_P,loglik_Q,loglik_R,loglik_S,prob_A,prob_B,prob_C,prob_D,prob_E,prob_F,prob_G,prob_H,prob_I,prob_J,prob_K,prob_L,prob_M,prob_N,prob_O,prob_P,prob_Q,prob_R,prob_S,target
8,G,-86.20437860899557,-87.2670107149929,-87.57999625026703,-87.60765733060346,-83.74798682379388,-86.17298306978813,-80.599177061873,-89.52948336219275,-86.37002775330112,-86.16293543745284,-88.87890849913725,-86.90912778958126,-87.14513022118244,-86.16605427033318,-86.17288170305046,-86.7512133604433,-86.60671913491979,-86.22336396161107,-86.70916258130586,0.0033995355012297,0.0011746928380623,0.0008590064603315,0.0008355710325025,0.0396476068824628,0.0035079588503534,0.9241185387707684,0.0001222770449886,0.0028805742329045,0.0035433831985207,0.0002343613590968,0.0016801610037993,0.0013269554632658,0.0035323491940456,0.003508314458721,0.001967579394647,0.0022734494942961,0.0033356029325192,0.0020520818874837,G
11,G,-813.3709155683786,-889.0120709115386,-766.1637791750784,-867.942285035776,-873.8933339948995,-874.0808857447917,-725.0667188944251,-847.7876896292075,-888.6135509086205,-883.0648937953495,-899.0027408255249,-897.7429649454391,-889.5573154221089,-815.2539346221323,-821.4463041591288,-899.5192516918333,-884.6419828631858,-911.2837185145876,-872.7752867547405,4.466575255149541e-39,6.301417943424727e-72,1.4183176181201245e-18,8.911073503964687e-63,2.319649146160424e-65,1.9229571707282195e-65,1.0,5.046143493513181e-54,9.38670833186494e-72,2.411373532531345e-69,2.8876561042011835e-76,1.0177923546884703e-75,3.6529321281620847e-72,6.79500641755103e-40,1.3895620715266492e-42,1.7227713947653268e-76,4.981308167721244e-70,1.339629271138814e-81,7.095517057638907e-65,G
14,C,-231.9955719319169,-292.9704081783676,-196.21736634599907,-292.49343490079565,-269.9055753675675,-284.4065207783412,-200.5517361617148,-254.11591207987527,-280.8126102984064,-289.70483940386623,-294.2624152861088,-292.8240907093784,-292.2154402281189,-266.76427463915326,-271.57380759444766,-270.1532403126463,-274.34648849072227,-289.848417248872,-283.64581904278464,2.8580253883293193e-16,9.44128625882688e-43,0.9870595181710964,1.5211709675863315e-42,9.81659290978396e-33,4.9462971945716856e-39,0.0129404818289035,7.068453711200376e-26,1.7992673407120318e-37,2.4731454836505886e-41,2.593699200266972e-43,1.09288895018297e-42,2.008674565405633e-42,2.270964543765653e-31,1.851214446134859e-33,7.663042996444562e-33,1.1569023875586222e-34,2.1423705950329108e-41,1.058397421636148e-38,G
16,C,-382.6947801548552,-474.6514068302638,-322.8003891494965,-475.4404884046533,-434.98308347020895,-453.4546231094449,-328.99296953170284,-396.9343779561372,-435.5292903216534,-464.79468555073504,-482.9576125923514,-473.3162368889169,-470.4821224509865,-408.22224518672624,-416.2040547248789,-428.33484248461946,-420.46289246975175,-473.64555038284544,-472.9378724522124,9.71201761917986e-27,1.1247446136992611e-66,0.9979596273968584,5.1092852337068354e-67,1.9001499022791466e-49,1.805926729751536e-57,0.0020403726031415,6.355221796689713e-33,1.10045741760129e-49,2.146711008553275e-62,2.7778871193662815e-70,4.2747513680952935e-66,7.273628595834811e-65,7.959252111057628e-38,2.7190451758098625e-41,1.465825383616775e-46,3.84438338270381e-43,3.075330721735036e-66,6.240687657874302e-66,G
21,H,-62.98574485257002,-64.41107766350683,-58.75229383460648,-64.76555458051988,-64.21353101957222,-63.00608474150343,-60.439186216035054,-57.422542301587015,-63.00550156488935,-63.04001853102159,-66.05851885614042,-64.02968746515937,-64.28252386701618,-60.76626629449373,-63.00622568491373,-63.856339041249214,-63.09684338607873,-62.9822272136399,-63.80938295339105,0.0027732050102182,0.0006667573341061,0.1912257586859496,0.0004677571475639,0.0008123837334891,0.0027173681114954,0.0353947084897259,0.7228532629298958,0.0027189532792013,0.0026267044937949,0.0001283787616869,0.0009763453452105,0.0007582248361532,0.0255205051565615,0.0027169851433557,0.0011611483891663,0.0024816040891631,0.0027829773217784,0.0012169717414835,G
23,G,-153.78733191766804,-158.2755795745322,-139.3383249434161,-162.84548675458362,-161.17333118163762,-152.89821155893287,-132.62842473641987,-143.5396708195526,-141.66368662731455,-147.60558995312897,-166.11121492001092,-152.8444251072551,-161.26984962649956,-144.30242418016704,-159.32238089171364,-157.34190422869125,-147.91807159738408,-159.30478360649948,-161.71672772250503,6.459677862676842e-10,7.260886865920808e-12,0.0012171240908922,7.521495479559618e-14,4.004166584487953e-13,1.5716322293381741e-09,0.9986366316324038,1.822693009938662e-05,0.0001189715475696,3.125412066106037e-07,2.870889501875124e-15,1.6584794150021782e-09,3.63575582872046e-13,8.50068557067445e-06,2.548998818898089e-12,1.8470545442782592e-11,2.2866422870988362e-07,2.594251270523462e-12,2.32550896378628e-13,G
19,G,-109.09662583206864,-137.35758850245227,-97.92279870573924,-141.82254244791622,-144.24684211996123,-135.12323477715543,-93.73423733952092,-121.799459150956,-114.3287552396626,-143.27708052297103,-145.92988260662935,-137.86940397559653,-144.30164610659813,-134.5190697772324,-126.4208230032208,-143.9883991584285,-145.38140850702243,-139.25103201630515,-143.95861165109287,2.097304066998735e-07,1.117072236289649e-19,0.014941453986875,1.2852168107454473e-21,1.137932306575879e-22,1.0433950006681969e-18,0.9850583351616652,6.381032904043109e-13,1.1204086902500381e-09,3.001086669865501e-22,2.1143711838789832e-23,6.695802300631503e-20,1.0772471616963431e-22,1.909124614783607e-18,6.278532753228268e-15,1.473522526622462e-22,3.659152471054517e-23,1.6817800180318172e-20,1.5180753542886705e-22,I
10,I,-96.02193601798028,-98.69497724073597,-97.1552807972784,-99.02870870564524,-98.51352149116752,-97.75643223393048,-89.44539547414762,-102.71547484643328,-82.74445163980101,-97.72439389066848,-100.28910332063568,-98.34884795179222,-98.57643339826558,-97.71465571175588,-97.75620971211885,-98.19865052004032,-98.36165700934032,-97.84393233559663,-98.15905239526326,1.7105110500915575e-06,1.1809699068390473e-07,5.507068777259642e-07,8.458650824498575e-08,1.415937428488171e-07,3.018865391486972e-07,0.001228234104523,2.11913637646987e-09,0.9987667668403464,3.1171508823117145e-07,2.398383740184028e-08,1.669401776768695e-07,1.3296023309999488e-07,3.147654539275509e-07,3.0195372296295626e-07,1.9399511192628896e-07,1.6481546814759148e-07,2.765941161451432e-07,2.01831075424694e-07,I
7,G,-86.20437860899557,-87.2670107149929,-87.57999625026703,-87.60765733060346,-83.74798682379388,-86.17298306978813,-80.59917706187298,-89.52948336219275,-86.37002775330112,-86.16293543745284,-88.87890849913725,-86.90912778958126,-87.14513022118244,-86.16605427033318,-86.17288170305046,-86.7512133604433,-86.60671913491979,-86.22336396161107,-86.70916258130586,0.0033995355012297,0.0011746928380623,0.0008590064603315,0.0008355710325025,0.0396476068824623,0.0035079588503533,0.9241185387707692,0.0001222770449886,0.0028805742329045,0.0035433831985207,0.0002343613590968,0.0016801610037993,0.0013269554632658,0.0035323491940456,0.003508314458721,0.001967579394647,0.002273449494296,0.0033356029325192,0.0020520818874837,G
2,G,-66.98777067346543,-86.57386353443297,-59.52641281693583,-86.22136296948355,-83.00265389006336,-81.45448419849303,-52.79800760437804,-59.15735745321956,-79.13153091240676,-75.82209658794622,-88.87890849913725,-86.90912778958126,-85.75883586006256,-71.94543066622728,-69.39740709539133,-82.18686516897546,-76.17386883761927,-86.22336396161107,-86.70916258130586,6.85795418605941e-07,2.1382699267864267e-15,0.0011929469306844,3.0419465456880136e-15,7.603598393088243e-14,3.575861787693428e-13,0.9970808607796512,0.0017254398818105,3.649442439712017e-12,9.988392123908923e-11,2.133016521085729e-16,1.5291817699805605e-15,4.8308610887021625e-15,4.820701066080351e-09,6.161731617146289e-08,1.719141840038407e-13,7.026237676965986e-11,3.035865720468404e-15,1.867681850579529e-15,G


### Evaluando las Clasificaciones

In [57]:
predicted_data = ConvertTo(data = nbt_predict_out.result,
                               target_columns = ["target", "prediction"],
                               target_datatype = ["VARCHAR(charlen=20,charset=UNICODE,casespecific=NO)"])

In [59]:
ClassificationEvaluator_obj = ClassificationEvaluator(data=predicted_data.result,
                                                          observation_column='target',
                                                          prediction_column='prediction',
                                                          labels=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S'])

In [None]:
ClassificationEvaluator_obj.result.head(22).sort('SeqNum')

In [None]:
ClassificationEvaluator_obj.output_data.sort('SeqNum')

In [62]:
copy_to_sql(predicted_data.result, schema_name="pocuser", table_name="matriz_clas", if_exists="replace")

## **Finalizando la Demo**

In [None]:
con.execute("DROP VIEW pocuser.CASO3;")

In [None]:
con.execute("DROP TABLE pocuser.TrainModel;")

In [None]:
con.execute("DROP TABLE pocuser.TestModel;")

In [None]:
con.execute("DROP TABLE pocuser.pmml_models;")

In [44]:
con.execute("DROP TABLE pocuser.matriz_clas;")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f2d82474668>

In [None]:
## Finalizando el Notebook y Limpiando el ambiente 
remove_context()

![Slide](images/Diapositiva13.PNG)

![Slide](images/Diapositiva14.PNG)

Copyright 2023. Elaborado por Luis Cajachahua bajo licencia MIT