##**Clasificación con métodos no supervisados**

#### Clasificación de registros de trabajos con tareas pendientes usando métodos no supervisados (k-means, LDA)

Se busca identificar los activos afectados para los cuales se han dejado tareas pendientes.

**Preparar notebook**

In [None]:
## Conectar el notebook a googledrive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
## Importar librerias necesarias
import nltk
import pandas as pd
import numpy as np
import re
import codecs
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from nltk.util import ngrams
from nltk import bigrams
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaMulticore
import multiprocessing as mp
import time
import spacy

In [None]:
## Descargar el corpus de nltk para 'tokenizer', 'stopwords'
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Importar stopwords
from nltk.corpus import stopwords
stop_words_nltk = set(stopwords.words('spanish'))

from nltk.probability import ConditionalFreqDist
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
# Descargar spaCy Language Model
!python -m spacy download es_core_news_lg
#!python -m spacy download es_core_news_md
nlp = spacy.load("es_core_news_lg")
#nlp = spacy.load("es_core_news_md") #sm: no word vectors, md: reduced word vector, lg: large word vector, trf: transformer pipeline without static word vectors)

2022-12-03 16:36:09.565811: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es-core-news-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-3.4.0/es_core_news_lg-3.4.0-py3-none-any.whl (568.0 MB)
[K     |████████████████████████████████| 568.0 MB 7.2 kB/s 
Installing collected packages: es-core-news-lg
Successfully installed es-core-news-lg-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_lg')


**Cargar los Datos Pendientes**

In [None]:
df = pd.read_csv('/content/drive/MyDrive/ProyectoIntegrador/Datos/datos_pendiente.csv', encoding='latin-1')
#Habilitar la siguiente línea para correr el proyecto de manera local
#df = pd.read_csv('/content/datos_pendiente.csv', encoding='latin-1')
df.shape

(10468, 12)

In [None]:
## Recuperar los tokes (como lista) que se extrajeron en el textprep y que al exportar el archivo .csv se cargaron como cadenas de texto
df['tokens_proc'] = df['tokens_proc'].apply(lambda x: re.sub('[\[\]\']+', '', str(x)))
df['tokens_proc'] = df['tokens_proc'].apply(lambda x: x.split(', '))

In [None]:
distancias = pd.read_csv('/content/drive/MyDrive/ProyectoIntegrador/Datos/distancias_pendientes.csv', encoding='latin-1')
distancias.shape

(10398, 3)

In [None]:
percentile90 = distancias['distancia'].quantile(0.90)
percentile90

8604.775329458535

In [None]:
outliers = distancias[distancias['distancia']>percentile90]
outliers.shape

(1040, 3)

In [None]:
df = df[~df['WORKLOGID'].isin(outliers['WORKLOGID'])].reset_index(drop=True)

In [None]:
# Definición del listado de palabras vacías
otras_vacias = ['quedo','pendiente','hoy','debe']

In [None]:
##Defición de función que reciba la columna del dataframe con los tokens y una lista de palabras para eliminar
def remover_palabra (tokens):
  for i in range(len(tokens)):
    tokens[i] = [w for w in tokens[i] if w not in otras_vacias]
  return tokens

In [None]:
df['tokens_proc'] = remover_palabra(df['tokens_proc'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tokens[i] = [w for w in tokens[i] if w not in otras_vacias]


In [None]:
#Seleccionar las columnas de interés para el análisis
datos = df[['WORKLOGID','procesado', 'tokens_proc', 'tokensunidos']]
datos.head(5)

Unnamed: 0,WORKLOGID,procesado,tokens_proc,tokensunidos
0,2789392,se realizo poda en varios ramales secundarios ...,"[poda, varios, ramales, secundarios, fusible, ...",poda varios ramales secundarios fusible transf...
1,2857541,se encuentra linea primarias de en el suelo se...,"[linea, primarias, suelo, procede, repara, dan...",linea primarias suelo procede repara dano caus...
2,2437852,se visita la ruta y se inspeccionan dos poste ...,"[ruta, inspeccionan, dos, poste, reportados, c...",ruta inspeccionan dos poste reportados cuales ...
3,2685049,se abrio puente agua abajo de transformador qu...,"[abrio, puente, agua, abajo, transformador, re...",abrio puente agua abajo transformador quedo pe...
4,2854713,se visita direccion y se encuentra linea pri m...,"[linea, pri, caida, arbol, habre, corte, repar...",linea pri caida arbol habre corte quedo pendie...


**POS Tagging**

In [None]:
# Definición de función para el POS Tagger spaCy para tokens
def token_tagger (text,nlp):
  doc = ' '.join(text)
  doc = nlp(doc)
  return [(w.text, w.pos_) for w in doc]


# Definición de función para el POS Tagger spaCy para texto
def spacy_tagger(text,nlp):
    doc = nlp(text)
    return [(w.text, w.pos_) for w in doc]

In [None]:
datos['procesadotagged'] = datos['procesado'].apply(lambda x: spacy_tagger(x,nlp))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  datos['procesadotagged'] = datos['procesado'].apply(lambda x: spacy_tagger(x,nlp))


In [None]:
datos['tokenstagged'] = datos['tokens_proc'].apply(lambda x: token_tagger(x,nlp))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  datos['tokenstagged'] = datos['tokens_proc'].apply(lambda x: token_tagger(x,nlp))


In [None]:
datos.head(5)

Unnamed: 0,WORKLOGID,procesado,tokens_proc,tokensunidos,procesadotagged,tokenstagged
0,2789392,se realizo poda en varios ramales secundarios ...,"[poda, varios, ramales, secundarios, fusible, ...",poda varios ramales secundarios fusible transf...,"[(se, PRON), (realizo, VERB), (poda, NOUN), (e...","[(poda, VERB), (varios, DET), (ramales, NOUN),..."
1,2857541,se encuentra linea primarias de en el suelo se...,"[linea, primarias, suelo, procede, repara, dan...",linea primarias suelo procede repara dano caus...,"[(se, PRON), (encuentra, VERB), (linea, NOUN),...","[(linea, NOUN), (primarias, ADJ), (suelo, NOUN..."
2,2437852,se visita la ruta y se inspeccionan dos poste ...,"[ruta, inspeccionan, dos, poste, reportados, c...",ruta inspeccionan dos poste reportados cuales ...,"[(se, PRON), (visita, VERB), (la, DET), (ruta,...","[(ruta, NOUN), (inspeccionan, VERB), (dos, NUM..."
3,2685049,se abrio puente agua abajo de transformador qu...,"[abrio, puente, agua, abajo, transformador, re...",abrio puente agua abajo transformador quedo pe...,"[(se, PRON), (abrio, VERB), (puente, NOUN), (a...","[(abrio, ADJ), (puente, NOUN), (agua, NOUN), (..."
4,2854713,se visita direccion y se encuentra linea pri m...,"[linea, pri, caida, arbol, habre, corte, repar...",linea pri caida arbol habre corte quedo pendie...,"[(se, PRON), (visita, VERB), (direccion, NOUN)...","[(linea, PROPN), (pri, PROPN), (caida, ADJ), (..."


In [None]:
## Definir lista de POSTAGG de interes
tags_interes = ['NOUN']

## Definir función para extraer palabras de interes según TAG
def ExtractInterest(text, tags):
  interesting = [k for k,v in text if v in tags]
  return(interesting)

In [None]:
datos['palabras_clave'] = datos['tokenstagged'].apply(lambda x: ExtractInterest(x,tags_interes))
datos.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  datos['palabras_clave'] = datos['tokenstagged'].apply(lambda x: ExtractInterest(x,tags_interes))


Unnamed: 0,WORKLOGID,procesado,tokens_proc,tokensunidos,procesadotagged,tokenstagged,palabras_clave
0,2789392,se realizo poda en varios ramales secundarios ...,"[poda, varios, ramales, secundarios, fusible, ...",poda varios ramales secundarios fusible transf...,"[(se, PRON), (realizo, VERB), (poda, NOUN), (e...","[(poda, VERB), (varios, DET), (ramales, NOUN),...","[ramales, poda, poda, linea]"
1,2857541,se encuentra linea primarias de en el suelo se...,"[linea, primarias, suelo, procede, repara, dan...",linea primarias suelo procede repara dano caus...,"[(se, PRON), (encuentra, VERB), (linea, NOUN),...","[(linea, NOUN), (primarias, ADJ), (suelo, NOUN...","[linea, suelo, caida, circuito, cuchillas, rec..."
2,2437852,se visita la ruta y se inspeccionan dos poste ...,"[ruta, inspeccionan, dos, poste, reportados, c...",ruta inspeccionan dos poste reportados cuales ...,"[(se, PRON), (visita, VERB), (la, DET), (ruta,...","[(ruta, NOUN), (inspeccionan, VERB), (dos, NUM...","[ruta, poste, coco, mitad, terminal, viento, v..."
3,2685049,se abrio puente agua abajo de transformador qu...,"[abrio, puente, agua, abajo, transformador, re...",abrio puente agua abajo transformador quedo pe...,"[(se, PRON), (abrio, VERB), (puente, NOUN), (a...","[(abrio, ADJ), (puente, NOUN), (agua, NOUN), (...","[puente, agua, transformador, aisladero, arbol..."
4,2854713,se visita direccion y se encuentra linea pri m...,"[linea, pri, caida, arbol, habre, corte, repar...",linea pri caida arbol habre corte quedo pendie...,"[(se, PRON), (visita, VERB), (direccion, NOUN)...","[(linea, PROPN), (pri, PROPN), (caida, ADJ), (...","[corte, linea]"


## **LDA sobre todo el BoW**

In [None]:
# Creación del BoW - Dictionary en gensim 
dictionary = Dictionary(datos.tokens_proc)

In [None]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in datos.tokens_proc]

In [None]:
print(corpus[:5])
print(dictionary)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 3), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(4, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1)], [(24, 1), (26, 1), (27, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 2), (48, 1), (49, 1), (50, 1), (51, 1), (52, 2), (53, 2)], [(1, 1), (10, 2), (15, 1), (27, 1), (50, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 2), (62, 1), (63, 1), (64, 1)], [(4, 2), (15, 1), (16, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1)]]
Dictionary(6685 unique tokens: ['caer', 'cuadrilla', 'fusible', 'grandes', 'linea']...)


In [None]:
#Paralelizar

t0 = time.time()
pool = mp.Pool(mp.cpu_count())
doc_term_matrix = pool.map(dictionary.doc2bow, [sentence for sentence in datos.tokens_proc])
pool.close()
print(time.time()-t0)

0.3956429958343506


In [None]:
# Definir el número de topicos para el LDA
# Posteriormente en el Notebook se hace el cálculo de K optimo, y se llega al valor de XXX.
num_topics = 11

# Definir
t0 = time.time()
lda_model = LdaMulticore(doc_term_matrix, num_topics=num_topics, id2word = dictionary, passes=10, workers=10)
print(time.time()-t0)

48.39826583862305


In [None]:
for i,topic in lda_model.show_topics(formatted=True, num_topics=num_topics, num_words=5):
    print(str(i)+": "+ topic)
    print()

0: 0.031*"transformador" + 0.023*"fase" + 0.019*"repara" + 0.017*"voltaje" + 0.017*"requiere"

1: 0.047*"aisladero" + 0.018*"abajo" + 0.017*"aguas" + 0.016*"poste" + 0.015*"red"

2: 0.033*"aguas" + 0.028*"abajo" + 0.026*"linea" + 0.025*"poste" + 0.025*"transformador"

3: 0.039*"vehiculo" + 0.029*"transformador" + 0.028*"caja" + 0.020*"poste" + 0.017*"acometida"

4: 0.093*"poste" + 0.039*"riesgo" + 0.038*"secundario" + 0.020*"red" + 0.014*"transformador"

5: 0.076*"linea" + 0.065*"arbol" + 0.049*"poda" + 0.031*"secundaria" + 0.028*"transformador"

6: 0.052*"medidor" + 0.050*"poste" + 0.016*"transformador" + 0.015*"normal" + 0.010*"prepago"

7: 0.063*"primario" + 0.056*"transformador" + 0.048*"fusible" + 0.031*"aisladero" + 0.027*"poste"

8: 0.092*"transformador" + 0.032*"arbol" + 0.031*"fusible" + 0.023*"quemado" + 0.020*"aisladero"

9: 0.044*"transformador" + 0.032*"trenza" + 0.031*"ruta" + 0.029*"cuadrilla" + 0.026*"poste"

10: 0.041*"primaria" + 0.033*"repara" + 0.030*"linea" + 0.022

In [None]:
#funcion para extraer el tópico principal
def assigntopic(doc):
    vector = lda_model[dictionary.doc2bow(doc)] 
   
    #asignar el tópico mayor a cada documento
    vector = max(vector,key=lambda item: item[1])
    return vector

#funcion para extraer el número del topico
def ext_topic(line):
    lista = list(line)
    topic = lista[0]
    return topic

In [None]:
# Asignar el topico principal a cada registro
datos['main_topic'] = datos.apply(lambda row: assigntopic(row['tokens_proc']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  datos['main_topic'] = datos.apply(lambda row: assigntopic(row['tokens_proc']), axis=1)


In [None]:
# Asignar el número del tópico principal a cada registro
datos['num_topic'] = datos.apply(lambda row: ext_topic(row['main_topic']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  datos['num_topic'] = datos.apply(lambda row: ext_topic(row['main_topic']), axis=1)


In [None]:
datos.head()

Unnamed: 0,WORKLOGID,procesado,tokens_proc,tokensunidos,procesadotagged,tokenstagged,palabras_clave,main_topic,num_topic
0,2789392,se realizo poda en varios ramales secundarios ...,"[poda, varios, ramales, secundarios, fusible, ...",poda varios ramales secundarios fusible transf...,"[(se, PRON), (realizo, VERB), (poda, NOUN), (e...","[(poda, VERB), (varios, DET), (ramales, NOUN),...","[ramales, poda, poda, linea]","(5, 0.60738915)",5
1,2857541,se encuentra linea primarias de en el suelo se...,"[linea, primarias, suelo, procede, repara, dan...",linea primarias suelo procede repara dano caus...,"[(se, PRON), (encuentra, VERB), (linea, NOUN),...","[(linea, NOUN), (primarias, ADJ), (suelo, NOUN...","[linea, suelo, caida, circuito, cuchillas, rec...","(1, 0.72176915)",1
2,2437852,se visita la ruta y se inspeccionan dos poste ...,"[ruta, inspeccionan, dos, poste, reportados, c...",ruta inspeccionan dos poste reportados cuales ...,"[(se, PRON), (visita, VERB), (la, DET), (ruta,...","[(ruta, NOUN), (inspeccionan, VERB), (dos, NUM...","[ruta, poste, coco, mitad, terminal, viento, v...","(4, 0.96047205)",4
3,2685049,se abrio puente agua abajo de transformador qu...,"[abrio, puente, agua, abajo, transformador, re...",abrio puente agua abajo transformador quedo pe...,"[(se, PRON), (abrio, VERB), (puente, NOUN), (a...","[(abrio, ADJ), (puente, NOUN), (agua, NOUN), (...","[puente, agua, transformador, aisladero, arbol...","(9, 0.5714224)",9
4,2854713,se visita direccion y se encuentra linea pri m...,"[linea, pri, caida, arbol, habre, corte, repar...",linea pri caida arbol habre corte quedo pendie...,"[(se, PRON), (visita, VERB), (direccion, NOUN)...","[(linea, PROPN), (pri, PROPN), (caida, ADJ), (...","[corte, linea]","(5, 0.75161874)",5


In [None]:
datos[['WORKLOGID','num_topic']].to_csv('/content/drive/MyDrive/ProyectoIntegrador/Datos/main_topic.csv')
##Habilitar la siguiente línea para correr el proyecto de manera local
# datos[['WORKLOGID','num_topic']].to_csv('/content/main_topic.csv')


In [None]:
datos.num_topic.describe()

count    9428.000000
mean        5.411010
std         3.010718
min         0.000000
25%         3.000000
50%         5.000000
75%         8.000000
max        10.000000
Name: num_topic, dtype: float64

## Representación de Documentos

In [None]:
datos2 = datos

In [None]:
# Definición para pasar una lista de token a un texto
def list_to_text (lista):
  text = ' '.join(lista)
  return text

In [None]:
#aplicamos la funcion list_to_text
datos2['text'] = datos2['palabras_clave'].apply(lambda x: list_to_text(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  datos2['text'] = datos2['palabras_clave'].apply(lambda x: list_to_text(x))


In [None]:
## Obtener listado de tokens (ya procesados)
tokens = []
for t in datos2['palabras_clave']:
    tokens.extend(t)
print('Cantidad total de tokens = ',str(len(tokens)))
## Distribución de frecuencia de los tokens. Tamaño del BoW
fdist_tokens = nltk.FreqDist(tokens)
print('Tamanno del BoW=',len(fdist_tokens))

Cantidad total de tokens =  40917
Tamanno del BoW= 1809


In [None]:
df_frecuencia = pd.DataFrame([[key, fdist_tokens[key]] for key in fdist_tokens.keys()], columns=['palabra', 'frecuencia'])

In [None]:
df_frecuencia.describe()

Unnamed: 0,frecuencia
count,1809.0
mean,22.618574
std,137.091805
min,1.0
25%,1.0
50%,2.0
75%,7.0
max,3650.0


In [None]:
datos2['text'] 

0                                 ramales poda poda linea
1       linea suelo caida circuito cuchillas reconecta...
2       ruta poste coco mitad terminal viento viento e...
3       puente agua transformador aisladero arbol post...
4                                             corte linea
                              ...                        
9423           incuentra circuito aguas poste medio poste
9424                      puente estructura puente correo
9425                       puente oficiales riesgo riesgo
9426                            conector apoyo aguas caja
9427                                      red poda cancha
Name: text, Length: 9428, dtype: object

In [None]:
#se crea lista de textos 
texts = datos2['text'].tolist()

Bit Vector Representation

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# generate bag of word representation for given dataset

bitvector = CountVectorizer(max_features = 500, binary = True) # se seleccionan solo 1000 dimensiones
features_bv = bitvector.fit_transform(texts)
df_bitvector=pd.DataFrame(
    features_bv.todense(),
    columns=bitvector.get_feature_names()
)



In [None]:
df_bitvector2 = datos2[['WORKLOGID']].merge(df_bitvector, left_on=None, right_on=None, left_index=True , right_index=True)
df_bitvector2

Unnamed: 0,WORKLOGID,abeja,acceso,accidente,aceite,acercamiento,acero,acometidas,acompanamiento,actividad,...,vendaval,veredas,vez,viaje,vias,vida,viento,vientos,voltaje,zona
0,2789392,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2857541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2437852,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,2685049,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2854713,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9423,3108241,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9424,3038792,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9425,2943207,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9426,3070258,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#extraemos matriz para revisar la representacion de documentos
#df_bitvector2.to_csv('/content/drive/MyDrive/ProyectoIntegrador/Datos/matrizbitvector.csv')

TF-IDF Representation

In [None]:
#Vectorización
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=1, max_df=0.5, ngram_range=(1, 1))
features_tfidf = tfidf.fit_transform(texts)
df_tfidf=pd.DataFrame(
    features_tfidf.todense(),
    columns=tfidf.get_feature_names()
)




In [None]:
df_tfidf.head(5)

Unnamed: 0,abajos,abeja,abertura,abriaqui,abrio,acceso,accidente,aceite,acera,acercamiento,...,villa,vision,volcamiento,voltaje,voz,wilmer,xlpe,yugo,zona,zorra
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
#extraemos matriz para revisar la representacion de documentos
#df_tfidf.to_csv('/content/drive/MyDrive/ProyectoIntegrador/Datos/matriztfidf.csv')

## **K means**

In [None]:
from sklearn.cluster import KMeans
modelkmeans = KMeans(n_clusters=15, init='k-means++', n_init=100)
modelkmeans.fit(df_tfidf)

KMeans(n_clusters=15, n_init=100)

In [None]:
#%% Etiquetamos nuestro dataframe.
labels = modelkmeans.predict(df_tfidf)


In [None]:
datos2['label'] = labels

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  datos2['label'] = labels


In [None]:
datos2.head(5)

Unnamed: 0,WORKLOGID,procesado,tokens_proc,tokensunidos,procesadotagged,tokenstagged,palabras_clave,main_topic,num_topic,text,label
0,2789392,se realizo poda en varios ramales secundarios ...,"[poda, varios, ramales, secundarios, fusible, ...",poda varios ramales secundarios fusible transf...,"[(se, PRON), (realizo, VERB), (poda, NOUN), (e...","[(poda, VERB), (varios, DET), (ramales, NOUN),...","[ramales, poda, poda, linea]","(5, 0.60738915)",5,ramales poda poda linea,9
1,2857541,se encuentra linea primarias de en el suelo se...,"[linea, primarias, suelo, procede, repara, dan...",linea primarias suelo procede repara dano caus...,"[(se, PRON), (encuentra, VERB), (linea, NOUN),...","[(linea, NOUN), (primarias, ADJ), (suelo, NOUN...","[linea, suelo, caida, circuito, cuchillas, rec...","(1, 0.72176915)",1,linea suelo caida circuito cuchillas reconecta...,7
2,2437852,se visita la ruta y se inspeccionan dos poste ...,"[ruta, inspeccionan, dos, poste, reportados, c...",ruta inspeccionan dos poste reportados cuales ...,"[(se, PRON), (visita, VERB), (la, DET), (ruta,...","[(ruta, NOUN), (inspeccionan, VERB), (dos, NUM...","[ruta, poste, coco, mitad, terminal, viento, v...","(4, 0.96047205)",4,ruta poste coco mitad terminal viento viento e...,0
3,2685049,se abrio puente agua abajo de transformador qu...,"[abrio, puente, agua, abajo, transformador, re...",abrio puente agua abajo transformador quedo pe...,"[(se, PRON), (abrio, VERB), (puente, NOUN), (a...","[(abrio, ADJ), (puente, NOUN), (agua, NOUN), (...","[puente, agua, transformador, aisladero, arbol...","(9, 0.5714224)",9,puente agua transformador aisladero arbol post...,14
4,2854713,se visita direccion y se encuentra linea pri m...,"[linea, pri, caida, arbol, habre, corte, repar...",linea pri caida arbol habre corte quedo pendie...,"[(se, PRON), (visita, VERB), (direccion, NOUN)...","[(linea, PROPN), (pri, PROPN), (caida, ADJ), (...","[corte, linea]","(5, 0.75161874)",5,corte linea,5


In [None]:
#datos2[['WORKLOGID','label']].to_csv('/content/drive/MyDrive/ProyectoIntegrador/Datos/label_kmeans.csv')
#Habilitar la siguiente línea para correr el proyecto de manera local
#datos2[['WORKLOGID','label']].to_csv('/content/label_kmeans.csv')

In [None]:
datos2['label'].value_counts()

0     3270
3     1069
9      528
6      525
2      453
4      437
5      426
12     395
14     383
8      374
1      363
7      344
10     329
13     292
11     240
Name: label, dtype: int64

In [None]:
## Obtener listado de tokens (ya procesados)

#Ciclo para recorrer y filtrar los diferentes grupos
for l in datos2['label'].unique().tolist():
  grupo_l = datos2[datos2['label']==l]

  tokens=[] 

  #ciclo para obtener las palabras clave de cada grupo
  for t in grupo_l['palabras_clave']:
    tokens.extend(t)
    fdist_tokens = nltk.FreqDist(tokens)

  print("grupo "+str(l))
  print(fdist_tokens.most_common(10))


grupo 9
[('poda', 843), ('cuadrilla', 148), ('arbol', 110), ('red', 54), ('aguas', 50), ('transformador', 49), ('circuito', 43), ('apoyo', 39), ('riesgo', 36), ('trabajo', 35)]
grupo 7
[('circuito', 424), ('poda', 89), ('transformador', 53), ('poste', 46), ('linea', 35), ('cuadrilla', 35), ('aguas', 34), ('arbol', 29), ('red', 27), ('trabajo', 23)]
grupo 0
[('poste', 533), ('transformador', 392), ('riesgo', 312), ('trabajo', 242), ('poda', 228), ('puente', 221), ('apoyo', 213), ('ruta', 196), ('cable', 181), ('voltaje', 148)]
grupo 14
[('transformador', 556), ('poste', 46), ('puente', 33), ('poda', 27), ('apoyo', 23), ('aguas', 21), ('red', 20), ('linea', 20), ('ruta', 17), ('conductor', 17)]
grupo 5
[('linea', 530), ('ruta', 110), ('poste', 93), ('poda', 84), ('cuadrilla', 83), ('transformador', 60), ('red', 43), ('puente', 36), ('riesgo', 33), ('peligro', 32)]
grupo 6
[('red', 717), ('poda', 202), ('poste', 195), ('riesgo', 148), ('arbol', 137), ('transformador', 78), ('contacto', 66