# Procesamiento de texto a través spaCy
Si usted no posee una GPU, no intente correr ninguno de los paquetes que se instalan por !pip. Se le hará una lista de requisitos y formas para correr con exito spaCy desde una GPU con arquitectura Turing:
 1. Instale la versión y el paquete de spaCy en Español para procesar texto en este lenguaje
 2. La versión de CUDA es 11.0, válido para GPUs NVIDIA con versión cuDNN 8.0+. Esto lo necesita su equipo para asignar la memoria a la tarjeta gráfica. Se instala por fuera de Python [CUDA](https://developer.nvidia.com/cuda-downloads).
 3. Instale la versión de Pytorch compatible con su versión de CUDA, recuerde que spaCy tiene transformers, esto tiene que estar especificado para que descargue los paquetes de clasificación
 4. Thinc GPU OPS y CuPy son necesarios para que el equipo reconozca la tarjeta gráfica y la vuelva accesible desde Jupyter.
 
Con esto podrá iniciar a correr modelos complejos en su equipo de trabajo, recuerde que no debe de correr ninguna línea que instale un paquete.

In [1]:
# Análisis no supervisado, espacio de trabajo
# Instalación de programación acelerada por GPU
!pip install -U pip setuptools wheel
!pip install -U spacy[cuda110,transformers,lookups]
!python -m spacy download es_core_news_sm
!pip install torch===1.7.1+cu110 torchvision===0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html --user
!pip install cupy-cuda110
!pip install thinc-gpu-ops

Collecting es-core-news-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.0.0/es_core_news_sm-3.0.0-py3-none-any.whl (13.9 MB)
[+] Download and installation successful
You can now load the package via spacy.load('es_core_news_sm')


Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [3]:
import spacy
import pandas as pd

In [5]:
#Procesamiento de una base
df = pd.read_excel('estudiante_14.xlsx', sheet1=0) # can also index sheet by name or fetch all sheets
mylist = df['Snippet'].tolist()

In [6]:
mylist

['@Bogota Con esa mano de ñeros colombianos y venecos,que se ven desfilando por bogota sin control alguno.podemos decir que esa platica se perdió..',
 '@RamiroLujanG @jsanchezcristo UD es venezolano? No sea metido',
 '@_El__Patriota_ Tengo 6 excelentes trabajadores Venezolanos. A los malos que se los lleven y encarcelan al que la cague. (Seá de donde sea)',
 '@rodolfaso3 @_El__Patriota_ Tu y yo conocemos bien el tema Así es Colombiano siempre con competencias xra el trabajo y no se fueron en un viaje de 2 años, fueron 40 años para construir lo mejor de Vzla. Venezolano siempre fue burócrata',
 '@IsaacMadriid @VezNancy @_El__Patriota_ Es que Venezuela debió en su momento, meterlos a la cárcel, deportarlos, a todos los bandidos que fueron a dañarles el país. Es lo mismo que pido en estos momentos a los venezolanos que vienen a hacer lo mismo acá. Los hijos a que te refieres los he visto que vienen a trabajar',
 'En tanta desesperación se Atreverán los narcos VENEZOLANOS a lanzar un ataqu

In [7]:
# Le decimos a spaCy que vamos con toda
spacy.prefer_gpu
# Corremos los datos en el core de español
nlp = spacy.load("es_core_news_sm")
# Usamos el pipeline para entity recognition, lemma, tokenizer, tagger y parser
docs = list(nlp.pipe(mylist))

In [8]:
for doc in docs:
    print([(ent.text, ent.label_) for ent in doc.ents])

[]
[('UD', 'ORG'), ('No sea metido', 'MISC')]
[('Tengo 6', 'MISC'), ('Venezolanos', 'ORG'), ('A los malos', 'MISC')]
[('Tu y yo conocemos bien el tema Así', 'MISC'), ('Colombiano', 'LOC'), ('Vzla', 'PER'), ('Venezolano', 'LOC')]
[('Venezuela', 'LOC'), ('Los hijos a que te refieres los he visto', 'MISC')]
[('Atreverán', 'LOC'), ('Dicen', 'LOC')]
[('Soberano?', 'MISC'), ('Soberano', 'PER')]
[('Honorable Amigo', 'MISC'), ('Los Miserables', 'MISC'), ('Venezolanos', 'LOC'), ('Colombia', 'LOC'), ('SANTODOMINGO', 'MISC')]
[]
[('COVID-19', 'LOC'), ('Yukpa', 'LOC'), ('El portador del virus', 'MISC'), ('El Escobal', 'LOC')]
[('@IvanDuque @ClaudiaLopez Usted', 'MISC'), ('Pobre', 'LOC')]
[]
[('@MariaFdaCabal Pues', 'LOC'), ('HEDIONDOS', 'MISC'), ('de Colombia', 'LOC')]
[('Yédinson Ned Florez Duart', 'PER'), ('Por medio de su cuenta', 'MISC'), ('Instagram', 'PER'), ('Shannon', 'PER'), ('Lima', 'LOC'), ('Por medio de una transmisión', 'MISC'), ('Instagram', 'LOC'), ('Pipe Bueno', 'PER'), ('Stephanie

[('@petrogustavo Usted', 'PER'), ('de Colombia', 'LOC'), ('Duque', 'PER'), ('Uribe', 'PER')]
[]
[('Bucaramanga', 'LOC'), ('Colombia', 'LOC')]
[('Espacios', 'PER'), ('Código de Policía', 'MISC')]
[('Fácil', 'PER')]
[]
[]
[('Haití', 'LOC'), ('Hospital San Francisco de Asís', 'LOC'), ('Quibdó', 'LOC'), ('Chocó', 'LOC'), ('Colombia', 'LOC')]
[('Quédate en tu casa no mas', 'MISC')]
[('Saquen', 'PER')]
[('Colombia NO MAS', 'MISC'), ('Bucaramanga', 'LOC'), ('@Citytv', 'LOC')]
[('Los venecos', 'MISC'), ('Ése', 'LOC'), ('Qué', 'LOC')]
[('Alguien', 'MISC'), ('Venezolanos', 'LOC'), ('Morir', 'PER')]
[('@petrogustavo Quien', 'PER'), ('Xenófobo', 'PER')]
[]
[('Idiota', 'PER'), ('Venezolanos', 'PER'), ('Sin ánimos de ofender la mayoría de los colombianos', 'MISC'), ('Europa', 'LOC'), ('Australia', 'LOC'), ('Canadá', 'LOC'), ('Chile', 'LOC')]
[('@elqsigueesporqs', 'MISC'), ('@ghitis Hoy subieron un video', 'MISC'), ('Twitter', 'MISC')]
[('@Isa_Calde No', 'MISC'), ('Colombia', 'LOC'), ('Medellín', 'LO

[]
[('🏻\u200d♀', 'PER')]
[]
[('Ojo', 'PER')]
[('de Colombia', 'LOC'), ('Venezolanos', 'PER'), ('Caracas', 'LOC')]
[('@VickyDavilaH @RevistaSemana Sigan', 'PER')]
[('Convivencia de Medellín', 'MISC')]
[('@SilvanoSerranoG Sr', 'MISC'), ('Giros', 'MISC')]
[('@flakajamie', 'LOC'), ('🤔🤔', 'LOC'), ('Saludos', 'LOC')]
[('Venecos', 'ORG')]
[('Capresoca', 'PER')]
[('@Eldiariodelanad Venezolanos', 'MISC'), ('Escriban', 'MISC')]
[('El tema de', 'MISC')]
[('Algo así como el familiar', 'MISC')]
[('@Alcalde_Verde @ClaudiaLopez @Asocapitales', 'MISC'), ('@WRadioColombia', 'LOC')]
[('de Colombia', 'LOC'), ('CARIDAD', 'ORG'), ('CASA', 'ORG')]
[('Ojo', 'PER'), ('Colombianos', 'LOC'), ('Venecos', 'ORG'), ('Uribe', 'PER'), ('COVID', 'LOC')]
[]
[('Los queridos hermanos venezolanos', 'MISC')]
[('@petrogustavo Tampoco', 'PER')]
[]
[('Colombia', 'LOC')]
[]
[('@ClaudiaLopez', 'LOC'), ('@Citytv Están', 'LOC'), ('CuarentenaObligatoriaYa #', 'MISC'), ('ToqueDeQueda', 'MISC')]
[('@AlexLopezMaya', 'LOC'), ('@IvanDu

[('@vcastrogomez', 'ORG'), ('@Citytv Hp', 'MISC')]
[('@noticiassincon1', 'PER')]
[('@Citytv Extranjeros?', 'MISC'), ('Venezolanos', 'LOC'), ('Venecos HDP', 'ORG')]
[('Venezolanos de Colombia', 'LOC')]
[]
[('Colombiano', 'LOC'), ('Venezolano', 'LOC')]
[('@iraidesh', 'MISC'), ('Colombia', 'LOC'), ('Ayer', 'PER'), ('Eso hacen muchos', 'MISC'), ('Eso da asco', 'MISC')]
[]
[('@ImpactoNewsCol Puroo', 'PER')]
[('Cúcuta #NorteDeSantander \u2066@PoliciaNteSder\u2069', 'MISC')]
[('Gallo 21 #RuletonActivo#Animalitos#ResultadosRuletonActivo#VenezolanosEnPeru', 'MISC')]
[('Venezolanos', 'LOC')]
[("Video 'Vendaval", 'MISC'), ('Bogotá #', 'MISC')]
[('Yoimer', 'PER'), ('Los Majupay', 'MISC'), ('Maicao', 'LOC')]
[('FAV', 'ORG'), ('Venezolanos', 'LOC')]
[('Malparidos venecos', 'MISC')]
[('@caroandujar', 'MISC')]
[('Venezolanos', 'LOC')]
[('Mediante', 'PER'), ('Alcalde de #Yopal', 'LOC')]
[('#Colombia', 'LOC'), ('DesdeLaVentana', 'MISC'), ('@gusi', 'MISC'), ('Junio', 'PER')]
[('@MinjusticiaCo @INPEC_Colo

[('@superdeporpasto Mas', 'MISC')]
[]
[('Oso 16 #RuletonActivo#Animalitos#ResultadosRuletonActivo#VenezolanosEnPeru', 'MISC')]
[('Bogotá', 'LOC'), ('Taller Blanco Ediciones', 'ORG')]
[('Por falsear!!', 'MISC')]
[('¿que hacen acá entonces los venezolanos?', 'MISC')]
[('Jajajaja', 'LOC')]
[('@AJGORI Solo', 'LOC')]
[('de Colombia', 'LOC')]
[]
[('@NicolasMaduro Saque', 'LOC'), ('La justicia divina llegará', 'MISC')]
[('@danicasmon', 'MISC'), ('Jajaja', 'MISC')]
[('Colombia', 'LOC')]
[('NoticieroW', 'MISC'), ('El productor venezolano', 'MISC'), ('Pepsi Max', 'MISC')]
[('¿Eres Venezolano', 'MISC')]
[('Ofresco', 'LOC')]
[('#Ultimahora Cúcuta', 'MISC')]
[('Youtuber', 'PER')]
[('Gracias Santi Ramírez', 'PER')]
[('@veneco_sandwich', 'ORG')]
[('@el_pais Necesita', 'PER')]
[]
[('Extranjeros #', 'MISC')]
[('Varadero', 'LOC')]
[('CUIDADO', 'MISC'), ('VENECOS', 'ORG'), ('TOSER A', 'ORG'), ('CONJUNTOS!', 'MISC')]
[('BEO BEO', 'MISC'), ('Transmilenios', 'MISC')]
[('Culebra 36 #RuletonActivo#Animalitos#

In [9]:
#spaCy no tiene categorias por defecto TO-DO implementar en el pipeline
for doc in docs:
    print([(cat.text, cat.label_) for cat in doc.cats])

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JOSE\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [31]:
#stopwords = [word.decode('utf-8') for word in stopwords.words('spanish')]

AttributeError: 'str' object has no attribute 'decode'

In [32]:
vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))
X = vectorizer.fit_transform(mylist)

In [33]:
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

KMeans(max_iter=100, n_clusters=2, n_init=1)

In [34]:
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

In [35]:
for i in range(true_k):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])

Cluster 0:
 venecos
 venezolano
 venezolana
 veneco
 si
 colombia
 país
 pueblo
 ivanduque
 venezuela
Cluster 1:
 venezolanos
 colombia
 país
 ivanduque
 si
 colombianos
 claudialopez
 venezuela
 gobierno
 gente
