<a href="https://colab.research.google.com/github/jodejetalo99/Datos-Masivos/blob/master/Mini_Proyecto1_JJTL_Lematizado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Universidad Nacional Autónoma de México
# Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas
# Datos Masivos II
# José de Jesús Tapia López
# Mini-proyecto: Modelado de tópicos con SVD
# 07 de Octubre del 2020


## Objetivo

El objetivo de este mini-proyecto es identificar los tópicos a partir de un conjunto de biografías usando el método de SVD.

La base de datos usada es: https://www.kaggle.com/sameersmahajan/people-wikipedia-data
La columna usada es: $text$

La extracción de tópicos será utilizando la descomposición de valores singulares.

Contenido del CSV: Contiene URI, nombres de personas y texto de su página de wikipedia.

In [None]:
#from google.colab import files
#files.upload()

Saving people_wiki.csv to people_wiki.csv


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')

# cargamos los datos y mostramos los primeros 5 registros del csv
datos = pd.read_csv('drive/My Drive/people_wiki.csv')
#datos.drop(datos.tail(42500).index,inplace=True) 
datos.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [None]:
datos.shape

(42786, 3)

In [None]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]


def convert_list_to_string(org_list, seperator=' '):
    
    return seperator.join(org_list)


In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Pre-procesamiento de los documentos

datos['clean_text'] = datos['text'].str.replace("[^a-zA-Z#]", " ") # se remueve signos, caracteres especiales.
#datos['clean_text'] = datos['clean_text'].apply(lambda x: ' '.join([item for item in x if item not in stop_words])) # se remueve palabras cortas
datos['clean_text'] = datos['clean_text'].apply(lambda x: x.lower())
datos['clean_text'] = datos.clean_text.apply(lemmatize_text)

In [None]:
for i in range(len(datos['clean_text'])):
  datos['clean_text'][i] = convert_list_to_string(datos['clean_text'][i])

In [None]:
stop_words = stopwords.words('english')

tokenized_doc = datos['clean_text'].apply(lambda x: x.split())#tokenización
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])#eliminación de stop-words
detokenized_doc = []
for i in range(len(datos)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)
datos['clean_text'] = detokenized_doc

In [None]:
datos['text'][0]

'digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vfla

In [None]:
datos['clean_text'][0]

'digby morrell born october former australian rule footballer played kangaroo carlton australian football league aflfrom western australia morrell played early senior football west perth game senior career falcon spanned wa club leading goalkicker age morrell wa recruited australian football league kangaroo football club third round selection afl rookie draft forward twice kicked five goal time kangaroo first wa losing cause sydney following season drawn game brisbaneafter season morrell wa traded along david teague carlton football club exchange corey mckernan played game blue delisted end continued play victorian football league vfl football northern bullants carltons vflaffiliate acted playing assistant coach shifted box hill hawk retiring playing end season morrell wa senior coach strathmore football club essendon district football league leading club premier division premiership since ha coached west coburg football club also edflhe currently teach physical education parade colleg

Dado que la Descomposición de Valores Singulares es para matrices, necesitamos convertir todos esos documentos (que ya están pre-procesados) a una matriz.

In [None]:
#Creamos una matriz de documentos y términos usando TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
                            max_features= 1000, # máximo número de términos (1000 características, es decir, 1000 palabras)
                            max_df = 0.5, 
                            smooth_idf=True)

X = vectorizer.fit_transform(datos['clean_text'])
print("Tamaño de la matriz T-D: ", X.shape) # visualizamos el tamaño de la matriz
print(X)

Tamaño de la matriz T-D:  (42786, 1000)
  (0, 558)	0.09157224872188534
  (0, 157)	0.04753191653574068
  (0, 268)	0.0645198350133814
  (0, 904)	0.08234516604056165
  (0, 209)	0.04910754694003716
  (0, 246)	0.06997416400090573
  (0, 697)	0.08590038221884506
  (0, 245)	0.0698384296691039
  (0, 400)	0.08407448122693519
  (0, 152)	0.1365644428118901
  (0, 48)	0.06584758307151911
  (0, 681)	0.11567352624555806
  (0, 611)	0.08016523155735393
  (0, 678)	0.05535037648074053
  (0, 184)	0.06943043181712222
  (0, 274)	0.12617232275787008
  (0, 85)	0.07694438543986452
  (0, 214)	0.0640676289034298
  (0, 815)	0.15485816105732536
  (0, 334)	0.055029479767559564
  (0, 897)	0.08566213770435405
  (0, 922)	0.03835607153027166
  (0, 367)	0.0755989786543026
  (0, 944)	0.08114269817909812
  (0, 253)	0.07911375659655492
  :	:
  (42785, 928)	0.08250912473524345
  (42785, 990)	0.05947696083105489
  (42785, 805)	0.05575118930682469
  (42785, 397)	0.07420051679691327
  (42785, 291)	0.09624884530974578
  (42785, 

Este transformador realiza una reducción de dimensionalidad lineal mediante descomposición de valor singular truncado (SVD). A diferencia del PCA, este estimador no centra los datos antes de calcular la descomposición del valor singular. Esto significa que puede trabajar con matrices dispersas de manera eficiente.

En particular, SVD truncado funciona en matrices de conteo de términos / tf-idf como lo devuelven los vectorizadores en sklearn.feature_extraction.text. En ese contexto, se lo conoce como análisis semántico latente (LSA).

Semántica: estudio de diversos aspectos del significado, sentido o interpretación de signos lingüísticos como símbolos, palabras, expresiones o representaciones formales.

n_components: La dimensionalidad deseada de los datos de salida. Debe ser estrictamente menor que el número de funciones. El valor predeterminado es útil para la visualización. Para LSA, se recomienda un valor de 100.


In [None]:
#Calculamos la descomposición de valores singulares de la matriz, usando la función "TruncatedSVD"

from sklearn.decomposition import TruncatedSVD

svd_model = TruncatedSVD(n_components=100, 
                        algorithm='randomized', 
                         n_iter=100, 
                         random_state=122)
#modelo que va a contener mis valores singulares


svd_model.fit(X)

# número de tópicos de este modelo
len(svd_model.components_)

100

Atributos

components_ matriz, forma (n_components, n_features)

explained_variance_ : matriz, de forma (n_components, ) La varianza de las muestras de entrenamiento transformadas por una proyección a cada componente.

explained_variance_ratio_ : matriz, de forma (n_components,) Porcentaje de varianza explicada por cada uno de los componentes seleccionados.

singular_values_ :matriz, forma (n_components,)

Los valores singulares correspondientes a cada uno de los componentes seleccionados. Los valores singulares son iguales a las 2 normas de las n_components variables en el espacio de menor dimensión.



In [None]:
svd_model.singular_values_

array([49.07524902, 32.31128432, 29.73993252, 24.47580897, 23.00026837,
       20.04370716, 19.16582495, 18.61232911, 17.73750161, 17.3195915 ,
       17.04598652, 16.0732717 , 15.46929256, 15.18447347, 14.56434069,
       14.3143898 , 13.88275926, 13.67379564, 13.53257895, 13.31452872,
       13.15663804, 13.11559811, 12.92139241, 12.83363255, 12.66527568,
       12.64765859, 12.29616248, 12.2331827 , 12.12735474, 11.83334241,
       11.73173316, 11.70965812, 11.52490912, 11.45643355, 11.33844473,
       11.28817787, 11.13083963, 11.05395585, 10.93164892, 10.79740442,
       10.72471498, 10.64258231, 10.56767198, 10.49590858, 10.455643  ,
       10.43986879, 10.41826594, 10.31129607, 10.27318216, 10.24453887,
       10.23767067, 10.11519738, 10.06378006,  9.94549469,  9.89194312,
        9.84067331,  9.82992869,  9.76653593,  9.6775405 ,  9.67043229,
        9.5916582 ,  9.54906946,  9.48551877,  9.46802146,  9.42974459,
        9.38829962,  9.33775545,  9.28430082,  9.24625241,  9.19

In [None]:
svd_model.singular_values_.shape

(100,)

Buscando en la documentación de TruncatedSVD, este es básicamente una envoltura de sklearn.utils.extmath.randomized_svd; podemos llamarlo manualmente así:



In [None]:
from sklearn.utils.extmath import randomized_svd

U, Sigma, VT = randomized_svd(X, n_components=100, 
                              n_iter=100, 
                              random_state=122)

print("Matriz U:\n",U)
print("Matriz Sigma: \n",Sigma)
print("Matriz VT:\n",VT)


Matriz U:
 [[ 5.19495625e-03  1.46259823e-02 -2.51288257e-03 ... -2.69407591e-03
   4.37788642e-04  3.70166543e-03]
 [ 3.46725547e-03 -2.70975040e-03 -3.03695719e-03 ... -6.95282221e-04
   3.87736057e-03  2.83557215e-03]
 [ 3.77378456e-03  3.98458019e-04  5.35808705e-03 ... -1.48511798e-03
   4.90369095e-03 -1.14763491e-03]
 ...
 [ 4.57945091e-03  1.01017278e-02 -2.05128313e-03 ... -4.62897853e-03
  -7.33097489e-05  7.69986956e-04]
 [ 4.02938243e-03 -2.60858650e-03 -1.68104542e-03 ... -8.41227538e-03
   1.94235981e-02 -5.19694716e-03]
 [ 4.35013795e-03 -5.02099835e-04 -2.85943233e-03 ... -1.04897970e-03
  -6.59885833e-04  4.50964546e-03]]
Matriz Sigma: 
 [49.07524902 32.31128432 29.73993252 24.47580897 23.00026837 20.04370716
 19.16582495 18.61232911 17.73750161 17.3195915  17.04598652 16.0732717
 15.46929256 15.18447347 14.56434069 14.3143898  13.88275926 13.67379564
 13.53257895 13.31452872 13.15663804 13.11559811 12.92139241 12.83363255
 12.66527568 12.64765859 12.29616248 12.233182

Este algoritmo encuentra una descomposición de valor singular truncado aproximado (generalmente muy bueno) utilizando la aleatorización para acelerar los cálculos. Es particularmente rápido en matrices grandes en las que desea extraer solo una pequeña cantidad de componentes. Para obtener una mayor velocidad, n_iter se puede establecer <= 2 (a costa de la pérdida de precisión). 

In [None]:
#Los componentes del modelo, serán los tópicos de los documentos 

# Recupérame las palabras que van a conformar esa matriz, que son 1000
terms = vectorizer.get_feature_names()

#Visualizamos algunas de las plabras más importantes en cada uno de los 42786 tópicos
for i, comp in enumerate(svd_model.components_):
  # términos o 1000 palabras que vamos a analizar en cada uno de los tópicos
    terms_comp = zip(terms, comp)
    # los ordeno porque cada palabra tiene un peso
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:5]
    # Visualizo las 5 palabras más importantes de cada tópico
    print("Tópico "+str(i+1)+":")
    for t in sorted_terms:
        print(t[0])



Tópico 1:
university
new
film
music
season
Tópico 2:
season
league
game
played
team
Tópico 3:
album
music
film
band
song
Tópico 4:
album
band
party
music
election
Tópico 5:
film
party
election
role
minister
Tópico 6:
championship
world
team
race
medal
Tópico 7:
art
book
museum
new
work
Tópico 8:
music
art
orchestra
opera
symphony
Tópico 9:
football
club
australian
goal
book
Tópico 10:
art
film
album
museum
band
Tópico 11:
law
football
court
state
united
Tópico 12:
film
coach
book
university
novel
Tópico 13:
coach
basketball
band
radio
head
Tópico 14:
role
tour
theatre
law
band
Tópico 15:
band
tour
film
jazz
new
Tópico 16:
new
york
republican
football
city
Tópico 17:
law
radio
australian
game
court
Tópico 18:
australian
australia
award
game
theatre
Tópico 19:
new
york
law
minister
hockey
Tópico 20:
canadian
hockey
canada
award
season
Tópico 21:
radio
church
canadian
league
canada
Tópico 22:
tour
golf
open
rugby
professional
Tópico 23:
racing
church
race
season
new
Tópico 24:
rugby
coach