![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Tokenización de textos  

En este notebook aprenderá a tokenizar un texto usando la librería especializada sklearn y [nltk](https://www.nltk.org/).

Este notebook tiene la licencia de [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Un agradecimiento especial para [Kevin Markham](https://github.com/justmarkham)

## Instrucciones Generales:

La tokenización es un proceso primordial para la limpieza de datos de texto que permite mejorar el performance de los modelos predictivos de procesamiento de lenguaje natural. Por medio de este notebook deberá tokenizar el texto del set de noticias populares de UCL. Para conocer más detalles de la base puede ingresar al siguiente [vínculo](https://archive.ics.uci.edu/ml/datasets/online+news+popularity#).
   
Para realizar la actividad, solo siga las indicaciones asociadas a cada celda del notebook. 

### Importar base de datos y librerías

In [None]:
# SUGERIDO: Descomenta la siguiente linea de código si requieres instalar las libreías básicas utilizadas en este notebook
# Si requieres incluir más librerías puedes agregarlas al archivo Semana 4\requirements.txt
#!pip install -r requirements.txt

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Importación librerías
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

%matplotlib inline

In [4]:
# Carga de datos de archivos .csv
df = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2025/main/datasets/mashable_texts.csv', index_col=0)
df.head()

Unnamed: 0,author,author_web,shares,text,title,facebo,google,linked,twitte,twitter_followers
0,Seth Fiegerman,http://mashable.com/people/seth-fiegerman/,4900,\nApple's long and controversial ebook case ha...,The Supreme Court smacked down Apple today,http://www.facebook.com/sfiegerman,,http://www.linkedin.com/in/sfiegerman,https://twitter.com/sfiegerman,14300
1,Rebecca Ruiz,http://mashable.com/people/rebecca-ruiz/,1900,Analysis\n\n\n\n\n\nThere is a reason that Don...,Every woman has met a man like Donald Trump,,,,https://twitter.com/rebecca_ruiz,3738
2,Davina Merchant,http://mashable.com/people/568bdab351984019310...,7000,LONDON - Last month we reported on a dog-sized...,Adorable dog-sized rabbit finally finds his fo...,,https://plus.google.com/105525238342980116477?...,,,0
3,Scott Gerber,[],5000,Today's digital marketing experts must have a ...,15 essential skills all digital marketing hire...,,,,,0
4,Josh Dickey,http://mashable.com/people/joshdickey/,1600,"LOS ANGELES — For big, fun, populist popcorn m...",Mashable top 10: 'The Force Awakens' is the be...,,https://plus.google.com/109213469090692520544?...,,https://twitter.com/JLDlite,11200


In [5]:
df.iloc[1]['text']

'Analysis\n\n\n\n\n\nThere is a reason that\xa0Donald Trump\'s\xa0outrageous statements and behavior feel familiar to many women.\xa0\nIt\'s not because they know his declarative style and trademark shrug from reality television or political debates. Nor is it because his outsized role in American business made an unforgettable impression on them.\nSEE ALSO: As Donald Trump targets women, Republicans say he could cost the party everything\nThe eerie familiarity is more personal than that. They know Trump because they\'ve encountered a man like him at home, work, on social media or in a relationship.\xa0\nThis man extols the virtues of women, but has no problem reducing them to sex objects. He casts himself as unflappable, but blames a woman when his weaknesses are revealed. He insists on personal responsibility, but denies, deflects and perhaps even turns violent when accused of wrongdoing.\xa0\n\nThe media is so after me on women  Wow, this is a tough business. Nobody has more respect

### Crear varaible de interés

In [9]:
# Separación de variable de interés (y)
y = df.shares
y.describe()


count       82.000000
mean      3090.487805
std       8782.031594
min        437.000000
25%        893.500000
50%       1200.000000
75%       2275.000000
max      63100.000000
Name: shares, dtype: float64

In [7]:
# Categorización de la variable de interés (y)
y = pd.cut(y, [0, 893, 1200, 2275, 63200], labels=[0, 1, 2, 3])
y.value_counts()

shares
1    22
0    21
3    21
2    18
Name: count, dtype: int64

In [11]:
# Definición de variable de interés en el dataframe
df['y'] = y
df

Unnamed: 0,author,author_web,shares,text,title,facebo,google,linked,twitte,twitter_followers,y
0,Seth Fiegerman,http://mashable.com/people/seth-fiegerman/,4900,\nApple's long and controversial ebook case ha...,The Supreme Court smacked down Apple today,http://www.facebook.com/sfiegerman,,http://www.linkedin.com/in/sfiegerman,https://twitter.com/sfiegerman,14300,4900
1,Rebecca Ruiz,http://mashable.com/people/rebecca-ruiz/,1900,Analysis\n\n\n\n\n\nThere is a reason that Don...,Every woman has met a man like Donald Trump,,,,https://twitter.com/rebecca_ruiz,3738,1900
2,Davina Merchant,http://mashable.com/people/568bdab351984019310...,7000,LONDON - Last month we reported on a dog-sized...,Adorable dog-sized rabbit finally finds his fo...,,https://plus.google.com/105525238342980116477?...,,,0,7000
3,Scott Gerber,[],5000,Today's digital marketing experts must have a ...,15 essential skills all digital marketing hire...,,,,,0,5000
4,Josh Dickey,http://mashable.com/people/joshdickey/,1600,"LOS ANGELES — For big, fun, populist popcorn m...",Mashable top 10: 'The Force Awakens' is the be...,,https://plus.google.com/109213469090692520544?...,,https://twitter.com/JLDlite,11200,1600
...,...,...,...,...,...,...,...,...,...,...,...
77,Adario Strange,http://mashable.com/people/adario-strange/,1900,"In recent years, a lot of time has been spent ...",Concept design a wraps a bike around your smar...,,,,https://twitter.com/adariostrange,5648,1900
78,Sandra Gonzalez,http://mashable.com/people/sandra-gonzalez/,1200,\n\n\n\nFrom the man whose messed-up brain gav...,Children from 'American Horror Story: Hotel' t...,,https://plus.google.com/100425018667086324110?...,,,0,1200
79,Nick Jaynes,http://mashable.com/people/nickjaynes/,2300,"For the last few years, Tesla has been playing...",Brabus Zero Emission makes the Tesla Model S s...,https://www.facebook.com/app_scoped_user_id/10...,https://plus.google.com/103396638359934416985?...,,https://twitter.com/NickJaynes,1319,2300
80,Stan Schroeder &amp; Brian Ries,[],1100,This story was updated at 5:55 p.m. ET with mo...,Chaos at Croatia's border as thousands of refu...,,,,,0,1100


### Crear variables predictoras X_A - tokenización sin limpieza

In [12]:
# Separación de variables predictoras (X), solo se considera el texto de la noticia
X = df.text

In [13]:
# Creación de matrices de documentos usando CountVectorizer a partir de X
vect_A = CountVectorizer()
X_dtm_A = vect_A.fit_transform(X)
temp_A=X_dtm_A.todense()

In [None]:
temp_A.shape  #esto tienen la matriz tokenizada, las filas son los articulos, las columnas los tokens

(82, 7969)

In [None]:
# Visualización de diccionario de palabras con su respectivo ID asignado
vect_A.vocabulary_ #De la tokenización saca cuantas palabras hay de cada una

{'apple': 682,
 'long': 4303,
 'and': 617,
 'controversial': 1747,
 'ebook': 2401,
 'case': 1307,
 'has': 3367,
 'reached': 5734,
 'its': 3884,
 'final': 2893,
 'chapter': 1383,
 'it': 3878,
 'not': 4883,
 'the': 7054,
 'happy': 3352,
 'ending': 2527,
 'company': 1612,
 'wanted': 7620,
 'supreme': 6865,
 'court': 1809,
 'on': 4969,
 'monday': 4687,
 'rejected': 5841,
 'an': 603,
 'appeal': 673,
 'filed': 2882,
 'by': 1224,
 'to': 7150,
 'overturn': 5075,
 'stinging': 6723,
 'ruling': 6087,
 'that': 7051,
 'led': 4181,
 'broad': 1147,
 'conspiracy': 1706,
 'with': 7748,
 'several': 6303,
 'major': 4374,
 'publishers': 5610,
 'fix': 2927,
 'price': 5483,
 'of': 4935,
 'books': 1088,
 'sold': 6528,
 'through': 7106,
 'online': 4979,
 'bookstore': 1089,
 'decision': 2009,
 'means': 4496,
 'now': 4895,
 'no': 4858,
 'choice': 1437,
 'but': 1215,
 'pay': 5178,
 'out': 5037,
 '400': 223,
 'million': 4611,
 'consumers': 1714,
 'additional': 446,
 '50': 252,
 'in': 3664,
 'legal': 4187,
 'fees'

In [21]:
# Impresión de dimensiones de matriz de documentos donde las filas son documentos y las columnas son términos o tokens
X_dtm_A.shape

(82, 7969)

In [22]:
# Visualización de 50 términos en el diccionario de palabras
print(vect_A.get_feature_names_out()[-150:-100])

['ydwnm50jlu' 'ye' 'yeah' 'year' 'years' 'yec' 'yeezy' 'yellow' 'yelp'
 'yep' 'yes' 'yesterday' 'yesweather' 'yet' 'yoga' 'yong' 'york' 'you'
 'young' 'younger' 'youngest' 'your' 'yourself' 'youth' 'youtube'
 'youtubeduck' 'yup' 'yuyuan' 'yücel' 'zach' 'zaxoqbv487' 'zero'
 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictexnxgxmtujcmujanbn'
 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1icteymdb4nji3iwplcwpwzw'
 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1icti4ohgxnjijcmujanbn'
 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictk1mhg1mzqjcmujanbn'
 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictu2mhg3ntakzqlqcgc'
 'zgkymde1lzewlza0l2zkl1n0yxj0dxayljq0mdvhlmpwzwpwcxrodw1ictywmhgzmzgjcmujanbn'
 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1ictexnxgxmtujcmujanbn'
 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1icteymdb4nji3iwplcwpwzw'
 'zgkymde1lzewlza0lzm1l2jpcmrfdgfudhj1lmu3zwmzlmpwzwpwcxrodw1icti4ohgxnjijcm

### Crear variables predictoras X_B - tokenización con limpieza de mayúsculas

In [None]:
# Creación de matrices de documentos usando CountVectorizer a partir de X, volviendo todas la palabras en minúscula
# a partir del parámetro 'lowercase=False' 
vect_B = CountVectorizer(lowercase=False) # con este parametro si una misma 
                                        #palabra salde en unas con mayus y otras con min, cada una la toma como diferente
X_dtm_B = vect_B.fit_transform(X)

In [24]:
# Impresión dimensiones de matriz de documentos donde las filas son documentos y las columnas son términos o tokens
X_dtm_B.shape

(82, 8759)

In [25]:
# Visualización de 50 términos en el diccionario de palabras
print(vect_B.get_feature_names_out()[-150:-100])

['weighed' 'weird' 'welcome' 'welcomed' 'welcoming' 'welfare' 'well'
 'wells' 'went' 'were' 'weren' 'west' 'western' 'what' 'whatever'
 'whatsoever' 'wheel' 'wheelchair' 'wheeliz' 'wheels' 'when' 'where'
 'wherever' 'whether' 'which' 'while' 'whistles' 'white' 'who' 'whole'
 'wholesome' 'whose' 'why' 'wide' 'widely' 'wider' 'widespread' 'widgets'
 'width' 'wife' 'wildest' 'wildly' 'will' 'willing' 'willrahn' 'win'
 'wind' 'window' 'windows' 'windscreen']


### Crear variables predictoras X_C - tokenización con limpieza de mayúsculas y usando n-gramas

In [26]:
# Creación de matrices de documentos usando CountVectorizer a partir de X y usando n-gramas
# a partir del parámetro 'ngram_range=(1, 4)' 
vect_C = CountVectorizer(lowercase=False, ngram_range=(1, 4)) # aqui ya no solo crea tokens por palabras sino por ngramas
X_dtm_C = vect_C.fit_transform(X)

In [27]:
# Impresión de dimensiones de matriz de documentos, donde las filas son documentos y las columnas son términos o tokens
X_dtm_C.shape

(82, 116956)

In [28]:
# Visualización de 50 términos en el diccionario de palabras
print(vect_C.get_feature_names_out()[-150:-100])

['your scope of knowledge' 'your score' 'your score the'
 'your score the more' 'your skills' 'your skills in'
 'your skills in variety' 'your skin' 'your skin The'
 'your skin The fabric' 'your smart' 'your smart phone'
 'your smart phone runs' 'your smartphone' 'your smartphone and'
 'your smartphone and go' 'your software' 'your software and'
 'your software and that' 'your sorrows' 'your sorrows in'
 'your sorrows in bag' 'your specific' 'your specific company'
 'your specific company For' 'your startup' 'your startup should'
 'your startup should look' 'your tablet' 'your tablet and'
 'your tablet and start' 'your tasty' 'your tasty souvenirs'
 'your tasty souvenirs South' 'your three' 'your three main'
 'your three main movement' 'your time' 'your time and'
 'your time and should' 'your time to' 'your time to The' 'your toddler'
 'your toddler doesn' 'your toddler doesn have' 'your toolbox'
 'your toolbox If' 'your toolbox If you' 'your tush' 'your tush and']


###  Entrenar modelo de predicción con diferentes matrices de palabras (variables predictoras)

In [37]:
# Definición de modelo Naive Bayes para predecir la varaible 'y' y variables predictoras x_A
nb = MultinomialNB()
pd.Series(cross_val_score(nb, X_dtm_A, y, cv=5)).describe()

count    5.000000
mean     0.160294
std      0.073557
min      0.058824
25%      0.117647
50%      0.187500
75%      0.187500
max      0.250000
dtype: float64

In [35]:
# Definición de modelo Naive Bayes para predecir la varaible 'y' y variables predictoras x_B
nb = MultinomialNB()
pd.Series(cross_val_score(nb, X_dtm_B, y, cv=6)).describe()

count    6.000000
mean     0.171245
std      0.060531
min      0.071429
25%      0.145604
50%      0.184066
75%      0.214286
max      0.230769
dtype: float64

In [None]:
# Definición de modelo Naive Bayes para predecir la varaible 'y' y variables predictoras x_B
nb = MultinomialNB()
pd.Series(cross_val_score(nb, X_dtm_C, y, cv=10)).describe()

In [38]:
import warnings
warnings.filterwarnings('ignore')

# Importación de librerías
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
import joblib

# Cargar los datos
data = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2025/main/datasets/dataTrain_Spotify.csv')

# selección de variables
columnas = ['danceability', 'energy', 'tempo', 'valence', 'liveness', 'speechiness']
X = data[columnas]
y = data['popularity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Definir el modelo
#clf = DecisionTreeClassifier(max_depth=5, random_state=42)
clf = DecisionTreeRegressor(max_depth=5, random_state=42)

# Entrenar el modelo
clf.fit(X_train, y_train)

# Evaluar el modelo
#y_pred = clf.predict(X_test)
#accuracy = accuracy_score(y_test, y_pred)
#print(f"Accuracy en datos de prueba: {accuracy:.2f}")

# Guardar el modelo
joblib.dump(clf, 'ModeloEntrenado.pkl', compress=3)
print("Modelo guardado como 'ModeloEntrenado.pkl'")


Modelo guardado como 'ModeloEntrenado.pkl'


In [42]:
import os

# Ver en qué carpeta estás
print("Directorio actual:", os.getcwd())


Directorio actual: c:\Users\madelgado\Documents\GitHub\MIAD_ML_NLP_2025\Semana 4


In [43]:
data

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,duration_ms,explicit,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,popularity
0,0,7hUhmkALyQ8SX9mJs5XI3D,Love and Rockets,Love and Rockets,Motorcycle,211533,False,0.305,0.84900,9,...,1,0.0549,0.000058,0.056700,0.4640,0.3200,141.793,4,goth,22
1,1,5x59U89ZnjZXuNAAlc8X1u,Filippa Giordano,Filippa Giordano,"Addio del passato - From ""La traviata""",196000,False,0.287,0.19000,7,...,0,0.0370,0.930000,0.000356,0.0834,0.1330,83.685,4,opera,22
2,2,70Vng5jLzoJLmeLu3ayBQq,Susumu Yokota,Symbol,Purple Rose Minuet,216506,False,0.583,0.50900,1,...,1,0.0362,0.777000,0.202000,0.1150,0.5440,90.459,3,idm,37
3,3,1cRfzLJapgtwJ61xszs37b,Franz Liszt;YUNDI,Relajación y siestas,"Liebeslied (Widmung), S. 566",218346,False,0.163,0.03680,8,...,1,0.0472,0.991000,0.899000,0.1070,0.0387,69.442,3,classical,0
4,4,47d5lYjbiMy0EdMRV8lRou,Scooter,Scooter Forever,The Darkside,173160,False,0.647,0.92100,2,...,1,0.1850,0.000939,0.371000,0.1310,0.1710,137.981,4,techno,27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79795,79795,6mmbWSbU5FElQOocyktyUZ,Amilcare Ponchielli;Gothenburg Symphony Orches...,"Ballet Highlights - The Nutcracker, Romeo & Ju...",La Gioconda / Act 3: Dance Of The Hours,162613,False,0.554,0.00763,4,...,1,0.0502,0.915000,0.000970,0.2210,0.1560,119.502,4,opera,49
79796,79796,0XL75lllKb1jTmEamqwVU6,Sajanka,Time of India,Time of India,240062,False,0.689,0.55400,9,...,1,0.0759,0.091000,0.914000,0.0867,0.1630,148.002,4,trance,30
79797,79797,763FEhIZGILafwlkipdgtI,Frankie Valli & The Four Seasons,Merry Christmas,I Saw Mommy Kissing Santa Claus,136306,False,0.629,0.56000,0,...,0,0.0523,0.595000,0.000000,0.1820,0.8800,118.895,3,soul,0
79798,79798,2VVWWwQ3FiWnmbukTb6Kd3,The Mayries,I Will Wait,I Will Wait,216841,False,0.421,0.10700,6,...,1,0.0335,0.948000,0.000000,0.0881,0.1180,104.218,4,acoustic,44
