<a href="https://colab.research.google.com/github/josemanuelvinhas/MarvelRecomverse/blob/main/MarvelRecomverse_Sistema_de_Recomendaci%C3%B3n.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**0. Sistema de Recomendación**

En este Notebook se realiza la implementación del sistema de recomendación de la aplicación.

En el primer punto se calcula la matriz de parecidos entre todos los personajes del dataset. Se realiza a modo de prueba, ya que no es necesario para la aplicación.

En el segundo punto se buscan los personajes más similares a una persona en función de dos criterios distintos:
1. Descripción
2. Descripción + Valoración Comentarios + Valoración Directa (Likes) + Valoración Media Comentarios

El uso de distintos criterios se realiza únicamente para comprobar como afectan al resultado final. Se decide emplear el segundo criterio porque también se tienen en cuenta las opiniones directas de los usuarios.

*NOTA: el dataset que se empleará está disponible [aquí](https://github.com/josemanuelvinhas/MarvelRecomverse/tree/main/datasets)*


#**1. Cálculo de la matriz de parecidos**

##**1.1 Preprocesado del dataset**


Para poder trabajar con los datos se deberá realizar un preprocesado de las descripciones.

El dataset con el que se trabaja contiene una colección de 281 personajes. Cada uno tiene dos campos:

1.   *name*: nombre del personaje
2.   *description*: descripcion del personaje

Se hará uso de la librería *pandas* para el preprocesado. A continuación se carga el csv con los datos


In [1]:
import pandas as pd

originalData = pd.read_csv('marvel.csv')
originalData

Unnamed: 0,name,description
0,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,A.I.M.,AIM is a terrorist organization bent on destro...
2,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,Adam Warlock,Adam Warlock is an artificially created human ...
4,Agent X (Nijo),Originally a partner of the mind-altering assa...
...,...,...
276,Zarek,Zarek is a member of the Kree race with no sup...
277,Zodiak,"Twelve demons merged with Norman Harrison, who..."
278,Zombie (Simon Garth),War hero Simon Garth was turned into a zombie ...
279,Zuras,Zuras was once the leader of the Eternals.


A continuación se llevará a cabo el preprocesado, en la que se usarán 3 técnicas:

1.   ***Tokenization***: división del texto en palabras 
2.   Eliminación de ***stopwords***
3. ***Stemmization***: permite la obtención de la raíz de cada palabra sin que el resultado sea una palabra real.

In [2]:
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import nltk
nltk.download('punkt')
nltk.download('stopwords')

ps = PorterStemmer()

preprocessedText = []

for row in originalData.itertuples():
    text = word_tokenize(row[2]) ## indice de la columna que contiene la descripcion
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    
    preprocessedText.append(text)

preprocessedData = originalData
preprocessedData['processed_text'] = preprocessedText

preprocessedData

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,name,description,processed_text
0,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...,rick jone hulk best bud sinc day one friend te...
1,A.I.M.,AIM is a terrorist organization bent on destro...,aim terrorist organ bent destroy world
2,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie...",formerli known emil blonski spi soviet yugosla...
3,Adam Warlock,Adam Warlock is an artificially created human ...,adam warlock artifici creat human born cocoon ...
4,Agent X (Nijo),Originally a partner of the mind-altering assa...,origin partner assassin black swan nijo spi de...
...,...,...,...
276,Zarek,Zarek is a member of the Kree race with no sup...,zarek member kree race superhuman abil special...
277,Zodiak,"Twelve demons merged with Norman Harrison, who...",twelv demon merg norman harrison soon adopt gu...
278,Zombie (Simon Garth),War hero Simon Garth was turned into a zombie ...,war hero simon garth turn zombi secretari layl...
279,Zuras,Zuras was once the leader of the Eternals.,zura leader etern


##**1.2 Creación de la bolsa de palabras (BoW) con TF-IDF**

Se parte de los datos almacenados en "preprocessedData", en donde para cada personaje existe un campo 'preprocessed_text' que contiene la descripción preprocesada.

El objetivo es transformar todos los textos de sinopsis en vectores de frecuencias (Bag of words), aplicando además la ponderación TF-IDF para los valores de dichas frecuencias.

El paquete sklearn ofrece una clase llamada *TfidfVectorizer* que crea automáticamente la matriz compuesta por todos los vectores de frecuencias ponderados a partir de un array de textos (preprocessedData['processed_text'])

Si se quiere emplear la bolsa de palabras sin ponderación TF-IDF puede usarse la clase *CountVectorized* del mismo paquete

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

bagOfWordsModel = TfidfVectorizer()
bagOfWordsModel.fit(preprocessedData['processed_text'])
textsBoW= bagOfWordsModel.transform(preprocessedData['processed_text'])
print("Finished")

Finished


In [4]:
textsBoW.shape

(281, 2283)

In [5]:
print(textsBoW)

  (0, 2137)	0.130889104582123
  (0, 2058)	0.14217279372193212
  (0, 2004)	0.22205823661448335
  (0, 1974)	0.22205823661448335
  (0, 1894)	0.1961840975559547
  (0, 1790)	0.1961840975559547
  (0, 1779)	0.1752944840330487
  (0, 1634)	0.22205823661448335
  (0, 1505)	0.10190245579244313
  (0, 1404)	0.11629865466340347
  (0, 1174)	0.16198034620346324
  (0, 1086)	0.22205823661448335
  (0, 968)	0.1468449451157794
  (0, 840)	0.20692283552679946
  (0, 816)	0.17030995849742606
  (0, 797)	0.14942034497452006
  (0, 695)	0.1752944840330487
  (0, 647)	0.16198034620346324
  (0, 537)	0.1878544852619919
  (0, 496)	0.1878544852619919
  (0, 475)	0.22205823661448335
  (0, 284)	0.20692283552679946
  (0, 251)	0.22205823661448335
  (0, 242)	0.1878544852619919
  (0, 214)	0.1752944840330487
  :	:
  (278, 818)	0.30506443106319997
  (278, 81)	0.2225292018497309
  (279, 2282)	0.6554139206650713
  (279, 1148)	0.49150479114816825
  (279, 669)	0.5734593559066198
  (280, 1993)	0.22286926032113236
  (280, 1726)	0.25226

In [6]:
bagOfWordsModel.get_feature_names()

['13',
 '1910',
 '1930',
 '1931',
 '1940',
 '19th',
 '21st',
 '31st',
 '53rd',
 '60',
 'abandon',
 'abduct',
 'abil',
 'abl',
 'abomin',
 'abraham',
 'abruptli',
 'absorb',
 'abus',
 'academ',
 'acceler',
 'access',
 'accident',
 'accomplish',
 'accord',
 'account',
 'accustom',
 'acolyt',
 'acquir',
 'acrobat',
 'across',
 'act',
 'action',
 'activ',
 'actress',
 'ad',
 'adam',
 'adamantium',
 'adapt',
 'add',
 'admiss',
 'adolesc',
 'adopt',
 'adrian',
 'adulthood',
 'advanc',
 'advantag',
 'adventur',
 'aegi',
 'affect',
 'affin',
 'affluent',
 'africa',
 'african',
 'after',
 'age',
 'agenc',
 'agent',
 'ago',
 'agre',
 'aid',
 'aim',
 'air',
 'airdrop',
 'albeit',
 'alberta',
 'alex',
 'alexand',
 'alia',
 'alien',
 'alik',
 'all',
 'allegi',
 'alli',
 'allow',
 'almost',
 'alon',
 'along',
 'alongsid',
 'alpha',
 'alreadi',
 'also',
 'alter',
 'altern',
 'although',
 'alvarez',
 'alway',
 'amahl',
 'amaz',
 'america',
 'american',
 'among',
 'amora',
 'amount',
 'amphibi',
 'amul

In [7]:
bagOfWordsModel.get_feature_names()[1005]

'includ'

#**2 Búsqueda de los personajes más similares a una persona**

##**2.1 Basado únicamente en descripción**

Para realizar este proceso hay que realizar todo el proceso anterior incluyendo a la persona con su descripción. De esta forma, se calculará la distancia entre el usuario y todos los personajes.



En la siguiente celda se debe intrroducir el nombre y la descripción con la que se quiera calcular la distancia

In [8]:
nombre = "Pepe" #Introducir nombre
descripcion = "I was once the leader of the Eternals. I am strong as iron and as small as an ant. I am also a terrorist." #Descripcion

Se realiza el mismo proceso que en el punto 1

In [9]:
import pandas as pd

originalData = pd.read_csv('marvel.csv')
originalData = originalData.append({'name' : nombre, 'description' : descripcion}, ignore_index=True)
originalData

Unnamed: 0,name,description
0,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,A.I.M.,AIM is a terrorist organization bent on destro...
2,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,Adam Warlock,Adam Warlock is an artificially created human ...
4,Agent X (Nijo),Originally a partner of the mind-altering assa...
...,...,...
277,Zodiak,"Twelve demons merged with Norman Harrison, who..."
278,Zombie (Simon Garth),War hero Simon Garth was turned into a zombie ...
279,Zuras,Zuras was once the leader of the Eternals.
280,Zzzax,"A chain reaction in an atomic reactor, a resul..."


In [10]:
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import nltk
nltk.download('punkt')
nltk.download('stopwords')

ps = PorterStemmer()

preprocessedText = []

for row in originalData.itertuples():
    text = word_tokenize(row[2]) ## indice de la columna que contiene la descripcion
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    
    preprocessedText.append(text)

preprocessedData = originalData
preprocessedData['processed_text'] = preprocessedText

preprocessedData

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,name,description,processed_text
0,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...,rick jone hulk best bud sinc day one friend te...
1,A.I.M.,AIM is a terrorist organization bent on destro...,aim terrorist organ bent destroy world
2,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie...",formerli known emil blonski spi soviet yugosla...
3,Adam Warlock,Adam Warlock is an artificially created human ...,adam warlock artifici creat human born cocoon ...
4,Agent X (Nijo),Originally a partner of the mind-altering assa...,origin partner assassin black swan nijo spi de...
...,...,...,...
277,Zodiak,"Twelve demons merged with Norman Harrison, who...",twelv demon merg norman harrison soon adopt gu...
278,Zombie (Simon Garth),War hero Simon Garth was turned into a zombie ...,war hero simon garth turn zombi secretari layl...
279,Zuras,Zuras was once the leader of the Eternals.,zura leader etern
280,Zzzax,"A chain reaction in an atomic reactor, a resul...",A chain reaction atom reactor result terrorist...


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

bagOfWordsModel = TfidfVectorizer()
bagOfWordsModel.fit(preprocessedData['processed_text'])
textsBoW= bagOfWordsModel.transform(preprocessedData['processed_text'])

A continuación se crea una matriz N x N (N= número de personajes) en donde el valor de la posición matriz[ i , j ] indique la distancia que existe entre el personaje i y el personaje j. 

In [12]:
from sklearn.metrics import pairwise_distances

distance_matrix= pairwise_distances(textsBoW,textsBoW ,metric='cosine')


Ahora basta con obtener el índice de la persona introducida y las distancias con el resto de personajes

In [13]:
indexOfTitle = preprocessedData[preprocessedData['name']==nombre].index.values[0]
indexOfTitle

distance_scores = list(enumerate(distance_matrix[indexOfTitle]))
distance_scores

[(0, 0.9304965120389753),
 (1, 0.8463144659112246),
 (2, 1.0),
 (3, 1.0),
 (4, 1.0),
 (5, 0.9472380650243366),
 (6, 1.0),
 (7, 1.0),
 (8, 0.9369400967895317),
 (9, 1.0),
 (10, 0.9295431936159904),
 (11, 1.0),
 (12, 1.0),
 (13, 0.9584495194596777),
 (14, 1.0),
 (15, 1.0),
 (16, 1.0),
 (17, 1.0),
 (18, 1.0),
 (19, 0.955394902008685),
 (20, 1.0),
 (21, 1.0),
 (22, 1.0),
 (23, 1.0),
 (24, 1.0),
 (25, 1.0),
 (26, 1.0),
 (27, 1.0),
 (28, 1.0),
 (29, 1.0),
 (30, 1.0),
 (31, 1.0),
 (32, 1.0),
 (33, 1.0),
 (34, 0.9435503891442081),
 (35, 1.0),
 (36, 1.0),
 (37, 1.0),
 (38, 1.0),
 (39, 1.0),
 (40, 0.9365110744732263),
 (41, 0.9451211569127159),
 (42, 0.90127903372154),
 (43, 0.9054165357925118),
 (44, 1.0),
 (45, 1.0),
 (46, 1.0),
 (47, 0.9404947837498154),
 (48, 1.0),
 (49, 1.0),
 (50, 0.9408421280101652),
 (51, 1.0),
 (52, 1.0),
 (53, 1.0),
 (54, 1.0),
 (55, 1.0),
 (56, 1.0),
 (57, 1.0),
 (58, 1.0),
 (59, 1.0),
 (60, 1.0),
 (61, 1.0),
 (62, 1.0),
 (63, 1.0),
 (64, 1.0),
 (65, 1.0),
 (66, 1.0),

Una vez tenemos las distancias, únicamente debemos ordenarlas, quedarnos con las más cortas y mostrarlas de forma legible.

In [14]:
ordered_scores = sorted(distance_scores, key=lambda x: x[1])
top_scores = ordered_scores[1:11]
top_indexes = [i[0] for i in top_scores]
preprocessedData['name'].iloc[top_indexes]

279                           Zuras
171                           Thena
1                            A.I.M.
126                          Sprite
42        Iron Man/Tony Stark (MAA)
43      Iron Patriot (James Rhodes)
70                            Sersi
170                The Leader (HAS)
216                         Vampiro
180    Thunderbird (John Proudstar)
Name: name, dtype: object

##**2.2 Basado en su descripción y parámetros adicionales**


**Búsqueda basada en Descripción + Valoración Comentarios + Valoración Directa (Likes) + Valoración Media Comentarios**

Ponderación:

* Descripción (70%): se basa en el parecido entre la descripción que tenga el usuario y la descripción de cada personaje
* Valoración Comentarios (7.5%): se basa en la valoración del comentario textual que realice el usuario a un personaje. Estos valores son extraídos del subsistema de valoración. Puede ser:
  * Buena (0)
  * Neutra o no valoración (0.5)
  * Mala (1)
* Valoración Directa por Likes (15%)
  * Like (0)
  * Dislike (1)
  * Ninguna (0.5)
* Valoración Media Comentarios (7.5%): se trata de la media de las valoraciones de comentarios de todos los usuarios sobre un personaje. Puede ser un valor entre 0 y 1.

Para probar esta parte del sistema simularemos tener los datos de las valoraciones de los comentarios y las valoraciones directas.

A continuación se generan aleatoriamente las valoraciones:

* 0  es valoración positiva
* 0.5 es valoración neutra (o no valoración)
* 1 es valoración negativa

In [15]:
from random import choice, randint

valoracion_comentario = []
valoracion_directa = []
valoracion_media = []

for i in range(0,len(preprocessedData)):
  valoracion_comentario.append(choice([0.0, 0.5, 1.0]))
  valoracion_directa.append(choice([0.0, 0.5, 1.0]))
  valoracion_media.append(randint(0.0,100.0)/100.0)

preprocessedData['valoracion_comentario'] = valoracion_comentario
preprocessedData['valoracion_directa'] = valoracion_directa
preprocessedData['valoracion_media'] = valoracion_media

preprocessedData

Unnamed: 0,name,description,processed_text,valoracion_comentario,valoracion_directa,valoracion_media
0,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...,rick jone hulk best bud sinc day one friend te...,0.0,1.0,0.27
1,A.I.M.,AIM is a terrorist organization bent on destro...,aim terrorist organ bent destroy world,0.5,0.5,0.73
2,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie...",formerli known emil blonski spi soviet yugosla...,0.5,0.0,0.43
3,Adam Warlock,Adam Warlock is an artificially created human ...,adam warlock artifici creat human born cocoon ...,0.5,0.5,0.68
4,Agent X (Nijo),Originally a partner of the mind-altering assa...,origin partner assassin black swan nijo spi de...,0.0,1.0,0.77
...,...,...,...,...,...,...
277,Zodiak,"Twelve demons merged with Norman Harrison, who...",twelv demon merg norman harrison soon adopt gu...,0.5,1.0,0.41
278,Zombie (Simon Garth),War hero Simon Garth was turned into a zombie ...,war hero simon garth turn zombi secretari layl...,1.0,0.0,0.80
279,Zuras,Zuras was once the leader of the Eternals.,zura leader etern,0.0,0.0,0.10
280,Zzzax,"A chain reaction in an atomic reactor, a resul...",A chain reaction atom reactor result terrorist...,0.0,0.5,0.65


A continuación calculamos las distancias teniendo en cuenta la ponderación

In [16]:
distance_scores_final = []

for i in range(len(distance_matrix[indexOfTitle])):
  valoracion = distance_matrix[indexOfTitle][i]*0.7 + preprocessedData['valoracion_comentario'][i]*0.075 + preprocessedData['valoracion_directa'][i]*0.15 + preprocessedData['valoracion_media'][i]*0.075
  distance_scores_final.append((i,valoracion))

distance_scores_final

[(0, 0.8215975584272827),
 (1, 0.7596701261378571),
 (2, 0.7697499999999999),
 (3, 0.8634999999999999),
 (4, 0.90775),
 (5, 0.9203166455170355),
 (6, 0.8815),
 (7, 0.943),
 (8, 0.7968580677526722),
 (9, 0.9999999999999999),
 (10, 0.7031802355311932),
 (11, 0.93775),
 (12, 0.8612499999999998),
 (13, 0.8704146636217743),
 (14, 0.709),
 (15, 0.8199999999999998),
 (16, 0.9369999999999999),
 (17, 0.8837499999999999),
 (18, 0.9137499999999998),
 (19, 0.9297764314060795),
 (20, 0.8222499999999999),
 (21, 0.8214999999999999),
 (22, 0.7479999999999999),
 (23, 0.7434999999999999),
 (24, 0.8589999999999999),
 (25, 0.86125),
 (26, 0.8109999999999999),
 (27, 0.826),
 (28, 0.7637499999999999),
 (29, 0.8905),
 (30, 0.8859999999999999),
 (31, 0.8807499999999999),
 (32, 0.9767499999999999),
 (33, 0.9849999999999999),
 (34, 0.7849852724009455),
 (35, 0.76825),
 (36, 0.8784999999999998),
 (37, 0.7749999999999999),
 (38, 0.8672499999999999),
 (39, 0.8949999999999999),
 (40, 0.7410577521312585),
 (41, 0.91

A continuación nos quedamos con los personajes más aprecidos

In [17]:
ordered_scores_final = sorted(distance_scores_final, key=lambda x: x[1])
top_scores_final = ordered_scores_final[1:11]
top_indexes_final = [i[0] for i in top_scores_final]
preprocessedData['name'].iloc[top_indexes_final]

279                        Zuras
171                        Thena
220                        Vapor
42     Iron Man/Tony Stark (MAA)
101                       Skreet
253            White Tiger (USM)
100                         Skin
10      Banshee (Theresa Rourke)
65          Scream (Donna Diego)
113     Spider-Girl (May Parker)
Name: name, dtype: object

#**3. Resumen de los resultados**

**Resultados del punto 2.1**: se muestran a continuación las descripciones para comprobar que existe algún parecido

In [18]:
descripcion

'I was once the leader of the Eternals. I am strong as iron and as small as an ant. I am also a terrorist.'

In [19]:
preprocessedData['name'].iloc[top_indexes]

279                           Zuras
171                           Thena
1                            A.I.M.
126                          Sprite
42        Iron Man/Tony Stark (MAA)
43      Iron Patriot (James Rhodes)
70                            Sersi
170                The Leader (HAS)
216                         Vampiro
180    Thunderbird (John Proudstar)
Name: name, dtype: object

In [20]:
preprocessedData['description'].iloc[top_indexes]

279           Zuras was once the leader of the Eternals.
171    Thena, a second generation Eternal, is the eld...
1      AIM is a terrorist organization bent on destro...
126    Sprite is a mischievous Eternal who maintains ...
42     Tony Stark is the genius inventor/billionaire/...
43     U.S. Air Force pilot and Tony Stark's friend w...
70     Sersi is a member of the Eternals, a species d...
170    What the Hulk has in strength, the Leader has ...
216    Vampiro, part of the race known as the Eternal...
180    An exceptionally strong and vigorous athlete i...
Name: description, dtype: object

**Resultados del punto 2.2**: se muestran a continuación las descripciones para comprobar que existe algún parecido (se debe tener en cuenta en este caso que la descripción no tiene todo el peso del cálculo)

In [21]:
descripcion

'I was once the leader of the Eternals. I am strong as iron and as small as an ant. I am also a terrorist.'

In [22]:
preprocessedData['name'].iloc[top_indexes_final]

279                        Zuras
171                        Thena
220                        Vapor
42     Iron Man/Tony Stark (MAA)
101                       Skreet
253            White Tiger (USM)
100                         Skin
10      Banshee (Theresa Rourke)
65          Scream (Donna Diego)
113     Spider-Girl (May Parker)
Name: name, dtype: object

In [23]:
preprocessedData['description'].iloc[top_indexes_final]

279           Zuras was once the leader of the Eternals.
171    Thena, a second generation Eternal, is the eld...
220    Vapor, along with her brother, was among the s...
42     Tony Stark is the genius inventor/billionaire/...
101    The Proemial God Aegis revealed that Skreet wa...
253    White Tiger takes everything very seriously. A...
100    Angelo Espinosa, a founding member of Generati...
10     The daughter of former X-Men member Sean Cassi...
65     Out of the five alien symbiotes that were forc...
113    May "Mayday" Parker is the daughter of Spider-...
Name: description, dtype: object