# Sistema Recomendador de Libros


>  ABP-2, Grupo 15

> Autores: Laura Álvarez Iglesias, Pedro Sánchez Uscamaita, José Luis Ruanova Lea, Jesús Velasco Bujeiro

En este sistema recomendador se trabaja con un dataset que contiene 16558 libros. Cada uno tiene un campo 'resume' que contiene una descripción textual de la trama del libro en cuestión.

En la siguiente celda de código se transforman los textos contenidos en el campo 'resume' a textos preprocesados. Estos nuevos textos tokenizados se almacenarán en una nueva columna a la que llamaremos 'preprocessed_text'.

In [None]:
import pandas as pd

originalData = pd.read_csv('books.csv',sep='\t')
originalData

Unnamed: 0,author,title,date,resume
0,George Orwell,Animal Farm,1945-08-17,"Old Major, the old boar on the Manor Farm, ca..."
1,Anthony Burgess,A Clockwork Orange,1962,"Alex, a teenager living in near-future Englan..."
2,Albert Camus,The Plague,1947,The text of The Plague is divided into five p...
3,David Hume,An Enquiry Concerning Human Understanding,,The argument of the Enquiry proceeds by a ser...
4,Vernor Vinge,A Fire Upon the Deep,,The novel posits that space around the Milky ...
...,...,...,...,...
16554,Colin Meloy,Under Wildwood,2012-09-25,"Prue McKeel, having rescued her brother from ..."
16555,Vince Flynn,Transfer of Power,2000-06-01,The reader first meets Rapp while he is doing...
16556,Jay-Z,Decoded,2010-11-16,The book follows very rough chronological ord...
16557,Stephen Colbert,America Again: Re-becoming The Greatness We Ne...,2012-10-02,Colbert addresses topics including Wall Stree...


In [None]:
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import nltk
nltk.download('punkt')
nltk.download('stopwords')

ps = PorterStemmer()

preprocessedText = []

for row in originalData.itertuples():
    
    
    text = word_tokenize(row[4]) ## indice de la columna que contiene el texto
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    
    preprocessedText.append(text)

preprocessedData = originalData
preprocessedData['processed_text'] = preprocessedText

preprocessedData

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,author,title,date,resume,processed_text
0,George Orwell,Animal Farm,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",old major old boar manor farm call anim farm m...
1,Anthony Burgess,A Clockwork Orange,1962,"Alex, a teenager living in near-future Englan...",alex teenag live england lead gang nightli org...
2,Albert Camus,The Plague,1947,The text of The Plague is divided into five p...,the text the plagu divid five part In town ora...
3,David Hume,An Enquiry Concerning Human Understanding,,The argument of the Enquiry proceeds by a ser...,the argument enquiri proce seri increment step...
4,Vernor Vinge,A Fire Upon the Deep,,The novel posits that space around the Milky ...,the novel posit space around milki way divid c...
...,...,...,...,...,...
16554,Colin Meloy,Under Wildwood,2012-09-25,"Prue McKeel, having rescued her brother from ...",prue mckeel rescu brother dowag gover conclus ...
16555,Vince Flynn,Transfer of Power,2000-06-01,The reader first meets Rapp while he is doing...,the reader first meet rapp covert oper iran di...
16556,Jay-Z,Decoded,2010-11-16,The book follows very rough chronological ord...,the book follow rough chronolog order switch c...
16557,Stephen Colbert,America Again: Re-becoming The Greatness We Ne...,2012-10-02,Colbert addresses topics including Wall Stree...,colbert address topic includ wall street campa...


##Creación de la bolsa de palabras

Se parte de los datos almacenados en "preprocessedData", en donde para cada libro existe un campo 'preprocessed_text' que contiene el resumen preprocesado.

El objetivo es transformar todos los textos de resumen en vectores de frecuencias (Bag of words), aplicando además la ponderación TF-IDF para los valores de dichas frecuencias.

El paquete sklearn ofrece una clase llamada TfidfVectorizer que crea automáticamente la matriz compuesta por todos los vectores de frecuencias ponderados a partir de un array de textos (preprocessedData['processed_text'])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

bagOfWordsModel = TfidfVectorizer()
bagOfWordsModel.fit(preprocessedData['processed_text'])
textsBoW= bagOfWordsModel.transform(preprocessedData['processed_text'])
print("Finished")

Finished


In [None]:
textsBoW.shape

(16559, 88271)

##Cálculo de distancias entre vectores de frecuencias

El objetivo final es el de crear una matriz N x N (N= número de libros) en donde el valor de la posición matriz[i,j] indique la distancia que existe entre el libro i y el libro j.

Esta distancia se puede calcular de varias formas. Gracias a que ahora los textos están representados mediante vectores de frecuencias (textsBoW), se pueden emplear para ello medidas de distancias standard entre vectores. La que se usa en el ejemplo es la distancia coseno.

In [None]:
from sklearn.metrics import pairwise_distances

distance_matrix= pairwise_distances(textsBoW,textsBoW ,metric='cosine')

##Búsqueda de los libros más similares a otros en base al resumen

In [None]:
searchTitle = "The Hobbit" #Libro base para las recomendaciones
indexOfTitle = preprocessedData[preprocessedData['title']==searchTitle].index.values[0]
indexOfTitle

85

In [None]:
distance_scores = list(enumerate(distance_matrix[indexOfTitle]))
distance_scores

[(0, 0.9890712884537552),
 (1, 0.9889112235414373),
 (2, 0.981277578438676),
 (3, 0.9879842864025564),
 (4, 0.980568995187431),
 (5, 0.9716679086642707),
 (6, 0.9760855337042229),
 (7, 0.9851825956646076),
 (8, 0.995785495740779),
 (9, 0.9956055242934675),
 (10, 0.9903541474224148),
 (11, 0.9923744933710383),
 (12, 0.9862334636172694),
 (13, 0.9946417793735608),
 (14, 0.9929736683319734),
 (15, 0.9933649272109959),
 (16, 0.9912536769680648),
 (17, 0.9946072124848658),
 (18, 0.9871229064872565),
 (19, 0.9937009831508105),
 (20, 0.9956606540895209),
 (21, 0.9834958112931228),
 (22, 0.987367279022661),
 (23, 0.9891111545149697),
 (24, 0.9858129080653386),
 (25, 0.9929470376323101),
 (26, 0.9850008506215834),
 (27, 0.9892473773034776),
 (28, 0.9944359125211635),
 (29, 0.9921771107600454),
 (30, 0.9889361619519677),
 (31, 0.9782556458976451),
 (32, 0.9623774016636835),
 (33, 0.9737346908457369),
 (34, 0.9871805460912341),
 (35, 0.9920907998758591),
 (36, 0.9643741973671126),
 (37, 0.9887244

In [None]:
ordered_scores = sorted(distance_scores, key=lambda x: x[1])
ordered_scores

[(85, 0.0),
 (15275, 0.7521788574329701),
 (78, 0.7633475665151637),
 (249, 0.8500461329575789),
 (2642, 0.8517862843833618),
 (9933, 0.8563211663405321),
 (4093, 0.8726652846427698),
 (2943, 0.8770053990468247),
 (7696, 0.8820846719967486),
 (8902, 0.8852689004547315),
 (5975, 0.886280949446838),
 (5275, 0.8889815887371664),
 (248, 0.8937258405818118),
 (4476, 0.9047868845368893),
 (9553, 0.9107992130481272),
 (10444, 0.9132897495413819),
 (11626, 0.9148061474981297),
 (10060, 0.9149429053397282),
 (10534, 0.915783632286029),
 (11298, 0.9170964689869118),
 (7134, 0.9189218391684971),
 (5480, 0.9193449982509347),
 (2163, 0.9201845556925635),
 (7156, 0.921510263292914),
 (7002, 0.9229463068415892),
 (6098, 0.9237212899377264),
 (6054, 0.9241571126637872),
 (12539, 0.926532017109063),
 (8085, 0.9270731628344467),
 (4497, 0.9271205555995791),
 (3230, 0.9275651971215642),
 (13285, 0.9279768966782663),
 (4044, 0.92809979198534),
 (7100, 0.9281199629419349),
 (5214, 0.9283096406355675),
 (94

In [None]:
top_scores = ordered_scores[1:11]
top_scores

[(15275, 0.7521788574329701),
 (78, 0.7633475665151637),
 (249, 0.8500461329575789),
 (2642, 0.8517862843833618),
 (9933, 0.8563211663405321),
 (4093, 0.8726652846427698),
 (2943, 0.8770053990468247),
 (7696, 0.8820846719967486),
 (8902, 0.8852689004547315),
 (5975, 0.886280949446838)]

In [None]:
top_indexes = [i[0] for i in top_scores]
top_indexes

[15275, 78, 249, 2642, 9933, 4093, 2943, 7696, 8902, 5975]

In [None]:
preprocessedData['title'].iloc[top_indexes] + ", " + preprocessedData['author'].iloc[top_indexes]

15275     The Fellowship of the Ring, J. R. R. Tolkien
78             The Lord of the Rings, J. R. R. Tolkien
249           The Return of the King, J. R. R. Tolkien
2642     The Princess and the Goblin, George MacDonald
9933              The Lord of the Rings, Shaun McKenna
4093               The Forest of Doom, Ian Livingstone
2943                            Thud!, Terry Pratchett
7696                      The Goblin Wood, Hilari Bell
8902                      The Doom Brigade, Don Perrin
5975                        Twilight Eyes, Dean Koontz
dtype: object

## Bibliografía:

Machine Learning en AWS https://aws.amazon.com/machine-learning

AWS Machine Learning Blog https://aws.amazon.com/blogs/ai

AI AND DATA SCIENCE https://www.nvidia.com/en-us/deep-learning-ai/

NVIDIA DEEP LEARNING INSTITUTE https://www.nvidia.fr/dli

NVIDIA VOLTA https://www.nvidia.fr/data-center/volta-gpu-architecture/

Instancias P3 de Amazon EC2 https://aws.amazon.com/ec2/instance-types/p3/

Julien Simon Medium https://medium.com/@julsimon