<a href="https://colab.research.google.com/github/josemanuelvinhas/MarvelRecomverse/blob/main/MarvelRecomverse_Sistema_de_Valoracion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**0. Sistema de Valoración**

Se realizaran dos prototipos del sistema de valoración:

*   Valoración manual de un personaje
*   Valoración automatizada de un personaje a través de comentarios



Las valoraciones de cada usuario se almacenarán en un dataset que guardará:

*   Nombre del personaje (name)
*   Comentario (comentario)
*   Valoracion comentario (valoracion_comentario)
*   Valoracion directa (valoracion_directa)



#**1. Sistema de Valoración Manual**

Introduce la valoracion del item:

*   Neutro
*   Like
*   Dislike


In [1]:
valoracion = "Like" #Introduce aquí Like, Dislike o cualquier otra cosa (Neutro)

nombre_usuario = "Pepe" #Nombre del usuario (el archivo csv con la informacion contendrá el nombre de usuario)

index = 1 #Indice sobre el personaje afectado, en este caso A.I.M

Se carga el dataset

In [2]:
import pandas as pd

originalData = pd.read_csv('marvel.csv')
originalData

Unnamed: 0,name,description
0,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,A.I.M.,AIM is a terrorist organization bent on destro...
2,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,Adam Warlock,Adam Warlock is an artificially created human ...
4,Agent X (Nijo),Originally a partner of the mind-altering assa...
...,...,...
276,Zarek,Zarek is a member of the Kree race with no sup...
277,Zodiak,"Twelve demons merged with Norman Harrison, who..."
278,Zombie (Simon Garth),War hero Simon Garth was turned into a zombie ...
279,Zuras,Zuras was once the leader of the Eternals.


A continuación creamos el dataset y añadimos la valoración. Recordemos que las valoraciones se tratan en el Sistema de Recomendación de la siguiente manera:

  * Like (0)
  * Dislike (1)
  * Neutra o ninguna (0.5)

In [3]:
name = originalData.loc[index, "name"]

if valoracion == "Like":
  like = 0.0
elif valoracion == "Dislike":
  like = 1
else:
  like = 0.5

data_user = {'name': name, 'valoracion_directa' : like}

dataframe_user = pd.DataFrame()
dataframe_user = dataframe_user.append(data_user, ignore_index=True)

dataframe_user

Unnamed: 0,name,valoracion_directa
0,A.I.M.,0.0


En el Sistema de Valoración Automatizado se incluirá en este dataset el resto de la información.

#**2. Sistema de Valoración Automatizado**

**Carga del dataset**

Este dataset está disponible en [GitHub](https://github.com/josemanuelvinhas/MarvelRecomverse/tree/main/datasets)

In [4]:
import pandas as pd

trainingData = pd.read_csv('reddit_data_train.csv', delimiter=',')
##trainingData = trainingData.head(1000) #Eliminar la funcion head() si se quiere usar todo el dataset. Para las pruebas usamos únicamente los 1000 primeros tweets
trainingData

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1
...,...,...
18805,other option,-1
18806,honestly feel bjp lesser evil congress was abs...,-1
18807,pappu,0
18808,india should blamed for much fud though,1


In [5]:
trainingData['category'].value_counts()

 1    8290
 0    6228
-1    4292
Name: category, dtype: int64

**Preprocesamiento de los datos de entrenamiento**


In [6]:
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import nltk
nltk.download('punkt')
nltk.download('stopwords')

ps = PorterStemmer()

preprocessedText = []

for row in trainingData.itertuples():
    ## indice de la columna que contiene el texto
    text = word_tokenize(str(row[1])) 
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    
    preprocessedText.append(text)

preprocessedData = trainingData
preprocessedData['processed_text'] = preprocessedText

preprocessedData

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,clean_comment,category,processed_text
0,family mormon have never tried explain them t...,1,famili mormon never tri explain still stare pu...
1,buddhism has very much lot compatible with chr...,1,buddhism much lot compat christian especi cons...
2,seriously don say thing first all they won get...,-1,serious say thing first get complex explain no...
3,what you have learned yours and only yours wha...,0,learn want teach differ focu goal wrap paper b...
4,for your own benefit you may want read living ...,1,benefit may want read live buddha live christ ...
...,...,...,...
18805,other option,-1,option
18806,honestly feel bjp lesser evil congress was abs...,-1,honestli feel bjp lesser evil congress absolut...
18807,pappu,0,pappu
18808,india should blamed for much fud though,1,india blame much fud though


**Creación de la bolsa de palabras**


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

bagOfWordsModel = TfidfVectorizer()
bagOfWordsModel.fit(preprocessedData['processed_text'])
textsBoW= bagOfWordsModel.transform(preprocessedData['processed_text'])
print("Finished")

Finished


In [8]:
textsBoW.shape

(18810, 26784)

**Entrenamiento de un algoritmo de clasificación (SVM)**

In [9]:
from sklearn import svm
svc = svm.SVC(kernel='linear') #Modelo de clasificación

X_train = textsBoW #Documentos
Y_train = trainingData['category'] #Etiquetas de los documentos 
svc.fit(X_train, Y_train) #Entrenamiento

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

**Carga de y preprocesado de documentos de test**

In [10]:
testData = pd.read_csv('reddit_data_test.csv', delimiter=',')
##testData = testData.head(100)
testData

Unnamed: 0,clean_comment,category
0,modi liar,0
1,doesn india have more important problems then ...,1
2,well that settles then everyone buy back india...,-1
3,all know ever visit india getting one those he...,0
4,india wont ban bitcoin because they didnt give...,-1
...,...,...
18434,jesus,0
18435,kya bhai pure saal chutiya banaya modi aur jab...,1
18436,downvote karna tha par upvote hogaya,0
18437,haha nice,1


In [11]:
ps = PorterStemmer()

preprocessedText = []

for row in testData.itertuples():
    
    
    text = word_tokenize(str(row[1])) ## indice de la columna que contiene el texto
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    
    preprocessedText.append(text)

preprocessedDataTest = testData
preprocessedDataTest['processed_text'] = preprocessedText

preprocessedDataTest

Unnamed: 0,clean_comment,category,processed_text
0,modi liar,0,modi liar
1,doesn india have more important problems then ...,1,india import problem cryptocurr
2,well that settles then everyone buy back india...,-1,well settl everyon buy back indian kid verifi ...
3,all know ever visit india getting one those he...,0,know ever visit india get one head massag barb...
4,india wont ban bitcoin because they didnt give...,-1,india wont ban bitcoin didnt give everi citize...
...,...,...,...
18434,jesus,0,jesu
18435,kya bhai pure saal chutiya banaya modi aur jab...,1,kya bhai pure saal chutiya banaya modi aur jab...
18436,downvote karna tha par upvote hogaya,0,downvot karna tha par upvot hogaya
18437,haha nice,1,haha nice


In [12]:
testData['category'].value_counts()

 1    7540
 0    6914
-1    3985
Name: category, dtype: int64

In [13]:
textsBoWTest= bagOfWordsModel.transform(preprocessedDataTest['processed_text'])
print("Finished")

Finished


In [14]:
textsBoWTest.shape

(18439, 26784)

**Clasificación de los documentos de test**

In [15]:
X_test = textsBoWTest #Documentos


predictions = svc.predict(X_test) #Se almacena en el array predictions las predicciones del clasificador

**Evaluacion de la predicción**

In [16]:
from sklearn.metrics import classification_report

Y_test = testData['category'] #Etiquetas reales de los documentos

print (classification_report(Y_test, predictions))

              precision    recall  f1-score   support

          -1       0.78      0.64      0.70      3985
           0       0.82      0.91      0.87      6914
           1       0.83      0.83      0.83      7540

    accuracy                           0.82     18439
   macro avg       0.81      0.79      0.80     18439
weighted avg       0.82      0.82      0.82     18439



**Entrenamiento y Evaluación de otro algoritmo de clasificación: k-NN**

In [17]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)


neigh.fit(X_train, Y_train) 
predictions = neigh.predict(X_test) 

print (classification_report(Y_test, predictions))

              precision    recall  f1-score   support

          -1       0.61      0.05      0.10      3985
           0       0.38      0.97      0.55      6914
           1       0.75      0.06      0.11      7540

    accuracy                           0.40     18439
   macro avg       0.58      0.36      0.25     18439
weighted avg       0.58      0.40      0.27     18439



**Valoración de un comentario sobre un personaje**



En la siguiente celda se puede introducir uno o varios comentarios para valorar

In [18]:
comentarios = []
comentarios.append("Fan of superheroes in general and Spider-Man in particular") #Comentario Positivo
comentarios.append("I hate characters who wear capes") #Comentario Negativo
comentarios.append("I play the guitar") #Comentario Neutro

comentarios.append("This character makes me feel sad.") #Comentario sobre A.I.M.

La siguiente celda valorará los comentarios introducidos

In [19]:
comentarioData = pd.DataFrame(columns=('clean_comment', 'category'))

for comentario in comentarios:
  comentarioData = comentarioData.append({'clean_comment' : comentario}, ignore_index=True)

ps = PorterStemmer()

preprocessedText = []

for row in comentarioData.itertuples():
    text = word_tokenize(str(row[1]))
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    preprocessedText.append(text)

preprocessedDataTest = comentarioData
preprocessedDataTest['processed_text'] = preprocessedText

textsBoWTest= bagOfWordsModel.transform(preprocessedDataTest['processed_text'])

X_test = textsBoWTest #Documentos
predictions = svc.predict(X_test)

for i in range(len(predictions)):
  if predictions[i] == 1:
    print(comentarios[i] + " -> Comentario positivo")
  elif predictions[i] == 0:
    print(comentarios[i] + " -> Comentario neutro")
  else:
    print(comentarios[i] + " -> Comentario negativo")

Fan of superheroes in general and Spider-Man in particular -> Comentario positivo
I hate characters who wear capes -> Comentario negativo
I play the guitar -> Comentario neutro
This character makes me feel sad. -> Comentario negativo


**Almacenamiento de la información sobre el comentario y su valoración**

Añadimos el comentario y la valoracion. Se debe tener en cuenta que el sistema de recomendación hace uso de estas valoraciones y los valores deben ser los siguientes:


* Buena (0.0)
* Neutra o no valoración (0.5)
* Mala (1.0)

*NOTA: con la informacion almacenada en este dataset ya se puede aplicar el sistema de recomendación a este usuario ficticio "Pepe"*

In [20]:
dataframe_user.loc[dataframe_user['name'] == name, 'comentario'] = comentarios[3]

if predictions[i] == 1:
  v_comentario = 0.0
elif predictions[i] == 0:
  v_comentario = 0.5
else:
  v_comentario = 1

dataframe_user.loc[dataframe_user['name'] == name, 'valoracion_comentario'] = v_comentario

dataframe_user

Unnamed: 0,name,valoracion_directa,comentario,valoracion_comentario
0,A.I.M.,0.0,This character makes me feel sad.,1.0


Por ultimo guardamos el dataset

In [21]:
dataframe_user.to_csv("userdata_" + nombre_usuario, index=False)