<a href="https://colab.research.google.com/github/josemanuelvinhas/MarvelRecomverse/blob/main/MarvelRecomverse_Sistema_de_Valoracion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**0. Sistema de Valoración**

Se realizaran dos prototipos del sistema de valoración:

*   Valoración manual de un personaje
*   Valoración automatizada de un personaje a través de comentarios

Las valoraciones de cada usuario se almacenarán en un dataset que guardará:

*   Nombre del personaje (name)
*   Comentario (comentario)
*   Valoracion comentario (valoracion_comentario)
*   Valoracion directa (valoracion_directa)



#**1. Sistema de Valoración Manual**

Introduce la valoracion del item:

*   Neutro
*   Like
*   Dislike


In [None]:
valoracion = "Like" #Introduce aquí Like, Dislike o cualquier otra cosa (Neutro)

nombre_usuario = "Pepe" #Nombre del usuario (el archivo csv con la informacion contendrá el nombre de usuario)

index = 1 #Indice sobre el personaje afectado, en este caso A.I.M

Se carga el dataset

In [None]:
import pandas as pd

originalData = pd.read_csv('marvel.csv')
originalData

Unnamed: 0,name,description
0,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,A.I.M.,AIM is a terrorist organization bent on destro...
2,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,Adam Warlock,Adam Warlock is an artificially created human ...
4,Agent X (Nijo),Originally a partner of the mind-altering assa...
...,...,...
276,Zarek,Zarek is a member of the Kree race with no sup...
277,Zodiak,"Twelve demons merged with Norman Harrison, who..."
278,Zombie (Simon Garth),War hero Simon Garth was turned into a zombie ...
279,Zuras,Zuras was once the leader of the Eternals.


A continuación creamos el dataset y añadimos la valoración. Recordemos que las valoraciones se tratan en el Sistema de Recomendación de la siguiente manera:

  * Like (0)
  * Dislike (1)
  * Neutra o ninguna (0.5)

In [None]:
name = originalData.loc[index, "name"]

if valoracion == "Like":
  like = 0.0
elif valoracion == "Dislike":
  like = 1
else:
  like = 0.5

data_user = {'name': name, 'valoracion_directa' : like}

dataframe_user = pd.DataFrame()
dataframe_user = dataframe_user.append(data_user, ignore_index=True)

dataframe_user

Unnamed: 0,name,valoracion_directa
0,A.I.M.,0.0


En el Sistema de Valoración Automatizado se incluirá en este dataset el resto de la información.

#**2. Sistema de Valoración Automatizado**

A continuación se han realizado distintas pruebas con sistemas de clasificación. Al final se ha decidido emplear Ensemble Learning como sistema clasificador de comentarios.

**Carga del dataset**

Este dataset está disponible en [GitHub](https://github.com/josemanuelvinhas/MarvelRecomverse/tree/main/datasets)

Se trata de un dataset de comentarios de Reddit, que se ha dividido en 2 partes, una para entrenamiento y otra para test.

In [None]:
import pandas as pd

trainingData = pd.read_csv('reddit_data_train.csv', delimiter=',')
trainingData = trainingData.head(4000) #Descomentar la funcion head() si no se quiere usar todo el dataset.
trainingData

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1
...,...,...
3995,folks just head over times now and watch what ...,1
3996,anybody has link the imgur photo which shows f...,0
3997,the results are expected this delhi election ...,1
3998,bjp should totally buy nano innova kejriwal no...,0


In [None]:
trainingData['category'].value_counts()

 1    1816
 0    1304
-1     880
Name: category, dtype: int64

**Preprocesamiento de los datos de entrenamiento**


In [None]:
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import nltk
nltk.download('punkt')
nltk.download('stopwords')

ps = PorterStemmer()

preprocessedText = []

for row in trainingData.itertuples():
    ## indice de la columna que contiene el texto
    text = word_tokenize(str(row[1])) 
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    
    preprocessedText.append(text)

preprocessedData = trainingData
preprocessedData['processed_text'] = preprocessedText

preprocessedData

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,clean_comment,category,processed_text
0,family mormon have never tried explain them t...,1,famili mormon never tri explain still stare pu...
1,buddhism has very much lot compatible with chr...,1,buddhism much lot compat christian especi cons...
2,seriously don say thing first all they won get...,-1,serious say thing first get complex explain no...
3,what you have learned yours and only yours wha...,0,learn want teach differ focu goal wrap paper b...
4,for your own benefit you may want read living ...,1,benefit may want read live buddha live christ ...
...,...,...,...
3995,folks just head over times now and watch what ...,1,folk head time watch arnab congress spokespers...
3996,anybody has link the imgur photo which shows f...,0,anybodi link imgur photo show front page newsp...
3997,the results are expected this delhi election ...,1,result expect delhi elect remind fight mountai...
3998,bjp should totally buy nano innova kejriwal no...,0,bjp total buy nano innova kejriw gon na give l...


**Creación de la bolsa de palabras**


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

bagOfWordsModel = TfidfVectorizer()
bagOfWordsModel.fit(preprocessedData['processed_text'])
textsBoW= bagOfWordsModel.transform(preprocessedData['processed_text'])
print("Finished")

Finished


In [None]:
textsBoW.shape

(4000, 9764)

**Entrenamiento de un algoritmo de clasificación (SVM)**

Se prueban clasificadores SVM con distintos kernel y otros parámetros.

In [None]:
X_train = textsBoW #Documentos
Y_train = trainingData['category'] #Etiquetas de los documentos 

*   **Kernel = linear**

In [None]:
from sklearn import svm
svc_linear = svm.SVC(kernel='linear') #Modelo de clasificación

svc_linear.fit(X_train, Y_train) #Entrenamiento

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

*   **Kernel = poly**

Con grado 3 (por defecto)

In [None]:
from sklearn import svm
svc_poly = svm.SVC(kernel='poly') #Modelo de clasificación

svc_poly.fit(X_train, Y_train) #Entrenamiento

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='poly',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Con grado 10

In [None]:
from sklearn import svm
svc_poly_10 = svm.SVC(kernel='poly', degree=10) #Modelo de clasificación

svc_poly_10.fit(X_train, Y_train) #Entrenamiento

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=10, gamma='scale', kernel='poly',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

*   **Kernel = rbf**

In [None]:
from sklearn import svm
svc_rbf = svm.SVC(kernel='rbf') #Modelo de clasificación

svc_rbf.fit(X_train, Y_train) #Entrenamiento

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

*   **Kernel = sigmoid**

Con coeficiente 0.0 (por defecto)

In [None]:
from sklearn import svm
svc_sigmoid = svm.SVC(kernel='sigmoid') #Modelo de clasificación

svc_sigmoid.fit(X_train, Y_train) #Entrenamiento

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='sigmoid',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Con coeficiente 1.0

In [None]:
from sklearn import svm
svc_sigmoid_coef_1 = svm.SVC(kernel='sigmoid', coef0=1.0) #Modelo de clasificación

svc_sigmoid_coef_1.fit(X_train, Y_train) #Entrenamiento

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=1.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='sigmoid',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

**Carga de y preprocesado de documentos de test**

In [None]:
testData = pd.read_csv('reddit_data_test.csv', delimiter=',')
##testData = testData.head(100)
testData

Unnamed: 0,clean_comment,category
0,modi liar,0
1,doesn india have more important problems then ...,1
2,well that settles then everyone buy back india...,-1
3,all know ever visit india getting one those he...,0
4,india wont ban bitcoin because they didnt give...,-1
...,...,...
18434,jesus,0
18435,kya bhai pure saal chutiya banaya modi aur jab...,1
18436,downvote karna tha par upvote hogaya,0
18437,haha nice,1


In [None]:
ps = PorterStemmer()

preprocessedText = []

for row in testData.itertuples():
    
    
    text = word_tokenize(str(row[1])) ## indice de la columna que contiene el texto
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    
    preprocessedText.append(text)

preprocessedDataTest = testData
preprocessedDataTest['processed_text'] = preprocessedText

preprocessedDataTest

Unnamed: 0,clean_comment,category,processed_text
0,modi liar,0,modi liar
1,doesn india have more important problems then ...,1,india import problem cryptocurr
2,well that settles then everyone buy back india...,-1,well settl everyon buy back indian kid verifi ...
3,all know ever visit india getting one those he...,0,know ever visit india get one head massag barb...
4,india wont ban bitcoin because they didnt give...,-1,india wont ban bitcoin didnt give everi citize...
...,...,...,...
18434,jesus,0,jesu
18435,kya bhai pure saal chutiya banaya modi aur jab...,1,kya bhai pure saal chutiya banaya modi aur jab...
18436,downvote karna tha par upvote hogaya,0,downvot karna tha par upvot hogaya
18437,haha nice,1,haha nice


In [None]:
testData['category'].value_counts()

 1    7540
 0    6914
-1    3985
Name: category, dtype: int64

In [None]:
textsBoWTest= bagOfWordsModel.transform(preprocessedDataTest['processed_text'])
print("Finished")

Finished


In [None]:
textsBoWTest.shape

(18439, 9764)

**Clasificación de los documentos de test**

In [None]:
X_test = textsBoWTest #Documentos

*   **Kernel = linear**

In [None]:
predictions_linear = svc_linear.predict(X_test) #Se almacena en el array predictions las predicciones del clasificador

*   **Kernel = poly**

Con grado 3 (por defecto)

In [None]:
predictions_poly = svc_poly.predict(X_test) #Se almacena en el array predictions las predicciones del clasificador

Con grado 10

In [None]:
predictions_poly_10 = svc_poly_10.predict(X_test) #Se almacena en el array predictions las predicciones del clasificador

*   **Kernel = rbf**

In [None]:
predictions_rbf = svc_rbf.predict(X_test) #Se almacena en el array predictions las predicciones del clasificador

*   **Kernel = sigmoid**

Con coeficiente 0.0 (por defecto)

In [None]:
predictions_sigmoid = svc_sigmoid.predict(X_test) #Se almacena en el array predictions las predicciones del clasificador

Con coeficiente 1.0

In [None]:
predictions_sigmoid_coef_1 = svc_sigmoid_coef_1.predict(X_test)

**Evaluacion de la predicción con SVM**


In [None]:
from sklearn.metrics import classification_report

Y_test = testData['category'] #Etiquetas reales de los documentos

*   **Kernel = linear**

In [None]:
print (classification_report(Y_test, predictions_linear))

              precision    recall  f1-score   support

          -1       0.70      0.45      0.55      3985
           0       0.76      0.82      0.79      6914
           1       0.72      0.80      0.76      7540

    accuracy                           0.73     18439
   macro avg       0.73      0.69      0.70     18439
weighted avg       0.73      0.73      0.73     18439



*   **Kernel = poly**

Con grado 3

In [None]:
print (classification_report(Y_test, predictions_poly))

              precision    recall  f1-score   support

          -1       0.89      0.03      0.06      3985
           0       0.88      0.09      0.17      6914
           1       0.43      0.99      0.60      7540

    accuracy                           0.45     18439
   macro avg       0.73      0.37      0.27     18439
weighted avg       0.70      0.45      0.32     18439



Con grado 10

In [None]:
print (classification_report(Y_test, predictions_poly_10))

              precision    recall  f1-score   support

          -1       0.97      0.01      0.02      3985
           0       0.97      0.02      0.05      6914
           1       0.41      1.00      0.58      7540

    accuracy                           0.42     18439
   macro avg       0.79      0.34      0.22     18439
weighted avg       0.74      0.42      0.26     18439



*   **Kernel = rbf**

In [None]:
print (classification_report(Y_test, predictions_rbf))

              precision    recall  f1-score   support

          -1       0.84      0.23      0.36      3985
           0       0.75      0.78      0.76      6914
           1       0.64      0.86      0.74      7540

    accuracy                           0.70     18439
   macro avg       0.75      0.63      0.62     18439
weighted avg       0.73      0.70      0.67     18439



*   **Kernel = sigmoid**

Con coeficiente 0.0

In [None]:
print (classification_report(Y_test, predictions_sigmoid))

              precision    recall  f1-score   support

          -1       0.71      0.43      0.54      3985
           0       0.76      0.82      0.79      6914
           1       0.71      0.81      0.76      7540

    accuracy                           0.73     18439
   macro avg       0.73      0.69      0.69     18439
weighted avg       0.73      0.73      0.72     18439



Con coeficiente 1.0

In [None]:
print (classification_report(Y_test, predictions_sigmoid_coef_1))

              precision    recall  f1-score   support

          -1       0.85      0.19      0.30      3985
           0       0.70      0.84      0.76      6914
           1       0.66      0.81      0.73      7540

    accuracy                           0.69     18439
   macro avg       0.74      0.61      0.60     18439
weighted avg       0.72      0.69      0.65     18439



**Entrenamiento y Evaluación de otro algoritmo de clasificación: k-NN**

*   **Resultados con *n_neighbors=3***

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh_3 = KNeighborsClassifier(n_neighbors=3)

neigh_3.fit(X_train, Y_train) 
predictions_neigh_3 = neigh_3.predict(X_test) 

print (classification_report(Y_test, predictions_neigh_3))

              precision    recall  f1-score   support

          -1       0.73      0.02      0.04      3985
           0       0.38      0.99      0.55      6914
           1       0.80      0.02      0.04      7540

    accuracy                           0.39     18439
   macro avg       0.64      0.35      0.21     18439
weighted avg       0.63      0.39      0.23     18439



*   **Resultados con *n_neighbors=2***

In [None]:
neigh_2 = KNeighborsClassifier(n_neighbors=2)

neigh_2.fit(X_train, Y_train) 
predictions_neigh_2 = neigh_2.predict(X_test) 

print (classification_report(Y_test, predictions_neigh_2))

              precision    recall  f1-score   support

          -1       0.55      0.06      0.10      3985
           0       0.38      0.97      0.54      6914
           1       0.89      0.02      0.04      7540

    accuracy                           0.39     18439
   macro avg       0.60      0.35      0.23     18439
weighted avg       0.62      0.39      0.24     18439



*   **Resultados con *n_neighbors=4***

In [None]:
neigh_4 = KNeighborsClassifier(n_neighbors=4)

neigh_4.fit(X_train, Y_train) 
predictions_neigh_4 = neigh_4.predict(X_test) 

print (classification_report(Y_test, predictions_neigh_4))

              precision    recall  f1-score   support

          -1       0.71      0.02      0.04      3985
           0       0.38      0.99      0.55      6914
           1       0.88      0.01      0.02      7540

    accuracy                           0.38     18439
   macro avg       0.66      0.34      0.20     18439
weighted avg       0.66      0.38      0.22     18439



**Entrenamiento y Clasificación mediante Ensemble learning**

A continuación se utilizará un *VotingClassifier*, que clasificará los comentarios en función de lo que determine la mayoría de los algoritmos empleados. Se emplearan varios de los algoritmos usados previamente, dando mayor ponderación a los que obtuvieron mejor resultado. 

In [None]:
from sklearn.ensemble import VotingClassifier

voting_classifier = VotingClassifier(
      estimators=[
                  ('svc_linear', svm.SVC(kernel='linear', probability=True)),
                  ('svc_rbf', svm.SVC(kernel='rbf')),
                  ('svc_sigmoid', svm.SVC(kernel='sigmoid')),
                  ('svc_poly', svm.SVC(kernel='poly')),
                  ('k-NN',KNeighborsClassifier(n_neighbors=3))
                  ],
      voting='hard',
      weights=[3,3,3,1,1])

voting_classifier.fit(X_train, Y_train)





VotingClassifier(estimators=[('svc_linear',
                              SVC(C=1.0, break_ties=False, cache_size=200,
                                  class_weight=None, coef0=0.0,
                                  decision_function_shape='ovr', degree=3,
                                  gamma='scale', kernel='linear', max_iter=-1,
                                  probability=True, random_state=None,
                                  shrinking=True, tol=0.001, verbose=False)),
                             ('svc_rbf',
                              SVC(C=1.0, break_ties=False, cache_size=200,
                                  class_weight=None, coef0=0.0...
                                  decision_function_shape='ovr', degree=3,
                                  gamma='scale', kernel='poly', max_iter=-1,
                                  probability=False, random_state=None,
                                  shrinking=True, tol=0.001, verbose=False)),
                             (

In [None]:
predictions_voting_classifier = voting_classifier.predict(X_test)

In [None]:
print (classification_report(Y_test, predictions_voting_classifier))

              precision    recall  f1-score   support

          -1       0.72      0.43      0.54      3985
           0       0.76      0.82      0.79      6914
           1       0.71      0.81      0.76      7540

    accuracy                           0.73     18439
   macro avg       0.73      0.69      0.69     18439
weighted avg       0.73      0.73      0.72     18439



**Valoración de un comentario sobre un personaje**



En base a los resultados de los test se decide emplear el VotingClassifier


En la siguiente celda se puede introducir uno o varios comentarios para valorar

In [None]:
comentarios = []
comentarios.append("Fan of superheroes in general and Spider-Man in particular") #Comentario Positivo
comentarios.append("I hate characters who wear capes") #Comentario Negativo
comentarios.append("I play the guitar") #Comentario Neutro

comentarios.append("This character makes me feel sad.") #Comentario sobre A.I.M.

La siguiente celda valorará los comentarios introducidos

In [None]:
comentarioData = pd.DataFrame(columns=('clean_comment', 'category'))

for comentario in comentarios:
  comentarioData = comentarioData.append({'clean_comment' : comentario}, ignore_index=True)

ps = PorterStemmer()

preprocessedText = []

for row in comentarioData.itertuples():
    text = word_tokenize(str(row[1]))
    stops = set(stopwords.words("english"))
    text = [ps.stem(w) for w in text if not w in stops and w.isalnum()]
    text = " ".join(text)
    preprocessedText.append(text)

preprocessedDataTest = comentarioData
preprocessedDataTest['processed_text'] = preprocessedText

textsBoWTest= bagOfWordsModel.transform(preprocessedDataTest['processed_text'])

X_test_comentario = textsBoWTest #Documentos
predictions = voting_classifier.predict(X_test_comentario)

for i in range(len(predictions)):
  if predictions[i] == 1:
    print(comentarios[i] + " -> Comentario positivo")
  elif predictions[i] == 0:
    print(comentarios[i] + " -> Comentario neutro")
  else:
    print(comentarios[i] + " -> Comentario negativo")

Fan of superheroes in general and Spider-Man in particular -> Comentario positivo
I hate characters who wear capes -> Comentario negativo
I play the guitar -> Comentario neutro
This character makes me feel sad. -> Comentario negativo


**Almacenamiento de la información sobre el comentario y su valoración**

Añadimos el comentario y la valoracion. Se debe tener en cuenta que el sistema de recomendación hace uso de estas valoraciones y los valores deben ser los siguientes:


* Buena (0.0)
* Neutra o no valoración (0.5)
* Mala (1.0)

*NOTA: con la informacion almacenada en este dataset ya se puede aplicar el sistema de recomendación a este usuario ficticio "Pepe"*

In [None]:
dataframe_user.loc[dataframe_user['name'] == name, 'comentario'] = comentarios[3]

if predictions[i] == 1:
  v_comentario = 0.0
elif predictions[i] == 0:
  v_comentario = 0.5
else:
  v_comentario = 1

dataframe_user.loc[dataframe_user['name'] == name, 'valoracion_comentario'] = v_comentario

dataframe_user

Unnamed: 0,name,valoracion_directa,comentario,valoracion_comentario
0,A.I.M.,0.0,This character makes me feel sad.,1.0


Por ultimo guardamos el dataset

In [None]:
dataframe_user.to_csv("userdata_" + nombre_usuario, index=False)