# Sistemas de Recomendación basados en contenidos

### Dataset de trabajo
Para crear este sistema de Recomendación basado en Machine Learning, usaremos un Dataset oficial de IDBM con 1000 peliculas.

#### Columns

* RankMovie rank order
* TitleThe title of the film
* DescriptionBrief one-sentence movie summary
* DirectorThe name of the film's director
* ActorsA comma-separated list of the main stars of the film
* YearThe year that the film released as an integer.
* Runtime (Minutes)The duration of the film in minutes.
* RatingUser rating for the movie 0-10
* VotesNumber of votes
* Revenue (Millions)Movie revenue in millions
* MetascoreAn aggregated average of critic scores. Values are between 0 and 100. Higher scores represent positive reviews.
* Genre1-2-3 list of genres used to classify the film

## Videos explicativos del Algoritmo de Recomendación Basado en Contenidos
* Parte1 https://youtu.be/AB6i8RSFaoM
* ¿Que es normalizar? https://youtu.be/fczgaWdXr-E
* Parte 2 https://youtu.be/-XyjWGa82eA

# Data Cleaning 
Este Dataset tiene una dificultad que debeis solventar antes de empezar a usarlo. Los datos de clasificación por genero han sido recogidos por Netflix en tres columnas y en forma de String.

Esto es un problema para el desarrollo de nuestro algoritmo ya que tiene que estar en valores binarios cada uno de los tipos de generos.

Tu primer reto debe ser solventar este problema.

In [1]:
#Cargamos Librerias
import pandas as pd
import numpy as np

In [3]:
movies = pd.read_csv("Movies.csv")
movies.head()

Unnamed: 0.1,Unnamed: 0,Rank,Title,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Genre1,Genre2,Genre3
0,0,1,Guardians of the Galaxy,A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,Action,Adventure,Sci-Fi
1,1,2,Prometheus,"Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,Adventure,Mystery,Sci-Fi
2,2,3,Split,Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,Horror,Thriller,
3,3,4,Sing,"In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,Animation,Comedy,Family
4,4,5,Suicide Squad,A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,Action,Adventure,Fantasy


#### Ayuda para solucionar el problema de generos en el Dataset
* Podría sacar los valores unicos de las columans de Genero para ver cuantos tipos de genero tengo para clasificar, con el **comando movies["nombrecolumna"].unique()**
* Crear listas con cada columna del Genero
* Crear arrays tantos generos tenga
* Eliejo un genero primero, luego recorro las 3 listas y donde aparezca pongo un 1 en el array del genero, si no pondría un 0
* Finalmente cojo ese array y lo añado como columna a mi Dataset, con el comando **movies["nombrecolumna"]=np.array("arraygenero")**
* Luego solo tendría que hacer esto tantas veces como generos tenga.
* Cuando tengas el Dataset totalmente preparado lo podrías salvar con **movies.to_csv("Nombrequequieras.csv")** para ya empezar a trabajar con él en la parte de Machine Learning

#### Para la segunda parte del Ejercicio os dejo la forma de crear un pequeño Dataset, porque os hará falta para meter las recomendaciones del usuario en función a las peliculas que ha visto.

In [4]:
#Ejemplo sencillo de crear un Dataset de 3 filas x 2 columnas
usuario = pd.DataFrame({'pelicula': ['Trolls', 'Fallen', 'The Founder'], 'Score': [5, 7, 9]})
usuario.head()

Unnamed: 0,pelicula,Score
0,Trolls,5
1,Fallen,7
2,The Founder,9


## Realizamos un proceso de Data Cleaning y convertir esos generos en Variables Dummies, es decir transformamos a 1 y 0
 * Cargamos el Dataset nuevo llamado Movie_cleaning, con el cual trabajaremos

In [5]:
MovieCleaning=pd.read_csv("Movie_cleaning.csv")
MovieCleaning.head()

Unnamed: 0.1,Unnamed: 0,Titulo,action,adventure,mistery,comedy,family,sciFi,fantasy,animation,...,thriller,drama,biography,music,western,war,crime,romance,sport,musical
0,0,Guardians of the Galaxy,1,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,Prometheus,0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,Split,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,3,Sing,0,0,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,4,Suicide Squad,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### Buscamos en el Dataset las peliculas que ha visto el usuario
* Para ello filtramos con condicionales

In [6]:
DataUser=MovieCleaning[(MovieCleaning["Titulo"]=="Trolls") | (MovieCleaning["Titulo"]=="Fallen") | (MovieCleaning["Titulo"]=="The Founder")]

In [7]:
DataUser.head()

Unnamed: 0.1,Unnamed: 0,Titulo,action,adventure,mistery,comedy,family,sciFi,fantasy,animation,...,thriller,drama,biography,music,western,war,crime,romance,sport,musical
23,23,Trolls,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
43,43,The Founder,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
47,47,Fallen,0,1,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0


### Ahora usamos iloc para localizar las columnas que multiplicaremos por el Score del usuario

In [9]:
DataUser.iloc[:1,2:]=DataUser.iloc[:1,2:] * pd.to_numeric(usuario.iloc[0,1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [10]:
DataUser.iloc[1:2,2:]=DataUser.iloc[1:2,2:] * pd.to_numeric(usuario.iloc[1,1])

In [11]:
DataUser.iloc[2:3,2:]=DataUser.iloc[2:3,2:] * pd.to_numeric(usuario.iloc[2,1])

In [12]:
DataUser.head()

Unnamed: 0.1,Unnamed: 0,Titulo,action,adventure,mistery,comedy,family,sciFi,fantasy,animation,...,thriller,drama,biography,music,western,war,crime,romance,sport,musical
23,23,Trolls,0,5,0,5,0,0,0,5,...,0,0,0,0,0,0,0,0,0,0
43,43,The Founder,0,0,0,0,0,0,0,0,...,0,7,7,0,0,0,0,0,0,0
47,47,Fallen,0,9,0,0,0,0,9,0,...,0,9,0,0,0,0,0,0,0,0


### Sumar el total de las columnas para obtener el valor total del Genero

In [13]:
DataUser.loc["Total"]=DataUser.iloc[:,2:].sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [14]:
DataUser.head()

Unnamed: 0.1,Unnamed: 0,Titulo,action,adventure,mistery,comedy,family,sciFi,fantasy,animation,...,thriller,drama,biography,music,western,war,crime,romance,sport,musical
23,23.0,Trolls,0.0,5.0,0.0,5.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
43,43.0,The Founder,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,7.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
47,47.0,Fallen,0.0,9.0,0.0,0.0,0.0,0.0,9.0,0.0,...,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Total,,,0.0,14.0,0.0,5.0,0.0,0.0,9.0,5.0,...,0.0,16.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Normalizar los valores

In [98]:
##Parto el Dataset original de peliculas
df1=MovieCleaning.iloc[:50,:]
df1.shape

(50, 21)

### normalizamos entre fronteras de 0 y 1

In [99]:
Normalizado=(DataUser.iloc[3:4,2:]-0)/(10*len(df1))-0

In [100]:
Normalizado.head()

Unnamed: 0,action,adventure,mistery,comedy,family,sciFi,fantasy,animation,horror,thriller,drama,biography,music,western,war,crime,romance,sport,musical
Total,0.0,0.028,0.0,0.01,0.0,0.0,0.018,0.01,0.0,0.0,0.032,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Multiplicamos los valores normalizados por los generos del Dataset de recomendación

In [101]:
for i in range(19):
    df1.iloc[:,i+2]=pd.to_numeric(Normalizado.iloc[0,i])*df1.iloc[:,i+2]

In [102]:
df1.head(50)

Unnamed: 0.1,Unnamed: 0,Titulo,action,adventure,mistery,comedy,family,sciFi,fantasy,animation,...,thriller,drama,biography,music,western,war,crime,romance,sport,musical
0,0,Guardians of the Galaxy,0.0,0.028,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Prometheus,0.0,0.028,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Split,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Sing,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Suicide Squad,0.0,0.028,0.0,0.0,0.0,0.0,0.018,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,The Great Wall,0.0,0.028,0.0,0.0,0.0,0.0,0.018,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,La La Land,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,7,Mindhorn,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,8,The Lost City of Z,0.0,0.028,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,9,Passengers,0.0,0.028,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Sacamos la suma por filas 

In [109]:
df1["Total"]=df1.iloc[:,2:].sum(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


### Ordenamos el Dataset en función a la columna ***total***

In [None]:
df1.sort_values(by=["total"], inplace=True, ascending=False)

# Mostramos la lista de recomendación

In [125]:
df1.loc[:, ["Titulo","total"]]

Unnamed: 0,Titulo,total
47,Fallen,0.078
20,Gold,0.06
9,Passengers,0.06
36,Interstellar,0.06
29,Assassin's Creed,0.06
26,Bahubali: The Beginning,0.06
23,Trolls,0.048
15,The Secret Life of Pets,0.048
13,Moana,0.048
40,Sausage Party,0.048


### Aportaciones Extras al proceso para calcular similitud de usuarios en el metodo de Recomendación por medio de Filtro colaborativo

## Distancia Euclidiana

In [31]:
from scipy import spatial
a = [10,7,8] 
b = [9,0,10] 
c = [10,6,0]
d = [6,8,4]
dst1 = distance.euclidean(a,b)
dst2 = distance.euclidean(a,c)
dst3 = distance.euclidean(a,d)

## Simitud del Coseno

In [27]:
from scipy import spatial
a = [10, 7, 8]
b = [9,0,10]
c = [10,6,0]
d = [6,8,4]
resultab = 1 - spatial.distance.cosine(a,b)
resultac = 1 - spatial.distance.cosine(a,c)
resultad = 1 - spatial.distance.cosine(a,d)