# EDA 01 - Preparación de los datos para Content Based Filtering

En este _notebook_ vamos a utilizar los siguientes archivos para preparar los datos para la creación de sistemas de recomendación basadas en contenido y metadatos sobre las películas:

- **cleaned_movies.csv** fichero de datos limpio como output del _notebook_ EDA00
- **credits.csv** contiene información sobre los actores y el director de las películas
- **keywords.csv** contiene palabras clave que serán de utilidad para la creación del sistema de recomendación

In [1]:
import pandas as pd
import numpy as np

## Importación del Dataset

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()

DATA_PATH = os.getenv("FILES_LOCATION")
CLEANED_FILE = os.path.join(DATA_PATH, "CSV", "cleaned_movies.csv")
CREDITS_FILE = os.path.join(DATA_PATH, "CSV", "credits.csv")
KEYWORDS_FILE = os.path.join(DATA_PATH, "CSV", "keywords.csv")

In [3]:
df = pd.read_csv(CLEANED_FILE, low_memory=False)
credits = pd.read_csv(CREDITS_FILE, low_memory=False)
keyw = pd.read_csv(KEYWORDS_FILE, low_memory=False)

In [4]:
df.shape, credits.shape, keyw.shape

((41362, 8), (45476, 3), (46419, 2))

In [5]:
credits.columns, keyw.columns

(Index(['cast', 'crew', 'id'], dtype='object'),
 Index(['id', 'keywords'], dtype='object'))

In [6]:
credits.info(), keyw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


(None, None)

## Limpieza de los datos

Con los diferentes _DataFrame_ importados, vamos a realizar una fusión de los mismos y a eliminar los duplicados por id en caso de haberlos. Con ello obtendremos un único _dataset_ con el que tratar a lo largo de todo el _notebook_.

In [7]:
merged = df.merge(credits, on="id", how="inner")
merged = merged.merge(keyw, on="id", how="inner").drop_duplicates("id")

In [8]:
merged.shape

(41361, 11)

In [9]:
merged.head()

Unnamed: 0,genres,id,title,overview,description,popularity,vote_average,vote_count,cast,crew,keywords
0,"['Animation', 'Comedy', 'Family']",862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...",21.946943,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,"['Adventure', 'Fantasy', 'Family']",8844,Jumanji,When siblings Judy and Peter discover an encha...,When siblings Judy and Peter discover an encha...,17.015539,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,"['Romance', 'Comedy']",15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,A family wedding reignites the ancient feud be...,11.7129,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,"['Comedy', 'Drama', 'Romance']",31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","Cheated on, mistreated and stepped on, the wom...",3.859495,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,['Comedy'],11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just when George Banks has recovered from his ...,8.387519,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


Tal y como hicimos en el anterior _notebook_, vamos a hacer uso de la función _literal_eval()_ para evaluar las entradas como listas de diccionarios. En este caso lo haremos para las columnas de _cast_, _crew_ y _keywords_.

In [10]:
import ast

get_name_cast = lambda x: [c["name"] for c in ast.literal_eval(x)] if isinstance(ast.literal_eval(x), list) else []
get_keys = lambda x: [w["name"] for w in ast.literal_eval(x)] if isinstance(ast.literal_eval(x), list) else []

In [11]:
def get_dir(x):
    for el in ast.literal_eval(x):
        if el["job"] == "Director":
            return el["name"]
    return np.nan

In [12]:
merged["director"] = merged["crew"].apply(get_dir)
merged["cast"] = merged["cast"].fillna("[]").apply(get_name_cast)
merged["cast"] = merged["cast"].apply(lambda x: x[:3] if len(x) >= 3 else x)
merged["keywords"] = merged["keywords"].apply(get_keys)

In [13]:
merged = merged.drop(columns=["crew"])
merged.head()

Unnamed: 0,genres,id,title,overview,description,popularity,vote_average,vote_count,cast,keywords,director
0,"['Animation', 'Comedy', 'Family']",862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...",21.946943,7.7,5415.0,"[Tom Hanks, Tim Allen, Don Rickles]","[jealousy, toy, boy, friendship, friends, riva...",John Lasseter
1,"['Adventure', 'Fantasy', 'Family']",8844,Jumanji,When siblings Judy and Peter discover an encha...,When siblings Judy and Peter discover an encha...,17.015539,6.9,2413.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[board game, disappearance, based on children'...",Joe Johnston
2,"['Romance', 'Comedy']",15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,A family wedding reignites the ancient feud be...,11.7129,6.5,92.0,"[Walter Matthau, Jack Lemmon, Ann-Margret]","[fishing, best friend, duringcreditsstinger, o...",Howard Deutch
3,"['Comedy', 'Drama', 'Romance']",31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","Cheated on, mistreated and stepped on, the wom...",3.859495,6.1,34.0,"[Whitney Houston, Angela Bassett, Loretta Devine]","[based on novel, interracial relationship, sin...",Forest Whitaker
4,['Comedy'],11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just when George Banks has recovered from his ...,8.387519,5.7,173.0,"[Steve Martin, Diane Keaton, Martin Short]","[baby, midlife crisis, confidence, aging, daug...",Charles Shyer


En el caso de _keywords_ tenemos diferentes palabras que pueden estar o no repetidas. Dado que puede haber palabras que estén repetidas una única vez, vamos a filtrar aquellas palabras clave que al menos se repiten 2 veces para asegurarnos de que se pueda establecer una mejor comparación por el sistema de recomendación.

In [14]:
kw_df = merged.apply(lambda x: pd.Series(x["keywords"]), axis=1).stack().reset_index(level=1, drop=True)
kw_count = kw_df.value_counts()

In [15]:
kw_count = kw_count[kw_count > 1]

Podríamos no realizar _stemming_ de las palabras clave y observar los resultados que nos otorga el sistema de recomendación sin llevar a cabo esta operación. Sin embargo, si dos películas contienen palabras clave en singular y otras en plural, será mejor que ambas estén solamente en singular, para que la comparación entre los metadatos sea más robusta. Para ello vamos a utilizar la librería de _nltk_.

In [16]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

def kw_filter(x):
    filtered = []
    for word in x:
        if word in kw_count:
            filtered.append(word)
    return filtered

In [17]:
merged["keywords"] = merged["keywords"].transform(func=kw_filter)
merged["keywords"] = merged["keywords"].transform(func=lambda x: [stemmer.stem(w) for w in x])

In [18]:
merged.head(3)

Unnamed: 0,genres,id,title,overview,description,popularity,vote_average,vote_count,cast,keywords,director
0,"['Animation', 'Comedy', 'Family']",862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...",21.946943,7.7,5415.0,"[Tom Hanks, Tim Allen, Don Rickles]","[jealousi, toy, boy, friendship, friend, rival...",John Lasseter
1,"['Adventure', 'Fantasy', 'Family']",8844,Jumanji,When siblings Judy and Peter discover an encha...,When siblings Judy and Peter discover an encha...,17.015539,6.9,2413.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[board gam, disappear, based on children's boo...",Joe Johnston
2,"['Romance', 'Comedy']",15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,A family wedding reignites the ancient feud be...,11.7129,6.5,92.0,"[Walter Matthau, Jack Lemmon, Ann-Margret]","[fish, best friend, duringcreditssting]",Howard Deutch


Para finalizar nuestro tratamiento de datos, vamos a crear una columna de metadatos que englobe las siguientes características:

- Actores
- Director
- Géneros
- Palabras clave

Para la creación de esta nueva categoría tenemos que "estandarizar" nuestros datos. Para ello lo que vamos a hacer es una serie de transformaciones para que el resultado nos dé todas las palabras en minúscula y separadas por un espacio. De manera excepcional, vamos a juntar el nombre y apellido de los directores en una única palabra para evitar que el sistema de recomendación compare dos películas como similares cuando sus directores compartan nombre y no apellido.

In [19]:
merged["director"] = merged["director"].astype("str").transform(lambda x: [str.lower(x.replace(" ", ""))])
for col in ("keywords", "cast"):
    merged[col] = merged[col].transform(func=lambda x: [str.lower(w.replace(" ", "")) for w in x])
merged["genres"] = merged["genres"].transform(func=lambda x: [w.lower().replace(" ", "") for w in ast.literal_eval(x)])

In [20]:
merged["metadata"] = merged["keywords"] + merged["cast"] + merged["director"] + merged["genres"]
merged["metadata"] = merged["metadata"].transform(func=lambda x: " ".join(x))

In [21]:
merged.head(3)

Unnamed: 0,genres,id,title,overview,description,popularity,vote_average,vote_count,cast,keywords,director,metadata
0,"[animation, comedy, family]",862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...",21.946943,7.7,5415.0,"[tomhanks, timallen, donrickles]","[jealousi, toy, boy, friendship, friend, rival...",[johnlasseter],jealousi toy boy friendship friend rivalri boy...
1,"[adventure, fantasy, family]",8844,Jumanji,When siblings Judy and Peter discover an encha...,When siblings Judy and Peter discover an encha...,17.015539,6.9,2413.0,"[robinwilliams, jonathanhyde, kirstendunst]","[boardgam, disappear, basedonchildren'sbook, n...",[joejohnston],boardgam disappear basedonchildren'sbook newho...
2,"[romance, comedy]",15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,A family wedding reignites the ancient feud be...,11.7129,6.5,92.0,"[waltermatthau, jacklemmon, ann-margret]","[fish, bestfriend, duringcreditssting]",[howarddeutch],fish bestfriend duringcreditssting waltermatth...


In [22]:
merged = merged.drop_duplicates("id")
merged = merged.drop_duplicates("title")

In [23]:
merged.shape

(41361, 12)

In [24]:
merged.to_csv(os.path.join(DATA_PATH, "CSV", "cleaned_content_based.csv"), index=False)