# 1. Preproceso

## 1.1 Imports y carga de dataset

Al comienzo de este notebook, se ejecutará todos los imports usados en él, y el dataset utilizado para llevar a cabo el modelo.

In [1]:
#Imports y carga del dataset

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd
import nltk
import math
import ast
import io
import re
import warnings
warnings.filterwarnings('ignore')
import numpy as np

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")

#pd.set_option("display.max_rows", None)

with open("movies_dataset/movies_metadata.csv", "r", encoding='utf-8') as f:
    df_raw = pd.read_csv(f, encoding="utf-8")
    print("Dataset loaded")

[nltk_data] Downloading package stopwords to /home/hector/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/hector/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/hector/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Dataset loaded


## 1.2 Eliminación de columnas no usadas

Vamos a procesar todas las columnas del dataset y construir un dataframe nuevo.

Las columnas que hemos considerado innecesarias para incluir son:

* ``homepage``
* ``imdb_id``
* ``original_title``
* ``poster_path``
* ``status``
* ``video``
* ``id``
* ``popularity``

The column ``id`` won't be removed, in order to know and to identify what movie belongs to each row in the preprocessed data.

In [2]:
exclude = ["homepage", "imdb_id", "original_title", "poster_path", "status", "video", "popularity"] 
df = df_raw.loc[:, df_raw.columns.difference(exclude)]
print("Columnas excluidas")

Columnas excluidas


Una vez eliminada estas columnas, se comprueba que columnas son las que se van a procesar y cuales tienen algún valor nulo.

In [3]:
#Comprobación de valores nulos

print(pd.isnull(df).any())
print()
print("Filas totales:", len(df))
print("Filas con algún valor nulo:", len([x for x in df.isnull().any(axis=1) if x]))

adult                    False
belongs_to_collection     True
budget                   False
genres                   False
id                       False
original_language         True
overview                  True
production_companies      True
production_countries      True
release_date              True
revenue                   True
runtime                   True
spoken_languages          True
tagline                   True
title                     True
vote_average              True
vote_count                True
dtype: bool

Filas totales: 45466
Filas con algún valor nulo: 42894


## 1.3 Procesamiento de columnas

### 1.3.1 Adult

Aqui observamos que hay valores no booleanos que debemos eliminar, y además observamos que las películas con valor True son muy escasas en comparación con las que tienen valor False, por lo que se procederá a eliminar las filas anómalas y quitar la columna ``adult``.

In [4]:
print(df["adult"].value_counts())
print()

drop_list = []
for i in range(len(df)):
    if df["adult"][i] != "False":
        drop_list.append(i)

df.drop(index=drop_list, inplace=True)

print(df["adult"].value_counts())

df.drop(["adult"], axis=1, inplace=True)

#No añadimos la columna al nuevo dataframe

False                                                                                                                             45454
True                                                                                                                                  9
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
 - Written by Ørnås                                                                                                                   1
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
Name: adult, dtype: int64

False    45454
Name: adult, dtype: int64


### 1.3.2 Title

Antes de procesar esta columna, se pasará a definir una serie de funciones que nos servirán para procesar aquellas carácterísticas que están formadas por lenguaje natural.

In [5]:
# Definicion de funciones para el procesamiento de lenguaje natural

replaces = [("i'm", "i am"), ("you're", "you are"), ("we're", "we are"), ("they're", "they are"),
            ("that's", "that is"), ("there's", "there is"), ("who's", "who is"), ("what's", "what is"),
            
            ("isn't", "is not"), ("aren't", "are not"), ("can't", "cannot"), ("haven't", "have not"),
            ("won't", "will not"), ("doesn't", "does not"), ("don't", "do not"), ("didn't", "did not"),
            ("couldn't", "could not"), ("shouldn't", "should not"), ("wasn't", "was not"), ("weren't", "were not"),
            ("hasn't", "has not"), ("hadn't", "had not"),
            
            ("we'd", "we would"), ("he'd", "he would"), ("you'd", "you would"), ("they'd", "they would"), 
            ("who'd", "who would"), ("wouldn't", "would not"),
            
            ("i'll", "i will"), ("you'll", "you will"), ("he'll", "he will"), ("she'll", "she will"),
            ("it'll", "it will"), ("we'll", "we will"), ("they'll", "they will"),
            
            ("i've", "i have"), ("you've", "you have"), ("we've", "we have"), ("they've", "they've"),
            ("&", " and ")
           ]

def sub_contractions(text):
    new_text = text
    for r in replaces:
        new_text = re.sub(f"{r[0]}", f"{r[1]}", new_text)
    return new_text

lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
    if type(text) is float:  #Comprobando valores nulos, expresados como un float NaN
        ret_text = []
    else:
        new_text = text.lower()
        new_text = sub_contractions(new_text)
        new_text = re.sub("[^A-Za-zÀ-ÖØ-öø-ÿ0-9 ]", "", new_text)
        
        tok_text = word_tokenize(new_text)
        stop_words = list(stopwords.words("english"))
        tok_text = [word for word in tok_text if word not in stop_words]
        tok_text = [lemmatizer.lemmatize(word) for word in tok_text]
        ret_text = tok_text
        
        """ Processing feedback
        joint_text = " ".join(tok_text)
        print(new_text)
        print(joint_text)
        print() #"""
        
    return ret_text

In [6]:
'''TITLE'''

df["title"] = df["title"].apply(preprocess_text)
df["title"].head()

0                 [toy, story]
1                    [jumanji]
2         [grumpier, old, men]
3            [waiting, exhale]
4    [father, bride, part, ii]
Name: title, dtype: object

### 1.3.3 Tagline

Para la columna ``tagline``, la cual es de tipo texto, también se deberá llevar a cabo procesamiento de lenguaje natural.



In [7]:
'''TAGLINE'''

df["tagline"] = df["tagline"].apply(preprocess_text)
df["tagline"].head()

0                                                   []
1                    [roll, dice, unleash, excitement]
2    [still, yelling, still, fighting, still, ready...
3            [friend, people, let, never, let, forget]
4            [world, back, normal, he, surprise, life]
Name: tagline, dtype: object

### 1.3.4 Overview

También con la columna ``overview``, se debe llevar a cabo de la misma forma.

In [8]:
'''OVERVIEW'''

df["overview"] = df["overview"].apply(preprocess_text)
df["overview"].head()

0    [led, woody, andys, toy, live, happily, room, ...
1    [sibling, judy, peter, discover, enchanted, bo...
2    [family, wedding, reignites, ancient, feud, ne...
3    [cheated, mistreated, stepped, woman, holding,...
4    [george, bank, recovered, daughter, wedding, r...
Name: overview, dtype: object

### 1.3.5 Vote count

Para liminamos las filas con valores nulos. Si el numero de votos es menor a 1 no es útil, puesto que una pelicula sin votos no sirve para deducir en función de la media de votos de una pelicula.

In [9]:
'''VOTE_COUNT'''

df["vote_count"] = df.drop(df[df.vote_count.isna()].index, inplace=False)["vote_count"].astype("int32")
df = df[df.vote_count >= 1]
df["vote_count"].head()

0    5415.0
1    2413.0
2      92.0
3      34.0
4     173.0
Name: vote_count, dtype: float64

### 1.3.6 Vote average

Eliminamos las filas con valores nulos de la columna ``vote_average``. Las filas anteriores, es decir, que tengan votos significativos, considerar todos los votos (hay filas que tienen una media de 0 pero ningun voto, por lo que no son válidas)

In [10]:
'''VOTE_AVERAGE'''

df["vote_average"] = df.drop(df[df.vote_average.isna()].index, inplace=False)["vote_average"]
df["vote_average"].head()

0    7.7
1    6.9
2    6.5
3    6.1
4    5.7
Name: vote_average, dtype: float64

### 1.3.7 Release date

Eliminamos las filas con valores nulos de esta columna. La columna de las fechas estaba en string (probablemente debido a los valores erroneos que había al principio) así que la convertimos en una columna de tipo datetime.

Podemos dejarlos como están o bien agrupar los años para extraer conclusiones

In [11]:
'''RELEASE_DATE'''

df["release_date"] = df.drop(df[df.release_date.isna()].index, inplace=False)["release_date"]
df["release_date"] = pd.to_datetime(df["release_date"])
df["release_date"].head()

#df_years = preprocessed['release_date'].groupby(preprocessed.release_date.dt.year)
#print(df_years.first())

0   1995-10-30
1   1995-12-15
2   1995-12-22
3   1995-12-22
4   1995-02-10
Name: release_date, dtype: datetime64[ns]

In [12]:
df["release_month"] = df["release_date"].dt.month
df["release_day"] = df["release_date"].dt.day_name()
df.drop(["release_date"], axis=1, inplace=True)
df.head()

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,production_companies,production_countries,revenue,runtime,spoken_languages,tagline,title,vote_average,vote_count,release_month,release_day
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"[led, woody, andys, toy, live, happily, room, ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",[],"[toy, story]",7.7,5415.0,10.0,Monday
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,"[sibling, judy, peter, discover, enchanted, bo...","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...","[roll, dice, unleash, excitement]",[jumanji],6.9,2413.0,12.0,Friday
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,"[family, wedding, reignites, ancient, feud, ne...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]","[still, yelling, still, fighting, still, ready...","[grumpier, old, men]",6.5,92.0,12.0,Friday
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,"[cheated, mistreated, stepped, woman, holding,...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]","[friend, people, let, never, let, forget]","[waiting, exhale]",6.1,34.0,12.0,Friday
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,"[george, bank, recovered, daughter, wedding, r...","[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]","[world, back, normal, he, surprise, life]","[father, bride, part, ii]",5.7,173.0,2.0,Friday


### 1.3.8 Runtime

Eliminamos las filas con valores nulos y se borran los que tienen valor 0 porque no tiene sentido que una pelicula tenga 0 minutos.

In [13]:
'''RUNTIME'''

df["runtime"] = df.drop(df[df.runtime.isna()].index, inplace=False)["runtime"]
df = df[df.runtime >= 1]
df["runtime"].head()

0     81.0
1    104.0
2    101.0
3    127.0
4    106.0
Name: runtime, dtype: float64

### 1.3.9 Original language

Para elegir los idiomas validos vamos a ver los 6 más usados y juntar los otros en la categoria resto.


Cambiar filas con idiomas inválidos.

In [14]:
'''ORIGINAL LANGUAGE'''

idiomas_validos = {'en', 'fr', 'it', 'ja', 'de', 'es'}

def preprocess_language(text):
    if text in idiomas_validos:
        return text
    else:
        return "other"

In [15]:
df["original_language"] = df["original_language"].apply(preprocess_language)
df["original_language"].value_counts()

en       29572
other     5195
fr        2226
ja        1282
it        1134
de         927
es         848
Name: original_language, dtype: int64

Para elegir los idiomas validos vamos a ver los 6 más usados y hacer un one hot encoding con los que aparezcan.

In [16]:
def language_counts(rows):
    languages_dict = {}
    
    for row in rows:
        try:
            languages = ast.literal_eval(row)
            
            for lang in languages:
                code = lang["iso_639_1"]
                
                if code in languages_dict.keys():
                    languages_dict[code] += 1
                else:
                    languages_dict[code] = 1
        except Exception as e:
            continue
            
    sorted_counts = sorted(languages_dict.items(), key=lambda x: x[1], reverse=True)
    return sorted_counts



def treat_languages(df, best_languages):
    for lang in best_languages:
        df[f'spoken_lang_{lang}'] = 0
    df[f'spoken_lang_other'] = 0
    
    for i, row in df.iterrows():
        try:
            language_list = ast.literal_eval(row['spoken_languages'])
            for language_dict in language_list:
                code = language_dict['iso_639_1']
                if code in best_languages:
                    df.iloc[i, df.columns.get_loc(f'spoken_lang_{code}')] = 1
                else:
                    df.iloc[i, df.columns.get_loc(f'spoken_lang_other')] = 1
        except:
            continue


In [17]:
lang_counts = language_counts(df["spoken_languages"])
lang_valid = [lang_counts[i][0] for i in range(7)]
lang_valid.append("other")

In [18]:
treat_languages(df, lang_valid)
df.drop(['spoken_languages'], axis=1, inplace=True)
df.head()

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,production_companies,production_countries,revenue,runtime,...,release_month,release_day,spoken_lang_en,spoken_lang_fr,spoken_lang_de,spoken_lang_es,spoken_lang_it,spoken_lang_ja,spoken_lang_ru,spoken_lang_other
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"[led, woody, andys, toy, live, happily, room, ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",373554033.0,81.0,...,10.0,Monday,1,0,0,0,0,0,0,0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,"[sibling, judy, peter, discover, enchanted, bo...","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",262797249.0,104.0,...,12.0,Friday,1,1,0,0,0,0,0,0
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,"[family, wedding, reignites, ancient, feud, ne...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,101.0,...,12.0,Friday,1,0,0,0,0,0,0,0
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,"[cheated, mistreated, stepped, woman, holding,...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",81452156.0,127.0,...,12.0,Friday,1,0,0,0,0,0,0,0
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,"[george, bank, recovered, daughter, wedding, r...","[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",76578911.0,106.0,...,2.0,Friday,1,0,0,0,0,0,0,0


### 1.3.10 Genre

Para elegir los generos validos vamos a ver los 5 mas usados y juntar los otros en la categoria resto.

In [19]:
"""

#print(pd.isnull(df).any())
print(df["genres"].value_counts(dropna=False))

#Camabiar filas con generos inválidos
generos_validos = {'1', '2', '3', '4', '5'}

for nodo in df:
   if genres not in generos_validos 
      genres = other

"""

def genre_counts(rows):
    genre_dict = {}
    
    for row in rows:
        try:
            genres = ast.literal_eval(row)
            
            for genre in genres:
                name = genre["name"]
                
                if name in genre_dict.keys():
                    genre_dict[name] += 1
                else:
                    genre_dict[name] = 1
        except Exception as e:
            continue
            
    sorted_counts = sorted(genre_dict.items(), key=lambda x: x[1], reverse=True)
    return sorted_counts



def treat_genres(df, valid_genres):
    for genre in valid_genres:
        df[f'genre_{genre}'] = 0
    
    df[f'genre_other'] = 0
    
    for i, row in df.iterrows():
        try:
            genre_list = ast.literal_eval(row['genres'])
            for genre_dict in genre_list:
                name = genre_dict['name']
                
                if name in valid_genres:
                    df.iloc[i, df.columns.get_loc(f'genre_{name}')] = 1
                else:
                    df.iloc[i, df.columns.get_loc(f'genre_other')] = 1
        except:
            continue

In [20]:
genres = genre_counts(df["genres"])
genre_valid = [genres[i][0] for i in range(7)]
genre_valid.append("other")

In [21]:
treat_genres(df, genre_valid)
df.drop(['genres'], axis=1, inplace=True)
df.head()

Unnamed: 0,belongs_to_collection,budget,id,original_language,overview,production_companies,production_countries,revenue,runtime,tagline,...,spoken_lang_ru,spoken_lang_other,genre_Drama,genre_Comedy,genre_Thriller,genre_Romance,genre_Action,genre_Horror,genre_Crime,genre_other
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,862,en,"[led, woody, andys, toy, live, happily, room, ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",373554033.0,81.0,[],...,0,0,0,1,0,0,0,0,0,1
1,,65000000,8844,en,"[sibling, judy, peter, discover, enchanted, bo...","[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",262797249.0,104.0,"[roll, dice, unleash, excitement]",...,0,0,0,0,0,0,0,0,0,1
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,15602,en,"[family, wedding, reignites, ancient, feud, ne...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",0.0,101.0,"[still, yelling, still, fighting, still, ready...",...,0,0,0,1,0,1,0,0,0,0
3,,16000000,31357,en,"[cheated, mistreated, stepped, woman, holding,...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",81452156.0,127.0,"[friend, people, let, never, let, forget]",...,0,0,1,1,0,1,0,0,0,0
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,11862,en,"[george, bank, recovered, daughter, wedding, r...","[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",76578911.0,106.0,"[world, back, normal, he, surprise, life]",...,0,0,0,1,0,0,0,0,0,0


In [22]:
#Tenemos que elegir entre tocar o no el presupuesto, si lo dejamos tal cual o en 4 categorias que habia dudas 
#(que sean super produccion, gran produccion, produccion normal y produccion bajo presupuesto por ejemplo)

### 1.3.11 Production companies

En cuanto a la columna `production_companies`, esta es una columna con diccionarios de compañías, por lo que se hará será coger el 'id' de cada compañía productora de la película y contar las apariciones de cada compañía para, posteriormente, en esa misma columna poner el valor de las ocurrencias (popularidad) máxima de las compañías involucradas.

In [23]:
'''PRODUCTION_COMPANIES'''

def popularity_companies(rows):
    companies_popularity = {}
    for row in rows:
        try:
            companies = ast.literal_eval(row)
            for company in companies:
                if company['id'] in companies_popularity.keys():
                    companies_popularity[company['id']] += 1
                else:
                    companies_popularity[company['id']] = 1
        except Exception:
            continue
    return companies_popularity

companies_popularity = popularity_companies(df['production_companies'])

def treat_companies(row):
    max_popularity = 0
    try:
        companies = ast.literal_eval(row)
        for company in companies:
            if int(companies_popularity[company['id']]) > max_popularity:
                max_popularity = companies_popularity[company['id']]
        return max_popularity
    except Exception:
        return max_popularity

In [24]:
df['production_companies'] = df['production_companies'].apply(treat_companies)
df['production_companies'].head(20)

0       52
1      196
2     1184
3      812
4      224
5     1184
6      972
7      261
8      819
9      270
10     420
11     420
12     819
13      83
14     993
15     819
16     427
17     182
18    1184
19     420
Name: production_companies, dtype: int64

### 1.3.12 Production countries

Para `production_countries`, esta es una columna de diccionarios de compañías, por lo que el proceso sera coger los id's de cada compañía y crear una lista de los siete países que más aparecen. Esto se convertirá en columna de one hot encoding.

In [25]:
'''PRODUCTION_COUNTRIES'''

def popularity_countries(rows):
    countries_popularity = {}
    
    for row in rows:
        try:
            countries = ast.literal_eval(row)
            for country in countries:
                if country['iso_3166_1'] in countries_popularity.keys():
                    countries_popularity[country['iso_3166_1']] += 1
                else:
                    countries_popularity[country['iso_3166_1']] = 1
        except:
            continue
            
    countries = sorted(countries_popularity.items(), key=lambda x: x[1], reverse=True)[:7]
    best_countries = []
    
    for country in countries:
        best_countries.append(country[0])
        
    return best_countries


def treat_countries(df, best_countries):
    for country in best_countries:
        df[f'production_{country}'] = 0
    df['production_other'] = 0
        
    for i, row in df.iterrows():
        try:
            countries = ast.literal_eval(row['production_countries'])
            for country in countries:
                if country['iso_3166_1'] in best_countries:
                    df.iloc[i, df.columns.get_loc(f'production_{country["iso_3166_1"]}')] = 1
                else:
                    df.iloc[i, df.columns.get_loc('production_other')] = 1
        except:
            continue

In [26]:
popular_countries = popularity_countries(df['production_countries'])

treat_countries(df, popular_countries)
df.drop(['production_countries'], axis=1, inplace=True)
df.head()

Unnamed: 0,belongs_to_collection,budget,id,original_language,overview,production_companies,revenue,runtime,tagline,title,...,genre_Crime,genre_other,production_US,production_GB,production_FR,production_DE,production_IT,production_CA,production_JP,production_other
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,862,en,"[led, woody, andys, toy, live, happily, room, ...",52,373554033.0,81.0,[],"[toy, story]",...,0,1,1,0,0,0,0,0,0,0
1,,65000000,8844,en,"[sibling, judy, peter, discover, enchanted, bo...",196,262797249.0,104.0,"[roll, dice, unleash, excitement]",[jumanji],...,0,1,1,0,0,0,0,0,0,0
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,15602,en,"[family, wedding, reignites, ancient, feud, ne...",1184,0.0,101.0,"[still, yelling, still, fighting, still, ready...","[grumpier, old, men]",...,0,0,1,0,0,0,0,0,0,0
3,,16000000,31357,en,"[cheated, mistreated, stepped, woman, holding,...",812,81452156.0,127.0,"[friend, people, let, never, let, forget]","[waiting, exhale]",...,0,0,1,0,0,0,0,0,0,0
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,11862,en,"[george, bank, recovered, daughter, wedding, r...",224,76578911.0,106.0,"[world, back, normal, he, surprise, life]","[father, bride, part, ii]",...,0,0,1,0,0,0,0,0,0,0


### 1.3.13 Budget

In [27]:
df["budget"] = df.drop(df[df.budget.isna()].index, inplace=False)["budget"]
df["budget"].head()

0    30000000
1    65000000
2           0
3    16000000
4           0
Name: budget, dtype: object

### 1.3.14 Belongs to collection

In [28]:
def preprocess_collection(coleccion):
    id_coleccion = -1
    if type(coleccion) != float:
        id_coleccion = re.search("'id' *: *([0-9]+)", coleccion)[1]
    return id_coleccion

In [29]:
df["belongs_to_collection"] = df["belongs_to_collection"].apply(preprocess_collection)
df["belongs_to_collection"]

0         10194
1            -1
2        119050
3            -1
4         96871
          ...  
45459        -1
45460        -1
45461        -1
45462        -1
45463        -1
Name: belongs_to_collection, Length: 41184, dtype: object

### 1.3.13 Revenue

Dado que para está característica se ha podido observar que faltan bastantes datos, esto nos servirá para crear dos datasets. Se tendrá un dataset con la columna de ``revenue`` con ningún valor nulo para el entrenamiento, y otro dataset sin esta columna que sería para el training.

In [30]:
'''REVENUE'''

revenue_valor = df[df["revenue"] > 0]
revenue_cero = df[df["revenue"] == 0]

In [31]:
print(len(revenue_valor))
print(len(revenue_cero))

7349
33835


## 1.5 Saving processed datasets

Una vez prerporcesadas todas las columnas, será turno de crear otros datasets que serán utilizados en nuestro sistema de predicción. Se tendrá unos datasets sin ratings y otros con ratings, pero con menos samples.

In [32]:
revenue_valor.to_csv("movies_dataset/preprocessed_training.csv", index = False)

In [33]:
revenue_cero.to_csv("movies_dataset/preprocessed_testing.csv", index = False)