# <h1 align=center> **ETL - DATASET ANOMALIES** </h1>
<h1 align=center> (Extract, Transform, Load) </h1>

Como fase previa a la aplicación del ETL, se realizará la importación de librerías que serán de utilidad en el desarrollo del proceso, las cuales son:

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from scipy.spatial import distance

# <h1 align=left>**`Extract`**</h1>

In [9]:
df_anomalies = pd.read_csv("..\Data\movies_dataset_anomalies.csv")

# <h1 align=left>**`Transform`**</h1>

Se consulta los datos que contiene el DataFrame, para identificar su estructura y la información que contiene:

In [10]:
df_anomalies.head(10)

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",...,1,,,,,,,,,
1,29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",...,12,,,,,,,,,
2,35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",...,22,,,,,,,,,


Teniendo en cuenta que este Dataset trae las variables corridas en 8 posiciones respecto del Dataset original de movies, se procede a cambiar los nombres de las columnas finales que son de relleno, para poder luego nombrar a las primeras con el nombre real:

In [11]:
df_anomalies.rename(columns={'revenue':'revenue2',
                            'runtime':'runtime2',
                            'spoken_languages':'spoken_languages2',
                            'status':'status2',
                            'tagline':'tagline2',
                            'title':'title2',
                            'video':'video2',
                            'vote_average':'vote_average2',
                            'vote_count':'vote_count2'},
                            inplace=True)

Se consulta el DataFrame para validar los cambios.

In [12]:
df_anomalies.head()

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue2,runtime2,spoken_languages2,status2,tagline2,title2,video2,vote_average2,vote_count2
0,19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",...,1,,,,,,,,,
1,29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",...,12,,,,,,,,,
2,35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",...,22,,,,,,,,,


Ahora se procede a colocar los nombres reales a las varibles de acuerdo con la información que contienen y el nombre de la variable correspondiente de conformidad con el Dataset original = `movies`.

In [13]:
df_anomalies.rename(columns={'adult':'original_title',
                            'belongs_to_collection':'popularity',
                            'budget':'poster_path',
                            'genres':'production_companies',
                            'homepage':'production_countries',
                            'id':'release_date',
                            'imdb_id':'revenue',
                            'original_language':'runtime',
                            'original_title':'spoken_languages',
                            'overview':'status',
                            'popularity':'tagline',
                            'poster_path':'title',
                            'production_companies':'video',
                            'production_countries':'vote_average',
                            'release_date':'vote_count'},
                            inplace=True)

df_anomalies.head()

Unnamed: 0.1,Unnamed: 0,original_title,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,...,vote_count,revenue2,runtime2,spoken_languages2,status2,tagline2,title2,video2,vote_average2,vote_count2
0,19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",...,1,,,,,,,,,
1,29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",...,12,,,,,,,,,
2,35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",...,22,,,,,,,,,


Se crean las nuevas columnas que faltan, asignándoles de una vez la posición que le corresponde, para que el DataFrame se asemeje al DataFrame que se utiliza con el Dataset `movies`, estas varibles se rellenan con "0" por defecto.

In [14]:
df_anomalies.insert(0, 'adult', 0)
df_anomalies.insert(1, 'belongs_to_collection', 0)
df_anomalies.insert(2, 'budget', 0)
df_anomalies.insert(3, 'genres', 0)
df_anomalies.insert(4, 'homepage', 0)
df_anomalies.insert(5, 'id', 0)
df_anomalies.insert(6, 'imdb_id', 0)
df_anomalies.insert(7, 'original_language', 0)
df_anomalies.insert(9, 'overview', 0)

Se consulta el DataFrame para validar los cambios.

In [15]:
df_anomalies.head()

Unnamed: 0.1,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,Unnamed: 0,overview,...,vote_count,revenue2,runtime2,spoken_languages2,status2,tagline2,title2,video2,vote_average2,vote_count2
0,0,0,0,0,0,0,0,0,19730,0,...,1,,,,,,,,,
1,0,0,0,0,0,0,0,0,29503,0,...,12,,,,,,,,,
2,0,0,0,0,0,0,0,0,35587,0,...,22,,,,,,,,,


Observando el DataFrame se evidenció que los datos de la variable `release_date` no quedaron con los datos originales, al ser poca la información faltante se procede a rellenar los datos de forma manual, con base en la información contenida en el Dataset original `movies`.

In [16]:
df_anomalies.loc[0, 'release_date']="1997-08-20"
df_anomalies.loc[1, 'release_date']="2012-09-29"
df_anomalies.loc[2, 'release_date']="2014-01-01"

Ahora se procede a eliminar las columnas sobrantes.

In [18]:
df_anomalies.drop(['revenue2','runtime2','spoken_languages2','status2','tagline2','title2','video2','vote_average2','vote_count2','Unnamed: 0'], axis=1, inplace=True)

Se consulta el DataFrame para validar los cambios.

In [19]:
df_anomalies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,overview,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,0,0,0,0,0,0,0,0,0,- Written by Ørnås,...,1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Midnight Man,False,6.0,1
1,0,0,0,0,0,0,0,0,0,Rune Balot goes to a casino connected to the ...,...,2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Mardock Scramble: The Third Exhaust,False,7.0,12
2,0,0,0,0,0,0,0,0,0,Avalanche Sharks tells the story of a bikini ...,...,2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Beware Of Frost Bites,Avalanche Sharks,False,4.3,22


Teniendo en cuenta que la columna `"id"` es la llave primaria de todos los Dataset, se reasigna un número de "id" diferente a cada fila, validando que no coincidan con los ya existentes en el Dataset original.

In [20]:
df_anomalies.loc[1, 'id']="1"
df_anomalies.loc[2, 'id']="4"

Se consulta el DataFrame para validar los cambios.

In [21]:
df_anomalies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,overview,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,0,0,0,0,0,0,0,0,0,- Written by Ørnås,...,1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Midnight Man,False,6.0,1
1,0,0,0,0,0,1,0,0,0,Rune Balot goes to a casino connected to the ...,...,2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Mardock Scramble: The Third Exhaust,False,7.0,12
2,0,0,0,0,0,4,0,0,0,Avalanche Sharks tells the story of a bikini ...,...,2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Beware Of Frost Bites,Avalanche Sharks,False,4.3,22


# <h1 align=left>**`Load`**</h1>

Ya listo el DataFrame se procede a crear un nuevo Dataset para ser compilado con el Dataset original y continuar con el proceso de ETL.

In [22]:
df_anomalies.to_csv("..\Data\movies_newdataset_anomalies.csv")