### **<ins>Índice_clean</ins>:**

 0. [Importación librerías](#importacion_lib)
 1. [Actor](#actor)
 2. [Film](#film)
 3. [Category](#category)
 4. [Inventory](#inventory)
 5. [Language](#language)
 6. [Old HDD](#old_hdd)
 7. [Rental](#rental)

### 0 - Importación librerías <a name="importacion_lib"/>
***

In [51]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')  
pd.set_option('display.max_columns', None) 

### 1 - Actor <a name="actor"/>
***

In [52]:
act = pd.read_csv('../data/raw/actor.csv')
act.head()

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,PENELOPE,GUINESS,2006-02-15 04:34:33
1,2,NICK,WAHLBERG,2006-02-15 04:34:33
2,3,ED,CHASE,2006-02-15 04:34:33
3,4,JENNIFER,DAVIS,2006-02-15 04:34:33
4,5,JOHNNY,LOLLOBRIGIDA,2006-02-15 04:34:33


In [53]:
# Tamaño de la tabla
act.shape

(200, 4)

In [54]:
# Revisamos duplicados en nombre y apellido
act[['first_name','last_name']].duplicated().sum()

1

In [55]:
# Eliminamos el registro duplicado, y corregimos índices
act_index = act[act[['first_name','last_name']].duplicated()].index
act = act.drop(index= act_index).reset_index(drop= True)
act['actor_id'] = range(1,act.shape[0]+1) # 

In [56]:
act.last_update.unique()  # Last_update no nos aporta información

array(['2006-02-15 04:34:33'], dtype=object)

In [57]:
act.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199 entries, 0 to 198
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   actor_id     199 non-null    int64 
 1   first_name   199 non-null    object
 2   last_name    199 non-null    object
 3   last_update  199 non-null    object
dtypes: int64(1), object(3)
memory usage: 6.3+ KB


De la información anterior podemos hacer las siguientes transformaciones:
* <ins>**actor_id</ins>:** Lo pasamos a np.int16 para optimizar el uso de datos
* <ins>**first_name / last_name</ins>:** Aplicamos método Capitalize para un correcto formato
* <ins>**last_update</ins>:** La eliminamos no aporta información de utilidad y sólo tiene un valor único
* Renombramos índices para tener un formato estándar con el resto de las tablas

In [58]:
act.actor_id = act.actor_id.astype(np.int16)
act['first_name'] = act.first_name.str.capitalize()
act['last_name'] = act.last_name.str.capitalize()
act = act.drop(columns='last_update')

In [59]:
act.head()

Unnamed: 0,actor_id,first_name,last_name
0,1,Penelope,Guiness
1,2,Nick,Wahlberg
2,3,Ed,Chase
3,4,Jennifer,Davis
4,5,Johnny,Lollobrigida


In [60]:
act.to_csv('../data/clean/actor_clean.csv',index=False)

In [115]:
(act.first_name + ' ' + act.last_name).unique().size

199

### 2 - Film <a name="film"/> 
***

In [42]:
film = pd.read_csv('../data/raw/film.csv')
film.head()

Unnamed: 0,film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,special_features,last_update
0,1,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,2006,1,,6,0.99,86,20.99,PG,"Deleted Scenes,Behind the Scenes",2006-02-15 05:03:42
1,2,ACE GOLDFINGER,A Astounding Epistle of a Database Administrat...,2006,1,,3,4.99,48,12.99,G,"Trailers,Deleted Scenes",2006-02-15 05:03:42
2,3,ADAPTATION HOLES,A Astounding Reflection of a Lumberjack And a ...,2006,1,,7,2.99,50,18.99,NC-17,"Trailers,Deleted Scenes",2006-02-15 05:03:42
3,4,AFFAIR PREJUDICE,A Fanciful Documentary of a Frisbee And a Lumb...,2006,1,,5,2.99,117,26.99,G,"Commentaries,Behind the Scenes",2006-02-15 05:03:42
4,5,AFRICAN EGG,A Fast-Paced Documentary of a Pastry Chef And ...,2006,1,,6,2.99,130,22.99,G,Deleted Scenes,2006-02-15 05:03:42


In [37]:
film.shape

(1000, 13)

In [43]:
film.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   film_id               1000 non-null   int64  
 1   title                 1000 non-null   object 
 2   description           1000 non-null   object 
 3   release_year          1000 non-null   int64  
 4   language_id           1000 non-null   int64  
 5   original_language_id  0 non-null      float64
 6   rental_duration       1000 non-null   int64  
 7   rental_rate           1000 non-null   float64
 8   length                1000 non-null   int64  
 9   replacement_cost      1000 non-null   float64
 10  rating                1000 non-null   object 
 11  special_features      1000 non-null   object 
 12  last_update           1000 non-null   object 
dtypes: float64(3), int64(5), object(5)
memory usage: 101.7+ KB


In [83]:
film.release_year.unique()

array([2006], dtype=int64)

In [114]:
film.title.unique().size

1000

### 3 - Category <a name="category"/> 
***

In [91]:
cat = pd.read_csv('../data/raw/category.csv')
cat.head()

Unnamed: 0,category_id,name,last_update
0,1,Action,2006-02-15 04:46:27
1,2,Animation,2006-02-15 04:46:27
2,3,Children,2006-02-15 04:46:27
3,4,Classics,2006-02-15 04:46:27
4,5,Comedy,2006-02-15 04:46:27


In [92]:
cat.shape

(16, 3)

In [93]:
cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   category_id  16 non-null     int64 
 1   name         16 non-null     object
 2   last_update  16 non-null     object
dtypes: int64(1), object(2)
memory usage: 512.0+ bytes


In [95]:
cat.last_update.unique()

array(['2006-02-15 04:46:27'], dtype=object)

### 4 - Inventory <a name="inventory"/> 
***

In [48]:
inv = pd.read_csv('../data/raw/inventory.csv')
inv.head()

Unnamed: 0,inventory_id,film_id,store_id,last_update
0,1,1,1,2006-02-15 05:09:17
1,2,1,1,2006-02-15 05:09:17
2,3,1,1,2006-02-15 05:09:17
3,4,1,1,2006-02-15 05:09:17
4,5,1,2,2006-02-15 05:09:17


In [49]:
inv.shape

(1000, 4)

In [50]:
inv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   inventory_id  1000 non-null   int64 
 1   film_id       1000 non-null   int64 
 2   store_id      1000 non-null   int64 
 3   last_update   1000 non-null   object
dtypes: int64(3), object(1)
memory usage: 31.4+ KB


In [98]:
inv.last_update.unique()

array(['2006-02-15 05:09:17'], dtype=object)

In [101]:
inv.query('film_id == 4')

Unnamed: 0,inventory_id,film_id,store_id,last_update
15,16,4,1,2006-02-15 05:09:17
16,17,4,1,2006-02-15 05:09:17
17,18,4,1,2006-02-15 05:09:17
18,19,4,1,2006-02-15 05:09:17
19,20,4,2,2006-02-15 05:09:17
20,21,4,2,2006-02-15 05:09:17
21,22,4,2,2006-02-15 05:09:17


### 5 - Language <a name="language"/> 
***

In [80]:
lan = pd.read_csv('../data/raw/language.csv')
lan.head(10)

Unnamed: 0,language_id,name,last_update
0,1,English,2006-02-15 05:02:19
1,2,Italian,2006-02-15 05:02:19
2,3,Japanese,2006-02-15 05:02:19
3,4,Mandarin,2006-02-15 05:02:19
4,5,French,2006-02-15 05:02:19
5,6,German,2006-02-15 05:02:19


In [72]:
lan.shape

(6, 3)

In [73]:
lan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   language_id  6 non-null      int64 
 1   name         6 non-null      object
 2   last_update  6 non-null      object
dtypes: int64(1), object(2)
memory usage: 272.0+ bytes


In [85]:
lan.last_update.unique()

array(['2006-02-15 05:02:19'], dtype=object)

### 6 - Old HDD <a name="old_hdd"/> 
***

In [74]:
ohdd = pd.read_csv('../data/raw/old_HDD.csv')
ohdd.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,PENELOPE,GUINESS,ACADEMY DINOSAUR,2006,6
1,PENELOPE,GUINESS,ANACONDA CONFESSIONS,2006,2
2,PENELOPE,GUINESS,ANGELS LIFE,2006,13
3,PENELOPE,GUINESS,BULWORTH COMMANDMENTS,2006,10
4,PENELOPE,GUINESS,CHEAPER CLYDE,2006,14


In [75]:
ohdd.shape

(1000, 5)

In [76]:
ohdd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   first_name    1000 non-null   object
 1   last_name     1000 non-null   object
 2   title         1000 non-null   object
 3   release_year  1000 non-null   int64 
 4   category_id   1000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 39.2+ KB


In [106]:
ohdd['complet_name'] = ohdd.first_name + ' ' + ohdd.last_nameb

In [112]:
ohdd.complet_name.unique().size
ohdd.title.unique().size

614

### 7 - Rental <a name="rental"/> 
***

In [77]:
rent = pd.read_csv('../data/raw/rental.csv')
rent.head()

Unnamed: 0,rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
0,1,2005-05-24 22:53:30,367,130,2005-05-26 22:04:30,1,2006-02-15 21:30:53
1,2,2005-05-24 22:54:33,1525,459,2005-05-28 19:40:33,1,2006-02-15 21:30:53
2,3,2005-05-24 23:03:39,1711,408,2005-06-01 22:12:39,1,2006-02-15 21:30:53
3,4,2005-05-24 23:04:41,2452,333,2005-06-03 01:43:41,2,2006-02-15 21:30:53
4,5,2005-05-24 23:05:21,2079,222,2005-06-02 04:33:21,1,2006-02-15 21:30:53


In [78]:
rent.shape

(1000, 7)

In [79]:
rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   rental_id     1000 non-null   int64 
 1   rental_date   1000 non-null   object
 2   inventory_id  1000 non-null   int64 
 3   customer_id   1000 non-null   int64 
 4   return_date   1000 non-null   object
 5   staff_id      1000 non-null   int64 
 6   last_update   1000 non-null   object
dtypes: int64(4), object(3)
memory usage: 54.8+ KB


In [82]:
rent.inventory_id.unique().size

1000

In [104]:
rent.customer_id.unique().size

485

In [117]:
rent.rental_date.max()

'2005-05-31 00:46:31'

In [118]:
rent.rental_date.min()

'2005-05-24 22:53:30'

In [None]:
3e