### Carregamento e Tratamento dos Dados

#### Importação de todas as bibliotecas

In [None]:
import pandas as pd 
import numpy as np

#### Carregando e tratando os dados

In [2]:
df = pd.read_csv("../data/desafio_indicium_imdb.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
1,2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
2,3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
3,4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
4,5,The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905


Aqui já notamos que a coluna "Unnamed: 0" pode ser simplesmente descartada, pois nao tem nenhuma informação relevante! 

In [3]:
df = df.drop(columns=['Unnamed: 0'])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   999 non-null    object 
 1   Released_Year  999 non-null    object 
 2   Certificate    898 non-null    object 
 3   Runtime        999 non-null    object 
 4   Genre          999 non-null    object 
 5   IMDB_Rating    999 non-null    float64
 6   Overview       999 non-null    object 
 7   Meta_score     842 non-null    float64
 8   Director       999 non-null    object 
 9   Star1          999 non-null    object 
 10  Star2          999 non-null    object 
 11  Star3          999 non-null    object 
 12  Star4          999 non-null    object 
 13  No_of_Votes    999 non-null    int64  
 14  Gross          830 non-null    object 
dtypes: float64(2), int64(1), object(12)
memory usage: 117.2+ KB


Encontrado mais alguns problemas, nossa base de dados possui valores nulos e as colunas "Released_Year", "Runtime" e "Gross" são do tipo Object, o que nao está correto, vamos ajustar isso

In [5]:
df['Released_Year'] = pd.to_numeric(df['Released_Year'], errors='coerce')
df['Released_Year'] = df['Released_Year'].astype('Int64')

In [6]:
df['Runtime'] = df['Runtime'].str.replace('min', '').astype(int)

In [7]:
df['Gross'] = df['Gross'].str.replace(',', '').astype(float)

In [8]:
df.isna().sum()

Series_Title       0
Released_Year      1
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

Outro problema, nossa base de dados nao é grande e possui muitos valores nulos, se simplesmente removermos tudo perderemos um volume muito grande de dados, para que possamos amenizar essa perda sem distorcer muito o conjunto de dados, acredito que a melhor forma é substituir os valores nulos por algo coerente, vamos explorar isso ...

##### Explorando e conhecendo mais os dados de "Gross", "Meta_score", "Certificate" e "Released_Year" para tratamento

In [9]:
print("MEDIA: ", df['Gross'].mean())
print("MEDIANA: ", df['Gross'].median())
print("MAX: ", df['Gross'].max())
print("MIN: ", df['Gross'].min())

MEDIA:  68082574.10481928
MEDIANA:  23457439.5
MAX:  936662225.0
MIN:  1305.0


Como temos uma diferença muito grande entre o maximo e o minimo, acho mais coerente substituir os valores nulos pela mediana no caso do "Gross"

In [10]:
df['Gross'] = df['Gross'].fillna(df['Gross'].median())

In [11]:
print("MEDIA: ", df['Meta_score'].mean())
print("MEDIANA: ", df['Meta_score'].median())
print("MAX: ", df['Meta_score'].max())
print("MIN: ", df['Meta_score'].min())

MEDIA:  77.96912114014252
MEDIANA:  79.0
MAX:  100.0
MIN:  28.0


Aqui notamos que nao tem uma diferença tao grande em relação a média e a mediana, optarei por utilizar a média!

In [12]:
df['Meta_score'] = df['Meta_score'].fillna(df['Meta_score'].mean())

In [13]:
df['Certificate'].value_counts(dropna=False)

Certificate
U           234
A           196
UA          175
R           146
NaN         101
PG-13        43
PG           37
Passed       34
G            12
Approved     11
TV-PG         3
GP            2
TV-14         1
Unrated       1
TV-MA         1
16            1
U/A           1
Name: count, dtype: int64

Aqui eu optarei apenas por retirar os Dados nulos, tenho receio desse caso eu adicionar algum valor e alterar o resultado das analises!

In [14]:
df[df['Released_Year'].isna()]

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
965,Apollo 13,,U,140,"Adventure, Drama, History",7.6,NASA must devise a strategy to return Apollo 1...,77.0,Ron Howard,Tom Hanks,Bill Paxton,Kevin Bacon,Gary Sinise,269197,173837933.0


Esse caso foi simples, apenas pesquisei na internet e vi que o filme foi lançado em 1995

In [15]:
df['Released_Year'] = df['Released_Year'].fillna(1995)

In [16]:
df.isna().sum()

Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score         0
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross              0
dtype: int64

In [17]:
df = df.dropna()

In [18]:
df.shape

(898, 15)

Dessa forma conseguimos limpar os dados tendo o menor numero de perdas e distorção dos dados, isso vai nos garantir uma analise mais consistente mais para frente!

In [19]:
df.duplicated(subset=["Series_Title", "Released_Year"]).sum()

np.int64(0)

In [20]:
df.head(5)

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Godfather,1972,A,175,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
1,The Dark Knight,2008,UA,152,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
2,The Godfather: Part II,1974,A,202,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
3,12 Angry Men,1957,U,96,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
4,The Lord of the Rings: The Return of the King,2003,U,201,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905.0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 898 entries, 0 to 996
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   898 non-null    object 
 1   Released_Year  898 non-null    Int64  
 2   Certificate    898 non-null    object 
 3   Runtime        898 non-null    int64  
 4   Genre          898 non-null    object 
 5   IMDB_Rating    898 non-null    float64
 6   Overview       898 non-null    object 
 7   Meta_score     898 non-null    float64
 8   Director       898 non-null    object 
 9   Star1          898 non-null    object 
 10  Star2          898 non-null    object 
 11  Star3          898 non-null    object 
 12  Star4          898 non-null    object 
 13  No_of_Votes    898 non-null    int64  
 14  Gross          898 non-null    float64
dtypes: Int64(1), float64(3), int64(2), object(9)
memory usage: 113.1+ KB


Os dados agora estão nos formatos corretos e totalmente preenchidos ...<br>
Agora criarei um dicionario para o "Certificate", porque pelo que vi essas classificaçoes, tem algumas que sao a mesma coisa porem para países diferentes

In [22]:
print("temos: ", df["Certificate"].value_counts())

mapa_etaria = {
    'U': 0,
    'G': 0,
    'Approved': 0,
    'Passed': 0,
    'GP': 0,
    'UA': 1,
    'U/A': 1,
    'PG': 1,
    'PG-13': 1,
    'TV-PG': 1,
    'A': 2,
    'R': 2,
    'TV-14': 2,
    '16': 2,
    'TV-MA': 2,
    'Unrated': 2
}

df['Faixa_etaria'] = df['Certificate'].map(mapa_etaria)
df['Faixa_etaria_texto'] = df['Faixa_etaria'].map({0:'Livre', 1:'Adolescentes', 2:'Adulto'}) 

temos:  Certificate
U           234
A           196
UA          175
R           146
PG-13        43
PG           37
Passed       34
G            12
Approved     11
TV-PG         3
GP            2
TV-14         1
Unrated       1
TV-MA         1
16            1
U/A           1
Name: count, dtype: int64


In [23]:
df.head(5)

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Faixa_etaria,Faixa_etaria_texto
0,The Godfather,1972,A,175,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0,2,Adulto
1,The Dark Knight,2008,UA,152,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0,1,Adolescentes
2,The Godfather: Part II,1974,A,202,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0,2,Adulto
3,12 Angry Men,1957,U,96,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0,0,Livre
4,The Lord of the Rings: The Return of the King,2003,U,201,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905.0,0,Livre


In [24]:
df.to_csv("../data/cinema_processed.csv", index=False)