**Desafio Ifood**

O conjunto de dados é composto por clientes da empresa Ifood com dados sobre:

* Perfis de clientes
* Preferências do produto
* Sucessos/fracassos da campanha
* Desempenho do canal

**Objetivo:**
Fazer uma análise exploratória dos dados.

* Quantos dados temos? Linhas e colunas 

* Quais são as colunas numéricas?

* Temos duplicados na nossa base? Se tivermos, retire-os 

* Temos dados nulos nessa base? Será que eles indicam algo? O que fazer com eles? 

* Qual é a média, mediana, 25 percentil, 75 percentil, mínimo e máximo de cada uma das colunas numéricas? 

In [34]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('mkt_data.csv')

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,...,education_Graduation,education_Master,education_PhD,MntTotal,MntRegularProds,AcceptedCmpOverall,marital_status,education_level,kids,expenses
0,0,58138.0,0,0,58,635,88,546,172,88,...,3.0,,,1529,1441,0,Single,Graduation,0,1529
1,1,46344.0,1,1,38,11,1,6,2,1,...,3.0,,,21,15,0,Single,Graduation,2,21
2,2,71613.0,0,0,26,426,49,127,111,21,...,3.0,,,734,692,0,Together,Graduation,0,734
3,3,26646.0,1,0,26,11,4,20,10,3,...,3.0,,,48,43,0,Together,Graduation,1,48
4,4,58293.0,1,0,94,173,43,118,46,27,...,,,5.0,407,392,0,Married,PhD,1,407




**1)  Quantos dados temos? Linhas e colunas**

44 colunas e 2205 linhas

In [28]:
df.shape

(2205, 44)

**2) Quais são as colunas numéricas?**

Todas do dataset, exceto: marital_status e education_level

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2205 entries, 0 to 2204
Data columns (total 44 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            2205 non-null   int64  
 1   Income                2205 non-null   float64
 2   Kidhome               2205 non-null   int64  
 3   Teenhome              2205 non-null   int64  
 4   Recency               2205 non-null   int64  
 5   MntWines              2205 non-null   int64  
 6   MntFruits             2205 non-null   int64  
 7   MntMeatProducts       2205 non-null   int64  
 8   MntFishProducts       2205 non-null   int64  
 9   MntSweetProducts      2205 non-null   int64  
 10  MntGoldProds          2205 non-null   int64  
 11  NumDealsPurchases     2205 non-null   int64  
 12  NumWebPurchases       2205 non-null   int64  
 13  NumCatalogPurchases   2205 non-null   int64  
 14  NumStorePurchases     2205 non-null   int64  
 15  NumWebVisitsMonth    

**3) Temos duplicados na nossa base? Se tivermos, retire-os**

Não!

In [None]:
dados_duplicados = df.duplicated()
soma_duplicados = dados_duplicados.sum()
print('Quantidade de dados duplicados: ', soma_duplicados)

Quantidade de dados duplicados:  0


**4) Temos dados nulos nessa base? Será que eles indicam algo? O que fazer com eles?**

Temos. Eles são rerente as variáveis categóricas e podem indicar que o usuário não possui as informações solicitadas, por exemplo, se ele é divorciado, casado..



In [None]:
df.isnull().sum()

Unnamed: 0                 0
Income                     0
Kidhome                    0
Teenhome                   0
Recency                    0
MntWines                   0
MntFruits                  0
MntMeatProducts            0
MntFishProducts            0
MntSweetProducts           0
MntGoldProds               0
NumDealsPurchases          0
NumWebPurchases            0
NumCatalogPurchases        0
NumStorePurchases          0
NumWebVisitsMonth          0
AcceptedCmp3               0
AcceptedCmp4               0
AcceptedCmp5               0
AcceptedCmp1               0
AcceptedCmp2               0
Complain                   0
Z_CostContact              0
Z_Revenue                  0
Response                   0
Age                        0
Customer_Days              0
marital_Divorced        1975
marital_Married         1351
marital_Single          1728
marital_Together        1637
marital_Widow           2129
education_2n Cycle      2007
education_Basic         2151
education_Grad

In [29]:
#contando as ocorrências de um valor na coluna marital_Divorced
df.marital_Divorced.value_counts()

1.0    230
Name: marital_Divorced, dtype: int64

A coluna marital_Divorced só tem o número 1.0, que aparece 230 vezes. O restante é nulo. Isso indica que o nulo significa que a pessoa não tem aquela determinada feature.

Com isso, podemos transformar essas colunas em um booleano. Sendo 1 se fez e 0 se não fez (nulo)

In [35]:
colunas_com_nulos = ["marital_Divorced" 
, "marital_Married"
, "marital_Single"  
, "marital_Together"       
, "marital_Widow"           
, "education_2n Cycle"      
, "education_Basic"       
, "education_Graduation"  
, "education_Master"      
, "education_PhD"]

for item in colunas_com_nulos:
  df["booleano"+str(item)] = np.where(df[item].isnull(), 0, 1) #se for nulo, substituimos por 0, caso contrário substituimos por 1

**5) Qual é a média, mediana, 25 percentil, 75 percentil, mínimo e máximo de cada uma das colunas numéricas?**

In [None]:
# média,25 percentil, 75 percentil, mínimo e máximo
df.describe()

Unnamed: 0.1,Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,...,education_2n Cycle,education_Basic,education_Graduation,education_Master,education_PhD,MntTotal,MntRegularProds,AcceptedCmpOverall,kids,expenses
count,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,...,198.0,54.0,1113.0,364.0,476.0,2205.0,2205.0,2205.0,2205.0,2205.0
mean,1102.0,51622.094785,0.442177,0.506576,49.00907,306.164626,26.403175,165.312018,37.756463,27.128345,...,1.0,2.0,3.0,4.0,5.0,562.764626,518.707483,0.29932,0.948753,562.764626
std,636.672993,20713.063826,0.537132,0.54438,28.932111,337.493839,39.784484,217.784507,54.824635,41.130468,...,0.0,0.0,0.0,0.0,0.0,575.936911,553.847248,0.68044,0.749231,575.936911
min,0.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,2.0,3.0,4.0,5.0,4.0,-283.0,0.0,0.0,4.0
25%,551.0,35196.0,0.0,0.0,24.0,24.0,2.0,16.0,3.0,1.0,...,1.0,2.0,3.0,4.0,5.0,56.0,42.0,0.0,0.0,56.0
50%,1102.0,51287.0,0.0,0.0,49.0,178.0,8.0,68.0,12.0,8.0,...,1.0,2.0,3.0,4.0,5.0,343.0,288.0,0.0,1.0,343.0
75%,1653.0,68281.0,1.0,1.0,74.0,507.0,33.0,232.0,50.0,34.0,...,1.0,2.0,3.0,4.0,5.0,964.0,884.0,0.0,1.0,964.0
max,2204.0,113734.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,262.0,...,1.0,2.0,3.0,4.0,5.0,2491.0,2458.0,4.0,3.0,2491.0


In [None]:
# mediana
df.median()

  


Unnamed: 0               1102.0
Income                  51287.0
Kidhome                     0.0
Teenhome                    0.0
Recency                    49.0
MntWines                  178.0
MntFruits                   8.0
MntMeatProducts            68.0
MntFishProducts            12.0
MntSweetProducts            8.0
MntGoldProds               25.0
NumDealsPurchases           2.0
NumWebPurchases             4.0
NumCatalogPurchases         2.0
NumStorePurchases           5.0
NumWebVisitsMonth           6.0
AcceptedCmp3                0.0
AcceptedCmp4                0.0
AcceptedCmp5                0.0
AcceptedCmp1                0.0
AcceptedCmp2                0.0
Complain                    0.0
Z_CostContact               3.0
Z_Revenue                  11.0
Response                    0.0
Age                        50.0
Customer_Days            2515.0
marital_Divorced            1.0
marital_Married             5.0
marital_Single              4.0
marital_Together            3.0
marital_