# Exercícios - Entrando no ciclo

Chegou o momento de resolvermos alguns problemas utilizando as técnicas de ciência de dados que acumulamos até aqui!

A ideia é que vcs exercitem (idealmente em grupo) a **esteira de um projeto de data science**.

Exercitem a esteira completa (incluindo as etapas de exploração dos dados!), mas deem foco especial para a etapa de modelagem, objetivando a melhoria das **métricas de avaliação** que você(s) julgarem as mais adequadas!

<img src="https://www.abgconsultoria.com.br/blog/wp-content/uploads/img33-768x242.png" width=700>

___

Para cada um dos datasets a seguir (alguns já conhecemos), responda:

- 1 - qual é o problema a ser resolvido?
- 2 - qual é a variável resposta (target?)
- 3 - o problema em questão é um problema de classificação ou regressão?
- 4 - faça EDA dos dados!! Conheça os dados!
- 5 - crie um modelo que proporcione a melhor métrica avaliação (discuta qual métrica faz mais sentido)

Obs:

> utilize os estimadores/hipóteses que conhecemos até um momento;

> se algum integrante do grupo conhecer outros estimadores/hipóteses, o grupo pode usar estas ferramentas **contanto que o integrante que conhece compartilhe com os demais colegas a essência do estimador a ser usado**
_____

In [4]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC, LinearSVC
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.naive_bayes import GaussianNB
# from sklearn.linear_model import Perceptron
# from sklearn.linear_model import SGDClassifier
# from sklearn.tree import DecisionTreeClassifier

____
____
____

### Problema 1: Titanic

Base `titanic.csv` na pasta `/datasets`

>what sorts of people were more likely to survive?

In [5]:
df_titanic = pd.read_csv('../datasets/titanic.csv', sep=',')
df_titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [6]:
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


In [7]:
#transforma idades que não sejam numéricas em nan
df_titanic['age'].replace('?', np.nan, inplace=True)

In [8]:
#dropa idades nan e converte as idades str para float
df_titanic = df_titanic.dropna()
df_titanic['age'] = [float(x) for x in df_titanic['age']]

In [9]:
#categoriza o sexo do passageiro em 1 ou 0
df_titanic['sex'].replace('male', 1, inplace=True)
df_titanic['sex'].replace('female', 0, inplace=True)

In [10]:
df_titanic.describe()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch
count,1046.0,1046.0,1046.0,1046.0,1046.0,1046.0
mean,2.207457,0.408222,0.629063,29.881135,0.502868,0.42065
std,0.841497,0.49174,0.483287,14.4135,0.912167,0.83975
min,1.0,0.0,0.0,0.1667,0.0,0.0
25%,1.0,0.0,0.0,21.0,0.0,0.0
50%,2.0,0.0,1.0,28.0,0.0,0.0
75%,3.0,1.0,1.0,39.0,1.0,1.0
max,3.0,1.0,1.0,80.0,8.0,6.0


In [51]:
df_model = df_titanic.select_dtypes(include = np.number)

In [53]:
df_model.head(10)

Unnamed: 0,pclass,survived,sex,age,sibsp,parch
0,1,1,0,29.0,0,0
1,1,1,1,0.9167,1,2
2,1,0,0,2.0,1,2
3,1,0,1,30.0,1,2
4,1,0,0,25.0,1,2
5,1,1,1,48.0,0,0
6,1,1,0,63.0,1,0
7,1,0,1,39.0,0,0
8,1,1,0,53.0,2,0
9,1,0,1,71.0,0,0


In [54]:
X = df_model.drop(columns=['survived'])
X

Unnamed: 0,pclass,sex,age,sibsp,parch
0,1,0,29.0000,0,0
1,1,1,0.9167,1,2
2,1,0,2.0000,1,2
3,1,1,30.0000,1,2
4,1,0,25.0000,1,2
...,...,...,...,...,...
1301,3,1,45.5000,0,0
1304,3,0,14.5000,1,0
1306,3,1,26.5000,0,0
1307,3,1,27.0000,0,0


In [55]:
y = df_model['survived']
y

0       1
1       1
2       0
3       0
4       0
       ..
1301    0
1304    0
1306    0
1307    0
1308    0
Name: survived, Length: 1046, dtype: int64

In [56]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [57]:
X_train.shape

(836, 5)

In [58]:
X_test.shape

(210, 5)

In [59]:
y_train.shape

(836,)

In [60]:
y_test.shape

(210,)

In [61]:
X_test.columns.values

array(['pclass', 'sex', 'age', 'sibsp', 'parch'], dtype=object)

What are the data types for various features?

In [19]:
X_train.info()
print('='*40)
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1047 entries, 772 to 1126
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1047 non-null   int64 
 1   name       1047 non-null   object
 2   sex        1047 non-null   object
 3   age        1047 non-null   object
 4   sibsp      1047 non-null   int64 
 5   parch      1047 non-null   int64 
 6   ticket     1047 non-null   object
 7   fare       1047 non-null   object
 8   cabin      1047 non-null   object
 9   embarked   1047 non-null   object
 10  boat       1047 non-null   object
 11  body       1047 non-null   object
 12  home.dest  1047 non-null   object
dtypes: int64(3), object(10)
memory usage: 114.5+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 262 entries, 1148 to 199
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     262 non-null    int64 
 1   name       262 non-null

What is the distribution of numerical feature values across the samples?

What is the distribution of categorical features?

In [64]:
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression().fit(X_train, y_train)

In [65]:
logit.intercept_

array([4.99897556])

In [66]:
logit.feature_names_in_

array(['pclass', 'sex', 'age', 'sibsp', 'parch'], dtype=object)

In [67]:
logit.coef_

array([[-1.14725353, -2.63003455, -0.03980415, -0.3288364 ,  0.06152164]])

In [69]:
X_test

Unnamed: 0,pclass,sex,age,sibsp,parch
860,3,0,26.0,0,0
317,1,1,21.0,0,1
688,3,1,29.0,1,0
357,2,1,42.0,0,0
1259,3,1,36.0,0,0
...,...,...,...,...,...
745,3,0,30.0,0,0
674,3,1,32.0,0,0
360,2,1,26.0,1,1
1258,3,0,29.0,0,2


In [71]:
X_test.values

array([[ 3.,  0., 26.,  0.,  0.],
       [ 1.,  1., 21.,  0.,  1.],
       [ 3.,  1., 29.,  1.,  0.],
       ...,
       [ 2.,  1., 26.,  1.,  1.],
       [ 3.,  0., 29.,  0.,  2.],
       [ 2.,  1., 28.,  0.,  1.]])

array([[0.37231292, 0.62768708],
       [0.38999768, 0.61000232],
       [0.92797399, 0.07202601],
       [0.83164634, 0.16835366],
       [0.92454526, 0.07545474],
       [0.42218888, 0.57781112],
       [0.83309787, 0.16690213],
       [0.12233264, 0.87766736],
       [0.73111325, 0.26888675],
       [0.28488688, 0.71511312],
       [0.19460946, 0.80539054],
       [0.49310969, 0.50689031],
       [0.87087379, 0.12912621],
       [0.21899732, 0.78100268],
       [0.50306026, 0.49693974],
       [0.68166916, 0.31833084],
       [0.92172112, 0.07827888],
       [0.88165713, 0.11834287],
       [0.54300601, 0.45699399],
       [0.33592132, 0.66407868],
       [0.89165198, 0.10834802],
       [0.06552228, 0.93447772],
       [0.28488688, 0.71511312],
       [0.5130084 , 0.4869916 ],
       [0.09117597, 0.90882403],
       [0.70700157, 0.29299843],
       [0.0788136 , 0.9211864 ],
       [0.72462905, 0.27537095],
       [0.93846763, 0.06153237],
       [0.85277792, 0.14722208],
       [0.

ValueError: Classification metrics can't handle a mix of binary and multilabel-indicator targets

___
___
___

### Problema 2 - Tips

Base `tips.csv` na pasta `/datasets`

___
___
___

### Problema 3: house prices

Base `house_prices.csv` na pasta `/datasets`

___
___
___

### Problema 4 - Iris

Base `iris.csv` na pasta `/datasets`

___
___
___

### Problema 5 - breast cancer

Base `breast_cancer.csv` na pasta `/datasets`

___
___
___

### Problema 6 - VOCÊ ESCOLHE!

Entre no [Kaggle](https://www.kaggle.com/), ou consiga dados **supervisionados** (com o target desejado) em qualquer outra fonte, e faça o que fizemos acima!

___
___
___