## Clase 9 Feature engineering - Predicción de Adopción de Animales

https://www.kaggle.com/c/petfinder-adoption-prediction/data


**AdoptionSpeed** - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.

Type - Type of animal (1 = Dog, 2 = Cat)

Age - Age of pet when listed, in months

Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)

Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)

Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)

Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)

Fee - Adoption fee (0 = Free)

In [320]:
 import numpy as np
 import pandas as pd
import seaborn as sns
 from sklearn.linear_model import LinearRegression
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')

### Ejercicio 1

1) Cargar el dataset "train"

2) Trabajar solo con las features "'Type','Age','Breed1','Breed2','Gender','Health','Fee', 'AdoptionSpeed'"

3) Formatear **Age**, **Fee** y **AdoptionSpeed**  como numéricas y el resto como categóricas (pd.Categorical)

In [321]:
import pandas as pd
data = pd.read_csv('train.csv')
df_train = data[['Type', 'Age', 'Breed1', 'Breed2', 'Gender', 'Health', 'Fee', 'AdoptionSpeed']]
df_train.head()

Unnamed: 0,Type,Age,Breed1,Breed2,Gender,Health,Fee,AdoptionSpeed
0,2,3,299,0,1,1,100,2
1,1,4,307,0,2,1,150,2
2,1,1,307,0,1,1,0,2
3,2,3,266,0,2,1,0,2
4,2,12,264,264,1,1,300,1


In [322]:
df_train['Type'] = pd.Categorical(df_train['Type'])
df_train['Breed1'] = pd.Categorical(df_train['Breed1'])
df_train['Breed2'] = pd.Categorical(df_train['Breed2'])
df_train['Gender'] = pd.Categorical(df_train['Gender'])
df_train['Health'] = pd.Categorical(df_train['Health'])

### Ejercicio 2
1) Crear una feature nueva para cuando un animal tenga dos **Breeds**. (np.where) 

2) Utilizar one hot encoding para las variables **Gender** y **Health** (pd.get_dummies) y eliminar las variables originales.

3) Elegir un citerio para cuantizar las variables **Age** y **Fee**.

4) Separar la columna **AdoptionSpeed** en otro dataframe. (X , y = df.iloc[ : , : ], df.AdoptionSpeed)

5) Utilizar ambos dataset para ajustar una regresión lineal y comparar resultados. A mayor score, mejor el ajuste.

In [323]:
df_train['Two_breeds'] = np.where((df_train['Breed1'] == 0) | (df_train['Breed2'] == 0), 0, 1)
df_train = pd.get_dummies(df_train, columns=['Gender'], prefix="Gender")
df_train = pd.get_dummies(df_train, columns=['Health'], prefix="Health")
df_train.head()

Unnamed: 0,Type,Age,Breed1,Breed2,Fee,AdoptionSpeed,Two_breeds,Gender_1,Gender_2,Gender_3,Health_1,Health_2,Health_3
0,2,3,299,0,100,2,0,1,0,0,1,0,0
1,1,4,307,0,150,2,0,0,1,0,1,0,0
2,1,1,307,0,0,2,0,1,0,0,1,0,0
3,2,3,266,0,0,2,0,0,1,0,1,0,0
4,2,12,264,264,300,1,1,1,0,0,1,0,0


In [324]:
ages = df_train['Age']
#ages_scaled = (ages - ages.min(axis=0)) / (ages.max(axis=0) - ages.min(axis=0))
q_cuts = pd.qcut(ages, 4, labels=False)
df_train['Age'] = q_cuts
df_train = pd.get_dummies(df_train, columns=['Age'], prefix="Age")
df_train.head()

Unnamed: 0,Type,Breed1,Breed2,Fee,AdoptionSpeed,Two_breeds,Gender_1,Gender_2,Gender_3,Health_1,Health_2,Health_3,Age_0,Age_1,Age_2,Age_3
0,2,299,0,100,2,0,1,0,0,1,0,0,0,1,0,0
1,1,307,0,150,2,0,0,1,0,1,0,0,0,0,1,0
2,1,307,0,0,2,0,1,0,0,1,0,0,1,0,0,0
3,2,266,0,0,2,0,0,1,0,1,0,0,0,1,0,0
4,2,264,264,300,1,1,1,0,0,1,0,0,0,0,1,0


In [325]:
fees = df_train['Fee']
fees_scaled = (fees - fees.min(axis=0)) / (fees.max(axis=0) - fees.min(axis=0))
#q_cuts = pd.qcut(fees, 4, labels=False, duplicates='drop')
df_train['Fee'] = fees_scaled
df_train.head()

Unnamed: 0,Type,Breed1,Breed2,Fee,AdoptionSpeed,Two_breeds,Gender_1,Gender_2,Gender_3,Health_1,Health_2,Health_3,Age_0,Age_1,Age_2,Age_3
0,2,299,0,0.033333,2,0,1,0,0,1,0,0,0,1,0,0
1,1,307,0,0.05,2,0,0,1,0,1,0,0,0,0,1,0
2,1,307,0,0.0,2,0,1,0,0,1,0,0,1,0,0,0
3,2,266,0,0.0,2,0,0,1,0,1,0,0,0,1,0,0
4,2,264,264,0.1,1,1,1,0,0,1,0,0,0,0,1,0


In [327]:
#X, y = df_train.iloc[:, :6], df_train.AdoptionSpeed
X, y = df_train.drop(['AdoptionSpeed'], axis=1), df_train.AdoptionSpeed
X.head()

Unnamed: 0,Type,Breed1,Breed2,Fee,Two_breeds,Gender_1,Gender_2,Gender_3,Health_1,Health_2,Health_3,Age_0,Age_1,Age_2,Age_3
0,2,299,0,0.033333,0,1,0,0,1,0,0,0,1,0,0
1,1,307,0,0.05,0,0,1,0,1,0,0,0,0,1,0
2,1,307,0,0.0,0,1,0,0,1,0,0,1,0,0,0
3,2,266,0,0.0,0,0,1,0,1,0,0,0,1,0,0
4,2,264,264,0.1,1,1,0,0,1,0,0,0,0,1,0


https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [328]:
#Baseline: 0.038
modelo_lr = LinearRegression().fit(X, y)
modelo_lr.score(X,y)

0.10413978784314304