<a href="https://colab.research.google.com/github/nicolezk/pet-adoption-prediction/blob/main/notebooks/feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook Configuration

In [1]:
# Mount Google Drive to obtain the data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Import libraries
import pandas as pd
import numpy as np

In [3]:
# Read data
df = pd.read_csv('/content/drive/MyDrive/ML - Project/data/pets_outliers_removed.csv')

## Helper Functions

In [4]:
def apply_get_dummies_for_k_most_frequent_categories(df, column, k):
  k_most_frequent_categories = df[column].value_counts()[:k].index
  df = pd.concat([df, pd.get_dummies(pd.Categorical(df[column], categories=k_most_frequent_categories), prefix=column)], axis=1)
  df = df.drop(columns=[column])
  return df

## Feature Engineering

Observations from previous analyses:
- [X] PetID should be dropped as it is a unique identifier
- [X] The label identifiers can be dropped (e.g. Breed1, Breed2, Color1, Color2, etc.)
- [X] We can drop the columns BreedType1 and BreedType2 and only keep Type since they have the same information
- [X] Since Color3Name has 70% of its values null we could drop such column
- [x] Since Name has too many different values we could make a feature to check if the pet has a name or not
- [X] Additionally we could use the X% most common names as features to see if it improves the model
- [X] Instead of using breed names, we could use features such as "Mixed Breed", "Domestic" and "Pure bred"
- [X] We could use the X% most common pet rescuer IDs to see if the rescuer has influence in the adoption speed
- [X] StateNames have too many values. We could use the X% most frequent ones as features

In [5]:
# Drop columns
df = df.drop(columns=['PetID','Breed1Type','Breed2Type','Breed1','Breed2','Color1','Color2','Color3','Color3Name','State'])

In [6]:
# Handle names
df['Name'] = df['Name'].str.lower()

unnamed_list_regex = '(pup)|(kitt)|(dog)|(cat)|(name)|(unknown)'

df["Named"] = np.where(df['Name'].str.contains(unnamed_list_regex, regex= True), 0, 1)
df.loc[df['Name'].isna(),"Named"] = 0

df.loc[df['Named'] == 0,"Name"] = np.nan

df = apply_get_dummies_for_k_most_frequent_categories(df, 'Name', 5)

  


In [7]:
# Handle breeds
df["Mixed_Breed"] = np.where(df['Breed1Name'] == 'Mixed Breed', 1, 0)
df["Domestic"] = np.where(df['Breed1Name'].str.contains('Domestic'), 1, 0)
df["Pure_Breed"] = np.where((df["Breed1Name"]!='Mixed Breed') & (df['Breed2Name'].isna()), 1, 0)
df = df.drop(columns=['Breed1Name','Breed2Name'])

In [8]:
# RescuerID
df = apply_get_dummies_for_k_most_frequent_categories(df, 'RescuerID', 5)

In [9]:
# StateName
# df['StateName'].value_counts()
df = apply_get_dummies_for_k_most_frequent_categories(df, 'StateName', 2)

## Writing results to a CSV file

In [10]:
df.to_csv('/content/drive/MyDrive/ML - Project/data/pets_feature_engineering.csv', index=False)