# Brief overview of this study
The data comes from the Austin Animal Center, a shelter, and spans from October 1, 2013, to March 2016.

### Objective

The task is to predict the fate of each animal based on available information. It's essentially a classification task. The classes are: Adoption, Died, Euthanasia, Return to owner, Transfer. 

We consider all classes equally important regardless of their representation in the dataset. Therefore, the prediction quality is assessed using the macro-averaged F1 score.

---

**Assignment**

Using the exact scheme proposed in this template is optional, but within this notebook, you should develop:

- Clear and clean,
- Well-commented,
- Reproducible (fix all possible random seeds),
- Motivated

**code** that **generates your best solution**. See competition rules for futher information.


### Methods

`TODO: Describe your feature preprocessing techniques`

`TODO: List the models and parameters you have tried`


### Results

`TODO: Share observations, success stories, and futile efforts; what interesting things can you say about the dataset? what conclusions can you draw?`

---


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import random
import os

from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report

# Set fixed random seeds for reproducibility
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

## Configurations and Constants
(avoid magic numbers in your code)

In [2]:
OUTCOME2LABEL = {
    "Adoption": 0,
    "Transfer": 1,
    "Return_to_owner": 2,
    "Euthanasia": 3,
    "Died": 4,
}
LABEL2OUTCOME = {v: k for k, v in OUTCOME2LABEL.items()}

## Libraries
(all imports should ideally be placed here)

In [3]:
import sklearn
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, make_scorer

Let's download and examine the data.

#### Questions to Consider:
- What kind of data transformations might we need?
- What are the potential pitfalls in the data preprocessing stage?

In [4]:
# Load the data
df_train = pd.read_csv("/kaggle/input/animal-shelter-log/train.csv", encoding="utf-8")
df_test = pd.read_csv("/kaggle/input/animal-shelter-log/test.csv", encoding="utf-8")
df_train.head(5)

Unnamed: 0,Name,SexuponOutcome,AnimalType,AgeuponOutcome,Breed,Color,DateTime,Outcome,ID
0,Socks,Neutered Male,Cat,2 months,Domestic Shorthair Mix,Black/White,2014-06-11 14:36:00,0,0
1,Vera,Intact Female,Cat,1 month,Domestic Shorthair Mix,Tortie/White,2014-07-18 08:10:00,3,1
2,Biscuit,Neutered Male,Dog,3 months,Chihuahua Shorthair Mix,Yellow,2016-01-02 17:28:00,2,2
3,Kitten,Spayed Female,Cat,2 years,Domestic Shorthair Mix,Calico,2014-02-19 17:27:00,0,3
4,,Neutered Male,Cat,2 months,Domestic Shorthair Mix,Orange Tabby,2014-07-21 17:34:00,0,4


Имя в общем случае не несёт в себе информации для предсказания целевой переменной

In [5]:
df_train.drop('Name', axis=1, inplace=True)

In [6]:
df_test.drop('Name', axis=1, inplace=True)

In [7]:
df_test.head(5)

Unnamed: 0,SexuponOutcome,AnimalType,AgeuponOutcome,Breed,Color,DateTime,ID
0,Intact Female,Cat,3 weeks,Domestic Shorthair Mix,Torbie,2015-08-21 15:11:00,0
1,Spayed Female,Cat,3 months,Domestic Shorthair Mix,Blue Tabby,2014-08-12 15:27:00,1
2,Neutered Male,Cat,2 months,Domestic Shorthair Mix,Black,2014-12-21 19:09:00,2
3,Neutered Male,Cat,4 months,Domestic Shorthair Mix,Black,2014-11-14 13:42:00,3
4,Spayed Female,Cat,2 months,Domestic Shorthair Mix,Calico,2014-06-03 15:17:00,4


### Exploratory Data Analysis

In [8]:
# TODO: explore the data, plot graphs, seek valuable insights, ...

## Feature Preparation

#### Dates

Convert date columns into a numerical format. What format is most suitable and why?

In [9]:
def pandas_dates2number(date_series: pd.Series):
    return pd.to_datetime(date_series).values.astype(np.int64) // 10**9


pandas_dates2number(pd.Series(["2020-12-10"]))

array([1607558400])

#### Other Features

Based on your EDA, preprocess other features.

Чтобы не проводить все преобразования два раза, я решил объединить train и test в один датасет

In [10]:
df = pd.concat([df_train.drop('Outcome', axis=1), df_test], keys=['train', 'test'])

In [11]:
df

Unnamed: 0,Unnamed: 1,SexuponOutcome,AnimalType,AgeuponOutcome,Breed,Color,DateTime,ID
train,0,Neutered Male,Cat,2 months,Domestic Shorthair Mix,Black/White,2014-06-11 14:36:00,0
train,1,Intact Female,Cat,1 month,Domestic Shorthair Mix,Tortie/White,2014-07-18 08:10:00,1
train,2,Neutered Male,Dog,3 months,Chihuahua Shorthair Mix,Yellow,2016-01-02 17:28:00,2
train,3,Spayed Female,Cat,2 years,Domestic Shorthair Mix,Calico,2014-02-19 17:27:00,3
train,4,Neutered Male,Cat,2 months,Domestic Shorthair Mix,Orange Tabby,2014-07-21 17:34:00,4
...,...,...,...,...,...,...,...,...
test,8014,Spayed Female,Dog,8 months,Pit Bull/Queensland Heeler,Brown Brindle/White,2015-04-11 17:30:00,8014
test,8015,Intact Female,Dog,9 years,Chihuahua Shorthair Mix,Tan,2015-10-12 14:16:00,8015
test,8016,Neutered Male,Dog,3 years,Pit Bull Mix,Yellow Brindle/Blue,2014-12-17 16:25:00,8016
test,8017,Intact Male,Cat,3 weeks,Domestic Shorthair Mix,Brown Tabby,2014-09-10 18:48:00,8017


Проверяем наличие null значений в столбцах

In [12]:
df.isna().any()

SexuponOutcome     True
AnimalType        False
AgeuponOutcome     True
Breed             False
Color             False
DateTime          False
ID                False
dtype: bool

Заполняем null значения модой, так как столбцы по большому счету категориальные

In [13]:
df['SexuponOutcome'] = df['SexuponOutcome'].fillna(df['SexuponOutcome'].mode()[0])

In [14]:
df['AgeuponOutcome'] = df['AgeuponOutcome'].fillna(df['AgeuponOutcome'].mode()[0])

In [15]:
df.isna().any()

SexuponOutcome    False
AnimalType        False
AgeuponOutcome    False
Breed             False
Color             False
DateTime          False
ID                False
dtype: bool

Преобразуем 'AgeuponOutcome' в числовой формат (количество дней)

In [16]:
df['AgeuponOutcome'] = df['AgeuponOutcome'].map(lambda x: 
                              int(str(x).split()[0]) if 'day' in str(x).split()[1] else 
                             int(str(x).split()[0])*7 if 'week' in str(x).split()[1] else
                             int(str(x).split()[0])*30 if 'month' in str(x).split()[1] else
                             int(str(x).split()[0])*365)

In [17]:
df

Unnamed: 0,Unnamed: 1,SexuponOutcome,AnimalType,AgeuponOutcome,Breed,Color,DateTime,ID
train,0,Neutered Male,Cat,60,Domestic Shorthair Mix,Black/White,2014-06-11 14:36:00,0
train,1,Intact Female,Cat,30,Domestic Shorthair Mix,Tortie/White,2014-07-18 08:10:00,1
train,2,Neutered Male,Dog,90,Chihuahua Shorthair Mix,Yellow,2016-01-02 17:28:00,2
train,3,Spayed Female,Cat,730,Domestic Shorthair Mix,Calico,2014-02-19 17:27:00,3
train,4,Neutered Male,Cat,60,Domestic Shorthair Mix,Orange Tabby,2014-07-21 17:34:00,4
...,...,...,...,...,...,...,...,...
test,8014,Spayed Female,Dog,240,Pit Bull/Queensland Heeler,Brown Brindle/White,2015-04-11 17:30:00,8014
test,8015,Intact Female,Dog,3285,Chihuahua Shorthair Mix,Tan,2015-10-12 14:16:00,8015
test,8016,Neutered Male,Dog,1095,Pit Bull Mix,Yellow Brindle/Blue,2014-12-17 16:25:00,8016
test,8017,Intact Male,Cat,21,Domestic Shorthair Mix,Brown Tabby,2014-09-10 18:48:00,8017


Так как в 'Breed' и 'Color' очень много разных значений, лучше воспользоваться LabelEncoder, чем OneHotEncoder

In [18]:
categorical_features = ['Breed', 'Color']
for feature in categorical_features:
    encoder = LabelEncoder()
    df[feature] = encoder.fit_transform(df[feature])

Для колонок "AnimalType" и 'SexuponOutcome' стоит использовать get_dummies, так как в значениях нет четкой иерархии и их немного

In [19]:
df[list(pd.get_dummies(df[["AnimalType", 'SexuponOutcome']]).columns)] = pd.get_dummies(df[["AnimalType", 'SexuponOutcome']])

In [20]:
df.drop(['AnimalType', 'SexuponOutcome'], axis=1, inplace=True)
df["DateTime"] = pandas_dates2number(df["DateTime"])

Разделим выборку обратно на train и test

In [21]:
train = df.loc[['train']]

In [22]:
test = df.loc[['test']]

Применим StandardScaler к данным

In [23]:
scaler = StandardScaler()
train[['AgeuponOutcome', 'DateTime', 'Breed', 'Color']] = scaler.fit_transform(train[['AgeuponOutcome', 'DateTime', 'Breed', 'Color']])
test[['AgeuponOutcome', 'DateTime', 'Breed', 'Color']] = scaler.transform(test[['AgeuponOutcome', 'DateTime', 'Breed', 'Color']])

In [24]:
train.head()

Unnamed: 0,Unnamed: 1,AgeuponOutcome,Breed,Color,DateTime,ID,AnimalType_Cat,AnimalType_Dog,SexuponOutcome_Intact Female,SexuponOutcome_Intact Male,SexuponOutcome_Neutered Male,SexuponOutcome_Spayed Female,SexuponOutcome_Unknown
train,0,-0.675006,-0.182735,-1.047185,-0.772149,0,True,False,False,False,True,False,False
train,1,-0.702585,-0.182735,1.316515,-0.623791,1,True,False,True,False,False,False,False
train,2,-0.647426,-0.830486,1.775143,1.530525,2,False,True,False,False,True,False,False
train,3,-0.059061,-0.182735,-0.156388,-1.22403,3,True,False,False,False,False,True,False
train,4,-0.675006,-0.182735,0.522735,-0.610093,4,True,False,False,False,True,False,False


In [25]:
from sklearn.model_selection import train_test_split

Разделим тренировочную выборку на тренировочную и валидационную, указывая стратификацию по целевой переменной

In [26]:
X_train, X_val, y_train, y_val = train_test_split(
                                                    train.drop(["ID"], axis=1),
                                                    df_train['Outcome'],
                                                    test_size=0.25,
                                                    random_state=SEED,
                                                    stratify=df_train['Outcome'],
                                        )

Combine everything, ensure same preprocessing or train and test. 

*HINT: use sklearn Pipeline of OneHorEncoder, ...* 

# Model

**TODO:**
* train-val split
* cross calidation
* advanced models and ensembling
* hyperparameter tuning

Поспотрим соотношение классов 

In [27]:
1/(df_train.Outcome.value_counts()/df_train.shape[0])

Outcome
0      2.482091
1      2.836998
2      5.585075
3     17.180900
4    135.579710
Name: count, dtype: float64

Применим GridSearchCV со случайным лесом (в param_grid некоторые значения не перебираются, так как я по прошлым запускам понял, что они всегда являются оптимальными)

In [28]:
param_grid = {
    "n_estimators": [190],
    "min_samples_leaf": [7],
    "max_samples": [0.3],
    "class_weight" : [{0: 2.482091,
1: 2.836998,
2: 5.585075,
3: 17.180900,
4: 135.579710}],
    'min_samples_split': [2],
    'max_depth': [None, 3, 15, 20, 25]
}

Для оценки моделей через GridSearchCV укажем метрику F1-macro

In [29]:
f1 = make_scorer(f1_score , average='macro')

clf = GridSearchCV(RandomForestClassifier(n_jobs=-1, random_state=SEED), 
                       param_grid, scoring=f1, verbose=1, cv=5)

clf.fit(X_train, y_train)

print("Best params on dev set:")
print(clf.best_params_)
    
print("Scores on development set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']

for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    
best_model = clf.best_estimator_
best_model.fit(X_train, y_train)

y_true, y_pred = y_val, best_model.predict(X_val)
    
print(classification_report(y_true, y_pred))
print()

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best params on dev set:
{'class_weight': {0: 2.482091, 1: 2.836998, 2: 5.585075, 3: 17.1809, 4: 135.57971}, 'max_depth': None, 'max_samples': 0.3, 'min_samples_leaf': 7, 'min_samples_split': 2, 'n_estimators': 190}
Scores on development set:
0.439 (+/-0.014) for {'class_weight': {0: 2.482091, 1: 2.836998, 2: 5.585075, 3: 17.1809, 4: 135.57971}, 'max_depth': None, 'max_samples': 0.3, 'min_samples_leaf': 7, 'min_samples_split': 2, 'n_estimators': 190}
0.330 (+/-0.018) for {'class_weight': {0: 2.482091, 1: 2.836998, 2: 5.585075, 3: 17.1809, 4: 135.57971}, 'max_depth': 3, 'max_samples': 0.3, 'min_samples_leaf': 7, 'min_samples_split': 2, 'n_estimators': 190}
0.436 (+/-0.014) for {'class_weight': {0: 2.482091, 1: 2.836998, 2: 5.585075, 3: 17.1809, 4: 135.57971}, 'max_depth': 15, 'max_samples': 0.3, 'min_samples_leaf': 7, 'min_samples_split': 2, 'n_estimators': 190}
0.438 (+/-0.013) for {'class_weight': {0: 2.482091, 1: 2.836998, 2:

finally, save test predictions of the best model to csv

In [30]:
clf.best_params_

{'class_weight': {0: 2.482091,
  1: 2.836998,
  2: 5.585075,
  3: 17.1809,
  4: 135.57971},
 'max_depth': None,
 'max_samples': 0.3,
 'min_samples_leaf': 7,
 'min_samples_split': 2,
 'n_estimators': 190}

In [31]:
preds = best_model.predict(test.drop('ID', axis=1))

In [32]:
# Create a submission using constant predictions
submission = pd.DataFrame({"ID": test["ID"], "Outcome": preds})

# Save the submission
submission.to_csv("submission_4.csv", index=False)

### Place for the feedack or meme

...