# Ghouls, Goblins, and Ghosts... Boo!
*Mon 31 Oct 2016 – Thu 1 Dec 2016*

https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo

Can you classify monsters haunting Kaggle?

Get out your dowsing rods, electromagnetic sensors, … and gradient boosting machines. Kaggle is haunted and we need your help. After a month of making scientific observations and taking careful measurements, we’ve determined that 900 ghouls, ghosts, and goblins are infesting our halls and frightening our data scientists. When trying garlic, asking politely, and using reverse psychology didn't work, it became clear that machine learning is the only answer to banishing our unwanted guests.

So now the hour has come to put the data we’ve collected in your hands. We’ve managed to identify 371 of the ghastly creatures, but need your help to vanquish the rest. And only an accurate classification algorithm can thwart them. Use bone length measurements, severity of rot, extent of soullessness, and other characteristics to distinguish (and extinguish) the intruders. Are you ghost-busters up for the challenge?

## A Haunted (random) Forest is, of course, the obvious solution to this spooky contest

In [1]:
# Basics
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

### The Data

In [2]:
train_df = pd.read_csv("G:/Kaggle_Boo/train.csv")
test_df = pd.read_csv("G:/Kaggle_Boo/test.csv")

In [4]:
train_df.head()

Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul,color,type
0,0,0.354512,0.350839,0.465761,0.781142,clear,Ghoul
1,1,0.57556,0.425868,0.531401,0.439899,green,Goblin
2,2,0.467875,0.35433,0.811616,0.791225,black,Ghoul
3,4,0.776652,0.508723,0.636766,0.884464,black,Ghoul
4,5,0.566117,0.875862,0.418594,0.636438,green,Ghost


The target variable: type of creature (of the night)

In [5]:
train_outcome = train_df["type"]
train_df.drop(["type"], axis=1, inplace=True)

In [6]:
train_outcome.value_counts()

Ghoul     129
Goblin    125
Ghost     117
Name: type, dtype: int64

Encoding the categorical data: the color of the apparition (with the complete set (train and test)) 

In [7]:
conjunto = pd.concat([train_df[["id", "color"]], test_df[["id", "color"]]])
conjunto_encoded = pd.get_dummies(conjunto, columns=["color"])
train_file = train_df.merge(conjunto_encoded, on="id", how="left")
test_file = test_df.merge(conjunto_encoded, on="id", how="left")
train_df.drop(["color"], axis=1, inplace=True)
test_df.drop(["color"], axis=1, inplace=True)

Separating the IDs (And we don't really need the IDs of the training set)

In [8]:
train_id = train_df[["id"]]
test_id = test_df[["id"]]
train_df.drop(["id"], axis=1, inplace=True)
test_df.drop(["id"], axis=1, inplace=True)

### The Model

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

First, let's train a small(ish) haunted forest to get some metrics 

In [10]:
X_train, X_val, y_train, y_val = train_test_split(train_df, train_outcome, test_size=0.2)
forest = RandomForestClassifier(n_estimators=200, n_jobs=4)
forest.fit(X_train, y_train)
y_pred_val = forest.predict(X_val)

In [11]:
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(y_val, y_pred_val))
print("Haunting accuracy: {:.1%}".format(accuracy_score(y_val, y_pred_val)))

             precision    recall  f1-score   support

      Ghost       0.92      0.76      0.83        29
      Ghoul       0.55      0.71      0.62        17
     Goblin       0.59      0.59      0.59        29

avg / total       0.70      0.68      0.69        75

Haunting accuracy: 68.0%


Now, we'll train the Haunted Random Forest with the complete training set...

In [None]:
forest = RandomForestClassifier(n_estimators=500, n_jobs=4)
forest.fit(train_df, train_outcome)
y_pred = forest.predict(test_df)

... and we'll predict of the type of creature haunting the forest

In [None]:
results = pd.read_csv("../input/sample_submission.csv")
results["type"] = y_pred

Finally, we'll save the results to haunt the leaderboard

In [None]:
results.to_csv("submission.csv", index=False)

### *0.71267 score on the leaderboard! Booo!*