[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lyeskhalil/mlbootcamp2022/blob/main/AgeDataset.ipynb)

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split

First, let's load the dataset and take a look at what it contains in raw form:

In [None]:
from pathlib import Path
from os import system
if not Path('AgeDataset-V1.csv.zip').exists():
    system('wget --no-check-certificate --content-disposition https://github.com/lyeskhalil/mlbootcamp2022/raw/main/AgeDataset-V1.csv.zip')
           
if not Path('AgeDataset-V1.csv').exists():
    system('unzip AgeDataset-V1.csv.zip')

In [None]:
df = pd.read_csv('AgeDataset-V1.csv')

In [None]:
df.head()

Our task is going to be to predict how long these people lived, based on the other features provided - occupation, birth year, country, manner of death, etc.

Some rows have empty entries (represented by NaN or "Not a Number"). These are going to be less useful, since they are missing data. I am going to drop them, but you can try imputing missing values to see if it improves performance.

In [None]:
df = df.dropna()

Don't change this next part- we are splitting the data into X and y variables, and splitting into training and testing sets which everyone will use.

In [None]:
X = df[['Name','Short description','Gender','Country','Occupation','Birth year','Manner of death']]
y = df['Age of death']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.9,random_state=0)

Some of the variables here are categorical, so an easy way to get started is to transform them into one-hot vectors. Luckily, Pandas provides a convenience function to do this for us:

In [None]:
X_dummies = pd.get_dummies(data=X[['Gender','Country','Occupation','Manner of death']])

And now that we have those one-hot vectors for all entries, we can transform the training and test data to use them instead:

In [None]:
X_train_dummies = X_dummies.loc[X_train.index] #Get the one-hot vectors corresponding to indices that are in the training set
X_train_concat = pd.concat([X_train['Birth year'], X_train_dummies],axis=1) #Join the one-hot vectors with the other features

X_test_dummies = X_dummies.loc[X_test.index]
X_test_concat = pd.concat([X_test['Birth year'], X_test_dummies],axis=1)

X_train_concat.head()

Without doing anything else that might be useful (do you think we might have too many manners of death that only occured to a single person?) let's try training a couple of models and see what we get:

In [None]:
tree = DecisionTreeRegressor().fit(X_train_concat,y_train)
tree.score(X_test_concat,y_test)

In [None]:
et = ExtraTreesRegressor(n_jobs=-1).fit(X_train_concat,y_train) #the n_jobs=-1 parameter tells Scikit-Learn to use all cores of your processor
et.score(X_test_concat,y_test)

You can surely do better! If you're looking for places to start, here are some suggestions:

1. Try imputing some of the missing values in the data rather than just discarding those rows.
2. I only tried to do regression with two kinds of tree model. There are a vast number of options in Scikit-Learn, not to mention using a neural network.
3. The country feature contains a list of multiple countries for some people - in the code above this gets encoded into its own feature, rather than setting the values of multiple countries to 1!
4. There are a ton of manners of death which are only listed for one or two people - is it better to just replace these with "other"?
5. We have a lot of rich text which we could take advantage of - how could we do that?
