## Salary prediction for Kaggle Data Scientists

In this exercise, we will explore the data from the 2017 Kaggle survey. The goal is to find what is the most important criterion for the salary of a data scientist. The data is available [here](https://kermorvant.github.io/csexed-ml/data/kaggle2017.xlsx)

**Questions:**
> * Load the Kaggle dataset with `pd.read_excel('kaggle2017.xlsx')` 
> * print a few lines of the dataset (`df.head()`) and compute basic statistics (`df.describe()`)


In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_excel(None) # YOUR CODE HERE
df.head()

In [None]:
df.describe()

The categorical data are not described with the function `df.describe()`. 

**Questions**
> * Plot the distribution of each categorical values  (`GenderSelect,Country,EmploymentStatus,CurrentJobTitleSelect, MajorSelect` ) with `sns.countplot(data=df3,y='GenderSelect')`


In [None]:
sns.countplot(data=df,y='GenderSelect')

The categorical data can not be directly used in most classifiers, their values must be encoded for example with an integer associated to each different category. 
> * Encode the categorical data with a [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Missing values must be filled before fitting the encoder (use `fillna('UNK')` on each column).
> * print a few lines of the dataset to see the effect of the encoder.


In [None]:
from sklearn import preprocessing
label_encoder = None # YOUR CODE HERE
for c in ['GenderSelect','Country','EmploymentStatus','CurrentJobTitleSelect','MajorSelect']:
    df[c+'_label'] = label_encoder.fit_transform(df[c].fillna('UNK'))
    # print the encoded values
    print (c)
    print ([(idx,val) for idx,val in  enumerate(label_encoder.classes_)])
    print()
# YOUR CODE HERE to print a few lines of the dataframe

A decision tree can be used both for classification of for regression. In our case, since we want to predict the salary of a data scientist, we will use decision trees for regression.
> * Define the features X and the target y (= Salary)
> * Fit a decision [DecisionTreeRegressor](http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html) to predict the Salary with maximum depth = 2


In [None]:
X = df[['Age','GenderSelect_label','Country_label','EmploymentStatus_label','CurrentJobTitleSelect_label','MajorSelect_label']]
y = df['Salary']

In [None]:
from sklearn.tree import DecisionTreeRegressor
regr = None # YOUR CODE HERE
regr.fit(None, None) # YOUR CODE HERE


The tree can be display to see which rules have been used. The follwing functions plot the decision tree previously trained. 

**Question**:
>  what is the first splitting node ?

In [None]:
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(regr, out_file=None,feature_names=['Age','GenderSelect_label','Country_label','EmploymentStatus_label','CurrentJobTitleSelect_label','MajorSelect_label'])
graph = graphviz.Source(dot_data)  
graph 

A RandomForest model can also be used to find the most important features in classification or regression problems. The importance of a feature is related to the number of times it is used in the different trees of the random forest.

The following code (adapted form [this code](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)) plot the feature importance on the Kaggle problem. 

**Question**
> * What is the most important feature to predict the salary ?
> 

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
forest = ExtraTreesRegressor()
feature_names=['Age','GenderSelect_label','Country_label','EmploymentStatus_label','CurrentJobTitleSelect_label','MajorSelect_label']
forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d %s(%f)" % (f + 1, indices[f],feature_names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()