# 2019 Kaggle ML & DS Survey
EXPLORATORY DATA ANALYSIS 
& MODEL EXPLAINABILITY NOTEBOOK

** Data Overview ** 

- This survey received 19,717 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, we grouped them into a group named “Other” for anonymity.
- Most of the respondents were found primarily through Kaggle channels, like our email list, discussion forums and social media channels.
- The survey was live from October 8th to October 28th. We allowed respondents to complete the survey at any time during that window. 
- Not every question was shown to every respondent. In general, respondents with more experience were asked more questions and respondents with less experience were asked less questions.
- To protect the respondents’ identity, the answers to multiple choice questions have been separated into a separate data file from the open-ended responses. 
- Multiple choice single response questions fit into individual columns whereas multiple choice multiple response questions were split into multiple columns.
- Text responses were encoded to protect user privacy and countries with fewer than 50 respondents were grouped into the category "other".

(full data description: https://www.kaggle.com/c/kaggle-survey-2019/data)

In [None]:
import numpy as np 
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')


import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split as train_valid_split
from sklearn.metrics import classification_report

import eli5

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Questions
This section list all expanded version of each survey question.

*Author Notes: Suvery schema data analysis will be re-added to future versions of this notebook*

In [None]:
questions_only = pd.read_csv('/kaggle/input/kaggle-survey-2019/questions_only.csv')
with pd.option_context('display.max_colwidth', 10000):
    display(questions_only.T)

# Multiple Choice Responses
This section provides quick data profiling and analysis of the multiple choice responses data. 

*Author Notes: Other Text Responses data analysis will be re-added to future versions of this notebook*

In [None]:
multiple_choice = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')[1:]
multiple_choice.profile_report(title='Multiple Choice Responses',style={'full_width':True})

# Current Title/Role
Since we want to do a model using the data and implement model explainability technique, I decided to use the title/role as the target variable.

In [None]:
multiple_choice['Q5'].value_counts().sort_values(ascending=True).plot(kind='barh')

We can see here that the student and data scientist are the top roles relative to the others. We can do a simple binary classification using the top 2 roles as classess

*Author Notes: Classification Model can be extended to multi-class on future versions*

In [None]:
condition = (multiple_choice['Q5']=='Data Scientist') | (multiple_choice['Q5']=='Student')
multiple_choice[condition].profile_report(title='Multiple Choice Responses',style={'full_width':True})

# Modeling: Students vs Data Scientist
This section we will create a simple classification model to differentiate a student from a data scientist. We will also include model explainability to understand what the model learned. 

In [None]:
condition = (multiple_choice['Q5']=='Data Scientist') | (multiple_choice['Q5']=='Student')
df = multiple_choice[condition].reset_index(drop=True)
other_text_cols = [col for col in df.columns if 'OTHER_TEXT' in col]
df = df.drop(other_text_cols,axis=1)
df = df.rename(columns={'Time from Start to Finish (seconds)':'Duration'})
df.head()

In [None]:
def cat_encoding(df,map_dict):
    for col in map_dict.keys():
        df[col] = df[col].map(map_dict[col])
    return df

df['Duration'] =  df['Duration'].astype(float)
cat_cols = df.select_dtypes('object').columns

cat_mapping = {}
for col in cat_cols:
    values = list(df[col].unique())
    LE = LabelEncoder().fit(values)
    cat_mapping[col] = dict(zip(LE.classes_, LE.transform(LE.classes_)))
    
df = cat_encoding(df,cat_mapping)

For our model we want to predict the Q5 so we can know how the model differentiate a Student from a Data Scientist

In [None]:
y_col = 'Q5'
y = df[y_col]
Xs = df.drop(y_col,axis=1).fillna(-999)

X_train,X_valid,y_train,y_valid = train_valid_split(Xs, y, test_size = .2,
                                                    random_state=0)
X_train.shape,X_valid.shape

### Modeling

In [None]:
%%time
model = RandomForestClassifier(n_estimators=100,
                               random_state=0,n_jobs=-1)
model.fit(X_train,y_train)

### Model Evaluation

In [None]:
preds = model.predict_proba(X_train)[:,1]
plt.hist(preds,bins=100)
plt.show();
print('train_report',classification_report(y_train,np.round(preds)))

preds = model.predict_proba(X_valid)[:,1]
plt.hist(preds,bins=100)
plt.show();
print('valid_report',classification_report(y_valid,np.round(preds)))

# Model Explainability: Students vs Data Scientist

The model's overall top weights are 
- Q6: What is the size of the company where you are employed? 
- Q8: Does your current employer incorporate machine learning methods into their business?
- Q7: Approximately how many individuals are responsible for data science workloads at your place of business?
- Q11: Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?
- Q10: 	What is your current yearly compensation (approximate $USD)?

It seems that the model leverage on questions regarding employment to differentiate a student from a data scientist.. 
*Author Notes:.. will be removing highly related variables for differentiating classes*

In [None]:
eli5.show_weights(model,feature_names=list(X_train.columns))

### Predicting a Data Scientist based on survey answers

Top questions to differentiate it from a student
- Q11: Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?
- Q8: Does your current employer incorporate machine learning methods into their business?

In [None]:
X_valid.loc[2569,:]

In [None]:
eli5.show_prediction(model,X_valid.loc[2569,:],feature_names=list(X_train.columns), top=20)

### Predicting a Student based on survey answers

Top questions to differentiate it from a Data Scientist
- Q6: What is the size of the company where you are employed?
- Q11: Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?

In [None]:
X_valid.loc[5683,:]

In [None]:
eli5.show_prediction(model,X_valid.loc[5683,:],feature_names=list(X_train.columns), top=20)

# Notebook in progress
### Do UPVOTE if this notebook is helpful to you in some way :) <br/> Comment below any suggetions that can help improve this notebook. TIA

In [None]:
nan