Lets import the modules and data we need.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
from collections import defaultdict
#import util_functions as uf
from util_functions import *
%matplotlib inline

df = pd.read_csv('./2019/survey_results_public.csv')
schema = pd.read_csv('./2019/survey_results_schema.csv')
df.shape

(88883, 85)

Lets see if we can predict if someone describes their occupation as a Data Scientist/Machine Learning Specialist based on some of the colums from the data.  We have to be cautious what columns to select as the data will expload with columns if we attempt to use some that require one hot encoding.  These may or may not be the best columns to select; however, for simplicity and ease of training our model we will use them going forward.

In [2]:
cols = ['EdLevel', 'UndergradMajor', 'Age', 'Hobbyist', 'DevType', 'WorkWeekHrs', 'WorkRemote', 'BetterLife']

df = df[cols]
df.shape

(88883, 8)

We need to do some data manipulation in order to utilize the dataframe in a machine learning algorithm:

1. Drop all the rows with no dev types
2. For each numeric variable, fill the column with the mean value.
3. Create y as the DevType column
4. Set any value in y with 'Data scientist or machine learning specialist' to 1 otherwise 0
5. Convert y column to an integer
6. Create X to contain all columns excluding DevType
6. Create dummy columns for all the categorical variables and drop the original columns

In [3]:

# Drop rows with missing DevType values
df = df[df.DevType.notnull()]

# Fill numeric columns with the mean
num_vars = df.select_dtypes(include=['float', 'int']).columns
for col in num_vars:
    df[col].fillna((df[col].mean()), inplace=True)

# Set the y = 1 for any that have 'Data scientist or machine learning specialist'
# in the string; otherwise, set y = 0
y = df['DevType']

for i, j in y.items():
    if y[i].find('Data scientist or machine learning specialist') != -1:
        y.at[i] = '1'       
    else: 
        y.at[i] = '0'

# Convert the series to integers
y = y.astype('int32')

# Lets get our X matrix by dropping the DevType column
X = df.drop(['DevType'], axis=1)

# Dummy the categorical variables
cat_vars = X.select_dtypes(include=['object']).copy().columns
for var in  cat_vars:
    X = pd.concat([X.drop(var, axis=1), pd.get_dummies(X[var], prefix=var, prefix_sep='_', drop_first=True)], axis=1)

# Make sure the number of rows that are related to a DevType of 'Data scientist or machine learning specialist' are 6460
# and take a look at the X matrix as a sanity check that the columns didn't explode when using one hot encoding
print('y values equal to 1: {}\nShape of X: {}'.format(y.sum(), X.shape))

y values equal to 1: 6460
Shape of X: (81335, 29)


Break up the X and y data into train and test batches

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state=101)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(65068, 29) (16267, 29) (65068,) (16267,)


We want to try running AdaBoost with a Decision Tree as our algorithm with default parameter values to see what we can come up quickly.

In [17]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score

ada_model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier()) 

ada_model.fit(X_train, y_train)
y_test_preds = ada_model.predict(X_test)

print(ada_model.score(X_test,y_test))
print(mean_squared_error(y_test, y_test_preds))

0.89764554005
0.10235445995


Lets use a grid search with a few different parameters to see if we can get any better results than the default values.

In [18]:
from sklearn.model_selection import GridSearchCV
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "base_estimator__max_depth" :   [2, 4],
              "n_estimators": [2,4,8,16,32]
             }

DTC = DecisionTreeClassifier()
ABC = AdaBoostClassifier(base_estimator = DTC)

# run grid search
grid_search = GridSearchCV(ABC, param_grid=param_grid)
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'base_estimator__criterion': 'entropy',
 'base_estimator__max_depth': 2,
 'base_estimator__splitter': 'best',
 'n_estimators': 4}

Now we can take the output from the previous grid search and plug in the one it states was the best.

In [19]:
DTC = DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=2)
ABC = AdaBoostClassifier(base_estimator=DTC, n_estimators=4)

ABC.fit(X_train, y_train)
y_test_preds = ABC.predict(X_test)

print(ABC.score(X_test,y_test))
print(mean_squared_error(y_test, y_test_preds))

0.918669699391
0.0813303006086


We got pretty good results tapping into the AdaBoost with a Decision Tree Classifier, but lets do a quick look at a SVC with default settings due to the length of time it takes to train SVC vs Ada+DT.

In [20]:
from sklearn.svm import SVC

svc_model = SVC() 
svc_model.fit(X_train, y_train)
y_test_preds = svc_model.predict(X_test) 

print(svc_model.score(X_test,y_test))
print(mean_squared_error(y_test, y_test_preds)) 

0.920022130694
0.079977869306


We see default AdaBoost with DTC wasn't horrible, but we did noticably better when doing a grid search of a subset of parameters.  What was interesing is default SVC was even better.  We potentially could get better doing a grid search on SVC; however, the time it takes to train that algorithm is substantially longer.  If time to train was a requirement in deciding the algorithm then using AdaBoost with DTC might be more beneficial; however, SVC seems to be a better solution.