# HW6:  Analyzing Data Science in the StackOverflow Tech Job Survey
## Due Sunday June 4, at 11:55pm

For this problem, extract part of the data involving <i>data science</i> and do two things:
(1) build a regression model to predict salary, and
(2) build a classifier model to predict the job title ("occupation").

Jennifer MacDonald 604501712

CS249 -- Spring 2017 -- D.S. Parker &copy; 2017

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
from patsy import dmatrices

import pandas as pd
from pandas import Series, DataFrame

import statsmodels.api as sm

from sklearn import datasets, svm

# Libraries for preprocessing, modeling, and calulating the data
from sklearn import ensemble, linear_model, preprocessing, metrics 
from sklearn.cross_validation import train_test_split # For splitting the data into train and test
from sklearn.metrics import mean_squared_error # For calculating mean squared error

  from pandas.core import datetools


## Part 0:  Getting the Survey data

In [2]:
JobSurvey = pd.read_csv('2016 Stack Overflow Survey Responses.csv') 

# Drop all rows with null values from star_wars_vs_star_trek column since so many are null
JobSurvey['star_wars_vs_star_trek'] = JobSurvey['star_wars_vs_star_trek'].fillna('') 
JobSurvey = JobSurvey.dropna() # drop all rows with null values from JobSurvey

In [3]:
df = JobSurvey.copy() # Copy dataframe to manipulate without changing the original

# Extract a subset of the data involving data science
df1 = df[df.occupation == 'Business intelligence or data warehousing expert']
df2 = df[df.occupation == 'Data scientist']
df3 = df[df.occupation == 'Developer with a statistics or mathematics background']
df4 = df[df.occupation == 'Machine learning developer']

frames = [df1,df2,df3,df4]

DataScience = pd.concat(frames) 

In [4]:
le = preprocessing.LabelEncoder() # Preprocess data by changing all categorical variables to numerical

## Part 1:  Predicting Salary

The texts in this course have presented a number of regression models for predicting numeric values.

Develop a "regression" model that predicts the <b>salary_midpoint</b> value (i.e., column number 15).

You should use "MSE" (Minimum Squared Error) as the accuracy measure.
Develop a model that reduces this error measure.

This is asking you to produce the best model you can for each of the two datasets
-- with the highest possible accuracy.
In other words, you are asked to produce to models, and report the accuracy of each of them.

In [5]:
clf = ensemble.GradientBoostingRegressor() # Use Gradient Boosting for regression model 

### Regression model for the JobSurvey dataset

In [6]:
JobSurvey_regr = JobSurvey.copy()

JS_X_regr_pp = JobSurvey_regr.drop('salary_midpoint', axis=1) # Drop salary_midpoint (used as target instead)
JS_X_regr =  JS_X_regr_pp.apply(le.fit_transform) # Apply preprocessing to data
JS_y_regr = JobSurvey['salary_midpoint'] # Use salary_midpoint for target

In [7]:
# Split train and test data
JS_X_regr_train, JS_X_regr_test, JS_y_regr_train, JS_y_regr_test = train_test_split(JS_X_regr, JS_y_regr) 

# Fit training data
JS_regr_fit = clf.fit(JS_X_regr_train, JS_y_regr_train)
# Make predictions from the training data onto the testing data
JS_y_regr_pred = JS_regr_fit.predict(JS_X_regr_test)

# Calculate mean squared error for the true vs. predicted values
print('MSE for the JobSurvey dataset:', mean_squared_error(JS_y_regr_test, JS_y_regr_pred))

MSE for the JobSurvey dataset: 12366231.1939


### Regression model for the DataScience dataset

In [8]:
# Repeat the process for the DataScience dataset

In [9]:
DataScience_regr = DataScience.copy()

DS_X_regr_pp = DataScience_regr.drop('salary_midpoint', axis=1)
DS_X_regr =  DS_X_regr_pp.apply(le.fit_transform)
DS_y_regr = DataScience['salary_midpoint']

In [10]:
DS_X_regr_train, DS_X_regr_test, DS_y_regr_train, DS_y_regr_test = train_test_split(DS_X_regr, DS_y_regr)

DS_regr_fit = clf.fit(DS_X_regr_train, DS_y_regr_train)
DS_y_regr_pred = DS_regr_fit.predict(DS_X_regr_test)

print('MSE for the DataScience dataset:', mean_squared_error(DS_y_regr_test, DS_y_regr_pred))

MSE for the DataScience dataset: 20407383.593


## Part 2:  Predicting Job Satisfaction

All of the tools covered in this course provide a large number of classifiers.

Develop a classifier model that predicts the <b>job satisfaction</b> value (i.e., column number 27).

More specifically, predict whether the value is <tt>"I love my job"</tt>.

Please use "accuracy rate" (percentage of correct predictions) as the measure of accuracy for this analysis.
For each of the two datasets, develop the best model you can -- with the highest possible accuracy.


In [11]:
logistic = linear_model.LogisticRegression() # Use Logistic Regression for classification model 

### Classification model for the JobSurvey dataset

In [12]:
JobSurvey_cls = JobSurvey.copy()
# Change target data ('I love my job') to boolean 1 and all other responses to boolean 0
JobSurvey_cls['job_satisfaction'] = (JobSurvey_cls['job_satisfaction'] == 'I love my job').astype(int) 

JS_X_cls_pp = JobSurvey_cls.drop('job_satisfaction', axis=1) # Drop job_satisfaction (use as target instead)
JS_X_cls = JS_X_cls_pp.apply(le.fit_transform)

JS_y_cls = JobSurvey_cls['job_satisfaction'] # Use job_satisfaction for target

In [13]:
JS_X_cls_train, JS_X_cls_test, JS_y_cls_train, JS_y_cls_test = train_test_split(JS_X_cls, JS_y_cls)

JS_cls_fit = logistic.fit(JS_X_cls_train, JS_y_cls_train)
JS_y_cls_pred = JS_cls_fit.predict(JS_X_cls_test)

# Calculate percentage of correct predictions from the true y values of the test data
print('Accuracy for the JobSurvey dataset:', metrics.accuracy_score(JS_y_cls_test, JS_y_cls_pred))

Accuracy for the JobSurvey dataset: 0.744902205576


### Classification model for the DataScience dataset

In [14]:
DataScience_cls = DataScience.copy()
DataScience_cls['job_satisfaction'] = (DataScience_cls['job_satisfaction'] == 'I love my job').astype(int)

DS_X_cls_pp = DataScience_cls.drop('job_satisfaction', axis=1)
DS_X_cls = DS_X_cls_pp.apply(le.fit_transform)

DS_y_cls = DataScience_cls['job_satisfaction']

In [15]:
DS_X_cls_train, DS_X_cls_test, DS_y_cls_train, DS_y_cls_test = train_test_split(DS_X_cls, DS_y_cls)

DS_cls_fit = logistic.fit(DS_X_cls_train, DS_y_cls_train)
DS_y_cls_pred = DS_cls_fit.predict(DS_X_cls_test)

print('Accuracy for the DataScience dataset:', metrics.accuracy_score(DS_y_cls_test, DS_y_cls_pred))

Accuracy for the DataScience dataset: 0.681818181818
