# Download data

See https://www.kaggle.com/stackoverflow/so-survey-2017 for an overview of stackoverflow dataset. It is a survey of about 64,000 responses from their users. We are going to use it to predict what kind of a developer the surveyee is.


Download the survey results data to /content/datalab/workspace/structured_data_classification_stackoverflow or another location, but you have to chagne the workspace path below.

In [1]:
WORKSPACE_PATH = '/content/datalab/workspace/structured_data_classification_stackoverflow'

In [2]:
!ls $WORKSPACE_PATH

survey_results_public.csv  survey_results_schema.csv


# Clean up the data

We have to make a decision about how we are going to model the data. For each column, we have to ask if it represents a numerical value, 1 categorical value, or many categories. This is not always clear, as there could be many ways to use a column in a model. 

For example, consider the 'disagree, somewhat disagree, somewhat agree, agree" type questions in a linear model, there are at least two options to how we use those columns. We could encode each option as a categorical value. If we do this, we loose the natural ordering (disagree seems like it should have a smaller value than agree), and so our model has to learn this relationship. Also, there is a variable for each categorical value, making the linear model large and easy to overfit. Another option is to convert these values into a numerical column (using say disagree=-2, somewhat disagree=-1, somewhat agree=1, agree=2), and now the linear model just has to learn one weight. However, the difference between two categories is now important. Is it correct that 'agree' is weigthted twice as strongly as 'somewhat agree'? Picking how to encode data columns is part of feature engineering, and it is domain an problem specfic.

In this notebook, we will do the simplest thing

* columns with one categorical response will be encoded with a one-hot vector
* columns with multiple categorical responses will be encoded with bag-of-words vector
* columns with numerical values will be encoded as numbers with no transformation


In [57]:
import os
import csv
import re
import pandas as pd
import six
import string
import random
import numpy as np
import json

In [73]:
survey_results_path = os.path.join(WORKSPACE_PATH, 'survey_results_public.csv')
survey_schema_path = os.path.join(WORKSPACE_PATH, 'survey_results_schema.csv')

# Clean data 
train_data_path = os.path.join(WORKSPACE_PATH, 'train.csv')
eval_data_path = os.path.join(WORKSPACE_PATH, 'eval.csv')
schema_path = os.path.join(WORKSPACE_PATH, 'schema.json')
transform_path = os.path.join(WORKSPACE_PATH, 'transforms.json')

# For analyze step
analyze_output = os.path.join(WORKSPACE_PATH, 'analyze_output')

# For the transform step
transform_output = os.path.join(WORKSPACE_PATH, 'transform_output')
transformed_train_pattern = os.path.join(transform_output, 'features_train*')
transformed_eval_pattern = os.path.join(transform_output, 'features_eval*')

# For the training step
training_output = os.path.join(WORKSPACE_PATH, 'training_output')

# For the prediction steps
batch_predict_output = os.path.join(WORKSPACE_PATH, 'batch_predict_output')
evaluation_model = os.path.join(training_output, 'evaluation_model')
regular_model = os.path.join(training_output, 'model')

In [5]:
if not os.path.isfile(survey_results_path) or not os.path.isfile(survey_schema_path):
  print('Error: the data files are missing!')

In [6]:
# Get CSV headers as a list of column names.
with open(survey_schema_path, 'r') as f:
  reader = csv.reader(f)
  next(reader) # skip header
  headers = [r[0] for r in reader]


To the data with MLWorkbench, the data needs to be cleaned in a few ways:

* missing values sould be missing in the csv file, not 'NA'. 
* for multiple categorical columns, the data has each value separated by a semicolon but  mlworkbench separates tokens by spaces
* some columns have non-ascii values, but only ascii is supported.

In [26]:
def update_multi_label_cols(v):
  """Make labels 1 token long.
  Example:
      Before: Stock options; Annual bonus; Vacation/days off; Equipment; Meals
      After: Stock_options Annual_bonus Vacation/days_off Equipment Meals
  """
  if isinstance(v, float):
    return v
  v = v.replace('; ', ';')
  v = v.replace(' ', '_')
  v = v.replace(';', ' ')
  return v

def convert_to_ascii(v):
  """Remove non-ascii characters."""
  if isinstance(v, (float, int)):
    return v
  return filter(lambda x: x in set(string.printable), v)
  #return re.sub(r'[^\x00-\x7f]',r'', v)  # remove non-ascii

In [8]:
single_label_cols = []
numerical_cols = []
multi_label_cols = []
key_cols = []
target_col = None

In [9]:
key_cols.append('Respondent')
target_col = 'Professional'
single_label_cols.append('ProgramHobby')
single_label_cols.append('Country')
single_label_cols.append('University')
single_label_cols.append('EmploymentStatus')
single_label_cols.append('FormalEducation')
single_label_cols.append('MajorUndergrad')
single_label_cols.append('HomeRemote')
single_label_cols.append('CompanySize') # bucket range
single_label_cols.append('CompanyType')
single_label_cols.append('YearsProgram') # bucket range
single_label_cols.append('YearsCodedJob') # bucket range
single_label_cols.append('YearsCodedJobPast') # bucket range
multi_label_cols.append('DeveloperType')
single_label_cols.append('WebDeveloperType')
multi_label_cols.append('MobileDeveloperType')
multi_label_cols.append('NonDeveloperType')
numerical_cols.append('CareerSatisfaction')
numerical_cols.append('JobSatisfaction')
single_label_cols.append('ExCoderReturn')
single_label_cols.append('ExCoderNotForMe')
single_label_cols.append('ExCoderBalance')
single_label_cols.append('ExCoder10Years')
single_label_cols.append('ExCoderBelonged')
single_label_cols.append('ExCoderSkills')
single_label_cols.append('ExCoderWillNotCode')
single_label_cols.append('ExCoderActive')
single_label_cols.append('PronounceGIF')
single_label_cols.append('ProblemSolving')
single_label_cols.append('BuildingThings')
single_label_cols.append('LearningNewTech')
single_label_cols.append('BoringDetails')
single_label_cols.append('JobSecurity')
single_label_cols.append('DiversityImportant')
single_label_cols.append('AnnoyingUI')
single_label_cols.append('FriendsDevelopers')
single_label_cols.append('RightWrongWay')
single_label_cols.append('UnderstandComputers')
single_label_cols.append('SeriousWork')
single_label_cols.append('InvestTimeTools')
single_label_cols.append('WorkPayCare')
single_label_cols.append('KinshipDevelopers')
single_label_cols.append('ChallengeMyself')
single_label_cols.append('CompetePeers')
single_label_cols.append('ChangeWorld')
single_label_cols.append('JobSeekingStatus')
numerical_cols.append('HoursPerWeek')
single_label_cols.append('LastNewJob') # bucket range
single_label_cols.append('AssessJobIndustry')
single_label_cols.append('AssessJobRole')
single_label_cols.append('AssessJobExp')
single_label_cols.append('AssessJobDept')
single_label_cols.append('AssessJobTech')
single_label_cols.append('AssessJobProjects')
single_label_cols.append('AssessJobCompensation')
single_label_cols.append('AssessJobOffice')
single_label_cols.append('AssessJobCommute')
single_label_cols.append('AssessJobRemote')
single_label_cols.append('AssessJobLeaders')
single_label_cols.append('AssessJobProfDevel')
single_label_cols.append('AssessJobDiversity')
single_label_cols.append('AssessJobProduct')
single_label_cols.append('AssessJobFinances')
multi_label_cols.append('ImportantBenefits')
single_label_cols.append('ClickyKeys')
multi_label_cols.append('JobProfile')
single_label_cols.append('ResumePrompted')
single_label_cols.append('LearnedHiring')
single_label_cols.append('ImportantHiringAlgorithms')
single_label_cols.append('ImportantHiringTechExp')
single_label_cols.append('ImportantHiringCommunication')
single_label_cols.append('ImportantHiringOpenSource')
single_label_cols.append('ImportantHiringPMExp')
single_label_cols.append('ImportantHiringCompanies')
single_label_cols.append('ImportantHiringTitles')
single_label_cols.append('ImportantHiringEducation')
single_label_cols.append('ImportantHiringRep')
single_label_cols.append('ImportantHiringGettingThingsDone')
single_label_cols.append('Currency')
single_label_cols.append('Overpaid')
single_label_cols.append('TabsSpaces')
single_label_cols.append('EducationImportant')
multi_label_cols.append('EducationTypes')
multi_label_cols.append('SelfTaughtTypes')
single_label_cols.append('TimeAfterBootcamp')
multi_label_cols.append('CousinEducation')
single_label_cols.append('WorkStart')
multi_label_cols.append('HaveWorkedLanguage')
multi_label_cols.append('WantWorkLanguage')
multi_label_cols.append('HaveWorkedFramework')
multi_label_cols.append('WantWorkFramework')
multi_label_cols.append('HaveWorkedDatabase')
multi_label_cols.append('WantWorkDatabase')
multi_label_cols.append('HaveWorkedPlatform')
multi_label_cols.append('WantWorkPlatform')
multi_label_cols.append('IDE')
single_label_cols.append('AuditoryEnvironment')
multi_label_cols.append('Methodology')
single_label_cols.append('VersionControl')
single_label_cols.append('CheckInCode')
single_label_cols.append('ShipIt')
single_label_cols.append('OtherPeoplesCode')
single_label_cols.append('ProjectManagement')
single_label_cols.append('EnjoyDebugging')
single_label_cols.append('InTheZone')
single_label_cols.append('DifficultCommunication')
single_label_cols.append('CollaborateRemote')
multi_label_cols.append('MetricAssess')
single_label_cols.append('EquipmentSatisfiedMonitors')
single_label_cols.append('EquipmentSatisfiedCPU')
single_label_cols.append('EquipmentSatisfiedRAM')
single_label_cols.append('EquipmentSatisfiedStorage')
single_label_cols.append('EquipmentSatisfiedRW')
single_label_cols.append('InfluenceInternet')
single_label_cols.append('InfluenceWorkstation')
single_label_cols.append('InfluenceHardware')
single_label_cols.append('InfluenceServers')
single_label_cols.append('InfluenceTechStack')
single_label_cols.append('InfluenceDeptTech')
single_label_cols.append('InfluenceVizTools')
single_label_cols.append('InfluenceDatabase')
single_label_cols.append('InfluenceCloud')
single_label_cols.append('InfluenceConsultants')
single_label_cols.append('InfluenceRecruitment')
single_label_cols.append('InfluenceCommunication')
single_label_cols.append('StackOverflowDescribes')
numerical_cols.append('StackOverflowSatisfaction')
multi_label_cols.append('StackOverflowDevices')
single_label_cols.append('StackOverflowFoundAnswer')
single_label_cols.append('StackOverflowCopiedCode')
single_label_cols.append('StackOverflowJobListing')
single_label_cols.append('StackOverflowCompanyPage')
single_label_cols.append('StackOverflowJobSearch')
single_label_cols.append('StackOverflowNewQuestion')
single_label_cols.append('StackOverflowAnswer')
single_label_cols.append('StackOverflowMetaChat')
single_label_cols.append('StackOverflowAdsRelevant')
single_label_cols.append('StackOverflowAdsDistracting')
single_label_cols.append('StackOverflowModeration')
single_label_cols.append('StackOverflowCommunity')
single_label_cols.append('StackOverflowHelpful')
single_label_cols.append('StackOverflowBetter')
single_label_cols.append('StackOverflowWhatDo')
single_label_cols.append('StackOverflowMakeMoney')
single_label_cols.append('Gender')
single_label_cols.append('HighestEducationParents')
multi_label_cols.append('Race')
single_label_cols.append('SurveyLong')
single_label_cols.append('QuestionsInteresting')
single_label_cols.append('QuestionsConfusing')
single_label_cols.append('InterestedAnswers')
numerical_cols.append('Salary')
numerical_cols.append('ExpectedSalary')


In [10]:
# Check we didn't miss a column
assert len(single_label_cols + multi_label_cols + numerical_cols + key_cols + [target_col]) == len(headers)

In [27]:
def cleanup_df(df):
  for col in headers:
    df[col] = df[col].apply(convert_to_ascii)
    
    if col in multi_label_cols:
      df[col] = df[col].apply(update_multi_label_cols)

df_all = pd.read_csv(survey_results_path, header=0, names=headers)
cleanup_df(df_all)      

In [33]:
random.random()

0.6232773056004386

In [45]:

random_index = []
for i in range(len(df_all)):
  if random.random() < 0.8:
    random_index.append(True)
  else:
    random_index.append(False)
  


#df_all.to_csv('/content/datalab/so/cleaned_train.csv', header=False, index=False)

In [53]:
x = np.array(random_index)
df_train = df_all[x]
df_eval = df_all[np.logical_not(x)]

In [55]:

df_train.to_csv(train_data_path, header=False, index=False)
df_eval.to_csv(eval_data_path, header=False, index=False)

Figure out schema.

In [58]:
schema = []
for h in headers:
  entry = {'name': h}
  if h in numerical_cols:
    entry['type']= 'FLOAT'
  elif h in key_cols:
    entry['type'] = 'INTEGER'
  else:
    entry['type'] = 'STRING'
  schema.append(entry)
  

with open(schema_path, 'w') as f:
  f.write(json.dumps(schema))

Figure out transform.

In [60]:
transforms = {}
for h in headers:
  if h in numerical_cols:
    transform = 'scale'
  elif h in key_cols:
    transform = 'key'
  elif h == target_col:
    transform = 'target'
  elif h in multi_label_cols:
    transform = 'bag_of_words'
  elif h in single_label_cols:
    transform = 'one_hot'
  else:
    print('Error: %s is an unknown label' % h)
    break
  
  transforms[h] = {'transform': transform}
  
with open(transform_path, 'w') as f:
  f.write(json.dumps(transforms))
    
  

# Analyze

In [63]:
import google.datalab.contrib.mlworkbench.commands



In [64]:
%%ml analyze
output: $analyze_output
training_data:
    csv: $train_data_path
    schema: $schema
features: $transforms

Expanding any file patterns...
file list computed.
Analyzing file /content/datalab/workspace/structured_data_classification_stackoverflow/train.csv...
file /content/datalab/workspace/structured_data_classification_stackoverflow/train.csv analyzed.


In [None]:
!ls $analyze_output

# Transform

Each transform step shoud take at most a few minutes.

In [None]:
!rm -r -f $transform_output

In [67]:
%%ml transform
output: $transform_output
analysis: $analyze_output
prefix: features_train
training_data:
    csv: $train_data_path

2017-07-13 21:02:34.845079: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:02:34.845427: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:02:34.846021: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:02:34.846057: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:02:34.846074: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't

In [71]:
%%ml transform
output: $transform_output
analysis: $analyze_output
prefix: features_eval
training_data:
    csv: $eval_data_path

2017-07-13 21:07:55.577418: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:07:55.577467: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:07:55.578654: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:07:55.578737: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-13 21:07:55.578766: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't

In [72]:
!wc -l $transform_output/errors*

0 /content/datalab/workspace/structured_data_classification_stackoverflow/transform_output/errors_features_eval-00000-of-00001.txt
0 /content/datalab/workspace/structured_data_classification_stackoverflow/transform_output/errors_features_train-00000-of-00001.txt
0 total


# Training

In [74]:
%%ml train
output: $training_output
analysis: $analyze_output
training_data:
    transformed: $transformed_train_pattern
evaluation_data:
    transformed: $transformed_eval_pattern
model_args:
    model: dnn_classification
    hidden-layer-size1: 100
    max-steps: 5000
    top-n: 2
    save-checkpoints-secs: 60

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f75a83b7710>, '_model_dir': '/content/datalab/workspace/structured_data_classification_stackoverflow/training_output/train', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope th

# Prediction

In [None]:
!head /content/datalab/so/cleaned_eval.csv -n 3

In [None]:
headers_string = ','.join(headers)

In [None]:
%%ml predict --model /content/datalab/so/model/evaluation_model --headers $headers_string
prediction_data:
- 2,Student,Yes both,United Kingdom,Yes full-time,Employed part-time,Some college/university study without earning a bachelors degree,Computer science or software engineering,More than half but not all the time,20 to 99 employees,Privately-held limited company not in startup mode,9.0,,,,,,,,,,,,,,,,,With a hard g like gift,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,No,Other,,Some other way,Important,Important,Important,Important,Somewhat important,Somewhat important,Not very important,Somewhat important,Not very important,Very important,British pounds sterling (),,Spaces,,Online course  Self-taught  Hackathon  Open source contributions,Official documentation  Stack Overflow Q&A  Other,,,10:00 AM,JavaScript  Python  Ruby  SQL,Java  Python  Ruby  SQL,.NET Core,.NET Core,MySQL  SQLite,MySQL  SQLite,Amazon Web Services (AWS),Linux Desktop  Raspberry Pi  Amazon Web Services (AWS),Atom  Notepad++  Vim  PyCharm  RubyMine  Visual Studio  Visual Studio Code,Put on some ambient sounds (e.g. whale songs forest sounds),,Git,Multiple times a day,Agree,Disagree,Strongly disagree,Agree,Somewhat agree,Disagree,Strongly disagree,Customer satisfaction  On time/in budget  Peers rating  Self-rating,Not very satisfied,Satisfied,Satisfied,Satisfied,Somewhat satisfied,Satisfied,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,I have created a CV or Developer Story on Stack Overflow,8.0,Desktop  iOS browser  iOS app  Android browser  Android app,Several times,Several times,Once or twice,Once or twice,Once or twice,Havent done at all,Several times,At least once each week,Disagree,Strongly disagree,Strongly disagree,Strongly agree,Agree,Strongly agree,Strongly agree,-2.0,Male,A masters degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0
- 4,Professional non-developer who sometimes writes code,Yes both,United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,Less than half the time but at least one day each week,10000 or more employees,Non-profit/non-governmental organization or private school/university,14.0,9.0,,,,,Data scientist,6.0,3.0,,,,,,,,,With a soft g like jiff,2.0,2.0,2.0,-1.0,0.0,1.0,1.0,1.0,0.0,-2.0,2.0,1.0,-1.0,2.0,2.0,0.0,1.0,I am actively looking for a job,5.0,Between 2 and 4 years ago,Somewhat important,Somewhat important,Somewhat important,Important,Important,Very important,Important,Very important,Important,Somewhat important,Not very important,Very important,Important,Very important,Very important,Stock options  Annual bonus  Health benefits  Equipment  Private office,Yes,LinkedIn  Other,,A friend family member or former colleague told me,Somewhat important,Somewhat important,Very important,Very important,Somewhat important,Somewhat important,Not very important,Not very important,Important,Very important,,,Spaces,,,,,,9:00 AM,Matlab  Python  R  SQL,Matlab  Python  R  SQL,React,Hadoop  Node.js  React,MongoDB  Redis  SQL Server  MySQL  SQLite,MongoDB  Redis  SQL Server  MySQL  SQLite,Windows Desktop  Linux Desktop  Mac OS  Amazon Web Services (AWS),Windows Desktop  Linux Desktop  Mac OS  Amazon Web Services (AWS),Notepad++  Sublime Text  TextMate  Vim  IPython / Jupyter  NetBeans  PyCharm  Xcode,Turn on some music,Agile,Git,Multiple times a day,Somewhat agree,Agree,Somewhat agree,Somewhat agree,Strongly agree,Disagree,Somewhat agree,,,,,,,,,,,,,,,,,,,I have created a CV or Developer Story on Stack Overflow,10.0,Desktop  iOS browser  iOS app,At least once each week,Several times,At least once each week,Several times,At least once each week,Several times,At least once each day,At least once each day,Agree,Strongly disagree,Strongly disagree,Strongly agree,Strongly agree,Agree,Strongly agree,-1.0,Male,A doctoral degree,White or of European descent,Agree,Agree,Somewhat agree,Strongly agree,,
- 5,Professional developer,Yes I program as a hobby,Switzerland,No,Employed full-time,Masters degree,Computer science or software engineering,Never,10 to 19 employees,Privately-held limited company not in startup mode,21.0,10.0,,Mobile developer  Graphics programming  Desktop applications developer,,,,6.0,8.0,,,,,,,,,With a soft g like jiff,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Satisfied,Satisfied,Satisfied,Satisfied,Satisfied,Satisfied,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

In [None]:
%%ml batch_predict --model /content/datalab/so/model/evaluation_model  --output_dir /content/datalab/so/predict --output_format csv
prediction_data:
  csv_file_pattern: /content/datalab/so/cleaned_eval.csv

In [None]:
!ls /content/datalab/so/predict/

In [None]:
import pandas as pd
import json

with open('/content/datalab/so/predict/predict_results_schema.json', 'r') as f:
  predict_schema = json.load(f)
  
df = pd.read_csv('/content/datalab/so/predict/predict_results_cleaned_eval.csv', header=None, names=[x['name'] for x in predict_schema])
correct = sum([1 if row['predicted'] == row['target'] else 0 for index, row in df.iterrows()])
accuracy = correct / float(len(df.index))
print('accuracy = %f' % accuracy)

In [None]:
predict_schema

In [None]:
from google.datalab.ml import ConfusionMatrix

cm = ConfusionMatrix.from_csv('/content/datalab/so/predict/predict_results_cleaned_eval.csv', headers=[x['name'] for x in predict_schema])
cm.plot()