# About the data

In this sequence of notebooks, we will build a classification model using the StackOverflow 2017 survey dataset. This data comes from a survey of StackOverflow users. It contains many interesting features about the types of software developers that use StackOverflow. This dataset does not represent a real world prediction problem, as new data is unlikely to be generated. However it is a nice example to show processing of different feature types.

One of the columns in the data describes the class of software developer of the survey responder: "Student", "Professional developer", "Used to be a professional developer",and "Professional non-developer who sometimes writes code". We will use the other columns in the data to predict this developer label class.

# Download data

See https://www.kaggle.com/stackoverflow/so-survey-2017 for a longer overview of stackoverflow dataset. The survey has about 64,000 responses from their users. Note that the data was taken from StackOverflow and licensed under the ODbL license. See the Kaggle website for more information.


Download the survey results data and schema to /content/datalab/workspace/structured_data_classification_stackoverflow or another location. If you use a different location, you have to change the workspace path below.

In [2]:
WORKSPACE_PATH = '/content/datalab/workspace/structured_data_classification_stackoverflow'

In [3]:
!ls $WORKSPACE_PATH

analyze_output	      schema.json		 training_output
batch_predict_output  survey_results_public.csv  transform_output
clean_input	      survey_results_schema.csv  transforms.json
eval.csv	      train.csv


# Clean up the data

We have to make a decision about how we are going to model the data. For each column, we have to ask if it represents a numerical value, 1 categorical value, or many categories. This is not always clear, as there could be many ways to use a column in a model. 

For example, consider the "disagree, somewhat disagree, somewhat agree, agree" type questions in a linear model. There are at least two options to how we could use those columns. We could encode each option as a categorical value. If we do this, we loose the natural ordering (disagree seems like it should have a smaller value than agree), and so our model has to learn this relationship. Also, there is a variable for each categorical value, making the linear model large and easy to overfit. Another option is to convert these values into a numerical column (using say disagree=-2, somewhat disagree=-1, somewhat agree=1, agree=2), and now the linear model just has to learn one weight. However, the difference between two categories is now important. Is it correct that 'agree' is weigthted twice as strongly as 'somewhat agree'? In general, categorical values should be encoded as categories because this produces more variables in the model which makes it easier to learn relationships. Picking how to encode data columns is part of feature engineering, and it is domain an problem specfic.

In this notebook, we will do the simplest thing

* columns with one categorical response will be encoded with a one-hot vector. Example: encode the day of the week with a vector of length 7. The value 'Monday' is encoded as [1, 0, 0, 0, 0, 0, 0].
* columns with multiple categorical responses will be encoded with bag-of-words vector. Example: encode which programming languages are used from the list ['Java', 'Python', 'C++', 'JavaScript'] as a vector of length 4. The value 'Java C++' is encoded as [1, 0, 1, 0].
* columns with numerical values will be encoded as numbers with no transformation


In [4]:
import os
import csv
import re
import pandas as pd
import six
import string
import random
import numpy as np
import json

In [5]:
survey_results_path = os.path.join(WORKSPACE_PATH, 'survey_results_public.csv')
survey_schema_path = os.path.join(WORKSPACE_PATH, 'survey_results_schema.csv')

# Clean data 
clean_folder = os.path.join(WORKSPACE_PATH, 'clean_input')
train_data_path = os.path.join(clean_folder, 'train.csv')
eval_data_path = os.path.join(clean_folder, 'eval.csv')
schema_path = os.path.join(clean_folder, 'schema.json')
transform_path = os.path.join(clean_folder, 'transforms.json')

In [6]:
!mkdir -p $clean_folder

In [7]:
if not os.path.isfile(survey_results_path) or not os.path.isfile(survey_schema_path):
    print('Error: the data files are missing!')
    print('Download the data and schema files from https://www.kaggle.com/stackoverflow/so-survey-2017')
    print('and put them into the folder ' + WORKSPACE_PATH)

In [8]:
# Get CSV headers as a list of column names.
with open(survey_schema_path, 'r') as f:
    reader = csv.reader(f)
    next(reader) # skip header
    headers = [r[0] for r in reader]

To use the data with MLWorkbench, the data needs to be cleaned in a few ways:

* missing values sould be missing in the csv file, not 'NA'. 
* for multiple categorical columns, the data has each value separated by a semicolon but  mlworkbench separates tokens by spaces
* some columns have non-ascii values, but only ascii is supported.

We will use the two functions below to fix the ascii and multiple categorical encoding issue. The missing/NA issue is fixed by Pandas when the data is saved.

In [9]:
def update_multi_label_cols(v):
    """Make labels 1 token long.
    Example:
        Before: Stock options; Annual bonus; Vacation/days off; Equipment; Meals
        After: Stock_options Annual_bonus Vacation/days_off Equipment Meals
    """
    if isinstance(v, float):
      return v
    v = v.replace('; ', ';')
    v = v.replace(' ', '_')
    v = v.replace(';', ' ')
    return v

def convert_to_ascii(v):
    """Remove non-ascii characters."""
    if isinstance(v, (float, int)):
      return v
    return filter(lambda x: x in set(string.printable), v)

In [10]:
# We divide the data columns into how we will transform them.
single_label_cols = []
numerical_cols = []
multi_label_cols = []
key_cols = []
target_col = None

In [11]:
key_cols.append('Respondent')
target_col = 'Professional'
single_label_cols.append('ProgramHobby')
single_label_cols.append('Country')
single_label_cols.append('University')
single_label_cols.append('EmploymentStatus')
single_label_cols.append('FormalEducation')
single_label_cols.append('MajorUndergrad')
single_label_cols.append('HomeRemote')
single_label_cols.append('CompanySize') # bucket range
single_label_cols.append('CompanyType')
single_label_cols.append('YearsProgram') # bucket range
single_label_cols.append('YearsCodedJob') # bucket range
single_label_cols.append('YearsCodedJobPast') # bucket range
multi_label_cols.append('DeveloperType')
single_label_cols.append('WebDeveloperType')
multi_label_cols.append('MobileDeveloperType')
multi_label_cols.append('NonDeveloperType')
numerical_cols.append('CareerSatisfaction')
numerical_cols.append('JobSatisfaction')
single_label_cols.append('ExCoderReturn')
single_label_cols.append('ExCoderNotForMe')
single_label_cols.append('ExCoderBalance')
single_label_cols.append('ExCoder10Years')
single_label_cols.append('ExCoderBelonged')
single_label_cols.append('ExCoderSkills')
single_label_cols.append('ExCoderWillNotCode')
single_label_cols.append('ExCoderActive')
single_label_cols.append('PronounceGIF')
single_label_cols.append('ProblemSolving')
single_label_cols.append('BuildingThings')
single_label_cols.append('LearningNewTech')
single_label_cols.append('BoringDetails')
single_label_cols.append('JobSecurity')
single_label_cols.append('DiversityImportant')
single_label_cols.append('AnnoyingUI')
single_label_cols.append('FriendsDevelopers')
single_label_cols.append('RightWrongWay')
single_label_cols.append('UnderstandComputers')
single_label_cols.append('SeriousWork')
single_label_cols.append('InvestTimeTools')
single_label_cols.append('WorkPayCare')
single_label_cols.append('KinshipDevelopers')
single_label_cols.append('ChallengeMyself')
single_label_cols.append('CompetePeers')
single_label_cols.append('ChangeWorld')
single_label_cols.append('JobSeekingStatus')
numerical_cols.append('HoursPerWeek')
single_label_cols.append('LastNewJob') # bucket range
single_label_cols.append('AssessJobIndustry')
single_label_cols.append('AssessJobRole')
single_label_cols.append('AssessJobExp')
single_label_cols.append('AssessJobDept')
single_label_cols.append('AssessJobTech')
single_label_cols.append('AssessJobProjects')
single_label_cols.append('AssessJobCompensation')
single_label_cols.append('AssessJobOffice')
single_label_cols.append('AssessJobCommute')
single_label_cols.append('AssessJobRemote')
single_label_cols.append('AssessJobLeaders')
single_label_cols.append('AssessJobProfDevel')
single_label_cols.append('AssessJobDiversity')
single_label_cols.append('AssessJobProduct')
single_label_cols.append('AssessJobFinances')
multi_label_cols.append('ImportantBenefits')
single_label_cols.append('ClickyKeys')
multi_label_cols.append('JobProfile')
single_label_cols.append('ResumePrompted')
single_label_cols.append('LearnedHiring')
single_label_cols.append('ImportantHiringAlgorithms')
single_label_cols.append('ImportantHiringTechExp')
single_label_cols.append('ImportantHiringCommunication')
single_label_cols.append('ImportantHiringOpenSource')
single_label_cols.append('ImportantHiringPMExp')
single_label_cols.append('ImportantHiringCompanies')
single_label_cols.append('ImportantHiringTitles')
single_label_cols.append('ImportantHiringEducation')
single_label_cols.append('ImportantHiringRep')
single_label_cols.append('ImportantHiringGettingThingsDone')
single_label_cols.append('Currency')
single_label_cols.append('Overpaid')
single_label_cols.append('TabsSpaces')
single_label_cols.append('EducationImportant')
multi_label_cols.append('EducationTypes')
multi_label_cols.append('SelfTaughtTypes')
single_label_cols.append('TimeAfterBootcamp')
multi_label_cols.append('CousinEducation')
single_label_cols.append('WorkStart')
multi_label_cols.append('HaveWorkedLanguage')
multi_label_cols.append('WantWorkLanguage')
multi_label_cols.append('HaveWorkedFramework')
multi_label_cols.append('WantWorkFramework')
multi_label_cols.append('HaveWorkedDatabase')
multi_label_cols.append('WantWorkDatabase')
multi_label_cols.append('HaveWorkedPlatform')
multi_label_cols.append('WantWorkPlatform')
multi_label_cols.append('IDE')
single_label_cols.append('AuditoryEnvironment')
multi_label_cols.append('Methodology')
single_label_cols.append('VersionControl')
single_label_cols.append('CheckInCode')
single_label_cols.append('ShipIt')
single_label_cols.append('OtherPeoplesCode')
single_label_cols.append('ProjectManagement')
single_label_cols.append('EnjoyDebugging')
single_label_cols.append('InTheZone')
single_label_cols.append('DifficultCommunication')
single_label_cols.append('CollaborateRemote')
multi_label_cols.append('MetricAssess')
single_label_cols.append('EquipmentSatisfiedMonitors')
single_label_cols.append('EquipmentSatisfiedCPU')
single_label_cols.append('EquipmentSatisfiedRAM')
single_label_cols.append('EquipmentSatisfiedStorage')
single_label_cols.append('EquipmentSatisfiedRW')
single_label_cols.append('InfluenceInternet')
single_label_cols.append('InfluenceWorkstation')
single_label_cols.append('InfluenceHardware')
single_label_cols.append('InfluenceServers')
single_label_cols.append('InfluenceTechStack')
single_label_cols.append('InfluenceDeptTech')
single_label_cols.append('InfluenceVizTools')
single_label_cols.append('InfluenceDatabase')
single_label_cols.append('InfluenceCloud')
single_label_cols.append('InfluenceConsultants')
single_label_cols.append('InfluenceRecruitment')
single_label_cols.append('InfluenceCommunication')
single_label_cols.append('StackOverflowDescribes')
numerical_cols.append('StackOverflowSatisfaction')
multi_label_cols.append('StackOverflowDevices')
single_label_cols.append('StackOverflowFoundAnswer')
single_label_cols.append('StackOverflowCopiedCode')
single_label_cols.append('StackOverflowJobListing')
single_label_cols.append('StackOverflowCompanyPage')
single_label_cols.append('StackOverflowJobSearch')
single_label_cols.append('StackOverflowNewQuestion')
single_label_cols.append('StackOverflowAnswer')
single_label_cols.append('StackOverflowMetaChat')
single_label_cols.append('StackOverflowAdsRelevant')
single_label_cols.append('StackOverflowAdsDistracting')
single_label_cols.append('StackOverflowModeration')
single_label_cols.append('StackOverflowCommunity')
single_label_cols.append('StackOverflowHelpful')
single_label_cols.append('StackOverflowBetter')
single_label_cols.append('StackOverflowWhatDo')
single_label_cols.append('StackOverflowMakeMoney')
single_label_cols.append('Gender')
single_label_cols.append('HighestEducationParents')
multi_label_cols.append('Race')
single_label_cols.append('SurveyLong')
single_label_cols.append('QuestionsInteresting')
single_label_cols.append('QuestionsConfusing')
single_label_cols.append('InterestedAnswers')
numerical_cols.append('Salary')
numerical_cols.append('ExpectedSalary')


In [12]:
# Check we didn't miss a column
assert len(single_label_cols + multi_label_cols + numerical_cols + key_cols + [target_col]) == len(headers)

In [13]:
def cleanup_df(df):
    """Updates a dataframe reference."""
    for col in headers:
        if col == 'Currency':
            df[col] = df[col].apply(convert_to_ascii)
    
        if col == 'Race':
            df[col] = df[col].apply(convert_to_ascii)
    
        if col in multi_label_cols:
            df[col] = df[col].apply(update_multi_label_cols)

df_all = pd.read_csv(survey_results_path, header=0, names=headers)
cleanup_df(df_all)      

In [14]:
# Split the data into 80% for training, and 20% for testing
random_index = []
for i in range(len(df_all)):
    if random.random() < 0.8:
        random_index.append(True)
    else:
        random_index.append(False)

In [15]:
x = np.array(random_index)
df_train = df_all[x]
df_eval = df_all[np.logical_not(x)]

In [16]:
# Save the new data sets
df_train.to_csv(train_data_path, header=False, index=False)
df_eval.to_csv(eval_data_path, header=False, index=False)

Figure out the schema. Categorical columns must be string. 

In [17]:
schema = []
for h in headers:
    entry = {'name': h}
    if h in numerical_cols:
        entry['type']= 'FLOAT'
    elif h in key_cols:
        entry['type'] = 'INTEGER'
    else:
        entry['type'] = 'STRING'
    schema.append(entry)

with open(schema_path, 'w') as f:
    f.write(json.dumps(schema))

Figure out transforms. As an exercise, try changing some transforms. For example, replace 'bag_of_words' for 'tfidf' or 'one_hot' with 'embedding'. Run '%%ml analyze --help' in a cell to see the list of transforms.

In [18]:
transforms = {}
for h in headers:
    if h in numerical_cols:
        transform = 'scale'
    elif h in key_cols:
        transform = 'key'
    elif h == target_col:
        transform = 'target'
    elif h in multi_label_cols:
        transform = 'bag_of_words'
    elif h in single_label_cols:
        transform = 'one_hot'
    else:
        print('Error: %s is an unknown label' % h)
        break
    transforms[h] = {'transform': transform}
  
with open(transform_path, 'w') as f:
    f.write(json.dumps(transforms))
    
  

# Next steps

Now that we have cleaned version of the csv file, we are ready to being the standward workflow for building and deploying TensorFlow models with Datalab's ML Workbench. Please proceed to the next notebook in this sequence. 