# File Overview

This file builds multiple binary classifiers for each country, varying the:

* model type (linear regression with rounding, linear svm, and random forest)
* the minimum non na cutoff for feature inclusion

# Outputs / Assets

* saves the result set in a pickle file for subsequent analysis

In [12]:
ROW_SAMPLE_RATE = 1
COUNTRY_MIN_SAMPLE_CUTOFF = 1000
DUMMY_NA = True
NUMERIC_IMPUTATION_TECHNIQUE="Mean"
CUTOFFS = [5000, 2500, 1000, 500, 250, 100]

In [2]:
import pandas as pd
import pickle

# Developer note: If you modify the local modules, the changes to those modules will not be reflected until you 
# reload the kernel. This can be fixed by dynamic reloads, but i couldn't get that working
# read up on this https://stackoverflow.com/questions/437589/how-do-i-unload-reload-a-python-module

pd.options.display.max_rows = 2000
pd.options.display.max_columns = 1000
pd.options.display.max_colwidth = 255

raw_df = pd.read_csv('./assets/survey_results_public.csv')
schema = pd.read_csv('./assets/survey_results_schema.csv')

# Data Preparation

* drop several columns based on bias
* did not drop any columns based on lack of values
* transform "near numeric" columns to numeric
* impute numeric columns using mean
* impute categorical columns using dummy binary columns

## Sample Data Set

For development purposes, to increase iteration speed, optionally sample the data

In [3]:
df = raw_df.sample(frac=ROW_SAMPLE_RATE, axis=0)

Set `Respondent` as the index

In [3]:
if 'Respondent' in df.columns:
    df.set_index('Respondent')

## Drop Columns

* `CompTotal` is a free form numeric that could be hourly, weekly, monthly, or annual compensation, and can be in any currency.
* `CompFreq` may be biased towards specific countries. While useful for predicting countries, we want to use developer traits and preferences, not their payment frequency to predict country
* The `CurrencyDesc`, `CurrencySymbol`, `Ethnicity` columns all would obviously directly correlate with country of origin. While useful for predicting countries, we want to use developer traits and preferences, not their payment frequency to predict country


In [4]:
for raw_compensation_column in ['CompTotal', 'CompFreq']:
    if raw_compensation_column in df.columns:
        df.drop(columns=[raw_compensation_column], inplace=True)

for unfair_predictors in ['CurrencyDesc', 'CurrencySymbol', 'Ethnicity']:
    if unfair_predictors in df.columns:
        df.drop(columns=[unfair_predictors], inplace=True)

## Transform "Near Numeric" Columns        

(based on analysis performed in [01_basics.ipynb](./01_basics.ipynb))

The YearsCode, YearsCodePro, and Age1stCode columns are strings but they contain mostly numbers, with a few inequality strings to represent the boundaries.
Convert them to numeric so we can treat as a quant metric.

While this is losing info - for example "someone over 50" is being converted to exactly 51 - this is an acceptable cost when weighed against the benefit of treating these values as numeric vs categorical:

* categorical will add X columns to the dataset
* if treated as categorical, the models lose the relative proximity of values. In other words 2 is close to 3 but '2' and '3' are unrelated (from a computing perspective)


In [5]:
from country_classifier.convert_age_series_to_numeric import convert_age_series_to_numeric
df['YearsCode'] = df['YearsCode'].map(convert_age_series_to_numeric)
df['YearsCodePro'] = df['YearsCodePro'].map(convert_age_series_to_numeric)
df['Age1stCode'] = df['Age1stCode'].map(convert_age_series_to_numeric)

## Transform "Choose all that apply" survey responses

* the models we choose cannot work with categorical data, nor can they work with missing values
* there are many "choose all that apply" questions on the survey. The responses to these are stored in the dataset as a single string, with the individual answers seperated by a ';'
* convert these responses into multiple columns, one colums for each individual answer, with a 0 indicating the respondent did not choose that answer, and a 1 indicating they did choose that answer

In [6]:
from country_classifier.convert_choose_all_that_apply_responses import convert_choose_all_that_apply_responses
df = convert_choose_all_that_apply_responses(df)

## Impute numeric columns

* we can see ( in [1_basics.ipynb](./1_basics.ipynb)) that all of the numeric columns have at least 50% response rate, so we will not drop any of them
* I built models using both mean and mode as the numeric imputation technique and did not see a measurable difference

Therefore, for this analysis I will use the mean of the data series to impute all missing numerical values

In [7]:
def fill_mean(col): return col.fillna(col.mean())
def fill_mode(col): return col.fillna(col.mode()[0])
fill = fill_mean if NUMERIC_IMPUTATION_TECHNIQUE == "Mean" else fill_mode

for column in df.select_dtypes(include=['int64', 'float64']).columns:
    df[column] = fill(df[column])

## Impute categorical columns

* we can see (in [1_basics.ipynb](./1_basics.ipynb)) that all of the categorical columns have at least 50% response rate, so we will not drop any of them
* Our total column count less than 10% of the number of rows, so we can "afford" to include a "no response" imputed column

Therefore, for this analysis I will use standard dummy column imputation to represent the categorical values, and I will include a no response column for each

In [8]:
for column in df.select_dtypes(include='object').columns:
    if column == 'Country':
        continue
    # for each cat add dummy var, drop original column
    df = pd.concat([df.drop(column, axis=1), pd.get_dummies(
        df[column], prefix=column, prefix_sep='_', drop_first=True, dummy_na=DUMMY_NA)], axis=1)

# Data Modelling

* choose which countries to attempt to build a classifier. We want to have enough positives to avoid building a "naive" classifier. 1000 seems like a round and reasonable number.
* for each country and for each "min value in column" cutoff, build 3 models : linear regression, svm, and random forest

## Why these models

* *linear_regression* : it was the only one included so far in the course material, and it is "white box" in that I can query the coefficients and see which inputs it is favoring
* *linear support vector machine* : a brief and incomplete survey of recommendations on the Internet suggest that SVMs make good classifiers, but are expensive. So use the linear version
* *random forest* : a brief and incomplete survey of recommendations on the Internet suggest random forest provides a classifier that balances between performance and cost

## On linear regression model

The linear regression model returns a value between 0 and 1, and this analysis calls for a binary 0 or 1. I round the response.
This represents a loss of the "confidence" of the model, but for this basic analysis, confidence is not used. If there is follow up work, we may consider the confidence, 
and use to help overcome the imbalanced nature of the dataset.

In [9]:
countries = df['Country'].value_counts()
countries_with_enough_samples = countries[countries > (COUNTRY_MIN_SAMPLE_CUTOFF * ROW_SAMPLE_RATE)].index.values
countries_with_enough_samples

array(['United States', 'India', 'United Kingdom', 'Germany', 'Canada',
       'France', 'Brazil', 'Netherlands', 'Poland', 'Australia', 'Spain',
       'Italy', 'Russian Federation'], dtype=object)

In [None]:
from country_classifier.predict_country import predict_country

results = []
for country in countries_with_enough_samples:
    country_results = predict_country( \
        rowset_label='all_responses', \
        columnset_label='all', \
        imputed_df=df, \
        cutoffs=CUTOFFS, \
        country_to_classify=country, \
        row_sample_rate=ROW_SAMPLE_RATE \
    )
    results.extend(country_results)

## Save model results

This is done as a developer aid to make it easier to iterate on the evaluation without rerunning the expensive train and test phases

In [11]:
results_df = pd.DataFrame(data=results).set_index(['country', 'cutoff'])
for to_round in ['TPR', 'TNR', 'PPV', 'NPV', 'ACC', 'duration']:
    results_df[to_round] = results_df[to_round].map(lambda x: round(x,3))

results_df.drop(columns=['X_columns', 'coefficients'])

with open('./pickles/latest_results.pkl', 'wb') as f:
    pickle.dump(results_df, f)

cutoff_string = '-'.join(map(str, CUTOFFS))
dataset_name = 'full_country_row_sample_rate_%s_country_min_sample_cutoff_%s_cutoffs_%s_results' % (ROW_SAMPLE_RATE, COUNTRY_MIN_SAMPLE_CUTOFF, cutoff_string)
with open('./pickles/%s.pkl' % dataset_name, 'wb') as f:
    pickle.dump(results_df, f)

# Data Evaluation

See [6_visualise_binary_classifier_results.ipynb](./6_visualise_binary_classifier_results.ipynb)