# Background

Hans is the coach of the Swedish Women National Biathlon team that is currently training for the upcoming winter season. As he couldn’t travel with his team to Canada to join their training camp, he is facing the problem that some of the team members seem to be cheating on their agreed upon training schedule in order to ensure the athletes improve consistently leading up to the first competition. To track progress on their rifle shooting, the athletes have to write down their name on each target board. This week, Hans’s assistant sent him the scanned reports from Canada but many of them don’t have the names of the athletes on the target boards - now Hans can’t judge the progress of his team! He turns to you for help in building a classifier based on the named reports, that he can use to generate predictions for the reports without names. He keeps some reports with names as test data and, depending on the accuracy of your classifier on the test data, Hans will invite you to the world cup finale this winter.   Please send back a JSON file with the same format, where each empty name string is replaced with a name of a team member, as well as a jupyter notebook that contains documentation and explanation of your approach.

# Approach

- Goal

Develop a classifier model using accuracy as the model metric.
- Data Provided

  - `12traits_biathlon_data.json`

# Exploratory Analysis

## Housekepping

In [1]:
import json
import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
import turicreate as tc

## Reading the Data

In [2]:
with open('12traits_biathlon_data.json') as f:
    data = json.load(f)
data = data['silhouette_targets']

Observations
  - Variables: name, shooting_record (x,y coordinates on silhouette targets)
  - Records are not assumed to be in chronological order (timestamp)

Possible Approaches
  - Identify features that can discriminate between the labels (team members)
    - How many shots in a session
    - How good the shooting is compared to team members (precision)
    - Spread of the shooting
    - Frequency of how often team members practice based on labeled data (might have to make assumptions about the likelihood of a team member probability of labeling his target)

## Tidy the data

In [3]:
def tranforming_variables(idx):
    output = pd.DataFrame.from_dict(data[idx]['shots'])
    output['name'] = data[idx]['name']
    output['silhouette'] = idx
    output = output[['silhouette','name','x','y']]
    return output

In [4]:
tidy_data = pd.concat([ tranforming_variables(idx) for idx in range(len(data)) ])

In [5]:
pd.DataFrame.from_dict({'name': [ record['name'] for record in data ]})['name'].value_counts()

             5000
Persson       173
Berger        167
Dahlmeier     160
Name: name, dtype: int64

It seems all three members have about the same number of silhouette targets labeled. This could be suggestive that the assumption that the likelihood of not labeling a silhouette target is constant and equal for all team members.

In order to measure the quality of a record we consider the following measures:
  - Count (how many shots in the silhouette)
  - Spread (norm of the pairwise Euclidean distance of the recorded shots)
  - Precision (first two moment of the Euclidean distance from origin)

## Engineer features from variables

In [6]:
def features_engineering(df):
    count = len(df)
    distances = euclidean_distances(df[['x', 'y']].values, [[0, 0]])
    mu        = distances.mean()
    sigma     = distances.std()
    spread    = np.linalg.norm(euclidean_distances(df[['x', 'y']].values))
    output    = pd.DataFrame.from_dict({'silhouette': [df['silhouette'][0]],
                                        'name': [df['name'][0]],
                                        'count': [count],
                                        'mu': [mu],
                                        'sigma': [sigma],
                                        'spread': [spread]
                                       })[['silhouette',
                                          'name',
                                          'count',
                                          'mu',
                                          'sigma',
                                          'spread']]
    return output

In [7]:
features = tc.SFrame(pd.concat(
    [features_engineering(x[1]) for x in tidy_data.groupby(['silhouette'])]))

# Modeling Stage

## Generate the training and test sets

Using cross-validation for the models.

In [8]:
train_set = features[features['name'] != '']
test_set = features[features['name'] == '']

## Explore basic classification models

In [9]:
def training_and_validation_statistics(seed):
    model_dt = tc.decision_tree_classifier.create(train_set,
                                              target = 'name',
                                              features = ['count','mu','sigma','spread'],
                                              seed = seed,
                                              metric = 'accuracy',
                                              verbose = False)
    model_rf = tc.random_forest_classifier.create(train_set,
                                              target = 'name',
                                              features = ['count','mu','sigma','spread'],
                                              seed = seed,
                                              metric = 'accuracy',
                                              verbose = False)
    model_bt = tc.boosted_trees_classifier.create(train_set,
                                              target = 'name',
                                              features = ['count','mu','sigma','spread'],
                                              seed = seed,
                                              metric = 'accuracy',
                                              verbose = False)
    output = pd.DataFrame.from_dict(
        {'model': ['boosted trees','random forest','decision tree'],
         'training_accuracy': [model_dt.training_accuracy,
                               model_rf.training_accuracy,
                               model_bt.training_accuracy],
         'validation_accuracy': [model_dt.validation_accuracy,
                                 model_rf.validation_accuracy,
                                 model_bt.validation_accuracy]})
    output['seed'] = seed
    return output

In [10]:
exploration = pd.concat([ training_and_validation_statistics(seed) for seed in range(20) ])

In [11]:
exploration[['model','training_accuracy','validation_accuracy']].groupby('model').agg(
    ['min', 'max', 'median', 'mean'])

Unnamed: 0_level_0,training_accuracy,training_accuracy,training_accuracy,training_accuracy,validation_accuracy,validation_accuracy,validation_accuracy,validation_accuracy
Unnamed: 0_level_1,min,max,median,mean,min,max,median,mean
model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
boosted trees,0.853503,0.90795,0.89298,0.886254,0.576923,0.947368,0.787856,0.774719
decision tree,0.924528,0.952891,0.937636,0.938658,0.615385,0.956522,0.83046,0.789372
random forest,0.868365,0.903766,0.8813,0.882245,0.653846,0.916667,0.77807,0.781524


It seems overall the decision tree is the best performant model.

# Prediction

In [12]:
descision_tree = exploration[exploration['model'] == 'decision tree']
seed = descision_tree['seed'][descision_tree['seed'] == descision_tree['seed'].max()]
model = tc.decision_tree_classifier.create(train_set,
                                           target = 'name',
                                           features = ['count','mu','sigma','spread'],
                                           seed = seed,
                                           metric = 'accuracy',
                                           verbose = False)

In [13]:
predictions = model.classify(test_set)['class']

In [14]:
with open('12traits_biathlon_data.json') as f:
    output = json.load(f)

In [15]:
offset = sum(1 for elem in output['silhouette_targets'] if elem['name'] != '')
for idx in range(offset, len(output['silhouette_targets'])):
    output['silhouette_targets'][idx]['name'] = predictions[idx - offset]

In [16]:
with open('fangfang_lee.json', 'w') as outfile:
    json.dump(output, outfile)