# 2019 IMFS Data Science Competition - Netflix for Bonds
## Week 0 - Setting Up
Welcome to the 2019 IMFS Competition!

### What (The Challenge)
**Your challenge is to build a recommendation system that matches similar bond based on the revealed preferences of an expert trader.**

**Backstory:** In an alternative universe, Vanguard employs several robotic crewmembers (manufactured by the Tyrell corporation) to perform certain parts of the investment lifecycle. One of the best robots, TRACE_Y, is very good at picking complementary bonds for various portfolios with an inexplicable knack for predicting which bonds will perform best in a diversified portfolio. Nobody has been able to replicate her work.
Unfortunately, not only is TRACE_Y very good at picking bonds, she is also very good at picking the winning lottery numbers and just retired to a beach shack in Key West, where she plans to go off grid. This leaves you, the portfolio manager, with no way to pick bonds unless you can recreate the mind of the robot!
All you have is a historical set of bonds that TRACE_Y looked at and her assessment of the ISIN of the nearest bond. Not every bond is labelled and the characteristics of the bond change from one day to the next.
You have no idea what bond characteristics TRACE_Y weighed most heavily, but know it must be one or more of the fields presented in the attached files. Thankfully, you took Coursera classes on machine learning and think you might have what it takes to replicate her mind.
You need to act fast – you have been given until July 31st to come up with an algorithm that can predict the nearest ISIN for several bonds. If you correctly predict the “nearest” bond (or even identify a bond in the top 10), you are confident that you will be able to continue to operate.

**A slightly more serious explanation of the problem:** We’ve taken several months of data on various bonds and created a secret algorithm that ranks bonds by their similarity. It is based on knowledge of characteristics of the bonds and somewhat resembles decisions made by true portfolio managers. However, this ground truth is artificial, intentionally made up to provide a positive Kaggle experience. So while some knowledge of bonds and their characteristics may help you, a deep knowledge of fixed income markets will not help. **You are not predicting the true nearest bond (if you believe you know what that is), but the nearest bond as predicted by our algorithm, which remains secret.**

### Getting Started
You must join Microsoft Teams to stay up-to-date with the latest announcements for this competition.  

Please ensure that you have joined the [Teams channel](https://teams.microsoft.com/l/channel/19%3af5fff6bb2bbf4493ae2ce674cea3b0d6%40thread.skype/General?groupId=e045e8fc-51f3-4d91-bbdc-583ff955ef24&tenantId=d3a74ac8-efe4-4fe8-b707-b1bf8c6a25bd) (refer to welcome email from Steve Lawrence).  

If located in Australia, where Teams has not been deployed yet in the Vanguard networks, make sure you sign up to the IMFS slack channel for this competition (refer to welcome email from Steve Lawrence).  

You will be working out of Kaggle, a platform for data science related competitions, to build and run your recommendation models. Therefore, you must create a Kaggle account and share your Kaggle handle with us (via Teams or Slack).  

While you will do all the work in Kaggle, we also provide the Github repository if you want to work on your own mahine, GitHub hosts the main repository for this competition with all materials you will be using in this challenge.

Please remember you can always work in Kaggle and you do not need to clone the github repository locally.

### Your teammates
John Kraynak  
Mahesh Thummati  
Krunal Patel  
Janeta Blagoeva  
Hemant Sojitra    

### Your timeline
Week 0 (starting July 1st) – Setting Up/Onboarding + Meet your IMFS mentor  
Week 1 (Starting July 8th) – Understand the challenge + Tutorial on exploring your dataset in your Kaggle Kernel  
Week 2 (starting July 15th) – Tutorial on preprocessing dataset  
Week 3 (starting July 22nd) – Tutorial on building classification models and tuning the parameters  
Week 4 (Starting July 29th) – Tutorial on submitting your results for grading  
Wednesday July 31st – Last date for submission of competition results  
Friday August 2nd – Award ceremony and announcement of winning teams  


### The Dataset
One year of historical pricing and duration data for a portfolio of securities.

### The Rules
1. **Participation**:
   1. You will work in small teams of four to five people.
   2. IMFS data scientist mentors will be hosting "competition office hours" where you can ask questions. The mentors will NOT do the work for you.
   3. Your team will be assigned a mentor to help. She/he will check on you once a week to track on your progress
   
2. **How to approach the problem**  
Here’s a brief rundown of what you need to do:
   1. Join the competition on Microsoft Teams [HERE](https://teams.microsoft.com/l/channel/19%3af5fff6bb2bbf4493ae2ce674cea3b0d6%40thread.skype/General?groupId=e045e8fc-51f3-4d91-bbdc-583ff955ef24&tenantId=d3a74ac8-efe4-4fe8-b707-b1bf8c6a25bd).  We are using Microsoft TEAMS as it is widely available at Vanguard (US and Malvern). For our participants in Australia, communicate via Slack using the IMFS slack channel for this competition (refer to Steve Lawrence’s welcome email). Please remember to submit your GitHub and Kaggle account handles.
   2. Use the dataset provided to train your model. Your model must take as an input a ISIN and return the top 10 similar securities to the "ISIN" in the test set. 
   3. Follow along the tutorial in your assigned competition kernel in Kaggle. A Kaggle kernel is a cloud computational environment that enables reproducible and collaborative analysis.  For this competition, the kernels are setup as Jupyter notebooks. Jupyter notebooks consist of a sequence of cells, where each cell is formatted in either Markdown (for writing text) or in a programming language of your choice (for writing code). For more details on kernels, read [this article](https://www.kaggle.com/docs/kernels) by Kaggle.
   4. When you are ready to submit your work, run your model for each of the 1000 test securities in the ["***test_set.csv***" dataset].  
   5. At the end of your notebook, save your recommendations for each security, see the function generate_output below.
   6. Post "Done with project - yourteamname" in the competition's "**#Microsoft Teams**" channel or the **Slack channel** if in Australia.
   
3. **Ground of truth and grading**  
The ground truth consists of a set of bonds and their characteristics and labels for the ISIN of the nearest bond. Information on the characteristics of this nearest bond are provided in the training file. As mentioned in the backstory, this ground truth is artificially generated to provide for an interesting competition and may not reflect the truly optimal bond for a particular situation. A good algorithm will correctly predict the ISIN of the nearest bond as labelled in the training dataset.

We will grade your result in the following way: for each target ISIN in the test set, as long as your top 10 predictions contain the NearestISIN of the target ISIN, you successfully recommend it, otherwise you don't.
   1. You will get perfect score if your recommendations match the model perfectly. Otherwise, will you will not get the score, so your score will be the number of NearestISIN you predict correct/1000.
   2. The winning teams will be determined based on:
      1. First Place (Performance based)
      2. Best coded (as judged by Chuqi Yang)
      3. Most innovative solution (As judged by IMFS team)
      4. Most engaged (As judged by IMFS interns)

### Help Desk
#### IMFS Data Science Blog for tips and tutorials
Access blog in Microsoft Teams Blog Page [HERE](https://thevanguardgroup.sharepoint.com/sites/2019imfsdatasciencecompetition/_layouts/15/news.aspx)   
Ask questions in Microsoft Teams Channel [HERE](https://teams.microsoft.com/l/channel/19%3af5fff6bb2bbf4493ae2ce674cea3b0d6%40thread.skype/General?groupId=e045e8fc-51f3-4d91-bbdc-583ff955ef24&tenantId=d3a74ac8-efe4-4fe8-b707-b1bf8c6a25bd)

#### Mentorship
Your IMFS mentor is Yuhan (Flora) Huang (yuhan_huang@vanguard.com)    

Her office hours:
1. Week 1: Mon-Fri :11AM-2:45PM EST
2. Week 2: Mon-Fri :11AM-2:45PM EST
3. Week 3: Mon-Fri :11AM-2:45PM EST
4. Week 4: Mon-Fri :11AM-2:45PM EST




## 1.0 - Import packages and Load the Data

In this section, you will import the necessary modules and packages you may need to use in the competition and load the data. The definition of modules and packages is here: https://www.learnpython.org/en/Modules_and_Packages

Please feel free to import and use any modules and packages you think may help you in this competition.

In [None]:
import pickle
import pandas as pd
import pandas as pd
import numpy as np
import random
import time
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
import math

In [None]:
#loading train and test datasets
train_raw = pd.DataFrame.from_csv('../input/dataset_final.csv')
test_raw = pd.DataFrame.from_csv('../input/test_set.csv')


In [None]:
#loading validation dataset
val_test = pd.DataFrame.from_csv('../input/dataset_valtest.csv')
val_ans =pd.DataFrame.from_csv('../input/dataset_valans.csv')

In [None]:
train_raw.reset_index(inplace = True)

## Let's first take a look at the trian_raw
from the cells below, we can find out that in "trian_raw", we have ISIN with its features in a specific day. There are 1397138 number of rows in the train_raw. However, some of them are not labeled, which we will use them in prediction rather than in training session.

In [None]:
train_raw.head()

In [None]:
train_raw.shape

we can see there are 1237625 rows has NearestISIN, they can be used in supervised learning

In [None]:
train_raw[~train_raw['NearestISIN'].isnull()].shape

In [None]:
#attributes we have
train_raw.columns

## Let's take a look at the test_raw
from the cells below, we can find out that in "test_raw", for each ISIN in one day, we have 10 same rows, and the "Your Prediction" is None, you should fill the top 10 most similar bonds ISIN to replace these None.

In [None]:
test_raw.head(15)

In [None]:
test_raw.tail(15)

# 2.0 Understand Challenge
before preprocessing and training the model, we had better to understand the challenge first.

In [None]:
total_ISIN = len(train_raw.ISIN.unique())
print(total_ISIN)

## Can we directly train Machine Learning given the data we have?

As we can see from above, there are totally 6739 ISIN. If we only give the model the feature of a ISIN and ask the model to predict the most simialr bond of this ISIN, which means we have to train one model which has 6739 possible classes in prediction or trian 6739 models; neither of them are the optimal choice for us considering the time limitation, data sparsity and imbalanced data. Of course, you can try these two ways if time and resource allow, but in this tutorial we will directly skip this experimental stage and propose a way of preporcessing data to transform the challenge into a different machine learning problem.

The way this tutorial will use this(you can use your method or change this method for sure): instead of using dataframe we have now, we use the absolute difference between two ISIN feature data to be the new feature, the label will be 1 or 0, 1 represents the two ISIN are the NearestISIN, 0 represents not. By using this data, we can train a binary classfication model!

In the prediction stage, we will find out the how similar two bonds are by entering the absolute difference of feautres between two ISINs and then find out the top 10 simialr bonds for each ISIN. 

In the rest of this section, we will guide you through how to prepare and preproces for the training and testing data for this method.

# 3.0 Preprocessing

From the discussion above, for trainging set we will change each row of the dataset to the difference in features between two rows in the original dataframe. 

For testing set, we will change each row of the dataframe to the difference in features between target ISIN and one of the other ISIN in the trainingset.

In [None]:
def clean_train(df):
    #Create index mapping
    
    start = time.time()
    
    ## in this compeition, we only use random 10 days data to give you an example,
    ## you can use more data if needed.
    index_dic = {}
    random_ten_days = random.sample(list(set(df['date'])),10)
    df_indate = df[df['date'].isin(random_ten_days)]
    df_indate = df_indate.reset_index(drop=True)
    date_column, isin_column = df_indate['date'], df_indate['ISIN']
    for i in range(len(date_column)):
        index_dic[(date_column[i], isin_column[i])] = i
    
    #type of columns
    IsCategorical = df_indate.dtypes == object
    categorical_columns = [x for x in df_indate.columns[IsCategorical].tolist() if x not in ['date','ISIN','NearestISIN']]
    numerical_columns = [x for x in df_indate.columns[~IsCategorical] if x not in ['date value','Keep']]
    
    #subset dataframe
    categorical_df = df_indate[categorical_columns]
    numerical_df = df_indate[numerical_columns]
    
    #nearestISIN dataframe (label 1)
    numerical_sim = []
    categorical_sim = []
    date_sim = []
    isin_sim = []
    nearest_isin_sim = []
    for row_index, row in df_indate.iterrows():
        date, isin, nearest_isin = row['date'], row['ISIN'], row['NearestISIN']
        if type(nearest_isin) == str:
            nearest_isin_index = index_dic[(date, nearest_isin)]  
            numerical_sim.append(abs(numerical_df.iloc[row_index]-numerical_df.iloc[nearest_isin_index]))
            categorical_sim.append(categorical_df.iloc[row_index]==categorical_df.iloc[nearest_isin_index])
            date_sim.append(date)
            isin_sim.append(isin)
            nearest_isin_sim.append(nearest_isin)
        else:
            continue
    sim_df = pd.concat([pd.DataFrame(np.column_stack([date_sim,isin_sim,nearest_isin_sim]), columns=['date', 'ISIN', 'NearestISIN']),
                        pd.DataFrame(numerical_sim), pd.DataFrame(categorical_sim)], axis=1)
    sim_df['Response'] = 1
    
    #random not nearestISIN dataframe (label 0)
    numerical_diff = []
    categorical_diff = []
    date_diff = []
    isin_diff = []
    nearest_isin_diff = []
    indexes = [x for x in range(df_indate.shape[0])]
    for row_index, row in df_indate.iterrows():
        date, isin, nearest_isin = row['date'], row['ISIN'], row['NearestISIN']
        if type(nearest_isin) == str:
            nearest_isin_index = index_dic[(date, nearest_isin)]
            
            random_index = random.choice(indexes)
            while  random_index ==row_index or random_index==nearest_isin_index:
                random_index = random.choice(indexes)
            numerical_diff.append(abs(numerical_df.iloc[row_index]-numerical_df.iloc[random_index]))
            categorical_diff.append(categorical_df.iloc[row_index]==categorical_df.iloc[random_index])
            date_diff.append(date)
            isin_diff.append(isin)
            nearest_isin_diff.append(nearest_isin)
        else:
            continue
    diff_df = pd.concat([pd.DataFrame(np.column_stack([date_diff,isin_diff,nearest_isin_diff]), columns=['date', 'ISIN', 'NearestISIN']),
                        pd.DataFrame(numerical_diff), pd.DataFrame(categorical_diff)], axis=1)
    diff_df['Response'] = 0
    
    #output df
    output_df = sim_df.append(diff_df.reset_index(drop=True))
    output_df = output_df.reset_index(drop=True)
    end = time.time()
    return output_df

In [None]:
train_df = clean_train(train_raw)

In the next step, we are going to preprocess the test set, in theory, for each test ISIN, we have 6739 rows to generate in test set, so the following step will take a comparatively long time(may longer than 20 mins) to run

In [None]:
def clean_test(train_raw, test_raw):
    
    #Create index mapping
    # Find last date, as this is the test sample
    last_date = test_raw.date[0]
    
    # Subset the dataframe for those in the last date, and reset index
    df_indate = train_raw[train_raw['date']==last_date]
    df_indate = df_indate.reset_index(drop=True)
    
    # Take dates and columns to be used later for dictionary
    date_column, isin_column = df_indate['date'], df_indate['ISIN']
    
    # As in previous function
    IsCategorical = df_indate.dtypes == object
    categorical_columns = [x for x in df_indate.columns[IsCategorical].tolist() if x not in ['date','ISIN','NearestISIN']]
    numerical_columns = [x for x in df_indate.columns[~IsCategorical] if x not in ['date value','Keep']]
    
    # subset dataframe
    categorical_df = df_indate[categorical_columns]
    numerical_df = df_indate[numerical_columns]
    
    # Sample categorical for numerical categories
    categorical_mat = np.zeros(np.shape(categorical_df))

    # Convert categories into numbers
    for k, column in enumerate(categorical_df):
        category_set = categorical_df[column].unique()
        num_val = np.zeros(len(category_set))
        for j, cat in enumerate(category_set):
            categorical_mat[categorical_df[column]==cat,k] = j + 1

    # Create overall matrix
    feature_mat = np.hstack((categorical_mat, numerical_df.values))
    
    # Create array of ISINs to match to feature_mat
    all_isin = df_indate['ISIN'].values
    
    # Test ISIN is all ISINs that we need to solve for
    test_ISIN = set(test_raw.ISIN.unique())
    
    # List of all these isins
    test_indate_df = df_indate[df_indate.ISIN.isin(test_ISIN)]
    test_isin_list = test_indate_df['ISIN'].values
   
    # Set sample size differently if testing for a sub sample - otherwise leave
    sample = len(test_isin_list)

    # -1 since 1 - total isins for each possible isin
    step_isin = np.shape(feature_mat)[0]-1
        
    # Preallocate matrix of difference [npossibleIsin x nFeatures x nTestIsin]
    diff_mat = np.zeros((step_isin,np.shape(feature_mat)[1], len(test_isin_list[:sample])))
    isin_target_mat = (step_isin * len(test_isin_list[:sample])) * [None]
    isin_test_mat = (step_isin * len(test_isin_list[:sample])) * [None]
    
    # Loop through each test isin and create matrix of differences
    for k, isin in enumerate(test_isin_list[:sample]):
        isin_idx = all_isin == isin
        diff_mat[:,:,k] = abs(feature_mat[~isin_idx,:] - feature_mat[isin_idx,:])
        isin_target_mat[step_isin*k:(step_isin*k)+step_isin] = step_isin * [isin]
        isin_test_mat[step_isin*k:(step_isin*k)+step_isin] = isin_column[~isin_idx]       
        
    # Reshape into 2-D matrix by stacking third dimension
    diff_mat_2d = np.reshape(np.transpose(diff_mat, (2,0,1)),(sample * step_isin, np.shape(feature_mat)[1]), order = 'C')
    
    # Date is the same for all
    date = np.shape(diff_mat_2d)[0] * [last_date]
    
    # For all categoricals turn to true if match, false otherwise
    categorical_bool = diff_mat_2d[:,:len(categorical_df.columns)] == 0
    
    # Take numerical matrix for appropriate part
    num_mat = diff_mat_2d[:,len(categorical_df.columns):]
    
    # Combine into one output dataframe for testing
    diff_df = pd.concat([pd.DataFrame(np.column_stack([date,isin_target_mat,isin_test_mat]), columns=['date', 'ISIN','ThisISIN']),
                        pd.DataFrame(num_mat, columns = numerical_columns), pd.DataFrame(categorical_bool, columns = categorical_columns)], axis=1)
    
    # Reset Index
    output_df = diff_df.reset_index(drop=True)
    
    return output_df

In [None]:
test_df = clean_test(train_raw, test_raw)

## Split X and y
after getting train_df and test_df, we need to split dataset into features and labels

In [None]:
xtrain = train_df.drop(['date', 'ISIN', 'NearestISIN', 'level_0', 'index', 'Response'], axis = 1)
ytrain = train_df['Response']

In [None]:
x_train, y_train = shuffle(xtrain, ytrain)

In [None]:
x_test = test_df.drop(['date', 'ISIN', 'ThisISIN'], axis = 1)

# 4.0 Fit data into a model
here we choose random forest model to give you an idea how we feed the data into a model, here we demonstrate you the model random forest with many techniques/tricks to tune the model in the package sklearn; but random forest is only one of the models in the machine universe, there are many different models and different methods you can use to tune the model, enjoy it :)

In [None]:
def randomforest_proba(xtrain,xtest,ytrain):
    categorical_features = xtrain.dtypes == "object"
    numerical_features = ~categorical_features
    
    preprocess_scale = make_column_transformer(
    (numerical_features, make_pipeline(SimpleImputer(strategy="median"), StandardScaler())),
    (categorical_features, make_pipeline(SimpleImputer(fill_value='missing', strategy='constant'), 
                                         OneHotEncoder(handle_unknown="ignore"))))
    
    randomforest_scale = Pipeline([('preprocess', preprocess_scale), 
                 ('clf', RandomForestClassifier(warm_start=True,n_jobs=-1))])
    
    max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
    random_grid = {'clf__max_depth': max_depth}
    grid_rf = GridSearchCV(randomforest_scale, random_grid, cv=10, scoring='roc_auc',n_jobs=-1)
    print("===Start of grid search===")
    grid_rf.fit(xtrain, ytrain)
    print("===End of grid search===")
  
    cross_score = grid_rf.best_score_
    y_preds = grid_rf.predict_proba(xtest)[:,1]
    print("Cross-validated training score is:")
    print(cross_score)
    
    return y_preds

In [None]:
rf_pred_proba_d = randomforest_proba(x_train,x_test,y_train)

# 5.0 Test on validation test set
Since we are transfer learning a binary classification problem to a clustering problem, we provide you a validation set with answer to give you a sense how well our model will be on the real test dataset. The accuracy score can be different from the cross-validation score you just get, so you might need to continue improving your model to make sure it is predicting the 10 Nearest ISINs well too! :D

Please Note that the result in validation set is not your score!

In [None]:
val_df = clean_test(train_raw, val_test)

In [None]:
x_val = val_df.drop(['date', 'ISIN', 'ThisISIN', 'level_0', 'index'], axis = 1)

In [None]:
val_preds = randomforest_proba(x_train,x_val,y_train)

In [None]:
def transfer_test(test_df, val_test, y_preds, y_true):
    #create dataframe
    test_df['Class_Prob'] = y_preds
    sel_ISIN = list(test_df.ISIN.unique())
    nearest_10ISIN = {}
    for isin in sel_ISIN:
        temp = test_df[test_df.ISIN == isin]
        temp_sort = temp.sort_values(['Class_Prob'],ascending=False).head(10)
        nearest_10ISIN[isin] = temp_sort['ThisISIN'].tolist()
    for idx in range(val_test.shape[0]):
        isin = val_test.loc[idx, 'ISIN']
        val_test.loc[idx,'Your Prediction'] = nearest_10ISIN[isin][0]
        nearest_10ISIN[isin].pop(0)

    correct = 0
    for idx in range(y_true.shape[0]):
        pred_list = val_test[val_test['ISIN']==y_true.loc[idx,'ISIN']]['Your Prediction'].tolist()
        if y_true.loc[idx,'NearestISIN'] in pred_list:
            correct+=1
    
    print("Test accuracy score is:")
    print(correct/len(sel_ISIN))

In [None]:
transfer_test(val_df, val_test, val_preds, val_ans)

# 6.0 Generate results


From the function above, we generate the probability of each pairs are NearestISINs, so for each ISIN in the test set, we want to find out the top 10 similar bonds
After finding out the top 10 similar bonds we want to save them into test_raw, and we just need to save the test_raw into a csv file with name 'TeamX_output.csv'(X means team number)


In [None]:
def generate_output(test_df, test_raw, y_preds, filename):
    #create dataframe
    test_df['Class_Prob'] = y_preds
    sel_ISIN = list(test_df.ISIN.unique())
    nearest_10ISIN = {}
    for isin in sel_ISIN:
        temp = test_df[test_df.ISIN == isin]
        temp_sort = temp.sort_values(['Class_Prob'],ascending=False).head(10)
        nearest_10ISIN[isin] = temp_sort['ThisISIN'].tolist()
    for idx in range(test_raw.shape[0]):
        isin = test_raw.loc[idx, 'ISIN']
        test_raw.loc[idx,'Your Prediction'] = nearest_10ISIN[isin][0]
        nearest_10ISIN[isin].pop(0)
    import os
    print(os.getcwd())
    test_raw.to_csv(filename)
    print("Output written to file "+filename)

In [None]:
generate_output(test_df, test_raw, rf_pred_proba_d, 'TeamX_output.csv')

<a href="TeamX_output.csv"> Download File </a>