### Problem Statement

> Online judges provide a platform where many users solve problems everyday to improve their programming skills. The users can be beginners or experts in competitive programming. Some users might be good at solving specific category of problems(e.g. Greedy, Graph algorithms, Dynamic Programming etc.) while others may be beginners in the same. There can be patterns to everything, and the goal of the machine learning would be to identify these patterns and model user’s behaviour from these patterns. The goal of this challenge is to predict range of attempts a user will make to solve a given problem given user and problem details. Finding these patterns can help the programming committee, as it will help them to suggest relevant problems to solve and provide hints automatically on which users can get stuck.  

### Evaluation Metric
The metric used for evaluating the performance of the model is the F1 score between the predicted and the actual value with average parameter set to ‘weighted’.

### Data Files

### train_submissions.csv 
This contains 1,55,295 submissions which are selected randomly from 2,21,850 submissions. Contains 3 columns (‘user_id’, ‘problem_id’, ‘attempts_range’). The variable ‘attempts_range’ denoted the range no. in which attempts the user made to get the solution accepted lies.

|Attempts_range | No. of attempts lies inside|
|---------------|----------------------------|
|1              |                         1-1|
|2              |                         2-3|
|3              |                         4-5|
|4              |                         6-7|
|5              |                         8-9|
|6              |                        >=10|




### user_data.csv - This is the file containing data of users. It contains the following features :-
- user_id - unique ID assigned to each user
- submission_count - total number of user submissions
- problem_solved - total number of accepted user submissions
- contribution - user contribution to the judge
- country - location of user
- follower_count - amount of users who have this user in followers
- last_online_time_seconds - time when user was last seen online
- max_rating - maximum rating of user
- rating - rating of user
- rank - can be one of ‘beginner’ ,’intermediate’ , ‘advanced’, ‘expert’
- registration_time_seconds - time when user was registered

### problem_data.csv - This is the file containing data of the problems. It contains the following features :-
- problem_id - unique ID assigned to each problem
- level_id - the difficulty level of the problem between ‘A’ to ‘N’
- points - amount of points for the problem
- tags - problem tag(s) like greedy, graphs, DFS etc.

In [1]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

In [2]:
if IN_COLAB:
    !pip install xgboost --upgrade
    !pip install yellowbrick
    !pip install mlxtend==0.17.2
    !pip install tpot

In [3]:
import numpy as np

import pandas as pd

from mlxtend.preprocessing import TransactionEncoder

from pathlib import Path

In [4]:
### script level variables
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    data_path = 'drive/My Drive/datasource/userproblem/{file}'
    source_path = Path.cwd()
else:
    data_path='{file}'
    source_path =Path.cwd().joinpath('datasource')
    
train_file = source_path.joinpath(data_path.format(file='train_submissions.csv'))
test_file = source_path.joinpath(data_path.format(file='test_submissions.csv'))
user_file  = source_path.joinpath(data_path.format(file='user_data.csv'))
problem_file  = source_path.joinpath(data_path.format(file='problem_data.csv'))

In [5]:
def load_data(train_file,test_file,user_file,problem_file):
    train_df = pd.read_csv(train_file)
    test_df = pd.read_csv(test_file)
    user_df = pd.read_csv(user_file)
    problem_df = pd.read_csv(problem_file)
    return train_df,test_df,user_df,problem_df
    

In [6]:
def extended_describe(dataframe):
    extended_describe_df = dataframe.describe(include='all').T 
    extended_describe_df['null_count']= dataframe.isnull().sum()
    return extended_describe_df

In [7]:
def merge_df(left_df,right_df,how,on,suffixes):
    if not on:
        raise valueError("Unable to join dataframes as join cols not specified")
    how = how or 'left'
    suffixes = suffixes or ('_left','_right')
    return  left_df.merge(right=right_df,how =how,on =on,suffixes=suffixes)
    

In [8]:
train_df,test_df,user_df,problem_df = load_data(train_file,test_file,user_file,problem_file)

### Data Exploration

In [9]:
extended_describe(train_df)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,null_count
user_id,155295,3529.0,user_1009,105.0,,,,,,,,0
problem_id,155295,5776.0,prob_5071,1365.0,,,,,,,,0
attempts_range,155295,,,,1.75503,1.07845,1.0,1.0,1.0,2.0,6.0,0


In [10]:
extended_describe(test_df)

Unnamed: 0,count,unique,top,freq,null_count
ID,66555,66555,user_1690_prob_1394,1,0
user_id,66555,3501,user_2744,52,0
problem_id,66555,4716,prob_5071,602,0


In [11]:
extended_describe(user_df)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,null_count
user_id,3571,3571.0,user_1871,1.0,,,,,,,,0
submission_count,3571,,,,299.481,366.103,1.0,66.5,169.0,390.0,4570.0,0
problem_solved,3571,,,,267.894,344.14,0.0,53.0,146.0,349.0,4476.0,0
contribution,3571,,,,4.10249,16.5523,-64.0,0.0,0.0,0.0,171.0,0
country,2418,79.0,India,619.0,,,,,,,,1153
follower_count,3571,,,,46.6906,211.495,0.0,4.0,13.0,40.0,10575.0,0
last_online_time_seconds,3571,,,,1502680000.0,5114850.0,1484240000.0,1502690000.0,1505050000.0,1505550000.0,1505600000.0,0
max_rating,3571,,,,390.374,92.4288,303.899,317.661,355.791,444.954,983.085,0
rating,3571,,,,350.166,106.593,0.0,279.243,329.702,413.418,911.124,0
rank,3571,4.0,beginner,1509.0,,,,,,,,0


In [12]:
extended_describe(problem_df)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,null_count
problem_id,6544,6544.0,prob_4451,1.0,,,,,,,,0
level_type,6411,14.0,A,1042.0,,,,,,,,133
points,2627,,,,1452.38,789.542,-1.0,1000.0,1500.0,2000.0,5000.0,3917
tags,3060,882.0,implementation,297.0,,,,,,,,3484


### Few intutions
- user_df country has nulls
- user_df has last_online_time_seconds, registration_time_seconds, we can try to convert them to dates
- problem_df has level_type,points, tags has nulls

In [13]:
rank_map = {'beginner':1,'intermediate':2,'advanced':3,'expert':4}


In [14]:
# Fill country with "Not_specified"
user_df['registration'] = pd.to_datetime(user_df['registration_time_seconds'], unit='s')
user_df['last_online'] = pd.to_datetime(user_df['last_online_time_seconds'], unit='s')
user_df['days_spent'] = (user_df['last_online'] -user_df['registration']).dt.days 
user_df['rank'] = user_df['rank'].map(rank_map)
user_df['rating_diff'] = user_df['max_rating']- user_df['rating']
user_df['rating_ratio'] = user_df['rating']/ user_df['max_rating']

In [15]:
user_df['country_not_specified'] = user_df['country'].map(lambda x: int(x is np.nan))

In [16]:
user_df.head()

Unnamed: 0,user_id,submission_count,problem_solved,contribution,country,follower_count,last_online_time_seconds,max_rating,rating,rank,registration_time_seconds,registration,last_online,days_spent,rating_diff,rating_ratio,country_not_specified
0,user_3311,47,40,0,,4,1504111645,348.337,330.849,2,1466686436,2016-06-23 12:53:56,2017-08-30 16:47:25,433,17.488,0.949796,1
1,user_3028,63,52,0,India,17,1498998165,405.677,339.45,2,1441893325,2015-09-10 13:55:25,2017-07-02 12:22:45,660,66.227,0.836749,0
2,user_2268,226,203,-8,Egypt,24,1505566052,307.339,284.404,1,1454267603,2016-01-31 19:13:23,2017-09-16 12:47:32,593,22.935,0.925376,0
3,user_480,611,490,1,Ukraine,94,1505257499,525.803,471.33,3,1350720417,2012-10-20 08:06:57,2017-09-12 23:04:59,1788,54.473,0.8964,0
4,user_650,504,479,12,Russia,4,1496613433,548.739,486.525,3,1395560498,2014-03-23 07:41:38,2017-06-04 21:57:13,1169,62.214,0.886624,0


In [17]:
user_df.drop(['country','last_online_time_seconds','registration_time_seconds','registration','last_online'],axis=1,inplace=True)

In [18]:
def preprocess_problem(problem_df):
    problem_df['points_&_tags_none'] = problem_df['points'].isna()&problem_df['tags'].isna() 

    problem_df['points_&_tags_none'] = problem_df['points_&_tags_none'].map(lambda x: int(x))

    problem_df.loc[problem_df[(problem_df['level_type']=='A')&(problem_df['points'].isna())].index,'points'] = 500

    problem_df.loc[problem_df[(problem_df['level_type']=='B')&(problem_df['points'].isna())].index,'points'] = 1000

    problem_df.loc[problem_df[(problem_df['level_type']=='C')&(problem_df['points'].isna())].index,'points'] = 1500

    problem_df.loc[problem_df[(problem_df['level_type']=='D')&(problem_df['points'].isna())].index,'points'] = 2000

    problem_df.loc[problem_df[(problem_df['level_type']=='E')&(problem_df['points'].isna())].index,'points'] = 2500

    problem_df.loc[problem_df[(problem_df['level_type']=='F')&(problem_df['points'].isna())].index,'points'] = 2750

    problem_df.loc[problem_df[(problem_df['level_type']=='G')&(problem_df['points'].isna())].index,'points'] = 3000

    problem_df.loc[problem_df[(problem_df['level_type']=='H')&(problem_df['points'].isna())].index,'points'] = 3000

    problem_df.loc[problem_df[(problem_df['points'].isna())].index,'points'] = 2500

    level_type_map = {k:i for i,k in enumerate(sorted(problem_df['level_type'].dropna().unique()))}

    problem_df['level_type'] = problem_df['level_type'].map(lambda x:level_type_map.get(x,-1))

    problem_df['tags'].fillna('',inplace=True)
    return problem_df


def process_tags(problem_df):
    te = TransactionEncoder()
    text_array =problem_df['tags'].str.replace('*','').str.replace('2-','').str.replace('[\s-]','_').str.split(pat=',').values
    te_ary = te.fit(text_array).transform(text_array)
    text_df =pd.DataFrame(te_ary.astype("int"), columns=te.columns_)
    text_df.drop([''],inplace=True,axis=1) 
    problem_df =pd.concat([problem_df, text_df], axis= 1)
    return problem_df

In [19]:
problem_df = preprocess_problem(problem_df)
problem_df = process_tags(problem_df)

In [20]:
train_df['ID']= train_df['user_id']+'_'+train_df['problem_id']

In [21]:
train_df.head()

Unnamed: 0,user_id,problem_id,attempts_range,ID
0,user_232,prob_6507,1,user_232_prob_6507
1,user_3568,prob_2994,3,user_3568_prob_2994
2,user_1600,prob_5071,1,user_1600_prob_5071
3,user_2256,prob_703,1,user_2256_prob_703
4,user_2321,prob_356,1,user_2321_prob_356


In [22]:
full_df = pd.concat([train_df,test_df],ignore_index=True)

In [23]:
full_df = merge_df(full_df,user_df,how ='left',on =['user_id'],suffixes=('_left', '_right'))

In [24]:
full_df = merge_df(full_df,problem_df,how ='left',on =['problem_id'],suffixes=('_left', '_right'))

In [25]:
full_df.head()

Unnamed: 0,user_id,problem_id,attempts_range,ID,submission_count,problem_solved,contribution,follower_count,max_rating,rating,...,sat,schedules,shortest_paths,sortings,special,string_suffix_structures,strings,ternary_search,trees,two_pointers
0,user_232,prob_6507,1.0,user_232_prob_6507,53,47,0,1,307.913,206.709,...,0,0,0,0,0,0,1,0,0,0
1,user_3568,prob_2994,3.0,user_3568_prob_2994,133,118,0,0,324.255,235.378,...,0,0,0,0,0,0,0,0,0,0
2,user_1600,prob_5071,1.0,user_1600_prob_5071,50,44,0,7,343.177,229.358,...,0,0,0,0,0,0,0,0,0,0
3,user_2256,prob_703,1.0,user_2256_prob_703,271,233,23,40,436.927,399.083,...,0,0,0,0,0,0,0,0,0,0
4,user_2321,prob_356,1.0,user_2321_prob_356,155,135,0,80,492.546,472.19,...,0,0,0,0,0,0,0,0,1,0


In [26]:
train_df = full_df[full_df['attempts_range'].notna()]
test_df = full_df[full_df['attempts_range'].isna()]

In [27]:
train_df = train_df.drop(['user_id','problem_id','tags'],axis=1)

In [28]:
test_df = test_df.drop(['user_id','problem_id','attempts_range','tags'],axis=1)

In [29]:
X = train_df.drop(['attempts_range','ID'], axis=1)
y = train_df['attempts_range'].astype(int)

In [30]:
X.head()

Unnamed: 0,submission_count,problem_solved,contribution,follower_count,max_rating,rating,rank,days_spent,rating_diff,rating_ratio,...,sat,schedules,shortest_paths,sortings,special,string_suffix_structures,strings,ternary_search,trees,two_pointers
0,53,47,0,1,307.913,206.709,1,827,101.204,0.671323,...,0,0,0,0,0,0,1,0,0,0
1,133,118,0,0,324.255,235.378,1,550,88.877,0.725904,...,0,0,0,0,0,0,0,0,0,0
2,50,44,0,7,343.177,229.358,1,361,113.819,0.668337,...,0,0,0,0,0,0,0,0,0,0
3,271,233,23,40,436.927,399.083,2,664,37.844,0.913386,...,0,0,0,0,0,0,0,0,0,0
4,155,135,0,80,492.546,472.19,3,783,20.356,0.958672,...,0,0,0,0,0,0,0,0,1,0


In [31]:
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.8, random_state=42)

In [32]:
import xgboost as xgb

In [33]:
params = {'max_depth':8,
          'objective':'multi:softmax',
          'n_estimators':500,
          'num_classes':6
         }

if IN_COLAB:
    param['tree_method'] = "gpu_hist"


In [34]:
clf = xgb.XGBClassifier(**params)

In [35]:
clf.fit(X_train,y_train)

Parameters: { num_classes } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




KeyboardInterrupt: 

In [None]:
y_pred = clf.predict(X_validation)

In [None]:
from sklearn.metrics import f1_score
f1_score(y_validation, y_pred, average='weighted')

In [None]:
y_test = clf.predict(test_df.drop(['ID'],axis=1))

In [None]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_validation, y_validation))

In [None]:
Check

In [None]:
submission = pd.DataFrame()
submission['ID'] = test_df['ID']
submission['attempts_range']=y_test

In [None]:
import datetime
FORMAT = '%Y%m%d%H%M%S'
timestamp=datetime.datetime.now().strftime(FORMAT)
filename ="Submission_xgboost_"+timestamp+"_out.csv"
if IN_COLAB:
    out_path = '/content/drive/My Drive/datasource/userproblem/{filename}'
else:
    out_path ="{filename}"
    
out_path = out_path.format(filename= filename)

out_path = out_path.format(filename= filename)
submission.to_csv(out_path,index=False)