### OBJECTIVE: Recommend the next three challenges to the users

#### DataSets

1. train_csv: Contains the set of 13 challenges that were attempted by the same user in a sequence
2. challenge_data.csv: Contains attributes related to each challenge
3. test.csv: Contains the first 10 challenges solved by a new user set (not in train) in the test set.

#### Notebook Objective: Detailed exploration of the data

IMPORTS and SETTING PATH

In [47]:
import pandas as pd
import os
import plotly.express as px
import numpy as np

In [22]:
print(os.getcwd())
data_path = "Personal Projects/AV/rec_eng/data/"
train_data = pd.read_csv(data_path+"train.csv")
test_data = pd.read_csv(data_path+"test.csv")

/Users/pbhat


In [19]:
train_data.head()

Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4576_1,4576,1,CI23714
1,4576_2,4576,2,CI23855
2,4576_3,4576,3,CI24917
3,4576_4,4576,4,CI23663
4,4576_5,4576,5,CI23933


In [23]:
test_data.head()

Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4577_1,4577,1,CI23855
1,4577_2,4577,2,CI23933
2,4577_3,4577,3,CI24917
3,4577_4,4577,4,CI24915
4,4577_5,4577,5,CI23714


In [25]:
total_users_in_train = len(train_data['user_id'].unique())
total_users_in_test = len(test_data['user_id'].unique())

print("Total Number of users in Train:",total_users_in_train)
print("Total Number of users in Test:",total_users_in_test)


Total Number of users in Train: 69532
Total Number of users in Train: 39732


In [53]:
#Count of challenges per user
train_user_challenge_count = train_data.groupby('user_id')['challenge_sequence'].count()
test_user_challenge_count = test_data.groupby('user_id')['challenge_sequence'].count()
#All the users have exactly attempted 13 challenges
print("Number of challenges per user in Train:",np.unique(train_user_challenge_count.values))
# ALl the users in test have 10 challenges attempted
print("Number of challenges per user in Test:",np.unique(test_user_challenge_count.values))

Number of challenges per user in Train: [13]
Number of challenges per user in Test: [10]


In [57]:
print("Number of unique challenges in Train:",len(train_data['challenge'].unique()))
print("Number of unique challenges in Test:", len(test_data['challenge'].unique()))

Number of unique challenges in Train: 5348
Number of unique challenges in Test: 4477


In [60]:
challenge_data = pd.read_csv(data_path+"challenge_data.csv")

In [61]:
challenge_data

Unnamed: 0,challenge_ID,programming_language,challenge_series_ID,total_submissions,publish_date,author_ID,author_gender,author_org_ID,category_id
0,CI23478,2,SI2445,37.0,06-05-2006,AI563576,M,AOI100001,
1,CI23479,2,SI2435,48.0,17-10-2002,AI563577,M,AOI100002,32.0
2,CI23480,1,SI2435,15.0,16-10-2002,AI563578,M,AOI100003,
3,CI23481,1,SI2710,236.0,19-09-2003,AI563579,M,AOI100004,70.0
4,CI23482,2,SI2440,137.0,21-03-2002,AI563580,M,AOI100005,
...,...,...,...,...,...,...,...,...,...
5601,CI29079,1,SI2864,,17-06-2010,AI567059,M,AOI101717,29.0
5602,CI29080,1,SI2865,,25-06-2010,AI567060,F,AOI101718,29.0
5603,CI29081,1,SI2865,,25-06-2010,AI566257,M,AOI100108,29.0
5604,CI29082,1,SI2865,,25-06-2010,AI563777,M,AOI100108,29.0


In [66]:
#How many programming languages are allowed in the portal
print("Programming Languages:",challenge_data['programming_language'].unique())

Programming Languages: [2 1 3]


In [69]:
#How many Challenge Series are present in the data
print("Number of Challenge Series:",len(challenge_data['challenge_series_ID'].unique()))

Number of Challenge Series: 436


In [70]:
#HOw many challenge categories are present
print("Number of Challenge Categories:",len(challenge_data['category_id'].unique()))

Number of Challenge Categories: 195


Objective: Recommend the next three challenges for the users
1. How many different challenge categories to users pusue?
2. Do users have only one category liking
3. DO users have one series liking


In [77]:
#Ehich category has the most of the challenges
top_20_challenge_category = challenge_data.groupby('category_id')['challenge_ID'].count().sort_values(ascending=False).head(20)
px.bar(top_20_challenge_category)

In [78]:
#Ehich category has the most of the challenges
top_20_challenge_category = challenge_data.groupby('category_id')['challenge_ID'].count().sort_values(ascending=False)
px.bar(top_20_challenge_category)

In [91]:
challenge_data_na_stat = pd.DataFrame([(challenge_data.isnull().sum()/len(challenge_data))*100,challenge_data.isnull().sum()]).T
challenge_data_na_stat.columns = ["na_percentage","na_record_count"]
challenge_data_na_stat

Unnamed: 0,na_percentage,na_record_count
challenge_ID,0.0,0.0
programming_language,0.0,0.0
challenge_series_ID,0.214056,12.0
total_submissions,6.278987,352.0
publish_date,0.0,0.0
author_ID,0.695683,39.0
author_gender,1.730289,97.0
author_org_ID,4.423832,248.0
category_id,32.839814,1841.0


In [106]:
#Features on Challenge data
#recency score for the challenge based on publishing date
challenge_data['publish_date']
challenge_data['publish_date'] = pd.to_datetime(challenge_data['publish_date'], format= "%d-%m-%Y")
print("Publishign data ranges from {} to {}".format(challenge_data['publish_date'].min(), challenge_data['publish_date'].max()))

Publishign data ranges from 1999-08-26 00:00:00 to 2010-06-25 00:00:00


In [109]:
challenge_data['day_since_published'] = (challenge_data['publish_date'].max() - challenge_data['publish_date']).dt.days

In [110]:
challenge_data

Unnamed: 0,challenge_ID,programming_language,challenge_series_ID,total_submissions,publish_date,author_ID,author_gender,author_org_ID,category_id,day_since_published
0,CI23478,2,SI2445,37.0,2006-05-06,AI563576,M,AOI100001,,1511
1,CI23479,2,SI2435,48.0,2002-10-17,AI563577,M,AOI100002,32.0,2808
2,CI23480,1,SI2435,15.0,2002-10-16,AI563578,M,AOI100003,,2809
3,CI23481,1,SI2710,236.0,2003-09-19,AI563579,M,AOI100004,70.0,2471
4,CI23482,2,SI2440,137.0,2002-03-21,AI563580,M,AOI100005,,3018
...,...,...,...,...,...,...,...,...,...,...
5601,CI29079,1,SI2864,,2010-06-17,AI567059,M,AOI101717,29.0,8
5602,CI29080,1,SI2865,,2010-06-25,AI567060,F,AOI101718,29.0,0
5603,CI29081,1,SI2865,,2010-06-25,AI566257,M,AOI100108,29.0,0
5604,CI29082,1,SI2865,,2010-06-25,AI563777,M,AOI100108,29.0,0
