---
---

<center><h1> The Approach </h1></center>

---    
    
* One way to approach this problem is to treat it as a sequence prediction problem for NLP, where you need to predict the next word given the previous words. 

* Now there may be many ways to predict using this hypothesis for example calculating conditional probabilities.

* We have used here a simple method of co-occurence matrix which essentially will be used to predict the next challenge. 

* We will be using both train and test data to calculate the co-occurence matrix
    
----
    
#### `Importing the Required Libraries`

----

In [1]:
# import the required libraries

from collections import Counter
from scipy import sparse
import numpy as np
import pandas as pd
import pickle

In [2]:
# reading train and test file

train = pd.read_csv("dataset/train.csv")
test = pd.read_csv("dataset/test.csv")

In [3]:
train.head()

Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4576_1,4576,1,CI23714
1,4576_2,4576,2,CI23855
2,4576_3,4576,3,CI24917
3,4576_4,4576,4,CI23663
4,4576_5,4576,5,CI23933


#### `Creating dataset in the required form for Co-occurence matrix`

---

In [4]:
# convert the train in the long format to wide format

wide_train = train.pivot_table(index = "user_id", columns="challenge_sequence", values="challenge", aggfunc= lambda x : x).reset_index()

In [5]:
wide_train.head(20)

challenge_sequence,user_id,1,2,3,4,5,6,7,8,9,10,11,12,13
0,4576,CI23714,CI23855,CI24917,CI23663,CI23933,CI25135,CI23975,CI25126,CI24915,CI24957,CI24958,CI23667,CI23691
1,4580,CI23663,CI23855,CI23933,CI23975,CI24530,CI23714,CI23648,CI23781,CI23667,CI25135,CI24915,CI25727,CI26051
2,4581,CI26155,CI26156,CI26157,CI26158,CI26159,CI26160,CI26161,CI26162,CI26164,CI26165,CI26163,CI26166,CI26167
3,4582,CI23855,CI24915,CI24917,CI23933,CI23663,CI24958,CI23975,CI23714,CI24953,CI24944,CI25135,CI26051,CI24957
4,4585,CI23855,CI23975,CI24917,CI25135,CI23848,CI23714,CI23663,CI23933,CI24958,CI24915,CI24530,CI24187,CI25126
5,4587,CI23933,CI25727,CI26051,CI25125,CI25124,CI25633,CI23663,CI26050,CI23667,CI24915,CI24031,CI23855,CI28240
6,4590,CI23848,CI23855,CI23975,CI25135,CI23929,CI23714,CI23913,CI23663,CI25298,CI24917,CI23691,CI25733,CI25142
7,4591,CI23855,CI23933,CI24530,CI23663,CI23714,CI24534,CI24915,CI24917,CI25135,CI24527,CI24958,CI24261,CI23648
8,4592,CI23855,CI24917,CI25135,CI23848,CI23714,CI23975,CI23663,CI25142,CI23913,CI25126,CI23769,CI24958,CI24187
9,4593,CI26155,CI26157,CI26158,CI26159,CI26160,CI26161,CI26162,CI26164,CI26165,CI26163,CI26166,CI26167,CI26168


In [6]:
# dropping the user_id, since we won't be needing those for our co-occurrence matrix

wide_train.drop(["user_id"], axis =1, inplace = True)

In [7]:
# convert each row for a user into a string

rows = []
for index, row in wide_train.iterrows():
    r = " ".join(row.map(str))
    rows.append(r)

In [9]:
# converting test to wide format

wide_test = test.pivot_table(index = "user_id", columns="challenge_sequence", values="challenge", aggfunc= lambda x : x).reset_index()

In [10]:
wide_test.shape

(39732, 11)

In [11]:
# saving test user_id for future use

test_ids = wide_test['user_id']

In [12]:
# dropping user_id from wide test

wide_test.drop(["user_id"], axis =1, inplace = True)

----

#### `Appending the test strings into the train strings (vertically)`

----

In [13]:
for index, row in wide_test.iterrows():
    r = " ".join(row.map(str))
    rows.append(r)

In [14]:
len(rows)

109264

In [16]:
# creating a corpus
thefile = open("corpus.txt","w")

In [17]:
for element in rows:
    thefile.write("%s\n"%element)

In [19]:
thefile.close()

----

#### `Creating co-occurence matrix from the corpus`

----

In [20]:
# reading the corpus

corpus = open("corpus.txt","r")

In [21]:
# creating a dictionary with key = challenge_name and value = frequency
vocab = Counter()

In [22]:
# updating the vocab dictionary with each line in the corpus
for line in corpus:
    tokens = line.strip().split()
    vocab.update(tokens)

In [23]:
# modifying the vocab dictionary to begin creating a mapping of challenge_id to integers

vocab = {word: (i, freq) for i, (word, freq) in enumerate(vocab.items())}

In [4]:
#vocab

#### `Creating a reverse mapping from integer to challenge_id to decode the predictions made.`

----

In [25]:
id2word = dict((i, word) for word, (i, _) in enumerate(vocab.items()))

In [3]:
#id2word

In [27]:
vocab_size = len(vocab)
print(vocab_size)

5502


----

#### `Creating a square co-occurence matrix`

----

In [28]:
cooccurrences = sparse.lil_matrix((vocab_size, vocab_size),dtype=np.float64)
cooccurrences

<5502x5502 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in List of Lists format>

In [29]:
# context window size

window_size = 10

In [30]:
corpus = open("corpus.txt","r")

In [31]:
# Tuneable parameters : window_size, distance

for i, line in enumerate(corpus):
    tokens = line.strip().split()
    token_ids = [vocab[word][0] for word in tokens]
    
    for center_i, center_id in enumerate(token_ids):
        # Collect all word IDs in left window of center word
        context_ids = token_ids[max(0, center_i - window_size) : center_i]
        contexts_len = len(context_ids)

        for left_i, left_id in enumerate(context_ids):
            # Distance from center word
            
            distance = contexts_len - left_i

            # Weight by inverse of distance between words
            increment = 1.0 / float(distance)

            # Build co-occurrence matrix symmetrically (pretend we
            # are calculating right contexts as well)
            cooccurrences[center_id, left_id] += increment
            cooccurrences[left_id, center_id] += increment

In [32]:
# If anything other than None will exclude challenges whose frequencies are below this value.

min_count = None
#min_count = 20
print(min_count)

None


In [33]:
# filling the values in a matrix form

co_matrix = np.zeros([len(id2word),len(id2word)])

for i, (row, data) in enumerate(zip(cooccurrences.rows,cooccurrences.data)):
    if min_count is not None and vocab[id2word[i]][0] < min_count:
        continue
        
    for data_idx, j in enumerate(row):
        if min_count is not None and vocab[id2word[j]][0] < min_count:
            continue
            
        co_matrix[i,j] = data[data_idx]

---

#### `Co-occurence Matrix`

---

In [34]:
co_matrix

array([[   0.        , 1218.48412698,  952.68690476, ...,    0.        ,
           0.        ,    0.        ],
       [1218.48412698,    0.        , 1221.0265873 , ...,    0.        ,
           0.        ,    0.        ],
       [ 952.68690476, 1221.0265873 ,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       ...,
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
           0.        ,    0.        ]])

In [35]:
#saving the mapping to a dictionary
pickle_path = "./vocab_mapping.pkl"
pickle_mapping = open(pickle_path,"wb")
pickle.dump(id2word, pickle_mapping)
pickle_mapping.close()

In [36]:
# saving the co-occurence matrix as a dataframe

co_occurence_dataframe = pd.DataFrame(co_matrix)

In [37]:
co_occurence_dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5492,5493,5494,5495,5496,5497,5498,5499,5500,5501
0,0.0,1218.484127,952.686905,1065.586905,985.600397,800.447222,690.27619,986.594841,522.625,260.590476,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1218.484127,0.0,1221.026587,1252.965476,1092.545238,968.419048,926.743651,641.663095,721.213492,335.470635,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,952.686905,1221.026587,0.0,847.772222,969.061508,647.253571,509.603175,471.850397,797.802381,520.581746,...,0.0,0.0,0.0,0.0,0.142857,0.5,0.0,0.0,0.0,0.0
3,1065.586905,1252.965476,847.772222,0.0,976.407143,746.009921,705.35119,529.753571,646.126984,274.274603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,985.600397,1092.545238,969.061508,976.407143,0.0,538.253175,487.259127,520.790476,588.089683,348.027778,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
res = {v:k for k,v in id2word.items()}

In [39]:
co_occurence_dataframe =co_occurence_dataframe.rename(columns=res)

In [40]:
co_occurence_dataframe = co_occurence_dataframe.rename(index=res)

In [41]:
co_occurence_dataframe.to_csv("co_matrix_with_window_size_1.csv", index = False)

In [42]:
co_occurence_dataframe.head()

Unnamed: 0,CI23714,CI23855,CI24917,CI23663,CI23933,CI25135,CI23975,CI25126,CI24915,CI24957,...,CI27326,CI29005,CI25760,CI28335,CI25962,CI25968,CI27314,CI27334,CI25342,CI28218
CI23714,0.0,1218.484127,952.686905,1065.586905,985.600397,800.447222,690.27619,986.594841,522.625,260.590476,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CI23855,1218.484127,0.0,1221.026587,1252.965476,1092.545238,968.419048,926.743651,641.663095,721.213492,335.470635,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CI24917,952.686905,1221.026587,0.0,847.772222,969.061508,647.253571,509.603175,471.850397,797.802381,520.581746,...,0.0,0.0,0.0,0.0,0.142857,0.5,0.0,0.0,0.0,0.0
CI23663,1065.586905,1252.965476,847.772222,0.0,976.407143,746.009921,705.35119,529.753571,646.126984,274.274603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CI23933,985.600397,1092.545238,969.061508,976.407143,0.0,538.253175,487.259127,520.790476,588.089683,348.027778,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
wide_test.head()

challenge_sequence,1,2,3,4,5,6,7,8,9,10
0,CI23855,CI23933,CI24917,CI24915,CI23714,CI23663,CI24958,CI25135,CI25727,CI24530
1,CI23663,CI23855,CI24917,CI23933,CI23975,CI23714,CI25135,CI24915,CI24958,CI23781
2,CI26939,CI26940,CI26941,CI26942,CI26943,CI26944,CI26945,CI26947,CI26948,CI26954
3,CI23663,CI23855,CI23975,CI23714,CI23848,CI23933,CI25135,CI23781,CI24530,CI23667
4,CI23855,CI23975,CI25135,CI23848,CI23714,CI24917,CI23929,CI25733,CI25126,CI23913


In [44]:
wide_test.shape

(39732, 10)

----

#### `Making predictions with the co-occurence_matrix based on last attemped/predicted`

----

In [None]:
final_predictions = []

for i in range(0,39732):
    predictions = [wide_test.loc[i,10]]
    counter = 0
    for stimulus in predictions:
        predictions.append(co_occurence_dataframe[stimulus].idxmax())
        counter+=1
        if counter == 3:
            break
            
    final_predictions.append(predictions[1:])    

In [46]:
# making predictions with the co-occurence_matrix based on 10th challenge only
final_predictions_new = []

for i in range(0,39732):
    stimulus = wide_test.loc[i,10]
    
    final_predictions_new.append(list(co_occurence_dataframe[stimulus].nlargest(3).index))   

In [47]:
largest_3 = pd.DataFrame(final_predictions_new)

In [48]:
largest_3['user_id'] = test_ids

In [49]:
largest_3.head()

Unnamed: 0,0,1,2,user_id
0,CI23691,CI23714,CI23663,4577
1,CI23663,CI23714,CI23933,4578
2,CI26953,CI26955,CI26951,4579
3,CI23648,CI23663,CI23933,4583
4,CI23714,CI23855,CI25142,4584


In [50]:
largest_3.head()

Unnamed: 0,0,1,2,user_id
0,CI23691,CI23714,CI23663,4577
1,CI23663,CI23714,CI23933,4578
2,CI26953,CI26955,CI26951,4579
3,CI23648,CI23663,CI23933,4583
4,CI23714,CI23855,CI25142,4584


In [51]:
largest_3_long = pd.melt(largest_3,id_vars="user_id",var_name="sequence", value_name="challenge" )

In [52]:
largest_3_long.head()

Unnamed: 0,user_id,sequence,challenge
0,4577,0,CI23691
1,4578,0,CI23663
2,4579,0,CI26953
3,4583,0,CI23648
4,4584,0,CI23714


In [53]:
largest_3_long['sequence'] = largest_3_long['sequence'].map({0:'11',1:'12',2:"13"})

In [54]:
largest_3_long['user_sequence'] = largest_3_long['user_id'].map(str)+"_"+largest_3_long['sequence'].map(str)

----

#### `Submission File`

----

In [55]:
largest_3_long[['user_sequence','challenge']].to_csv("submission.csv", index = False)