# Collaborative Filtering Using GraphLab for Implicit Dataset
## (Part 1: Current Users Via Segment - Pearson vs Jaccarad vs Factorization)

#### Objective : Apply and compare different models of collaborative Filtering and select the better recommender model

#### Step 1 of 5: Upload Relevant Libraries

Note: Need to sign up for academic license for the use of Graphlab's Library & Python 2 only.

In [17]:
import pandas as pd
import graphlab as gl

#### Step 2 of 5: Download & Prepare Dataset
 Note: print the data to check if the dataset has been imported correctly 

In [18]:
df = pd.read_csv('dataset\dataset02_master.csv', sep = ',')
print(df.head(5)) ## check data

     Type    Card_ID  SegmentNo Gender  Age Age_Grp  Length_of_Membership_MTH  \
0  Active  104829316          5      F   40   35-44                         0   
1  Active  101480021          5      M   44   35-44                         0   
2  Active  104219628          5      M   21   15-24                         0   
3  Active  104219628          5      M   21   15-24                         0   
4  Active  106272169          5      M   29   25-34                         0   

  Membership_Grp           pdt_type  total_count  
0        <= 1 YR    MP3Players_high          2.0  
1        <= 1 YR    MP3Players_high          2.0  
2        <= 1 YR  MP3Players_medium          2.0  
3        <= 1 YR    MP3Players_high          1.0  
4        <= 1 YR    Hardware_medium          1.0  


<b> Note 2: Slice and dice the dataset to keep relevant data only <br>
Note 3: Convert panda's dataframe to SFrame, a requirement for Graphlab. </b>

In [19]:
df_seg2_growable = df[df.SegmentNo == 2]  
df_seg2_growable = df_seg2_growable[['Card_ID', 'pdt_type']]
df_seg2_growable_SFrame = gl.SFrame(df_seg2_growable)## convert into S-Frame
df_seg2_growable_SFrame.save('dataset\dataset_recommender_2.csv', format='csv') 

#### Step 3 of 5: Create Train and Test Dataset
Note : This is done with 50% splitting between test and train dataset.

In [20]:
train_s2, test_s2 = gl.recommender.util.random_split_by_user (df_seg2_growable_SFrame,\
                    user_id = 'Card_ID', item_id = 'pdt_type', \
                    item_test_proportion = 0.5, random_seed = 2017)

#### Step 4 of 5: Creation of 3 different possible models. 

In [21]:
# Step 4: Create Model 2 - Pearson Similarity Score

train_s2_model_pearson = gl.recommender.item_similarity_recommender.create \
                (train_s2, user_id='Card_ID', item_id='pdt_type',\
                similarity_type='pearson')


# Step 4: Create Model 2 - Jaccard Similarity Score
train_s2_model_jaccard = gl.recommender.item_similarity_recommender.create \
                (train_s2, user_id='Card_ID', item_id='pdt_type',\
                similarity_type='jaccard')

# Step 5: Create Model 3 - Factorization 

train_s2_model_factorization = gl.recommender.ranking_factorization_recommender.\
                create(train_s2,\
                user_id='Card_ID', item_id='pdt_type',\
                 random_seed = 2017, solver = 'ials')


#### Step 5 of 5: Compare the Precision Rate of all the 3 models

In [22]:
x2 = gl.recommender.util.compare_models(test_s2 , \
    [train_s2_model_pearson, train_s2_model_jaccard, train_s2_model_factorization ], model_names=["m1", "m2", "m3"])

PROGRESS: Evaluate model m1

Precision and recall summary statistics by cutoff
+--------+-----------------+----------------+
| cutoff |  mean_precision |  mean_recall   |
+--------+-----------------+----------------+
|   1    |  0.255319148936 | 0.22695035461  |
|   2    |  0.20780141844  | 0.375177304965 |
|   3    |  0.191489361702 | 0.50780141844  |
|   4    |  0.174822695035 | 0.617730496454 |
|   5    |  0.157730496454 | 0.69219858156  |
|   6    |  0.139716312057 | 0.73475177305  |
|   7    |  0.12462006079  | 0.762411347518 |
|   8    |  0.113120567376 | 0.79219858156  |
|   9    |  0.104176516942 | 0.823404255319 |
|   10   | 0.0971631205674 | 0.856737588652 |
+--------+-----------------+----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model m2

Precision and recall summary statistics by cutoff
+--------+-----------------+----------------+
| cutoff |  mean_precision |  mean_recall   |
+--------+-----------------+----------------+
|   1    |  0.175886524823 | 0.149645

## Selection Final Model and Create Dataset of Results

In [23]:
train_s2_model_final= gl.recommender.item_similarity_recommender.create \
                (df_seg2_growable_SFrame, user_id='Card_ID', item_id='pdt_type',\
                similarity_type='jaccard')

In [24]:
## Output final dataset for visualization
recs_final  = train_s2_model_final.recommend()
recs_final.save('dataset\dataset_final_recs2.csv', format='csv')