Find what Popularity recommendation does

## 1. Import modules
* `pandas` and `numpy` for data manipulation
* `turicreate` for performing model selection and evaluation
* `sklearn` for splitting the data into train and test set
* `xlrd` for excel import
* sudo apt-get install libatlas-base-dev for missing package

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

import sys
sys.path.append("..")
import scripts.data_layer as data_layer

## 2. Load data
Single dataset from db, which can be found in `data` folder: 
* Lyb data QUEST JAN WITH PURQTY 10k (to avoid memory error)
* XLSX Format
* Possible error expected dude to difference between expected purchase frequency and purchase qty



In [None]:
s=time.time()

data=pd.read_excel('../data/Lyb data QUEST JAN WITH PURQTY 10k.xlsx')

print("Import time:", round((time.time()-s)/60,2), "minutes")

print(data.shape)
data.head(2)

## 3. Split train and test set
* Splitting the data into training and testing sets is an important part of evaluating predictive modeling, in this case a collaborative filtering model. Typically, we use a larger portion of the data for training and a smaller portion for testing. 
* We use 80:20 ratio for our train-test set size.
* Our training portion will be used to develop a predictive model, while the other to evaluate the model's performance.
* Now that we have three datasets with purchase counts, purchase dummy, and scaled purchase counts, we would like to split each.

In [None]:
train, test = train_test_split(data, test_size = .2)
print(train.shape, test.shape)

In [None]:
# Using turicreate library, we convert dataframe to SFrame - this will be useful in the modeling part

train_data = tc.SFrame(train)
test_data = tc.SFrame(test)

In [None]:
train_data

In [None]:
test_data

## 5. Baseline Model
Before running a more complicated approach such as collaborative filtering, we would like to use a baseline model to compare and evaluate models. Since baseline typically uses a very simple approach, techniques used beyond this approach should be chosen if they show relatively better accuracy and complexity.

### 5.1. Using a Popularity model as a baseline
* The popularity model takes the most popular items for recommendation. These items are products with the highest number of sells across customers.
* We use `turicreate` library for running and evaluating both baseline and collaborative filtering models below
* Training data is used for model selection
* Yet to evaluate is the math behind turicerate.popularity model

#### Using purchase counts

In [None]:
# variables to define field names
user_id = 'LYBID'
item_id = 'ITEMID'
target = 'TotalQtyPurchased'
users_to_recommend = list(data[user_id].unique())
n_rec = 5 # number of items to recommend
n_display = 30

In [None]:
popularity_model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)

In [None]:
# Get recommendations for a list of users to recommend (from data file)
# Printed below is head / top 30 rows for first 6 customers with 5 recommendations each

popularity_recomm = popularity_model.recommend(users=users_to_recommend, k=n_rec)
popularity_recomm.print_rows(n_display)