# Step1

What this step does:
* Converts implicit feedbacks to ratings
* Fixes the BookID padding problem
* Produces intermediate files for the next step

In [1]:
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output

### Loading Files

In [2]:
books = pd.read_csv('./data/BX-Books.csv', sep=';', quotechar='"', escapechar='\\', header=0)
users = pd.read_csv('./data/BX-Users.csv')
ratings_train = pd.read_csv('./data/BX_train.csv',
                            header=None, names=['UserID', 'BookID', 'Rating'])
ratings_test = pd.read_csv('./data/BX_test.csv',
                           header=None, names=['UserID', 'BookID', 'Rating'])

Rebuilding train/test

In [3]:
test_submission = ratings_test[ratings_test.Rating == 55]
tmp = ratings_test[ratings_test.Rating != 55]
train_explicit = ratings_train[ratings_train.Rating != 0]
train_explicit = train_explicit.append(tmp)
train_implicit = ratings_train[ratings_train.Rating == 0]

## Converting implicit feedbacks to ratings

Assume that we want to convert the implicit rating for item i for user u, and I is the set of other users who rated item i. Then:

$$ R_{ui} = \mu_{u} + \frac{\sum_{s\in{I}}{(R_{si} - \mu_{s})}}{len(I)} + \lambda $$

* The first term is the average rating value for user u
* The second term is the unbiased goodness of item i, which is the average unbiased current explicit ratings. To calculate this, assume that I is the set of users who explicitly rated the item i. Then the average unbiased ratings of item i would be the average distance of explicit ratings and average user ratings. 
* The third term (lambda) is the implicit rating constant. This can be set to 1 in a rating scale of 10.
* For 'new user' situation we consider average of all ratings (~7.57) for the first term.

In [4]:
train_explicit_grouped_by_user = train_explicit[['UserID', 'Rating']].groupby('UserID')
adjusted_ratings = train_explicit.copy()
adjusted_ratings['Rating'] = train_explicit['Rating'] - \
                            train_explicit_grouped_by_user.transform('mean')['Rating']

In [5]:
train_explicit_item_mean = adjusted_ratings.groupby('BookID').mean()['Rating'].to_dict()
train_explicit_user_mean = train_explicit_grouped_by_user.mean()['Rating'].to_dict()

In [6]:
train_implicit_conversion = train_implicit.copy()
train_implicit_conversion['Rating'] = train_implicit_conversion.Rating.astype(pd.np.float64)
counter = 1
max_counter = len(train_implicit_conversion)
update_step = max_counter / 10
baseline_rating = train_explicit.Rating.mean()

This cell takes a minute or two to execute:

In [7]:
for i, row in train_implicit_conversion.iterrows():
    term1 = 0
    if row.UserID in train_explicit_user_mean:
        term1 = train_explicit_user_mean[row.UserID]
    else:
        term1 = baseline_rating
    term2 = 0
    if row.BookID in train_explicit_item_mean:
        term2 = train_explicit_item_mean[row.BookID]
    term3 = 1 if (term1 + term2) < 9 else 0
    rating = term1 + term2 + term3
    if rating < 1:
        rating = 1
    if rating > 10: # impossible!
        rating = 10
    train_implicit_conversion.set_value(i, 'Rating', rating)
    if counter % update_step == 0:
        clear_output()
        print str(counter * 100 / max_counter) + "%"
    counter = counter + 1

100%


In [8]:
train_all = train_explicit.append(train_implicit_conversion)

## Fixing Book ID padding problem

In [9]:
train_explicit['BookID_org'] = train_explicit['BookID']
train_explicit['BookID'] = train_explicit['BookID'].apply(lambda x: x.zfill(10))
train_all['BookID_org'] = train_all['BookID']
train_all['BookID'] = train_all['BookID'].apply(lambda x: x.zfill(10))
test_submission['BookID_org'] = test_submission['BookID']
test_submission['BookID'] = test_submission['BookID'].apply(lambda x: x.zfill(10))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## Saving results for the next step

In [14]:
pickle.dump(train_explicit, open('./data/input_1.pcl', 'wb'))
pickle.dump(train_all, open('./data/input_2.pcl', 'wb'))
pickle.dump(test_submission, open('./data/input_3.pcl', 'wb'))