============ Datathon INSTRUCTIONS and INFORMATION ============

Challenge: SkyHigh Books want to know in 2018 what readers will rate their books that they publish. You are given data from 2017, where you know what readers read and what rating each user provided on average to all books they read.

Your job is 4 fold:

Predict what readers will rate on average in 2018 on all books they will read in 2017 + 2018. Use the wishlist of books readers want to read in 2018 and what they have already read in 2017 as a starting point. SkyHigh Books wants to know this information, as they want to market directly to readers who either have higher potential average scores on all books, or lower scores.

SkyHigh Books wants to know why some books in 2017 or 2018 (predict which books) have good or bad overall ratings. Is it certain words in the book? The genre of the book? The price in 2017 / 2018? Where it was sold? Why do some books on average have a higher rating than other books?

SkyHigh Books wants to maintain a good online presence on book review sites (hence we want to predict user averages), and so they want you to tell them how they can lift overall global reader satisifcation (not just each individual user, but overall).

Provide confidence bounds on your predictions. How much % are you sure your predictions are true or correct? Is the results plausible? Does the data seem plausible? (Ie do the word count follow some Power Law Distribution?)

**Datasets**

You are provided with 6 datasets:

1. **Books Information**
2. **Genres Mapping**
3. **User Data**
4. **Words in Book**
5. **Words Mapping**
6. **Example Submissions**

------

1. **Books Information** Actual info on books. Book ID, Barcode, difficulty (average *perceieved* reading difficulty of book --> 1 = easy, 5 = hard).
2. **Genres Mapping** Maps Genre IDs to real Genre types (eg Science)
3. **User Data** User's data on average ratings and what they read in 2017 + wishlist for 2018.
4. **Words in Book** What are the words in the actual book.
5. **Words Mapping** Actual words mapped to word ID
6. **Example Submissions** Shows an example of what you need to provide.

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn import datasets, metrics
from sklearn.preprocessing import MultiLabelBinarizer, LabelBinarizer
from surprise.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from surprise import SVDpp, Reader, Dataset, accuracy
from surprise.model_selection import train_test_split, cross_validate
from xgboost import XGBRegressor
from xgboost import plot_importance
from matplotlib import pyplot


In [2]:
## loading data
Books_Information = pd.read_csv('Books information.csv')
User_Data = pd.read_csv('User Data.csv')

# words_in_book = pd.read_csv('Words in Books Data.csv')
# words_map = pd.read_csv('Words Mapping.csv')

In [23]:
user_2017 = User_Data.iloc[:, [0, 1, 2, 4]]
arr_2017 = [[columns[0], columns[1], int(book), float(columns[3])] for columns in user_2017.values for book in columns[2].split(', ')]
user_2017 = pd.DataFrame(arr_2017, columns=user_2017.columns)

user_2018 = User_Data.iloc[:, [0, 1, 3]]
arr_2018 = [[columns[0], columns[1], int(book)] for columns in user_2018.values for book in columns[2].split(', ')]
user_2018 = pd.DataFrame(arr_2018, columns=user_2018.columns)

In [25]:
train_df = user_2017.merge(Books_Information, left_on='User Read Books (2017)', right_on='Book ID')
test_df = user_2018.merge(Books_Information, left_on='User Read Books (2018)', right_on='Book ID')


In [5]:
train_df = train_df[["User ID", "User Difficulty Choice", "User Read Books (2017)", "Book Genre", "Difficulty (Reader suggested)", "Number Of Words", "Price (2017)", "Most Sold At", "Number Sold", "Average Rating (2017)"]].rename(columns={'User Read Books (2017)': 'Book', 'Price (2017)': 'Price', 'Average Rating (2017)': "Rating"})
test_df = test_df[["User ID", "User Difficulty Choice", "User Read Books (2018)", "Book Genre", "Difficulty (Reader suggested)", "Number Of Words", "Price (2018)", "Most Sold At", "Number Sold"]].rename(columns={'User Read Books (2018)': 'Book', 'Price (2018)': 'Price'})


In [22]:
train_df.head()

Unnamed: 0,User ID,User Difficulty Choice,Book,Book Genre,Difficulty (Reader suggested),Number Of Words,Price,Most Sold At,Number Sold,Rating
0,ID790145788,1,6254,17,2,4522,34.506476,PythonBooks,3262,3.115447
1,ID523524439,2,6254,17,2,4522,34.506476,PythonBooks,3262,5.934029
2,ID720659947,4,6254,17,2,4522,34.506476,PythonBooks,3262,4.463691
3,ID345893462,Not specified,6254,17,2,4522,34.506476,PythonBooks,3262,1.525557
4,ID714509174,3,6254,17,2,4522,34.506476,PythonBooks,3262,1.22653


In [15]:
test_df.head()


Unnamed: 0,User ID,User Difficulty Choice,Book,Book Genre,Difficulty (Reader suggested),Number Of Words,Price,Most Sold At,Number Sold
0,ID790145788,1,7180,18,1,637,8.067764,DHC-Online,1596
1,ID675457711,5,7180,18,1,637,8.067764,DHC-Online,1596
2,ID788479335,3,7180,18,1,637,8.067764,DHC-Online,1596
3,ID567773609,3,7180,18,1,637,8.067764,DHC-Online,1596
4,ID677346269,1,7180,18,1,637,8.067764,DHC-Online,1596


In [None]:
test_df

In [None]:
algo = SVDpp(n_factors=100,n_epochs=300,lr_all=0.01,reg_all=0.2)
algo.fit(trainset)

In [None]:
final_df = user_data[['User ID', 'User Read Books (2017)', 'User Read Books (2018)', 'Average Rating (2017)']]
final_ar = [[a[0], (len(a[1].split(', ')) * float(a[3]) + sum([algo.predict(a[0], int(b)).est for b in a[2].split(', ')])) / (len(a[1].split(', ')) + len(a[2].split(', ')))] for a in final_df.values]


In [None]:
## user difficulty embedding
user_diff_ar = user_data['User Difficulty Choice'].values
mlb = MultiLabelBinarizer(classes = [1,2,3,4,5])
user_diff_code = mlb.fit_transform([([int(a)]) if a in '12345' else (1,2,3,4,5) for a in user_diff_ar])
dic_user_diff = dict(zip(user_data['User ID'].values, user_diff_code))

In [None]:
## book difficulty embedding
book_diff_ar = book_info['Difficulty (Reader suggested)'].values
mlb = MultiLabelBinarizer(classes = [1,2,3,4,5])
book_diff_code = mlb.fit_transform([([int(a)]) for a in book_diff_ar])
dic_book_diff = dict(zip(book_info['Book ID'].values, book_diff_code))

In [None]:
## book genre embedding
book_genre_ar = book_info['Book Genre'].values
mlb = LabelBinarizer()
book_genre_code = mlb.fit_transform(book_genre_ar)
dic_book_genre = dict(zip(book_info['Book ID'].values, book_genre_code))

In [None]:
## book most sell places embedding
book_store_ar = book_info['Most Sold At'].values
mlb = LabelBinarizer()
book_store_code = mlb.fit_transform(book_store_ar)
dic_book_store = dict(zip(book_info['Book ID'].values, book_store_code))

In [None]:
## book words embedding
book_words_ar = words_in_book['Words in Book'].values
mlb = MultiLabelBinarizer()
book_words_code = mlb.fit_transform([a.split('|') for a in book_words_ar])
dic_book_words = dict(zip(words_in_book['Book ID'].values, book_words_code))a

In [None]:
## second model and training
# X_train = [np.concatenate((user_latent[algo.trainset.to_inner_uid(a[0])], book_latent[algo.trainset.to_inner_iid(a[1])], dic_user_diff[a[0]], dic_book_diff[a[1]], dic_book_genre[a[1]], dic_book_store[a[1]])) for a in user_book_rate.values]
X_train = [np.concatenate((dic_user_diff[a[0]], dic_book_diff[a[1]], dic_book_genre[a[1]], dic_book_store[a[1]])) for a in user_book_rate.values]
# X_train = pd.DataFrame(X_train, columns=['user_id_'+str(a) for a in range(3)] + ['book_id_'+str(a) for a in range(3)] +  ['dic_user_diff_'+str(a) for a in range(1,6)] + ['dic_book_diff_'+str(a) for a in range(1,6)] + ['dic_book_genre_'+str(a) for a in range(31)] + ['store_'+str(a) for a in range(1,6)])
y_train = [a[2]-algo.predict(a[0], a[1]).est for a in user_book_rate.values]

In [None]:
model = XGBRegressor()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_train)
accuracy = metrics.r2_score(y_train, y_pred)

In [None]:
## predict of second model
# X_test = [np.concatenate((user_latent[algo.trainset.to_inner_uid(a[0])], book_latent[algo.trainset.to_inner_iid(a[1])], dic_user_diff[a[0]], dic_book_diff[a[1]], dic_book_genre[a[1]], dic_book_store[a[1]])) for a in test_ar]
X_test = [np.concatenate((dic_user_diff[a[0]], dic_book_diff[a[1]], dic_book_genre[a[1]], dic_book_store[a[1]])) for a in test_ar]
y_pred_new = model.predict(X_test)
dic_con = defaultdict(dict)
for i in range(len(test_ar)):
    dic_con[test_ar[i][0]][test_ar[i][1]] = y_pred_new[i]

In [None]:
## final result of first model
final_df_2 = user_data[['User ID', 'User Read Books (2017)', 'User Read Books (2018)', 'Average Rating (2017)']]
final_ar_2 = [[a[0], (len(a[1].split(', ')) * float(a[3]) + sum([algo.predict(a[0], int(b)).est+dic_con[a[0]][int(b)] for b in a[2].split(', ')])) / (len(a[1].split(', ')) + len(a[2].split(', ')))] for a in final_df_2.values]
rel = pd.DataFrame(final_ar_2, columns=['User ID', 'Average Rating (2018)'])