# Feature Engineering and Pre-Processing #

The main feature engineering step for this type of model is to eliminate users with few reviews (there shouldn't be many of these). I considered dropping books with few reviews, but the vast majority of the books in the dataset only have 1 review. This could be the difference between a good recommendation and a bad one. I'll also add a standardized review column just in case it helps the model perform better when I get to that step.

In [1]:
import pandas as pd
import sqlite3
import numpy as np
import os
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt

In [28]:
database_path = os.path.abspath(
            os.path.join(
                os.pardir,
                'data',
                'processed',
                'books.db'
                )
            )

conn = sqlite3.connect(database_path)

In [29]:
query = 'SELECT user_id, book_id, user_rating FROM Ratings'
raw_ratings_df = pd.read_sql_query(query, conn)

### Standardizing Ratings ###
Certain users are way harsher and others way more generous with their ratings. Z-score standardization may help the models account for this phenomenon.

In [30]:
def standardize_ratings(x):
    return (x - x.mean()) / x.std()

In [35]:
zstd_ratings_df = raw_ratings_df.copy()
zstd_ratings_df['standardized_rating'] = zstd_ratings_df.groupby('user_id')['user_rating'].transform(standardize_ratings)

In [36]:
zstd_ratings_df.sort_values(by = 'user_id')

Unnamed: 0,user_id,book_id,user_rating,standardized_rating
0,1,1,5,0.699862
3513,1,76,5,0.699862
3512,1,75,5,0.699862
3493,1,74,4,-0.258853
3443,1,73,3,-1.217568
...,...,...,...,...
20869,154524237,3283,1,0.333333
20868,154524237,3282,0,-0.416667
20867,154524237,3281,0,-0.416667
20872,154524237,3286,0,-0.416667


In [37]:
conn.close()

### Dropping Users ###
I scraped the top 300 most popular users' shelves. That said, some users only have a few reviews in the dataset. I can eliminate them to shrink my matrix a little bit.

In [62]:
zstd_ratings_df['user_id'].value_counts()

user_id
1            100
32726092     100
22888935     100
2026178      100
18384692     100
            ... 
4642710      100
3098682       99
91822086      12
597461        10
154524237      9
Name: count, Length: 298, dtype: int64

I can eliminate the users with fewer than 99 reviews, which is only 3 users and leaves me with 295.

In [70]:
min_ratings = 99
filter_users = zstd_ratings_df['user_id'].value_counts() >= min_ratings
filter_users = filter_users[filter_users].index.to_list()

filtered_zstd_ratings_df = zstd_ratings_df[zstd_ratings_df['user_id'].isin(filter_users)]
filtered_zstd_ratings_df['user_id'].value_counts()

user_id
1           100
6678151     100
22888935    100
2026178     100
18384692    100
           ... 
4866450     100
4851964     100
4642710     100
60147675    100
3098682      99
Name: count, Length: 295, dtype: int64