# Getting started
This notebook is for me, Rakin, just to analyze the data. I avoided creating a .py file for this because I wanted to be able to use the notebook to write my thoughts and ideas. I will try to keep this notebook as clean as possible for anyone to jump in add their ideas to the project.

## Read & Clean the data 
Here the Data will be stored in Pandas checked for outliers, strange values etc before starting the preprocessing

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import random
import math

# Convert the Data CSV files to pandas
print("Reading Books Data...")
books_data = pd.read_csv("Data/BX-Books.csv", sep=';', on_bad_lines='skip', encoding="latin")
print("Readting Users Data...")
users_data = pd.read_csv("Data/BX-Users.csv", sep=';', on_bad_lines='skip', encoding="latin")
print("Reading Ratings Data...")
book_ratings = pd.read_csv("Data/BX-Book-Ratings.csv", sep=';', on_bad_lines='skip', encoding="latin")
print("Done!")

Reading Books Data...


  books_data = pd.read_csv("Data/BX-Books.csv", sep=';', on_bad_lines='skip', encoding="latin")


Readting Users Data...
Reading Ratings Data...
Done!


## Cleaning Books Data
Doing a massive cleaning of books_data.csv. 
1. Removing all the books that have no ISBN
2. Removing all the books that have no author
3. Removing all the books that have no title
4. Removing all the books that have no publisher
5. Removing all the books that have no year of publication or 0 as year of publication
6. Removing all the books that have no image url
7. Removing all the books that have no image url small
8. Removing all the books that have no image url medium
9. Removing all the books that have no image url large
10. Removing all books that have no description

In [2]:
# Find out which columns have mixed types
print("Total number of rows in the dataset is: ", len(books_data))
print("Total number of unique ISBNs is: ", len(books_data['ISBN'].unique()))

# In Book-Author, there are some values that are not strings. Drop their rows from the dataset
books_data = books_data.loc[books_data['Book-Author'].apply(lambda x: isinstance(x, str)), :]
books_data = books_data.loc[books_data['Year-Of-Publication'].apply(lambda x: isinstance(x, int)), :]
books_data = books_data.loc[books_data['Publisher'].apply(lambda x: isinstance(x, str)), :]
books_data = books_data.loc[books_data['Image-URL-S'].apply(lambda x: isinstance(x, str)), :]
books_data = books_data.loc[books_data['Image-URL-M'].apply(lambda x: isinstance(x, str)), :]
books_data = books_data.loc[books_data['Image-URL-L'].apply(lambda x: isinstance(x, str)), :]
books_data.dropna(inplace=True)
books_data.reset_index(drop=True, inplace=True)
print("Total number of rows in the dataset is: ", len(books_data))

# Delete all the ISBNs that have been removed from the dataset from the ratings dataset
print("Delete all the ISBNs that have been removed from the dataset from the ratings dataset...")
book_ratings = book_ratings.loc[book_ratings['ISBN'].apply(lambda x: x in books_data['ISBN'].unique()), :]
book_ratings.reset_index(drop=True, inplace=True)
print("Total number of rows in book rating dataset is: ", len(book_ratings))

# Grab all the ISBNs that been rated over 20 times in books_data
print("Delete all the ISBNs that have been rated less than 20 times from the dataset...")
ISBNs = book_ratings['ISBN'].value_counts()
ISBNs = ISBNs[ISBNs >= 20]
ISBNs = ISBNs.index.tolist()
books_data = books_data.loc[books_data['ISBN'].apply(lambda x: x in ISBNs), :]
books_data.reset_index(drop=True, inplace=True)
print("Books that have been rated more than 20 times: ", len(books_data))

Total number of rows in the dataset is:  271360
Total number of unique ISBNs is:  271360
Total number of rows in the dataset is:  205821
Delete all the ISBNs that have been removed from the dataset from the ratings dataset...


### Cleaning users data

In [None]:
# Delete all the ISBNs that have been removed from the dataset from the book_ratings dataset (again)
print("Delete all the ISBNs that have been removed from the dataset from the book_ratings dataset again...")
book_ratings = book_ratings.loc[book_ratings['ISBN'].apply(lambda x: x in books_data['ISBN'].unique()), :]


print("Removing all the users that have rated less than 5 books from the book_data dataset and their ISBN")
users = book_ratings['User-ID'].value_counts()
users = users[users > 5]
users = users.index.tolist()
book_ratings = book_ratings.loc[book_ratings['User-ID'].apply(lambda x: x in users), :]
book_ratings.reset_index(drop=True, inplace=True)

# If ISBN is in book_ratings and in books_data, then keep it in books_data. Otherwise, drop it from books_data
ISBNs = book_ratings['ISBN'].value_counts()
ISBNs = ISBNs.index.tolist()
books_data = books_data.loc[books_data['ISBN'].apply(lambda x: x in ISBNs), :]
books_data.reset_index(drop=True, inplace=True)
print("Total number of rows in the dataset is: ", len(books_data))

In [None]:
# How many users are left
print("Total number of users in the dataset is: ", len(book_ratings['User-ID'].unique()))