# Getting started
This notebook is for me, Rakin, just to analyze the data. I avoided creating a .py file for this because I wanted to be able to use the notebook to write my thoughts and ideas. I will try to keep this notebook as clean as possible for anyone to jump in add their ideas to the project.

## Read & Clean the data 
Here the Data will be stored in Pandas checked for outliers, strange values etc before starting the preprocessing

In [9]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import random
import math

# Convert the Data CSV files to pandas
print("Reading Books Data...")
books_data = pd.read_csv("Data/BX-Books.csv", sep=';', on_bad_lines='skip', encoding="latin")
print("Readting Users Data...")
users_data = pd.read_csv("Data/BX-Users.csv", sep=';', on_bad_lines='skip', encoding="latin")
print("Reading Ratings Data...")
Book_Ratings = pd.read_csv("Data/BX-Book-Ratings.csv", sep=';', on_bad_lines='skip', encoding="latin")
print("Done!")

Reading Books Data...


  books_data = pd.read_csv("Data/BX-Books.csv", sep=';', on_bad_lines='skip', encoding="latin")


Readting Users Data...
Reading Ratings Data...
Done!


In [12]:
Book_Ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [13]:
Book_Ratings[Book_Ratings['Book-Rating']==0]

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
2,276727,0446520802,0
5,276733,2080674722,0
10,276746,0425115801,0
11,276746,0449006522,0
...,...,...,...
1149769,276704,059032120X,0
1149770,276704,0679752714,0
1149772,276704,080410526X,0
1149774,276704,0876044011,0


### Finding which columns needs to be cleaned
I found out that columns has mixed types in books_data. This is explored below

In [2]:
# Find out which columns have mixed types
print("The following has mixed types.")
for col in books_data.columns:
    weird = (books_data[[col]].applymap(type) != books_data[[col]].iloc[0].apply(type)).any(axis=1)
    if len (books_data[weird]) > 0:
        print(col)


The following has mixed types.
Book-Author
Year-Of-Publication
Publisher
Image-URL-L


In [3]:
# Looking at the mixed types at Book-Author
print("Looking for rows with mixed types in Book-Author then printing them.")
weird = (books_data[['Book-Author']].applymap(type) != books_data[['Book-Author']].iloc[0].apply(type)).any(axis=1)
print(books_data[weird])
print("Done!")

# Convert Nan to string value valled "Unknown"
books_data['Book-Author'] = books_data['Book-Author'].fillna('Unknown')

#Convert the column Book-Author to have the type String
books_data['Book-Author'] = books_data['Book-Author'].astype(str)

Looking for rows with mixed types in Book-Author then printing them.
              ISBN                                         Book-Title  \
187689  9627982032  The Credit Suisse Guide to Managing Your Perso...   

       Book-Author Year-Of-Publication                       Publisher  \
187689         NaN                1995  Edinburgh Financial Publishing   

                                              Image-URL-S  \
187689  http://images.amazon.com/images/P/9627982032.0...   

                                              Image-URL-M  \
187689  http://images.amazon.com/images/P/9627982032.0...   

                                              Image-URL-L  
187689  http://images.amazon.com/images/P/9627982032.0...  
Done!


### Cleaning Publication columns

In [4]:
# Clean the year of publication column
print("Find string values in column Year-Of-Publication and replace them with 0.")
books_data['Year-Of-Publication'] = books_data['Year-Of-Publication'].replace(['DK Publishing Inc', 'Gallimard'], 0)
print("How many rows have 0 in Year-Of-Publication?")
print(books_data[books_data['Year-Of-Publication'] == 0].shape[0])

# Convert the Year-Of-Publication column to int
print("Convert the Year-Of-Publication column to int.")
books_data['Year-Of-Publication'].astype(int)
print("Done!")


Find string values in column Year-Of-Publication and replace them with 0.
How many rows have 0 in Year-Of-Publication?
3573
Convert the Year-Of-Publication column to int.
Done!


In [5]:
# Find the mixed types in the column Publisher

# Convert Nan to string value valled "Unknown"
books_data['Publisher'] = books_data['Publisher'].fillna('Unknown')

# Look at the mixed types in the column Publisher
print("Looking for rows with mixed types in Publisher then printing them.")
weird = (books_data[['Publisher']].applymap(type) != books_data[['Publisher']].iloc[0].apply(type)).any(axis=1)
print(books_data[weird])

# Convert the Publisher column to have the type String
books_data['Publisher'] = books_data['Publisher'].astype(str)
print("Done!")


Looking for rows with mixed types in Publisher then printing them.
Empty DataFrame
Columns: [ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher, Image-URL-S, Image-URL-M, Image-URL-L]
Index: []
Done!


### Cleaning image_url columns

In [6]:
# Convert Nan to string value valled "Unknown" in Image-URL-L
books_data['Image-URL-L'] = books_data['Image-URL-L'].fillna('Unknown')

# Look for mixed types in Image-URL-L
print("Looking for rows with mixed types in Image-URL-L then printing them.")
weird = (books_data[['Image-URL-L']].applymap(type) != books_data[['Image-URL-L']].iloc[0].apply(type)).any(axis=1)
print(books_data[weird])

# Convert the column Image-URL-L to have the type String
books_data['Image-URL-L'] = books_data['Image-URL-L'].astype(str)
print("Done!")


Looking for rows with mixed types in Image-URL-L then printing them.
Empty DataFrame
Columns: [ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher, Image-URL-S, Image-URL-M, Image-URL-L]
Index: []
Done!


### Checking if everything is now correct in the books_data

In [7]:
books_data.dtypes
# Find out which columns have mixed types
# Convert ISBN to int
print("Objects to other types...")
books_data['ISBN'] = books_data['ISBN'].astype('string')
books_data['Book-Title'] = books_data['ISBN'].astype('string')
books_data['Book-Author'] = books_data['ISBN'].astype('string''')
books_data['Publisher'] = books_data['ISBN'].astype('string')
books_data['Year-Of-Publication'] = books_data['Year-Of-Publication'].astype(int)
books_data['Image-URL-S'] = books_data['ISBN'].astype('string')
books_data['Image-URL-M'] = books_data['ISBN'].astype('string')
books_data['Image-URL-L'] = books_data['ISBN'].astype('string')
print("Done!")

books_data.dtypes


Objects to other types...
Done!


ISBN                   string
Book-Title             string
Book-Author            string
Year-Of-Publication     int32
Publisher              string
Image-URL-S            string
Image-URL-M            string
Image-URL-L            string
dtype: object

In [8]:
# Look for mixed types in users_data and books_data
print("Looking for rows with mixed types in users_data then printing them.")
weird = (users_data[['User-ID']].applymap(type) != users_data[['User-ID']].iloc[0].apply(type)).any(axis=1)
print(users_data[weird])

print("Looking for rows with mixed types in books_data then printing them.")
weird = (books_data[['ISBN']].applymap(type) != books_data[['ISBN']].iloc[0].apply(type)).any(axis=1)
print(books_data[weird])

# Look for mixed types in books_data
print("Looking for rows with mixed types in books_data then printing them.")
weird = (books_data[['Book-Title']].applymap(type) != books_data[['Book-Title']].iloc[0].apply(type)).any(axis=1)
print(books_data[weird])


Looking for rows with mixed types in users_data then printing them.
Empty DataFrame
Columns: [User-ID, Location, Age]
Index: []
Looking for rows with mixed types in books_data then printing them.
Empty DataFrame
Columns: [ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher, Image-URL-S, Image-URL-M, Image-URL-L]
Index: []
Looking for rows with mixed types in books_data then printing them.
Empty DataFrame
Columns: [ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher, Image-URL-S, Image-URL-M, Image-URL-L]
Index: []


# Investigating the data
The data seems to have been cleaned here. Now we need to look for outliers, strange values and understanding the data better. The section below is for that. We are going to understand how the data is distributed and what are the values that are present in the data. Also in what way Collaborative filtering can be used to recommend books to users. 