In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from datetime import datetime

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [4]:
# Load the dataset
file_path = '/content/drive/MyDrive/ML_Model/book_rating_system/data/raw/df_cleaned.csv'
df = pd.read_csv(file_path)

In [5]:
df_cleaned = df.copy()

In [6]:
df_cleaned.head()

Unnamed: 0,title,authors,publisher,published_date,description,page_count,categories,average_rating,ratings_count,language
0,Deep Learning,"Ian Goodfellow, Yoshua Bengio, Aaron Courville",MIT Press,2016-11-18,An introduction to a broad range of topics in ...,801.0,Computers,3.5,6.0,en
1,Deep Learning for Coders with fastai and PyTorch,"Jeremy Howard, Sylvain Gugger",O'Reilly Media,2020-06-29,Deep learning is often viewed as the exclusive...,624.0,Computers,4.5,2.0,en
2,The Principles of Deep Learning Theory,"Daniel A. Roberts, Sho Yaida, Boris Hanin",Cambridge University Press,2022-05-26,This volume develops an effective theory appro...,473.0,Computers,5.0,2.0,en
3,Advances in Deep Learning,"M. Arif Wani, Farooq Ahmad Bhat, Saduf Afzal, ...",Springer,2019-03-14,This book introduces readers to both basic and...,149.0,Technology & Engineering,1.0,1.0,en
4,Introduction to Deep Learning,Sandro Skansi,Springer,2018-02-04,"This textbook presents a concise, accessible a...",191.0,Computers,5.0,1.0,en


In [7]:
df_cleaned.shape

(5144, 10)

In [8]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5144 entries, 0 to 5143
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           5144 non-null   object 
 1   authors         5144 non-null   object 
 2   publisher       5144 non-null   object 
 3   published_date  5144 non-null   object 
 4   description     5144 non-null   object 
 5   page_count      5144 non-null   float64
 6   categories      5144 non-null   object 
 7   average_rating  5144 non-null   float64
 8   ratings_count   5144 non-null   float64
 9   language        5144 non-null   object 
dtypes: float64(3), object(7)
memory usage: 402.0+ KB


In [9]:
df_cleaned.describe()

Unnamed: 0,page_count,average_rating,ratings_count
count,5144.0,5144.0,5144.0
mean,377.615863,4.435848,4.786353
std,1413.71095,0.871908,24.338186
min,0.0,1.0,1.0
25%,212.0,4.0,1.0
50%,324.0,5.0,1.0
75%,464.0,5.0,2.0
max,99998.0,5.0,182.0


Step 1: Data Cleaning and Checking for Duplicate Values
Explanation

Data cleaning is an essential step to ensure the dataset is in the best possible condition for analysis and modeling. It involves handling missing values, correcting inconsistencies, and removing duplicate entries. Duplicate values can distort analysis and lead to misleading conclusions, so it is crucial to identify and handle them appropriately.


Steps

Identify Duplicate Values: Check for duplicate rows in the dataset.
Handle Duplicates: Decide whether to remove duplicates or handle them differently based on the context.

In [10]:
df_cleaned.duplicated().sum()

278

In [11]:
df[df_cleaned.duplicated(subset=['title', 'authors'])]

Unnamed: 0,title,authors,publisher,published_date,description,page_count,categories,average_rating,ratings_count,language
335,Advances and Applications in Deep Learning,Marco Antonio Aceves-Fernandez,BoD – Books on Demand,2020,Artificial Intelligence (AI) has attracted the...,124.0,Computers,5.0,1.0,en
665,Recent Advances in Natural Language Processing,"Ruslan Mitkov, Nicolas Nicolov",John Benjamins Publishing,1997,This volume is based on contributions from the...,0.0,Language Arts & Disciplines,5.0,1.0,en
1254,Bayesian Statistics,Donald L. Meyer,Wiley,1970,This book introduces the Bayesian approach to ...,0.0,Education,5.0,1.0,en
1376,Why We Read Fiction,Lisa Zunshine,Ohio State University Press,2006,Why We Read Fiction offers a lucid overview of...,210.0,Literary Criticism,3.0,1.0,en
1397,Science Fiction by Scientists,Michael Brotherton,Springer,2016-11-15,This anthology contains fourteen intriguing st...,214.0,Science,5.0,1.0,en
...,...,...,...,...,...,...,...,...,...,...
5082,Neural Networks and Analog Computation,Hava Siegelmann,Birkhäuser,2012-10-21,The theoretical foundations of Neural Networks...,181.0,Computers,4.5,2.0,en
5085,Practical Time Series Analysis,Aileen Nielsen,O'Reilly Media,2019-09-20,Time series data analysis is increasingly impo...,500.0,Computers,4.5,2.0,en
5086,Tattooed Bodies,"James Martell, Erik Larsen",Palgrave Macmillan,2023-02-04,The essays collected in Tattooed Bodies draw o...,0.0,Social Science,5.0,1.0,en
5089,Journal of the American Statistical Association,Raymond Bernard Cattell,"Trotman, Limited",2007,A scientific and educational journal not only ...,764.0,Electronic journals,4.0,2.0,en


In [12]:
df_cleaned[df_cleaned['title']=="Architects of Intelligence"]

Unnamed: 0,title,authors,publisher,published_date,description,page_count,categories,average_rating,ratings_count,language
641,Architects of Intelligence,Raymond Bernard Cattell,"Trotman, Limited",2018,Collects the 77 papers presented during the No...,760.0,Artificial intelligence,4.0,2.0,en
5102,Architects of Intelligence,Raymond Bernard Cattell,"Trotman, Limited",2018,Collects the 77 papers presented during the No...,760.0,Artificial intelligence,4.0,2.0,en


In [13]:
# Correct removal of duplicate rows from the dataset
df_cleaned = df_cleaned.drop_duplicates()

# Verify that duplicates have been removed
num_duplicate_rows_after = df_cleaned[df_cleaned.duplicated()].shape[0]

# Display the number of remaining duplicate rows (should be zero)
num_duplicate_rows_after


0

Now that we have a clean dataset without duplicates, we can proceed with further data cleaning steps, such as handling any remaining missing values, correcting inconsistencies, and preparing the data for feature engineering.

Step 2: Handling Remaining Missing Values

Explanation
Handling missing values is crucial for building robust machine learning models. We can use various strategies to fill or remove missing values, ensuring the dataset is complete and suitable for analysis.

In [14]:
# Check for remaining missing values in the dataset
df_cleaned.isnull().sum()

title             0
authors           0
publisher         0
published_date    0
description       0
page_count        0
categories        0
average_rating    0
ratings_count     0
language          0
dtype: int64

Step 3: Checking for Inconsistencies and Data Types

Explanation

Before proceeding with feature engineering, it's important to ensure that all data is consistent and the data types are appropriate for analysis.

This step includes:


Checking Data Types: Ensuring each column has the correct data type (e.g., numeric columns are numeric, categorical columns are categorical).
Correcting Inconsistencies: Identifying and correcting any inconsistencies in the data, such as incorrect data entries, outliers, or formatting issues.

In [15]:
# Check data types of each column
df_cleaned.dtypes

title              object
authors            object
publisher          object
published_date     object
description        object
page_count        float64
categories         object
average_rating    float64
ratings_count     float64
language           object
dtype: object

Check for Inconsistencies:

Verify that published_date values are valid dates.

Check for any outliers or anomalies in numeric columns.

Ensure consistency in categorical columns (e.g., categories, language).

In [16]:
# Convert 'published_date' to datetime format
df_cleaned['published_date'] = pd.to_datetime(df_cleaned['published_date'],errors='coerce')

# Check for any conversion issues
invalid_dates = df_cleaned[df_cleaned['published_date'].isnull()]['published_date']

# Display the number of invalid dates
invalid_dates.count()

0

In [17]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4866 entries, 0 to 5143
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   title           4866 non-null   object        
 1   authors         4866 non-null   object        
 2   publisher       4866 non-null   object        
 3   published_date  3086 non-null   datetime64[ns]
 4   description     4866 non-null   object        
 5   page_count      4866 non-null   float64       
 6   categories      4866 non-null   object        
 7   average_rating  4866 non-null   float64       
 8   ratings_count   4866 non-null   float64       
 9   language        4866 non-null   object        
dtypes: datetime64[ns](1), float64(3), object(6)
memory usage: 418.2+ KB


In [18]:
# Summary statistics of numeric columns
df_cleaned.describe()

Unnamed: 0,published_date,page_count,average_rating,ratings_count
count,3086,4866.0,4866.0,4866.0
mean,2016-06-14 08:07:09.293583872,376.426839,4.43845,4.671599
min,1948-03-28 00:00:00,0.0,1.0,1.0
25%,2014-02-24 06:00:00,208.0,4.0,1.0
50%,2018-05-31 00:00:00,321.0,5.0,1.0
75%,2020-11-23 00:00:00,461.0,5.0,2.0
max,2024-09-10 00:00:00,99998.0,5.0,182.0
std,,1452.546808,0.871888,23.953394


In [19]:
df_cleaned['page_count'].describe()

count     4866.000000
mean       376.426839
std       1452.546808
min          0.000000
25%        208.000000
50%        321.000000
75%        461.000000
max      99998.000000
Name: page_count, dtype: float64

In [20]:
# Handle outliers in 'page_count'

page_count_threshold = 2000
df_cleaned['page_count'] = df_cleaned['page_count'].apply(lambda x: page_count_threshold if x > page_count_threshold else x)

# Verify the change by displaying summary statistics again
numeric_summary_updated = df_cleaned.describe()


# Check consistency in categorical columns
categories_unique = df_cleaned['categories'].unique()
languages_unique = df_cleaned['language'].unique()

numeric_summary_updated , categories_unique, languages_unique

(                      published_date   page_count  average_rating  \
 count                           3086  4866.000000     4866.000000   
 mean   2016-06-14 08:07:09.293583872   354.560008        4.438450   
 min              1948-03-28 00:00:00     0.000000        1.000000   
 25%              2014-02-24 06:00:00   208.000000        4.000000   
 50%              2018-05-31 00:00:00   321.000000        5.000000   
 75%              2020-11-23 00:00:00   461.000000        5.000000   
 max              2024-09-10 00:00:00  2000.000000        5.000000   
 std                              NaN   249.933574        0.871888   
 
        ratings_count  
 count    4866.000000  
 mean        4.671599  
 min         1.000000  
 25%         1.000000  
 50%         1.000000  
 75%         2.000000  
 max       182.000000  
 std        23.953394  ,
 array(['Computers', 'Technology & Engineering', 'Science', 'Mathematics',
        'Business & Economics', 'Education', 'Medical', 'Machine learning',


In [21]:
df_cleaned.columns

Index(['title', 'authors', 'publisher', 'published_date', 'description',
       'page_count', 'categories', 'average_rating', 'ratings_count',
       'language'],
      dtype='object')

In [22]:
# Step 1: Capitalize the first letter of each word in the 'title' column
df_cleaned['title'] = df_cleaned['title'].str.title()

# Step 2: Remove leading and trailing spaces
df_cleaned['title'] = df_cleaned['title'].str.strip()

# Verify the changes
unique_titles = df_cleaned['title'].unique()[:5]  # Display the first 5 unique titles for verification
unique_titles


array(['Deep Learning',
       'Deep Learning For Coders With Fastai And Pytorch',
       'The Principles Of Deep Learning Theory',
       'Advances In Deep Learning', 'Introduction To Deep Learning'],
      dtype=object)

In [23]:
# Standardize text in the 'publisher' column
df_cleaned['publisher'] = df_cleaned['publisher'].str.title().str.strip().str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Verify the changes
unique_publishers = df_cleaned['publisher'].unique()[:5]  # Display the first 5 unique publishers for verification
unique_publishers


array(['Mit Press', 'OReilly Media', 'Cambridge University Press',
       'Springer', 'AddisonWesley Professional'], dtype=object)

In [24]:
# Standardize text in the 'description' column
df_cleaned['description'] = df_cleaned['description'].str.title().str.strip().str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Verify the changes
unique_descriptions = df_cleaned['description'].unique()[:10]  # Display the first 10 unique descriptions for verification
unique_descriptions


array(['An Introduction To A Broad Range Of Topics In Deep Learning Covering Mathematical And Conceptual Background Deep Learning Techniques Used In Industry And Research Perspectives Written By Three Experts In The Field Deep Learning Is The Only Comprehensive Book On The Subject Elon Musk Cochair Of Openai Cofounder And Ceo Of Tesla And Spacex Deep Learning Is A Form Of Machine Learning That Enables Computers To Learn From Experience And Understand The World In Terms Of A Hierarchy Of Concepts Because The Computer Gathers Knowledge From Experience There Is No Need For A Human Computer Operator To Formally Specify All The Knowledge That The Computer Needs The Hierarchy Of Concepts Allows The Computer To Learn Complicated Concepts By Building Them Out Of Simpler Ones A Graph Of These Hierarchies Would Be Many Layers Deep This Book Introduces A Broad Range Of Topics In Deep Learning The Text Offers Mathematical And Conceptual Background Covering Relevant Concepts In Linear Algebra Proba

In [25]:
# Standardize text in the 'categories' column
df_cleaned['categories'] = df_cleaned['categories'].str.title().str.strip().str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Verify the changes
unique_categories = df_cleaned['categories'].unique()[:10]  # Display the first 10 unique categories for verification
unique_categories


array(['Computers', 'Technology  Engineering', 'Science', 'Mathematics',
       'Business  Economics', 'Education', 'Medical', 'Machine Learning',
       'SelfHelp', 'Deep Learning Machine Learning'], dtype=object)

In [26]:
# Standardize text in the 'language' column
df_cleaned['language'] = df_cleaned['language'].str.title().str.strip().str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Verify the changes
unique_languages = df_cleaned['language'].unique()[:10]  # Display the first 10 unique languages for verification
unique_languages


array(['En', 'De', 'Fr', 'Es', 'El', 'It', 'ZhCn', 'Ta', 'Da', 'Hi'],
      dtype=object)

In [27]:
# Extract the year as an integer from 'published_date' column
df_cleaned['published_year'] = pd.to_datetime(df_cleaned['published_date'], errors='coerce').dt.year

# Handle the case where the published year is still 'Unknown Date'
df_cleaned['published_year'] = df_cleaned['published_year'].fillna(0).astype(int)

# Create the 'book_age' column based on the current year
current_year = datetime.now().year
df_cleaned['book_age(Years)'] = df_cleaned['published_year'].apply(lambda x: current_year - x if x > 0 else 0)

# Verify the changes
df_cleaned[['published_date', 'published_year', 'book_age(Years)']].head()


Unnamed: 0,published_date,published_year,book_age(Years)
0,2016-11-18,2016,8
1,2020-06-29,2020,4
2,2022-05-26,2022,2
3,2019-03-14,2019,5
4,2018-02-04,2018,6


In [28]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4866 entries, 0 to 5143
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   title            4866 non-null   object        
 1   authors          4866 non-null   object        
 2   publisher        4866 non-null   object        
 3   published_date   3086 non-null   datetime64[ns]
 4   description      4866 non-null   object        
 5   page_count       4866 non-null   float64       
 6   categories       4866 non-null   object        
 7   average_rating   4866 non-null   float64       
 8   ratings_count    4866 non-null   float64       
 9   language         4866 non-null   object        
 10  published_year   4866 non-null   int64         
 11  book_age(Years)  4866 non-null   int64         
dtypes: datetime64[ns](1), float64(3), int64(2), object(6)
memory usage: 494.2+ KB


In [29]:
df_cleaned.head()

Unnamed: 0,title,authors,publisher,published_date,description,page_count,categories,average_rating,ratings_count,language,published_year,book_age(Years)
0,Deep Learning,"Ian Goodfellow, Yoshua Bengio, Aaron Courville",Mit Press,2016-11-18,An Introduction To A Broad Range Of Topics In ...,801.0,Computers,3.5,6.0,En,2016,8
1,Deep Learning For Coders With Fastai And Pytorch,"Jeremy Howard, Sylvain Gugger",OReilly Media,2020-06-29,Deep Learning Is Often Viewed As The Exclusive...,624.0,Computers,4.5,2.0,En,2020,4
2,The Principles Of Deep Learning Theory,"Daniel A. Roberts, Sho Yaida, Boris Hanin",Cambridge University Press,2022-05-26,This Volume Develops An Effective Theory Appro...,473.0,Computers,5.0,2.0,En,2022,2
3,Advances In Deep Learning,"M. Arif Wani, Farooq Ahmad Bhat, Saduf Afzal, ...",Springer,2019-03-14,This Book Introduces Readers To Both Basic And...,149.0,Technology Engineering,1.0,1.0,En,2019,5
4,Introduction To Deep Learning,Sandro Skansi,Springer,2018-02-04,This Textbook Presents A Concise Accessible An...,191.0,Computers,5.0,1.0,En,2018,6
