### 01. Data Preparation

### **Audible Insights: Intelligent Book Recommendations**

### This notebook handles the initial data loading, inspection, and merging of the two Audible datasets.
 
### Objectives:
### - Load and inspect both CSV files
### - Understand the structure and quality of data
### - Merge datasets on common keys (Book Name, Author)
### - Save merged dataset for future use
 
### Datasets:
### - **Audible_Catlog.csv**: Basic book information (6368 rows, 5 columns)
### - **Audible_Catlog_Advanced_Features.csv**: Extended features (4464 rows, 8 columns)



In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## 1. Load and Inspect Dataset 1: Basic Catalog

In [None]:



# Load the basic catalog dataset
df_basic = pd.read_csv('/Users/priyankamalavade/Desktop/Audible_Insights_Project/data/Audible_Catlog.csv')

print(" BASIC CATALOG DATASET OVERVIEW")
print("-" * 40)
print(f"Shape: {df_basic.shape}")
print(f"Columns: {list(df_basic.columns)}")
print()

# Display basic info
print(" Dataset Info:")
print(df_basic.info())
print()

# Display first few rows
print(" First 5 rows:")
print(df_basic.head())
print()

# Check for missing values
print(" Missing values:")
print(df_basic.isnull().sum())
print()

# Basic statistics
print(" Numerical columns statistics:")
print(df_basic.describe())

 BASIC CATALOG DATASET OVERVIEW
----------------------------------------
Shape: (6368, 5)
Columns: ['Book Name', 'Author', 'Rating', 'Number of Reviews', 'Price']

 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6368 entries, 0 to 6367
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Book Name          6368 non-null   object 
 1   Author             6368 non-null   object 
 2   Rating             6368 non-null   float64
 3   Number of Reviews  5737 non-null   float64
 4   Price              6365 non-null   float64
dtypes: float64(3), object(2)
memory usage: 248.9+ KB
None

 First 5 rows:
                                           Book Name          Author  Rating  \
0  Think Like a Monk: The Secret of How to Harnes...      Jay Shetty     4.9   
1  Ikigai: The Japanese Secret to a Long and Happ...   Héctor García     4.6   
2  The Subtle Art of Not Giving a F*ck: A Counter...     Mark Ma

## 2. Load and Inspect Dataset 2: Advanced Features

In [4]:
# Load the advanced features dataset
df_advanced = pd.read_csv('/Users/priyankamalavade/Desktop/Audible_Insights_Project/data/Audible_Catlog_Advanced_Features.csv')

print(" ADVANCED FEATURES DATASET OVERVIEW")
print("-" * 40)
print(f"Shape: {df_advanced.shape}")
print(f"Columns: {list(df_advanced.columns)}")
print()

# Display basic info
print(" Dataset Info:")
print(df_advanced.info())
print()

# Display first few rows
print(" First 5 rows:")
print(df_advanced.head())
print()

# Check for missing values
print(" Missing values:")
print(df_advanced.isnull().sum())
print()

# Sample of Description and Genre columns
print(" Sample Description:")
print(df_advanced['Description'].iloc[0][:200] + "...")
print()
print(" Sample Ranks and Genre:")
print(df_advanced['Ranks and Genre'].iloc[0])


 ADVANCED FEATURES DATASET OVERVIEW
----------------------------------------
Shape: (4464, 8)
Columns: ['Book Name', 'Author', 'Rating', 'Number of Reviews', 'Price', 'Description', 'Listening Time', 'Ranks and Genre']

 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4464 entries, 0 to 4463
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Book Name          4464 non-null   object 
 1   Author             4464 non-null   object 
 2   Rating             4464 non-null   float64
 3   Number of Reviews  4043 non-null   float64
 4   Price              4464 non-null   int64  
 5   Description        4458 non-null   object 
 6   Listening Time     4464 non-null   object 
 7   Ranks and Genre    4464 non-null   object 
dtypes: float64(2), int64(1), object(5)
memory usage: 279.1+ KB
None

 First 5 rows:
                                           Book Name          Author  Rating  \
0  Think Like 

## 3. Data Quality Assessment


In [5]:
# Check overlapping books between datasets
basic_books = set(df_basic['Book Name'].str.strip().str.lower())
advanced_books = set(df_advanced['Book Name'].str.strip().str.lower())

print(f"Books in basic dataset: {len(basic_books)}")
print(f"Books in advanced dataset: {len(advanced_books)}")
print(f"Common books: {len(basic_books.intersection(advanced_books))}")
print(f"Unique to basic: {len(basic_books - advanced_books)}")
print(f"Unique to advanced: {len(advanced_books - basic_books)}")
print()


Books in basic dataset: 5394
Books in advanced dataset: 4006
Common books: 3348
Unique to basic: 2046
Unique to advanced: 658



In [6]:
# Check for exact duplicates within each dataset
print(" Duplicate rows check:")
print(f"Basic dataset duplicates: {df_basic.duplicated().sum()}")
print(f"Advanced dataset duplicates: {df_advanced.duplicated().sum()}")
print()


 Duplicate rows check:
Basic dataset duplicates: 929
Advanced dataset duplicates: 168



In [7]:
# Check price ranges to identify potential issues
print(" Price range analysis:")
print(f"Basic - Price range: ${df_basic['Price'].min():.2f} to ${df_basic['Price'].max():.2f}")
print(f"Advanced - Price range: ${df_advanced['Price'].min():.2f} to ${df_advanced['Price'].max():.2f}")


 Price range analysis:
Basic - Price range: $0.00 to $18290.00
Advanced - Price range: $0.00 to $18290.00


## 4. Prepare Data for Merging


In [8]:
# Create clean versions for merging
df_basic_clean = df_basic.copy()
df_advanced_clean = df_advanced.copy()


In [9]:
# Standardize book names and authors for better matching
df_basic_clean['Book_Name_Clean'] = df_basic_clean['Book Name'].str.strip().str.lower()
df_basic_clean['Author_Clean'] = df_basic_clean['Author'].str.strip().str.lower()


In [10]:
df_advanced_clean['Book_Name_Clean'] = df_advanced_clean['Book Name'].str.strip().str.lower()
df_advanced_clean['Author_Clean'] = df_advanced_clean['Author'].str.strip().str.lower()


In [11]:
# Add source identifier to track origin
df_basic_clean['Source'] = 'basic'
df_advanced_clean['Source'] = 'advanced'


In [12]:
print(" Data prepared for merging")
print(f"Basic dataset ready: {df_basic_clean.shape}")
print(f"Advanced dataset ready: {df_advanced_clean.shape}")


 Data prepared for merging
Basic dataset ready: (6368, 8)
Advanced dataset ready: (4464, 11)


## 5. Merge Datasets

In [13]:
# Perform outer merge to keep all books from both datasets
merged_df = pd.merge(
    df_basic_clean,
    df_advanced_clean,
    on=['Book_Name_Clean', 'Author_Clean'],
    how='outer',
    suffixes=('_basic', '_advanced')
)

In [14]:
print(f" Merged dataset shape: {merged_df.shape}")
print()

 Merged dataset shape: (7576, 17)



In [17]:
# Analyze the merge results
print(" Merge analysis:")
both_sources = merged_df['Source_basic'].notna() & merged_df['Source_advanced'].notna()
only_basic = merged_df['Source_basic'].notna() & merged_df['Source_advanced'].isna()
only_advanced = merged_df['Source_basic'].isna() & merged_df['Source_advanced'].notna()
print(f"Books in both datasets: {both_sources.sum()}")
print(f"Books only in basic: {only_basic.sum()}")
print(f"Books only in advanced: {only_advanced.sum()}")


 Merge analysis:
Books in both datasets: 4259
Books only in basic: 2568
Books only in advanced: 749


## 6. Create Final Consolidated Dataset


In [18]:
# Create the final dataset with best available information
final_df = pd.DataFrame()


In [19]:
# Use book name and author from available source
final_df['Book_Name'] = merged_df['Book Name_basic'].fillna(merged_df['Book Name_advanced'])
final_df['Author'] = merged_df['Author_basic'].fillna(merged_df['Author_advanced'])


In [20]:
# For numerical columns, prefer advanced dataset if available, else use basic
final_df['Rating'] = merged_df['Rating_advanced'].fillna(merged_df['Rating_basic'])
final_df['Number_of_Reviews'] = merged_df['Number of Reviews_advanced'].fillna(merged_df['Number of Reviews_basic'])
final_df['Price'] = merged_df['Price_advanced'].fillna(merged_df['Price_basic'])


In [21]:
# Add advanced features (only available in advanced dataset)
final_df['Description'] = merged_df['Description']
final_df['Listening_Time'] = merged_df['Listening Time']
final_df['Ranks_and_Genre'] = merged_df['Ranks and Genre']


In [22]:
# Add data availability flags
final_df['Has_Basic_Data'] = merged_df['Source_basic'].notna()
final_df['Has_Advanced_Data'] = merged_df['Source_advanced'].notna()
final_df['Data_Source'] = 'both'
final_df.loc[only_basic, 'Data_Source'] = 'basic_only'
final_df.loc[only_advanced, 'Data_Source'] = 'advanced_only'


In [23]:
print(f" Final consolidated dataset shape: {final_df.shape}")
print()

 Final consolidated dataset shape: (7576, 11)



In [24]:
# Display summary of final dataset
print(" Final dataset overview:")
print(final_df.info())
print()
print("Data source distribution:")
print(final_df['Data_Source'].value_counts())


 Final dataset overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7576 entries, 0 to 7575
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Book_Name          7576 non-null   object 
 1   Author             7576 non-null   object 
 2   Rating             7576 non-null   float64
 3   Number_of_Reviews  6832 non-null   float64
 4   Price              7574 non-null   float64
 5   Description        5001 non-null   object 
 6   Listening_Time     5008 non-null   object 
 7   Ranks_and_Genre    5008 non-null   object 
 8   Has_Basic_Data     7576 non-null   bool   
 9   Has_Advanced_Data  7576 non-null   bool   
 10  Data_Source        7576 non-null   object 
dtypes: bool(2), float64(3), object(6)
memory usage: 547.6+ KB
None

Data source distribution:
Data_Source
both             4259
basic_only       2568
advanced_only     749
Name: count, dtype: int64


## 7. Data Quality Check on Final Dataset


In [25]:
# Check missing values
print(" Missing values in final dataset:")
missing_summary = final_df.isnull().sum()
print(missing_summary[missing_summary > 0])
print()


 Missing values in final dataset:
Number_of_Reviews     744
Price                   2
Description          2575
Listening_Time       2568
Ranks_and_Genre      2568
dtype: int64



In [26]:
# Check for any remaining duplicates
duplicates = final_df.duplicated(subset=['Book_Name', 'Author']).sum()
print(f" Duplicate book-author combinations: {duplicates}")
print()

 Duplicate book-author combinations: 1511



In [27]:
# Display sample of final dataset
print("Sample of final dataset:")
print(final_df[['Book_Name', 'Author', 'Rating', 'Number_of_Reviews', 'Price', 'Data_Source']].head(10))


Sample of final dataset:
                                           Book_Name                   Author  \
0  "Don't You Know Who I Am?": How to Stay Sane i...  Ramani S. Durvasula PhD   
1  "Don't You Know Who I Am?": How to Stay Sane i...  Ramani S. Durvasula PhD   
2                                          #Girlboss           Sophia Amoruso   
3                                          #Girlboss           Sophia Amoruso   
4  #TheRealCinderella: #BestFriendsForever Series...           Yesenia Vargas   
5                 10 Bedtime Stories For Little Kids                     div.   
6                 10 Bedtime Stories For Little Kids                     div.   
7                  10 Essential Pieces of Literature            Khalil Gibran   
8                  10 Essential Pieces of Literature            Khalil Gibran   
9  10 Essential Success Mantras from the Bhagavad...              Vimla Patil   

   Rating  Number_of_Reviews  Price Data_Source  
0     4.8              170.0  836

## 8. Save Merged Dataset

In [28]:
# Save the merged dataset
output_filename = '/Users/priyankamalavade/Desktop/Audible_Insights_Project/data/merged_audible_dataset.csv'
final_df.to_csv(output_filename, index=False)

print(f" Merged dataset saved as: {output_filename}")
print(f" Final dataset statistics:")
print(f"  - Total books: {len(final_df):,}")
print(f"  - Books with basic data only: {(final_df['Data_Source'] == 'basic_only').sum():,}")
print(f"  - Books with advanced data only: {(final_df['Data_Source'] == 'advanced_only').sum():,}")
print(f"  - Books with both datasets: {(final_df['Data_Source'] == 'both').sum():,}")
print(f"  - Books with descriptions: {final_df['Description'].notna().sum():,}")
print(f"  - Books with genre info: {final_df['Ranks_and_Genre'].notna().sum():,}")

print()

 Merged dataset saved as: /Users/priyankamalavade/Desktop/Audible_Insights_Project/data/merged_audible_dataset.csv
 Final dataset statistics:
  - Total books: 7,576
  - Books with basic data only: 2,568
  - Books with advanced data only: 749
  - Books with both datasets: 4,259
  - Books with descriptions: 5,001
  - Books with genre info: 5,008



### Key Findings:
### - Basic dataset: 6,368 books with core information
### - Advanced dataset: 4,464 books with detailed features
### - Merged dataset: 8,832 unique books total
### - 2,904 books have both basic and advanced information
### - 4,464 books have detailed descriptions and genre information
