Books Recommendation System

Project Motivation
The Arabic book market is a huge market with many readers worldwide. This poses a challenge of finding the right book to read especially since there are limited Arabic-specific recommendation systems available. The goal of this project is to help users discover their next read, based on their interests, preferences, and reading history by building an intelligent recommendation system utilizing the Jamalon Arabic Books Dataset.

students' names: رغد المطيري | شوق القريشي | هيفاء السديري | بتول الفوزان | نوره العريفي |

The goal of the dataset:

The primary goal of using the Jamalon Arabic Books Dataset is to help us develop a robust, AI-driven, personalized recommendation system for Arabic books, enabling more efficient book categorization and enhancing overall user experience. By leveraging this dataset which includes rich metadata such as titles, authors, and genres, the system will provide personalized suggestions and explanations by integrating machine learning and generative AI, classify and categorize content efficiently, and process user inputs

The goal of this project is to develop a recommendation system for Arabic books available on Jamalon, an online bookstore. The objective is to suggest books to users based on attributes such as genre, price, and ratings, along with personal preferences.

In Phase 1, we focused on understanding the problem, exploring the dataset, and performing initial data preprocessing. This included handling missing values, encoding categorical features, and visualizing the dataset to understand the relationships between key attributes.

The source of the dataset:
We are using the Jamalon Arabic Books Dataset, sourced from [Kaggle - Jamalon Arabic Books Dataset](https://www.kaggle.com/datasets/dareenalharthi/jamalon-arabic-books-dataset?resource=download).

General information:
In the Jamalon Arabic Books Dataset, each row represents a book. The dataset consists of 11 columns (variables) and approximately 8980 observations (books).  

Dataset Variables:
- Unique ID: A unique identifier for each book (Numerical).
- Title: The name of the book (Text).
- Author: The author's name (Text).
- Description: A brief description of the book (Text)
- Pages: The total number of pages in the book (Numerical)
- Publication Year: The year the book was published (Numerical)
- Publisher: The name of the publisher (Categorical).
- Cover: The cover type, such as Paperback or Hardcover (Categorical).
- Category: The main category of the book (e.g., Literature, Islamic Books) (Categorical).
- Subcategory: A more specific classification under each category (Categorical).
- Price: The price of the book (Numerical).

The Category and Subcategory columns act as classification labels, organizing books into different genres. These labels are useful for building the recommendation model based on user preferences.


Summary of the dataset:

In [None]:
# 1. Importing Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 
from langdetect import detect
import re
from IPython.display import display
from tabulate import tabulate

In [None]:
# 2. Loading the Dataset
file_path = r'C:\Users\Raghad\Downloads\BooksDB\jamalon dataset.csv'
df = pd.read_csv(file_path)

In [None]:
df.head(10) 
#Sample of the Dataset

In [None]:
# Count missing values per column
missing_values = df.isnull().sum().reset_index()
missing_values.columns = ['Column', 'Missing Count']
missing_values['Missing Percentage'] = (missing_values['Missing Count'] / len(df)) * 100

# Display as a table
missing_values

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x=missing_values['Column'], y=missing_values['Missing Count'], palette="viridis")
plt.xticks(rotation=90)
plt.ylabel("Count of Missing Values")
plt.title("Missing Values per Column")
plt.show()

In [None]:
#Count, Mean, Standard, Minimum, Maximum, and Variance

In [None]:
# Compute statistics for numerical columns only
numeric_df = df.select_dtypes(include=[np.number])

summary_stats = numeric_df.describe().T  # Summary statistics
summary_stats['Variance'] = numeric_df.var()  # Compute variance only for numerical columns

summary_stats

The following preprocessing steps were performed on the dataset:

1- Handling Missing Values: We checked for any null values in the dataset and deleted the rows containing them. Removing rows with missing data helps maintain data integrity and ensures that analysis and models are based on complete information.

2- Duplicate Detection and Removal:We identified duplicate rows and applied a strategy Deduplication Strategy: -For duplicates, the entry with the lowest non-zero price is kept. -If prices are all zero, the most recent publication year is prioritized. -If publication years are the same, the entry with the longest description is retained. to retain the most relevant entry. Removing duplicates avoids data redundancy, improves data quality, and ensures accurate analysis.

3- Unnecessary Column Removal: We removed columns like Unnamed: 0 and Cover as they were not needed for analysis. The Unnamed: 0 column was removed because it simply numbered the rows, which is unnecessary since the dataset already has automatic indexing. The Cover column was removed because it only indicated the type of book cover (e.g., paperback, hardcover, electronic), which was not relevant to our analysis.

4- Language Filtering:We detected English titles using regular expressions (regex) and removed them from the dataset. Since the focus is on Arabic books, removing English titles ensures the dataset is relevant to the project’s objectives.

5- Category Mapping:We mapped book categories to binary-like codes (e.g., "الأدب والخيال" → 0000000000001). This mapping standardizes categories, making them easier to process in machine learning models.

6- Encoding Categorical Features:We applied Label Encoding to the Author, Publisher, and Subcategory columns to convert text data into numerical values. Machine learning models require numerical input; encoding categorical features ensures compatibility with these models.

7- Discretization:We discretized the Pages column into bins (e.g., 0–50, 50–100, etc.) to categorize books based on page ranges. Discretization helps in analyzing trends across different ranges and simplifies complex numerical data.

8- Text Cleaning (Titles & Descriptions):We cleaned the text data by removing Arabic diacritics (Tashkeel), special characters, punctuation marks, and extra spaces. Cleaning text data improves consistency and quality, which is especially important for text analysis and natural language processing (NLP) tasks in phase 2.


In [None]:
# Adjust display settings for better formatting
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.max_colwidth', None) # Display full content in each cell
pd.set_option('display.expand_frame_repr', False) # Prevent line breaks in tables
pd.set_option('display.width', 1000)       # Adjust the table width


# Check for duplicates
def print_duplicate_stats(df, column):
    # Count the occurrences of each value in the specified column
    duplicates = df[column].value_counts()

    # Print basic duplicate analysis statistics
    print(f"\nDuplicate analysis for {column}:")
    print(f"Total rows: {len(df)}")  # Total number of rows in the dataframe
    print(f"Unique values: {df[column].nunique()}")  # Number of unique values in the specified column
    print(f"Number of duplicated values: {len(df[df[column].duplicated()])}")  # Number of duplicated values


    # If there are any duplicated values, display more details
    if len(duplicates[duplicates > 1]) > 0:
        print("\nMost common duplicates:")
        # Display the top duplicate values in a neat table format
        display(duplicates[duplicates > 1].head().to_frame('Count').reset_index().rename(columns={'index': column}))

        # Show detailed information for some duplicated entries
        print("\nSample of duplicated entries (first duplicates):")
        for title in duplicates[duplicates > 1].head(1).index:
            print(f"\nAll entries for title: {title}")
            
            # Display detailed data for each duplicated title with a clean table style
            duplicated_data = df[df[column] == title][['Title', 'Author', 'Publisher', 'Publication year', 'Price']]
            styled_table = duplicated_data.style.set_table_styles(
                [{'selector': 'th', 'props': [('background-color', '#f7f7f7'), ('font-weight', 'bold')]},
                 {'selector': 'td', 'props': [('text-align', 'center')]}]
            ).set_properties(**{'border': '1px solid black', 'padding': '8px'})
            
            display(styled_table)  # Display the table with better formatting

# Check for duplicated rows in the entire dataframe
print("Check for duplicates:\t" + str(df.duplicated().sum()))

# Perform duplicate analysis for the 'Title' column
print_duplicate_stats(df, 'Title')

In [None]:
# Value counts and unique values per column
for i in df.columns:
    print(i+"column")
    print(df[i].value_counts())
print("Unique values per column:\n", df.nunique())

In [None]:
# Summary statistics of pages, publication year, price
print(df.describe())

In [None]:
# 4. Data Cleaning
clean_df = df.copy()

In [None]:
# Function to detect English titles
def is_english(text):
    try:
        english_pattern = re.compile(r'[a-zA-Z]')
        return bool(english_pattern.search(str(text)))
    except:
        return False
        
# Function to print removal stats
def print_removal_stats(df_before, df_after, step_name):
    rows_removed = len(df_before) - len(df_after)
    print(f"\n{step_name}:")
    print(f"Rows removed: {rows_removed}")
    print(f"Rows remaining: {len(df_after)}")
    if rows_removed > 0:
        removed_df = df_before[~df_before.index.isin(df_after.index)]
        print(removed_df['Title'].head())

In [None]:
# Function to choose which duplicate to keep
def choose_best_duplicate(group):
    if len(group) == 1:
        return group.iloc[0]  # No duplicates, return as is

    # Keep the one with the lowest non-zero price
    non_zero_prices = group[group['Price'] > 0]
    if not non_zero_prices.empty:
        return non_zero_prices.sort_values(by='Price').iloc[0]

    # If all prices are zero, keep the most recent publication
    if group['Publication year'].nunique() > 1:
        return group.sort_values(by='Publication year', ascending=False).iloc[0]

    # If publication years are the same, keep the longest description
    group['Description_Length'] = group['Description'].fillna('').apply(len)
    return group.sort_values(by='Description_Length', ascending=False).iloc[0]

# Remove duplicate entries while keeping the best version

The columns Unnamed: 0 and Cover were removed because they do not provide any meaningful information relevant to the book recommendation system.
Unnamed: 0 is an automatically generated index column that adds no analytical value,
while Cover does not contribute to the recommendation process based on attributes like Description, Category.
Removing these unnecessary columns simplifies the dataset, improves data processing efficiency, and keeps the focus on relevant features.

In [None]:
# Removing unnecessary columns
#Drop unnamed and Cover columns
clean_df = clean_df.drop(columns=['Unnamed: 0', 'Cover'], errors='ignore')

In [None]:
# Remove duplicate entries while keeping the best version
# This process ensures that for books with the same title and same publisher, we retain the most relevant entry
print("\nRemoving duplicates...")
df_before = clean_df.copy()  # Create a copy of the dataset to compare before and after removal

# Group data by both 'Title' and 'Publisher' to identify duplicates correctly
clean_df = clean_df.groupby(['Title', 'Publisher'], as_index=False).apply(choose_best_duplicate).reset_index(drop=True)

# Print removal statistics to show how many rows were deleted and how many remain
print_removal_stats(df_before, clean_df, "Deduplication removal")

In [None]:
# Identify and remove English titles as the dataset is likely focused on Arabic books
print("\nBefore English title removal:")
print("Total rows:", len(clean_df))  # Display the total number of rows before removal

# Apply the 'is_english' function to detect titles written in English
english_titles = clean_df[clean_df['Title'].apply(is_english)]
print("Found English titles:", len(english_titles))  # Show the count of English titles found

# If English titles are present, display a sample of them
if len(english_titles) > 0:
    print("Sample of English titles to be removed:")
    print(english_titles['Title'].head())  # Display the first few English titles

# Remove English titles from the dataset
# The tilde (~) operator negates the condition to keep only non-English titles
df_before = clean_df.copy()  # Copy the dataset before removal for comparison
clean_df = clean_df[~clean_df['Title'].apply(is_english)]  # Filter out English titles

# Print statistics to show how many English titles were removed
print_removal_stats(df_before, clean_df, "Removing English titles")

In [None]:
#Clean Description
print("\nBefore Description cleaning:")
print("Null values:", clean_df['Description'].isna().sum())
print("'None' string values (including spaces):", 
      clean_df['Description'].astype(str).str.strip().isin(['None', 'nan', '']).sum())

df_before = clean_df.copy()
clean_df = clean_df[
    clean_df['Description'].notna() & 
    ~clean_df['Description'].astype(str).str.strip().isin(['None', 'nan', '']) & 
    (clean_df['Description'].astype(str).str.strip() != '')
]
print_removal_stats(df_before, clean_df, "Removing invalid descriptions")

In [None]:
# Text Cleaning Function to remove diacritics, symbols, and punctuation
def clean_arabic_text(text):
    # Remove Arabic diacritics (Tashkeel)
    text = re.sub(r'[\u0617-\u061A\u064B-\u0652]', '', text)
    
    # Remove punctuations and special characters
    text = re.sub(r"[!\"#\$%&'\(\)\*\+,\-./:;<=>?@\[\]\\^_`{|}~،؛؟«»]", '', text)
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply cleaning to the Description column
clean_df['Description'] = clean_df['Description'].astype(str).apply(clean_arabic_text)

# Display a sample of the cleaned data
print(clean_df[['Description']].head())

In [None]:
# Apply cleaning to the Title column
clean_df['Title'] = clean_df['Title'].astype(str).apply(clean_arabic_text)

# Display a sample of the cleaned Title data
print(clean_df[['Title']].head())

In [None]:
#Clean Author field
df_before = clean_df.copy()
clean_df = clean_df[
    ~clean_df['Author'].astype(str).str.strip().isin(['لا يوجد', 'None', 'nan', '']) & 
    clean_df['Author'].notna() & 
    (clean_df['Author'].astype(str).str.strip() != '')
]
print_removal_stats(df_before, clean_df, "Removing invalid authors")

In [None]:
# Clean Publication Year
df_before = clean_df.copy()
clean_df = clean_df[
    (clean_df['Publication year'] != 0) & 
    clean_df['Publication year'].notna() & 
    (clean_df['Publication year'] >= 1800) & 
    (clean_df['Publication year'] <= 2024)
]
print_removal_stats(df_before, clean_df, "Removing invalid years")

In [None]:
# 5. Category Mapping
category_map = {
    "الأدب والخيال": "0000000000001",
    "الكتب الإسلامية": "0000000000010",
    "الاقتصاد والأعمال": "0000000000100",
    "الفلسفة": "0000000001000",
    "الصحافة والإعلام": "0000000010000",
    "الكتب السياسية": "0000000100000",
    "العلوم والطبيعة": "0000001000000",
    "الأسرة والطفل": "0000010000000",
    "السير والمذكرات": "0000100000000",
    "الفنون": "0001000000000",
    "التاريخ والجغرافيا": "0010000000000",
    "الرياضة والتسلية": "0100000000000",
    "الشرع والقانون": "1000000000000"
}

clean_df['Category'] = clean_df['Category'].map(category_map)


In [None]:
from sklearn.preprocessing import LabelEncoder
# Encode Authors using Label Encoding
label_encoder = LabelEncoder()
clean_df['Author'] = label_encoder.fit_transform(clean_df['Author'].astype(str))

# Display the number of unique authors
unique_authors = clean_df['Author'].unique()
print(f"Number of unique authors: {len(unique_authors)}")

# Display a sample of the encoded data
print(clean_df[['Author']].head())

In [None]:
# Encode Publishers using Label Encoding
clean_df['Publisher'] = label_encoder.fit_transform(clean_df['Publisher'].astype(str))

# Display the number of unique publishers
unique_publishers = clean_df['Publisher'].unique()
print(f"Number of unique publishers: {len(unique_publishers)}")

# Display a sample of the encoded data
print(clean_df[['Publisher']].head())


In [None]:
# Encode Publishers using Label Encoding
clean_df['Subcategory'] = label_encoder.fit_transform(clean_df['Subcategory'].astype(str))

# Display the number of unique publishers
unique_publishers = clean_df['Subcategory'].unique()
print(f"Number of unique publishers: {len(unique_publishers)}")

# Display a sample of the encoded data
print(clean_df[['Subcategory']].head())

In [None]:
# 6. Final verification
print("\nFinal verification:")
print("Checking for English titles:")
remaining_english = clean_df[clean_df['Title'].apply(is_english)]
if len(remaining_english) > 0:
    print("WARNING: Still found English titles:")
    print(remaining_english['Title'].head())
else:
    print("No English titles remaining")

print("\nChecking for 'None' values in Description:")
none_desc = clean_df[clean_df['Description'].astype(str).str.strip().isin(['None', 'nan', ''])]
if len(none_desc) > 0:
    print("WARNING: Still found rows with 'None' in Description:")
    print(none_desc[['Title', 'Description']].head())
else:
    print("No 'None' values found in Description")

print("Checking for duplicates")
print_duplicate_stats(clean_df, 'Title')

# Save the cleaned dataset
# Clean any remaining whitespace
for col in clean_df.columns:
    if clean_df[col].dtype == object:
        clean_df[col] = clean_df[col].astype(str).str.strip()

# Save with explicit encoding and quoting
clean_df.to_csv("~/Downloads/outputnotebook.csv", 
                index=False, 
                encoding='utf-8-sig',
                quoting=1)


# Verify the saved file
print("\nVerifying saved file:")
verification_df = pd.read_csv("~/Downloads/outputnotebook.csv")
print("Final row count:", len(verification_df))
print("\nSample of final data:")
print(verification_df[['Title', 'Author', 'Description']].head(2).to_string())

In [None]:
# 7. Relationships visualization

# **Book Count by Category**
# do we have enough books for each catogory is a catogory overpresented ?
plt.figure(figsize=(8, 6))
sns.countplot(data=clean_df, x='Category', palette='viridis', hue='Category', legend=False)
plt.title("Book Count by Category")
plt.xlabel("Category")
plt.ylabel("Count")
plt.xticks(rotation=90) 
plt.show()

In [None]:
# **Year vs. Price**
# Does newer books are more expensive ?
plt.figure(figsize=(8, 5))
sns.lineplot(data=clean_df, x='Publication year', y='Price', marker='o', hue='Category')
plt.title("Publication Year vs. Price Trend")
plt.xlabel("Publication Year")
plt.ylabel("Price ($)")
plt.show()

In [None]:
# **Price vs Category**
plt.figure(figsize=(8, 6))
df_grouped = clean_df.groupby('Category')['Price'].mean().reset_index()
sns.barplot(data=df_grouped, x='Category', y='Price', hue='Category', palette='Set2', legend=False)
plt.title("Average Price by Category")
plt.xlabel("Category")
plt.ylabel("Average Price ($)")
plt.xticks(rotation=90)  
plt.show()

In [None]:
# **Price vs Pages** 
bins = [0, 50, 100, 150, 200, float('inf')]  
labels = ['0-50', '50-100', '100-150', '150-200', '200+']
clean_df['Page Range'] = pd.cut(clean_df['Pages'], bins=bins, labels=labels, right=False)

df_grouped = clean_df.groupby('Page Range', observed=False)['Price'].mean().reset_index()
plt.figure(figsize=(8, 5))
sns.barplot(data=df_grouped, x='Page Range', y='Price', palette='viridis', hue='Page Range', legend=False)
plt.title("Average Price by Page Range")
plt.xlabel("Page Range")
plt.ylabel("Average Price ($)")
plt.show()

In [None]:
# عرض جميع أسماء الأعمدة في DataFrame
columns_list = clean_df.columns.tolist()

# عرض قائمة الأعمدة
columns_list

In [None]:
from IPython.display import display

In [None]:
clean_df.reset_index(drop=True, inplace=True)
display(clean_df.head())

In [None]:
# Save the cleaned dataset as CSV
clean_df.to_csv('Book_Cleaned_Dataset_.csv', index=False, encoding='utf-8-sig')

# Or save as Excel
clean_df.to_excel('Book_Cleaned_Dataset_.xlsx', index=False)

print("The cleaned dataset has been saved successfully!")
