# Problem Definition & Objective

### Selected project track
**Project Track:** Artificial Intelligence & Machine Learning (Recommendation Systems)

### Clear problem statement
In the digital age, readers are overwhelmed by the vast number of books available. Finding a book that matches one's specific interests can be time-consuming and frustrating. Users often rely on general popularity lists which may not align with their personal taste. The problem is to build a system that can filter through thousands of books and suggest titles that are most relevant to a user based on a book they already like.

### Real-world relevance and motivation
Recommendation systems are ubiquitous in modern tech (Netflix, Amazon, Spotify). A personalized book recommender improves user engagement, increases sales for retailers, and helps readers discover hidden gems they might have missed. This project demonstrates how Content-Based Filtering can be used to solve this information overload problem.

# Data Understanding & Preparation

### Dataset source
We are using a subset of the **Book-Crossing Dataset**. 
- **Source:** Publicly available dataset.
- **Files:** `BX-Books.csv` (contains Title, Author, Publisher, etc.).

### Data loading and exploration
We will load the dataset and inspect its structure.

In [None]:
import pandas as pd
import numpy as np

# Load the books data
# Using 'on_bad_lines' to skip any malformed rows in the CSV
try:
    books_df = pd.read_csv('data/BX-Books.csv', sep=';', encoding='latin-1', on_bad_lines='skip', low_memory=False)
    print("Dataset loaded successfully.")
    print(f"Initial shape: {books_df.shape}")
except FileNotFoundError:
    print("Error: 'data/BX-Books.csv' not found. Please ensure the data directory exists.")

# Display first few rows
books_df.head(3)

### Cleaning, preprocessing, feature engineering
We need to handle missing values and create a unified text feature for our model. The core idea is to combine **Title**, **Author**, and **Publisher** into a single string (bag of words) that describes the book.

In [None]:
# 1. Select relevant columns
books_df = books_df[['Book-Title', 'Book-Author', 'Publisher']]

# 2. Handling missing values (fill with empty string)
books_df = books_df.fillna('')

# 3. Feature Engineering: Combine columns
books_df['features'] = books_df['Book-Title'] + " " + books_df['Book-Author'] + " " + books_df['Publisher']

# 4. Preprocessing: Convert to lowercase to ensure consistency
books_df['features'] = books_df['features'].apply(lambda x: x.lower())

print("Feature engineering complete. Example feature:")
print(books_df['features'].iloc[0])

# Model / System Design

### AI technique used
We are using **Content-Based Filtering** with **Natural Language Processing (NLP)** techniques.

### Architecture or pipeline explanation
1.  **Input:** User provides a book title they like.
2.  **Vectorization (TF-IDF):** The system converts all book descriptions (features) into numerical vectors using Term Frequency-Inverse Document Frequency. This highlights unique words (like specific character names or unique topics) while down-weighting common words.
3.  **Similarity Calculation (Cosine Similarity):** We calculate the cosine of the angle between the input book's vector and every other book's vector.
4.  **Output:** The system returns the top $N$ books with the highest similarity scores.

### Justification of design choices
-   **Why Content-Based?** We don't have user purchase history for this specific implementation, so we cannot use Collaborative Filtering. Content-based is excellent for recommending items similar to what a user specifically asks for.
-   **Why TF-IDF?** It is efficient and effective for text-based similarity comparisons without needing heavy deep learning compute power.

# Core Implementation

### Model training / inference logic
Here we implement the vectorization and similarity matrix computation.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Initialize TF-IDF Vectorizer
# removing english stop words like 'the', 'a', 'an'
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF Matrix
print("Training TF-IDF Model...")
tfidf_matrix = tfidf.fit_transform(books_df['features'])
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

# Compute Cosine Similarity Matrix
# linear_kernel is a faster implementation of cosine_similarity for this case
print("Computing Cosine Similarity Matrix...")
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print("Model Training Complete.")

### Recommendation or prediction pipeline
We define a function that takes a book title and returns the best matches.

In [None]:
# Create a reverse mapping of titles to indices for fast lookup
indices = pd.Series(books_df.index, index=books_df['Book-Title']).drop_duplicates()

def recommend_book(title, cosine_sim=cosine_sim):
    try:
        # 1. Get the index of the book
        if title not in indices:
            return [f"Book '{title}' not found in database."]
        
        idx = indices[title]
        
        # Handle slight edge case if duplicate titles still exist
        if isinstance(idx, pd.Series):
            idx = idx.iloc[0]

        # 2. Get the pairwise similarity scores
        sim_scores = list(enumerate(cosine_sim[idx]))

        # 3. Sort the books based on the similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        # 4. Get the scores of the 5 most similar books (ignoring index 0 which is the book itself)
        sim_scores = sim_scores[1:6]

        # 5. Get the book indices
        book_indices = [i[0] for i in sim_scores]

        # 6. Return the top 5 most similar books
        return books_df['Book-Title'].iloc[book_indices].tolist()
    except Exception as e:
        return [f"An error occurred: {str(e)}"]

# Evaluation & Analysis

### Metrics used 
Since this is an unsupervised learning task (there is no "correct" label), we use **Qualitative Evaluation**. We check if the recommendations make intuitive sense (e.g., if we input a Harry Potter book, do we get other fantasy books?).

### Sample outputs / predictions
Let's test the system with a few popular titles.

In [None]:
test_books = [
    "Classical Mythology",
    "The Fellowship of the Ring (The Lord of the Rings, Part 1)"
]

for book in test_books:
    print(f"\nRecommendations for '{book}':")
    recommendations = recommend_book(book)
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec}")

### Performance analysis and limitations
-   **Performance:** The system uses `linear_kernel` which is optimized, but for very large datasets (millions of books), the $O(N^2)$ complexity of the similarity matrix could be a bottleneck.
-   **Limitations:** 
    -   **Cold Start:** Can only recommend books that are in the dataset.
    -   **Lack of Personalization:** It only looks at content, not user history.

# Ethical Considerations & Responsible AI

### Bias and fairness considerations
-   **Representation Bias:** The dataset (`Book-Crossing`) relies on a community of users who may have specific demographic leanings. If the dataset lacks diverse authors, the recommender will fail to suggest them.

### Responsible use of AI tools
-   This tool is a supportive discovery aid. It should not be presented as an objective authority on "good" literature, but rather as a "search by similarity" tool.

### Dataset limitations
-   The data contains older books (up to 2004 in the original dump), so it will not recommend the latest 2024 bestsellers.

# Conclusion & Future Scope

### Summary of results
We successfully built a Content-Based Recommendation System that suggests books based on Title, Author, and Publisher similarity. The system runs efficiently on the provided subset and gives distinct, relevant results for test inputs.

### Possible improvements and extensions
1.  **Hybrid Model:** Integrate user ratings to prioritize higher-rated books among the similar ones.
2.  **Better NLP:** Use Description/Summary text (if available) with embeddings like BERT for deeper semantic understanding instead of just keyword matching.
3.  **Deploy as Web App:** The logic here can be wrapped in a Streamlit app for easy user interaction (as done in `app.py`).