# Pet Adoption Recommender System

## Introduction

**Goal**: Develop a recommender system to suggest pets to users based on their preferences and adoption history, enhancing pet adoption outcomes.

**Datasets**:

- `pet_adoption_data.csv`: Pet attributes (e.g., type, breed, size, age).
- `synthetic_user_data.csv`: User profiles (e.g., preferred pet type, living space).
- `adoption_history.csv`: Historical user-pet adoptions.

---

## 1. Data Exploration (Exploratory Data Analysis)

**Goal**: Understand the datasets to inform feature selection and modeling.

**Methodology**:

- Load datasets and inspect their structure.
- Visualize distributions of key features (e.g., pet types, user preferences).
- Check for missing values and anomalies.

**Code**:

In [69]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load datasets
pet_df_raw = pd.read_csv('content/pet_adoption_data.csv')
user_df_raw = pd.read_csv('content/synthetic_user_data.csv')
adoption_df_raw = pd.read_csv('content/adoption_history.csv')

# Work with copies for preprocessing
pet_df = pet_df_raw.copy()
user_df = user_df_raw.copy()
adoption_df = adoption_df_raw.copy()

# Pet type distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=pet_df, x='PetType')
plt.title('Distribution of Pet Types')
plt.savefig('fig/pet_type_dist.png')
plt.close()

# User preferred pet type
plt.figure(figsize=(8, 6))
sns.countplot(data=user_df, x='PreferredPetType')
plt.title('User Preferred Pet Types')
plt.savefig('fig/user_pref_dist.png')
plt.close()

# Check missing values
print("Missing values in pet_df:", pet_df.isnull().sum().sum())
print("Missing values in user_df:", user_df.isnull().sum().sum())
print("Missing values in adoption_df:", adoption_df.isnull().sum().sum())

Missing values in pet_df: 0
Missing values in user_df: 308
Missing values in adoption_df: 0


---

## 2. Data Preprocessing

**Goal**: Prepare data for the recommendation engine by standardizing features.

**Methodology**:

- Encode categorical variables (e.g., `PetType`, `Size`) using `LabelEncoder`.
- Normalize numerical features (e.g., `AgeMonths`, `WeightKg`) with `StandardScaler`.
- Ensure consistency between user and pet feature spaces.

**Code**:

In [70]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
import pandas as pd

def preprocess_data(pet_df, user_df, adoption_df):
    # Validate adoption_df
    duplicates = adoption_df.duplicated(subset=['UserID', 'PetID'], keep=False)
    if duplicates.sum() > 0:
        print(f"Warning: {duplicates.sum()} duplicate UserID-PetID pairs found. Removing duplicates...")
        adoption_df = adoption_df.drop_duplicates(subset=['UserID', 'PetID'], keep='first')
    
    invalid_adoptions = adoption_df[
        (~adoption_df['UserID'].isin(user_df['UserID'])) |
        (~adoption_df['PetID'].isin(pet_df['PetID']))
    ]
    if not invalid_adoptions.empty:
        raise ValueError(f"Invalid adoptions detected: {len(invalid_adoptions)}")

    # Encode categorical variables
    label_encoders = {}
    categorical_cols = ['PetType', 'Breed', 'Size', 'HealthCondition', 'EnergyLevel']
    for col in categorical_cols:
        label_encoders[col] = LabelEncoder()
        pet_df[col] = label_encoders[col].fit_transform(pet_df[col].astype(str))
    
    user_categorical_cols = ['PreferredPetType', 'LivingSpace', 'Allergies', 'ActivityLevel', 'PastPetExperience']
    for col in user_categorical_cols:
        label_encoders[col] = LabelEncoder()
        user_df[col] = label_encoders[col].fit_transform(user_df[col].astype(str))
    
    # Normalize numerical features
    scaler = StandardScaler()
    numeric_cols = ['AgeMonths', 'WeightKg', 'TimeInShelterDays', 'AdoptionFee']
    pet_df[numeric_cols] = scaler.fit_transform(pet_df[numeric_cols])
    
    # Create user-pet interaction matrix
    adoption_df['Interaction'] = 1  # Indicate adoption
    user_pet_matrix = adoption_df.pivot(index='UserID', columns='PetID', values='Interaction').fillna(0)
    
    # Print matrix stats
    density = user_pet_matrix.values.sum() / (user_pet_matrix.shape[0] * user_pet_matrix.shape[1]) * 100
    print(f"User-Pet Matrix: {user_pet_matrix.shape[0]} users, {user_pet_matrix.shape[1]} pets, {density:.2f}% density")
    
    return pet_df, user_df, user_pet_matrix, label_encoders, scaler

pet_df, user_df, user_pet_matrix, label_encoders, scaler = preprocess_data(pet_df, user_df, adoption_df)

User-Pet Matrix: 1000 users, 1000 pets, 2.39% density


**Notes**:
- Assumes `adoption_df` has an `AdoptionStatus` column (1 for adopted, 0 otherwise). If not, add:
  ```python
  adoption_df['AdoptionStatus'] = 1
  ```
- Non-numeric columns like 'Color' are excluded from normalization but retained for display.

---

## 3. Content-Based Filtering

**Goal**: Recommend pets based on user preferences and pet attributes.

**Methodology**:

- Build user profiles from preferences and adoption history.
- Compute cosine similarity between user profiles and pet features.
- Recommend top-k similar pets.

**Code**:

In [71]:
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import display, HTML

def format_output_as_table(recommendations_df, original_pet_df):
    """
    Formats recommendation results as an HTML table with original pet attributes.

    Args:
        recommendations_df (pd.DataFrame): DataFrame of recommendations,
                                           must include 'PetID' and 'SimilarityScore'.
        original_pet_df (pd.DataFrame): The original, unprocessed pet DataFrame.
    """
    if recommendations_df.empty:
        print("No recommendations to display.")
        return

    # Select PetIDs and SimilarityScore from recommendations
    recs_to_display = recommendations_df[['PetID', 'SimilarityScore']].copy()

    # Merge with original pet data to get readable attributes
    merged_recs = pd.merge(recs_to_display, original_pet_df, on='PetID', how='left')

    # Select and order columns for display
    # Add or remove columns as needed
    display_cols = [
        'PetID', 'PetType', 'Breed', 'Size', 'AgeMonths', 
        'WeightKg', 'HealthCondition', 'EnergyLevel', 'SimilarityScore'
    ]
    
    # Ensure only existing columns are selected
    display_cols = [col for col in display_cols if col in merged_recs.columns]
    
    formatted_df = merged_recs[display_cols]
    
    # Display as HTML table
    display(HTML(formatted_df.to_html(index=False)))

def content_based_recommendations(user_id, pet_df, adoption_df, top_k=5):
    # Get user adoption history
    user_adoptions = adoption_df[adoption_df['UserID'] == user_id]
    adopted_pet_ids = user_adoptions['PetID'].values
    
    # User profile: average features of adopted pets
    adopted_pets = pet_df[pet_df['PetID'].isin(adopted_pet_ids)]
    user_profile = adopted_pets[['PetType', 'Breed', 'Size', 'AgeMonths', 'WeightKg']].mean().values.reshape(1, -1)
    
    # Available pets (not yet adopted by this user)
    available_pets = pet_df[~pet_df['PetID'].isin(adopted_pet_ids)].copy()
    pet_features = available_pets[['PetType', 'Breed', 'Size', 'AgeMonths', 'WeightKg']].values
    
    # Compute similarity
    similarity_scores = cosine_similarity(user_profile, pet_features)[0]
    
    # Get top-k recommendations
    available_pets['SimilarityScore'] = similarity_scores
    recommendations = available_pets.sort_values(by='SimilarityScore', ascending=False).head(top_k)
    
    return recommendations

# Example
user_id = 42
content_recs = content_based_recommendations(user_id, pet_df, adoption_df)
# Check if content_recs is not empty before formatting
if not content_recs.empty:
    format_output_as_table(content_recs, pet_df_raw)
else:
    print(f"No content-based recommendations generated for User {user_id}.")

PetID,PetType,Breed,Size,AgeMonths,WeightKg,HealthCondition,EnergyLevel,SimilarityScore
69,Dog,Poodle,Medium,38,16.23,Special Needs,High,0.999646
819,Dog,Poodle,Medium,32,11.54,Healthy,High,0.999641
348,Dog,Poodle,Medium,28,13.72,Minor Issues,Medium,0.999638
320,Dog,Rottweiler,Small,30,14.75,Healthy,High,0.999618
872,Dog,Siberian Husky,Small,34,14.51,Healthy,Low,0.999597


---

## 4. Item-Based Collaborative Filtering

**Goal**: Recommend pets based on similarities between pets adopted by users.

**Methodology**:

- Create a user-pet interaction matrix.
- Compute cosine similarity between pets.
- Score and recommend top-k pets.

**Code**:

In [72]:
def item_based_recommendations(user_id, user_pet_matrix, pet_df, top_k=5):
    # Get pets adopted by the user
    user_row = user_pet_matrix.loc[user_id]
    adopted_pet_ids = user_row[user_row == 1].index
    
    # Compute similarity between all pets
    pet_similarity = cosine_similarity(user_pet_matrix.T)
    pet_similarity_df = pd.DataFrame(pet_similarity, index=user_pet_matrix.columns, columns=user_pet_matrix.columns)
    
    # Score pets based on similarity to adopted pets
    scores = pet_similarity_df.loc[adopted_pet_ids].mean(axis=0)
    
    # Filter out already adopted pets
    available_pet_scores = scores[~scores.index.isin(adopted_pet_ids).copy()]
    
    # Get top-k recommendations
    top_pet_ids = available_pet_scores.nlargest(top_k).index
    recommendations = pet_df[pet_df['PetID'].isin(top_pet_ids)].copy()
    recommendations['SimilarityScore'] = available_pet_scores[top_pet_ids].values
    
    return recommendations

# Example
item_recs = item_based_recommendations(user_id, user_pet_matrix, pet_df)

# Check if item_recs is not empty before formatting
if not item_recs.empty:
    format_output_as_table(item_recs, pet_df_raw)
else:
    print(f"No item-based recommendations generated for User {user_id}.")


PetID,PetType,Breed,Size,AgeMonths,WeightKg,HealthCondition,EnergyLevel,SimilarityScore
224,Dog,Labrador,Large,4,10.29,Healthy,Low,0.060792
238,Dog,German Shepherd,Medium,68,13.09,Healthy,Low,0.0605
477,Dog,Dachshund,Large,41,8.94,Healthy,Low,0.053552
673,Dog,Golden Retriever,Large,20,13.7,Healthy,Low,0.053283
703,Dog,Labrador,Medium,5,14.21,Minor Issues,Medium,0.052864


**Notes**:
- Returns pet details from `pet_df` for interpretability.
- Handles new users or users with no history by recommending popular pets.
- `NaN` values in `SimilarityScore` mean the collaborative filtering fell back to the popularity method

---

## 5. Visualization of Results

**Goal**: Illustrate recommendation quality and data insights for both methods.

**Methodology**:

- Plot similarity scores for top recommendations.
- Visualize pet attribute distributions.

**Code**:

In [73]:
# Top recommendations for a sample user
plt.figure(figsize=(10, 6))
sns.barplot(data=content_recs, x='PetID', y='SimilarityScore')
plt.title(f'Content-Based Top 5 Recommended Pets for User {user_id}')
plt.savefig('fig/content_top_recommendations.png')
plt.close()

# Pet attribute distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=pet_df, x='AgeMonths', bins=20)
plt.title('Distribution of Pet Ages')
plt.savefig('fig/pet_age_dist.png')
plt.close()

---