# HDB Recommendation System - Scoring Implementation

This notebook implements the scoring logic for the HDB flat recommendation system described in Section 5.

## ðŸ“‹ Overview

The recommendation system uses **content-based filtering with multi-criteria scoring** to rank HDB flats based on user preferences. It combines 5 independent scores with predefined weights to generate personalized recommendations.

## ðŸ”„ Processing Pipeline

1. **Hard Filtering** â†’ Reduce 254K flats to candidates matching essential criteria (budget, type, location)
2. **XGBoost Prediction** â†’ Predict fair market value for all candidates
3. **Score Calculation** â†’ Compute 5 independent scores for each candidate
4. **Weighted Ranking** â†’ Combine scores using fixed weights (35% travel, 25% value, 20% budget, 15% amenity, 5% space)
5. **Top-N Selection** â†’ Return top 10 recommendations with explanations

## ðŸŽ¯ Scoring Components

| Score | Weight | Purpose | Range |
|-------|--------|---------|-------|
| **Travel Convenience** | 35% | Weighted distance to work/frequent destinations | 0-100 |
| **Value Efficiency** | 25% | Predicted price per sqm (space efficiency) | 0-100 |
| **Budget Comfort** | 20% | How comfortably price fits within budget | 0-100 |
| **Amenity Access** | 15% | Proximity to MRT, schools, malls, hawkers | 0-100 |
| **Space Adequacy** | 5% | Alignment with desired floor area | 0-100 |

**Final Score Formula:**
```
final_score = 0.35Ã—travel + 0.25Ã—value + 0.20Ã—budget + 0.15Ã—amenity + 0.05Ã—space
```

## ðŸ“‚ File Requirements

Before running, ensure you have:
- âœ… `HDB_model_ready.csv` - Main dataset (254K+ flats)
- âœ… `town_code_map.csv` - Town name mappings
- âœ… `flat_type_int_map.csv` - Flat type mappings
- âœ… `flat_model_code_map.csv` - Flat model mappings
- âœ… `xgboost_model.pkl` - Trained price prediction model from Section 4

## ðŸš€ Quick Start

1. Run cells 1-2 to load data and model
2. Run cells 3-4 to define helper and scoring functions
3. Test with sample input in cell 7
4. Copy Flask API code from cell 8 to `app.py` for production

## ðŸ”§ Customization Points

- **Travel score**: Adjust `frequency_weights` dictionary (cell 4)
- **Score weights**: Modify weights in `generate_recommendations()` (cell 6)
- **Max distances**: Change `max_distance` thresholds in scoring functions
- **Top N**: Set `top_n` parameter when calling `generate_recommendations()`

## ðŸ“– For New Users

Each function includes:
- Detailed docstring explaining purpose and parameters
- Inline comments explaining algorithm logic
- Example usage with sample inputs/outputs
- Edge case handling notes

---

## 1. Setup and Imports

In [None]:
# ============================================================================
# DEPENDENCIES
# ============================================================================
import pandas as pd
import numpy as np
import pickle
from math import radians, cos, sin, asin, sqrt
import json

# ============================================================================
# LOAD DATASET
# ============================================================================
# Load the main HDB dataset with 254K+ flat records
# This contains all historical resale transactions with features engineered in Section 4
df = pd.read_csv(r'HDB Clone/HDB-Resale-Price-Prediction-and-Recommendation-main/Model_Building/HDB_model_ready.csv')

# ============================================================================
# LOAD MAPPING FILES
# ============================================================================
# These CSV files map encoded values to human-readable names
# town_code_map.csv: Maps town_code (0-25) to town names (e.g., 0 = 'ANG MO KIO')
# flat_type_int_map.csv: Maps flat_type_int (1-7) to flat types (e.g., 4 = '4 ROOM')
# flat_model_code_map.csv: Maps flat_model_code (0-10) to model names (e.g., 2 = 'Improved')
town_map = pd.read_csv(r'HDB Clone/HDB-Resale-Price-Prediction-and-Recommendation-main/Model_Building/mappings_csv/town_code_map.csv')
flat_type_map = pd.read_csv(r'HDB Clone/HDB-Resale-Price-Prediction-and-Recommendation-main/Model_Building/mappings_csv/flat_type_int_map.csv')
flat_model_map = pd.read_csv(r'HDB Clone/HDB-Resale-Price-Prediction-and-Recommendation-main/Model_Building/mappings_csv/flat_model_code_map.csv')

print(f"Dataset loaded: {len(df)} flats")
print(f"Columns: {df.columns.tolist()}")

## 2. Load Trained XGBoost Model

In [None]:
# ============================================================================
# LOAD TRAINED ML MODEL
# ============================================================================
# Load the XGBoost price prediction model trained in Section 4
# This model predicts fair market value based on flat features
# 
# TODO: Update 'path/to/your/xgboost_model.pkl' with your actual model file path
# Example: 'HDB Clone/.../Model_Building/xgboost_model.pkl'
#
with open('path/to/your/xgboost_model.pkl', 'rb') as f:
    xgb_model = pickle.load(f)

print("XGBoost model loaded successfully")

# ============================================================================
# OPTIONAL: Load preprocessing objects if you used them during training
# ============================================================================
# If you applied StandardScaler, LabelEncoder, or other transformers in Section 4,
# load them here to ensure consistent preprocessing
#
# Example:
# with open('path/to/scaler.pkl', 'rb') as f:
#     scaler = pickle.load(f)
# with open('path/to/label_encoder.pkl', 'rb') as f:
#     label_encoder = pickle.load(f)

## 3. Helper Functions

In [None]:
# ============================================================================
# HELPER FUNCTION: Calculate Geographic Distance
# ============================================================================
def haversine_distance(lat1, lon1, lat2, lon2):
    """
    Calculate the great-circle distance between two points on Earth using the Haversine formula.
    
    This function is used to calculate distances from flats to user-specified destinations
    (work, school, parents' homes, etc.) for the travel convenience score.
    
    Parameters:
    -----------
    lat1, lon1 : float
        Latitude and longitude of the first point (e.g., flat location)
    lat2, lon2 : float
        Latitude and longitude of the second point (e.g., workplace)
    
    Returns:
    --------
    float
        Distance in kilometers
    
    Example:
    --------
    >>> distance = haversine_distance(1.3521, 103.8198, 1.2897, 103.8501)  # Ang Mo Kio to Marina Bay
    >>> print(f"{distance:.2f} km")
    7.23 km
    """
    # Convert decimal degrees to radians (required for trigonometric functions)
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    
    # Haversine formula components
    dlat = lat2 - lat1  # Difference in latitude
    dlon = lon2 - lon1  # Difference in longitude
    
    # Calculate arc length
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    
    # Multiply by Earth's radius to get distance in kilometers
    km = 6371 * c  # Earth's radius = 6371 km
    return km


# ============================================================================
# HELPER FUNCTION: Get Predicted Price from ML Model
# ============================================================================
def get_predicted_price(flat_row, model):
    """
    Predict fair market value for a single flat using the trained XGBoost model.
    
    This function extracts the required features from a flat record and passes them
    to the ML model to get a predicted price. The prediction is used in the 
    value efficiency score calculation.
    
    Parameters:
    -----------
    flat_row : pandas Series
        A single row from the HDB dataset containing all flat features
    model : xgboost.XGBRegressor
        Trained XGBoost model loaded from pickle file
    
    Returns:
    --------
    float
        Predicted price in SGD
    
    Notes:
    ------
    - Feature order MUST match the order used during model training in Section 4
    - If you used different features or feature engineering, update this list
    - The 16 features used here are the standard set from HDB_model_ready.csv
    
    Example:
    --------
    >>> predicted_price = get_predicted_price(df.iloc[0], xgb_model)
    >>> print(f"Predicted: ${predicted_price:,.0f}")
    Predicted: $485,000
    """
    # Extract features in the EXACT same order as used during training
    # If you modified features in Section 4, update this list accordingly
    features = flat_row[[
        'floor_area_sqm',                           # Size of the flat
        'lease_commence_year',                      # Year lease started (affects remaining lease)
        'distance_to_nearest_primary_school_km',    # School proximity
        'distance_to_nearest_high_value_school_km', # Top-tier school proximity
        'distance_to_nearest_mrt_km',               # Public transport access
        'distance_to_nearest_hawker_km',            # Food center proximity
        'distance_to_nearest_mall_km',              # Shopping convenience
        'distance_to_cbd_km',                       # Distance to Central Business District
        'year',                                     # Transaction year
        'month_num',                                # Transaction month (1-12)
        'quarter',                                  # Transaction quarter (1-4)
        'region_code',                              # Region encoding
        'flat_type_int',                            # Flat type (1-7: 1 ROOM to MULTI-GEN)
        'flat_model_code',                          # Flat model (0-10: Apartment, DBSS, etc.)
        'town_code',                                # Town encoding (0-25)
        'floor_level'                               # Floor height category
    ]].values.reshape(1, -1)  # Reshape to 2D array (1 sample Ã— 16 features)
    
    # Get prediction from model
    predicted_price = model.predict(features)[0]
    return predicted_price

## 4. Scoring Functions

In [None]:
# ============================================================================
# SCORE 1: Travel Convenience Score (35% weight)
# ============================================================================
def calculate_travel_score(flat_row, destinations):
    """
    Calculate travel convenience score based on weighted average distance to user destinations.
    
    This score prioritizes daily commutes (work) over occasional visits (parents, gym).
    Daily destinations get 5x weight compared to weekly visits, ensuring work proximity
    is the primary driver of location suitability.
    
    Parameters:
    -----------
    flat_row : pandas Series
        Row containing flat data including coordinates (latitude, longitude)
    destinations : list of dict
        User-specified destinations with visit frequencies
        Format: [
            {'name': 'Work (CBD)', 'lat': 1.2833, 'lon': 103.8511, 'frequency': 'daily'},
            {'name': 'Parents Home', 'lat': 1.3521, 'lon': 103.9448, 'frequency': 'weekly'},
            ...
        ]
    
    Returns:
    --------
    float
        Travel score from 0-100 (higher = better location for user's travel patterns)
        - 100: Excellent - very close to all frequent destinations
        - 75-99: Good - reasonable commute times
        - 50-74: Average - moderate travel distances
        - 25-49: Poor - long commutes required
        - 0-24: Very poor - excessive travel distances
    
    Algorithm:
    ----------
    1. Calculate distance from flat to each destination
    2. Apply frequency weights (daily=5.0, weekly=1.0, etc.)
    3. Compute weighted average distance
    4. Convert to 0-100 score (lower distance = higher score)
    
    Example:
    --------
    >>> destinations = [
    ...     {'name': 'Office', 'lat': 1.28, 'lon': 103.85, 'frequency': 'daily'},  # 8km away
    ...     {'name': 'Gym', 'lat': 1.35, 'lon': 103.87, 'frequency': '2-3_per_week'}  # 3km away
    ... ]
    >>> score = calculate_travel_score(flat_row, destinations)
    >>> # Weighted avg = (8*5 + 3*2.5) / (5+2.5) = 6.33km â†’ score â‰ˆ 68/100
    """
    # ========================================================================
    # Define frequency-to-weight mapping
    # ========================================================================
    # These weights ensure daily commutes dominate the score
    # A daily commute (weight=5) has same impact as 5 weekly visits (weight=1)
    frequency_weights = {
        'daily': 5.0,           # 5 days/week (e.g., work commute)
        '2-3_per_week': 2.5,    # 2.5 days/week (e.g., gym, part-time work)
        'weekly': 1.0,          # 1 day/week (e.g., visiting parents)
        '1-2_per_month': 0.25,  # ~0.25 days/week (e.g., medical appointments)
        'rarely': 0.05          # ~once every 3 months (e.g., rare meetups)
    }
    
    # ========================================================================
    # Handle edge case: No destinations specified
    # ========================================================================
    if not destinations or len(destinations) == 0:
        return 50  # Return neutral score if user didn't specify any destinations
    
    # ========================================================================
    # Calculate weighted average distance
    # ========================================================================
    total_weighted_distance = 0  # Sum of (distance Ã— weight)
    total_weight = 0             # Sum of weights
    
    # TODO: Add latitude/longitude columns to your dataset if not present
    # You may need to geocode addresses or use town centroids as approximations
    # Example: Add 'latitude' and 'longitude' columns to HDB_model_ready.csv
    flat_lat = flat_row.get('latitude', None)  
    flat_lon = flat_row.get('longitude', None)
    
    # If coordinates missing, return neutral score
    if flat_lat is None or flat_lon is None:
        print("Warning: Flat coordinates not found. Add 'latitude' and 'longitude' columns to dataset.")
        return 50
    
    # Loop through each destination and calculate weighted distance
    for dest in destinations:
        # Calculate straight-line distance (Haversine formula)
        distance = haversine_distance(flat_lat, flat_lon, dest['lat'], dest['lon'])
        
        # Get weight for this destination's visit frequency
        weight = frequency_weights.get(dest['frequency'], 1.0)  # Default to 1.0 if unknown frequency
        
        # Accumulate weighted distance
        total_weighted_distance += distance * weight
        total_weight += weight
    
    # Calculate weighted average distance
    weighted_avg_distance = total_weighted_distance / total_weight
    
    # ========================================================================
    # Convert distance to 0-100 score
    # ========================================================================
    # Assume maximum acceptable average distance is 20km
    # 0km â†’ 100 score (perfect location)
    # 20km+ â†’ 0 score (too far)
    max_distance = 20  # kilometers
    travel_score = max(0, 100 - (weighted_avg_distance / max_distance * 100))
    
    return round(travel_score, 1)

## 5. Hard Filtering

In [None]:
def apply_hard_filters(df, user_input):
    """
    Apply hard constraints to filter candidate flats.
    
    Parameters:
    -----------
    df : pandas DataFrame
        Full HDB dataset
    user_input : dict
        User preferences from frontend
    
    Returns:
    --------
    pandas DataFrame : Filtered candidate flats
    """
    filtered_df = df.copy()
    
    # Budget filter (using predicted price)
    # Note: You'll need to predict prices for all flats or filter on historical resale_price
    if 'min_budget' in user_input and 'max_budget' in user_input:
        # Using resale_price as proxy - in production, pre-compute predicted prices
        filtered_df = filtered_df[
            (filtered_df['resale_price'] >= user_input['min_budget']) &
            (filtered_df['resale_price'] <= user_input['max_budget'])
        ]
    
    # Flat type filter
    if 'flat_types' in user_input and len(user_input['flat_types']) > 0:
        flat_type_codes = flat_type_map[
            flat_type_map['flat_type'].isin(user_input['flat_types'])
        ]['flat_type_int'].tolist()
        filtered_df = filtered_df[filtered_df['flat_type_int'].isin(flat_type_codes)]
    
    # Flat model filter
    if 'flat_models' in user_input and len(user_input['flat_models']) > 0:
        flat_model_codes = flat_model_map[
            flat_model_map['flat_model_grouped'].isin(user_input['flat_models'])
        ]['flat_model_code'].tolist()
        filtered_df = filtered_df[filtered_df['flat_model_code'].isin(flat_model_codes)]
    
    # Floor area filter
    if 'min_area' in user_input and 'max_area' in user_input:
        filtered_df = filtered_df[
            (filtered_df['floor_area_sqm'] >= user_input['min_area']) &
            (filtered_df['floor_area_sqm'] <= user_input['max_area'])
        ]
    
    # Town filter
    if 'towns' in user_input and len(user_input['towns']) > 0:
        town_codes = town_map[
            town_map['town'].isin(user_input['towns'])
        ]['town_code'].tolist()
        filtered_df = filtered_df[filtered_df['town_code'].isin(town_codes)]
    
    # Lease year filter
    if 'min_lease_year' in user_input:
        current_year = 2025
        remaining_lease = 99 - (current_year - filtered_df['lease_commence_year'])
        filtered_df = filtered_df[remaining_lease >= user_input['min_lease_year']]
    
    # Storey filter
    if 'storey_ranges' in user_input and len(user_input['storey_ranges']) > 0:
        # Convert floor_level to storey ranges if needed
        filtered_df = filtered_df[filtered_df['floor_level'].isin(user_input['storey_ranges'])]
    
    # Strict amenity filters (if enabled)
    if user_input.get('strict_mrt', False) and 'max_mrt_distance' in user_input:
        filtered_df = filtered_df[
            filtered_df['distance_to_nearest_mrt_km'] <= user_input['max_mrt_distance']
        ]
    
    if user_input.get('strict_mall', False) and 'max_mall_distance' in user_input:
        filtered_df = filtered_df[
            filtered_df['distance_to_nearest_mall_km'] <= user_input['max_mall_distance']
        ]
    
    if user_input.get('strict_school', False) and 'max_school_distance' in user_input:
        filtered_df = filtered_df[
            filtered_df['distance_to_nearest_primary_school_km'] <= user_input['max_school_distance']
        ]
    
    print(f"Candidates after filtering: {len(filtered_df)}")
    return filtered_df

## 6. Main Recommendation Function

In [None]:
def generate_recommendations(user_input, top_n=10):
    """
    Main function to generate flat recommendations.
    
    Parameters:
    -----------
    user_input : dict
        User preferences from frontend JSON
    top_n : int
        Number of recommendations to return
    
    Returns:
    --------
    list : Top N recommended flats with scores
    """
    # Step 1: Apply hard filters
    candidates_df = apply_hard_filters(df, user_input)
    
    if len(candidates_df) == 0:
        return {'error': 'No flats match your criteria. Please relax some filters.'}
    
    # Step 2: Get predicted prices for all candidates
    print("Calculating predicted prices...")
    candidates_df['predicted_price'] = candidates_df.apply(
        lambda row: get_predicted_price(row, xgb_model), axis=1
    )
    candidates_df['predicted_price_per_sqm'] = (
        candidates_df['predicted_price'] / candidates_df['floor_area_sqm']
    )
    
    # Step 3: Calculate all scores
    print("Calculating scores...")
    results = []
    
    for idx, row in candidates_df.iterrows():
        # Calculate individual scores
        travel_score = calculate_travel_score(row, user_input.get('destinations', []))
        value_score = calculate_value_score(row, row['predicted_price'], candidates_df)
        budget_score = calculate_budget_score(
            row['predicted_price'], 
            user_input['min_budget'], 
            user_input['max_budget']
        )
        amenity_score = calculate_amenity_score(row)
        space_score = calculate_space_score(
            row['floor_area_sqm'],
            user_input.get('min_area', 0),
            user_input.get('max_area', 200)
        )
        
        # Calculate weighted final score
        final_score = (
            0.35 * travel_score +
            0.25 * value_score +
            0.20 * budget_score +
            0.15 * amenity_score +
            0.05 * space_score
        )
        
        # Get town name
        town_name = town_map[
            town_map['town_code'] == row['town_code']
        ]['town'].values[0] if len(town_map[town_map['town_code'] == row['town_code']]) > 0 else 'Unknown'
        
        # Get flat type name
        flat_type_name = flat_type_map[
            flat_type_map['flat_type_int'] == row['flat_type_int']
        ]['flat_type'].values[0] if len(flat_type_map[flat_type_map['flat_type_int'] == row['flat_type_int']]) > 0 else 'Unknown'
        
        # Store result
        results.append({
            'flat_id': idx,
            'town': town_name,
            'flat_type': flat_type_name,
            'floor_area_sqm': float(row['floor_area_sqm']),
            'predicted_price': float(row['predicted_price']),
            'lease_commence_year': int(row['lease_commence_year']),
            'floor_level': int(row['floor_level']),
            'scores': {
                'travel_score': float(travel_score),
                'value_score': float(value_score),
                'budget_score': float(budget_score),
                'amenity_score': float(amenity_score),
                'space_score': float(space_score),
                'final_score': float(round(final_score, 2))
            },
            'distances': {
                'mrt_km': float(row['distance_to_nearest_mrt_km']),
                'school_km': float(row['distance_to_nearest_primary_school_km']),
                'mall_km': float(row['distance_to_nearest_mall_km']),
                'hawker_km': float(row['distance_to_nearest_hawker_km']),
                'cbd_km': float(row['distance_to_cbd_km'])
            }
        })
    
    # Step 4: Sort by final score and return top N
    results_sorted = sorted(results, key=lambda x: x['scores']['final_score'], reverse=True)
    
    return {
        'total_candidates': len(candidates_df),
        'recommendations': results_sorted[:top_n]
    }

## 7. Test with Sample Input

In [None]:
# Sample user input (matches what your React frontend will send)
sample_user_input = {
    'min_budget': 400000,
    'max_budget': 600000,
    'flat_types': ['4 ROOM', '5 ROOM'],
    'flat_models': ['Improved', 'Model A', 'New Generation'],
    'min_area': 80,
    'max_area': 110,
    'towns': ['BISHAN', 'ANG MO KIO', 'TOA PAYOH'],
    'min_lease_year': 70,
    'destinations': [
        {'name': 'CBD', 'lat': 1.2833, 'lon': 103.8511, 'frequency': 'daily'},
        {'name': 'Jurong East', 'lat': 1.3330, 'lon': 103.7436, 'frequency': 'daily'},
        {'name': 'Parents Home', 'lat': 1.3521, 'lon': 103.9448, 'frequency': 'weekly'}
    ],
    'strict_mrt': False,
    'max_mrt_distance': 1.0,
    'strict_mall': False,
    'max_mall_distance': 1.5
}

# Generate recommendations
recommendations = generate_recommendations(sample_user_input, top_n=10)

# Display results
print(json.dumps(recommendations, indent=2))

## 8. Flask API Implementation

In [None]:
# Save this as a separate file: app.py

"""
from flask import Flask, request, jsonify
from flask_cors import CORS
import pandas as pd
import pickle

app = Flask(__name__)
CORS(app)  # Enable CORS for React frontend

# Load data and model at startup
df = pd.read_csv('path/to/HDB_model_ready.csv')
town_map = pd.read_csv('path/to/town_code_map.csv')
flat_type_map = pd.read_csv('path/to/flat_type_int_map.csv')
flat_model_map = pd.read_csv('path/to/flat_model_code_map.csv')

with open('path/to/xgboost_model.pkl', 'rb') as f:
    xgb_model = pickle.load(f)

# Include all the scoring functions here (copy from above cells)

@app.route('/api/recommend', methods=['POST'])
def recommend():
    try:
        user_input = request.json
        
        # Validate input
        if 'min_budget' not in user_input or 'max_budget' not in user_input:
            return jsonify({'error': 'Budget range is required'}), 400
        
        # Generate recommendations
        recommendations = generate_recommendations(user_input, top_n=10)
        
        return jsonify(recommendations)
    
    except Exception as e:
        return jsonify({'error': str(e)}), 500


@app.route('/api/health', methods=['GET'])
def health():
    return jsonify({'status': 'healthy', 'total_flats': len(df)})


if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)
"""

print("Flask API code above - save as app.py")
print("\nTo run: python app.py")
print("API will be available at: http://localhost:5000/api/recommend")

## 9. Frontend Integration Example

In [None]:
# Example React frontend code to call your API

"""
// In your React component:

const getRecommendations = async (userPreferences) => {
  try {
    const response = await fetch('http://localhost:5000/api/recommend', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(userPreferences)
    });
    
    const data = await response.json();
    
    if (data.error) {
      console.error('Error:', data.error);
      return;
    }
    
    // Display recommendations
    console.log('Total candidates:', data.total_candidates);
    console.log('Top recommendations:', data.recommendations);
    
    // Update your UI state
    setRecommendations(data.recommendations);
    
  } catch (error) {
    console.error('API call failed:', error);
  }
};

// Call when user submits the form
const handleSubmit = (formData) => {
  const userPreferences = {
    min_budget: formData.minBudget,
    max_budget: formData.maxBudget,
    flat_types: formData.selectedFlatTypes,
    flat_models: formData.selectedModels,
    towns: formData.selectedTowns,
    destinations: formData.destinations,
    // ... other fields
  };
  
  getRecommendations(userPreferences);
};
"""

print("React integration example above")