# XGBoost Learning to Rank Model - Feature Importance Analysis

This notebook analyzes the feature importance of the XGBoost Learning to Rank model used for property search ranking. The model was trained on search interaction data and uses 45 different features.

We'll visualize which features have the most influence on the ranking decisions made by the model.

## 1. Import Required Libraries

First, let's import the necessary libraries for our analysis:

In [None]:
import os
import json
import xgboost as xgb
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns

# Set Matplotlib style for better visuals
plt.style.use('ggplot')
sns.set(style="whitegrid")

## 2. Load the XGBoost Model

Next, we'll load the XGBoost model from the JSON file and the feature names from the metadata file:

In [None]:
# Define paths to model and metadata files
model_path = "models/xgboost_ltr_model.json"
model_metadata_path = "models/ltr_model_metadata.json"

# Load the XGBoost model
print(f"Loading model from {model_path}...")
ranker = xgb.XGBRanker()
ranker.load_model(model_path)
print("Model loaded successfully!")

# Load feature names from metadata
with open(model_metadata_path, 'r') as f:
    metadata = json.load(f)
    feature_names = metadata.get('features', [])
    print(f"Loaded {len(feature_names)} feature names from metadata")
    
# Display metadata information
print("\nModel Metadata:")
for key, value in metadata.items():
    if key != 'features':  # Skip printing the full feature list for now
        print(f"- {key}: {value}")

## 3. Plot Feature Importance

Now, let's use XGBoost's built-in `plot_importance` function to visualize the top 10 most important features:

In [None]:
# Set up figure size for better visualization
plt.figure(figsize=(12, 8))

# Plot feature importance using XGBoost's built-in function
xgb.plot_importance(ranker, max_num_features=10, importance_type='weight')

# Add title and adjust layout
plt.title('Top 10 Most Important Features', fontsize=16)
plt.tight_layout()

# Show the plot
plt.show()

## 4. Analyze Feature Importance

Let's extract the feature importance values from the model and create a DataFrame to analyze them in more detail. 
We'll map the feature indices to their actual names from the metadata:

In [None]:
# Get feature importance from the model
importance = ranker.get_booster().get_score(importance_type='weight')

# Create a mapping from feature indices to feature names
feature_map = {f"f{i}": name for i, name in enumerate(feature_names)}

# Create a DataFrame with importance values and feature names
importance_df = pd.DataFrame({
    'Feature': list(importance.keys()),
    'Importance': list(importance.values())
})

# Map feature indices to actual feature names
importance_df['Feature Name'] = importance_df['Feature'].map(feature_map)

# Sort by importance value in descending order
importance_df = importance_df.sort_values('Importance', ascending=False).reset_index(drop=True)

# Display the DataFrame with all features
pd.set_option('display.max_rows', None)  # Show all rows
print(f"Total number of features with importance scores: {len(importance_df)}")
importance_df

In [None]:
# Calculate the percentage of total importance for each feature
total_importance = importance_df['Importance'].sum()
importance_df['Percentage'] = (importance_df['Importance'] / total_importance * 100).round(2)

# Create a dataframe with only the top 10 features
top_10_features = importance_df.head(10).copy()

# Calculate the cumulative percentage of importance for the top 10 features
top_10_features['Cumulative Percentage'] = top_10_features['Percentage'].cumsum()

# Display the statistics for the top 10 features
print(f"Top 10 features account for {top_10_features['Percentage'].sum():.2f}% of total importance")
top_10_features[['Feature Name', 'Importance', 'Percentage', 'Cumulative Percentage']]

## 5. Custom Feature Importance Visualization

Now let's create a custom visualization of feature importance with better formatting, labels, and color coding:

In [None]:
# Create a better visualized horizontal bar chart for the top 10 features
plt.figure(figsize=(14, 10))

# Get the top 10 features
top_features = importance_df.head(10)

# Create a colormap based on importance values
colors = plt.cm.viridis(np.linspace(0, 0.8, len(top_features)))

# Create horizontal bar plot
bars = plt.barh(top_features['Feature Name'], top_features['Importance'], color=colors)

# Add percentage labels to the right of each bar
for i, bar in enumerate(bars):
    percentage = top_features.iloc[i]['Percentage']
    plt.text(bar.get_width() + bar.get_width() * 0.01, 
             bar.get_y() + bar.get_height() / 2, 
             f"{percentage:.2f}%", 
             va='center',
             fontweight='bold')

# Set labels and title
plt.xlabel('Importance Score', fontsize=14)
plt.ylabel('Feature Name', fontsize=14)
plt.title('Top 10 Most Important Features in the XGBoost LTR Model', fontsize=16)

# Reverse y-axis to have highest importance at the top
plt.gca().invert_yaxis()

# Add a grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Adjust layout and show plot
plt.tight_layout()
plt.show()

# Save the plot to a file
plt.savefig('models/feature_importance_custom.png', dpi=300, bbox_inches='tight')
print("Custom feature importance plot saved to models/feature_importance_custom.png")

In [None]:
# Group features by category based on their names
def categorize_feature(feature_name):
    if 'position' in feature_name:
        return 'Position-related'
    elif 'query' in feature_name:
        return 'Query-related'
    elif 'geo' in feature_name or 'distance' in feature_name or 'neighborhood' in feature_name:
        return 'Geographic'
    elif 'semantic' in feature_name:
        return 'Semantic'
    elif 'bm25' in feature_name:
        return 'BM25 Relevance'
    elif 'interaction' in feature_name or 'click' in feature_name or 'view' in feature_name or 'engagement' in feature_name:
        return 'User Interaction'
    elif 'price' in feature_name or 'tax' in feature_name or 'fee' in feature_name or 'square' in feature_name:
        return 'Property Attributes'
    elif 'bedroom' in feature_name or 'bathroom' in feature_name:
        return 'Room-related'
    elif 'search' in feature_name or 'template' in feature_name:
        return 'Search Metadata'
    else:
        return 'Other'

# Add category to the dataframe
importance_df['Category'] = importance_df['Feature Name'].apply(categorize_feature)

# Calculate importance by category
category_importance = importance_df.groupby('Category')['Importance'].sum().sort_values(ascending=False)
category_percentage = (category_importance / category_importance.sum() * 100).round(2)

# Display importance by category
category_df = pd.DataFrame({
    'Category': category_importance.index,
    'Importance': category_importance.values,
    'Percentage': category_percentage.values
})

print("Feature Importance by Category:")
category_df

In [None]:
# Visualize importance by category using a pie chart
plt.figure(figsize=(12, 8))

# Create a pie chart with percentage labels
plt.pie(category_df['Percentage'], 
        labels=category_df['Category'], 
        autopct='%1.1f%%',
        startangle=90, 
        shadow=True, 
        explode=[0.05] * len(category_df),
        colors=plt.cm.tab10(np.arange(len(category_df))))

# Add title
plt.title('Feature Importance by Category', fontsize=16)

# Ensure the pie chart is drawn as a circle
plt.axis('equal')

# Show the chart
plt.tight_layout()
plt.show()

# Save the category chart
plt.savefig('models/feature_importance_by_category.png', dpi=300, bbox_inches='tight')
print("Category importance plot saved to models/feature_importance_by_category.png")

## Conclusion

This notebook has analyzed the feature importance of the XGBoost Learning to Rank model used for property search ranking. We've:

1. Loaded the model and metadata
2. Visualized the top 10 most important features using XGBoost's built-in function
3. Created a detailed DataFrame of feature importance with percentages
4. Built a custom visualization with better formatting and labels
5. Categorized features and analyzed importance by category

The analysis helps understand which features have the most influence on the ranking decisions made by the model, which can guide further model improvements and search optimization.

To use these insights:
- Focus on improving the quality of the most important features
- Consider removing or downweighting less important features
- Use the category analysis to ensure a balanced feature set across different aspects of relevance