# Flight Fare Prediction - Data Exploration

**Objective**: Explore the Bangladesh Flight Price Dataset and understand its structure, quality, and patterns.

**Dataset**: Flight_Price_Dataset_of_Bangladesh.csv  
**Source**: Kaggle  
**Date**: 2026-02-06

**Note**: All implementation logic is in `scripts/` directory. This notebook contains only function calls.

## 1. Setup and Configuration

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Import custom modules
from scripts import data_loader, data_quality, visualizations, analysis # type: ignore

# Configure display and plotting
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Setup complete!")

## 2. Load Data

In [None]:
# Load the flight price dataset
df = data_loader.load_flight_data()

## 3. Initial Data Inspection

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Display dataset information
data_loader.display_basic_info(df)

In [None]:
# Statistical summary
print("Statistical Summary (Numerical Columns):")
df.describe()

## 4. Data Quality Assessment

In [None]:
# Check missing values
missing_df = data_quality.check_missing_values(df)
missing_df

In [None]:
# Check for duplicates
duplicate_stats = data_quality.check_duplicates(df)

In [None]:
# Display data types and unique values
data_quality.display_data_types(df)

In [None]:
# Verify fare calculations
fare_mismatches = data_quality.verify_fare_calculation(df)
if len(fare_mismatches) > 0:
    display(fare_mismatches)

## 5. Target Variable Analysis

In [None]:
# Plot total fare distribution
visualizations.plot_target_distribution(df)

## 6. Categorical Variables Analysis

In [None]:
# Analyze airlines
analysis.analyze_categorical_variable(df, 'Airline', top_n=10)

In [None]:
# Average fare by airline
airline_stats = visualizations.plot_average_fare_by_category(
    df, 'Airline', title='Average Fare by Airline'
)
airline_stats

In [None]:
# Analyze class distribution
analysis.analyze_categorical_variable(df, 'Class')
class_stats = analysis.analyze_fare_by_category(df, 'Class', top_n=3)

In [None]:
# Create and analyze routes
df = analysis.create_route_feature(df)
route_analysis = analysis.analyze_routes(df, top_n=10)

In [None]:
# Analyze stopovers
analysis.analyze_categorical_variable(df, 'Stopovers')
stopover_stats = analysis.analyze_fare_by_category(df, 'Stopovers', top_n=3)

In [None]:
# Analyze seasonality
analysis.analyze_categorical_variable(df, 'Seasonality')
seasonal_stats = analysis.analyze_fare_by_category(df, 'Seasonality', top_n=4)

## 7. Temporal Analysis

In [None]:
# Booking lead time distribution
visualizations.plot_booking_lead_time(df)

In [None]:
# Booking lead time vs fare
visualizations.plot_scatter_with_trend(
    df, 
    'Days Before Departure', 
    'Total Fare (BDT)',
    sample_size=5000,
    title='Booking Lead Time vs Total Fare'
)

## 8. Fare Component Analysis

In [None]:
# Calculate tax percentage
df = analysis.calculate_tax_percentage(df)

## 9. Correlation Analysis

In [None]:
# Correlation heatmap
numerical_cols = [
    'Duration (hrs)', 
    'Base Fare (BDT)', 
    'Tax & Surcharge (BDT)', 
    'Total Fare (BDT)', 
    'Days Before Departure'
]

corr_matrix = visualizations.plot_correlation_heatmap(df, numerical_cols)

print("\nCorrelations with Total Fare:")
print(corr_matrix['Total Fare (BDT)'].sort_values(ascending=False))

## 10. Business Insights

In [None]:
# Generate key business insights
insights = analysis.generate_business_insights(df, top_n=3)

## 11. Final Data Quality Summary

In [None]:
# Generate comprehensive quality summary
quality_summary = data_quality.generate_quality_summary(df)

print(f"\nFlight Details:")
print(f"   Unique Airlines: {df['Airline'].nunique()}")
print(f"   Unique Routes: {df['Route'].nunique()}")
print(f"   Unique Sources: {df['Source'].nunique()}")
print(f"   Unique Destinations: {df['Destination'].nunique()}")

print("\n" + "=" * 60)
print("Initial exploration complete! Ready for preprocessing.")
print("=" * 60)

## Next Steps

1. **Data Preprocessing**: Handle missing values, outliers, and data type conversions
2. **Feature Engineering**: Create temporal features, encode categoricals, engineer new features
3. **Model Development**: Train baseline and advanced models
4. **Evaluation**: Compare model performance and select best approach