#### **Project Name**    - Voyage Analytics: Integrating MLOps in Travel (Productionization of ML Systems)

##### **Project Type** - EDA/Regression/Classification/Recommender
##### **Contribution** - Individual
##### **Individual Name** - Manasvi Save

#### **Project Summary**

This capstone project demonstrates the end-to-end application of data analytics, machine learning, and MLOps within the travel and tourism domain. By leveraging three core datasets — users, flights, and hotels — the project aims to extract meaningful insights, build predictive and recommendation models, and deploy them using modern, production-grade tools and workflows.

The project begins with a strong foundation in data understanding and exploratory data analysis to uncover patterns in user behavior, travel routes, pricing trends, and accommodation preferences. These insights guide feature selection and model design decisions across multiple machine learning use cases.

A primary focus of the project is the development of a regression model to predict flight prices based on factors such as distance, duration, flight type, and agency. This model is exposed through a REST API built using Flask, enabling real-time predictions. To ensure portability and scalability, the application is containerized using Docker and deployed on Kubernetes, allowing it to efficiently handle varying workloads.

Beyond model development, the project emphasizes real-world MLOps practices. Automated workflows are implemented using Apache Airflow to orchestrate data processing and model-related tasks. A CI/CD pipeline built with Jenkins ensures seamless integration, testing, and deployment of model updates. MLflow is used to track experiments, manage model versions, and maintain reproducibility throughout the model lifecycle.

In addition to price prediction, the project includes a gender classification model to categorize users and a travel recommendation system that suggests hotels based on user preferences and historical behavior. These recommendations and insights are presented through an interactive Streamlit web application, providing a user-friendly interface for exploration and decision support.

Overall, this project delivers a comprehensive, production-ready machine learning pipeline that combines predictive modeling, recommendation systems, scalable deployment, and automated operations. It showcases the practical application of machine learning and MLOps concepts to solve real-world challenges in the travel and tourism industry.

In [None]:
# EDA.ipynb

# 1. Imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport

# Set plot style
sns.set(style="whitegrid")

# 2️. Load Data (adjust paths if needed)
users = pd.read_csv("C:/Users/LENOVO/voyage-analytics/data/users.csv")
flights = pd.read_csv("C:/Users/LENOVO/voyage-analytics/data/flights.csv")
hotels = pd.read_csv("C:/Users/LENOVO/voyage-analytics/data/hotels.csv")

print("Data Loaded")
print("Users:", users.shape)
print("Flights:", flights.shape)
print("Hotels:", hotels.shape)

# 3. Quick Glimpse
display(users.head())
display(flights.head())
display(hotels.head())

# 4️. Profile Flights Dataset (example)
profile = ProfileReport(flights, title="Flights Dataset Report", explorative=True)

try:
    # Try VS Code/Jupyter-friendly widgets
    profile.to_widgets()
except Exception as e:
    print("⚠ Widget rendering failed in VS Code. Exporting to HTML instead.")
    profile.to_file("flights_report.html")
    print("✅ Report saved as flights_report.html — open it in your browser.")


# 5️. Basic Visualizations
plt.figure(figsize=(8,5))
sns.histplot(flights["price"], bins=50, kde=True)
plt.title("Flight Price Distribution")
plt.show()

plt.figure(figsize=(8,5))
sns.scatterplot(x="distance", y="price", data=flights)
plt.title("Distance vs Price")
plt.show()

# 6️. Missing Values Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(flights.isnull(), cbar=False)
plt.title("Missing Values in Flights Dataset")
plt.show()

**1. Importing Required Libraries**

In this step, we import essential Python libraries required for data analysis and visualization. Pandas is used for data manipulation, Seaborn and Matplotlib are used for creating visualizations, and ydata_profiling is used to generate an automated and detailed data profiling report.

Setting the Seaborn style ensures that all plots have a clean and consistent appearance.

**2. Loading the Datasets**

Three datasets are loaded into the environment:

Users dataset

Flights dataset

Hotels dataset

Each dataset is read from a CSV file using Pandas. After loading, the shape of each dataset is printed to verify the number of rows and columns, ensuring the data has been imported correctly.

**3. Initial Data Inspection**8

The first few rows of each dataset are displayed using the head() function. This helps in understanding:

Column names

Data types

Sample values

Overall structure of the datasets

This quick glimpse is useful for identifying obvious data quality issues at an early stage.

**4. Automated Data Profiling**

An automated profiling report is generated for the flights dataset using ydata_profiling. This report provides:

Summary statistics

Distribution of numerical features

Correlation analysis

Missing value analysis

Detection of potential outliers

If the interactive widget cannot be rendered in the environment, the report is exported as an HTML file that can be viewed in a web browser.

**5. Basic Data Visualizations**

Basic visualizations are created to understand key patterns in the flights dataset:

A histogram of flight prices to analyze price distribution

A scatter plot of distance versus price to understand their relationship

These plots help in identifying trends, skewness, and potential anomalies in the data.

**6. Missing Values Analysis**

A heatmap is used to visualize missing values across the flights dataset. This allows easy identification of columns with missing data and helps decide whether data cleaning or imputation is required in later steps.