# Project: Advanced Machine Learning and MLOps  
## Tourism Package Prediction


## Problem Statement

“Visit with Us” has recently introduced a **Wellness Tourism Package** to expand its service offerings. However, identifying customers who are most likely to purchase this package remains a significant challenge. The existing manual approach for customer identification is time-consuming, inconsistent, and highly dependent on individual judgment, which can lead to inefficient targeting and missed business opportunities.

The objective of this project is to build a predictive model that can accurately determine whether a customer will purchase the Wellness Tourism Package before initiating sales outreach.

---

## Business Context

“Visit with Us” is a leading travel and tourism organization that leverages customer data to enhance marketing effectiveness and customer engagement. With increasing competition and a diverse customer base, the company aims to adopt data-driven decision-making to improve conversion rates and optimize marketing efforts.

By predicting customer purchase intent in advance, the organization can:
- Prioritize high-probability customers for sales outreach  
- Improve campaign efficiency and customer satisfaction  
- Reduce operational costs associated with ineffective targeting  

To support this initiative, an **end-to-end MLOps pipeline** is implemented to automate data preprocessing, model training, evaluation, and deployment. Continuous Integration and Continuous Deployment (CI/CD) practices using GitHub Actions ensure scalability, consistency, and timely model updates.

---

## Data Description

The dataset contains customer demographic details and interaction-related attributes collected during sales and marketing engagements. Each record represents a unique customer, along with information indicating whether the customer purchased the Wellness Tourism Package.

### Target Variable
- **ProdTaken**: Indicates whether the customer purchased the Wellness Tourism Package  
  - `0` → No  
  - `1` → Yes  

### Customer Demographic Attributes
- **Age**: Age of the customer  
- **Gender**: Gender of the customer  
- **MaritalStatus**: Marital status of the customer  
- **Occupation**: Customer’s occupation  
- **MonthlyIncome**: Gross monthly income of the customer  
- **CityTier**: City category based on development level  
- **OwnCar**: Indicates whether the customer owns a car  
- **Passport**: Indicates whether the customer holds a valid passport  

### Customer Interaction Attributes
- **TypeofContact**: Mode of contact (Company Invited or Self Inquiry)  
- **ProductPitched**: Type of tourism product pitched to the customer  
- **PitchSatisfactionScore**: Customer’s satisfaction score for the sales pitch  
- **NumberOfFollowups**: Number of follow-ups made after the pitch  
- **DurationOfPitch**: Duration of the sales pitch  
- **PreferredPropertyStar**: Preferred hotel rating  
- **NumberOfTrips**: Average number of trips taken annually  
- **NumberOfPersonVisiting**: Total number of people traveling with the customer  
- **NumberOfChildrenVisiting**: Number of children below age five accompanying the customer  

The dataset consists of both **numerical and categorical variables**, with some features containing missing values that must be addressed during data preprocessing before model training.


## Data Registration

### Folder Structure Creation
As part of the data registration process, a master project directory was created with a dedicated subfolder named **`data`** to store the dataset used for this project. This ensures organized data management and reproducibility.

**Project Structure:**



The dataset file **`tourism.csv`** is placed inside the `data` directory and is used as the single source of truth for model training and evaluation.

---

### Hugging Face Dataset Registration
To enable dataset versioning, accessibility, and reproducibility, the dataset was registered on the **Hugging Face Datasets Hub** as a public dataset. Registering the data on Hugging Face allows seamless integration with machine learning pipelines and ensures that the dataset can be reused or updated in the future.

**Steps followed:**
1. Created a Hugging Face dataset repository.
2. Uploaded the `tourism.csv` file to the dataset repository.
3. Verified successful dataset hosting and accessibility.

The dataset can now be accessed programmatically or via the Hugging Face interface, supporting scalable and collaborative MLOps workflows.

This completes the data registration requirement for the project.


## Importing Important Libraries


In [4]:
# Data manipulation and numerical computation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning utilities
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Model
from sklearn.ensemble import RandomForestClassifier

# Model evaluation
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    f1_score
)

# Model persistence
import joblib

# Display settings
pd.set_option("display.max_columns", 200)


### Observation
The required libraries for data processing, visualization, machine learning, evaluation, and model persistence have been successfully imported. These libraries will be used throughout the project to build and deploy the end-to-end MLOps pipeline.


## Model Building


In [6]:
#Load dataset
df = pd.read_csv("../data/tourism.csv")
df.head()


Unnamed: 0.1,Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,5,200005,0,32.0,Company Invited,1,8.0,Salaried,Male,3,3.0,Basic,3.0,Single,1.0,0,5,1,1.0,Executive,18068.0


**Basic sanity checks**

In [7]:
df.shape

(4128, 21)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4128 entries, 0 to 4127
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                4128 non-null   int64  
 1   CustomerID                4128 non-null   int64  
 2   ProdTaken                 4128 non-null   int64  
 3   Age                       4128 non-null   float64
 4   TypeofContact             4128 non-null   object 
 5   CityTier                  4128 non-null   int64  
 6   DurationOfPitch           4128 non-null   float64
 7   Occupation                4128 non-null   object 
 8   Gender                    4128 non-null   object 
 9   NumberOfPersonVisiting    4128 non-null   int64  
 10  NumberOfFollowups         4128 non-null   float64
 11  ProductPitched            4128 non-null   object 
 12  PreferredPropertyStar     4128 non-null   float64
 13  MaritalStatus             4128 non-null   object 
 14  NumberOf

In [10]:
df.isna().sum().sort_values(ascending=False).head(15)

Unnamed: 0                  0
ProductPitched              0
Designation                 0
NumberOfChildrenVisiting    0
OwnCar                      0
PitchSatisfactionScore      0
Passport                    0
NumberOfTrips               0
MaritalStatus               0
PreferredPropertyStar       0
NumberOfFollowups           0
CustomerID                  0
NumberOfPersonVisiting      0
Gender                      0
Occupation                  0
dtype: int64

### Observations
- The dataset contains both numerical and categorical features.
- Some columns contain missing values that need to be handled during preprocessing.
- The target variable `ProdTaken` indicates whether a customer purchased the tourism package.

**Remove duplicates & clean strings**

In [14]:
df = df.drop_duplicates().copy()

for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].astype(str).str.strip()
    df.loc[df[col].str.lower().isin(["nan", "none", "null", ""]), col] = np.nan

df.shape


(4128, 21)

**Drop ID column**

In [16]:
if "CustomerID" in df.columns:
    df = df.drop(columns=["CustomerID"])

df.head()


Unnamed: 0.1,Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,1,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,3,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,5,0,32.0,Company Invited,1,8.0,Salaried,Male,3,3.0,Basic,3.0,Single,1.0,0,5,1,1.0,Executive,18068.0


### Observation
`CustomerID` is an identifier column and does not add predictive value. It has been removed to avoid data leakage and improve generalization.
