# Problem Statement

## **Business Context**

"Visit with Us," a leading travel company, is revolutionizing the tourism industry by leveraging data-driven strategies to optimize operations and customer engagement. While introducing a new package offering, such as the Wellness Tourism Package, the company faces challenges in targeting the right customers efficiently. The manual approach to identifying potential customers is inconsistent, time-consuming, and prone to errors, leading to missed opportunities and suboptimal campaign performance.

To address these issues, the company aims to implement a scalable and automated system that integrates customer data, predicts potential buyers, and enhances decision-making for marketing strategies. By utilizing an MLOps pipeline, the company seeks to achieve seamless integration of data preprocessing, model development, deployment, and CI/CD practices for continuous improvement. This system will ensure efficient targeting of customers, timely updates to the predictive model, and adaptation to evolving customer behaviors, ultimately driving growth and customer satisfaction.


## **Objective**

As an MLOps Engineer at "Visit with Us," your responsibility is to design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases. The primary objective is to build a model that predicts whether a customer will purchase the newly introduced Wellness Tourism Package before contacting them. The pipeline will include data cleaning, preprocessing, transformation, model building, training, evaluation, and deployment, ensuring consistent performance and scalability. By leveraging GitHub Actions for CI/CD integration, the system will enable automated updates, streamline model deployment, and improve operational efficiency. This robust predictive solution will empower policymakers to make data-driven decisions, enhance marketing strategies, and effectively target potential customers, thereby driving customer acquisition and business growth.

## **Data Description**

The dataset contains customer and interaction data that serve as key attributes for predicting the likelihood of purchasing the Wellness Tourism Package. The detailed attributes are:

**Customer Details**
- **CustomerID:** Unique identifier for each customer.
- **ProdTaken:** Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).
- **Age:** Age of the customer.
- **TypeofContact:** The method by which the customer was contacted (Company Invited or Self Inquiry).
- **CityTier:** The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).
- **Occupation:** Customer's occupation (e.g., Salaried, Freelancer).
- **Gender:** Gender of the customer (Male, Female).
- **NumberOfPersonVisiting:** Total number of people accompanying the customer on the trip.
- **PreferredPropertyStar:** Preferred hotel rating by the customer.
- **MaritalStatus:** Marital status of the customer (Single, Married, Divorced).
- **NumberOfTrips:** Average number of trips the customer takes annually.
- **Passport:** Whether the customer holds a valid passport (0: No, 1: Yes).
- **OwnCar:** Whether the customer owns a car (0: No, 1: Yes).
- **NumberOfChildrenVisiting:** Number of children below age 5 accompanying the customer.
- **Designation:** Customer's designation in their current organization.
- **MonthlyIncome:** Gross monthly income of the customer.

**Customer Interaction Data**
- **PitchSatisfactionScore:** Score indicating the customer's satisfaction with the sales pitch.
- **ProductPitched:** The type of product pitched to the customer.
- **NumberOfFollowups:** Total number of follow-ups by the salesperson after the sales pitch.-
- **DurationOfPitch:** Duration of the sales pitch delivered to the customer.


# Model Building

In [79]:
# Create a master folder to keep all files created when executing the below code cells
import os
os.makedirs("tourism_project", exist_ok=True)

In [80]:
# Create a folder for storing the model building files
os.makedirs("tourism_project/model_building", exist_ok=True)

## Data Registration

In [81]:
os.makedirs("tourism_project/data", exist_ok=True)

Once the **data** folder created after executing the above cell, please upload the **tourism.csv** in to the folder

In [83]:
%%writefile tourism_project/model_building/data_register.py
from huggingface_hub.utils import RepositoryNotFoundError, HfHubHTTPError
from huggingface_hub import HfApi, create_repo
import os


repo_id = "huzaifa-sr/tourism-project"
repo_type = "dataset"

# Initialize API client
api = HfApi(token=os.getenv("HF_TOKEN"))

# Step 1: Check if the space exists
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f"Space '{repo_id}' already exists. Using it.")
except RepositoryNotFoundError:
    print(f"Space '{repo_id}' not found. Creating new space...")
    create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
    print(f"Space '{repo_id}' created.")

api.upload_folder(
    folder_path="tourism_project/data",
    repo_id=repo_id,
    repo_type=repo_type,
)

Overwriting tourism_project/model_building/data_register.py


## Data Preparation

In [9]:
df["Gender"].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Male,2463
Female,1510
Fe Male,155


1. **Imports Necessary Libraries**:

2. **Dataset Loading**:
   - The script defines a path to a dataset stored on Hugging Face and reads it into a Pandas DataFrame.

3. **Data Preparation**:
   - The code creates matrices for predictors (features) and the target variable.
   - It splits the dataset into training and testing sets, reserving 20% of the data for testing. This is done to evaluate the model's performance later.

4. **Saving Prepared Data**:
   - After splitting, the script saves the training and testing datasets (features and target) as CSV files.

5. **Uploading Files**:
   - Finally, it uploads these CSV files back to the Hugging Face Hub, ensuring that they are properly stored in the specified repository.

In [17]:
%%writefile tourism_project/model_building/prep.py
# for data manipulation
import pandas as pd
import sklearn
# for creating a folder
import os
# for data preprocessing and pipeline creation
from sklearn.model_selection import train_test_split
# for converting text data in to numerical representation
from sklearn.preprocessing import LabelEncoder
# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi
import numpy as np

# Define constants for the dataset and output paths
api = HfApi(token=os.getenv("HF_TOKEN"))
DATASET_PATH = "hf://datasets/huzaifa-sr/tourism-project/tourism.csv"
df = pd.read_csv(DATASET_PATH)
print("Dataset loaded successfully.")

# Drop columns not required
df = df.drop(columns=["CustomerID"])
print("Dropped CustomerID column")
first_col = df.columns[0]
if first_col == "" or str(first_col).lower().startswith("unnamed"):
    df = df.drop(columns=[first_col])
    print(f"Dropped unnamed index column: {first_col}")

# Ensure target exists
TARGET = "ProdTaken"
if TARGET not in df.columns:
    raise KeyError(f"Expected target column '{TARGET}' not found in dataset columns: {df.columns.tolist()}")

# Fix Gender typos/variants seen in CSV (e.g. 'Fe Male', 'Fe Male ', 'FeMale')
if "Gender" in df.columns:
    df["Gender"] = df["Gender"].astype(str).str.strip().str.lower()
    df.loc[df["Gender"].str.contains(r"fe|fem", na=False), "Gender"] = "Female"
    df.loc[df["Gender"].str.contains(r"male", na=False) & ~df["Gender"].str.contains(r"fe|fem", na=False), "Gender"] = "Male"
    df.loc[~df["Gender"].isin(["Male", "Female"]) , "Gender"] = np.nan


# Columns to treat as categorical for encoding
categorical_cols = [
    c for c in [
        "TypeofContact",
        "Occupation",
        "Gender",
        "ProductPitched",
        "MaritalStatus",
        "Designation",
    ]
    if c in df.columns
]

# Numeric columns detection (excluding target)
numeric_cols = [c for c in df.columns if c not in categorical_cols + [TARGET]]

# Convert numeric-like columns to numeric dtype where possible
for c in numeric_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Impute missing values: numeric -> median, categorical -> mode
for c in numeric_cols:
    if df[c].isna().any():
        med = df[c].median()
        df[c] = df[c].fillna(med)
        print(f"Imputed numeric column {c} with median={med}")

for c in categorical_cols:
    if df[c].isna().any():
        mode_val = df[c].mode(dropna=True)
        if not mode_val.empty:
            mode_val = mode_val[0]
            df[c] = df[c].fillna(mode_val)
            print(f"Imputed categorical column {c} with mode='{mode_val}'")
        else:
            # if mode cannot be determined, fill with string 'Unknown'
            df[c] = df[c].fillna("Unknown")
            print(f"Filled categorical column {c} with 'Unknown' (no mode available)")

# One-hot encode categorical columns using pandas get_dummies (drop first to avoid collinearity)
if categorical_cols:
    df = pd.get_dummies(df, columns=categorical_cols, prefix_sep="__", drop_first=True)
    print(f"One-hot encoded columns: {categorical_cols}")

# Final check: ensure no missing values remain
missing_after = df.isna().sum().sum()
print(f"Total missing values after imputation/encoding: {missing_after}")

# Split into X and y
X = df.drop(columns=[TARGET])
y = df[TARGET]

# Ensure output directory exists
out_dir = "tourism_project/data/prepared"
os.makedirs(out_dir, exist_ok=True)

# Train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y if len(y.unique())>1 else None)

# Save CSVs
Xtrain.to_csv(os.path.join(out_dir, "Xtrain.csv"), index=False)
Xtest.to_csv(os.path.join(out_dir, "Xtest.csv"), index=False)
ytrain.to_csv(os.path.join(out_dir, "ytrain.csv"), index=False)
ytest.to_csv(os.path.join(out_dir, "ytest.csv"), index=False)

print(f"Saved prepared files to {out_dir}")

# Optional: upload to Hugging Face dataset repo if HF_TOKEN present in env
hf_token = os.getenv("HF_TOKEN")
if hf_token:
    api = HfApi(token=hf_token)
    repo_id = "huzaifa-sr/tourism-project"
    try:
        for filename in ["Xtrain.csv","Xtest.csv","ytrain.csv","ytest.csv"]:
            path = os.path.join(out_dir, filename)
            api.upload_file(path_or_fileobj=path, path_in_repo=filename, repo_id=repo_id, repo_type="dataset")
            print(f"Uploaded {filename} to Hugging Face dataset {repo_id}")
    except Exception as e:
        print(f"Failed to upload to HF: {e}")

print("Data preparation completed.")


Overwriting tourism_project/model_building/prep.py


## Model Training and Registration with Experimentation Tracking

In [12]:
!pip install mlflow==3.0.1 pyngrok==7.2.12 -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m718.4/718.4 kB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.4/203.4 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h


1. **Set Ngrok Authentication**: Authenticates Ngrok using a personal token to enable secure tunneling from local to public network.

2. **Launch MLflow UI**: Starts the MLflow Tracking UI as a background process on local port 5000 for experiment visualization and tracking.

3. **Create Public Tunnel**: Uses Ngrok to expose the local MLflow UI to the internet, generating a public URL that can be accessed remotely.

4. **Display Public URL**: Prints the Ngrok-generated URL, allowing users to open and interact with the MLflow UI in their browser.

In [13]:
from pyngrok import ngrok
import subprocess
import mlflow

# Set your auth token here (replace with your actual token)
ngrok.set_auth_token("33dZWyEHkHqzh0JaqbQ3YrpLvmL_7nj21cUh1dG37JVAj1hH7")

# Start MLflow UI on port 5000
process = subprocess.Popen(["mlflow", "ui", "--port", "5000"])

# Create public tunnel
public_url = ngrok.connect(5000).public_url
print("MLflow UI is available at:", public_url)

MLflow UI is available at: https://calamitean-yelena-acrogynous.ngrok-free.dev


In [14]:
# Set the tracking URL for MLflow
mlflow.set_tracking_uri(public_url)

# Set the name for the experiment
mlflow.set_experiment("MLOps_experiment")

2025/10/05 09:44:23 INFO mlflow.tracking.fluent: Experiment with name 'MLOps_experiment' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/719673877582777321', creation_time=1759657463945, experiment_id='719673877582777321', last_update_time=1759657463945, lifecycle_stage='active', name='MLOps_experiment', tags={}>

In [19]:
import pandas as pd
import sklearn
# for creating a folder
import os
# for data preprocessing and pipeline creation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
# for model training, tuning, and evaluation
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, recall_score
# for model serialization
import joblib
import mlflow
import numpy as np

# Prefer prepared data (created by prep.py)
DATASET_PATH = "tourism_project/data/tourism.csv"
df = pd.read_csv(DATASET_PATH)
print("Dataset loaded successfully.")

# Drop columns not required
df = df.drop(columns=["CustomerID"])
print("Dropped CustomerID column")
first_col = df.columns[0]
if first_col == "" or str(first_col).lower().startswith("unnamed"):
    df = df.drop(columns=[first_col])
    print(f"Dropped unnamed index column: {first_col}")

# Ensure target exists
TARGET = "ProdTaken"
if TARGET not in df.columns:
    raise KeyError(f"Expected target column '{TARGET}' not found in dataset columns: {df.columns.tolist()}")

# Fix Gender typos/variants seen in CSV (e.g. 'Fe Male', 'Fe Male ', 'FeMale')
if "Gender" in df.columns:
    df["Gender"] = df["Gender"].astype(str).str.strip().str.lower()
    df.loc[df["Gender"].str.contains(r"fe|fem", na=False), "Gender"] = "Female"
    df.loc[df["Gender"].str.contains(r"male", na=False) & ~df["Gender"].str.contains(r"fe|fem", na=False), "Gender"] = "Male"
    df.loc[~df["Gender"].isin(["Male", "Female"]) , "Gender"] = np.nan


# Columns to treat as categorical for encoding
categorical_cols = [
    c for c in [
        "TypeofContact",
        "Occupation",
        "Gender",
        "ProductPitched",
        "MaritalStatus",
        "Designation",
    ]
    if c in df.columns
]

# Numeric columns detection (excluding target)
numeric_cols = [c for c in df.columns if c not in categorical_cols + [TARGET]]

# Convert numeric-like columns to numeric dtype where possible
for c in numeric_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Impute missing values: numeric -> median, categorical -> mode
for c in numeric_cols:
    if df[c].isna().any():
        med = df[c].median()
        df[c] = df[c].fillna(med)
        print(f"Imputed numeric column {c} with median={med}")

for c in categorical_cols:
    if df[c].isna().any():
        mode_val = df[c].mode(dropna=True)
        if not mode_val.empty:
            mode_val = mode_val[0]
            df[c] = df[c].fillna(mode_val)
            print(f"Imputed categorical column {c} with mode='{mode_val}'")
        else:
            # if mode cannot be determined, fill with string 'Unknown'
            df[c] = df[c].fillna("Unknown")
            print(f"Filled categorical column {c} with 'Unknown' (no mode available)")

# One-hot encode categorical columns using pandas get_dummies (drop first to avoid collinearity)
if categorical_cols:
    df = pd.get_dummies(df, columns=categorical_cols, prefix_sep="__", drop_first=True)
    print(f"One-hot encoded columns: {categorical_cols}")

# Final check: ensure no missing values remain
missing_after = df.isna().sum().sum()
print(f"Total missing values after imputation/encoding: {missing_after}")

# Split into X and y
X = df.drop(columns=[TARGET])
y = df[TARGET]

# Ensure output directory exists
out_dir = "tourism_project/data/prepared"
os.makedirs(out_dir, exist_ok=True)

# Train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y if len(y.unique())>1 else None)

# Save CSVs
Xtrain.to_csv(os.path.join(out_dir, "Xtrain.csv"), index=False)
Xtest.to_csv(os.path.join(out_dir, "Xtest.csv"), index=False)
ytrain.to_csv(os.path.join(out_dir, "ytrain.csv"), index=False)
ytest.to_csv(os.path.join(out_dir, "ytest.csv"), index=False)

#if os.path.exists(Xtrain_path) and os.path.exists(ytrain_path):
print('Loading prepared data from', out_dir)
Xtrain_path = os.path.join(out_dir, "Xtrain.csv")
Xtest_path = os.path.join(out_dir, "Xtest.csv")
ytrain_path = os.path.join(out_dir, "ytrain.csv")
ytest_path = os.path.join(out_dir, "ytest.csv")

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path).iloc[:, 0]
ytest = pd.read_csv(ytest_path).iloc[:, 0]

print(f'Xtrain shape: {Xtrain.shape}, Xtest shape: {Xtest.shape}')
print(f'ytrain distribution:\n{ytrain.value_counts(normalize=True)}')

# Determine preprocessing: if Xtrain columns are all numeric we only scale; otherwise scale numeric and one-hot encode categoricals
numeric_features = Xtrain.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = Xtrain.select_dtypes(exclude=[np.number]).columns.tolist()

print(f'Numeric features ({len(numeric_features)}): {numeric_features[:10]}')
print(f'Categorical features ({len(categorical_features)}): {categorical_features[:10]}')

transformers = []
if numeric_features:
    transformers.append((StandardScaler(), numeric_features))
if categorical_features:
    transformers.append((OneHotEncoder(handle_unknown='ignore'), categorical_features))

if transformers:
    preprocessor = make_column_transformer(*transformers)
else:
    preprocessor = None

# Compute scale_pos_weight for XGBoost to help with class imbalance (neg/pos)
neg = (ytrain == 0).sum()
pos = (ytrain == 1).sum()
scale_pos_weight = neg / pos if pos > 0 else 1.0
print(f'scale_pos_weight={scale_pos_weight:.3f} (neg={neg}, pos={pos})')

# Define base XGBoost model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
# set scale_pos_weight
xgb_model.set_params(scale_pos_weight=scale_pos_weight)

# Build pipeline
if preprocessor is not None:
    model_pipeline = make_pipeline(preprocessor, xgb_model)
else:
    model_pipeline = make_pipeline(xgb_model)

# Parameter grid (kept small)
# Note: when using pipeline, estimator step name becomes 'xgbclassifier' in GridSearchCV param keys if pipeline includes the classifier
param_grid = {
    'xgbclassifier__n_estimators': [50, 100],
    'xgbclassifier__max_depth': [3, 5],
    'xgbclassifier__learning_rate': [0.05, 0.1],
}

with mlflow.start_run():
    grid_search = GridSearchCV(model_pipeline, param_grid, cv=3, n_jobs=-1, scoring='f1', verbose=1)
    grid_search.fit(Xtrain, ytrain)

    results = grid_search.cv_results_
    for i in range(len(results['params'])):
        param_set = results['params'][i]
        mean_score = results['mean_test_score'][i]
        std_score = results['std_test_score'][i]
        with mlflow.start_run(nested=True):
            mlflow.log_params(param_set)
            mlflow.log_metric('mean_test_score', mean_score)
            mlflow.log_metric('std_test_score', std_score)

    mlflow.log_params(grid_search.best_params_)

    best_model = grid_search.best_estimator_

    classification_threshold = 0.45
    try:
        y_pred_train_proba = best_model.predict_proba(Xtrain)[:, 1]
        y_pred_train = (y_pred_train_proba >= classification_threshold).astype(int)
        y_pred_test_proba = best_model.predict_proba(Xtest)[:, 1]
        y_pred_test = (y_pred_test_proba >= classification_threshold).astype(int)
    except Exception:
        y_pred_train = best_model.predict(Xtrain)
        y_pred_test = best_model.predict(Xtest)

    train_report = classification_report(ytrain, y_pred_train, output_dict=True)
    test_report = classification_report(ytest, y_pred_test, output_dict=True)

    mlflow.log_metrics({
        'train_accuracy': train_report['accuracy'],
        'train_precision': train_report.get('1', {}).get('precision', 0),
        'train_recall': train_report.get('1', {}).get('recall', 0),
        'train_f1-score': train_report.get('1', {}).get('f1-score', 0),
        'test_accuracy': test_report['accuracy'],
        'test_precision': test_report.get('1', {}).get('precision', 0),
        'test_recall': test_report.get('1', {}).get('recall', 0),
        'test_f1-score': test_report.get('1', {}).get('f1-score', 0),
    })

# Save the best model
model_dir = 'tourism_project/model_building/artifacts'
os.makedirs(model_dir, exist_ok=True)
model_path = os.path.join(model_dir, 'xgb_best_model.joblib')
joblib.dump(best_model, model_path)
print(f'Saved best model to {model_path}')

Dataset loaded successfully.
Dropped CustomerID column
Dropped unnamed index column: Unnamed: 0
One-hot encoded columns: ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']
Total missing values after imputation/encoding: 0
Loading prepared data from tourism_project/data/prepared
Xtrain shape: (3302, 27), Xtest shape: (826, 27)
ytrain distribution:
ProdTaken
0    0.806784
1    0.193216
Name: proportion, dtype: float64
Numeric features (12): ['Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar']
Categorical features (15): ['TypeofContact__Self Enquiry', 'Occupation__Large Business', 'Occupation__Salaried', 'Occupation__Small Business', 'ProductPitched__Deluxe', 'ProductPitched__King', 'ProductPitched__Standard', 'ProductPitched__Super Deluxe', 'MaritalStatus__Married', 'MaritalStatus__Single']
scale_pos_weight=4.176 (neg=2664, po

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


🏃 View run secretive-donkey-208 at: https://calamitean-yelena-acrogynous.ngrok-free.dev/#/experiments/719673877582777321/runs/5dc40be57e6d4096ba83418e5dbcc798
🧪 View experiment at: https://calamitean-yelena-acrogynous.ngrok-free.dev/#/experiments/719673877582777321
🏃 View run nebulous-foal-963 at: https://calamitean-yelena-acrogynous.ngrok-free.dev/#/experiments/719673877582777321/runs/a2a0bb5bbbb843feb8b3f9a639b84262
🧪 View experiment at: https://calamitean-yelena-acrogynous.ngrok-free.dev/#/experiments/719673877582777321
🏃 View run secretive-auk-413 at: https://calamitean-yelena-acrogynous.ngrok-free.dev/#/experiments/719673877582777321/runs/036d38bea59a47d1a306e62bcace5bf4
🧪 View experiment at: https://calamitean-yelena-acrogynous.ngrok-free.dev/#/experiments/719673877582777321
🏃 View run wistful-goat-714 at: https://calamitean-yelena-acrogynous.ngrok-free.dev/#/experiments/719673877582777321/runs/ea2fc91b4a9447278d11322486615889
🧪 View experiment at: https://calamitean-yelena-acrog

In [26]:
import pandas as pd
import sklearn
# for creating a folder
import os
# for data preprocessing and pipeline creation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
# for model training, tuning, and evaluation
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, recall_score
# for model serialization
import joblib
import mlflow
import numpy as np
from pyngrok import ngrok
import subprocess
import time

# Prefer prepared data (created by prep.py)
DATASET_PATH = "tourism_project/data/tourism.csv"
df = pd.read_csv(DATASET_PATH)
print("Dataset loaded successfully.")

# Drop columns not required
df = df.drop(columns=["CustomerID"])
print("Dropped CustomerID column")
first_col = df.columns[0]
if first_col == "" or str(first_col).lower().startswith("unnamed"):
    df = df.drop(columns=[first_col])
    print(f"Dropped unnamed index column: {first_col}")

# Ensure target exists
TARGET = "ProdTaken"
if TARGET not in df.columns:
    raise KeyError(f"Expected target column '{TARGET}' not found in dataset columns: {df.columns.tolist()}")

# Fix Gender typos/variants seen in CSV (e.g. 'Fe Male', 'Fe Male ', 'FeMale')
if "Gender" in df.columns:
    df["Gender"] = df["Gender"].astype(str).str.strip().str.lower()
    df.loc[df["Gender"].str.contains(r"fe|fem", na=False), "Gender"] = "Female"
    df.loc[df["Gender"].str.contains(r"male", na=False) & ~df["Gender"].str.contains(r"fe|fem", na=False), "Gender"] = "Male"
    df.loc[~df["Gender"].isin(["Male", "Female"]) , "Gender"] = np.nan


# Columns to treat as categorical for encoding
categorical_cols = [
    c for c in [
        "TypeofContact",
        "Occupation",
        "Gender",
        "ProductPitched",
        "MaritalStatus",
        "Designation",
    ]
    if c in df.columns
]

# Numeric columns detection (excluding target)
numeric_cols = [c for c in df.columns if c not in categorical_cols + [TARGET]]

# Convert numeric-like columns to numeric dtype where possible
for c in numeric_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Impute missing values: numeric -> median, categorical -> mode
for c in numeric_cols:
    if df[c].isna().any():
        med = df[c].median()
        df[c] = df[c].fillna(med)
        print(f"Imputed numeric column {c} with median={med}")

for c in categorical_cols:
    if df[c].isna().any():
        mode_val = df[c].mode(dropna=True)
        if not mode_val.empty:
            mode_val = mode_val[0]
            df[c] = df[c].fillna(mode_val)
            print(f"Imputed categorical column {c} with mode='{mode_val}'")
        else:
            # if mode cannot be determined, fill with string 'Unknown'
            df[c] = df[c].fillna("Unknown")
            print(f"Filled categorical column {c} with 'Unknown' (no mode available)")

# One-hot encode categorical columns using pandas get_dummies (drop first to avoid collinearity)
if categorical_cols:
    df = pd.get_dummies(df, columns=categorical_cols, prefix_sep="__", drop_first=True)
    print(f"One-hot encoded columns: {categorical_cols}")

# Final check: ensure no missing values remain
missing_after = df.isna().sum().sum()
print(f"Total missing values after imputation/encoding: {missing_after}")
if missing_after > 0:
    print("Columns with missing values after imputation/encoding:")
    print(df.isna().sum()[df.isna().sum() > 0])


# Split into X and y
X = df.drop(columns=[TARGET])
y = df[TARGET]

# Ensure output directory exists
out_dir = "tourism_project/data/prepared"
os.makedirs(out_dir, exist_ok=True)

# Train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y if len(y.unique())>1 else None)

# Save CSVs
Xtrain.to_csv(os.path.join(out_dir, "Xtrain.csv"), index=False)
Xtest.to_csv(os.path.join(out_dir, "Xtest.csv"), index=False)
ytrain.to_csv(os.path.join(out_dir, "ytrain.csv"), index=False)
ytest.to_csv(os.path.join(out_dir, "ytest.csv"), index=False)

#if os.path.exists(Xtrain_path) and os.path.exists(ytrain_path):
print('Loading prepared data from', out_dir)
Xtrain_path = os.path.join(out_dir, "Xtrain.csv")
Xtest_path = os.path.join(out_dir, "Xtest.csv")
ytrain_path = os.path.join(out_dir, "ytrain.csv")
ytest_path = os.path.join(out_dir, "ytest.csv")

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path).iloc[:, 0]
ytest = pd.read_csv(ytest_path).iloc[:, 0]

print(f'Xtrain shape: {Xtrain.shape}, Xtest shape: {Xtest.shape}')
print(f'ytrain distribution:\n{ytrain.value_counts(normalize=True)}')

# Determine preprocessing: if Xtrain columns are all numeric we only scale; otherwise scale numeric and one-hot encode categoricals
numeric_features = Xtrain.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = Xtrain.select_dtypes(exclude=[np.number]).columns.tolist()

print(f'Numeric features ({len(numeric_features)}): {numeric_features[:10]}')
print(f'Categorical features ({len(categorical_features)}): {categorical_features[:10]}')

# Set the clas weight to handle class imbalance
class_weight = ytrain.value_counts()[0] / ytrain.value_counts()[1]
class_weight

# Define the preprocessing steps
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

# Define base XGBoost model
xgb_model = xgb.XGBClassifier(scale_pos_weight=class_weight, random_state=42)


# Define hyperparameter grid
param_grid = {
    'xgbclassifier__n_estimators': [50, 75, 100],
    'xgbclassifier__max_depth': [2, 3, 4],
    'xgbclassifier__colsample_bytree': [0.4, 0.5, 0.6],
    'xgbclassifier__colsample_bylevel': [0.4, 0.5, 0.6],
    'xgbclassifier__learning_rate': [0.01, 0.05, 0.1],
    'xgbclassifier__reg_lambda': [0.4, 0.5, 0.6],
}

# Model pipeline
model_pipeline = make_pipeline(preprocessor, xgb_model)

# Restart ngrok tunnel and set MLflow tracking URI
try:
    ngrok.kill()
except Exception:
    pass

# Set your auth token here (replace with your actual token)
ngrok.set_auth_token("33dZWyEHkHqzh0JaqbQ3YrpLvmL_7nj21cUh1dG37JVAj1hH7")

# Start MLflow UI on port 5000
process = subprocess.Popen(["mlflow", "ui", "--port", "5000"])

# Add a small delay to allow MLflow UI to start
time.sleep(5)

# Check if the process is still running
if process.poll() is not None:
    print("MLflow UI process failed to start.")
else:
    # Create public tunnel
    public_url = ngrok.connect(5000).public_url
    print("MLflow UI is available at:", public_url)

    mlflow.set_tracking_uri(public_url)
    mlflow.set_experiment("MLOps_experiment")

    with mlflow.start_run():
        # Hyperparameter tuning
        grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, n_jobs=-1)
        grid_search.fit(Xtrain, ytrain)

        # Log best parameters
        mlflow.log_params(grid_search.best_params_)

        # Store and evaluate the best model
        best_model = grid_search.best_estimator_

        classification_threshold = 0.45

        y_pred_train_proba = best_model.predict_proba(Xtrain)[:, 1]
        y_pred_train = (y_pred_train_proba >= classification_threshold).astype(int)

        y_pred_test_proba = best_model.predict_proba(Xtest)[:, 1]
        y_pred_test = (y_pred_test_proba >= classification_threshold).astype(int)

        train_report = classification_report(ytrain, y_pred_train, output_dict=True)
        test_report = classification_report(ytest, y_pred_test, output_dict=True)

        mlflow.log_metrics({
            "train_accuracy": train_report['accuracy'],
            "train_precision": train_report['1']['precision'],
            "train_recall": train_report['1']['recall'],
            "train_f1-score": train_report['1']['f1-score'],
            "test_accuracy": test_report['accuracy'],
            "test_precision": test_report['1']['precision'],
            "test_recall": test_report['1']['recall'],
            "test_f1-score": test_report['1']['f1-score']
        })

    # Save the best model
    model_dir = 'tourism_project/model_building/artifacts'
    os.makedirs(model_dir, exist_ok=True)
    model_path = os.path.join(model_dir, 'xgb_best_model.joblib')
    joblib.dump(best_model, model_path)
    print(f'Saved best model to {model_path}')

Dataset loaded successfully.
Dropped CustomerID column
Dropped unnamed index column: Unnamed: 0
One-hot encoded columns: ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']
Total missing values after imputation/encoding: 0
Loading prepared data from tourism_project/data/prepared
Xtrain shape: (3302, 27), Xtest shape: (826, 27)
ytrain distribution:
ProdTaken
0    0.806784
1    0.193216
Name: proportion, dtype: float64
Numeric features (12): ['Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar']
Categorical features (15): ['TypeofContact__Self Enquiry', 'Occupation__Large Business', 'Occupation__Salaried', 'Occupation__Small Business', 'ProductPitched__Deluxe', 'ProductPitched__King', 'ProductPitched__Standard', 'ProductPitched__Super Deluxe', 'MaritalStatus__Married', 'MaritalStatus__Single']
MLflow UI is available at: https://c

In [1]:
%%writefile tourism_project/model_building/train.py
# for data manipulation
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
# for model training, tuning, and evaluation
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, recall_score
# for model serialization
import joblib
# for creating a folder
import os
# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi, create_repo
from huggingface_hub.utils import RepositoryNotFoundError, HfHubHTTPError
import mlflow
import numpy as np

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("mlops-tourism-project")

api = HfApi()


Xtrain_path = "hf://datasets/huzaifa-sr/tourism-project/Xtrain.csv"
Xtest_path = "hf://datasets/huzaifa-sr/tourism-project/Xtest.csv"
ytrain_path = "hf://datasets/huzaifa-sr/tourism-project/ytrain.csv"
ytest_path = "hf://datasets/huzaifa-sr/tourism-project/ytest.csv"

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path)
ytest = pd.read_csv(ytest_path)

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path).iloc[:, 0]
ytest = pd.read_csv(ytest_path).iloc[:, 0]

print(f'Xtrain shape: {Xtrain.shape}, Xtest shape: {Xtest.shape}')
print(f'ytrain distribution:\n{ytrain.value_counts(normalize=True)}')

# Determine preprocessing: if Xtrain columns are all numeric we only scale; otherwise scale numeric and one-hot encode categoricals
numeric_features = Xtrain.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = Xtrain.select_dtypes(exclude=[np.number]).columns.tolist()

print(f'Numeric features ({len(numeric_features)}): {numeric_features[:10]}')
print(f'Categorical features ({len(categorical_features)}): {categorical_features[:10]}')

# Set the clas weight to handle class imbalance
class_weight = ytrain.value_counts()[0] / ytrain.value_counts()[1]
class_weight

# Define the preprocessing steps
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

# Define base XGBoost model
xgb_model = xgb.XGBClassifier(scale_pos_weight=class_weight, random_state=42)


# Define hyperparameter grid
param_grid = {
    'xgbclassifier__n_estimators': [50, 75, 100],
    'xgbclassifier__max_depth': [2, 3, 4],
    'xgbclassifier__colsample_bytree': [0.4, 0.5, 0.6],
    'xgbclassifier__colsample_bylevel': [0.4, 0.5, 0.6],
    'xgbclassifier__learning_rate': [0.01, 0.05, 0.1],
    'xgbclassifier__reg_lambda': [0.4, 0.5, 0.6],
}

# Model pipeline
model_pipeline = make_pipeline(preprocessor, xgb_model)

# Start MLflow run
with mlflow.start_run():
    # Hyperparameter tuning
    grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, n_jobs=-1)
    grid_search.fit(Xtrain, ytrain)

    # Log all parameter combinations and their mean test scores
    results = grid_search.cv_results_
    for i in range(len(results['params'])):
        param_set = results['params'][i]
        mean_score = results['mean_test_score'][i]
        std_score = results['std_test_score'][i]

        # Log each combination as a separate MLflow run
        with mlflow.start_run(nested=True):
            mlflow.log_params(param_set)
            mlflow.log_metric("mean_test_score", mean_score)
            mlflow.log_metric("std_test_score", std_score)

    # Log best parameters separately in main run
    mlflow.log_params(grid_search.best_params_)

    # Store and evaluate the best model
    best_model = grid_search.best_estimator_

    classification_threshold = 0.45

    y_pred_train_proba = best_model.predict_proba(Xtrain)[:, 1]
    y_pred_train = (y_pred_train_proba >= classification_threshold).astype(int)

    y_pred_test_proba = best_model.predict_proba(Xtest)[:, 1]
    y_pred_test = (y_pred_test_proba >= classification_threshold).astype(int)

    train_report = classification_report(ytrain, y_pred_train, output_dict=True)
    test_report = classification_report(ytest, y_pred_test, output_dict=True)

    # Log the metrics for the best model
    mlflow.log_metrics({
        "train_accuracy": train_report['accuracy'],
        "train_precision": train_report['1']['precision'],
        "train_recall": train_report['1']['recall'],
        "train_f1-score": train_report['1']['f1-score'],
        "test_accuracy": test_report['accuracy'],
        "test_precision": test_report['1']['precision'],
        "test_recall": test_report['1']['recall'],
        "test_f1-score": test_report['1']['f1-score']
    })

    # Save the model locally
    model_path = "tourism_model_v1.joblib"
    joblib.dump(best_model, model_path)

    # Log the model artifact
    mlflow.log_artifact(model_path, artifact_path="model")
    print(f"Model saved as artifact at: {model_path}")

    # Upload to Hugging Face
    repo_id = "huzaifa-sr/tourism-project"
    repo_type = "model"

    # Step 1: Check if the space exists
    try:
        api.repo_info(repo_id=repo_id, repo_type=repo_type)
        print(f"Space '{repo_id}' already exists. Using it.")
    except RepositoryNotFoundError:
        print(f"Space '{repo_id}' not found. Creating new space...")
        create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
        print(f"Space '{repo_id}' created.")

    # create_repo("churn-model", repo_type="model", private=False)
    api.upload_file(
        path_or_fileobj="tourism_model_v1.joblib",
        path_in_repo="tourism_model_v1.joblib",
        repo_id=repo_id,
        repo_type=repo_type,
    )

Overwriting tourism_project/model_building/train.py


# Deployment

## Dockerfile

In [30]:
os.makedirs("tourism_project/deployment", exist_ok=True)

In [31]:
%%writefile tourism_project/deployment/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
	PATH=/home/user/.local/bin:$PATH

WORKDIR $HOME/app

COPY --chown=user . $HOME/app

# Define the command to run the Streamlit app on port "8501" and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

Writing tourism_project/deployment/Dockerfile


## Streamlit App

Please ensure that the web app script is named `app.py`.

In [32]:
%%writefile tourism_project/deployment/app.py
import streamlit as st
import pandas as pd
import numpy as np
import os
import joblib

st.set_page_config(page_title='Tourism Purchase Predictor', layout='wide')

# Paths
MODEL_PATH = 'tourism_project/model_building/artifacts/xgb_best_model.joblib'
PREPARED_X_PATH = 'tourism_project/data/prepared/Xtrain.csv'
RAW_CSV_CANDIDATES = ['tourism_project/data/tourism.csv', 'tourism.csv']

# Load model if available
model = None
if os.path.exists(MODEL_PATH):
    try:
        model = joblib.load(MODEL_PATH)
        st.info(f'Loaded model from {MODEL_PATH}')
    except Exception as e:
        st.error(f'Failed to load model at {MODEL_PATH}: {e}')
else:
    st.warning(f'No trained model found at {MODEL_PATH}. Please run training first.')

# Load raw CSV (for populating select options)
raw_df = None
for p in RAW_CSV_CANDIDATES:
    if os.path.exists(p):
        raw_df = pd.read_csv(p)
        break

if raw_df is None:
    st.error('Could not find tourism.csv in repo. Place it at repo root or tourism_project/data.')
    st.stop()

# Quick cleanup matching prep.py behavior
# Drop unnamed index column if present
first_col = raw_df.columns[0]
if first_col == '' or str(first_col).lower().startswith('unnamed'):
    raw_df = raw_df.drop(columns=[first_col])

# Drop CustomerID if present
if 'CustomerID' in raw_df.columns:
    raw_df = raw_df.drop(columns=['CustomerID'])

# Ensure target exists
TARGET = 'ProdTaken'
if TARGET not in raw_df.columns:
    st.error(f"Expected target column '{TARGET}' not found in dataset.")
    st.stop()

# Candidate input fields (based on dataset)
numeric_inputs = [
    'Age', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups',
    'PreferredPropertyStar', 'NumberOfTrips', 'PitchSatisfactionScore',
    'NumberOfChildrenVisiting', 'MonthlyIncome'
]
categorical_inputs = [
    'TypeofContact', 'CityTier', 'Occupation', 'Gender', 'ProductPitched',
    'MaritalStatus', 'Designation', 'Passport', 'OwnCar'
]
# Make sure fields exist in raw_df
numeric_inputs = [c for c in numeric_inputs if c in raw_df.columns]
categorical_inputs = [c for c in categorical_inputs if c in raw_df.columns]

st.title('Tourism Package Purchase Predictor')
st.write('Enter customer and interaction details below and click Predict.')

with st.form('input_form'):
    cols = st.columns(3)

    inputs = {}
    # Numeric inputs
    for i, col in enumerate(numeric_inputs):
        c = cols[i % 3]
        mean = float(raw_df[col].dropna().mean()) if not raw_df[col].dropna().empty else 0.0
        inputs[col] = c.number_input(col, value=mean)

    # Categorical / binary inputs
    for col in categorical_inputs:
        if col in ['Passport', 'OwnCar']:
            # binary
            unique_vals = sorted(raw_df[col].dropna().unique().tolist())
            default = 1 if 1 in unique_vals else (0 if 0 in unique_vals else unique_vals[0])
            inputs[col] = st.selectbox(col, options=unique_vals, index=0)
        elif col == 'CityTier':
            # CityTier numeric but categorical-like
            options = sorted(raw_df[col].dropna().unique().tolist())
            inputs[col] = st.selectbox(col, options=options, index=0)
        else:
            options = sorted(raw_df[col].dropna().astype(str).unique().tolist())
            inputs[col] = st.selectbox(col, options=options, index=0)

    submitted = st.form_submit_button('Predict')

# Normalize Gender similar to prep.py
if 'Gender' in inputs:
    g = str(inputs['Gender']).strip().lower()
    if any(x in g for x in ['fe', 'fem']):
        inputs['Gender'] = 'Female'
    elif 'male' in g:
        inputs['Gender'] = 'Male'
    else:
        inputs['Gender'] = inputs['Gender']

# Build single-row DataFrame from inputs using the original raw columns order
input_df = pd.DataFrame([inputs])

# Apply same one-hot encoding used in prep.py (pandas.get_dummies with drop_first and prefix_sep='__')
cat_for_dummies = [c for c in categorical_inputs if c in input_df.columns and input_df[c].dtype == object or isinstance(input_df[c].iloc[0], str)]
if cat_for_dummies:
    input_dummies = pd.get_dummies(input_df, columns=cat_for_dummies, prefix_sep='__', drop_first=True)
else:
    input_dummies = input_df.copy()

# If prepared Xtrain exists, align columns to it (this ensures dummies match training features)
if os.path.exists(PREPARED_X_PATH):
    prepared_cols = pd.read_csv(PREPARED_X_PATH, nrows=0).columns.tolist()
    # Reindex to prepared columns, filling missing with 0
    input_prepared = input_dummies.reindex(columns=prepared_cols, fill_value=0)
else:
    # If no prepared Xtrain file, pass input_dummies as-is and hope pipeline handles raw columns
    input_prepared = input_dummies.copy()

st.subheader('Prepared input (what will be fed to the model)')
st.dataframe(input_prepared.transpose())

if submitted:
    if model is None:
        st.error('No model loaded; cannot predict. Train model and save to the expected path.')
    else:
        try:
            # Ensure columns order matches model training
            # If model is a pipeline expecting a full feature set, the input_prepared must match that
            X_in = input_prepared
            # Some sklearn pipelines expect numpy arrays - convert accordingly
            if hasattr(model, 'predict_proba'):
                proba = model.predict_proba(X_in)[:, 1][0]
                pred = int(proba >= 0.5)
            else:
                pred = int(model.predict(X_in)[0])
                proba = None

            st.markdown('## Prediction')
            if proba is not None:
                st.metric('Probability of purchase (ProdTaken=1)', f'{proba:.3f}')
                st.write('Predicted class:', 'Will Purchase (1)' if pred == 1 else 'Will Not Purchase (0)')
            else:
                st.write('Predicted class:', 'Will Purchase (1)' if pred == 1 else 'Will Not Purchase (0)')

        except Exception as e:
            st.error(f'Prediction failed: {e}')

# Optionally show a few sample rows from dataset and model prediction
st.sidebar.markdown('## Dataset samples')
if st.sidebar.button('Show 5 random samples'):
    st.sidebar.dataframe(raw_df.sample(5))

st.sidebar.markdown('Model & data paths')
st.sidebar.text(MODEL_PATH)
st.sidebar.text(PREPARED_X_PATH)

Writing tourism_project/deployment/app.py


## Dependency Handling

Please ensure that the dependency handling file is named `requirements.txt`.

In [34]:
%%writefile tourism_project/deployment/requirements.txt
pandas==2.2.2
huggingface_hub==0.32.6
streamlit==1.43.2
joblib==1.5.1
scikit-learn==1.6.0
xgboost==2.1.4
mlflow==3.0.1

Writing tourism_project/deployment/requirements.txt


# Hosting

In [35]:
os.makedirs("tourism_project/hosting", exist_ok=True)

In [37]:
%%writefile tourism_project/hosting/hosting.py
from huggingface_hub import HfApi
import os

api = HfApi(token=os.getenv("HF_TOKEN"))
api.upload_folder(
    folder_path="tourism_project/deployment",     # the local folder containing your files
    repo_id="huzaifa-sr/tourism-project",          # the target repo
    repo_type="space",                      # dataset, model, or space
    path_in_repo="",                          # optional: subfolder path inside the repo
)

Overwriting tourism_project/hosting/hosting.py


# MLOps Pipeline with Github Actions Workflow

**Note:**

1. Before running the file below, make sure to add the HF_TOKEN to your GitHub secrets to enable authentication between GitHub and Hugging Face.
2. The below code is for a sample YAML file that can be updated as required to meet the requirements of this project.

```
name: Tourism pipeline

on:
  push:
    branches:
      - main  # Automatically triggers on push to the main branch

jobs:

  register-dataset:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r tourism_project/requirements.txt
      - name: Upload Dataset to Hugging Face Hub
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python tourism_project/model_building/data_register.py

  data-prep:
    needs: register-dataset
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r tourism_project/requirements.txt
      - name: Run Data Preparation
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python tourism_project/model_building/prep.py


  model-traning:
    needs: data-prep
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r tourism_project/requirements.txt
      - name: Start MLflow Server
        run: |
          nohup mlflow ui --host 0.0.0.0 --port 5000 &  # Run MLflow UI in the background
          sleep 5  # Wait for a moment to let the server starts
      - name: Model Building
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python tourism_project/model_building/train.py


  deploy-hosting:
    runs-on: ubuntu-latest
    needs: [model-traning,data-prep,register-dataset]
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r tourism_project/requirements.txt
      - name: Push files to Frontend Hugging Face Space
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python tourism_project/hosting/hosting.py

```

**Note:** To use this YAML file for our use case, we need to

1. Go to the GitHub repository for the project
2. Create a folder named ***.github/workflows/***
3. In the above folder, create a file named ***pipeline.yml***
4. Copy and paste the above content for the YAML file into the ***pipeline.yml*** file

## Requirements file for the Github Actions Workflow

In [38]:
%%writefile tourism_project/requirements.txt
huggingface_hub==0.32.6
datasets==3.6.0
pandas==2.2.2
scikit-learn==1.6.0
xgboost==2.1.4
mlflow==3.0.1

Writing tourism_project/requirements.txt


## Github Authentication and Push Files

* Before moving forward, we need to generate a secret token to push files directly from Colab to the GitHub repository.
* Please follow the below instructions to create the GitHub token:
    - Open your GitHub profile.
    - Click on ***Settings***.
    - Go to ***Developer Settings***.
    - Expand the ***Personal access tokens*** section and select ***Tokens (classic)***.
    - Click ***Generate new token***, then choose ***Generate new token (classic)***.
    - Add a note and select all required scopes.
    - Click ***Generate token***.
    - Copy the generated token and store it safely in a notepad.

In [89]:
# Install Git
!apt-get install git

# Set your Git identity (replace with your details)
!git config --global user.email "huzaifa.sr@gmail.com"
!git config --global user.name "huzaifasr"

# Clone your GitHub repository
!git clone https://github.com/huzaifasr/tourism-project.git

# Move your folder to the repository directory
!mv /content/tourism_project/ /content/tourism-project

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.34.1-1ubuntu1.15).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.
fatal: destination path 'tourism-project' already exists and is not an empty directory.


In [90]:
!git add data/*

In [85]:
!git add deployment/*

In [92]:
!git add hosting/*

In [72]:
!git add model_building/*

In [93]:
!git add deployment/*

In [94]:
# Commit the changes
!git commit -m "Add tourism_project files1"

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mtourism-project/[m
	[31mtourism_project/[m

nothing added to commit but untracked files present (use "git add" to track)


In [76]:
# Change directory to the cloned repository
# %cd tourism-project/ # Removed this line as it's causing an error and the directory should already be correct

# Add the new folder to Git
!git add .

# Commit the changes
!git commit -m "Add tourism_project files"

# Push to GitHub using the token in the URL (replace YOUR_GITHUB_PERSONAL_ACCESS_TOKEN)
!git push https://huzaifasr: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@github.com/huzaifasr/tourism-project.git main

error: 'tourism-project/' does not have a commit checked out
fatal: adding files failed
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mtourism-project/[m

nothing added to commit but untracked files present (use "git add" to track)
Enumerating objects: 22, done.
Counting objects: 100% (22/22), done.
Delta compression using up to 2 threads
Compressing objects: 100% (19/19), done.
Writing objects: 100% (21/21), 188.45 KiB | 3.43 MiB/s, done.
Total 21 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/huzaifasr/tourism-project.git
   046c204..e1d9a95  main -> main


# Output Evaluation

- GitHub (link to repository, screenshot of folder structure and executed workflow)

- Streamlit on Hugging Face (link to HF space, screenshot of Streamlit app)

<font size=6 color="navyblue">Power Ahead!</font>
___