# MLOps Project: Wellness Tourism Package Prediction

## Business Context
"Visit with Us" aims to predict whether a customer will purchase the newly introduced Wellness Tourism Package. This project implements an MLOps pipeline to automate the workflow from data preparation to deployment.

## Objective
Design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases.

## 1. Data Registration
**Goal:** Register the dataset on the Hugging Face dataset space.

In [None]:
import pandas as pd
from huggingface_hub import HfApi, HfFolder

# Configuration
DATASET_PATH = "data/Tourism.csv" # Ensure this file exists in your 'data' folder
HF_USERNAME = "YOUR_HF_USERNAME" # Replace with your Hugging Face username
HF_DATASET_REPO = f"{HF_USERNAME}/tourism-package-prediction"

# Initialize API
api = HfApi()
token = HfFolder.get_token()

if token:
    print(f"Logged in as {HF_USERNAME}")
else:
    print("Please login using `huggingface-cli login` in your terminal.")

## 2. Data Preparation
**Goal:** Load, clean, split, and upload processed data.

In [None]:
from sklearn.model_selection import train_test_split
import os

def prepare_data():
    # Load Data
    try:
        df = pd.read_csv(DATASET_PATH)
        print("Data loaded successfully.")
    except FileNotFoundError:
        print("Dataset not found. Please ensure 'data/Tourism.csv' exists.")
        return

    # Data Cleaning
    if 'CustomerID' in df.columns:
        df = df.drop(columns=['CustomerID'])
    
    # Handle Missing Values (Simple Imputation)
    num_cols = df.select_dtypes(include=['float64', 'int64']).columns
    for col in num_cols:
        df[col] = df[col].fillna(df[col].median())
        
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        df[col] = df[col].fillna(df[col].mode()[0])
        
    # Split Data
    train, test = train_test_split(df, test_size=0.2, random_state=42)
    
    # Save Locally
    os.makedirs("data/processed", exist_ok=True)
    train.to_csv("data/processed/train.csv", index=False)
    test.to_csv("data/processed/test.csv", index=False)
    print("Data split and saved locally.")
    
    # Upload to Hugging Face
    if token:
        api.create_repo(repo_id=HF_DATASET_REPO, repo_type="dataset", exist_ok=True)
        api.upload_file(path_or_fileobj="data/processed/train.csv", path_in_repo="train.csv", repo_id=HF_DATASET_REPO, repo_type="dataset")
        api.upload_file(path_or_fileobj="data/processed/test.csv", path_in_repo="test.csv", repo_id=HF_DATASET_REPO, repo_type="dataset")
        print("Data uploaded to Hugging Face.")

prepare_data()

## 3. Model Building & Experiment Tracking
**Goal:** Train a model, tune parameters, log experiments, and register the best model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib

HF_MODEL_REPO = f"{HF_USERNAME}/tourism-prediction-model"

def train_and_evaluate():
    try:
        train_df = pd.read_csv("data/processed/train.csv")
        test_df = pd.read_csv("data/processed/test.csv")
    except FileNotFoundError:
        print("Processed data not found.")
        return

    X_train = train_df.drop('ProdTaken', axis=1)
    y_train = train_df['ProdTaken']
    X_test = test_df.drop('ProdTaken', axis=1)
    y_test = test_df['ProdTaken']
    
    # Preprocessing (One-Hot Encoding)
    X_train = pd.get_dummies(X_train)
    X_test = pd.get_dummies(X_test)
    X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)
    
    # Model Training
    model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluation
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))
    print(classification_report(y_test, predictions))
    
    # Save Model
    os.makedirs("models", exist_ok=True)
    joblib.dump(model, "models/model.joblib")
    joblib.dump(list(X_train.columns), "models/features.joblib")
    
    # Register Model
    if token:
        api.create_repo(repo_id=HF_MODEL_REPO, exist_ok=True)
        api.upload_file(path_or_fileobj="models/model.joblib", path_in_repo="model.joblib", repo_id=HF_MODEL_REPO)
        api.upload_file(path_or_fileobj="models/features.joblib", path_in_repo="features.joblib", repo_id=HF_MODEL_REPO)
        print("Model registered on Hugging Face.")

train_and_evaluate()

## 4. Model Deployment
**Goal:** Deploy the model using Streamlit and Docker.

### Streamlit App (`app.py`)
The `app.py` file contains the code to load the model from Hugging Face and create a user interface for predictions.

### Dockerfile
The `Dockerfile` defines the environment for running the Streamlit app.

## 5. MLOps Pipeline with GitHub Actions
**Goal:** Automate the workflow.

The `.github/workflows/pipeline.yml` file defines the CI/CD pipeline that triggers on push to the main branch. It installs dependencies, runs data preparation, and trains the model.

## Conclusion
This notebook demonstrates the end-to-end MLOps pipeline for the "Visit with Us" project, covering data management, model training, and automation setup.