<a id="top"></a>

## Table of Contents

* [0. Header Section](#0)
    - [0.1 Project title, chosen domain, and team members](#0.1)
    - [0.2 Business motivation and problem statement](#0.2)
    - [0.3 Dataset description and target categories](#0.3)

* [1. Setup and Configuration](#1)
    - [1.1 Import required libraries (pandas, sklearn, nltk, xgboost, torch)](#1.1)
    - [1.2 Environment configuration and random seeds](#1.2)
    - [1.3 Helper functions for preprocessing, visualization, and evaluation](#1.3)

* [2. Data Understanding and Preprocessing](#2)
    - [2.1 Load and inspect the dataset `jason23322/high-accuracy-email-classifier`](#2.1)
    - [2.2 Clean text (remove HTML, punctuation, stopwords, lowercasing)](#2.2)
    - [2.3 Lemmatization / Tokenization with NLTK or spaCy](#2.3)
    - [2.4 Convert text to TF-IDF features](#2.4)
    - [2.5 Dimensionality Reduction with PCA for visualization](#2.5)

* [3. Exploratory Data Analysis (EDA)](#3)
    - [3.1 Analyze class distribution across 6 email categories](#3.1)
    - [3.2 Keyword frequency, message length, and term correlation](#3.2)
    - [3.3 Visualize TF-IDF and PCA projections in 2D space](#3.3)

* [4. Unsupervised Learning (Clustering)](#4)
    - [4.1 Apply K-Means clustering on TF-IDF features](#4.1)
    - [4.2 Determine optimal `k` using Elbow, Silhouette, Davies‚ÄìBouldin](#4.2)
    - [4.3 Visualize and interpret clusters (PCA / t-SNE)](#4.3)

* [5. Supervised Machine Learning Models](#5)
    - [5.1 Decision Tree Classifier (baseline)](#5.1)
    - [5.2 Random Forest (Bagging Ensemble)](#5.2)
    - [5.3 XGBoost (Boosting Ensemble)](#5.3)
    - [5.4 Stacking Ensemble (meta-learner over RF, XGB, etc.)](#5.4)
    - [5.5 Evaluation: Accuracy, Precision, Recall, F1, ROC-AUC](#5.5)
    - [5.6 Feature importance / SHAP](#5.6)

* [6. Deep Learning Model (Neural Network)](#6)
    - [6.1 Build Feed-Forward / 1D-CNN / LSTM (PyTorch)](#6.1)
    - [6.2 Inputs: TF-IDF or embeddings](#6.2)
    - [6.3 Train/validate and visualize loss/accuracy](#6.3)
    - [6.4 Compare NN vs. ensembles (incl. Stacking)](#6.4)

* [7. Dimensionality Reduction and Visualization](#7)
    - [7.1 PCA on high-dimensional TF-IDF](#7.1)
    - [7.2 Explained variance plots](#7.2)
    - [7.3 t-SNE for non-linear structure](#7.3)

* [8. Integration of LLM / Generative AI (Optional)](#8)
    - [8.1 LLM assistance (cluster summaries, error analysis)](#8.1)
    - [8.2 Synthetic email generation for data balance](#8.2)
    - [8.3 Compare manual vs. LLM-augmented preprocessing](#8.3)

* [9. Results and Discussion](#9)
    - [9.1 Performance comparison: DT, RF, XGB, **Stacking**, NN](#9.1)
    - [9.2 Confusion matrices and error analysis](#9.2)
    - [9.3 Cluster‚Äìlabel alignment insights](#9.3)
    - [9.4 Limitations and future work](#9.4)

* [10. Business Insights and Recommendations](#10)
    - [10.1 Productivity gains from auto-categorization](#10.1)
    - [10.2 Inbox/CRM workflow automation](#10.2)
    - [10.3 Governance & explainability](#10.3)

* [11. Deployment (FastAPI + Streamlit)](#11)
    - [11.1 FastAPI `/predict` endpoint for inference](#11.1)
    - [11.2 Streamlit UI (text box ‚Üí predicted category + probabilities)](#11.2)
    - [11.3 Live demo pipeline: input ‚Üí TF-IDF ‚Üí model ‚Üí category](#11.3)

* [12. Appendices and Deliverables](#12)
    - [12.1 Source notebooks, trained models, config](#12.1)
    - [12.2 API URLs/keys and dataset files](#12.2)
    - [12.3 Slides and references](#12.3)


In [3]:
from huggingface_hub import login
import pandas as pd
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [None]:
# Login using e.g. `huggingface-cli login` to access this dataset
splits = {'train': 'train.json', 'test': 'test.json'}
train_df = pd.read_json("hf://datasets/jason23322/high-accuracy-email-classifier/" + splits["train"])
test_df = pd.read_json("hf://datasets/jason23322/high-accuracy-email-classifier/" + splits["test"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
#create openai client
import os
from openai import OpenAI

OPENAI_MODEL_NAME = 'gpt-4o-mini'

def create_secure_openai_client():
    """
    Create OpenAI client with secure API key handling.

    This function:
    1. Looks for OPENAI_API_KEY in environment variables
    2. Tests the connection with a simple API call
    3. Returns the client or None if setup fails
    """

    try:
        from dotenv import load_dotenv
        load_dotenv()  # Load .env file if it exists
    except ImportError:
        pass  # python-dotenv not installed, that's okay

    api_key = os.getenv('OPENAI_API_KEY')

    if not api_key:
      try:
          from google.colab import userdata
          api_key = userdata.get('OPENAI_API_KEY')
          print("‚úÖ API key loaded from Colab Secrets")
      except ImportError:
          pass
      except Exception as e:
          print(f"üîç Colab Secret not found: {e}")

    if not api_key:
        print("‚ö†Ô∏è No OpenAI API key found.")
        print("üí° Set environment variable: OPENAI_API_KEY=your_key")
        print("üí° Or create .env file with: OPENAI_API_KEY=your_key")
        return None

    try:
        client = OpenAI(api_key=api_key)
        print("‚úÖ OpenAI client created successfully")
        return client
    except Exception as e:
        print(f"‚ùå OpenAI client creation failed: {e}")
        print("üîç Check your API key and internet connection")
        return None

# Initialize the client
client = create_secure_openai_client()

‚úÖ API key loaded from Colab Secrets
‚úÖ OpenAI client created successfully


In [None]:
filtered_train_df = train_df[train_df.category!='spam']


In [None]:
# View categories
print(filtered_train_df['category'].value_counts())



category
forum           1800
verify_code     1800
promotions      1796
social_media    1796
updates         1794
Name: count, dtype: int64


# Email Feature Extractor

In [6]:
from pydantic import BaseModel, Field, field_validator
from typing import Optional, List
from datetime import datetime, date, time
from enum import Enum

class UrgencyLevel(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class RecurrencePattern(str, Enum):
    NONE = "none"
    DAILY = "daily"
    WEEKLY = "weekly"
    MONTHLY = "monthly"
    YEARLY = "yearly"
    CUSTOM = "custom"

class EventType(str, Enum):
    APPOINTMENT = "appointment"
    MEETING = "meeting"
    DEADLINE = "deadline"
    MAINTENANCE = "maintenance"
    PAYMENT = "payment"
    VERIFICATION = "verification"
    NOTIFICATION = "notification"
    REMINDER = "reminder"
    FINAL = "final"
    OTHER = "other"

class ActionRequirement(str, Enum):
    CONFIRM = "confirm"
    REPLY = "reply"
    PAY = "pay"
    VERIFY = "verify"
    CLICK = "click"
    DOWNLOAD = "download"
    COMPLETE = "complete"
    REVIEW = "review"
    NONE = "none"

class LocationType(str, Enum):
    PHYSICAL = "physical"
    VIRTUAL = "virtual"
    HYBRID = "hybrid"
    NONE = "none"

class EmailFeatures(BaseModel):
    """Pydantic model for extracting structured features from email content."""

    email_text: Optional[str] = Field(None, description="Full email text")

    # Date/Time fields
    scheduled_datetime: Optional[datetime] = Field(None, description="Extracted date and time")
    date_text: Optional[str] = Field(None, description="Raw date/time text")
    date_from: Optional[date] = Field(None, description="Start date (YYYY-MM-DD)")
    date_to: Optional[date] = Field(None, description="End date (YYYY-MM-DD)")
    time_from: Optional[time] = Field(None, description="Start time (HH:MM:SS 24-hour)")
    time_to: Optional[time] = Field(None, description="End time (HH:MM:SS 24-hour)")
    has_complete_datetime: bool = Field(False, description="True if both date and time present")

    # Location
    location: Optional[str] = Field(None, description="Meeting location")
    meeting_url: Optional[str] = Field(None, description="Virtual meeting URL")
    maps_url: Optional[str] = Field(None, description="Maps URL")
    coordinates: Optional[str] = Field(None, description="Coordinates")
    location_type: Optional[LocationType] = Field(None, description="Location type")  # ‚Üê Changed to Optional

    # Event
    event_type: Optional[EventType] = Field(None, description="Event type")  # ‚Üê Changed to Optional
    event_confidence: Optional[float] = Field(None, ge=0.0, le=1.0, description="Event confidence")  # ‚Üê Changed to Optional

    # Urgency
    urgency_level: Optional[UrgencyLevel] = Field(None, description="Urgency level")  # ‚Üê Changed to Optional
    urgency_score: Optional[float] = Field(None, ge=0.0, le=1.0, description="Urgency score")  # ‚Üê Changed to Optional
    urgency_indicators: List[str] = Field(default_factory=list, description="Urgency phrases")

    # Recurrence
    recurrence_pattern: Optional[RecurrencePattern] = Field(None, description="Recurrence")  # ‚Üê Changed to Optional
    recurrence_text: Optional[str] = Field(None, description="Recurrence text")

    # Action
    action_required: Optional[ActionRequirement] = Field(None, description="Action required")  # ‚Üê Changed to Optional
    action_deadline: Optional[datetime] = Field(None, description="Action deadline")
    action_confidence: Optional[float] = Field(None, ge=0.0, le=1.0, description="Action confidence")  # ‚Üê Changed to Optional
    action_phrases: List[str] = Field(default_factory=list, description="Action phrases")

    # Metadata
    contains_links: bool = Field(False, description="Contains links")
    contains_attachments: bool = Field(False, description="Contains attachments")
    financial_amount: Optional[str] = Field(None, description="Financial amounts")

    # Validators to set defaults when None is provided
    @field_validator('location_type', mode='before')
    @classmethod
    def set_location_type_default(cls, v):
        return v if v is not None else LocationType.NONE

    @field_validator('event_type', mode='before')
    @classmethod
    def set_event_type_default(cls, v):
        return v if v is not None else EventType.OTHER

    @field_validator('urgency_level', mode='before')
    @classmethod
    def set_urgency_level_default(cls, v):
        return v if v is not None else UrgencyLevel.LOW

    @field_validator('recurrence_pattern', mode='before')
    @classmethod
    def set_recurrence_pattern_default(cls, v):
        return v if v is not None else RecurrencePattern.NONE

    @field_validator('action_required', mode='before')
    @classmethod
    def set_action_required_default(cls, v):
        # Handle boolean True being passed (OpenAI bug)
        if v is True:
            return ActionRequirement.NONE
        return v if v is not None else ActionRequirement.NONE

    @field_validator('event_confidence', 'urgency_score', 'action_confidence', mode='before')
    @classmethod
    def set_score_default(cls, v):
        return v if v is not None else 0.0

    class Config:
        json_encoders = {
            datetime: lambda v: v.isoformat() if v else None,
            date: lambda v: v.isoformat() if v else None,
            time: lambda v: v.isoformat() if v else None
        }
        use_enum_values = True

In [7]:

#  Define the extraction function

import json

def extract_email_features(email_text: str, subject: str = "") -> EmailFeatures:
    """Extract structured features from email using OpenAI API."""

    if not client:
        raise ValueError("OpenAI client not initialized.")

    full_text = f"Subject: {subject}\n\nBody: {email_text}" if subject else email_text

    system_prompt =  """You are an expert email analyzer. Extract structured information from emails and return it in the specified JSON format.

Focus on identifying:
1. Scheduled dates/times (appointments, deadlines, events) - extract date ranges and time ranges
2. Urgency indicators (urgent, asap, now, today, deadline, final notice, etc.)
3. Event types (meetings, payments, verifications, etc.)
4. Required actions (confirm, reply, pay, verify, etc.)
5. Recurrence patterns (daily, weekly, monthly, etc.)
6. Financial amounts and deadlines
7. Location information (physical addresses, venue names, virtual meeting URLs, coordinates)

For dates and times:
- Extract start and end dates separately (date_from and date_to in YYYY-MM-DD format)
- Extract start and end times separately (time_from and time_to in HH:MM:SS 24-hour format)
- If only one date mentioned, use same value for both date_from and date_to
- If only one time mentioned, use same value for both time_from and time_to
- Convert 12-hour format to 24-hour (1 PM = 13:00:00, 2:30 PM = 14:30:00, etc.)
- Set has_complete_datetime to true only if BOTH date AND time are present

Return valid JSON matching the EmailFeatures schema exactly."""

    user_prompt = f"""Analyze this email and extract structured features:

{full_text}

Return a JSON object with these fields:

DATE AND TIME FIELDS (NEW - IMPORTANT):
- date_from: start date in YYYY-MM-DD format (e.g., "2025-11-15"), null if no date
- date_to: end date in YYYY-MM-DD format (same as date_from if single date), null if no date
- time_from: start time in HH:MM:SS 24-hour format (e.g., "13:00:00" for 1 PM), null if no time
- time_to: end time in HH:MM:SS 24-hour format (same as time_from if single time), null if no time
- has_complete_datetime: boolean - true ONLY if both date and time are present, false otherwise

LEGACY DATE/TIME FIELDS:
- scheduled_datetime: ISO datetime string if specific date/time mentioned, null otherwise
- date_text: raw text containing date/time info, null if none

URGENCY:
- urgency_level: one of [low, medium, high, critical]
- urgency_score: float 0.0-1.0
- urgency_indicators: array of urgency phrases found

LOCATION:
- location: meeting location, address, or venue name, null if none
- meeting_url: virtual meeting URL (Zoom, Teams, etc.), null if none
- maps_url: Google Maps or other map service URL, null if none
- coordinates: geographic coordinates (latitude, longitude), null if none
- location_type: one of [physical, virtual, hybrid, none]

EVENT:
- event_type: one of [appointment, meeting, deadline, maintenance, payment, verification, notification, reminder, final, other]
- event_confidence: float 0.0-1.0

RECURRENCE:
- recurrence_pattern: one of [none, daily, weekly, monthly, yearly, custom]
- recurrence_text: raw recurrence text, null if none

ACTION:
- action_required: one of [confirm, reply, pay, verify, click, download, complete, review, none]
- action_deadline: ISO datetime for action deadline, null if none
- action_confidence: float 0.0-1.0
- action_phrases: array of action-indicating phrases

METADATA:
- contains_links: boolean
- contains_attachments: boolean
- financial_amount: string of any monetary amounts, null if none

EXAMPLES OF TIME CONVERSION:
- "1 PM" or "13" ‚Üí "13:00:00"
- "2:30 PM" ‚Üí "14:30:00"
- "9 AM" ‚Üí "09:00:00"
- "midnight" ‚Üí "00:00:00"
- "noon" ‚Üí "12:00:00"

EXAMPLES OF DATE EXTRACTION:
- "Meeting on Nov 15, 2025" ‚Üí date_from: "2025-11-15", date_to: "2025-11-15"
- "Conference from Dec 1-3" ‚Üí date_from: "2025-12-01", date_to: "2025-12-03"
- "this week in 2021" ‚Üí extract specific date if possible, otherwise null

EXAMPLES OF has_complete_datetime:
- Has date "Nov 15" and time "2 PM" ‚Üí has_complete_datetime: true
- Has only date "Nov 15" ‚Üí has_complete_datetime: false
- Has only time "2 PM" ‚Üí has_complete_datetime: false
- No date or time ‚Üí has_complete_datetime: false"""
    try:
        response = client.chat.completions.create(
            model=OPENAI_MODEL_NAME,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.1,
            max_tokens=1500,
            response_format={"type": "json_object"}
        )

        result_dict = json.loads(response.choices[0].message.content)
        return EmailFeatures(**result_dict)

    except Exception as e:
        print(f"Error: {e}")
        return EmailFeatures()


In [None]:
#  Process emails and save results


# Process 100 emails
subset_df = filtered_train_df.head(100)
print(f"Extracting features for {len(subset_df)} emails...\n")

all_features = []
failed_extractions = []

for idx, row in subset_df.iterrows():
    try:
        # Extract features
        features = extract_email_features(row['body'], row.get('subject', ''))
        features.email_text = f"Subject: {row.get('subject', '')}\n\nBody: {row.get('body', '')}"
        all_features.append(features)

        # Progress
        if (idx + 1) % 20 == 0:
            print(f"Processed {idx + 1}/{len(subset_df)} emails")

    except Exception as e:
        print(f"Failed email {idx}: {e}")
        failed_extractions.append(idx)
        all_features.append(EmailFeatures(
            email_text=f"Subject: {row.get('subject', '')}\n\nBody: {row.get('body', '')}"
        ))

print(f"\n{'='*60}")
print(f"Completed! Success: {len(all_features) - len(failed_extractions)}, Failed: {len(failed_extractions)}")
print(f"{'='*60}\n")

# Display 5 samples
print("Sample extracted features (first 5):\n")
for i in range(min(5, len(all_features))):
    f = all_features[i]
    print(f"{'='*60}")
    print(f"Email {i+1}")
    print(f"{'='*60}")
    print(f"Date: {f.date_from} to {f.date_to}")
    print(f"Time: {f.time_from} to {f.time_to}")
    print(f"Complete DateTime: {f.has_complete_datetime}")
    print(f"Location: {f.location or 'N/A'}")
    print(f"Event: {f.event_type} (confidence: {f.event_confidence})")
    print(f"Urgency: {f.urgency_level} (score: {f.urgency_score})")
    print(f"Action: {f.action_required}")
    print(f"Recurrence: {f.recurrence_pattern}")
    print(f"Financial: {f.financial_amount or 'N/A'}")
    print(f"Links: {f.contains_links} | Attachments: {f.contains_attachments}")
    print(f"{'='*60}\n")

# Save results
print("Saving results")
features_data = [f.model_dump() for f in all_features]
df_features = pd.DataFrame(features_data)

df_features.to_csv('email_features.csv', index=False)
df_features.to_parquet('email_features.parquet', index=False)
print("Saved to email_features.csv and email_features.parquet")

if failed_extractions:
    with open('failed_extractions.json', 'w') as f:
        json.dump(failed_extractions, f)
    print(f"Saved {len(failed_extractions)} failed extraction indices")

Extracting features for 100 emails...

Processed 20/100 emails
Processed 40/100 emails
Processed 60/100 emails
Processed 80/100 emails
Processed 100/100 emails

Completed! Success: 100, Failed: 0

Sample extracted features (first 5):

Email 1
Date: None to None
Time: None to None
Complete DateTime: False
Location: N/A
Event: None (confidence: None)
Urgency: None (score: None)
Action: None
Recurrence: None
Financial: N/A
Links: False | Attachments: False

Email 2
Date: None to None
Time: None to None
Complete DateTime: False
Location: N/A
Event: other (confidence: 0.5)
Urgency: low (score: 0.0)
Action: click
Recurrence: none
Financial: N/A
Links: True | Attachments: False

Email 3
Date: None to None
Time: None to None
Complete DateTime: False
Location: N/A
Event: notification (confidence: 0.5)
Urgency: low (score: 0.0)
Action: none
Recurrence: none
Financial: N/A
Links: True | Attachments: False

Email 4
Date: None to None
Time: None to None
Complete DateTime: False
Location: Downtown C

In [None]:
import json
import pickle

print("\nSaving in additional formats...")

# Already saved: CSV and Parquet

# Save as JSON (human-readable)
df_features.to_json('email_features.json', orient='records', indent=2, date_format='iso')
print("Saved as JSON: email_features.json")

# Save as Pickle preserves original Pydantic objects - best for loading back in Python
with open('email_features.pkl', 'wb') as f:
    pickle.dump(all_features, f)
print("Saved as Pickle: email_features.pkl")

print(f"\nTotal records saved: {len(df_features)}")
print(f"Files created:")
print(f"email_features.csv (for Excel)")
print(f"email_features.parquet (for Python/Pandas)")
print(f"email_features.json (human-readable)")
print(f"email_features.pkl (Pydantic objects)")
print(f"failed_extractions.json (error log)")


Saving in additional formats...
Saved as JSON: email_features.json
Saved as Pickle: email_features.pkl

Total records saved: 100
Files created:
email_features.csv (for Excel)
email_features.parquet (for Python/Pandas)
email_features.json (human-readable)
email_features.pkl (Pydantic objects)
failed_extractions.json (error log)


In [None]:
# Load and check the data =====

import pandas as pd
import pickle
import json
from datetime import datetime, date, time

# Method 1: Load from CSV
df = pd.read_csv('email_features.csv')
print("Loaded from CSV")
print(f"Shape: {df.shape}")
print(f"\nColumns:\n{df.columns.tolist()}")
print(f"\nFirst few rows:\n{df.head()}")

# Load from Parquet (recommended preserves data types)
df_parquet = pd.read_parquet('email_features.parquet')
print("\n" + "="*60)
print("Loaded from Parquet")
print(f"Shape: {df_parquet.shape}")

# Method 3: Load original Pydantic objects from Pickle
with open('email_features.pkl', 'rb') as f:
    loaded_features = pickle.load(f)
print("\n" + "="*60)
print("Loaded Pydantic objects from Pickle")
print(f"Total objects: {len(loaded_features)}")
print(f"First object type: {type(loaded_features[0])}")

Loaded from CSV
Shape: (100, 27)

Columns:
['email_text', 'scheduled_datetime', 'date_text', 'date_from', 'date_to', 'time_from', 'time_to', 'has_complete_datetime', 'location', 'meeting_url', 'maps_url', 'coordinates', 'location_type', 'event_type', 'event_confidence', 'urgency_level', 'urgency_score', 'urgency_indicators', 'recurrence_pattern', 'recurrence_text', 'action_required', 'action_deadline', 'action_confidence', 'action_phrases', 'contains_links', 'contains_attachments', 'financial_amount']

First few rows:
                                          email_text scheduled_datetime  \
0  Subject: Anniversary Special: Buy one get one ...                NaN   
1  Subject: Digital Ritual Experience Creation\n\...                NaN   
2  Subject: Your post was moved to "Programming H...                NaN   
3  Subject: Memories from this week in 2021\n\nBo...                NaN   
4  Subject: Two-step verification code: 426706\n\...                NaN   

           date_text date

In [None]:
#  Analyze the extracted features =====

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_parquet('email_features.parquet')

print("DATA ANALYSIS")
print("="*60)

# 1. Basic statistics
print("\n1Ô∏èDataset Overview:")
print(f"   Total emails: {len(df)}")
print(f"   Total columns: {len(df.columns)}")

# 2. Check new date/time fields
print("\nDate/Time Field Coverage:")
print(f"   Emails with date_from: {df['date_from'].notna().sum()}")
print(f"   Emails with date_to: {df['date_to'].notna().sum()}")
print(f"   Emails with time_from: {df['time_from'].notna().sum()}")
print(f"   Emails with time_to: {df['time_to'].notna().sum()}")
print(f"   Emails with complete datetime: {df['has_complete_datetime'].sum()}")

# 3. Event type distribution
print("\nEvent Type Distribution:")
print(df['event_type'].value_counts())

# 4. Urgency level distribution
print("\nUrgency Level Distribution:")
print(df['urgency_level'].value_counts())

# 5. Action required distribution
print("\nAction Required Distribution:")
print(df['action_required'].value_counts())

# 6. Location type distribution
print("\nLocation Type Distribution:")
print(df['location_type'].value_counts())

# 7. Additional metadata
print("\nAdditional Metadata:")
print(f"   Emails with links: {df['contains_links'].sum()}")
print(f"   Emails with attachments: {df['contains_attachments'].sum()}")
print(f"   Emails with financial amounts: {df['financial_amount'].notna().sum()}")

# 8. Check for missing values in key fields
print("\nMissing Values in Key Fields:")
key_fields = ['date_from', 'date_to', 'time_from', 'time_to', 'has_complete_datetime',
              'event_type', 'urgency_level', 'action_required']
missing_data = df[key_fields].isna().sum()
print(missing_data)

DATA ANALYSIS

1Ô∏èDataset Overview:
   Total emails: 100
   Total columns: 27

Date/Time Field Coverage:
   Emails with date_from: 13
   Emails with date_to: 13
   Emails with time_from: 14
   Emails with time_to: 14
   Emails with complete datetime: 9

Event Type Distribution:
event_type
notification    52
verification    16
other           13
maintenance      4
meeting          2
reminder         2
appointment      2
final            1
payment          1
Name: count, dtype: int64

Urgency Level Distribution:
urgency_level
low       66
medium    19
high       8
Name: count, dtype: int64

Action Required Distribution:
action_required
none        41
click       31
verify      11
review       3
reply        3
confirm      2
pay          1
complete     1
Name: count, dtype: int64

Location Type Distribution:
location_type
none        77
virtual     13
physical     3
Name: count, dtype: int64

Additional Metadata:
   Emails with links: 72
   Emails with attachments: 0
   Emails with finan

Check date & Time

In [None]:
import re

# Example patterns for date and time
date_pattern = r'\b(?:\d{1,2}[/-]\d{1,2}(?:[/-]\d{2,4})?|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s*\d{1,2}\b)\b'
time_pattern = r'\b(?:[01]?\d|2[0-3]):[0-5]\d\b|\b(?:1[0-2]|0?[1-9]) ?(?:AM|PM|am|pm)\b'

# Detect date and time separately
has_date = df['email_text'].str.contains(date_pattern, case=False, regex=True, na=False)
has_time = df['email_text'].str.contains(time_pattern, case=False, regex=True, na=False)

# Detect both
has_both = has_date & has_time

# Count and preview
print("Rows with both date and time:", has_both.sum())
filtered_train_df_withTimeAndDate=df[has_both]

Rows with both date and time: 6


In [None]:
print(filtered_train_df_withTimeAndDate['email_text'].iloc[2])

Subject: Appointment reminder: Eye Exam, Aug 15 3-5AM GMT

Body: New features: Security updates. Install now: service.com/status Changelog included.
