# Case Study: Zomato Real-Time ETA Prediction

## Introduction
This case study examines the design and implementation of a real-time Estimated Time of Arrival (ETA) prediction system for Zomato, a food delivery service. The focus is on the data flow from inception to deployment, highlighting the various components and considerations needed to ensure accurate and timely predictions for food delivery times. Factors influencing ETA calculations include restaurant busy times, traffic conditions, order complexity, and rider availability.

---

## Key Factors in ETA Prediction
### Influencing Factors
The accuracy of ETA predictions can be influenced by various dynamic factors, including:
1. **Restaurant Busy Times**: The volume of orders affecting preparation times, particularly during peak hours.
2. **Traffic Conditions**: Variations in congestion and road conditions affecting delivery times.
3. **Order Complexity**: Specific menu items may require longer preparation or delivery durations than others.
4. **Rider Availability and Performance**: The number of active riders and their historical performance play a role in determining delivery speed.

---

## End-to-End Data Inception to Deployment

### Functional Requirements
1. **Accurate ETA Prediction**: Essential for customer satisfaction by providing reliable delivery time estimates.
2. **Real-Time Updates**: The system should adapt to changing conditions, delivering immediate feedback.
3. **Dynamic Updates**: Updates should respond to various factors:
   - Changes in travel time
   - Restaurant preparation times
   - Order complexities
   - Rider availability
4. **Integration**: Seamless data flow among restaurants, riders, and customers to maintain efficient operations.

### Non-Functional Requirements
1. **Scalability**: The system must effortlessly accommodate a growing user base without performance issues.
2. **Availability**: High system uptime is crucial to maintain reliable service.
3. **Low Latency**: Minimum delays in ETA updates are critical to user experience.
4. **Reliability**: Consistent performance and dependability in predictions are a must.
5. **Data Security and Privacy**: Safeguarding personally identifiable information (PII) to ensure user trust and compliance with regulations.

---

## Data Sources and Structures
1. **Orders Data**: Contains critical details about each order.
   - **Attributes**: Order ID, items ordered, customer ID, restaurant ID, payment information, delivery address, special instructions.
   - **Structure**: Semi-structured (combination of tabular, text, and JSON data).

2. **Restaurant Data**: Details about the restaurants involved in the delivery.
   - **Attributes**: Restaurant ID (primary key), name, address, location (latitude and longitude), star ratings, type of cuisine, average preparation time (with nested structure for varied cuisines), operational hours, number of orders.
   - **Structure**: Structured.

3. **Rider Data**: Information on delivery riders.
   - **Attributes**: Rider ID, availability, historical delivery times, vehicle type, vehicle number, phone number, ratings, earnings.
   - **Structure**: Structured.

4. **Customer Data**: Information about customers using the app.
   - **Attributes**: Locations, names, interests, addresses, phone numbers, order history.
   - **Structure**: Structured.

5. **Traffic Data**: External factors that can influence ETA calculations.
   - **Attributes**: History of traffic, weather conditions, road network data, congestion times, average speed.
   - **Source**: Typically obtained from external APIs like Google Maps.
   - **Structure**: Unstructured.

---

## Storage Solutions
1. **Raw Data Lake**: **Amazon S3**
   - **General Storage**: For serving various data types.
   - **Glacier Storage**: For cold storage solutions, allowing instant retrieval and flexible or deep archiving.

2. **Structured Data**: **AWS Redshift (RDS)**
   - For storing structured datasets that can be joined and queried effectively.

3. **Frequently Accessed Real-Time Data**: **DynamoDB, Firestore**
   - These databases support real-time data access and quick retrieval functions.

4. **Metadata Store**: **AWS Glue Data Catalog**
   - This tool enables cataloging and efficient management of data storage, ensuring easy discovery and governance of metadata.

---

## Data Processing and Feature Engineering
### ETL Process
- **Extract, Transform, Load (ETL)**: 
   - A critical process that in the industry often combines extraction, transformation, and loading using comprehensive tools such as:
     - **Informatica**
     - **Alteryx**
     - **Dataiku DSS**
     - **AWS Glue**: This tool can perform all three steps efficiently.

### Data Cleaning and Transformation
- Performing thorough data cleaning involves addressing missing values, identifying outliers, and correcting inconsistencies in the datasets. 

### Feature Engineering Techniques
- **Distance Calculations**: Using the Haversine formula for accurate geographical distance computations.
- Designing special features that incorporate:
  - Restaurant characteristics such as type and average preparation times.
  - Rider-specific metrics, including workload and past delivery performance.

### Timing for ETL
- The ETL process should initiate daily or when significant events occur, such as:
   - Updates in S3
   - Riders' location changes affecting ETA predictions.
  
#### How It Works
- **AWS Lambda** can trigger ETL events, allowing dynamic scale handling as order quantities fluctuate. In one order instance, machine learning algorithms may execute numerous times to update predictions in real-time.

---

## Feature Encoding Techniques
1. **One-Hot Encoding**: Converts categorical variables into a binary matrix format.
2. **Target Encoding**: Encodes categorical variables based on the mean of the target variable.
3. **Ordinal Encoding**: Assigns integer values to ordinal categories.
4. **Pre-trained Model Embeddings**: For high cardinality feature representation. Given there are approximately 1 million restaurants in Zomato’s database, encoding restaurant names appropriately is crucial.
   - **Example**: QSR (Quick Service Restaurants) and biryani-specific outlets might have similar preparation times.

### Geo-Hashing
- **Explanation**: Geo-hashing encodes latitude and longitude into a short string, making it easier to manage geographic data. 
   - **Example**: Amazon may use a 5-character precision while Swiggy might opt for a 9-character precision, depending on the necessity for accuracy in densely populated areas.

---

## Model Selection and Training Phase
### Model Training Options
- **AWS SageMaker**: A powerful platform to build, train, and deploy machine learning models, supporting various model types such as:
  - Linear Regression
  - XGBoost (Extreme Gradient Boosting)
  - LightGBM (Light Gradient Boosting)
  - Neural Networks (NN)

### Deployment
- Once training is completed, models are deployed via **endpoints**, allowing them to be exposed as REST APIs for real-time predictions.

---

## AWS Lambda Function for Predictions
1. **Pre-process Incoming Data**: Handles encoding and necessary transformations of incoming requests.
2. **Invoke SageMaker Endpoint**: Triggers the prediction process using the pre-processed data.
3. **Post-Processing**: Finalizes prediction output to return meaningful results to users.

---

## Inferential Statistics
### Statistical Methods Used
1. **Confidence Intervals**: To provide users with a range of possible ETAs, indicating prediction uncertainty.
   - Example: If the predicted ETA is 30 minutes with a confidence interval of ±5 minutes, the user can expect delivery between 25 and 35 minutes.
   
2. **Hypothesis Testing**:
   - **Model Performance Comparison**:
     - **Null Hypothesis (H0)**: No significant difference exists between Model A and Model B's prediction errors.
     - **Alternative Hypothesis (H1)**: A significant difference exists in prediction accuracy.
   - **Impact of New Features**:
     - Evaluate the significance of any new features added to the model.
     - **H0**: Adding the new feature does not improve prediction accuracy.
     - **H1**: Adding the new feature significantly improves prediction accuracy.
   - **A/B Testing for System Changes**:
     - Comparing two versions of the ETA prediction system to analyze improvements.
     - **H0**: No significant difference in accuracy or user satisfaction between the old and new systems.
     - **H1**: The new system significantly improves accuracy or user satisfaction.

---

## Flowchart Diagram

```mermaid
flowchart TD
    A[Start] --> B{Extract Data}
    B --> C[Orders Data]
    B --> D[Restaurant Data]
    B --> E[Rider Data]
    B --> F[Customer Data]
    B --> G[Traffic Data]
    C --> H[ETL Processing]
    D --> H
    E --> H
    F --> H
    G --> H
    H --> I[Feature Engineering]
    I --> J[Model Training]
    J --> K[Model Deployment]
    K --> L[Prediction via REST API]
    L --> M{User Request}
    M --> N[Update ETA]
    M --> O[Provide Real-time Data]
    O --> P[End]
```

---

## Conclusion
The design described in this case study outlines a comprehensive framework for real-time ETA prediction tailored for Zomato. This system incorporates multi-faceted data sources, cloud technologies, and statistical methodologies to ensure timely and accurate delivery estimates. Such a design not only meets functional requirements but also aligns with non-functional benchmarks to provide a robust and trustworthy service to Zomato’s users.
