# **Predicting Disneyland Ride Popularity and Designing an Optimal Park Day Plan**

### **DSC 102 – Assignment 2**

**Author:**  
Kyle Le  

**Project Overview:**  
This project uses historic Disneyland wait-time data to predict ride popularity tiers (1–5) using machine learning models. The predicted popularity scores are then used to generate an optimized ride plan aimed at maximizing high-tier rides within a single park day.


## **1. Predictive Task**

### **Goal**

The goal of this project is to predict the **popularity tier** of each Disneyland ride using historic wait-time data.  
Each ride is assigned a popularity tier from **1 to 5**, where:

- **5** = most popular / in highest demand  
- **1** = least popular / lowest demand  

These tiers may be derived from my own personal tier list that I have created.

---

### **Problem Formulation**

This task is formulated as a **multiclass classification problem**:

- **Input features (X):**  
  Quantitative attributes of a ride, such as:  
  - Average wait time  
  - Maximum wait time  

- **Output label (y):**  
  A discrete popularity tier in **{1, 2, 3, 4, 5}**

The goal is to learn a function:

$$f(X) \rightarrow y$$

that accurately predicts the popularity tier from ride statistics.

---

### **Models to Be Used**

To align with course content, the project includes:

- A **majority-class baseline**  
- A **wait-time heuristic baseline** (ranking rides by average wait time)  
- A **multinomial logistic regression model**  
  - This is the primary ML model used for classification  
  - It outputs predicted probabilities for each tier  

---

### **Evaluation Strategy**

Model performance will be assessed using:

- **Accuracy**  
- **Macro F1-score**  
- **Confusion matrix visualization**  
  - Helps reveal whether errors occur mostly between adjacent tiers (e.g., 3 ↔ 4)

Performance of the logistic regression model will be compared against baselines to demonstrate prediction quality and justify the model’s effectiveness.


## **2. Dataset Description and Label Construction**

### **2.1 Dataset Context**

This project uses historic Disneyland attraction wait-time data obtained from an online API.  
Each observation contains information about a specific ride on a given date, including:

- **ride name**
- **average wait time**
- **maximum wait time**
- **date**

These wait-time statistics serve as quantitative features for predicting ride popularity.

---

### **2.2 Ride Popularity Tiers**

To create supervised labels for the prediction task, I constructed a **custom ride tier list** based on personal experience and familiarity with Disneyland attractions.

The tier list includes five categories:

| Tier Name | Numeric Label | Meaning |
|----------|---------------|---------|
| **Love**   | **5** | Top-tier, highly popular rides |
| **Like**   | **4** | Strongly preferred rides |
| **Neutral**| **3** | Average or mid-tier rides |
| **Dislike**| **2** | Below-average rides |
| **Loathe** | **1** | Lowest-tier rides |

The visual tier list used to define these categories is shown below:

![Disneyland ride tier list](ride_tier_list.png)

---

### **2.3 Label Mapping**

Each ride name in the dataset is mapped to one of the five popularity tiers shown above.  
This mapping produces the target variable **y**, which takes values:

- **1** = Loathe  
- **2** = Dislike  
- **3** = Neutral  
- **4** = Like  
- **5** = Love  

The input features **X** are the ride-level statistics (e.g., average wait time and maximum wait time).  
The goal is to learn a model that maps features **X** to a popularity tier **y**.

This forms a **supervised multiclass classification problem**, where wait-time data is used to predict the assigned popularity tier.

---

### **2.4 Preprocessing Steps**

Before modeling, the dataset will be cleaned and prepared by:

- Removing rows with missing or invalid wait times  
- Filtering out attractions not included in the tier list  
- Aggregating data at the **ride level** using summary statistics such as:  
  - Mean average wait time  
  - Mean maximum wait time  

These aggregated features form the input matrix **X**, and the tier labels form the output vector **y** used to train the predictive model.
