# CIS 5450 Project: Difficulty Topics
**Group Members:**
* **Jessica Zhang**
* **Tongxun Hu**
* **Tingyu Lu**


## Topic 1: Imbalanced Data

### **Why we used this concept**
Our target variable, `is_cancel`, is highly imbalanced, with cancellations representing only a small fraction of all invoices.  
Models trained on such data tend to predict the majority class (“not canceled”) almost exclusively. While this results in high overall accuracy, it performs poorly where it matters most—**identifying canceled invoices**.

Earlier in the project, we mitigated imbalance using:

- **Stratified splitting**, and  
- **`class_weight='balanced'`** in Logistic Regression and Random Forest.

To deepen our treatment of this issue and satisfy the Difficulty requirement, we implemented a **data-level strategy: random oversampling of the minority class**.  
Unlike class weighting, which adjusts the model's loss function, oversampling directly modifies the training distribution so that cancellation cases become equally represented during learning.

This allowed us to compare two different imbalance-handling philosophies:
- Emphasizing minority errors through **class weighting**, and  
- Increasing minority presence through **resampling**.

### **How we implemented it**
##### **1. Stratified Train/Test Split**

During data splitting, we used stratification to maintain the original cancellation proportion in both training and test sets.  
This prevents pathological splits where:

- the minority class might be underrepresented in training (hurting learning), or  
- nearly absent in the test set (making evaluation unreliable).

This ensured all models were trained and evaluated on representative distributions.

##### **2. Class Weighting in Logistic Regression and Random Forest**

We used `class_weight="balanced"` in both model families.  
This adjusts the loss function so that the classifier assigns a higher penalty to misclassified cancellations.

This method modifies *how the model learns* without altering the data.  
It encourages the model to pay more attention to minority-class errors while preserving the true class distribution.

##### **3. Oversampling for Logistic Regression**

To study a data-level approach, we created a separate Logistic Regression model trained on an oversampled version of the training data:

- We combined `X_train` and `y_train` into one DataFrame.
- Split into majority (`is_cancel=0`) and minority (`is_cancel=1`) subsets.
- Upsampled the minority subset *with replacement* until both classes were equally represented.
- Reconstructed a balanced training set and trained Logistic Regression without class weights.
- Critically, evaluation was performed on the original test set to avoid unrealistic performance inflation.

### **Results & Interpretation**
The oversampled Logistic Regression produced:

- **AUC:** 0.9692  
- **Precision (cancelled):** 0.6789  
- **Recall (cancelled):** 0.9398  
- **F1 (cancelled):** 0.7883  

Oversampling fundamentally changed the model's behavior:

- It dramatically **increased recall** on cancelled invoices (fewer missed risky orders).  
- Precision decreased, meaning the model raised more false alarms.  

This illustrates a key trade-off in imbalanced learning:

- If the goal is to avoid missing potential cancellations → **oversampling is better**.  
- If the goal is to reduce unnecessary manual review → **class weighting may be preferable**.

Although oversampling slightly improved F1 compared to the weighted Logistic Regression, the **tuned Random Forest** still delivered the best combination of AUC, recall, and stable overall performance. Thus, Random Forest remained our final model.

## Topic 2: Hyperparameter Tuning (Random Forest)
### **Why we used this concept**
Random Forests contain many hyperparameters that directly influence their
capacity, generalization behavior, and stability. Parameters such as the number
of trees, maximum depth, leaf size, and the number of features used at each
split can dramatically change:

- how much structure the forest can learn,
- whether it overfits noisy patterns,
- how well it separates the cancelled vs. non-cancelled invoices.

Our baseline Random Forest used manually chosen settings, which produced strong
performance but did not guarantee that we were operating near the model’s
optimal configuration. Because Random Forest was the strongest model family in
earlier experiments, carefully tuning it was essential both for maximizing
predictive performance and for demonstrating a non-trivial depth of modeling
effort consistent with the Difficulty requirement.

### **How we implemented it**
We applied **RandomizedSearchCV** to explore the interaction of the most
impactful Random Forest hyperparameters:

- **`n_estimators`** (number of trees): affects variance reduction  
- **`max_depth`**: controls model complexity and overfitting  
- **`min_samples_split` / `min_samples_leaf`**: regulate how deep branches can grow  
- **`max_features`**: balances tree diversity vs. strength  

We kept **`class_weight='balanced'`** to remain consistent with our imbalance
strategy, and we used **ROC-AUC** as the tuning objective to capture overall
ranking performance on the minority class.

The tuning process involved:

1. Defining a search space that spans both shallow and deep tree structures.  
2. Running a 3-fold cross-validated randomized search over 20 sampled
   hyperparameter configurations.  
3. Selecting the best-performing configuration based on validation AUC.  
4. Evaluating the tuned model on the untouched test set.

### **Results & Interpretation**
The tuned Random Forest achieved:

- **Best CV AUC:** 0.9793  
- **Test AUC:** 0.9811 (vs. 0.979 for the baseline)

The improvement is modest but consistent. The main gains came from:

- deeper and more flexible trees (`max_depth=8` instead of 10 or None),  
- slightly larger leaf size (`min_samples_leaf=2`),  
- a tuned feature subsampling rate (`max_features=0.5`),  
- more trees (`n_estimators=400`) providing variance reduction.

The nearly overlapping ROC curves indicate that the baseline model was already
well-configured, and tuning refined it rather than fundamentally altering its
behavior.

Nevertheless, this confirms that Random Forest performance is
**sensitive to hyperparameter choices**, and the tuned model represents the best
version of this algorithm for our dataset.



## Topic 3: Feature Engineering for High-Cardinality Geography  
### **Why we used this concept**

Our dataset includes dozens of countries, many with very few invoices.  
This creates a **high-cardinality categorical feature** with severe class imbalance across categories.  
Using raw one-hot encoding would produce:

- extremely sparse features,  
- unstable coefficients for rare countries,  
- overfitting due to low sample counts, and  
- noisy model behavior.

At the same time, exploratory analysis suggested that **country may influence cancellation likelihood**.  
Rather than guessing how to encode geography, we applied a *statistical* approach to determine whether country differences were meaningful enough to include in the model.

This led us to combine three advanced ideas:

1. **Hypothesis testing** to determine whether geographic patterns are real.  
2. **Grouping strategy** to reduce high-cardinality noise.  
3. **Target encoding** to preserve information while avoiding sparsity.

### **How we implemented it**

##### **1. Hypothesis Testing: Do UK and Non-UK Behave Differently?**

We used a **two-proportion z-test** to formally assess whether the cancellation rate in the United Kingdom (the dominant country) differs from that of all other countries.

- Null hypothesis: UK and non-UK cancellation rates are equal  
- Result:  
  - *z* ≈ -3.65  
  - *p* ≈ 0.00026  

Because the p-value is extremely small, we reject the null hypothesis and conclude that **UK cancellation behavior is significantly different** from non-UK.

This statistical insight guided our feature engineering decisions.

##### **2. Reducing High Cardinality: UK vs Other**

Since many countries appear only a handful of times, modeling them individually would:

- introduce sparse dummy variables,  
- amplify noise,  
- weaken generalization.

Guided by the hypothesis test, we grouped:

- **United Kingdom** → its own category  
- **All other countries** → `"Other"`

This approach preserves meaningful signal while avoiding instability from low-frequency categories.

##### **3. Target Encoding to Capture Numerical Geographic Signal**

To retain finer-grained patterns without one-hot encoding, we computed a numerical feature:
$$
\text{country_cancel_rate}(c)
= \frac{\text{Cancelled Invoices in } c}{\text{Total Invoices in } c}
$$
This feature represents, for each country:

- the proportion of past invoices that were cancelled,  
- a stable estimate of cancellation tendency,  
- a smooth numeric value instead of sparse dummy variables.

We then merged this numeric feature back into each invoice.

Target encoding required careful design to avoid leakage:

- We used historical averages, not invoice-level outcomes.  
- No information from the current invoice's label was used to compute the encoded value.  
- Rare categories gained stable estimates through grouping.

### **Results & Interpretation**
The combined effect of hypothesis testing, grouping, and target encoding provided:

- **Improved AUC and recall**, as geographic tendencies helped the model distinguish cancellation risk.  
- **Stability**, since the UK/Other grouping prevented noisy behavior caused by rare categories.  
- **Interpretability**, as `country_cancel_rate` emerged as a meaningful predictor of cancellation likelihood.  
- **Reduced overfitting**, because we avoided thousands of sparse dummy variables.


