## ____Stack Overflow Survey ML Project - Learning Journal____

In [None]:
# RAW DATA
  # └──> EDA (distributions, correlations, plots, intuition)
    #     └──> Preprocessing (handle missing, encode categoricals, bin experience, log salary)
      #        └──> Transformers & Pipeline (reusable, scalable, leak-proof)
       #             └──> Scaling (StandardScaler)
        #                  └──> Modeling (linear regression on log(salary))
         #                       └──> Evaluation & Interpretation
          #                            └──> Journal documentation & insights

#### ___Project Overview___
- Dataset: Stack Overflow 2023 Survey
- Goal: __Predicting Yearly Salary (ConvertedComppYearly)__
- Key Learning Objectives: Apply Chapter 2 concepts, feature engineering practice, robust pipeline building.

#### ___Target Selection___
- **Decision**: ConvertedCompYearly as regression target
- **Why**: 
  - 48K samples, reasonable distribution
  - Median $75K aligns with industry knowledge
  - Rich feature set for prediction
- **Challenges identified**: 
  - Extreme outliers need handling
  - ~46% missing values
  - Need currency/location normalization strategy
- **Next**: Explore feature relationships and outlier handling

####  ___Documentation___

#### __Data Exploration__
- **Decision**: Checked the dataset for missing values, saw statistical descriptions of each variable.  
- **Why**: To find out a reasonable target variable. 
- **Alternatives considered**: N/A
- **Outcome**: Most of the variables are objects/categorical. The model would require clever data preprocessing to find out only relevant variable for the target. Then onwards clever feature engineering would help to build a strong model.

#### __Feature Selection__
- **Features**: EdLevel, YearsCode, YearsCodePro, DevType, OrgSize, TechList, LanguageHaveWorkedWith, PlatformHaveWorkedWith, WebframeHaveWorkedWith, ToolsTechHaveWorkedWith, WorkExp, Industry, ProfessionalTech.
- **Statistics**: 12 out of 13 features are object/categorical mostly highly cordinal. 
- **Choice**: Based on the tech domain intuition, these features are most likely to relate the most and can be the drivers of tech salaries. 
- **Challenges**: None so far.

#### ✨ **Feature Engineering Strategy**

<details>
  <summary><strong>📂 Click to expand details</strong></summary>

---

## 🎯 Topic: Advanced Feature Engineering for High-Cardinality Categorical Variables

---

### 📝 **Problem Context**
Working with the Stack Overflow survey data:
- **13 variables** (12 categorical, 1 numerical)
- **~89,000 rows**

The main challenge:  
👉 High-cardinality categorical variables like `TechList` and `LanguageHaveWorkedWith` could explode into **thousands of sparse features** if naïvely one-hot encoded.

---

### 🔍 **Key Insights Discovered**

**1️⃣ The Sparse Matrix Strategy**
- **Problem:**  
  One-hot encoding all 12 categoricals creates potentially **10K+ features** with **99%+ zeros**.
- **Solution:**  
  Use sklearn’s sparse matrix support with `feature_names_out` for interpretability.
- **Why it works:**  
  Sparse matrices store only non-zero values, reducing memory usage by **90-95%**.

---

**2️⃣ Hierarchical Feature Engineering Approach**
Instead of flat one-hot encoding:
- 🏗️ **Stack-level features:** Group technologies into meaningful categories (frontend, backend, data science).
- 🔬 **Technology-level features:** Preserve granular signals for high-impact individual technologies.

---

**3️⃣ Salary-Proportional Weighting Strategy**
- **Core Problem:**  
  Not all technologies within a stack equally impact salary.
- **Solution:**  
  Weight features **proportionally to salary impact**:

- **Rationale:**  
Retains genuine salary signals from both high-paying outliers and common technologies.

---

### ⚙️ **Technical Decisions Made**

- **Weighting Approach:**  
Chose raw salary differences over RBF smoothing for now — to maintain interpretability.  
➡️ Will revisit after performance tests.

- **Experimental Plan:**  
1. Analyze technology → salary relationships  
2. Implement proportional weighting  
3. Benchmark against simple stack groups  
4. Measure correlation & regression metrics  
5. Then build a robust transformer for the production pipeline.

---

### 📚 **Learning Links to Chapter 2 Concepts**

- 🛠️ **Feature Engineering Pipelines:**  
Following Aurélien’s Ch.2 by separating **experimental exploration from production steps**.

- 🎭 **Handling Categorical Variables:**  
Moving beyond basic one-hots to **domain-driven encodings** that capture business realities.

- 🚀 **Taming the Curse of Dimensionality:**  
Recognizing how high-cardinality categoricals inflate feature space and strategically compressing it.

---

### 🚀 **Next Steps**
1. Implement salary analysis per technology  
2. Build proportional weighting system  
3. Create experimental features & test correlations  
4. Compare against baseline stack grouping  
5. Document insights before pipeline production.

---

### 🌟 **Key Takeaway**
> 🧠 The best feature engineering blends **domain expertise** (understanding real tech clusters)  
> with **statistical rigor** (weighting by salary impact).  
> This is where **human judgment becomes irreplaceable** in an ML pipeline.

---

</details>


#### __EDA__

### 📌 **Insight**
After exploratory scatter plots of `YearsCode` vs `ConvertedCompYearly`, we observed extreme variance and no clear linear relationship.  
This confirmed that raw years of experience does **not translate directly into salary** due to multiple hidden confounders (role, location, negotiation, industry).

---

### 📌 **Decision**
- **1. Log-transform salary**  
  To stabilize variance and interpret coefficients in percentage changes, we applied:
  
  LogSalary = log1p(ConvertedCompYearly)

This reduces the impact of extreme salary outliers and makes relationships more linear.

- **2. Bucketize years of experience**  
Recognizing the diminishing returns of experience, we plan to group `YearsCode` into meaningful categories (e.g., 0-2 yrs, 3-5 yrs, etc).  
This approach captures non-linear experience effects and prevents outlier distortion.

---

### 📌 **Next step**
- Define logical buckets for `YearsCode` and `YearsCodePro` based on both:
- **Domain intuition** (typical junior, mid, senior ranges), and
- **Actual data distributions** (observed quantiles).
- Explore pivot plots of average `LogSalary` vs experience buckets to confirm expected patterns.
- Encode these buckets as categorical features in our modeling pipeline.

---



#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__