## ____Stack Overflow Survey ML Project - Learning Journal____

In [None]:
# RAW DATA
  # └──> EDA (distributions, correlations, plots, intuition) - DONE
    #     └──> Preprocessing (handle missing, encode categoricals, bin experience, log salary) - DONE
      #        └──> Transformers & Pipeline (reusable, scalable, leak-proof) - IN-PROGRESS
       #             └──> Scaling (StandardScaler)
        #                  └──> Modeling (linear regression on log(salary))
         #                       └──> Evaluation & Interpretation
          #                            └──> Journal documentation & insights

In [None]:
# RAW DATA
  # └──> train_test_split
    #     ├──> fit transformers only on train
      #   └──> apply transformations on train & test separately

#### ___Project Overview___
- Dataset: Stack Overflow 2023 Survey
- Goal: __Predicting Yearly Salary (ConvertedComppYearly)__
- Key Learning Objectives: Apply Chapter 2 concepts, feature engineering practice, robust pipeline building.

#### ___Target Selection___
- **Decision**: ConvertedCompYearly as regression target
- **Why**: 
  - 48K samples, reasonable distribution
  - Median $75K aligns with industry knowledge
  - Rich feature set for prediction
- **Challenges identified**: 
  - Extreme outliers need handling
  - ~46% missing values
  - Need currency/location normalization strategy
- **Next**: Explore feature relationships and outlier handling

####  ___Documentation___

#### __Data Exploration__
- **Decision**: Checked the dataset for missing values, saw statistical descriptions of each variable.  
- **Why**: To find out a reasonable target variable. 
- **Alternatives considered**: N/A
- **Outcome**: Most of the variables are objects/categorical. The model would require clever data preprocessing to find out only relevant variable for the target. Then onwards clever feature engineering would help to build a strong model.

#### __Feature Selection__
- **Features**: EdLevel, YearsCode, YearsCodePro, DevType, OrgSize, TechList, LanguageHaveWorkedWith, PlatformHaveWorkedWith, WebframeHaveWorkedWith, ToolsTechHaveWorkedWith, WorkExp, Industry, ProfessionalTech.
- **Statistics**: 12 out of 13 features are object/categorical mostly highly cordinal. 
- **Choice**: Based on the tech domain intuition, these features are most likely to relate the most and can be the drivers of tech salaries. 
- **Challenges**: Sensitive bucketization is needed for the categorical features.

### ___Exploratory Data Analysis___

#### ____Handling Multi-Label Categorical Fields____

- **Problem:**  
  Several columns (e.g., `LanguageHaveWorkedWith`, `TechList`, `PlatformHaveWorkedWith`) stored multi-label data as semi-colon separated strings, leading to thousands of misleading unique entries.

- **Solution:**  
  Used a systematic loop to apply `str.get_dummies(sep=';')` to each multi-label column, expanding them into individual binary features.

- **Result:**  
  Discovered actual unique counts such as:
  - Languages: ~43
  - TechList items: ~58
  - Platforms: ~9
  - Web frameworks: ~7
  - Tools/Tech: ~16
  - ProfessionalTech: ~5

- **Why this matters:**  
  This preserves meaningful multi-label signals (e.g. working with both Python and SQL), while preventing explosion of spurious categories due to string combinations.

#### __Analyzing remaining categorical varaibles - Cell 46__

- __Decision:__ Variables such as Country and Company size have been analysed to find out how contribute to total salary. For this, a total of 10 most salary-generating countries have been chosen and remaining 171 countries have been dropped to keep the model very specific. A further per-country-salary analysis has been done to find out the relationship between those countries and salaries. 
- ___Why?___ Total countries were 181, including such big list would end up in fitting the model with too many attributes that do not actally hold much weight in predicting the target.
- __Challenges:__ Described below.

#### __YearsCodePro__

- __Decision:__ YearsCodePro is of object type, in order to convert it to string, the missing values (almost 20,000+) have to be handled. Scikit learn's SimpleImputer is being used with strategy='mean'. 
- ___Why?___ It's important to convert YearsCodePro to type integer to be able to impute it for handling missing values (using SimpleImputer) since it has most numerical values except a few object type ('Less than a year') which are being dropped after analzying their respective impact on the salary. 
- __Challenges:__ Dropping a label from the column is a bit tricky since the rows are alot and there are 52 unique values. Each low performing value can be dropped to reduce noise in the dataset and for the model. 
- __Solution:__ Converted "Less than a year" to 0 and dropped "More than 50 years" because of very less correlation with the target. SimpleImpter has been used to convert the NaN vlaues to those of the median since the YearsCodePro column was more skewed.

#### __Employment__

- __Decision:__ Minimize Employment to certain types to reduce noise and make further analysis possible.  
- ___Why?___ There are 106 unique values in the column, including too many ';' separated values which can be easily dropped or bucketed to certain types to reduce the categories and allow for further coorelation analysis with the target. 
- __Challenges:__ Identifying labels and bucketizing them. 
- __Solution:__ Seperating ';' values first, then bucketizing them into 5 most relative columns. We prioritize full-time employment over other statuses when someone has multiple employment types (like "Employed, full-time;Independent contractor"), since full-time employment is likely their primary income source. 

__Categories:__

- Full-time Employed - Traditional full-time jobs
- Student - Full-time or part-time students
- Freelancer/Self-employed - Independent workers
- Part-time Employed - Part-time traditional jobs
- Other - Unemployed, retired, not looking, etc.

#### __Bucketizing PlatformsHaveWorkedWith & WebframeHaveWorkedWith__

- __Decision:__ Both the columns have ';' separated values, they have been separated by keeping only the first value of the rows as the main value. Missing values have been imputed bsaed on most_frequent values. 
- ___Why?___ To reduce unnecessary noise and make analysis easier. The values will be bucketized based on tech stack; front-end, back-end, etc. Can be done relative to domain importance as well.
- __Challenges:__ NA

#### __Bucketizing WorkExp, EdLevel, DevType, & ProfessionalTech (High Salary Impact)__

- __Decision:__ Bucketizing these columns based on domain intuition.
- ___Why?___ To reduce noise and also simplify techstacks into fewer groups with equal weights of group values.
- __Challenges:__ Challenge described below.

#### __ISSUE: ProfessionalTech bucket "None" with 52272 values__

- __Problem:__ After bucketization, "ProfessionalTech" columns has 52272 values in the 'None' bucket. These are not NaN values. 
- ___Why?___ The bucketizaiton rules have to be revised, it can be throwing legitimate values in the "None" bracket.
- __Solution:__ ProfessionalTech has been bucketed into 7 different buckets. The original column with ';' separated values is  used to extract valuable skills and place them into bucket. A string-strict function is used to classify the 'None' column and it contains all the null (non-answered) values & 'None of these' answers. Unfortunately this makes up more than 50% of the dataset, but the buckets available should work properly in the model.

#### __Mini-Pipeline__

- __Decision:__ A function has been developed which removes ; seperated values intelligently based on a priority approach. It loops through the whole string and selects keys based on priority from the given bucket list. After bucketization, the column is imputed with most_frequent values to fill missing values and is saved in the features column with ending with _Bucket. 
- ___Why?___ Survey data has many categorical variables with ; separated values which need to follow the Separation > Bucketize > Impution flow. This function does all that in once by taking the dataset, column name, and bucket list as inputs. 
- __Challenges:__ Previously the process was tiring, this function autmoates the whole process, subsequenly making EDA faster. LanguagesHaveWorkedWith & ToolsTechHaveWorkedWith have been processed properly with very minimal 'None' & 'Other' values. 

#### __Age, Country, OrgSize & Industry__

- __Decision:__ For country, only the top 10 are being kept (based on unique counts) and the rest will be specified as "Other" and the ~2500 NaN values would also be handled similarly. Age is alerady in nice age-brackets, it's a perfect usecase for Ordinal Encoding therefore Age is being ordinal encoded for the model to understand i.e. from '18-24 years told' to '5' - assigning a number to each age group while also making sure the model doesn't take number 7 higher than 6 in some cases where Developers above 50 genuinely do not earn higher than the ones under 4O. 

- OrgSize has more than __~25000 missing values__, and while the basic intuition is that big organizations tend to pay more, the process of handling missing values with most_frequent one can end up making the model mis-classify genuine cases. After analysis, OrgSize column has mostly 'Not Specified' values followed by '20-99' employees. To keep the effect of OrgSize, the NaN values are being imputed by 'Not Specified' for the model to understand and recognize any patters with the values.

- Industry column has been dropped to reduce redunant data in the model and to also eliminate unnecesassary noise. There are already sufficient categorical variables to weigh salaries based on experience and expertise.

- ___Why?___ The importance of these metrics cannot be undermind in the tech domain, for most of the domain these are important salary drivers and can really affect an individual developer. Hence why a safer approach has been taken for OrgSize to analyse the values and correlation with annual Salary. 

- __Challenges:__ N/A


---

### ___FIT ON TRAIN, TRANSFORM ON TEST!___

#### **Feature Engineering Strategy**

##### **Problem & Solution**
- **Challenge**: 11 high-cardinality categorical variables could create thousands of sparse features with naive one-hot encoding
- **Approach**: Hierarchical bucketization (✅ completed) + salary-impact weighted encoding (🔄 current focus)
- **Result**: Meaningful tech stack categories with proper salary-proportional weights

##### **Salary-Proportional Encoding Strategy**
- **Core Principle**: Each category value gets weighted based on its actual salary impact, not equal treatment
- **Method**: Calculate mean salary per category → encode with salary-impact weights instead of binary 0/1
- **Why Critical**: A "Machine Learning Engineer" in DevType_Bucket should have higher weight than "Student" - encoding must be salary-sensitive

##### **Current Implementation Phase**
- **Status**: Bucketization complete, moving to transformers with weighted categorical encoding
- **Next**: Build robust encoders that assign proper weights to each category value based on salary analysis
- **Future**: RBF kernel smoothing for advanced salary impact curves (after baseline pipeline works)

#### __Encoding Strategy__

- __Ordinal Encoding:__ `EdLevel_Bucket`
- __OneHotEncoding:__ `'Country_Grouped', 'Employment_Category_Bucket'`
- __Target Encoding:__ `'DevType_Bucket', 'PlatformHaveWorkedWith_Bucket', 'WebframeHaveWorkedWith_Bucket', 'ProfessionalTech_Bucket', 'LanguagesWorkedWith_Bucket', 'ToolsTechHaveWorkedWith_Bucket'`


- ___Why?___ __Ordinal Encoding__ is used for variables with a defined rank paramter i.e. Education (Bachelors, Masters, ..and so on). It is so the model can properly assign how these rank-based features affect the target and do not get over-influenced by other variables' weights. __OneHotEncoding__ is used for variables with many more values but without a clear ranking system. These variables impact the target variables on a different scale, something that will be interesting to capture with Gaussian RBF. __Target Encoding__ is a target-sensitive encoder used for variables which define the major characteristics of a developer's salary i.e. Lanuages, Platforms, Professional Tech Stack, etc. 


- __Challenges:__ Implementation of all the transformers in a way robust way to ensure a robust pipeline. 

- __Solution:__ Used ColumnTransformer for enocding all other variables and a Custom TargetEncoder for encoding target-sensitive variable. All the transformers ran in a pipeline.

#### __Pipeline & Scaling__

- __Decision:__ A robust pipeline has been built to automate including all transformers. All the variables (except OneHotEncoded ones) have been scaled with StandardScaler & the target has been logtransformed. 
- ___Why?___ To bring down all the variables at the same scale for machine learning models to understand.
- __Challenges:__ N/A

## Model Performance Progress

The Stack Overflow salary prediction model achieved significant improvement through data cleaning and hyperparameter tuning. Initial models suffered from extreme outliers, with RMSE reaching $432,861, but cleaning the data to a realistic $10k-$500k salary range dramatically improved performance. Random Forest ($54,444 RMSE) slightly outperformed Linear Regression ($54,857 RMSE) on the cleaned dataset. GridSearchCV optimization with 108 parameter combinations across 3-fold cross-validation (324 total training runs) further improved the Random Forest model to a final RMSE of $53,347. The optimal hyperparameters were: 300 estimators, max depth of 10, min samples split of 10, and min samples leaf of 2.

#### __Linear Regression, Random Forest & GridSearchCV On TrainTestSplit__

- __Insights:__ The Stack Overflow salary prediction model achieved significant improvement through data cleaning and hyperparameter tuning. Initial models suffered from extreme outliers, with RMSE reaching $432,861, but cleaning the data to a realistic $10k-$500k salary range dramatically improved performance. Random Forest ($54,444 RMSE) slightly outperformed Linear Regression ($54,857 RMSE) on the cleaned dataset. GridSearchCV optimization with 108 parameter combinations across 3-fold cross-validation (324 total training runs) further improved the Random Forest model to a final RMSE of $53,347. The optimal hyperparameters were: 300 estimators, max depth of 10, min samples split of 10, and min samples leaf of 2. 
- ___Next Steps:___

#### __Linear Regression, Random Forest & GridSearchCV On StratifiedSplit__

- __Insights:__ The results were very similear, performance measure used was RMSE.  
- ___Next Steps:___ Try different performance measures. Explore advanced feature engineering and use advanced models.

#### __Feature Engineering__

- __Decision:__ New features such as Experience Consistency, Professional Experience Factor, Experience Skill ratio, Senior Experience Match, and a boolean Is_Senior Role have been engineered. 
- ___Why?___ In pursuit of bringing down the RMSE from $53,347. Feature engineering is being done to run advanced models and test them using RMSE. 

#### __Advanced Models Comparison__

- __Decision:__ Tested 6 advanced models: XGBoost, Gradient Boosting, Extra Trees, Ridge, ElasticNet, and SVR to improve upon Random Forest's $53,347 RMSE.
- ___Why?___ To explore if ensemble methods and regularized models could capture more complex salary patterns and push below the $53k threshold.
- __Outcome:__ **Gradient Boosting achieved new best RMSE of $52,951** ($396 improvement). Extra Trees ($54,323) and Ridge ($54,644) also performed well. XGBoost underperformed expectations ($54,811), while ElasticNet ($61,366) and SVR ($68,123) struggled with categorical complexity.
- __Challenges:__ XGBoost needs hyperparameter tuning; linear models (Ridge) surprisingly competitive suggests good feature engineering.
- __Next Steps:__ GridSearchCV on Gradient Boosting, explore ensemble combining top 3 models, consider feature selection.

#### __Pipeline Integration Issue - Feature Engineering Not Applied__

**Problem:** Advanced models showed no improvement despite extensive feature engineering. Root cause: Pipeline was processing data BEFORE feature engineering, dropping all 6 engineered features.

**Discovery:** 
- Pipeline expected 34 features, but `all_feature_names` had 40 (34 + 6 engineered)
- Shape mismatch error revealed pipeline was ignoring engineered features
- Models were running on original dataset without any feature engineering

**Solution:**
- Created `AdvancedFeatureEngineer` custom transformer inheriting from `BaseEstimator` and `TransformerMixin`
- Integrated feature engineering directly into pipeline as first step
- Fixed data flow: Raw data → Feature Engineering → Preprocessing → Models

**Result:** All 6 engineered features now properly included in model training. Feature engineering finally applied to advanced models.

**Key Learning:** Pipeline order matters. Feature engineering must be integrated into pipeline, not applied separately beforehand.

---

#### __The Great Debugging Saga: When 42 Really Was The Answer__

**The Setup:** After achieving $52,951 RMSE with Gradient Boosting, decided to optimize further through feature selection based on feature importance analysis.

**Feature Importance Insights:** 
- OrgSize_Encoded (27.8%) dominated - surprisingly realistic given company size/salary correlation
- YearsCodePro (22.7%) → LanguagesWorkedWith (17.1%) → WebFrames (6.8%) → DevType (6.5%) perfectly matched domain intuition
- Employment Category (0.000000-0.007873%) and ProfessionalTech (0.9%) identified as noise/redundant

**The Trap:** Removed Country, Employment Category, and ProfessionalTech features expecting performance improvement.

**The Disaster:** Performance mysteriously tanked to $62,056 (9k worse!). Added Employment back → even worse at $67,307. All models degraded significantly.

**The Investigation:** 
- Pipeline structure checked ✓
- Feature engineering verified ✓  
- ColumnTransformer configuration confirmed ✓
- Then realized: Multiple `train_test_split` calls without `random_state` throughout development

**The Revelation:** Different train/test splits each time = comparing apples to oranges. Models may have indirectly seen test data across splits (subtle data leakage).

**The Solution:** Complete environment restart (VS Code, kernel, dataset reload) with single `train_test_split` using `random_state=42`.

**The Irony:** After sophisticated feature engineering and advanced modeling, the breakthrough came from Douglas Adams' famous number - 42, "The Answer to Life, the Universe, and Everything."

**Key Learning:** Always set `random_state` in the FIRST train/test split and never re-run it. Data integrity trumps model complexity every time.

---


#### __Post-42 Debugging Session__

**Issue 1: Salary Cap Adjustment**
- **Decision:** Increased salary range from $50k-$500k to $50k-$750k
- **Why:** Accommodate higher-paying roles without extreme outliers
- **Result:** Best model performance: Gradient Boosting ($64,280 RMSE)

**Issue 2: Feature Importance Magnitude Bias**
- **Problem:** OrgSize (0.28) and YearsCodePro (0.23) dominating feature importance due to numerical scale
- **Initial Hypothesis:** Features getting inflated importance from higher numerical values
- **Solution Attempted:** Moved YearsCodePro to target encoder

**Issue 3: Data Analysis Revelation**
- **Discovery:** Plotted mean salary vs. both features - revealed true patterns
- **OrgSize:** Perfect monotonic relationship (larger companies = higher salaries, $160k→$75k)
- **YearsCodePro:** Extreme outlier with 45+ years claiming $650k salary (n=10 sample size)
- **Conclusion:** OrgSize importance was legitimate; YearsCodePro had data quality issues

**Issue 4: Outlier Removal**
- **Decision:** Capped YearsCodePro at 30 years maximum
- **Why:** 45+ years professional coding experience in 2023 unrealistic
- **Result:** YearsCodePro importance dropped to 0.19, model improved to $64,087 RMSE

**Issue 5: OrgSize Strategic Binning**
- **Observation:** Freelancers showing highest salary ($160k) but small sample size
- **Next:** Implement business-logic binning to preserve salary hierarchy while grouping similar company sizes

#### __OrgSize ISSUE__

- __Decision:__ OrgSize is already encoded and the binning is being applied on it + it's being double-encoded after being transformed in the pipeline.  
- ___Why?___ 
- __Challenges:__

#### __The OrgSize Double-Encoding Bug - BREAKTHROUGH!__

- **Bug Discovery:** OrgSize was pre-encoded to integers (0,1,2,3,4) early in project, then custom `OrgSizeBinner()` transformer was binning these encoded integers instead of raw categorical values.
- **Impact:** OrgSize showed artificially high feature importance (27.8%) because models were learning from meaningless encoded numbers, not actual organization size patterns.
- **Fix:** Passed raw categorical OrgSize data to custom binner transformer instead of pre-encoded integers.
- **Result:** **RMSE dropped below 52K threshold!** Major breakthrough in model performance.
- **Key Learning:** Always verify data flow through pipeline - encoding should happen once, at the right stage.

#### __Post-Fix Model Results - All Models Improved!__

- **New Best:** Gradient Boosting $52,647 (vs. $52,951 before)
- **Biggest Winner:** SVR $56,011 (12K improvement from $68,123!)
- **XGBoost Breakthrough:** $52,962 (finally competitive)
- **Overall:** All 6 models now clustered in $52-58K range vs. previous $52-68K spread
- **Next:** GridSearchCV on Gradient Boosting to push below $52K threshold

#### __Final GridSearchCV Optimization__

- **Target:** Gradient Boosting ($52,647 baseline)
- **Grid:** 108 combinations (3×3×3×3×3) across n_estimators, max_depth, learning_rate, min_samples_split
- **Goal:** Break $52K barrier for project finale