## ____Stack Overflow Survey ML Project - Learning Journal____

In [None]:
# RAW DATA
  # └──> EDA (distributions, correlations, plots, intuition) - DONE
    #     └──> Preprocessing (handle missing, encode categoricals, bin experience, log salary) - DONE
      #        └──> Transformers & Pipeline (reusable, scalable, leak-proof) - IN-PROGRESS
       #             └──> Scaling (StandardScaler)
        #                  └──> Modeling (linear regression on log(salary))
         #                       └──> Evaluation & Interpretation
          #                            └──> Journal documentation & insights

In [None]:
# RAW DATA
  # └──> train_test_split
    #     ├──> fit transformers only on train
      #   └──> apply transformations on train & test separately

#### ___Project Overview___
- Dataset: Stack Overflow 2023 Survey
- Goal: __Predicting Yearly Salary (ConvertedComppYearly)__
- Key Learning Objectives: Apply Chapter 2 concepts, feature engineering practice, robust pipeline building.

#### ___Target Selection___
- **Decision**: ConvertedCompYearly as regression target
- **Why**: 
  - 48K samples, reasonable distribution
  - Median $75K aligns with industry knowledge
  - Rich feature set for prediction
- **Challenges identified**: 
  - Extreme outliers need handling
  - ~46% missing values
  - Need currency/location normalization strategy
- **Next**: Explore feature relationships and outlier handling

####  ___Documentation___

#### __Data Exploration__
- **Decision**: Checked the dataset for missing values, saw statistical descriptions of each variable.  
- **Why**: To find out a reasonable target variable. 
- **Alternatives considered**: N/A
- **Outcome**: Most of the variables are objects/categorical. The model would require clever data preprocessing to find out only relevant variable for the target. Then onwards clever feature engineering would help to build a strong model.

#### __Feature Selection__
- **Features**: EdLevel, YearsCode, YearsCodePro, DevType, OrgSize, TechList, LanguageHaveWorkedWith, PlatformHaveWorkedWith, WebframeHaveWorkedWith, ToolsTechHaveWorkedWith, WorkExp, Industry, ProfessionalTech.
- **Statistics**: 12 out of 13 features are object/categorical mostly highly cordinal. 
- **Choice**: Based on the tech domain intuition, these features are most likely to relate the most and can be the drivers of tech salaries. 
- **Challenges**: Sensitive bucketization is needed for the categorical features.

### ___Exploratory Data Analysis___

#### ____Handling Multi-Label Categorical Fields____

- **Problem:**  
  Several columns (e.g., `LanguageHaveWorkedWith`, `TechList`, `PlatformHaveWorkedWith`) stored multi-label data as semi-colon separated strings, leading to thousands of misleading unique entries.

- **Solution:**  
  Used a systematic loop to apply `str.get_dummies(sep=';')` to each multi-label column, expanding them into individual binary features.

- **Result:**  
  Discovered actual unique counts such as:
  - Languages: ~43
  - TechList items: ~58
  - Platforms: ~9
  - Web frameworks: ~7
  - Tools/Tech: ~16
  - ProfessionalTech: ~5

- **Why this matters:**  
  This preserves meaningful multi-label signals (e.g. working with both Python and SQL), while preventing explosion of spurious categories due to string combinations.

#### __Analyzing remaining categorical varaibles - Cell 46__

- __Decision:__ Variables such as Country and Company size have been analysed to find out how contribute to total salary. For this, a total of 10 most salary-generating countries have been chosen and remaining 171 countries have been dropped to keep the model very specific. A further per-country-salary analysis has been done to find out the relationship between those countries and salaries. 
- ___Why?___ Total countries were 181, including such big list would end up in fitting the model with too many attributes that do not actally hold much weight in predicting the target.
- __Challenges:__ Described below.

#### __YearsCodePro__

- __Decision:__ YearsCodePro is of object type, in order to convert it to string, the missing values (almost 20,000+) have to be handled. Scikit learn's SimpleImputer is being used with strategy='mean'. 
- ___Why?___ It's important to convert YearsCodePro to type integer to be able to impute it for handling missing values (using SimpleImputer) since it has most numerical values except a few object type ('Less than a year') which are being dropped after analzying their respective impact on the salary. 
- __Challenges:__ Dropping a label from the column is a bit tricky since the rows are alot and there are 52 unique values. Each low performing value can be dropped to reduce noise in the dataset and for the model. 
- __Solution:__ Converted "Less than a year" to 0 and dropped "More than 50 years" because of very less correlation with the target. SimpleImpter has been used to convert the NaN vlaues to those of the median since the YearsCodePro column was more skewed.

#### __Employment__

- __Decision:__ Minimize Employment to certain types to reduce noise and make further analysis possible.  
- ___Why?___ There are 106 unique values in the column, including too many ';' separated values which can be easily dropped or bucketed to certain types to reduce the categories and allow for further coorelation analysis with the target. 
- __Challenges:__ Identifying labels and bucketizing them. 
- __Solution:__ Seperating ';' values first, then bucketizing them into 5 most relative columns. We prioritize full-time employment over other statuses when someone has multiple employment types (like "Employed, full-time;Independent contractor"), since full-time employment is likely their primary income source. 

__Categories:__

- Full-time Employed - Traditional full-time jobs
- Student - Full-time or part-time students
- Freelancer/Self-employed - Independent workers
- Part-time Employed - Part-time traditional jobs
- Other - Unemployed, retired, not looking, etc.

#### __Bucketizing PlatformsHaveWorkedWith & WebframeHaveWorkedWith__

- __Decision:__ Both the columns have ';' separated values, they have been separated by keeping only the first value of the rows as the main value. Missing values have been imputed bsaed on most_frequent values. 
- ___Why?___ To reduce unnecessary noise and make analysis easier. The values will be bucketized based on tech stack; front-end, back-end, etc. Can be done relative to domain importance as well.
- __Challenges:__ NA

#### __Bucketizing WorkExp, EdLevel, DevType, & ProfessionalTech (High Salary Impact)__

- __Decision:__ Bucketizing these columns based on domain intuition.
- ___Why?___ To reduce noise and also simplify techstacks into fewer groups with equal weights of group values.
- __Challenges:__ Challenge described below.

#### __ISSUE: ProfessionalTech bucket "None" with 52272 values__

- __Problem:__ After bucketization, "ProfessionalTech" columns has 52272 values in the 'None' bucket. These are not NaN values. 
- ___Why?___ The bucketizaiton rules have to be revised, it can be throwing legitimate values in the "None" bracket.
- __Solution:__ ProfessionalTech has been bucketed into 7 different buckets. The original column with ';' separated values is  used to extract valuable skills and place them into bucket. A string-strict function is used to classify the 'None' column and it contains all the null (non-answered) values & 'None of these' answers. Unfortunately this makes up more than 50% of the dataset, but the buckets available should work properly in the model.

#### __Mini-Pipeline__

- __Decision:__ A function has been developed which removes ; seperated values intelligently based on a priority approach. It loops through the whole string and selects keys based on priority from the given bucket list. After bucketization, the column is imputed with most_frequent values to fill missing values and is saved in the features column with ending with _Bucket. 
- ___Why?___ Survey data has many categorical variables with ; separated values which need to follow the Separation > Bucketize > Impution flow. This function does all that in once by taking the dataset, column name, and bucket list as inputs. 
- __Challenges:__ Previously the process was tiring, this function autmoates the whole process, subsequenly making EDA faster. LanguagesHaveWorkedWith & ToolsTechHaveWorkedWith have been processed properly with very minimal 'None' & 'Other' values. 

#### __Age, Country, OrgSize & Industry__

- __Decision:__ For country, only the top 10 are being kept (based on unique counts) and the rest will be specified as "Other" and the ~2500 NaN values would also be handled similarly. Age is alerady in nice age-brackets, it's a perfect usecase for Ordinal Encoding therefore Age is being ordinal encoded for the model to understand i.e. from '18-24 years told' to '5' - assigning a number to each age group while also making sure the model doesn't take number 7 higher than 6 in some cases where Developers above 50 genuinely do not earn higher than the ones under 4O. 

- OrgSize has more than __~25000 missing values__, and while the basic intuition is that big organizations tend to pay more, the process of handling missing values with most_frequent one can end up making the model mis-classify genuine cases. After analysis, OrgSize column has mostly 'Not Specified' values followed by '20-99' employees. To keep the effect of OrgSize, the NaN values are being imputed by 'Not Specified' for the model to understand and recognize any patters with the values.

- Industry column has been dropped to reduce redunant data in the model and to also eliminate unnecesassary noise. There are already sufficient categorical variables to weigh salaries based on experience and expertise.

- ___Why?___ The importance of these metrics cannot be undermind in the tech domain, for most of the domain these are important salary drivers and can really affect an individual developer. Hence why a safer approach has been taken for OrgSize to analyse the values and correlation with annual Salary. 

- __Challenges:__ N/A


---

### ___FIT ON TRAIN, TRANSFORM ON TEST!___

#### **Feature Engineering Strategy**

##### **Problem & Solution**
- **Challenge**: 11 high-cardinality categorical variables could create thousands of sparse features with naive one-hot encoding
- **Approach**: Hierarchical bucketization (✅ completed) + salary-impact weighted encoding (🔄 current focus)
- **Result**: Meaningful tech stack categories with proper salary-proportional weights

##### **Salary-Proportional Encoding Strategy**
- **Core Principle**: Each category value gets weighted based on its actual salary impact, not equal treatment
- **Method**: Calculate mean salary per category → encode with salary-impact weights instead of binary 0/1
- **Why Critical**: A "Machine Learning Engineer" in DevType_Bucket should have higher weight than "Student" - encoding must be salary-sensitive

##### **Current Implementation Phase**
- **Status**: Bucketization complete, moving to transformers with weighted categorical encoding
- **Next**: Build robust encoders that assign proper weights to each category value based on salary analysis
- **Future**: RBF kernel smoothing for advanced salary impact curves (after baseline pipeline works)

#### __Encoding Strategy__

- __Ordinal Encoding:__ `EdLevel_Bucket, ` 
- ___Why?___
- __Challenges:__

#### __HEADING_HERE__

- __Decision:__
- ___Why?___
- __Challenges:__