# Notebook 02: Feature Engineering

## Introduction
- In this notebook, we will perform feature engineering on our cleaned dataset. The goal is to transform and create new features that will improve the performance of our machine learning models.

### 1- importing necessary libraries

In [None]:
import pandas as pd 
import numpy as np

In [None]:
df = pd.read_parquet(r"D:\my_projcts\job-salary-prediction\data\processed\eda_cleaned_data.parquet", engine='fastparquet')

print(df.shape)
print(df.columns)
print(df.dtypes)
display(df.head())

### 2- Target Transormation
- The target variable in our dataset is the salary, which is a continuous variable. We will apply a log transformation to the salary variable to reduce skewness and make it more suitable for modeling.

In [None]:
df['salary_log'] = np.log1p(df['salary_in_usd'])
display(df[['salary_in_usd', 'salary_log']].head())

### 3- Experience Level Encoding

In [None]:
experience_map = {
    'Entry': 0,
    'Mid': 1,
    'Senior': 2,
    'Executive': 3
}

df['experience_encoded'] = df['experience_level'].map(experience_map)

In [None]:
display(df['experience_level'].head())

### 4- Encode Employment Type

In [None]:
# binary signal because the full time job domenate in the data
df['is_full_time'] = (df['employment_type'] == 'Full-time').astype(int)

display(df['is_full_time'].head())

### 5- Group Job Titles

In [None]:
df.columns

In [None]:
# For job titles
top_titles = ['Data Scientist', 'Data Engineer', 'Data Analyst', 'Software Engineer', 
              'Engineer', 'Machine Learning Engineer', 'Manager']
df['job_title'] = df['job_title'].apply(
    lambda x: x if x in top_titles else 'Other'
)

### 6- Group Company Location

In [None]:
top_locations = ['US', 'CA', 'GB', 'AU']
df['company_location'] = df['company_location'].apply(
    lambda x: x if x in top_locations else 'Other'
)

### 7- Remote Engineering

In [None]:
df['is_remote'] = (df['remote_ratio'] == 100).astype(int)
df['is_hybrid'] = (df['remote_ratio'] == 50).astype(int)

print(df['is_remote'].value_counts())
print(df['is_hybrid'].value_counts())


### 8- Encode Company Size

In [None]:
size_map = {
    'Small': 0,
    'Medium': 1,
    'Large': 2
}

df['company_size_encoded'] = df['company_size'].map(size_map)

print(df['company_size_encoded'].value_counts())

### 9- One-Hot Encoding for Job Title and Company Location

In [None]:
df = pd.get_dummies(df, columns=['job_title', 'company_location'], 
                     drop_first=False)

### 10- Drop Unnecessary Columns

In [None]:
columns_to_drop = [
    'salary',
    'salary_currency',
    'salary_in_usd',
    'experience_level',
    'employment_type',
    'company_size',
    'employee_residence',
    'remote_ratio'
]

df_model = df.drop(columns=columns_to_drop)


### 11- change columns to proper types

In [None]:
num_columns  = [
    "experience_encoded" , "is_full_time",
    "is_remote" , "is_hybrid",
    "company_size_encoded"
]
for col in num_columns:
    df_model[col] = df_model[col].astype("int8")

In [None]:
# Convert bool columns to int8 (for better model compatibility and memory efficiency)
bool_columns = df_model.select_dtypes(include='bool').columns
df_model[bool_columns] = df[bool_columns].astype('int8')


In [None]:
df_model.dtypes

### 12- saving the engineered dataset

In [242]:
df_model.to_parquet(
    r"D:\my_projcts\job-salary-prediction\data\processed\feature_engineered_data.parquet",
    engine='fastparquet',
    index=False
)

# Feature Engineering Summary

## 1. Target Transformation
- Created `salary_log` using `np.log1p()` to normalize right-skewed salary distribution
- Improves model performance and stabilizes variance

## 2. Experience Level Encoding
- Mapped to ordinal values: Entry=0, Mid=1, Senior=2, Executive=3
- Preserves natural seniority progression

## 3. Employment Type
- Created binary feature `is_full_time` (1 = Full-time, 0 = Other)
- Full-time roles dominate and pay significantly higher

## 4. Job Title Grouping
- Kept top 7 most frequent titles, rest grouped as 'Other'
- Reduced from 100+ unique titles to 8 categories
- Prevents overfitting from rare titles

## 5. Company Location Grouping
- Kept top 4 locations (US, CA, GB, AU), rest grouped as 'Other'
- Reduced from 70+ countries to 5 categories
- Removes noise from low-frequency locations

## 6. Remote Work Features
- Created binary flags: `is_remote` (100%) and `is_hybrid` (50%)
- Simpler signal than raw 0/50/100 values

## 7. Company Size Encoding
- Mapped to ordinal values: Small=0, Medium=1, Large=2

## 8. One-Hot Encoding
- Applied to grouped `job_title` and `company_location` categories
- `drop_first=False` to keep all categories

## 9. Data Type Optimization
- Converted all binary and ordinal features to `int8`
- Reduces memory usage and speeds up computation

## 10. Dropped Columns
Removed raw/unprocessed columns:
- `salary`, `salary_currency`, `salary_in_usd`
- `experience_level`, `employment_type`, `company_size`
- `employee_residence`, `remote_ratio`

---

**Final dataset**: `df_model` with 31,942 rows Ã— engineered features

**Next step**: Model training