This notebook includes the 3rd step in a Machine Learning process; i.e. preprocessing the data. 


The data comes from [Wages By Education](https://www.kaggle.com/datasets/asaniczka/wages-by-education-in-the-usa-1973-2022), from Kaggle.

In this notebook:

1. [Setup](#setup)
2. [melting the data](#melt)<br>
3. [Transforming Cat to Numeric)](#data.trans)<br>
4. [Preparing modeling data](#data.prep)<br>
    4.1. [Splitting the data](#data.split)<br>
    4.2. [Scaling the data](#data.scale)<br>


<a id='setup'></a>
## 1. Setup

In [1]:
import pandas as pd

# modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('wages_by_education.csv')

In the EDA stage, I did some feature engineering, which I am going to use here as well. However, instead of doing manual imputations, I will use the `get_dummies` method of `pandas`.  

First, I need to change the data format from wide to a long format.

<a id='melt'></a>

## 2. `melt`ing the data

Before actually melting, I need to reset the index of the dataframe, so that "year" is our index.

In [3]:
df.set_index('year', inplace= True)

In [4]:
df_melt = df.melt(var_name = 'Education', value_name = 'Hourly.Salary', ignore_index = False)

In [5]:
df_melt.head()

Unnamed: 0_level_0,Education,Hourly.Salary
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2022,less_than_hs,16.52
2021,less_than_hs,16.74
2020,less_than_hs,17.02
2019,less_than_hs,16.11
2018,less_than_hs,15.94


Now, I separate the values in the "Education" column into 3 distinct values for "Sex", "Race" and level of education.

In [6]:
df_melt['Sex'] = df_melt['Education'].apply(lambda x: 'male' if x.startswith('men') 
                                            else 'female' if x.startswith('women') 
                                            else None)

In [7]:
df_melt['Race'] = df_melt['Education'].apply(lambda x: 'white' if x.startswith('white')
                                             else 'black' if x.startswith('black')
                                             else 'hispanic')

In [8]:
# Define a list of the actual education levels
education_levels = [
    'less_than_hs',
    'high_school',
    'some_college',
    'bachelors_degree',
    'advanced_degree'
]

# Function to extract matching education level from label
def extract_education(label):
    for level in education_levels:
        if label.endswith(level):
            return level
    return None  # fallback if no match found

# Apply to our column
df_melt['Education'] = df_melt['Education'].apply(extract_education)


Now, I move the dependent variable to the last column.

In [9]:
df_melt = df_melt[['Sex', 'Race', 'Education', 'Hourly.Salary']]

<a id='data.trans'></a>

## 3. Transforming Cat to Numeric

Using `panda`s `get_dummies` method, I transform the categorical data to numeric below.

In [10]:
df_num = pd.get_dummies(df_melt, drop_first = True)

In [11]:
df_num.head()

Unnamed: 0_level_0,Hourly.Salary,Sex_male,Race_hispanic,Race_white,Education_bachelors_degree,Education_high_school,Education_less_than_hs,Education_some_college
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022,16.52,False,True,False,False,False,True,False
2021,16.74,False,True,False,False,False,True,False
2020,17.02,False,True,False,False,False,True,False
2019,16.11,False,True,False,False,False,True,False
2018,15.94,False,True,False,False,False,True,False


<a id='data.prep'></a>
## 4. Preparing modeling data

In order to avoid data leakage, we first need to split the data, then `fit` the scaler on the training data, followed by `transform`ing both the training and test sets on the same scaler.

<a id='data.split'></a>
### 4.1. Splitting the data

Choosing "Hourly.Salary" as our dependent variable to be predicted, I first define `X` and `y` sets to be split to training and test sets after.

For training and test set split, I chose to use 25% of the data as my test set. 

In [13]:
X = df_num.drop('Hourly.Salary', axis =1)
y = df_num['Hourly.Salary']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.25,
                                                    random_state = 42)

<a id='data.scale'></a>

### 4.2. Scaling the data



In [16]:
scaler = StandardScaler()
scaler.fit(X_train)

scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

<hr style="border: 1px solid #333;" />


Writing our clean dataset that will be used in modeling later to a new `.csv` file for convenience.

In [19]:
df_num.to_csv('clean_data.csv')