## **Prepare environment, load and prepare data**

Required environment

*  Python 3.12.12
*  numpy: 2.0.2
*  pandas: 2.2.2
*  seaborn: 0.13.2
*  matplotlib: 3.10.0










In [1]:
!python --version

Python 3.12.12


In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print("pandas:",pd.__version__)
print("numpy:",np.__version__)
print("seaborn:",sns.__version__)
print("matplotlib:",plt.matplotlib.__version__)

pandas: 2.2.2
numpy: 2.0.2
seaborn: 0.13.2
matplotlib: 3.10.0


In [3]:
# import dataset

df = pd.read_csv('https://data.ontario.ca/dataset/1f14addd-e4fc-4a07-9982-ad98db07ef86/resource/4cc07c1b-62ed-4ece-a2a4-d05d0f45081c/download/img-wage-rate-by-edu-age-sex-ft-pt-ca-on-2006-24.csv')

In [4]:
# Rename the ' Men' column to 'Men Wage' and ' Women' to 'Women Wage' to remove
# leading space and add clarity
df_renamed = df.rename(columns={' Men': 'Men_Wage', ' Women': 'Women_Wage'})

# Display the first few rows to show the renamed columns
display(df_renamed.head())

Unnamed: 0,YEAR,GEOGRAPHY,IMMIGRANT,TYPE OF WORK,WAGE RATE,EDUCATION,AGE GROUP,Both sexes,Men_Wage,Women_Wage
0,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",15 +,17.5,19.2,16.0
1,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 +,19.4,21.5,17.5
2,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 - 34,18.0,19.0,16.8
3,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 - 54,19.5,21.5,17.8
4,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 - 64,19.5,21.5,17.6


In [5]:
# change "0.0" in dataset to NaN and flag as suppressed
for col in ['Men_Wage', 'Women_Wage']:
    # Create a new column to flag suppressed values for each gender
    df_renamed[f'{col}_is_suppressed'] = df_renamed[col].astype(str).str.strip() == "0.0"
    # Replace "0.0" strings with np.nan
    df_renamed[col] = df_renamed[col].replace('0.0', np.nan)

# Display the first few rows to show the new columns and replaced values
display(df_renamed.head())

Unnamed: 0,YEAR,GEOGRAPHY,IMMIGRANT,TYPE OF WORK,WAGE RATE,EDUCATION,AGE GROUP,Both sexes,Men_Wage,Women_Wage,Men_Wage_is_suppressed,Women_Wage_is_suppressed
0,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",15 +,17.5,19.2,16.0,False,False
1,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 +,19.4,21.5,17.5,False,False
2,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 - 34,18.0,19.0,16.8,False,False
3,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 - 54,19.5,21.5,17.8,False,False
4,2006,Canada,Total,All employees,Median hourly wage,"Total, all education levels",25 - 64,19.5,21.5,17.6,False,False


In [6]:
# Filter for Ontario
df_ontario = df_renamed[df_renamed['GEOGRAPHY'] == ' Ontario']

# **Research Question 3**

Examine wage trends across the different subgroups between genders to determine influence of immigration status on wage gap

Method: Apply decision tree model to analyze gender wage gaps across the different immigrant subgroups and validate model using K-Fold Cross-Validation

In [7]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

# Filter out suppressed wage values for both men and women
df_filtered = df_ontario[
    (df_ontario['Men_Wage_is_suppressed'] == False) &
    (df_ontario['Women_Wage_is_suppressed'] == False)
].copy()

# Create a 'wage_gap' target variable
df_filtered['wage_gap'] = df_filtered['Men_Wage'] - df_filtered['Women_Wage']

# Sort by YEAR to respect temporal order
df_filtered = df_filtered.sort_values(by='YEAR')

# Select features
features = ['YEAR', 'IMMIGRANT']
target = 'wage_gap'

X = df_filtered[features]
y = df_filtered[target]

# Identify categorical columns for encoding
categorical_features = ['IMMIGRANT']

# Create a column transformer for one-hot encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Keep YEAR and Men_Wage
)

# Create a pipeline with the preprocessor and the Decision Tree Regressor
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', DecisionTreeRegressor(random_state=42))])

# Define TimeSeriesSplit for cross-validation
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

# Perform TimeSeries Cross-Validation
scores = cross_val_score(pipeline, X, y, scoring='neg_mean_squared_error', cv=tscv)
mse_scores = -scores

# Print the results
print(f"TimeSeriesSplit results (Mean Squared Error) for Decision Tree Regressor with {n_splits} folds:")
print(f"Individual MSE scores: {mse_scores}")
print(f"Mean MSE: {np.mean(mse_scores):.4f}")
print(f"Standard deviation of MSE: {np.std(mse_scores):.4f}")

# Fit the model on the entire dataset to examine feature importances
pipeline.fit(X, y)

# Get feature importances from the trained model
encoded_feature_names = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
passthrough_features = [col for col in features if col not in categorical_features]
all_feature_names = np.concatenate([encoded_feature_names, passthrough_features])

feature_importances = pd.Series(pipeline.named_steps['regressor'].feature_importances_, index=all_feature_names)
sorted_feature_importances = feature_importances.sort_values(ascending=False)

print("\nFeature Importances (Top 10) for predicting Wage Gap:")
print(sorted_feature_importances.head(10))


TimeSeriesSplit results (Mean Squared Error) for Decision Tree Regressor with 5 folds:
Individual MSE scores: [ 7.19946221  8.81555354 10.16805953 16.51382545 12.15107548]
Mean MSE: 10.9696
Standard deviation of MSE: 3.2134

Feature Importances (Top 10) for predicting Wage Gap:
YEAR                                                   0.512584
IMMIGRANT_ Non-landed immigrants                       0.129646
IMMIGRANT_ Born in Canada                              0.115430
IMMIGRANT_Total                                        0.108738
IMMIGRANT_   Recent immigrants, 5+ to 10 years         0.084950
IMMIGRANT_  Very recent immigrants, 5 years or less    0.029091
IMMIGRANT_  Recent immigrants 5+ years                 0.011512
IMMIGRANT_ Total Landed Immigrants                     0.004403
IMMIGRANT_   Established immigrants, 10+ years         0.003647
dtype: float64


The decision tree model shows that YEAR is the most significant predictor. With regards to the Immigrant status, 'Non-landed immigrants' and 'Born in Canada' show some importance, while the other subcategories have lower importance.

The mean MSE of approximately 10.97 suggests a moderate level of error in predicting the wage gap. The standard deviation of 3.21 indicates some variability in the model's performance across different time folds, which is expected with time series data.

However, it is important to note that the time series CV is based on 18 yearly points, which is a very small sample size. As each fold has so few data points, the validation set may not capture meaningful patterns. A single unusual year (e.g., recession, policy change) could dominate the fold, leading to misleading performance estimates.