# Sprint 3: Model Experiments for Computational Modeling Approaches

This notebook initializes the setup for testing computational modeling approaches based on EDA correlations between industries and job roles. We will load the processed data, compute correlations using high-correlation features identified in Task 3.1 (e.g., strong associations in IT like 'Head of Product' and 'Head of Software Engineering', healthcare roles like 'Registered Nurses'), and prepare for pattern recognition systems including data preprocessing, feature engineering for industry-job role pairs, and placeholders for Level 3 computational models.

Note: This notebook contains setup code only. Do not execute or run experiments.

In [7]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

In [8]:
# Load the processed data
data_path = '../data/processed/cleaned_s_user_jobrole.csv'
df = pd.read_csv(data_path)
print(f"Data loaded successfully. Shape: {df.shape}")
print("Columns in DataFrame:", df.columns.tolist())
print(df.head())
print(df.info())

Data loaded successfully. Shape: (4328, 34)
Columns in DataFrame: ['id', 'industries', 'department', 'sub_department', 'department_id', 'jobrole', 'description', 'jobrole_category', 'performance_expectation', 'status', 'related_jobrole', 'required_skill_experience', 'location', 'salary_range', 'company_information', 'responsibilities', 'benefits', 'keyword_tags', 'job_posting_date', 'application_deadline', 'contact_information', 'internal_tracking', 'education', 'experience', 'training', 'sub_institute_id', 'created_by', 'updated_by', 'deleted_by', 'created_at', 'updated_at', 'deleted_at', 'required_skill_experience_missing', 'jobrole_missing']
   id   industries          department      sub_department  department_id  \
0   1  Accountancy           Assurance           Assurance            838   
1   2  Accountancy           Assurance           Assurance            838   
2   3  Accountancy           Assurance           Assurance            838   
3   4  Accountancy           Assurance 

In [9]:
required_cols = ["industries", "jobrole"]

df_clean = df[required_cols].copy()

for col in required_cols:
    df_clean[col] = (
        df_clean[col]
        .astype(str)
        .str.strip()
        .str.lower()
        .replace({"": np.nan, "nan": np.nan})
    )

df_clean = df_clean.dropna()

print("After cleaning:", df_clean.shape)


After cleaning: (4327, 2)


In [10]:
# Data Preprocessing
# Handle missing values if any

# Encode categorical variables for correlation analysis
le_industries = LabelEncoder()
le_jobrole = LabelEncoder()

df['industries_encoded'] = le_industries.fit_transform(df['industries'])
df['jobrole_encoded'] = le_jobrole.fit_transform(df['jobrole'])

print("Data preprocessing completed.")
print(f"Unique industries: {len(le_industries.classes_)}")
print(f"Unique job roles: {len(le_jobrole.classes_)}")

Data preprocessing completed.
Unique industries: 46
Unique job roles: 2843


In [11]:
contingency_table = pd.crosstab(df["industries"], df["jobrole"])
chi2, p, dof, expected = chi2_contingency(contingency_table)

n = contingency_table.to_numpy().sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))

print("Chi-square:", round(chi2, 3))
print("p-value:", round(p, 6))
print("Degrees of freedom:", dof)
print("Cramer's V:", round(cramers_v, 3))
print("Cells with expected freq < 5:", (expected < 5).sum(), "/", expected.size)


Chi-square: 163495.973
p-value: 0.0
Degrees of freedom: 125004
Cramer's V: 0.927
Cells with expected freq < 5: 127890 / 127890


In [23]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import warnings

warnings.filterwarnings("ignore")

# Load data
data_path = "../data/processed/cleaned_s_user_jobrole.csv"
df = pd.read_csv(data_path)

# Clean categorical columns
df = df[["industries", "jobrole"]].copy()
for col in ["industries", "jobrole"]:
    df[col] = (
        df[col]
        .astype(str)
        .str.strip()
        .str.lower()
        .replace({"": np.nan, "nan": np.nan})
    )
df = df.dropna()

# Chi-square test
contingency_table = pd.crosstab(df["industries"], df["jobrole"])
chi2, p, dof, expected = chi2_contingency(contingency_table)

n = contingency_table.to_numpy().sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))

print("Chi-square:", round(chi2, 3))
print("p-value:", round(p, 6))
print("Degrees of freedom:", dof)
print("Cramer's V:", round(cramers_v, 3))
print("Cells with expected freq < 5:", (expected < 5).sum(), "/", expected.size)

# Standardized residuals (true association strength)
expected_df = pd.DataFrame(
    expected,
    index=contingency_table.index,
    columns=contingency_table.columns
)
std_residuals = (contingency_table - expected_df) / np.sqrt(expected_df)

strong_pairs = (
    std_residuals
    .stack()
    .reset_index()
    .rename(columns={0: "std_residual"})
)
strong_pairs = strong_pairs[strong_pairs["std_residual"].abs() > 2]
strong_pairs = strong_pairs.sort_values("std_residual", ascending=False)

print("\nStrong industryâ€“jobrole associations:")
print(strong_pairs.head(10))

# ML: predict jobrole from industry only (no leakage)
le_jobrole = LabelEncoder()
y = le_jobrole.fit_transform(df["jobrole"])

ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X = ohe.fit_transform(df[["industries"]])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(
    max_iter=1000,
    # n_jobs=-1,
    # multi_class="multinomial",
    solver='lbfgs'
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("\nClassification Report:")
print(
    classification_report(
        y_test,
        y_pred,
        labels=np.unique(y_test),
        target_names=le_jobrole.inverse_transform(np.unique(y_test)),
        digits=2
    )
)


Chi-square: 163495.973
p-value: 0.0
Degrees of freedom: 125004
Cramer's V: 0.927
Cells with expected freq < 5: 127890 / 127890

Strong industryâ€“jobrole associations:
                         industries  \
23105   carbon services and trading   
23107   carbon services and trading   
23106   carbon services and trading   
23110   carbon services and trading   
23109   carbon services and trading   
23108   carbon services and trading   
23382   carbon services and trading   
24026   carbon services and trading   
11292                 air transport   
127493  workplace safety and health   

                                                  jobrole  std_residual  
23105                                   carbon accountant     23.213721  
23107                        carbon investment specialist     23.213721  
23106                                      carbon auditor     23.213721  
23110                                     carbon verifier     23.213721  
23109                           

In [35]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import warnings

warnings.filterwarnings("ignore")

# Load data
df = pd.read_csv("../data/processed/cleaned_s_user_jobrole.csv")
df = df[["industries", "jobrole"]].copy()
for col in ["industries", "jobrole"]:
    df[col] = df[col].astype(str).str.strip().str.lower().replace({"": np.nan, "nan": np.nan})
df = df.dropna()

# Keep top 5 job roles to avoid too small classes
top_roles = df["jobrole"].value_counts().nlargest(5).index
df = df[df["jobrole"].isin(top_roles)]

# Encode labels
le_jobrole = LabelEncoder()
y = le_jobrole.fit_transform(df["jobrole"])

# Encode industry feature
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X = ohe.fit_transform(df[["industries"]])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train model (no multi_class keyword for compatibility)
model = LogisticRegression(max_iter=1000, solver='lbfgs')
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Robust classification report
unique_labels = np.unique(y_test)
target_names = le_jobrole.inverse_transform(unique_labels)

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, labels=unique_labels, target_names=target_names, digits=2))



Classification Report:

                              precision    recall  f1-score   support

business development manager       0.40      1.00      0.57         2
               data engineer       0.00      0.00      0.00         2
head of software engineering       0.00      0.00      0.00         2
             product manager       0.33      0.50      0.40         2
              vice president       1.00      1.00      1.00         2

                    accuracy                           0.50        10
                   macro avg       0.35      0.50      0.39        10
                weighted avg       0.35      0.50      0.39        10



In [13]:
print(std_residuals.abs().describe())


jobrole  1st assistant cameraman / focus puller (specialty camera operation)  \
count                                            45.000000                     
mean                                              0.236912                     
std                                               0.731095                     
min                                               0.042998                     
25%                                               0.080443                     
50%                                               0.105324                     
75%                                               0.153535                     
max                                               5.008066                     

jobrole  2d artist (concept art / background art / character art, storyboarding)  \
count                                            45.000000                         
mean                                              0.236912                         
std                        

In [None]:
def normalize_role(role):
    if "data scientist" in role:
        return "data scientist"
    if "software" in role or "developer" in role:
        return "software engineer"
    if "analyst" in role:
        return "analyst"
    if "manager" in role:
        return "manager"
    return role

df["jobrole_grouped"] = df["jobrole"].apply(normalize_role)


                   industries                       jobrole  count
11292           air transport                vice president     11
72194  information technology               head of product      6
72213  information technology  head of software engineering      6
73037  information technology               product manager      6
73111  information technology    quality assurance engineer      6
73112  information technology     quality assurance manager      6
73125  information technology              quality engineer      6
73130  information technology   quality engineering manager      6
73540  information technology            software architect      6
73543  information technology  software engineering manager      6


## Summary

This notebook has been initialized with:
- Data loading from `data/processed/cleaned_s_user_jobrole.csv`
- Correlation computation between industries and job roles using contingency tables and Cramer's V
- Identification of high-correlation features based on EDA insights (e.g., IT roles, healthcare roles)
- Feature engineering for industry-job role pairs
- Placeholders for Level 3 computational models (e.g., neural networks for pattern recognition)

Next steps: Execute the cells to run the setup, then develop and test specific models based on the correlations identified.