## Education & Training Alignment

This section of the analysis focuses on how education and formal training align with the realities of the data science job market.  
We will explore three key research questions:

### Research Questions
1. **Education vs. Salary/Seniority**  
   - How does the required education level correlate with salary ranges and seniority level in data science roles?  
   - Example: Do master’s or PhD holders consistently earn more or secure higher-level positions compared to those with bachelor’s degrees or bootcamp training?

2. **Skills Gap in Formal Education**  
   - Are there technical skills or tools that are **highly demanded in job postings** but are **rarely taught** in formal education programs?  
   - Example: Cloud platforms (AWS, GCP, Azure), version control (Git/GitHub), or modern ML frameworks (TensorFlow, PyTorch).

3. **Job Postings vs. Curricula**  
   - How well do job postings align with the skills listed by educational institutions?  
   - Which **emerging skills** are appearing in postings but not yet common in curricula?  
   - Example: Generative AI tools, MLOps, or advanced data visualization libraries.

---

### Expected Outcomes
- Identify whether higher education is a strong predictor of salary and role seniority.  
- Highlight **gaps between industry demand and academic curricula**.  
- Provide recommendations for how training programs (bootcamps, universities, online courses) can better align with real-world employer expectations.

---

### Data Needed
- Education requirements from job postings.  
- Salary data and seniority levels (junior, mid, senior).  
- Curriculum data from universities, bootcamps, and online programs.  
- Frequency of technical skills mentioned in both postings and curricula.

In [25]:
import pandas as pd 
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

## Dataset Loading and Initial Exploration

We begin by loading the AI job dataset into a pandas DataFrame.  
To understand the dataset’s structure and quality, we perform three quick checks:  

1. **`info()`** – Displays the number of rows/columns, data types, and non-null counts.  
2. **`describe()`** – Summarizes numeric columns (count, mean, std, min, quartiles, max).  
3. **`head()`** – Previews the first 5 rows to confirm the data loaded correctly.  

These steps give us a foundational understanding of the dataset before deeper analysis.

In [26]:
# Load the AI job dataset from the datasets folder into a DataFrame called 'aijobs'
aijobs = pd.read_csv("../datasets/ai_job_dataset.csv")

# Display general information about the dataset:
# number of rows/columns, column names, data types, and non-null counts
aijobs.info()

# Generate basic descriptive statistics for numeric columns
# (count, mean, std, min, quartiles, max)
aijobs.describe()

# Preview the first 5 rows of the dataset
# This helps confirm the data loaded correctly and shows a sample of the values
aijobs.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   job_id                  15000 non-null  object 
 1   job_title               15000 non-null  object 
 2   salary_usd              15000 non-null  int64  
 3   salary_currency         15000 non-null  object 
 4   experience_level        15000 non-null  object 
 5   employment_type         15000 non-null  object 
 6   company_location        15000 non-null  object 
 7   company_size            15000 non-null  object 
 8   employee_residence      15000 non-null  object 
 9   remote_ratio            15000 non-null  int64  
 10  required_skills         15000 non-null  object 
 11  education_required      15000 non-null  object 
 12  years_experience        15000 non-null  int64  
 13  industry                15000 non-null  object 
 14  posting_date            15000 non-null

Unnamed: 0,job_id,job_title,salary_usd,salary_currency,experience_level,employment_type,company_location,company_size,employee_residence,remote_ratio,required_skills,education_required,years_experience,industry,posting_date,application_deadline,job_description_length,benefits_score,company_name
0,AI00001,AI Research Scientist,90376,USD,SE,CT,China,M,China,50,"Tableau, PyTorch, Kubernetes, Linux, NLP",Bachelor,9,Automotive,2024-10-18,2024-11-07,1076,5.9,Smart Analytics
1,AI00002,AI Software Engineer,61895,USD,EN,CT,Canada,M,Ireland,100,"Deep Learning, AWS, Mathematics, Python, Docker",Master,1,Media,2024-11-20,2025-01-11,1268,5.2,TechCorp Inc
2,AI00003,AI Specialist,152626,USD,MI,FL,Switzerland,L,South Korea,0,"Kubernetes, Deep Learning, Java, Hadoop, NLP",Associate,2,Education,2025-03-18,2025-04-07,1974,9.4,Autonomous Tech
3,AI00004,NLP Engineer,80215,USD,SE,FL,India,M,India,50,"Scala, SQL, Linux, Python",PhD,7,Consulting,2024-12-23,2025-02-24,1345,8.6,Future Systems
4,AI00005,AI Consultant,54624,EUR,EN,PT,France,S,Singapore,100,"MLOps, Java, Tableau, Python",Master,0,Media,2025-04-15,2025-06-23,1989,6.6,Advanced Robotics


### Check Dataset Dimensions

The `.shape` attribute tells us the size of the dataset in terms of rows and columns.  
This helps confirm the overall scale of the data we’re working with.  
Here, we also format the output to clearly state the row and column counts.

In [27]:
# Check the number of rows and columns in the dataset
# .shape[0] = number of rows, .shape[1] = number of columns
aijobs.shape
print(f"The dataset contains {aijobs.shape[0]:,} rows and {aijobs.shape[1]} columns.")


The dataset contains 15,000 rows and 19 columns.


### Check for Missing Values

Before analysis, it’s important to identify any missing data.  
The `.isnull().sum()` function shows the number of null values in each column.  
This helps us decide if we need to clean, impute, or drop certain columns/rows.

To ensure clean data for analysis, we remove all rows that contain any null values using `.dropna()`.  
We then confirm the cleanup by checking the total number of missing values again.  
If the result is `0`, it means the dataset has no remaining nulls.

In [28]:
# Check how many null (missing) values are present in each column
aijobs.isnull().sum()

job_id                    0
job_title                 0
salary_usd                0
salary_currency           0
experience_level          0
employment_type           0
company_location          0
company_size              0
employee_residence        0
remote_ratio              0
required_skills           0
education_required        0
years_experience          0
industry                  0
posting_date              0
application_deadline      0
job_description_length    0
benefits_score            0
company_name              0
dtype: int64

In [29]:
# Check the total number of missing values across the entire dataset
aijobs.isnull().sum().sum()

0

In [30]:
# Create a cleaned version of the dataset by dropping rows with missing values
aijobs_clean = aijobs.dropna()

# Verify that no missing values remain
aijobs_clean.isnull().sum().sum()

0

### Create Subset for Education & Training Alignment

For the education and training research questions, we only need a subset of the dataset.  
We filter the cleaned DataFrame to keep the most relevant columns:  

- `job_id` → unique job identifier  
- `salary_usd` → standardized salary measure  
- `experience_level` → seniority of the role  
- `education_required` → education level listed in posting  
- `years_experience` → required years of experience  
- `required_skills` → technical/soft skills required  
- `posting_date` → track emerging/temporal trends  
- `industry` → industry context for the role 

In [31]:
# Filter dataset to only include relevant columns for Education & Training Alignment analysis
education_jobs = aijobs_clean[["job_id", "salary_usd", "experience_level", "education_required","years_experience", "required_skills", "posting_date", "industry"]]
print(education_jobs)



        job_id  salary_usd experience_level education_required  \
0      AI00001       90376               SE           Bachelor   
1      AI00002       61895               EN             Master   
2      AI00003      152626               MI          Associate   
3      AI00004       80215               SE                PhD   
4      AI00005       54624               EN             Master   
...        ...         ...              ...                ...   
14995  AI14996       38604               EN           Bachelor   
14996  AI14997       57811               EN             Master   
14997  AI14998      189490               EX          Associate   
14998  AI14999       79461               EN                PhD   
14999  AI15000       56481               MI                PhD   

       years_experience                                  required_skills  \
0                     9         Tableau, PyTorch, Kubernetes, Linux, NLP   
1                     1  Deep Learning, AWS, Mathematic

In [32]:
# Save the subset to a new CSV file
education_jobs.to_csv('education_jobs.csv', index=False)