<a href="https://colab.research.google.com/github/jasonnzhangg2/job-market-analysis-data-analytics-project/blob/main/EDA_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Questions to be Answered

1. What are the most in demand skills for the top 3 most popular Data Roles

2. How are in-demand skills trending for Data Analysts

3. How well do jobs and skills pay for Data Analysts

4. What is the most optimal skill to learn for Data Analysts


In [None]:
# Importing Libraries
import ast
import pandas as pd
import seaborn as sns
from datasets import load_dataset
import matplotlib.pyplot as plt

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()



In [None]:
df.head()


In [None]:
print(type(df.job_posted_date[0]))
print(type(df.job_skills[1]))

##### Data Cleanup

job_posted_date and job_skills are string values

convert job_posted_date to list

In [None]:
# Data Cleanup
# Change job_posted_date to datetime

df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

def clean_list(skill_list):
  if pd.notna(skill_list):
    return ast.literal_eval(skill_list)
  else:
    return skill_list

df['job_skills'] = df['job_skills'].apply(clean_list)

Since I am from Canada we will be looking for Data Analyst roles in Canada



In [None]:
df_DA_CAN = df[(df['job_country'] == 'Canada') & (df['job_title_short'] == 'Data Analyst')]

In [None]:
df_DA_CAN.head()

### Exploratory Data Analysis

Things to check

1. What are the top 10 job locations in Canada for Data Analysts
2. How many jobs have benefits
3. What are the top 10 job companies in Canada for Data Analysts


In [None]:
# Top 10 job locations
df_DA_CAN.job_location.value_counts().head(10)

While exploring the job_location field

I observed that many postings list the location simply as "Canada" rather than a specific city or province.

These entries lack the geographic detail required for city-level analysis.

To better understand the entries, I compared job_location with the job_work_from_home field

If a posting had:

- job_location = "Canada"
- job_work_from_home = True

it was interpreted as a Canada-wide remote position.

If job_location = "Canada" but the remote flag was False or missing, the location was considered unspecified

In [None]:
df_DA_CAN[(df_DA_CAN['job_location'] == 'Canada') & (df_DA_CAN['job_work_from_home'] == True)].value_counts(dropna=False)

A further breakdown shows that there are no remote inidcators for the positions

As such they would be categorized as Location unspecified.

To ensure that the “Top 10 Job Locations” visualization reflects only postings with clearly defined geographic information, rows labeled only as Canada were excluded from that specific analysis.

In [None]:
df_DA_CAN1 = df_DA_CAN[
~((df_DA_CAN['job_location'] == 'Canada') &
  (df_DA_CAN['job_work_from_home'] == False))
]

df_DA_CAN1

Plot out the job locations on a horizontal bar plot

In [None]:
df_plot = df_DA_CAN1.job_location.value_counts().head(10).to_frame()

sns.set_theme(style='ticks')
sns.despine()

sns.barplot(data=df_plot, x='count', y='job_location', hue='count', palette='dark:b_r', legend = False)

plt.title('Top 10 job locations for Data Analysts in Canada')
plt.xlabel('Number of Jobs')
plt.ylabel('')



### Benefits Analysis for Data Analyst Jobs

Plot the number of jobs
- have work from home
- has health insurance


In [None]:
df_DA_CAN1[['job_health_insurance', 'job_work_from_home']].value_counts()

In [None]:
fig, ax = plt.subplots(1, 2)


dic_column = {
    'job_work_from_home' : 'Work From Home',
    'job_health_insurance' : 'Health Insurance'
}

# Loop through to get the column and Title
# enumerate is used to plot the subplots
for i, (column, title) in enumerate(dic_column.items()):
  ax[i].pie(df_DA_CAN1[column].value_counts(), startangle=90, autopct='%1.1f%%', labels=['False', 'True'])
  ax[i].set_title(title)

plt.tight_layout()
plt.show()

### Number of Companies Hiring

In [None]:
df_plot = df_DA_CAN1.company_name.value_counts().head(10).to_frame()

sns.set_theme(style='ticks')
sns.despine()

sns.barplot(data=df_plot, x='count', y='company_name', hue='count', palette='dark:b_r', legend = False)

plt.title('Top 10 Companies for Data Analysts in Canada')
plt.xlabel('Number of Jobs')
plt.ylabel('')


