# 📊 AI/ML Job Market Analysis — Real LinkedIn Dataset Project

This is my first full data analysis project where I tried to make sense of 800+ real AI/ML job listings scraped from LinkedIn (via Kaggle).  
I didn’t want to just run code — I wanted to understand **what the hiring scene actually looks like** for people entering this field.

I broke the whole process into 6 versions — each one focusing on cleaning, exploring, and asking better questions than the last.  
It’s not perfect, but I learned more with each step.

> I’ve documented everything version by version — including challenges, fixes, learnings, and what surprised me.

## Notebook Version: v1  
**Focus**: Dataset loading and basic structural preview  

This notebook is part of a versioned project exploring trends in AI/ML job postings in the U.S.  
This version focuses on loading the dataset, checking its structure, and identifying surface-level issues.


In [None]:
#importing the necessary libraries
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Dataset Overview

- Source: Kaggle – AI and ML Job Listings USA  
- File path: `/kaggle/input/ai-and-ml-job-listings-usa/ai_ml_jobs_linkedin.csv

## Load and Preview Data

Loading the dataset into a DataFrame and preview the structure to understand its basic layout.


In [None]:
# Load the dataset
us_jobs_df = pd.read_csv('/kaggle/input/ai-and-ml-job-listings-usa/ai_ml_jobs_linkedin.csv')

# Create a working copy
jobs_df = us_jobs_df.copy()

In [None]:
# Preview first 2 rows
jobs_df.head(2)


In [None]:
# Check dataset shape
print(f"Rows: {jobs_df.shape[0]}, Columns: {jobs_df.shape[1]}")

# Data types and non-null info
jobs_df.info()

In [None]:
# Summary stats for numeric columns
jobs_df.describe()

## Initial Observations and Notes

- The dataset contains **862 rows** and **10 columns**.
- Some columns such as `companyName`, `publishedAt`, and `sector` contain missing values.
- Columns like `applicationsCount` and `publishedAt` may need data type conversions in the next version.
- No immediate data loading issues were encountered.


## Notebook Version: v2  
**Focus**: Data Cleaning and Formatting  
 
This version focuses on cleaning the dataset, handling missing values, renaming columns, correcting data types, and preparing the data for analysis.


In [None]:
#previewing the data again
jobs_df.head(3)

## Handling missing/null values

In [None]:
#checking for null values if any
jobs_df.isna().sum()

## Data Type Fix

In [None]:
#handling null values

#filling the null values in columns 'companyName' and 'experienceLevel' as 'Unknown'
jobs_df[['companyName','sector']] = jobs_df[['companyName','sector']].fillna('Unknown')

#handling the null value for column 'publishedAt' using ffill() assuming that the post has been updated nearly at that date
jobs_df['publishedAt'] = jobs_df['publishedAt'].fillna(method='ffill')

#check if null value still exists
jobs_df.isna().sum()


#### NOTE:
Filled publishedAt using forward fill to maintain temporal continuity, assuming listings are updated close to previous records.

In [None]:
#checking for the datatypes 
jobs_df.info()

In [None]:
# converting 'publishedAt' into datetime data type  
jobs_df['publishedAt'] = pd.to_datetime(jobs_df['publishedAt'])

# converting 'applicationsCount' into integer data type
#first we need to extract the count of the applications 
jobs_df['applicationsCount'] = jobs_df['applicationsCount'].str.extract(r'(\d+)')[0]

#now convert the 'applicationsCount' dtype to numeric
jobs_df['applicationsCount'] = pd.to_numeric(jobs_df['applicationsCount'])
jobs_df.info()


## Removing the columns that are not useful for my analysis

In [None]:
#making a new df to store only the columns that are useful for my analysis

updated_jobs = jobs_df.drop(columns=['description','sector','workType'])
updated_jobs

## Duplicate Check and Removal

In [None]:
# checking the duplicate values that exists (based on all columns)
# updated_jobs.duplicated().sum()
updated_jobs[updated_jobs.duplicated()]

# dropping the duplicated values
updated_jobs.drop_duplicates(inplace=True)

# check if any row exist that have same title,companyName, location and publishedAt
updated_jobs.duplicated(subset=['title', 'companyName', 'location', 'publishedAt']).sum()

# removing the dupliacted values
duplicate_vals = updated_jobs.duplicated(subset=['title', 'companyName', 'location', 'publishedAt'])
updated_jobs = updated_jobs[~duplicate_vals].copy()

#resetting the index after droppingt the duplicate values
updated_jobs.reset_index(drop=True,inplace=True)

## Whitespace Stripping 



In [None]:
#stripping the whitespaces if any, from the string based columns

for col in ['title','companyName','location','experienceLevel','contractType']:
    updated_jobs[col] = updated_jobs[col].str.strip()



## Category Cleaning

In [None]:
#1. title
# for consistency I am converting the titles into title case
updated_jobs['title'] = updated_jobs['title'].str.title()

#check if the function is applied properly
updated_jobs['title'].head(3)

#2. location
# I will be splitting the location column into two parts: one is for city and other is for state
location_split = updated_jobs['location'].str.split(',',n=1,expand=True)

#adding the city column
updated_jobs['city'] = location_split[0].str.strip()

#adding the state column
updated_jobs['state'] = location_split[1].str.strip()

#check if any null value has been added due to the above two columns
updated_jobs.isna().sum()

#handle the null values
updated_jobs['state'] = updated_jobs['state'].fillna('Unknown')

#rechecking for null values
updated_jobs.isna().sum()

#removing the location column as it is no more useful
updated_jobs.drop('location',axis='columns',inplace=True)

#3. publishedAt
# I will be asplitting this column also into two parts year and month (day is not useful)

updated_jobs['year'] = updated_jobs['publishedAt'].dt.year
updated_jobs['month'] = updated_jobs['publishedAt'].dt.month

#dropping the publishedAt column because it is not useful
updated_jobs.drop('publishedAt',axis='columns',inplace=True)
updated_jobs.columns

#4. companyName
# converting the company's name into title case so that it remains consistent throughout
updated_jobs['companyName'] = updated_jobs['companyName'].str.title()

#check if the change has been made properly
updated_jobs['companyName'].head(5)

## Dataset Cleaning and Structuring Summary

In this version, I focused on cleaning and structuring the dataset to prepare it for meaningful analysis. The original dataset had multiple inconsistencies and mixed-format fields which could hinder exploration and insights.

## 🚀 Key actions
- Selected 7 relevant columns for the analysis.
- Cleaned categorical columns (`title`, `companyName`, `location`) for consistency.
- Split complex fields like `location` and `publishedAt` into simpler, analyzable components (city, state, year, month).
- Handled missing values in `state` by filling with "Unknown".

## ⚠️ Challenges
- Some job titles were overly specific or inconsistent (e.g., different casing, role modifiers). I resolved this with title casing but might need more grouping later.
- The `location` field didn’t follow a uniform format in all rows — some were missing state info, which led to NaN values after splitting.
- The `publishedAt` field contained full timestamps, which were not useful at this stage. It took care to isolate only the useful components (year/month) without losing meaning.

## 🎯 Learnings
- Even basic string cleaning and formatting (like `.str.title()` or `.str.strip()`) can greatly improve consistency in the dataset.
- Breaking down complex columns (like `location` and `publishedAt`) can make future analysis smoother and more insightful.
- It's important to analyze columns one by one instead of applying generic cleaning — each column may need unique handling.



## Notebook Version: v3  
**Focus**: Exploratory Data Analysis (EDA)

This version focuses on asking structured and slightly deeper questions to understand the dataset better.  
I'm primarily focusing on categorical patterns, hiring distributions, and application behavior.  
We'll go from basic univariate counts to intermediate bivariate groupings (without visuals, which are reserved for v4).

> Note: `title` and `sector` are *not* taken up in this version intentionally.  
> - `title` is too noisy to analyze meaningfully without cleanup — we’ll handle that in **v5**.  
> - `sector` is reserved for **v5** as well, to avoid overloading this version and to keep v3 beginner-friendly.



### 1. Exploring Unique Values in Categorical Columns

In [None]:
#finding number of unique values in categorical columns
print('Unique values in categorical column:')
print(updated_jobs[['contractType','experienceLevel','month','year']].nunique())

### 2. Top 10 Most Common Job Titles

In [None]:
#top 10 most common job titles
common_title = updated_jobs['title'].value_counts().head(10)
print('Top 10 most common job titles')
print(common_title)

This gives a sense of which roles are being advertised the most — though detailed title analysis will be taken in v5.


###  3. Companies Posting the Most Jobs

In [None]:
#companies that have posted the most job listings
print('Companies that have posted the most job listings')
print(updated_jobs['companyName'].value_counts().head(10))

Companies with the most job listings often reflect dominant hiring brands in the market.


###  4. Top 5 Hiring Cities (with Cleaning)

In [None]:
#cities that are hiring the most(top 5)

#NOTE: alot of records have United States as a city but that is a wrong value, thus replacing it with 'Unknown'
updated_jobs['city'] = updated_jobs['city'].replace('United States','Unknown')
updated_jobs['city'].value_counts()

#NOTE: majority of companies have not entered the city, thus we will be ignoring it and will show the actual city names
hiring_cities = updated_jobs['city'][updated_jobs['city'] != 'Unknown']
hiring_cities.reset_index(drop=True, inplace=True)
print('Top 5 cities in US that are hiring the most')
print(hiring_cities.value_counts().head())

I noticed that many job listings have 'United States' or missing city data, so I cleaned it for more realistic counts.
(Probably missed in v2 while cleaning the data)

###  5. Most Common Contract Type & Experience Level

In [None]:
#most common contractType
print('Most common contract type: ')
print(updated_jobs['contractType'].value_counts().head(1).index[0])
print()

#experience level that is highest in demand
print('Experience level that is highest in demand ')
print(updated_jobs['experienceLevel'].value_counts())

Basic univariate checks to understand the dominant job types and demanded experience levels.


###  6. Average Application Count

In [None]:
#average of application counts
print('Average of application counts:')
print(updated_jobs['applicationsCount'].mean())

This shows how saturated the job market is — a very high mean might suggest few postings with extreme competition.


### 7. Company with Highest/Lowest Application Count

In [None]:
#company that recieved highest and lowest number of applications

#grouping the company with the total number of application count
job_app_count = updated_jobs.groupby('companyName').agg({'applicationsCount':'sum'})

#extracting the max and min count
max_val = job_app_count['applicationsCount'].max()
min_val = job_app_count['applicationsCount'].min()

#extracting the company names
print("Company that recieved highest applications")
highest_company = job_app_count[job_app_count['applicationsCount'] == max_val]
print(highest_company)
print()

lowest_company = job_app_count[job_app_count['applicationsCount'] == min_val]
print("Top 5 companies that got lowest application count")
print(lowest_company.head())
print()
#NOTE! : there are alot of companies that have got the minimum(25) number of applications, so showing only 5 of them

This indicates which companies attract more attention from applicants — maybe due to reputation or role type.


### 8. Application Count by Experience Level

In [None]:
#application count distribution by experience level
print('Application count distribution by experience level')
exp_app_count = updated_jobs.groupby('experienceLevel').agg({'applicationsCount':'sum'})
print(exp_app_count)

Helpful to understand if juniors or seniors are attracting more applications.


### 9. Contract Types Getting the Most Applications

In [None]:
#contractType that are getting the highest number of applications
print('Contract type that are getting the highest number of applications')
contract_app_count = updated_jobs.groupby('contractType').agg({'applicationsCount':'sum'})
highest_count = contract_app_count['applicationsCount'].max()
highest_contract_val = contract_app_count[contract_app_count['applicationsCount']==highest_count]
print(highest_contract_val)

Applicants are applying more for full time contract type as compared to other types of contract

### 10. Experience Level vs Contract Type 

In [None]:
#which experience level are linked more with which contract type
exp_contract = updated_jobs.groupby(['experienceLevel', 'contractType']).size().unstack().fillna(0)
print("Experience vs Contract Type distribution:")
print(exp_contract)

Cross-sectional view of how contract types differ across experience levels. This helps in understanding how experience level varies with contract type

### 11. Underperforming Contract Types

In [None]:
#Contract types with high postings but fewer average applications:

avg_app_per_contract = updated_jobs.groupby('contractType').agg({
    'applicationsCount': 'mean',
    'contractType': 'count'
}).rename(columns={'contractType': 'jobCount'}).sort_values(by='applicationsCount')
print("Avg applications per contract type vs job count:")
print(avg_app_per_contract)

This analysis shows which contract types might be oversupplied or under-attractive — useful for recruiters or job portals.


# Exploratory Data Analysis Summary – Version 3

In this version, I focused on exploring the dataset more deeply using beginner to intermediate level EDA (without any visuals).  
The goal was to understand how different categorical features like experience level, contract type, city, and applications behave — both on their own and together.


## 🔍 What I Did

- Checked unique values for key categorical columns: `contractType`, `experienceLevel`, `month`, and `year`.
- Found out which companies posted the most jobs and which cities are hiring the most.
- Looked at average application counts, plus which companies got the highest and lowest applications, and which contract types are getting the most interest.
- Grouped experience levels and contract types to see how they relate to each other.
- Identified which job types might be oversaturated (that is, lots of postings but not many applications on average).


## ⚠️ What I Didn’t Cover

- Skipped the `title` column for now because it’s just too messy — planning to clean and analyze job titles in v5 when I dive into deeper title/trend analysis.
- Left out the `sector` column to keep this version beginner-friendly and focused. That'll be part of v5 too.


## 🎯 What I Learned

- You can pull out a lot of insights just by grouping and aggregating columns — no fancy plots needed.
- Application counts alone tell a lot about what job types are getting attraction, and which ones most people are ignoring.



# Notebook Version: v4  
**Focus**: Visual Exploration & Useful Hiring Trends

This version builds on the cleaned dataset and focuses entirely on meaningful visualizations using Matplotlib and Seaborn.

The aim here is not to plot everything possible — but to highlight the kind of trends that would actually help:
- **Job seekers**: spot hiring hotspots, application behavior, and target companies
- **Recruiters**: benchmark trends, see market competition, and hiring patterns

The charts cover:
- Top hiring companies
- Contract type and experience level preferences
- Sector-wise demand (basic view)
- State-wise hiring trends
- Job posting activity over time
- Application count patterns
- Company + state combinations to show where hiring is happening

> ⚠️ Note: I’ve skipped anything related to `title` and detailed `sector` analysis for now — both need cleanup, and that’ll be handled in **Version 5**.


In [None]:
#importing the libraries for visulaisations
import matplotlib.pyplot as plt
import seaborn as sns

### 1.  Top Hiring Companies

In [None]:
# which 10 companies are hiring the most

top_hiring = updated_jobs['companyName'].value_counts().head(10).reset_index()
top_hiring.columns = ['companyName','noOfJobs']

#plotting the chart
ax = sns.barplot(data=top_hiring,x='noOfJobs',y='companyName',palette='crest')
plt.title('Top 10 hiring companies')
plt.xlabel('Company Names')
plt.ylabel('No of Jobs Offered by Companies')
for container in ax.containers:
    ax.bar_label(container,padding=1)
plt.show()

- Helps job seekers target high-opportunity employers

- Helps recruiters benchmark hiring volume

### 2. Contract Type Trends

In [None]:
# how are jobs distributed across different contract types (e.g., Full-time, Contract)?

ax = sns.countplot(data=updated_jobs,x='contractType',palette='crest')

plt.title('Contract Type Trends')
plt.xlabel('Type of Contract')
plt.ylabel('No of Job Listings')

for container in ax.containers:
    ax.bar_label(container, padding=1)
plt.show()

- Shows the nature of job availability in the market

- Recruiters can benchmark contract vs full-time usage

### 3. Experience Level Demand

In [None]:
# which experience levels are most in demand?

ax = sns.countplot(data=updated_jobs, y='experienceLevel',palette='crest')
plt.title('Experience Level Demand')
plt.xlabel('Number of Job Listings')
plt.ylabel('Experience Levels')

for container in ax.containers:
    ax.bar_label(container,padding=1)

plt.show()

- Job seekers can position themselves accordingly

- Recruiters can validate if they're aligned with market

### 4. Sector-wise Job Distribution

In [None]:
# which 10 sectors have the highest number of listings?

sector_jobs = jobs_df['sector'].value_counts().head(10).reset_index()
sector_jobs.columns = ['sector','jobCount']

ax = sns.barplot(data=sector_jobs, x='jobCount',y='sector',palette='crest')
plt.title('Sector-wise Job Distribution')
plt.ylabel('Sectors')
plt.xlabel('Number of Job Listings')

for container in ax.containers:
    ax.bar_label(container,padding=1)
    
plt.show()

- Job seekers see which industries have the most job openings

- Recruiters get a sense of how active their industry is in hiring

### 5. Job Postings Over Time

In [None]:
# how has the number of listings changed over months/years?

#extracting the month and year from the date
jobs_df['year_month'] = jobs_df['publishedAt'].dt.to_period('M').astype('str')

# counting the values
postings = jobs_df['year_month'].value_counts().sort_index()
postings = postings.reset_index()
postings.columns = ['year_month','jobPostings']

sns.lineplot(data=postings, x='year_month', y='jobPostings', palette='crest', marker='o')
plt.title('Job Postings Over Time')
plt.xlabel('Time period')
plt.ylabel('Number of Job Listings')
plt.grid(True)
plt.xticks(rotation = 45)
plt.show()

- Job seekers can time their applications better

- Recruiters can plan campaigns around hiring peaks

### 6. Top 10 Hiring Locations (States Only)

In [None]:
# states column have 'Unknown' and 'United States' as value which needs to be handled first
updated_jobs.state.unique()

# removing the 'unknown' and 'united states' value from the column
state_jobs = updated_jobs[~updated_jobs['state'].isin(['Unknown', 'United States'])].copy()
top_states = state_jobs['state'].value_counts().head(10)
top_states = top_states.reset_index()
top_states.columns = ['state','jobCount']

#plotting the chart
ax = sns.barplot(data=top_states, x='state', y='jobCount', palette='crest')
plt.title('Top 10 Hiring Locations(state-wise)')
plt.ylabel('Number of Job Listing')
plt.xlabel('States')

for container in ax.containers:
    ax.bar_label(container,padding=1)

plt.show()

- Job seekers can focus on high-demand regions

- Recruiters can identify competitive hiring zones

### 7. Application Count Distribution

In [None]:
# what is the overall spread of application counts across listings?
sns.histplot(data=updated_jobs, x='applicationsCount', bins=40, kde=True)

plt.title('Distribution of Application Counts per Listing')
plt.xlabel('Number of Applications')
plt.ylabel('Number of Listings')
plt.grid(True)

plt.show()

- Job seekers get a sense of competition

- Recruiters learn which roles get more attention

### 8. Total Applications per Sector

In [None]:
# Which sectors receive more applications in total?

sector_app = jobs_df.groupby('sector',as_index=False).agg({'applicationsCount':'sum'})
sector_app = sector_app.sort_values(by='applicationsCount', ascending=False).head(10).reset_index()

ax = sns.barplot(data=sector_app, y='sector', x='applicationsCount', palette='crest')
plt.title('Total applications per sector')
plt.xlabel('Number of applications')
plt.ylabel('Sectors')

for container in ax.containers:
    ax.bar_label(container,padding=1)

plt.show()

- For job seekers, it reveals which industries may be more competitive to break into.
- For recruiters, it helps benchmark talent interest and assess if their sector is attracting enough visibility.

### 9. Companies Hiring in Each State

In [None]:
#Which companies are active in which states?

company_state_jobs = updated_jobs.groupby(['companyName', 'state']).size().reset_index(name='jobCount')

company_state_jobs = company_state_jobs.sort_values(by='jobCount', ascending=False)

top_company_state = company_state_jobs.head(10)

ax = sns.barplot(data=top_company_state, x='jobCount', y='companyName', hue='state', dodge=False, palette='crest')

plt.title('Top Company-State Combinations by Job Count')
plt.xlabel('Number of Listings')
plt.ylabel('Company')

for container in ax.containers:
    ax.bar_label(container, padding=3)

plt.show()


- Job seekers understand which companies are hiring in which states — useful for location-specific targeting

- Recruiters analyze regional hiring activity across major competitors



# Visual Exploration Summary – Version 4

In this version, I focused on turning the cleaned dataset into meaningful and useful visualizations using Seaborn and Matplotlib.  
The goal was not to make it look fancy — but to bring out trends that actually matter to either job seekers or recruiters.

## 🔍 What I Did

- Plotted the **top 10 hiring companies** to see who’s hiring the most.
- Looked at the **contract type** and **experience level** distributions to understand what kind of roles are being posted.
- Explored **sector-wise demand** (basic view) to see which industries are putting out the most jobs.
- Checked **job posting activity over time** to spot any hiring patterns or slowdowns.
- Visualized how **applications are spread** across listings — is competition high or balanced?
- Plotted **total applications by sector** to see which industries are getting the most attention overall.
- Added a **company + state combination chart** to see where hiring is happening geographically — who’s hiring and in which states.

## ⚠️ What I Didn’t Cover

- I skipped the `title` column in this version since the data is too messy and inconsistent. I’ll clean and explore it properly in **v5**.
- I also skipped deeper analysis of the `sector` column (like cross-analysis with experience or applications) for now — that’ll also be part of the next version after cleanup.

## 🎯 What I Learned

- Even without touching the messy parts, you can still pull out 8–9 strong insights just by using visual tools smartly.
- Adding visual context makes patterns easier to understand — especially when looking at competition levels, top companies, and hiring locations.
- It's better to skip noisy columns than force weak charts — the real value comes from **clean, readable trends**.


# Notebook Version: v5

**Focus** : Title & Sector Cleanup + Role-Based Insight Exploration

**Overview:**

In this version, I shift focus to two key fields that were previously too messy to analyze meaningfully: **title** and **sector**.
Both of these columns carry crucial signals — job role, industry, and hiring focus — but in their raw form, they’re inconsistent, redundant, and often noisy.

**Goal 🎯:**

The goal of this version is twofold:
- Clean and categorize job titles into standard, analysis-ready buckets
- Group sectors into simplified categories for clearer industry-level insights

**Analysis Plan:**

Once cleaned, I’ll generate targeted visualizations to understand:
- Which AI/ML roles are most in demand
- Which sectors are hiring most actively
- How roles and sectors intersect in real-world postings

This version is not about excessive plotting — it’s about preparing the data for smart role/sector trend analysis and enabling powerful filters in the next version.

### 1. Cleaning and Handling the Title column

In [None]:
# converting all the titles into lower case

jobs_df['title'] = jobs_df['title'].str.lower()

# Grouping the titles into broad categories and few are termed as Others which are out of those categories

def clean_title(title):
    if 'intern' in title:
        return 'Intern'
    elif 'ml ops' in title or 'mlops' in title or 'ops' in title:
        return 'MLOps Engineer'
    elif 'ai/ml' in title or 'ml/ai' in title or 'ai /ml' in title:
        return 'AI/ML Engineer'
    elif 'data scientist' in title or 'data' in title:
        return 'Data Scientist'
    elif 'deep learning' in title or 'dl' in title:
        return 'Deep Learning Engineer'
    elif 'nlp' in title:
        return 'NLP Engineer'
    elif 'computer vision' in title or 'cv' in title:
        return 'Computer Vision Engineer'
    elif 'research' in title:
        return 'Research Engineer'
    elif 'ai' in title or 'artificial intelligence' in title:
        return 'AI Engineer'
    elif 'ml' in title or 'machine learning' in title or 'machine' in title: 
        return 'Machine Learning Engineer'
    elif 'python' in title:
        return 'Python Engineer'
    elif 'software' in title:
        return 'Software Engineer (AI/ML)'
    elif 'robots' in title or 'robotics' in title:
        return 'Robotics Engineer'
    elif 'front' in title or 'back' in title or 'web' in title:
        return 'Web Development Engineer'
    elif 'r&d' in title or 'RD' in title or 'Research' in title:
        return 'Research and Development Engineer'
    else:
        return 'Other'

# applying the function to the titles
cleaned_title = jobs_df['title'].apply(clean_title)

# changing the titles into cleaned titles
jobs_df['title'] = cleaned_title

#converting the titles into title format
jobs_df['title'] = jobs_df['title'].str.title()


### 2. Cleaning and Handling the Sector Column

In [None]:
# converting all the sector values into lower case

jobs_df['sector'] = jobs_df['sector'].str.lower()

# checking if sector column consists of any null values
jobs_df['sector'].isna().sum()

# categorizing the sectors into broad categories

def clean_sector(sector):
    
    if 'tech' in sector or 'software' in sector or 'it' in sector or 'computer' in sector or 'data' in sector or 'telecommunication' in sector:
        return 'Tech'

    elif 'finance' in sector or 'bank' in sector or 'investment' in sector or 'insurance' in sector:
        return 'Finance'

    elif 'health' in sector or 'hospital' in sector or 'biotech' in sector or 'pharma' in sector or 'medical' in sector:
        return 'Healthcare'

    elif 'education' in sector or 'university' in sector or 'e-learning' in sector or 'school' in sector or 'coaching' in sector:
        return 'Education'

    elif 'government' in sector or 'public' in sector or 'defense' in sector:
        return 'Government'

    elif 'retail' in sector or 'ecommerce' in sector or 'consumer' in sector:
        return 'Retail'

    elif 'consult' in sector or 'advisory' in sector or 'services' in sector:
        return 'Consulting'
        
    elif 'internet' in sector:
        return 'Internet Publishing'

    elif 'manufacturing' in sector or 'manufacturer' in sector:
        return 'Manufacturing'

    elif 'translation' in sector or 'localise' in sector:
        return 'Translation and Localisation' 

    elif 'entertainment' in sector or 'entertaining' in sector or 'media' in sector:
        return 'Entertainment'

    elif 'engineer' in sector or 'engineering' in sector:
        return 'Engineering'
        
    elif 'oil' in sector or 'gas' in sector:
        return 'Energy'
        
    else:
        return 'Other'

#applying the function to each value in the sector
cleaned_sector = jobs_df['sector'].apply(clean_sector) 

#checking the counts of different sectors
cleaned_sector.value_counts()

#adding the cleaned sector column to the dataframe
jobs_df['cleanedSector'] = cleaned_sector

#converting the cleaned sectors name into title format
jobs_df['cleanedSector'] = jobs_df.cleanedSector.str.title()

#converting the sectors name into title format
jobs_df['sector'] = jobs_df.sector.str.title()
jobs_df



### TOP 10 JOB TITLES

In [None]:
# extracting the top 10 job titles from the df
top_titles = jobs_df.title.value_counts().head(10).reset_index()

# plotting the chart
ax = sns.barplot(data=top_titles,x='count',y='title',palette='crest')
plt.title('Top Ten Job Titles')
plt.xlabel('Job Count')
plt.ylabel('Job Titles')

for container in ax.containers:
    ax.bar_label(container)
    
plt.show()

#### Insight:

Data Scientist, AI/ML Engineer, and  Mlops Engineer dominate the hiring landscape.
These roles alone make up a significant share of all job postings, showing where demand is concentrated

### SECTOR WISE JOB DISTRIBUTION

In [None]:
sec = jobs_df.cleanedSector.value_counts().reset_index()

ax = sns.barplot(data=sec,y='cleanedSector',x='count',palette='crest')
plt.title('Sector Wise Job Distribution')
plt.xlabel('Job post for each sector')
plt.ylabel('Sectors')

for container in ax.containers:
    ax.bar_label(container)
    
plt.show()

#### Insight:

Tech sector is leading AI/ML hiring by a wide margin, followed by Consulting and Internet Publishing.
This reflects the deep integration of AI into core tech products.

### JOB ROLES ACROSS TOP 5 HIRING SECTORS

In [None]:
top_titles = jobs_df['title'].value_counts().head().index
top_sectors = jobs_df['cleanedSector'].value_counts().head().index

filtered_df = jobs_df[
    jobs_df['title'].isin(top_titles) & 
    jobs_df['cleanedSector'].isin(top_sectors)
]

filtered_ct = pd.crosstab(filtered_df['cleanedSector'], filtered_df['title'])

plt.figure(figsize=(10,5))
sns.heatmap(filtered_ct, annot=True, cmap="crest", linewidths=0.5)
plt.title("Top Roles vs Top Sectors")
plt.xlabel("Job Title")
plt.ylabel("Sector")
plt.tight_layout()
plt.show()

#### Insight:

Tech sector show the broadest role diversity, especially for ML Engineers and AI/ML Engineers

# Title & Sector Exploration Summary – Version 5

In this version, I tackled two of the messiest but most important fields in the dataset: **title** and **sector**.  
Both carry critical signals about what kinds of roles are in demand and which industries are hiring — but their raw formats were too inconsistent for any meaningful analysis.

## 🔍 What I Did

- Cleaned and standardized the title field using a custom rule-based function — grouped similar job titles into 15+ clear categories.
- Simplified and grouped the sector field into broader categories like Tech, Finance, Healthcare, Government, etc.
- Created new column `cleanedSector` for cleaner analysis for sectors.
- Visualized the Top 10 job roles based on cleaned titles.
- Explored sector-wise job distribution to see which industries are hiring the most.
- Built a focused heatmap of top 5 sectors vs top 5 job roles to explore how hiring varies across industries.

## ⚠️ What I Didn’t Cover

- Didn’t go into advanced cross-analysis like title vs application counts, or experience level vs title
- Skipped visualizing lower-ranked or rare titles/sectors to keep charts readable and insights focused.
  
## 🎯 What I Learned

- Cleaning messy categorical text fields (like title/sector) unlocks meaningful analysis and smarter visuals.
- Titles like "Senior ML/AI Research Intern" can now be grouped confidently under buckets like "Intern" or "ML Engineer", making analysis way more reliable.
- Focused visualizations (like heatmaps on filtered data) are much more powerful than trying to show everything at once — **clarity > completeness.**


# Notebook Version: v6

**Focus**: Job Roles, Experience Level & Real-World Application Pressure

**Overview:**

After cleaning and organizing job titles in the previous version, I wanted to take it a step further — not just see what roles exist, but understand **how they behave in the real market**.

In this version, I'm focusing on 3 things that matter when you're actually looking for a job:
- What kind of roles are open to freshers?
- Which roles are getting overcrowded with applicants?
- And where's that sweet spot — roles that exist in good numbers but aren't flooded?

**Goal 🎯:**

The goal here is to connect **roles**, **experience level**, and **application data** to see what's really going on underneath the surface.

I want this to help someone decide:
> “Is this job worth applying to?”

### 1. Job Role Distribution Across Experience Levels

In [None]:
top_roles = jobs_df['title'].value_counts().nlargest(10).index
filtered_df = jobs_df[jobs_df['title'].isin(top_roles)]

ax = sns.catplot(data=filtered_df,y='title',kind='count',col='experienceLevel',col_wrap = 4,order=filtered_df['title'].value_counts().index, palette='crest')

ax.set_titles("{col_name}")
ax.set_axis_labels("Job Count", "Job Role")
plt.show()

### Insights: 

- Most of the roles demands **Mid-senior level** or **Entry level** experience
- **Associate-level** roles are mostly limited to AI/ML Engineers as collective and idividually
- ML Engineer appear across **all experience levels**, showing they’re in demand at every stage.
- Most roles cluster at the **Mid-level**, confirming that companies prefer 2–4 years of experience in technical fields.
- Almost every role does not demand a specific experience level thus anyone without experience can easily apply to these jobs.
- **Executive Experience** and **Director Experience** are least required by the companies.


### 2. Average Applications per Job Role

In [None]:
title_application = jobs_df[['title','applicationsCount']]
title_application 
# jobs_df
grouping = title_application.groupby('title').agg({'applicationsCount':'mean'})
grouping = grouping.sort_values(by='applicationsCount', ascending=False)

grouping_reset = grouping.reset_index()

ax = sns.barplot(data=grouping_reset, y='title', x='applicationsCount',palette='crest')
plt.title("Average Applications per Job Role")
plt.xlabel("Job Role")
plt.ylabel("Avg Application Count")

for container in ax.containers:
    ax.bar_label(container)

plt.show()

### Insights

- **Internships** receive the most applications (~200) — by far the most competitive.
- **Software Engineer (AI/ML)** and **Web Dev** roles are more applied to than core ML roles, likely due to broader appeal.
- **MLOps** and **AI/ML Engineers** are less applied to — these may be smart targets for qualified candidates.
- **Deep Learning Engineers** receive the **least** attention (~66 apps), despite high relevance — a potential opportunity.
- **"Other"** roles get skipped, showing the importance of clear, attractive job titles.


### 3. Job Count vs Avg Applications per Role

In [None]:
# Job count per role
job_counts = jobs_df['title'].value_counts()

# Avg applications per role
avg_apps = jobs_df.groupby('title')['applicationsCount'].mean()

# Combine into one DataFrame (long format for sns)
comparison_df = pd.DataFrame({
    'Job Count': job_counts,
    'Avg Applications': avg_apps
})

# Keep only top 10 most common roles to reduce clutter
top_roles = job_counts.sort_values(ascending=False).head(10).index
comparison_df = comparison_df.loc[top_roles]

# Reshape into long format
comparison_df_long = comparison_df.reset_index().melt(id_vars='title', value_vars=['Job Count', 'Avg Applications'],value_name='Value',var_name='Metric')

# 4. Plot with seaborn
plt.figure(figsize=(12,6))
sns.barplot(data=comparison_df_long, x='title', y='Value', hue='Metric',palette='crest')
plt.title("Job Count vs Avg Applications per Role")
plt.xlabel("Job Role")
plt.ylabel("Value")
plt.xticks(rotation=45)
plt.legend(title='Metric')
plt.tight_layout()
plt.show()



### Insights:

- **Machine Learning Engineer Role** have the **highest job counts** but the application count doesnot match them, not even 50% of the people apply for this role if compared with the jobs available.
- **AI/ML Engineer Role** have a better ratio of openings and demands if compared with other job roles and their application count, indicating a healthy market position.
- Most of the roles like **Software Engineer**, **Computer Vision Engineer** **NLP and Research Engineer** along with **Data Scientist** role have more average application count than the jobs available. 
- **Intern** role receives the most applications despite not having the highest number of openings — showing intense competition at the fresher level.
- Roles like **Deep Learning Engineer** and **MLOps Engineer** have relatively **fewer applications** compared to others, even when job count is decent — making them **hidden opportunities** for skilled applicants.
- This comparison clearly reveals where **supply-demand mismatch** exists, helping job seekers avoid wasting energy on over-applied roles.

**Through this chart it can be concluded that job recruiters are more interested in hiring for ML roles however job seekers are applying to the other roles leading to an imbalance in ratio**


# Summary – Version 6

After cleaning job titles in the last version, I wanted to understand **how these roles behave in the real world** — not just in volume, but in terms of experience requirements and application competition.

## 🔍 What I Did

- Explored how job roles are distributed across different **experience levels** (Entry, Mid, Senior, etc.)
- Analyzed **average applications received per job title** to identify which roles are most and least competitive
- Compared **job supply vs application demand** to uncover market imbalances and hidden opportunities

## 📊 Charts I Built

1. **Experience Level Distribution by Job Role** — to see which roles are fresher-friendly, mid-level focused, or spread across all levels  
2. **Average Applications per Role** — to measure competition pressure across job types  
3. **Job Count vs Avg Applications (Top Roles)** — a supply vs demand view to spot saturation vs opportunities

## 🎯 What I Learned

- Most roles fall under **Mid-level hiring** , reflecting companies’ strong preference for candidates with 2–4 years of experience.
- **Internships** are the most applied-to roles (~200 apps), even without being the most available — proving just how tough the entry-level market is.
- Roles like **MLOps Engineer** and **Deep Learning Engineer** get **very few applications**, despite relevance — making them smart targets for niche-skilled candidates.
- Surprisingly, roles like **Software Engineer (AI/ML)** and **Web Development Engineer** receive more applications than core ML roles — possibly due to broader appeal or clearer job expectations.
- When comparing job availability vs application demand, a clear imbalance shows up — **some roles are crowded, while others are overlooked**, despite healthy posting numbers.

## 🧠 Big Takeaway

There’s a noticeable **mismatch between what companies are hiring for and what job seekers are chasing**.  
If you’re blindly applying to popular roles, chances are you’re just one of hundreds.  
But if you align your skills with under-applied, high-supply roles — you can massively improve your visibility and chances.

## ✅ Why This Matters

This version wasn’t about flashy charts — it was about **real questions** that job seekers have:
- “Where do I have the best shot?”
- “Am I applying where companies are actually hiring?”
- “Should I even be targeting this role at my level?”

That’s why this version is all about helping candidates **apply smarter**, not harder.


# Final Project Wrap-Up✨

This project was never about building the flashiest dashboard or running complex ML models.

It was about starting with messy, real-world data and asking the kind of questions that **job seekers actually care about**:
- What roles are companies hiring for?
- Who are they hiring — freshers or experienced folks?
- Where is the competition intense, and where are the hidden opportunities?

Over six versions, I cleaned messy job titles, grouped chaotic sectors, simplified noisy experience levels, and tied everything back to real-world hiring trends.


## 💡 What I Learned

- **Raw data doesn’t give answers — it gives friction.** But if you work through it, the clarity that follows is worth it.
- Cleaning and structuring text fields like job titles is painful — but it's the key to every meaningful insight that followed.
- One targeted chart, if designed with purpose, can be more powerful than ten generic ones.
- EDA isn’t about quantity of code — it’s about **quality of thought**.


## 🎯 What This Project Really Was

This wasn’t a textbook exercise.  
It was me, sitting with a messy dataset, trying to figure out how to **make it speak**.

And in the process, I didn’t just learn about the job market — I learned how to:
- Think more critically
- Clean more confidently
- Ask better questions
- And extract value where there was once noise


## ⚡ Closing Thought 

This project wasn’t polished from the start — it was built version by version, through trial, errors, edits, and late realizations.

But that’s what made it valuable.

I didn’t just run through a dataset — I stayed with it long enough to understand it, clean it, question it, and turn it into something that makes sense.

It’s not perfect. But it’s mine — and I walked away from it knowing more than when I began.

That, to me, is a win.

