# Exploring AI & ML Job Trends in the U.S.

#### Notebook Version: v1  
**Focus**: Dataset loading and basic structural preview  

This notebook is part of a versioned project exploring trends in AI/ML job postings in the U.S.  
This version focuses on loading the dataset, checking its structure, and identifying surface-level issues.


In [None]:
#importing the necessary libraries
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Dataset Overview

- Source: Kaggle – AI and ML Job Listings USA  
- File path: `/kaggle/input/ai-and-ml-job-listings-usa/ai_ml_jobs_linkedin.csv

## Load and Preview Data

Loading the dataset into a DataFrame and preview the structure to understand its basic layout.


In [None]:
# Load the dataset
us_jobs_df = pd.read_csv('/kaggle/input/ai-and-ml-job-listings-usa/ai_ml_jobs_linkedin.csv')

# Create a working copy
jobs_df = us_jobs_df.copy()

In [None]:
# Preview first 2 rows
jobs_df.head(2)


In [None]:
# Check dataset shape
print(f"Rows: {jobs_df.shape[0]}, Columns: {jobs_df.shape[1]}")

# Data types and non-null info
jobs_df.info()

In [None]:
# Summary stats for numeric columns
jobs_df.describe()

## Initial Observations and Notes

- The dataset contains **862 rows** and **10 columns**.
- Some columns such as `companyName`, `publishedAt`, and `sector` contain missing values.
- Columns like `applicationsCount` and `publishedAt` may need data type conversions in the next version.
- No immediate data loading issues were encountered.


#### Notebook Version: v2  
**Focus**: Data Cleaning and Formatting  
 
This version focuses on cleaning the dataset, handling missing values, renaming columns, correcting data types, and preparing the data for analysis.


In [None]:
#previewing the data again
jobs_df.head(3)

## Handling missing/null values

In [None]:
#checking for null values if any
jobs_df.isna().sum()

## Data Type Fix

In [None]:
#handling null values

#filling the null values in columns 'companyName' and 'experienceLevel' as 'Unknown'
jobs_df[['companyName','sector']] = jobs_df[['companyName','sector']].fillna('Unknown')

#handling the null value for column 'publishedAt' using ffill() assuming that the post has been updated nearly at that date
jobs_df['publishedAt'] = jobs_df['publishedAt'].fillna(method='ffill')

#check if null value still exists
jobs_df.isna().sum()


#### NOTE:
Filled publishedAt using forward fill to maintain temporal continuity, assuming listings are updated close to previous records.

In [None]:
#checking for the datatypes 
jobs_df.info()

In [None]:
# converting 'publishedAt' into datetime data type  
jobs_df['publishedAt'] = pd.to_datetime(jobs_df['publishedAt'])

# converting 'applicationsCount' into integer data type
#first we need to extract the count of the applications 
jobs_df['applicationsCount'] = jobs_df['applicationsCount'].str.extract(r'(\d+)')[0]

#now convert the 'applicationsCount' dtype to numeric
jobs_df['applicationsCount'] = pd.to_numeric(jobs_df['applicationsCount'])
jobs_df.info()


## Removing the columns that are not useful for my analysis

In [None]:
#making a new df to store only the columns that are useful for my analysis

updated_jobs = jobs_df.drop(columns=['description','sector','workType'])
updated_jobs

## Duplicate Check and Removal

In [None]:
# checking the duplicate values that exists (based on all columns)
# updated_jobs.duplicated().sum()
updated_jobs[updated_jobs.duplicated()]

# dropping the duplicated values
updated_jobs.drop_duplicates(inplace=True)

# check if any row exist that have same title,companyName, location and publishedAt
updated_jobs.duplicated(subset=['title', 'companyName', 'location', 'publishedAt']).sum()

# removing the dupliacted values
duplicate_vals = updated_jobs.duplicated(subset=['title', 'companyName', 'location', 'publishedAt'])
updated_jobs = updated_jobs[~duplicate_vals].copy()

#resetting the index after droppingt the duplicate values
updated_jobs.reset_index(drop=True,inplace=True)

## Whitespace Stripping 



In [None]:
#stripping the whitespaces if any, from the string based columns

for col in ['title','companyName','location','experienceLevel','contractType']:
    updated_jobs[col] = updated_jobs[col].str.strip()



## Category Cleaning

In [None]:
#1. title
# for consistency I am converting the titles into title case
updated_jobs['title'] = updated_jobs['title'].str.title()

#check if the function is applied properly
updated_jobs['title'].head(3)

#2. location
# I will be splitting the location column into two parts: one is for city and other is for state
location_split = updated_jobs['location'].str.split(',',n=1,expand=True)

#adding the city column
updated_jobs['city'] = location_split[0].str.strip()

#adding the state column
updated_jobs['state'] = location_split[1].str.strip()

#check if any null value has been added due to the above two columns
updated_jobs.isna().sum()

#handle the null values
updated_jobs['state'] = updated_jobs['state'].fillna('Unknown')

#rechecking for null values
updated_jobs.isna().sum()

#removing the location column as it is no more useful
updated_jobs.drop('location',axis='columns',inplace=True)

#3. publishedAt
# I will be asplitting this column also into two parts year and month (day is not useful)

updated_jobs['year'] = updated_jobs['publishedAt'].dt.year
updated_jobs['month'] = updated_jobs['publishedAt'].dt.month

#dropping the publishedAt column because it is not useful
updated_jobs.drop('publishedAt',axis='columns',inplace=True)
updated_jobs.columns

#4. companyName
# converting the company's name into title case so that it remains consistent throughout
updated_jobs['companyName'] = updated_jobs['companyName'].str.title()

#check if the change has been made properly
updated_jobs['companyName'].head(5)

## Dataset Cleaning and Structuring Summary

In this version, I focused on cleaning and structuring the dataset to prepare it for meaningful analysis. The original dataset had multiple inconsistencies and mixed-format fields which could hinder exploration and insights.

## 🚀 Key actions
- Selected 7 relevant columns for the analysis.
- Cleaned categorical columns (`title`, `companyName`, `location`) for consistency.
- Split complex fields like `location` and `publishedAt` into simpler, analyzable components (city, state, year, month).
- Handled missing values in `state` by filling with "Unknown".

## ⚠️ Challenges
- Some job titles were overly specific or inconsistent (e.g., different casing, role modifiers). I resolved this with title casing but might need more grouping later.
- The `location` field didn’t follow a uniform format in all rows — some were missing state info, which led to NaN values after splitting.
- The `publishedAt` field contained full timestamps, which were not useful at this stage. It took care to isolate only the useful components (year/month) without losing meaning.

## 🎯 Learnings
- Even basic string cleaning and formatting (like `.str.title()` or `.str.strip()`) can greatly improve consistency in the dataset.
- Breaking down complex columns (like `location` and `publishedAt`) can make future analysis smoother and more insightful.
- It's important to analyze columns one by one instead of applying generic cleaning — each column may need unique handling.



# Notebook Version: v3  
**Focus**: Exploratory Data Analysis (EDA)

This version focuses on asking structured and slightly deeper questions to understand the dataset better.  
I'm primarily focusing on categorical patterns, hiring distributions, and application behavior.  
We'll go from basic univariate counts to intermediate bivariate groupings (without visuals, which are reserved for v4).

> Note: `title` and `sector` are *not* taken up in this version intentionally.  
> - `title` is too noisy to analyze meaningfully without cleanup — we’ll handle that in **v5**.  
> - `sector` is reserved for **v5** as well, to avoid overloading this version and to keep v3 beginner-friendly.



### 1. Exploring Unique Values in Categorical Columns

In [None]:
#finding number of unique values in categorical columns
print('Unique values in categorical column:')
print(updated_jobs[['contractType','experienceLevel','month','year']].nunique())

### 2. Top 10 Most Common Job Titles

In [None]:
#top 10 most common job titles
common_title = updated_jobs['title'].value_counts().head(10)
print('Top 10 most common job titles')
print(common_title)

This gives a sense of which roles are being advertised the most — though detailed title analysis will be taken in v5.


###  3. Companies Posting the Most Jobs

In [None]:
#companies that have posted the most job listings
print('Companies that have posted the most job listings')
print(updated_jobs['companyName'].value_counts().head(10))

Companies with the most job listings often reflect dominant hiring brands in the market.


###  4. Top 5 Hiring Cities (with Cleaning)

In [None]:
#cities that are hiring the most(top 5)

#NOTE: alot of records have United States as a city but that is a wrong value, thus replacing it with 'Unknown'
updated_jobs['city'] = updated_jobs['city'].replace('United States','Unknown')
updated_jobs['city'].value_counts()

#NOTE: majority of companies have not entered the city, thus we will be ignoring it and will show the actual city names
hiring_cities = updated_jobs['city'][updated_jobs['city'] != 'Unknown']
hiring_cities.reset_index(drop=True, inplace=True)
print('Top 5 cities in US that are hiring the most')
print(hiring_cities.value_counts().head())

I noticed that many job listings have 'United States' or missing city data, so I cleaned it for more realistic counts.
(Probably missed in v2 while cleaning the data)

###  5. Most Common Contract Type & Experience Level

In [None]:
#most common contractType
print('Most common contract type: ')
print(updated_jobs['contractType'].value_counts().head(1).index[0])
print()

#experience level that is highest in demand
print('Experience level that is highest in demand ')
print(updated_jobs['experienceLevel'].value_counts())

Basic univariate checks to understand the dominant job types and demanded experience levels.


###  6. Average Application Count

In [None]:
#average of application counts
print('Average of application counts:')
print(updated_jobs['applicationsCount'].mean())

This shows how saturated the job market is — a very high mean might suggest few postings with extreme competition.


### 7. Company with Highest/Lowest Application Count

In [None]:
#company that recieved highest and lowest number of applications

#grouping the company with the total number of application count
job_app_count = updated_jobs.groupby('companyName').agg({'applicationsCount':'sum'})

#extracting the max and min count
max_val = job_app_count['applicationsCount'].max()
min_val = job_app_count['applicationsCount'].min()

#extracting the company names
print("Company that recieved highest applications")
highest_company = job_app_count[job_app_count['applicationsCount'] == max_val]
print(highest_company)
print()

lowest_company = job_app_count[job_app_count['applicationsCount'] == min_val]
print("Top 5 companies that got lowest application count")
print(lowest_company.head())
print()
#NOTE! : there are alot of companies that have got the minimum(25) number of applications, so showing only 5 of them

This indicates which companies attract more attention from applicants — maybe due to reputation or role type.


### 8. Application Count by Experience Level

In [None]:
#application count distribution by experience level
print('Application count distribution by experience level')
exp_app_count = updated_jobs.groupby('experienceLevel').agg({'applicationsCount':'sum'})
print(exp_app_count)

Helpful to understand if juniors or seniors are attracting more applications.


### 9. Contract Types Getting the Most Applications

In [None]:
#contractType that are getting the highest number of applications
print('Contract type that are getting the highest number of applications')
contract_app_count = updated_jobs.groupby('contractType').agg({'applicationsCount':'sum'})
highest_count = contract_app_count['applicationsCount'].max()
highest_contract_val = contract_app_count[contract_app_count['applicationsCount']==highest_count]
print(highest_contract_val)

Applicants are applying more for full time contract type as compared to other types of contract

### 10. Experience Level vs Contract Type 

In [None]:
#which experience level are linked more with which contract type
exp_contract = updated_jobs.groupby(['experienceLevel', 'contractType']).size().unstack().fillna(0)
print("Experience vs Contract Type distribution:")
print(exp_contract)

Cross-sectional view of how contract types differ across experience levels. This helps in understanding how experience level varies with contract type

### 11. Underperforming Contract Types

In [None]:
#Contract types with high postings but fewer average applications:

avg_app_per_contract = updated_jobs.groupby('contractType').agg({
    'applicationsCount': 'mean',
    'contractType': 'count'
}).rename(columns={'contractType': 'jobCount'}).sort_values(by='applicationsCount')
print("Avg applications per contract type vs job count:")
print(avg_app_per_contract)

This analysis shows which contract types might be oversupplied or under-attractive — useful for recruiters or job portals.


# Exploratory Data Analysis Summary – Version 3

In this version, I focused on exploring the dataset more deeply using beginner to intermediate level EDA (without any visuals).  
The goal was to understand how different categorical features like experience level, contract type, city, and applications behave — both on their own and together.


### 🔍 What I Did

- Checked unique values for key categorical columns: `contractType`, `experienceLevel`, `month`, and `year`.
- Found out which companies posted the most jobs and which cities are hiring the most.
- Looked at average application counts, plus which companies got the highest and lowest applications, and which contract types are getting the most interest.
- Grouped experience levels and contract types to see how they relate to each other.
- Identified which job types might be oversaturated (that is, lots of postings but not many applications on average).


### ⚠️ What I Didn’t Cover

- Skipped the `title` column for now because it’s just too messy — planning to clean and analyze job titles in v5 when I dive into deeper title/trend analysis.
- Left out the `sector` column to keep this version beginner-friendly and focused. That'll be part of v5 too.


### 🎯 What I Learned

- You can pull out a lot of insights just by grouping and aggregating columns — no fancy plots needed.
- Application counts alone tell a lot about what job types are getting attraction, and which ones most people are ignoring.

