# Poeple Analytics- Linkedin Job Market Analysis 

This project focuses on **LinkedIn People Analytics**, combining large-scale **professional profile data** and **job postings with skill requirements** to extract actionable insights about the modern labor market.  
By integrating multiple datasets from Kaggle, the project aims to bridge the gap between **who professionals are**, **what skills they possess**, and **what the market is demanding**.

The analysis supports **talent intelligence, workforce planning, career guidance, and job recommendation systems** by uncovering trends in job titles, industries, locations, companies, and skill demand.

The project emphasizes **real-world data cleaning, exploratory analysis, feature engineering, and applied analytics**, closely mirroring challenges faced in people analytics and labor market intelligence roles.

---

## **Project Goals**
- Understand **professional profile patterns** across industries, locations, and experience levels.
- Identify **high-demand roles, skills, and companies** in the global job market.
- Build data-driven insights that support **recruitment, upskilling, and career mobility**.
- Lay the foundation for a **job recommendation system** connecting user profiles to relevant job opportunities.

---

## **Key Objectives**

### 1. **Data Preparation**
- Load and process LinkedIn datasets, including:
  - `LinkedIn_company_data.csv`
  - `LinkedIn_people_profiles_dataset.csv`
  - `job_skills.csv`
  - `job_summary.csv`
  - `linkedin_job_postings.csv`
- Handle missing values, inconsistent formatting, duplicates, and nested text fields.
- Normalize location, job titles, companies, and skill names for analysis.

### 2. **Data Exploration and Analysis**
- Analyze the distribution of:
  - Job titles, industries, and seniority levels
  - Geographic demand by city, country, and region
  - Company hiring trends across roles and locations
- Identify patterns in professional experience and education backgrounds.

### 3. **Skills Demand Analysis**
- Determine the **most in-demand skills** across job categories and industries.
- Compare skill requirements by:
  - Job title
  - Industry
  - Location
- Identify emerging skills and declining skill trends.

### 4. **Company & Industry Insights**
- Identify top companies hiring for specific roles.
- Explore industry-specific hiring patterns and skill requirements.
- Analyze job levels (entry, mid, senior) across industries.

### 5. **Job Recommendation System (Foundational)**
- Match professional profiles to relevant job postings using:
  - Job titles
  - Skills overlap
  - Location compatibility
  - Experience level
- Generate ranked job recommendations based on profile-job similarity.

### 6. **Market Gap & Workforce Insights**
- Identify **skill gaps** between available talent and job market demand.
- Highlight opportunities for:
  - Educational programs
  - Corporate upskilling initiatives
  - Career transition pathways

---

## **Dataset Overview**

### **LinkedIn Professional Profiles Dataset**
- **Source:** Kaggle [LinkedIn Professional Profiles Dataset](https://www.kaggle.com/datasets/manishkumar7432698/linkedinuserprofiles)
- **Key Features:**
  - Name
  - Title & Position
  - Current Company
  - Experience History
  - Education
  - Location
  - Profile Metadata (e.g., avatar URL)

### **1.3M LinkedIn Jobs & Skills Dataset (2024)**
- **Source:** Kaggle [1.3M Linkedin Jobs & Skills (2024)](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024?select=job_summary.csv)
- **Key Features:**
  - Job titles and summaries
  - Required skills
  - Company names
  - Job locations
  - Industry and role classifications

---

## **Expected Outcome**
By the end of this project, the analysis will deliver:
- Clear insights into **job market trends and skill demand**
- Data-backed recommendations for **talent development and hiring**
- A scalable framework for **people analytics and job recommendation systems**


## Environment Setup and Required Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

from sklearn.model_selection import GridSearchCV, learning_curve, train_test_split, cross_val_score
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score, r2_score, make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from scipy import stats as st
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
import time

2026-01-17 22:12:50.042497: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 1. **Data Preparation**

In [2]:
df_jb_skills = pd.read_csv('/Users/tjscott/Desktop/Pet Projects/LinkedIn/job_skills.csv')
df_com_data = pd.read_csv('/Users/tjscott/Desktop/Pet Projects/LinkedIn/LinkedIn_company_data.csv')
df_profiles = pd.read_csv('/Users/tjscott/Desktop/Pet Projects/LinkedIn/LinkedIn_people_profiles.csv')
df_job_sum = pd.read_csv('/Users/tjscott/Desktop/Pet Projects/LinkedIn/job_summary.csv')
df_job_post = pd.read_csv('/Users/tjscott/Desktop/Pet Projects/LinkedIn/linkedin_job_postings.csv')

df_jb_skills.head()

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."
2,https://www.linkedin.com/jobs/view/school-base...,"Applied Behavior Analysis (ABA), Data analysis..."
3,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Engineering, Project Controls, Sche..."
4,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Assembly, Point to point wiring, St..."


In [3]:
df_job_post.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1348454 entries, 0 to 1348453
Data columns (total 14 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   job_link             1348454 non-null  object
 1   last_processed_time  1348454 non-null  object
 2   got_summary          1348454 non-null  object
 3   got_ner              1348454 non-null  object
 4   is_being_worked      1348454 non-null  object
 5   job_title            1348454 non-null  object
 6   company              1348443 non-null  object
 7   job_location         1348435 non-null  object
 8   first_seen           1348454 non-null  object
 9   search_city          1348454 non-null  object
 10  search_country       1348454 non-null  object
 11  search_position      1348454 non-null  object
 12  job_level            1348454 non-null  object
 13  job_type             1348454 non-null  object
dtypes: object(14)
memory usage: 144.0+ MB


In [4]:
df_job_sum.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1297332 entries, 0 to 1297331
Data columns (total 2 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   job_link     1297332 non-null  object
 1   job_summary  1297332 non-null  object
dtypes: object(2)
memory usage: 19.8+ MB


In [5]:
df_jb_skills.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296381 entries, 0 to 1296380
Data columns (total 2 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   job_link    1296381 non-null  object
 1   job_skills  1294296 non-null  object
dtypes: object(2)
memory usage: 19.8+ MB


In [6]:
df_profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   timestamp                   1000 non-null   object 
 1   id                          1000 non-null   object 
 2   name                        1000 non-null   object 
 3   city                        949 non-null    object 
 4   country_code                998 non-null    object 
 5   region                      456 non-null    object 
 6   current_company:company_id  572 non-null    object 
 7   current_company:name        833 non-null    object 
 8   position                    982 non-null    object 
 9   following                   481 non-null    float64
 10  about                       498 non-null    object 
 11  posts                       417 non-null    object 
 12  groups                      110 non-null    object 
 13  current_company             1000 n

In [7]:
df_com_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   timestamp              1000 non-null   object 
 1   id                     1000 non-null   object 
 2   name                   1000 non-null   object 
 3   country_code           999 non-null    object 
 4   locations              1000 non-null   object 
 5   formatted_locations    1000 non-null   object 
 6   followers              1000 non-null   int64  
 7   employees_in_linkedin  847 non-null    float64
 8   about                  998 non-null    object 
 9   specialties            765 non-null    object 
 10  company_size           1000 non-null   object 
 11  organization_type      930 non-null    object 
 12  industries             1000 non-null   object 
 13  website                969 non-null    object 
 14  crunchbase_url         15 non-null     object 
 15  found

In [8]:
df_job_post.describe(include='all')

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type
count,1348454,1348454,1348454,1348454,1348454,1348454,1348443,1348435,1348454,1348454,1348454,1348454,1348454,1348454
unique,1348454,722748,2,2,2,584544,90605,29153,6,1018,4,1993,2,3
top,https://www.linkedin.com/jobs/view/account-exe...,2024-01-19 09:45:09.215838+00,t,t,f,LEAD SALES ASSOCIATE-FT,Health eCareers,"New York, NY",2024-01-14,Baytown,United States,Account Executive,Mid senior,Onsite
freq,1,625540,1297877,1296401,1346978,7325,41598,13436,460035,10052,1149342,19468,1204445,1337633


df_com_data 
df_profiles
df_jb_skills 
df_job_sum 
df_job_post

In [9]:
df_jb_skills.describe(include='all')  

Unnamed: 0,job_link,job_skills
count,1296381,1294296
unique,1296381,1287101
top,https://www.linkedin.com/jobs/view/housekeeper...,"Front Counter, DriveThru, Outside Order Taker,..."
freq,1,169


In [10]:
df_job_sum.describe(include='all')

Unnamed: 0,job_link,job_summary
count,1297332,1297332
unique,1297332,958192
top,https://www.linkedin.com/jobs/view/restaurant-...,Dollar General Corporation has been delivering...
freq,1,4571


In [11]:
df_com_data.describe(include='all') 

Unnamed: 0,timestamp,id,name,country_code,locations,formatted_locations,followers,employees_in_linkedin,about,specialties,...,similar,sphere,url,type,updates,slogan,affiliated,funding,stock_info,investors
count,1000,1000,1000,999,1000,1000,1000.0,847.0,998,765,...,881,1000,1000,1000,491,778,49,15,0.0,9
unique,31,1000,1000,134,995,995,,,998,765,...,881,125,1000,8,491,778,49,15,,9
top,2023-08-11,be-nijs-business-development,Be Nijs * Business- & Concept Development,US,"New York, NY, US","New York, NY, US",,,"Voor Startups, ondernemers en bedrijven die he...","Expansie bedrijfsactiviteiten, Marketing strat...",...,"[{""Links"":""https://www.linkedin.com/company/ro...",IT Services and IT Consulting,https://www.linkedin.com/company/be-nijs-busin...,Privately Held,"[{""likes_count"":7,""text"":""Vandaag een nieuw co...","Voor Startups, ondernemers en bedrijven die he...","[{""Links"":""https://vn.linkedin.com/company/arm...","{""last_round_date"":""2017-08-01T00:00:00.000Z"",...",,MassMutual
freq,260,1,1,279,3,3,,,1,1,...,1,74,1,528,1,1,1,1,,1
mean,,,,,,,1231.787,34.005903,,,...,,,,,,,,,,
std,,,,,,,12003.115391,202.467214,,,...,,,,,,,,,,
min,,,,,,,0.0,1.0,,,...,,,,,,,,,,
25%,,,,,,,10.0,2.0,,,...,,,,,,,,,,
50%,,,,,,,62.0,5.0,,,...,,,,,,,,,,
75%,,,,,,,335.25,15.0,,,...,,,,,,,,,,


In [12]:
df_profiles.describe(include='all')

Unnamed: 0,timestamp,id,name,city,country_code,region,current_company:company_id,current_company:name,position,following,...,people_also_viewed,educations_details,education,avatar,languages,certifications,recommendations,recommendations_count,volunteer_experience,—Åourses
count,1000,1000,1000,949,998,456,572,833,982,481.0,...,823,682,720,810,255,184,173,173.0,116,7
unique,324,1000,1000,671,82,5,553,813,925,,...,823,647,720,516,198,184,173,,116,7
top,2023-01-13,catherinemcilkenny,"Catherine Fitzpatrick (McIlkenny), B.A",United States,US,EU,amazon,Amazon,--,,...,"[{""profile_link"":""https://ca.linkedin.com/in/l...",Georgia State University,"[{""degree"":""Bachelor of Arts (B.A.) Honours"",""...",https://static-exp1.licdn.com/sc/h/244xhbkr7g4...,"[{""subtitle"":""-"",""title"":""English""}]","[{""meta"":""Issued Jun 2013"",""subtitle"":""Van der...",Menno H. Poort ‚ÄúIk werk al jaren prettig met M...,,"[{""cause"":"""",""duration"":""Sep 2010 Jul 2020 9 y...","[{""subtitle"":""-"",""title"":""Masters work in Comp..."
freq,43,1,1,16,356,209,4,4,57,,...,1,3,1,198,29,1,1,,1,1
mean,,,,,,,,,,144.241164,...,,,,,,,,3.67052,,
std,,,,,,,,,,169.527407,...,,,,,,,,4.797177,,
min,,,,,,,,,,1.0,...,,,,,,,,1.0,,
25%,,,,,,,,,,12.0,...,,,,,,,,1.0,,
50%,,,,,,,,,,57.0,...,,,,,,,,2.0,,
75%,,,,,,,,,,226.0,...,,,,,,,,4.0,,


In [13]:
nulls={}
linkedin = {'company': df_com_data,'profiles': df_profiles,
    'skills':df_jb_skills, 'summaries': df_job_sum, 'postings': df_job_post}
for key in linkedin:
    null = 0
    null += linkedin[key].isna().sum() 

    nulls.update({key: null})
nulls

{'company': timestamp                   0
 id                          0
 name                        0
 country_code                1
 locations                   0
 formatted_locations         0
 followers                   0
 employees_in_linkedin     153
 about                       2
 specialties               235
 company_size                0
 organization_type          70
 industries                  0
 website                    31
 crunchbase_url            985
 founded                     4
 company_id                  0
 employees                 152
 headquarters                1
 image                       0
 logo                        0
 similar                   119
 sphere                      0
 url                         0
 type                        0
 updates                   509
 slogan                    222
 affiliated                951
 funding                   985
 stock_info               1000
 investors                 991
 dtype: int64,
 'profiles': 

In [14]:
dups = {}
for key in linkedin:
    dup = 0
    dup += linkedin[key].duplicated().sum()
    dups.update({key: dup})
dups

{'company': 0, 'profiles': 0, 'skills': 0, 'summaries': 0, 'postings': 0}

### üßº Data Cleaning Summary

#### Total Rows (Across Datasets)

| df_com_data | df_profiles | df_jb_skills |df_job_sum |df_job_post |
|--------|--------|-----------|--------|---------|
| 1000 | 1000 | 1296381  | 1296381 | 1348454 |

---
#### üî¢ Null Value Totals (Summed Across All Dataset)

| df_com_data | Column    | Total Nulls  | Notes        |
|---|----------------|-----------|----------------|
||  country_code|                1 |High % of nulls (32%)        |
||employees_in_linkedin     |153|High % of nulls (32%)        |
||about                      | 2|High % of nulls (32%)        |
||specialties              | 235|High % of nulls (32%)        |
||organization_type         | 70|High % of nulls (32%)        |
||website             |       31|High % of nulls (32%)        |
||crunchbase_url       |     985|High % of nulls (32%)        |
||founded               |      4|High % of nulls (32%)        |
||employees              |   152|High % of nulls (32%)        |
||headquarters           |     1|High % of nulls (32%)        |
||similar                |   119|High % of nulls (32%)        |
||updates                 |  509|High % of nulls (32%)        |
||slogan                  |  222|High % of nulls (32%)        |
||affiliated              |  951|High % of nulls (32%)        |
||funding                 |  985|High % of nulls (32%)        |
||stock_info              | 1000|High % of nulls (32%)        |
||investors                | 991 | High % of nulls (32%)        |

 
| df_profiles | Column    | Total Nulls  | Notes        |
|---|----------------|-----------|----------------|
|| brand          | 15M+  | Moderate nulls (14%)      |


| df_jb_skills | Column    | Total Nulls  | Notes        |
|---|----------------|-----------|----------------|
|| user_session   | <50     | Very rare nulls          |


| df_job_sum | Column    | Total Nulls  | Notes        |
|---|----------------|-----------|----------------|
|||||


| df_job_post | Column    | Total Nulls  | Notes        |
|---|----------------|-----------|----------------|
|||||

---

#### Monthly Summary (Means and Medians)

| Data | Month-Year | Count (approx) | Mean Price | Median Price |
||------------|----------------|------------|--------------|
| Oct 2019   | 42+ million    | ~290‚Äì310   | ~150‚Äì180     |
| Nov 2019   | 67+ million    | ~280‚Äì320   | ~150‚Äì190     |

---


df_job_post
    1348454 entries
 6   company              1348443 non-null  object
 7   job_location         1348435 non-null  object

df_jb_skills
    1296381 entries
 1   job_skills  1294296 non-null  object
 

 df_profiles
 RangeIndex: 1000 entries, 0 to 999
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 
 3   city                        949 non-null    object 
 4   country_code                998 non-null    object 
 5   region                      456 non-null    object 
 6   current_company:company_id  572 non-null    object 
 7   current_company:name        833 non-null    object 
 8   position                    982 non-null    object 
 9   following                   481 non-null    float64
 10  about                       498 non-null    object 
 11  posts                       417 non-null    object 
 12  groups                      110 non-null    object 
 14  experience                  906 non-null    object 
 16  people_also_viewed          823 non-null    object 
 17  educations_details          682 non-null    object 
 18  education                   720 non-null    object 
 19  avatar                      810 non-null    object 
 20  languages                   255 non-null    object 
 21  certifications              184 non-null    object 
 22  recommendations             173 non-null    object 
 23  recommendations_count       173 non-null    float64
 24  volunteer_experience        116 non-null    object 
 25  —Åourses                     7 non-null      object 

 df_com_data

 country_code                1
 employees_in_linkedin     153
 about                       2
 specialties               235
 organization_type          70
 website                    31
 crunchbase_url            985
 founded                     4
 employees                 152
 headquarters                1
 similar                   119
 updates                   509
 slogan                    222
 affiliated                951
 funding                   985
 stock_info               1000
 investors                 991

## 2. **Data Exploration and Analysis**

## 3. **Skills Demand Analysis**

## 4. **Company & Industry Insights**

## 5. **Job Recommendation System (Foundational)**

## 6. **Market Gap & Workforce Insights**