Problem Statement:

Which jobs are most at risk of automation by AI, and what factors influence this risk?

•	Predict: which jobs are at risk of automation using datasets like O*NET and others.

•	Output: Risk scores for various jobs, factors influencing automation.

Tasks:

•	Data collection: Gather data on job types, skills required, and current automation levels.

•	Exploratory Data Analysis: Analyze which factors (skills, education, industry) correlate with higher automation risk.

•	Predictive Modeling: Build a model to predict the automation risk for different job titles.

•	Visualization: Create dashboards or infographics showing which jobs are at risk and why.


# Import py libraries and Dataset

In [2]:
import pandas as pd
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

In [20]:
df=pd.read_csv(r"G:\Project Ai and Future\Data\try 11\ai_job_trends_dataset_enriched_2020_2025.csv")
df.head()

Unnamed: 0,Job Title,Industry,Job Status,AI Impact Level,Median Salary (USD),Required Education,Experience Required (Years),Job Openings (2024),Projected Openings (2030),Remote Work Ratio (%),Automation Risk (%),Location,Gender Diversity (%),Automation Risk 2020 (%),Automation Risk 2021 (%),Automation Risk 2022 (%),Automation Risk 2023 (%),Automation Risk 2024 (%)
0,Investment analyst,IT,Increasing,Moderate,42109.76,Master’s Degree,5,1515,6342,55.96,28.28,UK,44.63,21.95,21.96,25.3,28.86,26.23
1,"Journalist, newspaper",Manufacturing,Increasing,Moderate,132298.57,Master’s Degree,15,1243,6205,16.81,89.71,USA,66.39,84.73,93.13,90.53,86.14,91.18
2,Financial planner,Finance,Increasing,Low,143279.19,Bachelor’s Degree,4,3338,1154,91.82,72.97,Canada,41.13,67.06,67.89,71.34,64.23,65.77
3,Legal secretary,Healthcare,Increasing,High,97576.13,Associate Degree,15,7173,4060,1.89,99.94,Australia,65.76,93.51,92.26,99.01,94.3,92.81
4,Aeronautical engineer,IT,Increasing,Low,60956.63,Master’s Degree,13,5944,7396,53.76,37.65,Germany,72.57,36.25,32.86,34.68,31.8,35.17


##Basic Exploratory Data Analysis (EDA)

In [21]:
df.shape

(30000, 18)

In [22]:
df.dropna(inplace=True)

In [23]:
print(df.shape)

(30000, 18)


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Job Title                    30000 non-null  object 
 1   Industry                     30000 non-null  object 
 2   Job Status                   30000 non-null  object 
 3   AI Impact Level              30000 non-null  object 
 4   Median Salary (USD)          30000 non-null  float64
 5   Required Education           30000 non-null  object 
 6   Experience Required (Years)  30000 non-null  int64  
 7   Job Openings (2024)          30000 non-null  int64  
 8   Projected Openings (2030)    30000 non-null  int64  
 9   Remote Work Ratio (%)        30000 non-null  float64
 10  Automation Risk (%)          30000 non-null  float64
 11  Location                     30000 non-null  object 
 12  Gender Diversity (%)         30000 non-null  float64
 13  Automation Risk 

In [25]:
df.describe()

Unnamed: 0,Median Salary (USD),Experience Required (Years),Job Openings (2024),Projected Openings (2030),Remote Work Ratio (%),Automation Risk (%),Gender Diversity (%),Automation Risk 2020 (%),Automation Risk 2021 (%),Automation Risk 2022 (%),Automation Risk 2023 (%),Automation Risk 2024 (%)
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,90119.965639,10.051433,5039.640833,5074.2181,49.836431,50.154229,49.97966,45.588113,46.514112,47.367414,48.266634,49.209233
std,34412.013953,6.060678,2861.009654,2866.550722,28.966688,28.754889,17.274665,28.7464,28.798043,28.808522,28.806938,28.81806
min,30001.86,0.0,100.0,100.0,0.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0
25%,60500.7025,5.0,2570.0,2586.75,24.57,25.4,35.07,20.55,21.57,22.35,23.49,24.4775
50%,90274.115,10.0,5034.0,5106.5,49.57,50.02,49.885,45.23,46.25,47.12,47.98,49.015
75%,119454.71,15.0,7527.0,7573.0,75.1,75.03,64.91,70.2925,71.24,72.19,73.0825,74.03
max,149998.5,20.0,10000.0,10000.0,100.0,99.99,80.0,100.0,100.0,100.0,100.0,100.0


In [26]:
df.columns.tolist()

['Job Title',
 'Industry',
 'Job Status',
 'AI Impact Level',
 'Median Salary (USD)',
 'Required Education',
 'Experience Required (Years)',
 'Job Openings (2024)',
 'Projected Openings (2030)',
 'Remote Work Ratio (%)',
 'Automation Risk (%)',
 'Location',
 'Gender Diversity (%)',
 'Automation Risk 2020 (%)',
 'Automation Risk 2021 (%)',
 'Automation Risk 2022 (%)',
 'Automation Risk 2023 (%)',
 'Automation Risk 2024 (%)']

# calulate the sum of each columns

In [27]:
df.isnull()

Unnamed: 0,Job Title,Industry,Job Status,AI Impact Level,Median Salary (USD),Required Education,Experience Required (Years),Job Openings (2024),Projected Openings (2030),Remote Work Ratio (%),Automation Risk (%),Location,Gender Diversity (%),Automation Risk 2020 (%),Automation Risk 2021 (%),Automation Risk 2022 (%),Automation Risk 2023 (%),Automation Risk 2024 (%)
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
29996,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
29997,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
29998,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [28]:
df.isnull().sum()

Job Title                      0
Industry                       0
Job Status                     0
AI Impact Level                0
Median Salary (USD)            0
Required Education             0
Experience Required (Years)    0
Job Openings (2024)            0
Projected Openings (2030)      0
Remote Work Ratio (%)          0
Automation Risk (%)            0
Location                       0
Gender Diversity (%)           0
Automation Risk 2020 (%)       0
Automation Risk 2021 (%)       0
Automation Risk 2022 (%)       0
Automation Risk 2023 (%)       0
Automation Risk 2024 (%)       0
dtype: int64

In [29]:
df.nunique()

Job Title                        639
Industry                           8
Job Status                         2
AI Impact Level                    3
Median Salary (USD)            29968
Required Education                 5
Experience Required (Years)       21
Job Openings (2024)             9439
Projected Openings (2030)       9410
Remote Work Ratio (%)           9466
Automation Risk (%)             9519
Location                           8
Gender Diversity (%)            5965
Automation Risk 2020 (%)        9306
Automation Risk 2021 (%)        9344
Automation Risk 2022 (%)        9389
Automation Risk 2023 (%)        9456
Automation Risk 2024 (%)        9469
dtype: int64

In [30]:
df.sum()

Job Title                      Investment analystJournalist, newspaperFinanci...
Industry                       ITManufacturingFinanceHealthcareITEducationMan...
Job Status                     IncreasingIncreasingIncreasingIncreasingIncrea...
AI Impact Level                ModerateModerateLowHighLowLowHighHighHighModer...
Median Salary (USD)                                                2703598969.16
Required Education             Master’s DegreeMaster’s DegreeBachelor’s Degre...
Experience Required (Years)                                               301543
Job Openings (2024)                                                    151189225
Projected Openings (2030)                                              152226543
Remote Work Ratio (%)                                                 1495092.94
Automation Risk (%)                                                   1504626.88
Location                       UKUSACanadaAustraliaGermanyUSAUKCanadaAustrali...
Gender Diversity (%)        

In [31]:
# Job title
job_title_counts = df["Job Title"].value_counts()

# Display each job title and its count
for title, count in job_title_counts.items():
    print(f"{title} = {count}")
    # Display total number of unique job titles
print("Total unique job titles:", job_title_counts.count())
print("Total sum of job titles:", job_title_counts.sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
# Reset index to convert Series into a DataFrame
job_title_df = job_title_counts.reset_index()
job_title_df.columns = ["Job Title", "Count"]

# Line graph
fig = px.line(job_title_df, y="Job Title", x="Count",
              title="Job Title Frequency",
              markers=True)  # markers=True to show points
fig.show()

Surveyor, insurance = 75
Counselling psychologist = 70
Charity officer = 68
Surveyor, land/geomatics = 67
Hydrogeologist = 66
Insurance claims handler = 65
Engineer, production = 64
Tax inspector = 64
Television production assistant = 63
International aid/development worker = 63
Environmental education officer = 62
Horticultural therapist = 62
Contracting civil engineer = 62
Print production planner = 61
Designer, television/film set = 61
Administrator, sports = 61
Sound technician, broadcasting/film/video = 61
Marine scientist = 61
Quantity surveyor = 61
Garment/textile technologist = 61
Air traffic controller = 61
Journalist, magazine = 60
Ceramics designer = 60
Telecommunications researcher = 60
Armed forces training and education officer = 60
Therapist, nutritional = 60
Senior tax professional/tax inspector = 60
Product manager = 60
Psychotherapist = 59
Historic buildings inspector/conservation officer = 59
Merchandiser, retail = 58
Commercial horticulturist = 58
Theatre manager = 

In [32]:
industry_counts = df["Industry"].value_counts()

In [33]:
industry_counts

Industry
Entertainment     3895
Manufacturing     3855
Healthcare        3771
Finance           3721
Education         3714
Retail            3702
IT                3681
Transportation    3661
Name: count, dtype: int64

In [34]:
# Count each industry
industry_counts = df["Industry"].value_counts()

# Total rows across all industries
total_rows = industry_counts.sum()

# Display each industry, its count, and percentage
for industry, count in industry_counts.items():
    percent = (count / total_rows) * 100
    print(f"{industry} = {count}  ({percent:.2f}%)")

# Totals
print("\nTotal unique industries:", industry_counts.count())
print("Total rows (all industries):", total_rows)
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
industry_counts = df["Industry"].value_counts().reset_index()
industry_counts.columns = ["Industry", "Count"]

fig = px.pie(
    industry_counts,
    names="Industry",
    values="Count",
    title="Industry Distribution",
    hole=0.0,  # 🔹 set between 0 (full pie) and 1 (donut)
)

fig.update_traces(textinfo="percent+label")  # show % and label inside/outside

fig.show()

Entertainment = 3895  (12.98%)
Manufacturing = 3855  (12.85%)
Healthcare = 3771  (12.57%)
Finance = 3721  (12.40%)
Education = 3714  (12.38%)
Retail = 3702  (12.34%)
IT = 3681  (12.27%)
Transportation = 3661  (12.20%)

Total unique industries: 8
Total rows (all industries): 30000


In [35]:
# Counting Job Status
job_status_counts = df["Job Status"].value_counts()

# Display each job status and its count
for status, count in job_status_counts.items():
    print(f"{status} = {count}")

# Display total number of unique job statuses
print("Total unique job statuses:", job_status_counts.count())
print("Total sum of job statuses:", job_status_counts.sum())

job_status_df = job_status_counts.reset_index()
job_status_df.columns = ["Job Status", "Count"]
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
# Bar chart
fig = px.bar(job_status_df, y="Job Status", x="Count",
             title="Job Status Distribution",
             text="Count")  # show counts on bars
fig.show()

Increasing = 15136
Decreasing = 14864
Total unique job statuses: 2
Total sum of job statuses: 30000


In [36]:
# counting Ai impact
ai_impact_counts = df["AI Impact Level"].value_counts()

# Display each AI impact and its count
for impact, count in ai_impact_counts.items():
    print(f"{impact} = {count}")
    # Display total number of unique AI impacts
print("Total unique AI impacts:", ai_impact_counts.count())
print("Total sum of AI impacts:", ai_impact_counts.sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
ai_impact_df = ai_impact_counts.reset_index()
ai_impact_df.columns = ["AI Impact Level", "Count"]

# Line chart
fig = px.line(ai_impact_df, x="AI Impact Level", y="Count",
              title="AI Impact Level Distribution",
              markers=True)  # markers=True to show data points
fig.show()


Moderate = 10042
High = 10005
Low = 9953
Total unique AI impacts: 3
Total sum of AI impacts: 30000


In [37]:
# Counting Median Salary (USD)
median_salary_counts = df["Median Salary (USD)"].value_counts()

# Display each median salary count
for salary, count in median_salary_counts.items():
    print(f"{salary} = {count}")

# Display total number of unique median salaries
print("Total unique median salaries:", median_salary_counts.count())

# Display total number of rows (sum of all counts)
print("Total rows:", median_salary_counts.sum())

# 💡 Gross total of all salary values (numerical sum of the column)
print("Gross total salary (USD):", df["Median Salary (USD)"].sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
median_salary_df = median_salary_counts.reset_index()
median_salary_df.columns = ["Median Salary (USD)", "Count"]

# Sort by salary for a logical flow (optional if salaries are numeric)
median_salary_df = median_salary_df.sort_values("Median Salary (USD)")

# Stream-like (area) chart
fig = px.area(median_salary_df,
              x="Median Salary (USD)",
              y="Count",
              title="Median Salary Distribution (Stream Chart Style)")
fig.show()

119301.98 = 2
139154.33 = 2
126634.05 = 2
77847.53 = 2
108901.43 = 2
59213.22 = 2
117711.88 = 2
40648.37 = 2
49693.9 = 2
149674.55 = 2
55639.94 = 2
119250.96 = 2
44159.55 = 2
78551.67 = 2
128406.45 = 2
52923.33 = 2
59324.16 = 2
82943.69 = 2
101962.53 = 2
122965.46 = 2
130237.82 = 2
128917.77 = 2
109282.35 = 2
140722.06 = 2
91690.6 = 2
117221.32 = 2
142478.74 = 2
39917.38 = 2
73920.93 = 2
133155.54 = 2
33841.64 = 2
109363.8 = 2
37329.71 = 1
115129.6 = 1
30506.42 = 1
143578.35 = 1
139768.78 = 1
46828.72 = 1
63193.72 = 1
68891.88 = 1
62093.25 = 1
107318.63 = 1
101322.09 = 1
109546.76 = 1
142660.31 = 1
122569.29 = 1
122668.89 = 1
60874.45 = 1
49705.11 = 1
59923.23 = 1
52060.54 = 1
110077.96 = 1
102393.43 = 1
73318.63 = 1
72953.11 = 1
34082.78 = 1
41523.13 = 1
92165.61 = 1
62229.93 = 1
106122.32 = 1
142549.39 = 1
76323.91 = 1
79665.42 = 1
41105.21 = 1
121435.0 = 1
124663.97 = 1
41067.95 = 1
148115.7 = 1
38486.99 = 1
35799.45 = 1
54495.6 = 1
65875.53 = 1
60038.64 = 1
50955.44 = 1
113064.81 =

In [38]:
# Required Education
required_education_counts = df["Required Education"].value_counts()

# Display each required education and its count
for education, count in required_education_counts.items():
    print(f"{education} = {count}")
    # Display total number of unique required education
print("Total unique required education:", required_education_counts.count())
print("Total sum of required education:", required_education_counts.sum())

required_education_df = required_education_counts.reset_index()
required_education_df.columns = ["Required Education", "Count"]
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
# Radar (polar) chart
fig = px.line_polar(required_education_df,
                    r="Count",
                    theta="Required Education",
                    line_close=True,      # closes the radar shape
                    title="Required Education Distribution")
fig.show()

Bachelor’s Degree = 6146
Master’s Degree = 6097
Associate Degree = 6003
High School = 5900
PhD = 5854
Total unique required education: 5
Total sum of required education: 30000


In [39]:
# Experience Required (Years)
experience_required_counts = df["Experience Required (Years)"].value_counts()

# Display each experience required count
for experience, count in experience_required_counts.items():
    print(f"{experience} = {count}")
    # Display total number of unique experience required
print("Total unique experience required:", experience_required_counts.count())
print("Total sum of experience required:", experience_required_counts.sum())
# 💡 Gross total of all salary values (numerical sum of the column)
print("Gross Expreience Required:", df["Experience Required (Years)"].sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------

19 = 1476
8 = 1475
0 = 1469
16 = 1464
11 = 1453
13 = 1449
20 = 1448
15 = 1447
7 = 1446
18 = 1430
6 = 1428
10 = 1423
12 = 1419
14 = 1416
5 = 1416
17 = 1407
2 = 1403
1 = 1400
3 = 1391
9 = 1388
4 = 1352
Total unique experience required: 21
Total sum of experience required: 30000
Gross Expreience Required: 301543


In [40]:
# Job Openings (2024)
job_openings_counts = df["Job Openings (2024)"].value_counts()

# Display each job opening count
for openings, count in job_openings_counts.items():
    print(f"{openings} = {count}")
    # Display total number of unique experience required
print("Total unique Jop Opening:", job_openings_counts.count())
print("Total sum of Jon Opening:", job_openings_counts.sum())
# 💡 Gross total of all salary values (numerical sum of the column)
print("Gross job_openings_counts:", df["Job Openings (2024)"].sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
job_openings_df = job_openings_counts.reset_index()
job_openings_df.columns = ["Job Openings (2024)", "Count"]

# Sort for better line chart visualization
job_openings_df = job_openings_df.sort_values("Job Openings (2024)")

# Line chart
fig = px.line(job_openings_df,
              x="Job Openings (2024)",
              y="Count",
              title="Job Openings (2024) Distribution",
              markers=True,   # show points on the line
              labels={"Count": "Number of Jobs", "Job Openings (2024)": "Job Openings (2024)"})

fig.show()

4256 = 12
6783 = 12
3032 = 11
9811 = 10
2855 = 10
6343 = 10
8207 = 10
1233 = 10
4685 = 10
3883 = 10
2190 = 10
5471 = 10
6926 = 10
3162 = 9
2053 = 9
5856 = 9
8256 = 9
9352 = 9
6279 = 9
6534 = 9
3421 = 9
8688 = 9
3978 = 9
7311 = 9
7462 = 9
5953 = 9
4544 = 9
3585 = 9
8490 = 9
5541 = 9
8928 = 9
9708 = 9
1303 = 9
1637 = 9
7927 = 9
8789 = 9
6774 = 9
2129 = 9
3006 = 9
8021 = 9
9023 = 9
6060 = 9
6592 = 9
4343 = 9
762 = 8
9837 = 8
824 = 8
3347 = 8
2109 = 8
5893 = 8
4826 = 8
658 = 8
2649 = 8
7352 = 8
3645 = 8
4527 = 8
5553 = 8
9789 = 8
1473 = 8
4456 = 8
2387 = 8
1058 = 8
207 = 8
7574 = 8
6896 = 8
4045 = 8
6283 = 8
1454 = 8
761 = 8
2530 = 8
1169 = 8
9241 = 8
7747 = 8
7039 = 8
2766 = 8
2932 = 8
9749 = 8
5919 = 8
9089 = 8
2341 = 8
5788 = 8
6616 = 8
7408 = 8
8689 = 8
5136 = 8
3734 = 8
9420 = 8
7610 = 8
3307 = 8
3647 = 8
9957 = 8
5702 = 8
735 = 8
5590 = 8
6586 = 8
1033 = 8
144 = 8
9427 = 8
9910 = 8
1294 = 8
5404 = 8
1496 = 8
1219 = 8
3207 = 8
9020 = 8
9153 = 8
8295 = 8
6806 = 8
385 = 8
9193 = 8
9025 

In [41]:
# Projected Openings (2030)
project_openings_counts = df["Projected Openings (2030)"].value_counts()

# Display each job opening count
for openings, count in project_openings_counts.items():
    print(f"{openings} = {count}")
    # Display total number of unique experience required
print("Total unique Jop Opening:", project_openings_counts.count())
print("Total sum of Jon Opening:", project_openings_counts.sum())
# 💡 Gross total of job opening values (numerical sum of the column)
print("Gross job_openings_counts:", df["Projected Openings (2030)"].sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
project_openings_df = project_openings_counts.reset_index()
project_openings_df.columns = ["Projected Openings (2030)", "Count"]

# Scatter plot with size representing count
fig = px.scatter(project_openings_df,
                 x="Projected Openings (2030)",
                 y="Count",
                 size="Count",            # bubble size represents count
                 color="Count",           # color gradient by count
                 title="Projected Job Openings (2030)",
                 text="Count",
                 color_continuous_scale=px.colors.sequential.Viridis)

fig.update_traces(marker=dict(opacity=0.8, line=dict(width=1, color='DarkSlateGrey')))

fig.show()

3843 = 10
1426 = 10
1122 = 10
6943 = 10
3272 = 10
3587 = 10
1427 = 10
1571 = 9
5136 = 9
4913 = 9
8115 = 9
8256 = 9
5601 = 9
5386 = 9
6885 = 9
2747 = 9
9086 = 9
7920 = 9
3015 = 9
7479 = 9
5031 = 9
6732 = 9
6884 = 9
4619 = 9
6578 = 9
4893 = 9
1686 = 9
4198 = 9
3970 = 9
410 = 9
6097 = 9
6430 = 9
7595 = 9
6741 = 9
5983 = 9
1062 = 9
3932 = 9
950 = 9
6229 = 9
665 = 8
1154 = 8
8218 = 8
4072 = 8
5575 = 8
1931 = 8
2951 = 8
4588 = 8
8051 = 8
3666 = 8
6574 = 8
7364 = 8
7799 = 8
6221 = 8
7314 = 8
2500 = 8
7088 = 8
6212 = 8
3655 = 8
9096 = 8
1466 = 8
3453 = 8
6761 = 8
1937 = 8
4669 = 8
1529 = 8
780 = 8
4070 = 8
6296 = 8
1330 = 8
9013 = 8
6104 = 8
6055 = 8
4207 = 8
8416 = 8
6232 = 8
2074 = 8
2385 = 8
3322 = 8
1792 = 8
1980 = 8
1459 = 8
568 = 8
4293 = 8
8482 = 8
5484 = 8
8938 = 8
7343 = 8
528 = 8
5013 = 8
4237 = 8
389 = 8
5781 = 8
681 = 8
9488 = 8
183 = 8
6091 = 8
7103 = 8
7638 = 8
2458 = 8
8553 = 8
7969 = 8
2436 = 8
4499 = 8
6310 = 8
7796 = 8
6738 = 8
7169 = 8
5148 = 8
9427 = 8
4112 = 8
5508 = 8
532

In [42]:
# Remote Work Ratio (%)
remote_work_counts = df["Remote Work Ratio (%)"].value_counts()

# Display each job opening count
for openings, count in remote_work_counts.items():
    print(f"{openings} = {count}")
    # Display total number of unique experience required
print("Total unique Jop Opening:", remote_work_counts.count())
print("Total sum of Jon Opening:", remote_work_counts.sum())
# 💡 Gross total of all salary values (numerical sum of the column)
print("Gross job_openings_counts:", df["Remote Work Ratio (%)"].sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
import plotly.express as px

remote_work_df = remote_work_counts.reset_index()
remote_work_df.columns = ["Remote Work Ratio (%)", "Count"]

# Dark-themed bar chart with blue bars
fig = px.bar(remote_work_df,
             x="Remote Work Ratio (%)",
             y="Count",
             title="Remote Work Ratio Distribution",
             text="Count",
             color="Count",  # keep color mapped to Count for gradient
             color_continuous_scale=px.colors.sequential.Blues)

# Dark background
fig.update_layout(
    template="plotly_dark",
    title_font=dict(size=20),
    xaxis_title="Remote Work Ratio (%)",
    yaxis_title="Count"
)

fig.show()


17.04 = 11
65.81 = 11
85.62 = 10
66.82 = 10
58.55 = 10
15.3 = 10
15.59 = 10
65.51 = 10
77.28 = 10
34.19 = 10
6.41 = 9
23.89 = 9
57.45 = 9
36.5 = 9
1.63 = 9
80.67 = 9
22.07 = 9
93.22 = 9
22.16 = 9
1.51 = 9
75.62 = 9
4.8 = 9
86.22 = 9
70.08 = 9
80.04 = 9
89.45 = 9
44.7 = 9
44.41 = 9
55.63 = 9
31.06 = 9
41.15 = 9
27.13 = 9
67.74 = 9
10.24 = 9
36.1 = 9
2.34 = 9
31.48 = 9
24.17 = 9
9.85 = 8
20.37 = 8
85.8 = 8
18.65 = 8
60.81 = 8
63.96 = 8
14.62 = 8
92.87 = 8
13.26 = 8
21.09 = 8
5.12 = 8
99.91 = 8
78.97 = 8
9.24 = 8
32.06 = 8
44.85 = 8
57.91 = 8
43.84 = 8
59.24 = 8
73.16 = 8
43.37 = 8
89.29 = 8
43.0 = 8
75.73 = 8
31.1 = 8
95.14 = 8
25.43 = 8
66.0 = 8
6.88 = 8
78.77 = 8
7.47 = 8
50.43 = 8
93.24 = 8
98.06 = 8
20.93 = 8
94.57 = 8
7.48 = 8
66.21 = 8
57.77 = 8
1.78 = 8
27.16 = 8
88.36 = 8
21.64 = 8
66.98 = 8
75.15 = 8
20.33 = 8
3.6 = 8
68.42 = 8
18.54 = 8
0.31 = 8
8.98 = 8
74.29 = 8
46.88 = 8
96.21 = 8
33.7 = 8
87.66 = 8
43.1 = 8
77.43 = 8
10.87 = 8
49.85 = 8
21.34 = 8
52.38 = 8
16.52 = 8
58.57 =

In [43]:
# Automation Risk (%)
automation_risk_counts = df["Automation Risk (%)"].value_counts()

# Display each job opening count
for openings, count in automation_risk_counts.items():
    print(f"{openings} = {count}")
    # Display total number of unique experience required
print("Total unique Automation Risk:", automation_risk_counts.count())
print("Total sum of Automation Risk:", automation_risk_counts.sum())
# 💡 Gross total of all salary values (numerical sum of the column)
print("Gross Automation Risk(%):", df["Automation Risk (%)"].sum())
actual_risk = df["Automation Risk (%)"].mean()
print("Actual Automation Risk (%):", actual_risk)
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
automation_risk_df = automation_risk_counts.reset_index()
automation_risk_df.columns = ["Automation Risk (%)", "Count"]

# Sort by risk percentage for proper area chart
automation_risk_df = automation_risk_df.sort_values("Automation Risk (%)")

# Simple area chart
fig = px.area(automation_risk_df,
              x="Automation Risk (%)",
              y="Count",
              title="Automation Risk (%) Distribution",
              labels={"Count": "Number of Jobs", "Automation Risk (%)": "Automation Risk (%)"})

fig.show()



78.62 = 12
53.3 = 11
32.85 = 10
57.86 = 10
5.01 = 10
56.13 = 10
43.42 = 10
11.56 = 10
96.16 = 10
10.75 = 10
59.2 = 10
60.19 = 10
20.18 = 10
83.72 = 9
28.84 = 9
41.42 = 9
28.95 = 9
63.28 = 9
1.61 = 9
61.4 = 9
56.91 = 9
96.98 = 9
94.24 = 9
36.35 = 9
50.77 = 9
13.8 = 9
40.39 = 9
0.44 = 9
78.18 = 9
58.52 = 9
11.57 = 9
1.74 = 9
10.83 = 9
43.49 = 9
17.77 = 9
23.28 = 9
85.74 = 9
36.94 = 9
98.03 = 8
33.02 = 8
15.75 = 8
39.8 = 8
22.95 = 8
70.99 = 8
16.53 = 8
82.65 = 8
13.75 = 8
19.55 = 8
9.09 = 8
79.75 = 8
32.62 = 8
59.23 = 8
12.88 = 8
11.79 = 8
22.75 = 8
39.59 = 8
90.65 = 8
6.12 = 8
93.5 = 8
4.88 = 8
80.01 = 8
79.88 = 8
74.7 = 8
89.83 = 8
32.74 = 8
6.33 = 8
81.18 = 8
85.15 = 8
89.8 = 8
99.57 = 8
79.36 = 8
73.07 = 8
82.11 = 8
20.76 = 8
79.31 = 8
24.47 = 8
64.16 = 8
66.26 = 8
30.05 = 8
73.1 = 8
19.13 = 8
61.83 = 8
46.19 = 8
71.96 = 8
93.9 = 8
5.03 = 8
35.22 = 8
16.14 = 8
51.69 = 8
24.43 = 8
99.86 = 8
20.94 = 8
81.81 = 8
98.8 = 8
61.08 = 8
82.6 = 8
46.96 = 8
72.0 = 8
0.27 = 8
83.83 = 8
71.16 = 8


In [23]:
# Location
# Count each country
country_counts = df["Location"].value_counts()

# Display each country and its count
for country, count in country_counts.items():
    print(f"{country} = {count}")

# Display total number of unique countries
print("Total unique countries:", country_counts.count())
print("Total sum of country:", country_counts.sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
# Convert Series to DataFrame
# Convert Series to DataFrame
country_df = country_counts.reset_index()
country_df.columns = ["Location", "Count"]

# Box plot
fig = px.box(country_df, y="Count", x="Location",
             title="Job Openings by Location (Box Plot)",
             points="all")  # show all individual points
fig.show()

Australia = 3802
UK = 3784
Canada = 3775
China = 3763
Germany = 3741
Brazil = 3728
USA = 3713
India = 3694
Total unique countries: 8
Total sum of country: 30000


In [44]:
# Gender Diversity (%)
gender_diversity_counts = df["Gender Diversity (%)"].value_counts()

# Display each job opening count
for openings, count in gender_diversity_counts.items():
    print(f"{openings} = {count}")
    # Display total number of unique experience required
print("Total unique Gender Diversity:", gender_diversity_counts.count())
print("Total sum of Gender Diversity:", gender_diversity_counts.sum())
# 💡 Gross total of Gender Diversity  (numerical sum of the column)
print("Gross Gender Diversity(%):", df["Gender Diversity (%)"].sum())
# ---------------------------
#  # 📊 Graph #
#  ---------------------------
import plotly.express as px

# Convert Series to DataFrame
gender_diversity_df = gender_diversity_counts.reset_index()
gender_diversity_df.columns = ["Gender Diversity (%)", "Count"]

# Bubble chart
fig = px.scatter(gender_diversity_df,
                 x="Gender Diversity (%)",
                 y="Count",
                 size="Count",           # bubble size represents count
                 color="Gender Diversity (%)",
                 title="Gender Diversity Distribution (Bubble Chart)")
fig.show()


79.85 = 14
54.35 = 14
20.86 = 13
27.52 = 13
29.39 = 13
70.08 = 13
29.73 = 13
76.9 = 13
74.93 = 13
41.79 = 13
20.15 = 13
46.33 = 13
36.63 = 13
65.01 = 13
39.46 = 13
74.67 = 13
70.95 = 13
39.48 = 12
66.67 = 12
42.42 = 12
32.78 = 12
58.87 = 12
34.46 = 12
79.3 = 12
76.56 = 12
58.97 = 12
56.55 = 12
32.81 = 12
50.99 = 12
47.18 = 12
75.92 = 12
52.21 = 11
52.5 = 11
52.62 = 11
54.87 = 11
75.81 = 11
57.88 = 11
25.47 = 11
50.55 = 11
60.64 = 11
59.27 = 11
46.14 = 11
76.67 = 11
53.04 = 11
69.22 = 11
59.29 = 11
39.63 = 11
72.37 = 11
40.75 = 11
30.8 = 11
31.8 = 11
59.25 = 11
30.03 = 11
69.94 = 11
46.06 = 11
62.47 = 11
49.24 = 11
31.52 = 11
42.54 = 11
72.54 = 11
37.8 = 11
41.82 = 11
70.54 = 11
75.49 = 11
49.22 = 11
20.98 = 11
23.13 = 11
43.77 = 11
31.24 = 11
35.55 = 11
43.61 = 11
24.15 = 10
70.05 = 10
57.39 = 10
58.77 = 10
55.21 = 10
68.69 = 10
55.78 = 10
33.82 = 10
51.62 = 10
30.23 = 10
49.14 = 10
64.41 = 10
44.11 = 10
67.54 = 10
77.16 = 10
60.52 = 10
29.27 = 10
53.01 = 10
52.65 = 10
41.0 = 10
26.65 

# Model Implementation 

In [50]:
# ===============================================================
# 📘 AI Job Automation Risk — Prediction-Focused Version (2020–2025)
# ===============================================================

import os
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sentence_transformers import SentenceTransformer

# ===============================================================
# STEP 1: LOAD DATA
# ===============================================================
file_path = r"G:\Project Ai and Future\Data\try 11\ai_job_trends_dataset_enriched_2020_2025.csv"
df = pd.read_csv(file_path)
print(f"✅ Dataset Loaded: {df.shape} rows")

# ===============================================================
# STEP 2: FEATURE ENGINEERING
# ===============================================================
df['Growth Rate (%)'] = ((df['Projected Openings (2030)'] - df['Job Openings (2024)'])
                         / df['Job Openings (2024)'].replace(0, np.nan)) * 100
df['Experience_to_Salary_Ratio'] = df['Experience Required (Years)'] / df['Median Salary (USD)']
df['Remote_Opportunity_Score'] = df['Remote Work Ratio (%)'] * df['Growth Rate (%)']
df = df.replace([np.inf, -np.inf], np.nan).fillna(0)

# ===============================================================
# STEP 3: TEXT EMBEDDINGS (keep readable job titles)
# ===============================================================
text_cols = ['Job Title', 'Industry', 'AI Impact Level']
bert = SentenceTransformer('all-MiniLM-L6-v2')

print("🔹 Encoding textual job info using BERT...")
combined_text = df[text_cols].astype(str).agg(" | ".join, axis=1).tolist()
embeddings = bert.encode(combined_text, show_progress_bar=True)

# ===============================================================
# STEP 4: DEFINE FEATURES & TARGET
# ===============================================================
time_cols = [f"Automation Risk {y} (%)" for y in range(2020, 2025)]
numeric_cols = [
    'Median Salary (USD)', 'Experience Required (Years)',
    'Job Openings (2024)', 'Projected Openings (2030)',
    'Remote Work Ratio (%)', 'Gender Diversity (%)',
    'Growth Rate (%)', 'Experience_to_Salary_Ratio', 'Remote_Opportunity_Score'
]

X_numeric = df[time_cols + numeric_cols].copy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_numeric)

# Combine scaled numeric + text embeddings
X_final = np.hstack([X_scaled, embeddings])
y = df['Automation Risk (%)']  # target (2025)

# ===============================================================
# STEP 5: TRAIN MODEL (Random Forest + XGBoost)
# ===============================================================
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

models = {
    "Random Forest": RandomForestRegressor(n_estimators=200, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=200, learning_rate=0.05, objective='reg:squarederror', random_state=42)
}

def evaluate_model(name, model, y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{name} → R²={r2:.4f} | RMSE={rmse:.4f} | MAE={mae:.4f}")
    return {"Model": name, "R2": r2, "RMSE": rmse, "MAE": mae}

results = []
predictions = {}

for name, model in models.items():
    print(f"\n🚀 Training {name} ...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results.append(evaluate_model(name, model, y_test, y_pred))
    predictions[name] = model.predict(X_final)

# ===============================================================
# STEP 6: SAVE PREDICTIONS (READABLE JOB TITLES)
# ===============================================================
df["Predicted Risk (RF)"] = predictions["Random Forest"]
df["Predicted Risk (XGB)"] = predictions["XGBoost"]

# Risk Category (human-readable)
def categorize_risk(x):
    if x < 30: return "Low"
    elif x < 70: return "Medium"
    else: return "High"

df["Predicted Risk Category"] = df["Predicted Risk (RF)"].apply(categorize_risk)

output_path = r"G:\Project Ai and Future\Data\Predicted_Job_Automation_Risk.csv"
df.to_csv(output_path, index=False)
print(f"\n📁 Saved: {output_path}")

# ===============================================================
# STEP 7: INDUSTRY & JOB TITLE SUMMARY
# ===============================================================
summary = df.groupby(["Industry", "Job Title"]).agg({
    "Predicted Risk (RF)": "mean",
    "Predicted Risk Category": lambda x: x.value_counts().index[0],
    "Median Salary (USD)": "mean",
    "Experience Required (Years)": "mean",
    "Remote Work Ratio (%)": "mean",
    "Gender Diversity (%)": "mean"
}).reset_index()

summary.rename(columns={"Predicted Risk (RF)": "Avg Predicted Risk (%)"}, inplace=True)
summary_path = r"G:\Project Ai and Future\Data\Industry_Job_Risk_Summary.csv"
summary.to_csv(summary_path, index=False)

print("\n📊 Saved: Industry_Job_Risk_Summary.csv — Average Automation Risk by Job & Industry")

# ===============================================================
# STEP 8: SHOW TOP 15 JOBS AT HIGHEST RISK
# ===============================================================
top_risky = summary.sort_values("Avg Predicted Risk (%)", ascending=False).head(15)
print("\n🔥 Top 15 Jobs Most at Risk of Automation:\n")
print(top_risky[["Job Title", "Industry", "Avg Predicted Risk (%)", "Predicted Risk Category"]])


✅ Dataset Loaded: (30000, 18) rows
🔹 Encoding textual job info using BERT...


Batches:   0%|          | 0/938 [00:00<?, ?it/s]


🚀 Training Random Forest ...
Random Forest → R²=0.9964 | RMSE=1.7441 | MAE=1.3701

🚀 Training XGBoost ...
XGBoost → R²=0.9972 | RMSE=1.5514 | MAE=1.2143

📁 Saved: G:\Project Ai and Future\Data\Predicted_Job_Automation_Risk.csv

📊 Saved: Industry_Job_Risk_Summary.csv — Average Automation Risk by Job & Industry

🔥 Top 15 Jobs Most at Risk of Automation:

                                         Job Title        Industry  \
2045                   Conservator, museum/gallery      Healthcare   
1735  Production designer, theatre/television/film         Finance   
500                         Recruitment consultant       Education   
2628                               Careers adviser              IT   
4915                               Product manager  Transportation   
1021                               Mining engineer   Entertainment   
1768             Radiation protection practitioner         Finance   
2308                              Network engineer      Healthcare   
4964          