## Transform ##

In the previous step I carried out a quality check on the data and found there were several issues that need to be addressed.
- Remove unwanted columns 
- Duplicated rows 
- Null values
- Data type for the date column 
- the data in salary column uses different scales.
- some job listing dont contain values for salary



In [164]:
import pandas as pd
df_raw = pd.read_csv('reed_api_data.csv', index_col=0)
df_raw.iloc[0]

jobId                                                           54047135
employerId                                                        409522
employerName                                                         WTW
employerProfileId                                                    NaN
employerProfileName                                                  NaN
jobTitle                                           Senior Data Scientist
locationName                                                      London
minimumSalary                                                        NaN
maximumSalary                                                        NaN
currency                                                             NaN
expirationDate                                                30/12/2024
date                                                          18/11/2024
jobDescription         We are looking for a Data Scientist, with expe...
applications                                       

Firstly I will drop the employerProfileId and employerProfileName columns as these dont contain any useful data, I will also remove the locationName column as this contains a mixture of postcodes and city names which will be difficult to analise instead I will use the city column which was created when the data was extracted and contains only city names.

In [165]:
df_raw = df_raw.drop(columns=["employerProfileId",	"employerProfileName", "locationName"])
df_raw.iloc[0]

jobId                                                      54047135
employerId                                                   409522
employerName                                                    WTW
jobTitle                                      Senior Data Scientist
minimumSalary                                                   NaN
maximumSalary                                                   NaN
currency                                                        NaN
expirationDate                                           30/12/2024
date                                                     18/11/2024
jobDescription    We are looking for a Data Scientist, with expe...
applications                                                     14
jobUrl            https://www.reed.co.uk/jobs/senior-data-scient...
city                                                         London
Name: 0, dtype: object

I will now convert the date column to datetime format this will help with sorting and analysing the data later on

In [166]:
df_raw.dtypes

jobId               int64
employerId          int64
employerName       object
jobTitle           object
minimumSalary     float64
maximumSalary     float64
currency           object
expirationDate     object
date               object
jobDescription     object
applications        int64
jobUrl             object
city               object
dtype: object

In [167]:
print(df_raw['date'].max())
df_raw['date'].dtype

31/10/2024


dtype('O')

In [168]:
df_raw['date'] = pd.to_datetime(df_raw['date'], dayfirst=True)
df_raw.dtypes

jobId                      int64
employerId                 int64
employerName              object
jobTitle                  object
minimumSalary            float64
maximumSalary            float64
currency                  object
expirationDate            object
date              datetime64[ns]
jobDescription            object
applications               int64
jobUrl                    object
city                      object
dtype: object

In [169]:
df_raw.head()

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
0,54047135,409522,WTW,Senior Data Scientist,,,,30/12/2024,2024-11-18,"We are looking for a Data Scientist, with expe...",14,https://www.reed.co.uk/jobs/senior-data-scient...,London
1,53989684,501640,Vitality,Lead Data Scientist,,,,06/12/2024,2024-11-08,About The Role Team – &nbsp;Data Science Worki...,29,https://www.reed.co.uk/jobs/lead-data-scientis...,London
2,54032986,543104,Jobheron,Data Scientist,40000.0,55000.0,GBP,27/12/2024,2024-11-15,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
3,53929241,472032,Proactive Appointments,Data Scientist,,,,10/12/2024,2024-10-29,Data Scientist -&nbsp; Remote Working Data Sci...,245,https://www.reed.co.uk/jobs/data-scientist/539...,London
4,54054640,524441,INTEC SELECT LIMITED,Data Scientist,450.0,500.0,GBP,01/01/2025,2024-11-20,Data Scientist – 450-500pd PAYE – 7 month cont...,27,https://www.reed.co.uk/jobs/data-scientist/540...,London


In [170]:
print(df_raw['date'].max())
print(df_raw['date'].min())

2024-11-21 00:00:00
2020-11-03 00:00:00


In [171]:
df_raw[df_raw['date'] == df_raw['date'].min()]

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
3217,41308637,1990,Gregory Martin International Limited,Cost Consultant,35000.0,65000.0,GBP,29/11/2024,2020-11-03,Cost Consultant / Cost Engineer Our client is ...,60,https://www.reed.co.uk/jobs/cost-consultant/41...,Southampton


Now that we have converted the date column to datetime format I will sort the data by date

In [172]:
df_raw = df_raw.sort_values(by='date', ascending=False,)
df_raw = df_raw.reset_index(drop=True)
df_raw.head(10)

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
0,54061747,470824,Eames Consulting,Senior Backend Python Developer - FastAPI / Te...,650.0,750.0,GBP,02/01/2025,2024-11-21,Senior Backend Python Developer - Python Stack...,0,https://www.reed.co.uk/jobs/senior-backend-pyt...,London
1,54062533,390934,Jonathan Lee Recruitment,Systems Developer,40000.0,45000.0,GBP,02/01/2025,2024-11-21,**Unlock Your Potential as a Systems Developer...,1,https://www.reed.co.uk/jobs/systems-developer/...,Wolverhampton
2,54061875,121426,Henderson Scott,Software Project Lead - Low Level,65000.0,75000.0,GBP,02/01/2025,2024-11-21,Software Project Lead Location: Stevenage (Rel...,1,https://www.reed.co.uk/jobs/software-project-l...,Luton
3,54064282,106910,SF Recruitment,IT Helpdesk Engineer,28000.0,32000.0,GBP,02/01/2025,2024-11-21,IT Support Technician Location: Coleshill &amp...,0,https://www.reed.co.uk/jobs/it-helpdesk-engine...,Birmingham
4,54061731,10470,E Personnel Recruitment,Systems Engineer,35000.0,39001.0,GBP,02/01/2025,2024-11-21,SYSTEMS ENGINEER - BASED IN EPSOM - KT18 5AP -...,0,https://www.reed.co.uk/jobs/systems-engineer/5...,London
5,54062999,563926,Akkodis,Mid level .net web developer,40000.0,50000.0,GBP,02/01/2025,2024-11-21,C# Software Developer Leicester /Hybrid Role O...,1,https://www.reed.co.uk/jobs/mid-level-net-web-...,Leicester
6,54062921,563926,Akkodis,Senior C# Developer Microsoft Developer Role L...,50000.0,60000.0,GBP,02/01/2025,2024-11-21,Senior C# Developer Microsoft Developer Role L...,0,https://www.reed.co.uk/jobs/senior-c-developer...,Leicester
7,54061255,391063,Opus Recruitment Solutions Ltd,AWS DevOps Engineer AI Integration Project O...,500.0,550.0,GBP,02/01/2025,2024-11-21,I am currently supporting a FinTech client tha...,21,https://www.reed.co.uk/jobs/aws-devops-enginee...,Wolverhampton
8,54064282,106910,SF Recruitment,IT Helpdesk Engineer,28000.0,32000.0,GBP,02/01/2025,2024-11-21,IT Support Technician Location: Coleshill &amp...,0,https://www.reed.co.uk/jobs/it-helpdesk-engine...,Wolverhampton
9,54062843,2030,Futures Manufacturing,Electronic Systems Engineer,40000.0,60000.0,GBP,02/01/2025,2024-11-21,Do you have experience of systems integration ...,0,https://www.reed.co.uk/jobs/electronic-systems...,Sheffield


I will use the duplicated method to check for duplicate rows of data

In [173]:
duplciated_rows = df_raw[df_raw.duplicated(subset=['jobId'], keep=False)]
duplciated_rows

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
0,54061747,470824,Eames Consulting,Senior Backend Python Developer - FastAPI / Te...,650.0,750.0,GBP,02/01/2025,2024-11-21,Senior Backend Python Developer - Python Stack...,0,https://www.reed.co.uk/jobs/senior-backend-pyt...,London
1,54062533,390934,Jonathan Lee Recruitment,Systems Developer,40000.0,45000.0,GBP,02/01/2025,2024-11-21,**Unlock Your Potential as a Systems Developer...,1,https://www.reed.co.uk/jobs/systems-developer/...,Wolverhampton
2,54061875,121426,Henderson Scott,Software Project Lead - Low Level,65000.0,75000.0,GBP,02/01/2025,2024-11-21,Software Project Lead Location: Stevenage (Rel...,1,https://www.reed.co.uk/jobs/software-project-l...,Luton
3,54064282,106910,SF Recruitment,IT Helpdesk Engineer,28000.0,32000.0,GBP,02/01/2025,2024-11-21,IT Support Technician Location: Coleshill &amp...,0,https://www.reed.co.uk/jobs/it-helpdesk-engine...,Birmingham
5,54062999,563926,Akkodis,Mid level .net web developer,40000.0,50000.0,GBP,02/01/2025,2024-11-21,C# Software Developer Leicester /Hybrid Role O...,1,https://www.reed.co.uk/jobs/mid-level-net-web-...,Leicester
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9988,50056290,471259,MBDA,Algorithms Engineer,,,,26/11/2024,2023-03-21,"Stevenage As an Algorithms Engineer, you will ...",34,https://www.reed.co.uk/jobs/algorithms-enginee...,Luton
9989,49897629,1990,Gregory Martin International Limited,Senior Analyst Modeller,40000.0,70000.0,GBP,31/12/2024,2023-02-28,"Senior Analyst - Operational Analysis, Python,...",75,https://www.reed.co.uk/jobs/senior-analyst-mod...,Southampton
9990,49897629,1990,Gregory Martin International Limited,Senior Analyst Modeller,40000.0,70000.0,GBP,31/12/2024,2023-02-28,"Senior Analyst - Operational Analysis, Python,...",75,https://www.reed.co.uk/jobs/senior-analyst-mod...,Southampton
9993,45901240,582327,ITOL Recruitment,Cyber Security Trainee,24000.0,37000.0,GBP,04/12/2024,2022-02-25,Cyber Security Placement Programme - No Experi...,200,https://www.reed.co.uk/jobs/cyber-security-tra...,Coventry


We can see that some jobs have been duplicated multiple times

In [174]:
duplicated_job = df_raw[df_raw['jobId'] == 54032986]
duplicated_job

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
1747,54032986,543104,Jobheron,Data Scientist,40000.0,55000.0,GBP,27/12/2024,2024-11-15,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
1752,54032986,543104,Jobheron,Data Scientist,40000.0,55000.0,GBP,27/12/2024,2024-11-15,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
1757,54032986,543104,Jobheron,Data Scientist,40000.0,55000.0,GBP,27/12/2024,2024-11-15,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
1879,54032986,543104,Jobheron,Data Scientist,40000.0,55000.0,GBP,27/12/2024,2024-11-15,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London


I will now remove all duplicated jobs from the dataframe

In [175]:
df = df_raw.drop_duplicates()
print("df length:", len(df_raw))
print("df length:", len(df))

df length: 9997
df length: 5651


I can see that some jobs have also been posted multiple times. I will isolate jobs which have the same values for the jobDescription and city columns and delete those also, keeping the entry that has the oldest value for date.

In [176]:
df = df.sort_values(by=['jobDescription', 'city', 'date'], ascending=[True, True, True])
df = df.drop_duplicates(subset=['jobDescription', 'city'], keep='first')
len(df)


4915

The minimumSalary and maximumSalary columns contain a range of different scales, some appear to be yearly salaries while others are daily rates. 

In [177]:
df['maximumSalary'].value_counts()

maximumSalary
50000.00    373
45000.00    339
60000.00    277
40000.00    238
65000.00    224
           ... 
36698.00      1
14.40         1
11.50         1
1200.00       1
20.19         1
Name: count, Length: 341, dtype: int64

In [178]:
top_salaries = df.sort_values(by='maximumSalary', ascending=False)[['jobId', 'maximumSalary']]
top_salaries.head(10)

Unnamed: 0,jobId,maximumSalary
1624,54044633,960000.0
1936,54026785,850000.0
2262,54017901,720001.0
7627,53869686,500000.0
2318,54017540,450000.0
2452,54017540,450000.0
838,54051404,437879.0
6488,53902297,300000.0
5337,53935002,250000.0
5266,53935002,250000.0


In [179]:
max_salary = df[df['jobId'] == 54044633]
max_salary

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
1624,54044633,633103,Crimson,Solution Architect - ERP,840000.0,960000.0,GBP,30/12/2024,2024-11-18,Solution Architect - ERP Hybrid x2-3 days per ...,7,https://www.reed.co.uk/jobs/solution-architect...,Manchester


When we look at the job with the highest value for maximumSalary we see a value of 960'000 which seems incredibly high, when we investigate further by looking at the jobDescription we see "70-80k" mentioned showing the value in the maximumSalary column has been entered incorrectly.

In [180]:
highest_paid = df[df['maximumSalary']>100000].sort_values(by='maximumSalary', ascending=False)
len(highest_paid)

140

In [181]:
highest_paid.head(20)

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
1624,54044633,633103,Crimson,Solution Architect - ERP,840000.0,960000.0,GBP,30/12/2024,2024-11-18,Solution Architect - ERP Hybrid x2-3 days per ...,7,https://www.reed.co.uk/jobs/solution-architect...,Manchester
1936,54026785,412685,Nigel Frank International,D365 CE Technical Lead,70000.0,850000.0,GBP,26/12/2024,2024-11-14,Job Description An excellent opportunity to wo...,4,https://www.reed.co.uk/jobs/d365-ce-technical-...,Leeds
2262,54017901,121426,Henderson Scott,Lead Platform Engineer,720000.0,720001.0,GBP,25/12/2024,2024-11-13,Lead Platform Engineer - Hampshire (Hybrid) - ...,10,https://www.reed.co.uk/jobs/lead-platform-engi...,Southampton
7627,53869686,409660,Huxley,FX Software Engineering Manager,100000.0,500000.0,GBP,28/11/2024,2024-10-17,FX Software Engineering Manager C#.NET Finance...,21,https://www.reed.co.uk/jobs/fx-software-engine...,London
2318,54017540,472689,Page Personnel Finance,FP&A Analyst hybrid,40000.0,450000.0,GBP,25/12/2024,2024-11-13,Fabulous opportunity for someone who has a rea...,11,https://www.reed.co.uk/jobs/fp-a-analyst-hybri...,Derby
2452,54017540,472689,Page Personnel Finance,FP&A Analyst hybrid,40000.0,450000.0,GBP,25/12/2024,2024-11-13,Fabulous opportunity for someone who has a rea...,11,https://www.reed.co.uk/jobs/fp-a-analyst-hybri...,Stoke-on-Trent
838,54051404,520034,Sanderson,Business Analyst,35029.0,437879.0,GBP,31/12/2024,2024-11-19,Business Analyst Who are Diligenta? Diligenta'...,17,https://www.reed.co.uk/jobs/business-analyst/5...,Glasgow
6488,53902297,300264,Client Server Ltd.,C++ Developer - Template Metaprogramming,150000.0,300000.0,GBP,21/11/2024,2024-10-24,C Developer / Software Engineer (TMP C 20 / 23...,21,https://www.reed.co.uk/jobs/c-developer-templa...,London
5266,53935002,634813,Ortus PSR,Financial Planner,100000.0,250000.0,GBP,11/12/2024,2024-10-30,Join an Elite Wealth Management Team as a Fina...,3,https://www.reed.co.uk/jobs/financial-planner/...,Leeds
5337,53935002,634813,Ortus PSR,Financial Planner,100000.0,250000.0,GBP,11/12/2024,2024-10-30,Join an Elite Wealth Management Team as a Fina...,3,https://www.reed.co.uk/jobs/financial-planner/...,York
