## Transform ##

Now I have the data saved in a pandas dataframe I can now start to transform the data. My first task will be to check for and remove any duplicated data as I suspect some of the searches will return similar job postings.

I will use the duplicated method to check for duplicate rows of data

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('reed_api_data.csv', index_col=0)
df.iloc[0]

jobId                                                           54047135
employerId                                                        409522
employerName                                                         WTW
employerProfileId                                                    NaN
employerProfileName                                                  NaN
jobTitle                                           Senior Data Scientist
locationName                                                      London
minimumSalary                                                        NaN
maximumSalary                                                        NaN
currency                                                             NaN
expirationDate                                                30/12/2024
date                                                          18/11/2024
jobDescription         We are looking for a Data Scientist, with expe...
applications                                       

In [4]:
duplciated_rows = df[df.duplicated(subset=['jobId'], keep=False)]
duplciated_rows

Unnamed: 0,jobId,employerId,employerName,employerProfileId,employerProfileName,jobTitle,locationName,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
2,54032986,543104,Jobheron,,,Data Scientist,London,40000.0,55000.0,GBP,27/12/2024,15/11/2024,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
4,54054640,524441,INTEC SELECT LIMITED,,,Data Scientist,EC3N2EX,450.0,500.0,GBP,01/01/2025,20/11/2024,Data Scientist – 450-500pd PAYE – 7 month cont...,27,https://www.reed.co.uk/jobs/data-scientist/540...,London
6,54056851,400289,Huntress,,,Data Scientist,London,50.0,55.0,GBP,27/11/2024,20/11/2024,Data Scientist- London/Remote- 12 Months- 50- ...,31,https://www.reed.co.uk/jobs/data-scientist/540...,London
8,54064151,429392,Searchability,,,Data Scientist,London,50000.0,60000.0,GBP,02/01/2025,21/11/2024,Data Scientist Opportunity within a Tech4Good ...,1,https://www.reed.co.uk/jobs/data-scientist/540...,London
12,53917216,377106,Robert Walters,,,Data Scientist,London,60000.0,80000.0,GBP,09/12/2024,28/10/2024,Our client is in the midst of a crucial growth...,159,https://www.reed.co.uk/jobs/data-scientist/539...,London
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9991,54049736,300264,Client Server Ltd.,,,Senior C# Developer .Net API AWS - Games,SR12JR,65000.0,80000.0,GBP,17/12/2024,19/11/2024,Senior C# Developer / Software Engineer (C# .N...,8,https://www.reed.co.uk/jobs/senior-c-developer...,Newcastle
9992,54030987,375315,Spectrum IT Recruitment,,,Senior PHP Software Engineer,NE11AD,50000.0,60000.0,GBP,29/11/2024,15/11/2024,We're looking for a dynamic Senior PHP Softwar...,19,https://www.reed.co.uk/jobs/senior-php-softwar...,Newcastle
9993,53905046,300264,Client Server Ltd.,,,Senior Software Developer C# .Net API AWS,SR12JR,70000.0,80000.0,GBP,21/11/2024,24/10/2024,Senior Software Developer / Engineer (C# .Net ...,18,https://www.reed.co.uk/jobs/senior-software-de...,Newcastle
9994,53987043,300264,Client Server Ltd.,,,Senior Software Developer C# .Net API AWS,SR12JR,65000.0,80000.0,GBP,06/12/2024,08/11/2024,Senior Software Developer / Engineer (C# .Net ...,12,https://www.reed.co.uk/jobs/senior-software-de...,Newcastle


We can see that some jobs have been duplicated multiple times

In [5]:
duplicated_job = df[df['jobId'] == 54032986]
duplicated_job

Unnamed: 0,jobId,employerId,employerName,employerProfileId,employerProfileName,jobTitle,locationName,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
2,54032986,543104,Jobheron,,,Data Scientist,London,40000.0,55000.0,GBP,27/12/2024,15/11/2024,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
2193,54032986,543104,Jobheron,,,Data Scientist,London,40000.0,55000.0,GBP,27/12/2024,15/11/2024,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
9097,54032986,543104,Jobheron,,,Data Scientist,London,40000.0,55000.0,GBP,27/12/2024,15/11/2024,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
9219,54032986,543104,Jobheron,,,Data Scientist,London,40000.0,55000.0,GBP,27/12/2024,15/11/2024,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London


I will now remove all duplicated jobs from the dataframe

In [6]:
df_no_duplicates = df.drop_duplicates()
print("df length:", len(df))
print("df_no_duplicates length:", len(df_no_duplicates))

df length: 9997
df_no_duplicates length: 5651


I will drop the columns I dont need to reduce the amount of data I am working with

In [7]:
df_no_duplicates = df_no_duplicates.drop(columns=["employerProfileId",	"employerProfileName", "locationName"])
df_no_duplicates.iloc[0]

jobId                                                      54047135
employerId                                                   409522
employerName                                                    WTW
jobTitle                                      Senior Data Scientist
minimumSalary                                                   NaN
maximumSalary                                                   NaN
currency                                                        NaN
expirationDate                                           30/12/2024
date                                                     18/11/2024
jobDescription    We are looking for a Data Scientist, with expe...
applications                                                     14
jobUrl            https://www.reed.co.uk/jobs/senior-data-scient...
city                                                         London
Name: 0, dtype: object

In order to be able to analyise the date using the date column the format of the date column needs to be changed from a string to a datetime object

In [9]:
df_no_duplicates.dtypes

jobId               int64
employerId          int64
employerName       object
jobTitle           object
minimumSalary     float64
maximumSalary     float64
currency           object
expirationDate     object
date               object
jobDescription     object
applications        int64
jobUrl             object
city               object
dtype: object

In [15]:
print(df_no_duplicates['date'].max())
df_no_duplicates['date'].dtype

2024-11-21 00:00:00


dtype('<M8[ns]')

In [10]:
df_no_duplicates['date'] = pd.to_datetime(df_no_duplicates['date'], dayfirst=True)
df_no_duplicates.dtypes

jobId                      int64
employerId                 int64
employerName              object
jobTitle                  object
minimumSalary            float64
maximumSalary            float64
currency                  object
expirationDate            object
date              datetime64[ns]
jobDescription            object
applications               int64
jobUrl                    object
city                      object
dtype: object

In [11]:
df_no_duplicates[:4]

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
0,54047135,409522,WTW,Senior Data Scientist,,,,30/12/2024,2024-11-18,"We are looking for a Data Scientist, with expe...",14,https://www.reed.co.uk/jobs/senior-data-scient...,London
1,53989684,501640,Vitality,Lead Data Scientist,,,,06/12/2024,2024-11-08,About The Role Team – &nbsp;Data Science Worki...,29,https://www.reed.co.uk/jobs/lead-data-scientis...,London
2,54032986,543104,Jobheron,Data Scientist,40000.0,55000.0,GBP,27/12/2024,2024-11-15,"A Data Scientist, who must have a PhD&nbsp; qu...",55,https://www.reed.co.uk/jobs/data-scientist/540...,London
3,53929241,472032,Proactive Appointments,Data Scientist,,,,10/12/2024,2024-10-29,Data Scientist -&nbsp; Remote Working Data Sci...,245,https://www.reed.co.uk/jobs/data-scientist/539...,London


In [18]:
print(df_no_duplicates['date'].max())
print(df_no_duplicates['date'].min())

2024-11-21 00:00:00
2020-11-03 00:00:00


In [17]:
df_no_duplicates[df_no_duplicates['date'] == df_no_duplicates['date'].min()]

Unnamed: 0,jobId,employerId,employerName,jobTitle,minimumSalary,maximumSalary,currency,expirationDate,date,jobDescription,applications,jobUrl,city
3217,41308637,1990,Gregory Martin International Limited,Cost Consultant,35000.0,65000.0,GBP,29/11/2024,2020-11-03,Cost Consultant / Cost Engineer Our client is ...,60,https://www.reed.co.uk/jobs/cost-consultant/41...,Southampton
