# Assignment 1: Data Parsing, Cleansing and Integration
## Task 3
#### Student Name: Harold Davies
#### Student ID: 3997902

Date: 22/04/2024

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: 
* pandas
* re
* numpy

## Introduction
This third part of the data parsing, cleaning and integration project involved exploring a second dataset (DS2) to discover how the schema and data varied from that of the first dataset (DS1), making adjustments so that the datasets could be merged, and addressing any issues with the merged dataset before exporting it to csv format. While exploring DS2, it was discovered that it had very similar features, just missing the Id and SourceName columns, but with all columns besides Category having different titles, and many columns' data values were also not consistent with that of DS1. Of particular interest was that the Contract Type column in DS2 corresponded with the ContractTime column in DS1. Some examples of inconsistent data values were Category columns having slight differences in category names such as 'IT Jobs' and 'Information Technology' and DS2 listing salaries per month as opposed to per year. Differences in value names were fixed by mapping value names to the names used in DS1, and at the same time missing values were replaced with 'non-specified', whilst salaries were fixed simply by multiplying those in DS2 by 12. The dates in DS2 were in a slightly different format, so these were adjusted, and an Id column was added to DS2 starting from 10000001 up to 10005000 which doesn't overlap with any of the Ids from DS1. Excluding just the Id and Source columns, no duplicates were found, however an assumption was made that if the open and close dates matched apart from only the day and time, then this is effectively a duplicate, so new columns were added omitting day and time, and 4 duplicates were identified and removed. Finally Id was identified as an appropriate global key and the dataset was exported to a CSV. 


##  Import libraries 

In [214]:
import pandas as pd
import numpy as np
import re

### 1. Loading and Examining Data 
As we see below, the second dataset only has 9 columns, blah blah blah

In [215]:
#save datasets to df variables
df2 = pd.read_csv('data_to_integrate.csv')
df1 = pd.read_csv('clean_data.csv')

In [216]:
df2.shape

(5000, 9)

In [217]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Job Title                   5000 non-null   object 
 1   Organisation                4937 non-null   object 
 2   Monthly Payment             5000 non-null   float64
 3   Contract Type               4312 non-null   object 
 4   Category                    5000 non-null   object 
 5   Closing                     5000 non-null   object 
 6   Location                    5000 non-null   object 
 7   Full-Time Equivalent (FTE)  643 non-null    object 
 8   Opening                     5000 non-null   object 
dtypes: float64(1), object(8)
memory usage: 351.7+ KB


In [218]:
df2.head()

Unnamed: 0,Job Title,Organisation,Monthly Payment,Contract Type,Category,Closing,Location,Full-Time Equivalent (FTE),Opening
0,Deputy Manager (RGN) Nursing Home Bristol,Liquid Personnel Ltd,2791.67,,Healthcare and Nursing,13/01/2013 15:00,Bristol,FULL TIME,14/11/2012 15:00
1,RGN or RMN Wigan,Eden Brown,2240.0,,Healthcare and Nursing,24/01/2013 15:00,Wigan,FULL TIME,25/11/2012 15:00
2,Staff Nurse in Elderly (RGN) Bromley,Aaron Millar Recruitment,2000.0,,Healthcare and Nursing,1/02/2012 12:00,Bromley,FULL TIME,2/01/2012 12:00
3,Practice Nurses Band 5 Northallerton,Hays Healthcare,2880.0,,Healthcare and Nursing,21/07/2012 0:00,Northallerton,FULL TIME,22/04/2012 0:00
4,RGN or RMN Liverpool,Eden Brown,1996.75,,Healthcare and Nursing,6/04/2013 0:00,Liverpool,FULL TIME,7/03/2013 0:00


In [219]:
df2['Contract Type'].value_counts()

Contract Type
Permanent              3720
Fixed Term Contract     592
Name: count, dtype: int64

In [220]:
df2['Full-Time Equivalent (FTE)'].value_counts()

Full-Time Equivalent (FTE)
FULL TIME    583
0.2 FTE       19
0.4 FTE       16
0.8 FTE       14
0.6 FTE       11
Name: count, dtype: int64

In [221]:
df2['Category'].value_counts()

Category
Information Technology           1286
Engineering                      1066
Healthcare and Nursing            655
Finance and Accounting            637
Sales                             630
Hospitality and Catering          275
Teaching                          234
PR, Advertising and Marketing     217
Name: count, dtype: int64

### 2. Resolving schema conflicts
Conflicts found in two schemas and steps to resolve them:

| Id | Dataset 1 Column | Dataset 2 Column | Conflict | Resolutions |
|----|------------------|------------------|----------|-------------|
| 1  | Id               | NA               | No Id in dataset 2 (DS2)                                | Generate 8-digit unique Ids for DS2                  |
| 2  | Title            | Job Title        | Different column headings                               | Change column heading in DS2 to Title                |
| 3  | All              | All              | Different column order                                  | Change DS2 column order to match DS1                 |
| 4  | Company          | Organisation    | Different column headings                                | Change column heading in DS2 to Title                |
| 5  | ContractType     | Contract Type   | Column in DS2 corresponds to ContractTime column in DS1  | Change column heading in DS2 to ContractTime         |
| 6  | ContractType     | Full-Time Equivalent (FTE) | Different column headings                     | Change column heading in DS2 to Title                |
| 7  | ContractType     | Full-Time Equivalent (FTE) | Different names for full time vs part time    | Change values in DS2 to match those of DS1           |
| 8  | ContractTime     | Contract Type   | Different names for permanent vs contract                | Change values in DS2 to match those of DS1           |
| 9  | Category         | Category        | Same categories but different names                      | Update category names in DS2 to match those of DS1   |
| 10 | Salary           | Monthly Payment | Different column headings                                | Change column heading in DS2 to Title                |
| 11 | Salary           | Monthly Payment | Values annual vs monthly                                 | Multiply values in DS2 by 12                         |
| 12 | OpenDate         | Opening         | Different column headings                                | Change column heading in DS2 to Title                |
| 13 | CloseDate        | Closing         | Different column headings                                | Change column heading in DS2 to Title                |
| 14 | SourceName       | NA              | No source name in DS2                                    | Create column in DS2 and populate with non-specified |
| 15 | Company               | Organisation            | Missing values                              | Replace missing values with 'non-specified'          |
| 16 | ContractType               | Full-Time Equivalent (FTE) | Missing values                      | Replace missing values with 'non-specified' |
| 17 | ContractTime               | Contract Type           | Missing values                         | Replace missing values with 'non-specified' |
| 18 | OpenDate               | Opening           | Wrong data type                         | Change to datetime format yyyy-mm-dd hh:mm:ss|
| 19 | CloseDate               | Closing           | Wrong data type                         | Change to datetime format yyyy-mm-dd hh:mm:ss|

#### Conflicts 2, 3, 4, 5, 6, 10, 12, 13 and 14: Column Names and Order
These conflicts all relate the inconsistent column names or order between the 2 datasets. All that is required to fix this conflict is to rename the columns in DS2 appropriately, add the missing columns and re-order the columns. I will also fix conflict 14 which is that DS2 (df2) doesn't have a SourceName column by adding that column and populating it with 'non-specified' values.

In [222]:
#rename columns to match df1
df2.rename(columns={
    'Job Title': 'Title', 'Organisation': 'Company', 'Monthly Payment': 'Salary', 'Contract Type': 'ContractTime',
    'Closing': 'CloseDate', 'Full-Time Equivalent (FTE)': 'ContractType', 'Opening': 'OpenDate'
}, inplace=True)

In [223]:
#add columns to match df1
df2['SourceName'], df2['Id'] = 'non-specified', ''

In [224]:
#re-order columns to match df1
new_order = ['Id', 'Title', 'Location', 'Company', 'ContractType', 'ContractTime', 'Category', 'Salary', 'OpenDate', 'CloseDate', 'SourceName']
df2 = df2[new_order]

#### Conflict 1: Missing Ids
In order to generate ids with consistent formatting, I will use 1 to len(df2) padded with leading 0's to achieve 8 digits length. As the Ids are stored as integers, in order to retain the leading numbers I will start the new ids with 1, so given the shape of df2 the new ids will go from 10000001 to 10005000

In [225]:
#generate ids
df2['Id'] = range(1, len(df2) + 1)

In [226]:
df2.dtypes

Id                int64
Title            object
Location         object
Company          object
ContractType     object
ContractTime     object
Category         object
Salary          float64
OpenDate         object
CloseDate        object
SourceName       object
dtype: object

In [227]:
#check for conflicts
print(f"df1 minimum id is: {df1['Id'].min()} and df2 maximum id is: {df2['Id'].max() + 10000000}")

df1 minimum id is: 12612628 and df2 maximum id is: 10005000


In [228]:
#pad with 0's
df2['Id'] = df2['Id'].apply(lambda x: int("1" + "0"*(7-len(str(x))) + str(x)))

#### Conflicts 7 & 16: Inconsistent ContractType value names and missing values
In order to make df2 consistent with df1 I will re-map the value names for the ContractType column in df2 (previously Full-Time Equivalent (FTE)) from FULL TIME to full_time and from 0.# FTE to part_time. 

In [229]:
#review values from dataset 1
df1['ContractType'].value_counts()

ContractType
non-specified    37430
full_time        11758
part_time         1513
Name: count, dtype: int64

In [230]:
#review values from dataset 2
df2['ContractType'].value_counts()

ContractType
FULL TIME    583
0.2 FTE       19
0.4 FTE       16
0.8 FTE       14
0.6 FTE       11
Name: count, dtype: int64

In [231]:
#create mapping for new values
contract_type_mapping = {'FULL TIME': 'full_time', '0.2 FTE': 'part_time', '0.4 FTE': 'part_time', '0.6 FTE': 'part_time', '0.8 FTE': 'part_time'}
#update values and replace missing values with 'non-specified'
df2['ContractType'] = df2['ContractType'].map(contract_type_mapping).fillna('non-specified')

#### Conflicts 8 & 17: Inconsistent ContractTime value names and missing values
In order to make df2 consistent with df1 I will re-map the value names for the ContractTime column in df2 (previously Contract Type) from Permanent to permanent and from Fixed Term Contract to contract. 

In [232]:
#review values from dataset 1
df1['ContractTime'].value_counts()

ContractTime
permanent        30331
non-specified    11514
contract          5572
Name: count, dtype: int64

In [233]:
#review values from dataset 2
df2['ContractTime'].value_counts()

ContractTime
Permanent              3720
Fixed Term Contract     592
Name: count, dtype: int64

In [234]:
#create mapping for new values
contract_type_mapping = {'Permanent': 'permanent', 'Fixed Term Contract': 'contract'}
#update values and replace missing values with 'non-specified'
df2['ContractTime'] = df2['ContractTime'].map(contract_type_mapping).fillna('non-specified')

#### Conflict 9: Inconsistent Category value names
In order to make df2 consistent with df1 I will re-map the value names for the Category column in df2. 

In [235]:
#review values from dataset 1
df1['Category'].value_counts()

Category
IT Jobs                             13122
Healthcare & Nursing Jobs            8185
Engineering Jobs                     7199
Accounting & Finance Jobs            6808
Sales Jobs                           4747
Hospitality & Catering Jobs          4530
Teaching Jobs                        3558
PR, Advertising & Marketing Jobs     2552
Name: count, dtype: int64

In [236]:
#review values from dataset 2
df2['Category'].value_counts()

Category
Information Technology           1286
Engineering                      1066
Healthcare and Nursing            655
Finance and Accounting            637
Sales                             630
Hospitality and Catering          275
Teaching                          234
PR, Advertising and Marketing     217
Name: count, dtype: int64

In [237]:
#create mapping for new values
contract_type_mapping = {
    'Information Technology': 'IT Jobs',
    'Engineering': 'Engineering Jobs',
    'Healthcare and Nursing': 'Healthcare & Nursing Jobs',
    'Finance and Accounting': 'Accounting & Finance Jobs',
    'Sales': 'Sales Jobs',
    'Hospitality and Catering': 'Hospitality & Catering Jobs',
    'Teaching': 'Teaching Jobs',
    'PR, Advertising and Marketing': 'PR, Advertising & Marketing Jobs'
}
#update values and replace missing values with 'non-specified'
df2['Category'] = df2['Category'].map(contract_type_mapping)

#### Conflict 11: Annual Vs Months Salaries
In order to make df2 consistent with df1 I will multiple the salaries in df2 by 12 to convert them to an annual salary. 

In [238]:
df2['Salary'] = round(df2['Salary'] * 12, 2)

#### Conflict 15: Missing company values in dataset 2
In order to make df2 consistent with df1 I will replace missing values in the Company (previously Organisation) column with 'non-specified' values.  

In [239]:
#replace missing values with 'non-specified'
df2['Company'] = df2['Company'].fillna('non-specified')

In [253]:
#check value counts
df2['Company'].value_counts().head()

Company
UKStaffsearch                  72
non-specified                  63
Flame Health Associates LLP    42
CHERRY RED RECRUITMENT         39
Jobsite Jobs                   37
Name: count, dtype: int64

#### Conflict 18 & 19: Wrong date formats
In order to make df2 consistent with df1 I will change the format of the dates and times in df2 from objects (strings) to datetime, and then to the desired date-time format.  

In [241]:
#convert date columns to date-time
df2['OpenDate'] = pd.to_datetime(df2['OpenDate'], format='%d/%m/%Y %H:%M')
df2['CloseDate'] = pd.to_datetime(df2['CloseDate'], format='%d/%m/%Y %H:%M')

### 3. Merging data

Now that the 2 datasets have been modified to make their formatting consistent, they can be combined.

In [244]:
df_merged = pd.concat([df1, df2], ignore_index=True)

### 4. Resolving data conflicts:
Most of the data conflicts were resolved above before merging the datasets, however I have not yet addressed duplicate rows. 

#### Conflict 1: Duplications
The biggest challenge in finding duplicate rows was that the exact dates and times of listings were too specific, resulting is rows which were effectively duplicates, having different OpenDate and CloseDate values. To overcome this challenge I created 2 new columns containing just the year and month from OpenDate and CloseDate. I am making the assumption that if 2 job advertisements are the same, apart from the source, and day and time of the opneing and closing of the add, then they are duplicate job adds. Considering that the job title, location, company, contract time, job category, salary and month and year of the open and close date all match, I believe this to be a reasonable assumption. This process identified 4 duplicate rows from the second dateset (non-specified SourceName) which were removed from the merged dataset. 

In [246]:
#columns to exclude from duplicate search
exclude_columns = ['Id', 'SourceName']

#duplicates excluding the specified columns
duplicate_mask = df_merged.duplicated(subset=[col for col in df_merged.columns if col not in exclude_columns])
df_merged[duplicate_mask]

Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,Salary,OpenDate,CloseDate,SourceName


In [247]:
#reformat the time-date values back to datetime format for manipulation
df_merged['OpenDate'] = pd.to_datetime(df_merged['OpenDate'], format='%Y-%m-%d %H:%M:%S')
df_merged['CloseDate'] = pd.to_datetime(df_merged['CloseDate'], format='%Y-%m-%d %H:%M:%S')

#create new columns with modified date formats (without day and time)
df_merged['OpenDateOnly'] = df_merged['OpenDate'].dt.strftime('%Y-%m')
df_merged['CloseDateOnly'] = df_merged['CloseDate'].dt.strftime('%Y-%m')

In [248]:
#columns to exclude from duplicate search
exclude_columns = ['Id', 'SourceName', 'OpenDate', 'CloseDate']

# Check for duplicates excluding the specified columns
duplicate_mask = df_merged.duplicated(subset=[col for col in df_merged.columns if col not in exclude_columns])
df_merged[duplicate_mask]

Unnamed: 0,Id,Title,Location,Company,ContractType,ContractTime,Category,Salary,OpenDate,CloseDate,SourceName,OpenDateOnly,CloseDateOnly
50752,10000052,IT Infrastructure and Desktop Support,Berkshire,non-specified,non-specified,contract,IT Jobs,38880.0,2013-05-06 12:00:00,2013-05-20 12:00:00,non-specified,2013-05,2013-05
52385,10001685,Occupational Therapist Community Norfolk,King's Lynn,Service Care Solutions Ltd,non-specified,contract,Healthcare & Nursing Jobs,32640.0,2013-03-16 00:00:00,2013-03-30 00:00:00,non-specified,2013-03,2013-03
52609,10001909,Chef de Partie ****AA Rosette Restaurant Str...,Wales,James Webber Recruitment,non-specified,non-specified,Hospitality & Catering Jobs,15360.0,2013-04-14 00:00:00,2013-04-28 00:00:00,non-specified,2013-04,2013-04
52693,10001993,Supply IT Technical Trainers,UK,"Aspire, Achieve, Advance Limited",non-specified,contract,IT Jobs,38400.0,2013-09-04 12:00:00,2013-09-18 12:00:00,non-specified,2013-09,2013-09


#### Conflict 15: Missing company values in dataset 2
In order to make df2 consistent with df1 I will replace missing values in the Company (previously Organisation) column with 'non-specified' values.  

In [249]:
#remove duplicate rows
indices_to_delete = [50752, 52385, 52609, 52693]
df_merged = df_merged.drop(indices_to_delete)
#remove ancillary columns
df_merged = df_merged.drop(['OpenDateOnly', 'CloseDateOnly'], axis=1)

#### Finding global key for the data
A proper global key for the integrated data should have only unique values and no missing values. As the Id column served this purpose in the first datset, and in the second dataset, unique Ids were generated which do not overlap with the existing Ids, this column should serve well as a global key. Below I ensure that the values are indeed unique and there are no missing values. 

In [250]:
#Id column contains only unique values (True)
df_merged['Id'].is_unique

True

In [251]:
#Id column contains missing values (False)
df_merged['Id'].isnull().any()

False

### 5. Saving the integrated and reshaped data
Now that the datasets have been successfully integrated and uplicates have been removed, it is time to save the resulting data frame as a csv file.

In [252]:
#export merged csv
df_merged.to_csv('final_data.csv', index=False)

## Summary of the Assessment Task
This third part of the assessment task involved massaging a second dataset into the right shape and form to be consistent with the first, merging the two and fixing data inconsistencies. It was a challenge of conflicts, first finding and fixing the schema conflicts, followed by the data conflicts. Once this was completed, by creating columns which removed days and times from the date columns, four duplicate rows were identified and removed. Finally, the Id column was determined to be a candidate key and the dataset was exported. This dataset would hypothetically be ready for the next phase of a project with a broader scope than this one, such as a data science project. 