# Part 2: Data Cleaning, Reduction & Enrichment

To prepare the data scientist jobs data that I scraped from glassdoor.co.uk for analysis, I will:

- **Enrich** the data: by expanding/filling-in parts of the advertised job location using an Ordinance Survey API

- **Clean** the data: after importing the CSV file into a pandas DataFrame, I'll remove duplicate jobs, and check and clean the data column-by-column

- **Reduce** the data: by eliminating invalid jobs and transforming the data types where possible so that they take up less memory


## Setup

### Import Packages & Modules

In [1]:
# import packages and modules
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import os
import warnings
import random
import re

### Display Settings

In [2]:
# ensure all columns and rows will be displayed if/when you print the dataframe
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

# ensure all figures will have a white background in this notebook
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}

# ignore filter warnings
warnings.filterwarnings('ignore')


### Import Data

I'll import the jobs data CSV file, reading it in as a pandas DataFrame.

In [3]:
# create the path to the scraped and checked glassdoor jobs data
path = './data/'

# provide glassdoor scrape date
scrapedate = '14Dec2020'  # e.g. '14Dec2020'

# create the absolute path to the scraped jobs data with parsed locations
filename = os.path.join(path, f"gdjobs_df_{scrapedate}_checked.csv")

# read the data scientist jobs data (CSV file) into a dataframe
gdjobs = pd.read_csv(filename, index_col=0)

# display dataframe info to check that it's what you expected
gdjobs.info()


<class 'pandas.core.frame.DataFrame'>
Index: 727 entries, Senior Data Scientist to Researcher/Data Scientist - QMUL
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   salary_estimate         453 non-null    object 
 1   job_description         724 non-null    object 
 2   rating                  587 non-null    float64
 3   company_name            727 non-null    object 
 4   location                727 non-null    object 
 5   size                    629 non-null    object 
 6   founded                 508 non-null    float64
 7   type_of_ownership       629 non-null    object 
 8   industry                553 non-null    object 
 9   sector                  555 non-null    object 
 10  revenue                 629 non-null    object 
 11  rating_culturevalues    572 non-null    float64
 12  rating_worklifebalance  583 non-null    float64
 13  rating_diversity        451 non-null    float64
 14

## Enrichment: getting the full job location

When browsing jobs on the glassdoor.co.uk, I had noticed that the level of detail in the job locations very (e.g. 'Greater Manchester' vs 'Farnborough, Hampshire, South East England, England').

In [4]:
# print a sample of the job locations scraped from glassdoor.co.uk
gdjobs['location'].value_counts().sample(10)

Alva, Scotland                                     1
Bromley, England                                   1
Skipton, England                                   1
Bristol, England                                   9
Swindon, Wiltshire, South West England, England    2
Colchester, England                                2
Birmingham, England                                5
Dundee, Scotland                                   1
Bury St Edmunds, England                           1
Cambridgeshire                                     1
Name: location, dtype: int64

I'm interested in being able to look at jobs by region or city, etc. To make this possible, I will enrich the data set using the Ordinance Survey API to parse the location given by each employer, such that all parts of the job location are recorded in the DataFrame. For this purpose I have written the function `get_locations`, which is in the `function_locationapi.py` script. 

In [5]:
from function_locationapi import get_locations

gdjobs_loc = get_locations(
    scrapedate='14Dec2020', 
    path='./data/')

# display dataframe info to check that it's what you expected
gdjobs_loc.info()

100%|██████████| 727/727 [01:30<00:00,  8.00it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 727 entries, 0 to 726
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   job_title               727 non-null    object 
 1   salary_estimate         453 non-null    object 
 2   job_description         724 non-null    object 
 3   rating                  587 non-null    float64
 4   company_name            727 non-null    object 
 5   location                727 non-null    object 
 6   size                    629 non-null    object 
 7   founded                 508 non-null    float64
 8   type_of_ownership       629 non-null    object 
 9   industry                553 non-null    object 
 10  sector                  555 non-null    object 
 11  revenue                 629 non-null    object 
 12  rating_culturevalues    572 non-null    float64
 13  rating_worklifebalance  583 non-null    float64
 14  rating_diversity        451 non-null    fl




5 new columns have been added by `get_locations()`  after using the Ordinance Survey API to parse the locations 

In [6]:
gdjobs_loc[['location', 'api_citytownvilham', 'api_region', 'api_country', 'uk', 'remote']].sample(10)

Unnamed: 0,location,api_citytownvilham,api_region,api_country,uk,remote
196,"Staines, England",Staines-upon-Thames,South East,England,False,False
590,Greater London,London,London,England,False,False
512,"London, England",London,London,England,False,False
275,"Stockport, England",Stockport,North West,England,False,False
135,"Derby, England",Derby,East Midlands,England,False,False
627,"London, England",London,London,England,False,False
503,"Newcastle upon Tyne, England",Newcastle upon Tyne,North East,England,False,False
181,"London, England",London,London,England,False,False
198,"Cheltenham, England",Cheltenham,South West,England,False,False
54,United Kingdom,,,,True,False


## Cleaning: checking and cleaning each column as needed

Now I'll look at the values in each column to decide and implement any necessary cleaning of the data

### Job title

In [7]:
# check what the job titles look like
print(gdjobs_loc['job_title'].value_counts().sample(20))


Data Scientist (KTP Associate)                              1
NGS Product Integration Scientist                           1
Data Scientist / Operational Researcher                     2
Senior Data Scientist - Product                             1
Principal Applied Scientist                                 1
Lead Data Scientist, Performance Marketing (Belfast, UK)    2
Senior Data Scientist - Crop Modeller R&D                   1
Commercial Data Analyst                                     1
Customer Data Scientist                                     1
Data Scientist - Reinforcement Learning                     1
Data Scientist Python PhD                                   1
Data Science Manager                                        6
Artificial Intelligence – Data Scientist                    1
Chemical Development Scientist                              1
Data Science Communicator                                   1
Data Scientist - Defence                                    1
Analytic

The job titles seem to have been scraped appropriately; there are no cleaning requirements.

### Salary estimate

In [8]:
# check what the salary estimates look like
print(gdjobs_loc['salary_estimate'].value_counts().sample(10))

£24K-£44K (Glassdoor Est.)     1
£39K-£44K (Glassdoor Est.)     1
£24K-£32K (Glassdoor Est.)     1
£40K-£52K (Glassdoor Est.)    46
£66K-£92K (Glassdoor Est.)     2
£33K-£51K (Glassdoor Est.)     2
£35K-£64K (Glassdoor Est.)     1
£26K-£32K (Glassdoor Est.)     2
£43K-£70K (Glassdoor Est.)     1
£48K-£60K (Glassdoor Est.)     1
Name: salary_estimate, dtype: int64


To be able to analyse the salary estimates I will need to isolate the numbers.

It looks like all scraped salary estimates:
- are given as ranges
- are in GBP ('£'),
- are per annum salaries (implied by the use of 'K' to denote thousands), and 
- end with '(Glassdoor Est.)'

Before I go ahead with the cleaning, I will check whether my assumptions (above) are true to spot and address any exceptions.


In [9]:
# check if all salaries include a '-' (hyphen) indicating a range 
if (gdjobs_loc['salary_estimate'].dropna().str.contains('-').all()):
    print('All salaries are given as a range; no exceptions to deal with')
else:
    # calculate what proportion of salary estimates are not given as a range
    ppn_range = gdjobs_loc['salary_estimate'].dropna().str.contains('-').mean().round(2)
    print(f"The vast majority of Glassdoor salary estimates ({(100-ppn_range)*100}%) are given as a range")
    print("Salaries given as a single value will be used for the salary estimate midpoint directly")


All salaries are given as a range; no exceptions to deal with


In [10]:
# check if all salaries are in GBP (£)
all_gbp = gdjobs_loc['salary_estimate'].dropna().str.contains('\£').all()

if not all_gbp:
    nongbp_se = gdjobs_loc[gdjobs_loc["salary_estimate"].str.contains('\£') == False]['salary_estimate']
    print(f"Not all salaries are in GBP:\n\n{nongbp_se}\n")
    gdjobs_loc = gdjobs_loc.drop(labels=nongbp_se.index, axis='index')
    print("Jobs with non-GBP salary estimates removed")
else:
    print("All salaries are in GBP")


Not all salaries are in GBP:

227     $86K-$142K (Glassdoor Est.)
471    $120K-$253K (Glassdoor Est.)
Name: salary_estimate, dtype: object

Jobs with non-GBP salary estimates removed


In [11]:
# check if all salaries are in thousands (K) 
all_k = gdjobs_loc['salary_estimate'].dropna().astype(str).str.contains('K').all()

if not all_k:
    nonk_se = gdjobs_loc[gdjobs_loc["salary_estimate"].astype(str).str.contains('K') == False]['salary_estimate']
    print(f"Not all salaries are in thousands (K; indicating annual salary):\n\n{nonk_se}\n")
    gdjobs_loc.loc[nonk_se.index, "salary_estimate"] = np.nan
    print("These have been converted to 'np.nan'")
else:
    print("All salaries are in thousands, denoted with a 'K'")

All salaries are in thousands, denoted with a 'K'


In [12]:
# check if all salaries end in "(Glassdoor Est.)"
all_gdest = gdjobs_loc['salary_estimate'].dropna().astype(str).str.contains('(Glassdoor Est.)').all()

if all_gdest:
    print('All salary estimates end in "(Glassdoor Est.)"')
else:
    nongdest_se = gdjobs_loc[gdjobs_loc["salary_estimate"].str.contains('(Glassdoor Est.)') == False]['salary_estimate']
    print(f'Not all salaries end in "(Glassdoor Est.)":\n\n{nongdest_se}\n')
    

All salary estimates end in "(Glassdoor Est.)"


To be able to analyse the salary estimates more easily, I will:
- remove all instances of "£", "K" and "(Glassdoor Est.)", 
- split salary ranges into min and max salary, and convert these data to numerical values
- calculate the midpoint of the salary range by taking the mean of the min and max

In [13]:
# remove "£" and "K" from the salary_estimate column
gdjobs_loc['salary_estimate'] = gdjobs_loc['salary_estimate'].str.replace('[K£]', '')

# remove "(Glassdoor Est.)" by splitting string on "(" and keeping only the first part
gdjobs_loc["salary_estimate"] = gdjobs_loc["salary_estimate"].apply(
    lambda x: x if pd.isna(x) else x.split(" (")[0])

# check if any instances of "£", "K", or "(Glassdoor Est.)" remain
print(gdjobs_loc['salary_estimate'].str.contains('\£').any())
print(gdjobs_loc['salary_estimate'].str.contains('K').any())
print(gdjobs_loc['salary_estimate'].str.contains('(Glassdoor Est.)').any())

# check how the values look now
print(gdjobs_loc['salary_estimate'].dropna().head(10))


False
False
False
0     54-69
1     58-80
2     26-27
10    35-40
11    26-32
13    46-51
14    58-77
15    29-38
16    61-69
17    51-70
Name: salary_estimate, dtype: object


In [14]:
# extract the min, max and mid-point of the salary estimate, where a range is given
# split string on "-" and take first part
gdjobs_loc["salary_min"] = gdjobs_loc["salary_estimate"].apply(
    lambda x: x if (pd.isna(x) or ("-" not in x)) else x.split("-")[0]
)

# max; split string on "-" and take second part
gdjobs_loc["salary_max"] = gdjobs_loc["salary_estimate"].apply(
    lambda x: x if (pd.isna(x) or ("-" not in x)) else x.split("-")[1]
)

# convert the min and max to numerical values and get the midpoint
gdjobs_loc["salary_mid"] = gdjobs_loc.apply(
    lambda x: x["salary_estimate"] if pd.isna(x["salary_estimate"]) else np.mean(
        pd.to_numeric([x["salary_min"], x["salary_max"]])
    ), axis=1
)

# check for the expected output
print(gdjobs_loc[[
    "salary_estimate", 
    "salary_min", 
    "salary_max", 
    "salary_mid"]].dropna().sample(10)
)

    salary_estimate salary_min salary_max  salary_mid
633           47-67         47         67        57.0
15            29-38         29         38        33.5
182           61-91         61         91        76.0
16            61-69         61         69        65.0
526           39-56         39         56        47.5
552           45-62         45         62        53.5
89            34-45         34         45        39.5
275           37-50         37         50        43.5
378           36-47         36         47        41.5
616           42-52         42         52        47.0


### Job Description

To answer the questions I am interested in, I need all job ads in my data set to have a description.

In [15]:
# check for jobs that lack a description
gdjobs_loc['job_description'].isna().value_counts()

False    722
True       3
Name: job_description, dtype: int64

In [16]:
# remove jobs that don't have a description
gdjobs_loc = gdjobs_loc[gdjobs_loc['job_description'].notnull()]

### Company name

In [17]:
# check what the scraped company names look like
print(gdjobs_loc["company_name"].sample(10))

305                       infarm\n4.1
717                   G-Research\n4.7
297     British American Tobacco\n4.0
42                         Logic Plum
211                   Kelkoo LTD\n4.4
436                        Abcam\n4.8
99     Amida Recruitment Limited\n4.4
281       Next Phase Recruitment\n4.0
384                    Taylorollinson
386                    causaLens\n4.5
Name: company_name, dtype: object


When a company has a Glassdoor rating, it appears with the company's name on the website. My scraping tool has captured both the company's name and rating together, separated by a new line. 

Since the company rating has already been scraped separately and recorded in it's own column (`rating`), I'll simply remove it from the `company_name` column. I'll check each job for a rating, and when there is one, the last 4 characters of the company name will be excluded to remove the rating and new line.

In [18]:
# remove the company rating from the company name
gdjobs_loc["company_name"] = gdjobs_loc.apply(
    lambda x: x["company_name"] if pd.isna(x["rating"]) else x["company_name"][:-4], axis=1
)
print(gdjobs_loc["company_name"].sample(10))


91              GlaxoSmithKline
409                 esure Group
445                   CitizenMe
348                   Sartorius
695                       Tesco
2                   BioGrad Ltd
342                       Ipsos
310      Novation Solutions Ltd
115                 AstraZeneca
260    Public Sector Resourcing
Name: company_name, dtype: object


### Size

In [19]:
# check what the company sizes look like
print(gdjobs_loc["size"].value_counts(dropna=False).sort_index())


1 to 50 Employees          118
10000+ Employees           157
1001 to 5000 Employees      92
201 to 500 Employees        54
5001 to 10000 Employees     21
501 to 1000 Employees       34
51 to 200 Employees        124
Unknown                     25
NaN                         97
Name: size, dtype: int64


There are 3 issues with the `size` column:
- Each size bracket ends with the word "Employees", which isn't necessary: this should be removed
- The size values are intervals/bins of number of employees, which are essentially ordered categories, but the data type of the column is pandas object (or python string), which take up more memory than categories: I'll change the data type from object to category, specifying the order, so that the data are plotted appropriately
- Some companies have a size called "Unknown": I'll convert these to NaN values so that they are excluded from analyses and plots

In [20]:
# check memory usage before changing to categorical data type
gdjobs_loc["size"].memory_usage(deep=True)

55855

In [21]:
# turn "Unknown" entries to nan values and remove " Employees" (10 chars) from the end
gdjobs_loc["size"] = gdjobs_loc["size"].apply(
    lambda x: np.nan if (x == "Unknown" or pd.isna(x)) else x[:-10]
)

# create an ordered categorical data type to apply to the 'size' column
size_cat_type = CategoricalDtype( # CategoricalDtype allows ordering
    categories=[
        '1 to 50',
        '51 to 200',
        '201 to 500',
        '501 to 1000',
        '1001 to 5000',
        '5001 to 10000',
        '10000+'
    ],
    ordered=True
)

# change the data type to size_cat_type
gdjobs_loc["size"] = gdjobs_loc["size"].astype(size_cat_type)



In [22]:
# check memory usage after changing to categorical data type
gdjobs_loc["size"].memory_usage(deep=True)

7265

The data are now in the appropriate format and using much less memory, which will make the analyses run more quickly.

### Founded

In [23]:
# check what the founded year values look like
print(gdjobs_loc["founded"].sample(10))

348    1870.0
14        NaN
504    1987.0
668       NaN
107    1993.0
696    2017.0
217    1991.0
238    2017.0
497    2015.0
242    2007.0
Name: founded, dtype: float64


The year in which each company was founded are in floating point number format; I'll convert these to integers.

In [24]:
# change the data type of the founded column
gdjobs_loc["founded"] = gdjobs_loc["founded"].astype('Int64')
print(gdjobs_loc["founded"].sample(10))

96     2007
517    <NA>
223    1987
410    <NA>
684    2005
592    <NA>
425    1992
569    2010
327    <NA>
266    <NA>
Name: founded, dtype: Int64


### Type of ownership

In [25]:
# check what the type of ownership values look like
print(gdjobs_loc["type_of_ownership"].value_counts(dropna=False).sort_index())

College / University               15
Company - Private                 385
Company - Public                  174
Contract                            1
Government                          8
Hospital                            2
Nonprofit Organization              8
Other Organization                  2
Private Practice / Firm             1
Subsidiary or Business Segment     25
Unknown                             4
NaN                                97
Name: type_of_ownership, dtype: int64


I will take the valid values found in the jobs data set scraped on 14 Dec 2020 to create a categorical data type using CategoricalDtype; this will remove np.nan values and turn anything not listed in the CategoricalDtype into np.nan values as well. Changing the data type of the `type_of_ownership` column from pandas object/python string to categorical will also reduce memory usage. 

In [26]:
# create an categorical data type to apply to the 'type_of_ownership' column
too_cat_type = CategoricalDtype( 
    categories=[
        'Company - Private',
        'Company - Public',
        'Subsidiary or Business Segment',
        'College / University',
        'Government',
        'Nonprofit Organization',
        'Hospital',
        'Contract',
        'Private Practice / Firm',
    ],
    ordered=False
)

# change the data type of gdjobs_loc["type_of_ownership"] to too_cat_type
gdjobs_loc["type_of_ownership"] = gdjobs_loc["type_of_ownership"].astype(too_cat_type)
gdjobs_loc["type_of_ownership"].value_counts(dropna=False)

Company - Private                 385
Company - Public                  174
NaN                               103
Subsidiary or Business Segment     25
College / University               15
Government                          8
Nonprofit Organization              8
Hospital                            2
Contract                            1
Private Practice / Firm             1
Name: type_of_ownership, dtype: int64

The 'Unknown' and 'Other Organization' values, not included in the categories of the `too_cat_type` CategoricalDtype, were automatically converted to NaN values, when `too_cat_type` was applied to `gdjobs["type_of_ownership"]`

### Industry

In [27]:
# check what the industry values look like
print(gdjobs_loc["industry"].value_counts(dropna=False).sort_index())

Accounting                                     7
Advertising & Marketing                        7
Aerospace & Defense                            8
Airlines                                       1
Architectural & Engineering Services           7
Banks & Credit Unions                         15
Beauty & Personal Accessories Stores           1
Biotech & Pharmaceuticals                     68
Brokerage Services                             3
Cable, Internet & Telephone Providers          2
Camping & RV Parks                             1
Chemical Manufacturing                         1
Colleges & Universities                       15
Commercial Equipment Repair & Maintenance      1
Computer Hardware & Software                  20
Construction                                   1
Consulting                                    33
Consumer Products Manufacturing                4
Department, Clothing, & Shoe Stores           10
Drug & Health Stores                           3
Education Training S

I'll switch the data type from pandas object to 'category' to reduce the memory usage when working with this column going forward.

In [28]:
# # if there are any 'Unknown' industries, convert these to np.nan
# gdjobs_loc["industry"] = gdjobs_loc.apply(lambda x: np.nan if (x["industry"] == "Unknown") else x["industry"], axis=1)

# change the industry column's data type from object to category
gdjobs_loc["industry"] = gdjobs_loc["industry"].astype('category')


### Sector

In [29]:
# check what the sector values look like
print(gdjobs_loc["sector"].value_counts(dropna=False).sort_index())


Accounting & Legal                      7
Aerospace & Defense                     8
Agriculture & Forestry                  1
Arts, Entertainment & Recreation        5
Biotech & Pharmaceuticals              68
Business Services                     158
Construction, Repair & Maintenance      2
Education                              20
Finance                                50
Government                              6
Health Care                             9
Information Technology                126
Insurance                              17
Manufacturing                          12
Media                                  13
Mining & Metals                         1
Non-Profit                              2
Oil, Gas, Energy & Utilities            7
Real Estate                             1
Retail                                 27
Telecommunications                      5
Transportation & Logistics              3
Travel & Tourism                        3
NaN                               

I'll switch the data type from pandas object to 'category' to reduce the memory usage when working with this column going forward.

In [30]:
# # if there are any 'Unknown' sectors, convert these to np.nan
# gdjobs_loc["sector"] = gdjobs_loc.apply(lambda x: np.nan if (x["sector"] == "Unknown") else x["sector"], axis=1)

# change the sector column's data type from object to category
gdjobs_loc["sector"] = gdjobs_loc["sector"].astype('category')


### Revenue

In [31]:
# check what the revenue values look like
print(gdjobs_loc["revenue"].value_counts(dropna=False))

Unknown / Non-Applicable            218
NaN                                  97
Less than $1 million (USD)           92
$10+ billion (USD)                   80
$100 to $500 million (USD)           48
$5 to $10 billion (USD)              36
$25 to $50 million (USD)             27
$10 to $25 million (USD)             25
$2 to $5 billion (USD)               25
$50 to $100 million (USD)            21
$500 million to $1 billion (USD)     21
$1 to $5 million (USD)               19
$1 to $2 billion (USD)                7
$5 to $10 million (USD)               6
Name: revenue, dtype: int64


The 'Unknown / Non-Applicable' values need to be converted to NaN values. I'll convert the `revenue` values from string objects to CategoricalDtype with an ordered list of valid values, so everything else (in this case, 'Unknown / Non-Applicable') is changed to a NaN value. Changing the data type of the `revenue` column from pandas object/python string to CategoricalDtype will also reduce memory usage and make plotting the data more straightforward, since the categories will be correctly ordered in any plots.

In [32]:
# if there are any "Unknown / Non-Applicable" revenues, convert these to np.nan
gdjobs_loc["revenue"] = gdjobs_loc.apply(lambda x: np.nan if (x["revenue"] == "Unknown / Non-Applicable") else x["revenue"], axis=1)

revenue_cat_type = CategoricalDtype(
    categories=[
        'Less than $1 million (USD)',
        '$1 to $5 million (USD)',
        '$5 to $10 million (USD)',
        '$10 to $25 million (USD)',
        '$25 to $50 million (USD)',
        '$50 to $100 million (USD)',
        '$100 to $500 million (USD)',
        '$500 million to $1 billion (USD)',
        '$1 to $2 billion (USD)',
        '$2 to $5 billion (USD)',
        '$5 to $10 billion (USD)',
        '$10+ billion (USD)',
    ],
    ordered=True
)

gdjobs_loc["revenue"] = gdjobs_loc["revenue"].astype(revenue_cat_type)
gdjobs_loc["revenue"].value_counts(dropna=False).sort_index()


Less than $1 million (USD)           92
$1 to $5 million (USD)               19
$5 to $10 million (USD)               6
$10 to $25 million (USD)             25
$25 to $50 million (USD)             27
$50 to $100 million (USD)            21
$100 to $500 million (USD)           48
$500 million to $1 billion (USD)     21
$1 to $2 billion (USD)                7
$2 to $5 billion (USD)               25
$5 to $10 billion (USD)              36
$10+ billion (USD)                   80
NaN                                 315
Name: revenue, dtype: int64

The former 'Unknown / Non-Applicable' are now correctly included with the np.nan ('NaN') values.

### Subratings

There were up to 6 different subratings for each company when the Glassdoor website was scraped (14 Dec 2020). There is a column for each in the gdjobs DataFrame. 

In [33]:
# create a list of the subrating column headers
subratings = [
    "rating_culturevalues",
    "rating_worklifebalance",
    "rating_diversity",
    "rating_seniormgmt",
    "rating_compbenefits",
    "rating_careerops",
]

In [34]:
# get info on the subrating columns in the jobs dataset dataframe
gdjobs_loc[subratings].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 722 entries, 0 to 726
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   rating_culturevalues    567 non-null    float64
 1   rating_worklifebalance  578 non-null    float64
 2   rating_diversity        446 non-null    float64
 3   rating_seniormgmt       577 non-null    float64
 4   rating_compbenefits     578 non-null    float64
 5   rating_careerops        578 non-null    float64
dtypes: float64(6)
memory usage: 39.5 KB


In [35]:
# check what the subratings values look like
gdjobs_loc[subratings].head()

Unnamed: 0,rating_culturevalues,rating_worklifebalance,rating_diversity,rating_seniormgmt,rating_compbenefits,rating_careerops
0,4.0,3.9,,3.7,3.8,4.1
1,4.7,4.6,4.5,4.4,4.4,4.3
2,,,,,,
3,2.9,2.6,3.4,2.7,3.6,3.5
4,3.2,3.1,3.4,2.8,3.3,2.9


In [36]:
# check that all subratings are either NaN or a value between 0 and 5 with 1 d.p.
for i in subratings:
    print(
        np.all(
            gdjobs_loc[i].apply(
                lambda x: pd.isna(x) or ((x*10 == np.floor(x*10)) and (x<=5 and x>0))
            )
        )
    )

True
True
True
True
True
True


Since the data type for every subrating column is float64, and the values are either numbers between 0 and 5 rounded to 1 d.p. or NaN, no cleaning is necessary in order to analyse and plot these data.

### Parsed job location columns

In [37]:
parsed_loc_cols = [
    'api_citytownvilham', 
    'api_region', 
    'api_country', 
    'uk', 
    'remote'
    ]

In [38]:
# get info on the parsed_loc_cols columns in the jobs dataset dataframe
gdjobs_loc[parsed_loc_cols].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 722 entries, 0 to 726
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   api_citytownvilham  689 non-null    object
 1   api_region          689 non-null    object
 2   api_country         693 non-null    object
 3   uk                  722 non-null    bool  
 4   remote              722 non-null    bool  
dtypes: bool(2), object(3)
memory usage: 24.0+ KB


In [39]:
# check what the values in each location column looks like
for i in parsed_loc_cols:
    print(gdjobs_loc[i].value_counts().head(),'\n')

London        375
Cambridge      47
Manchester     22
Edinburgh      17
Reading        13
Name: api_citytownvilham, dtype: int64 

London             386
East of England     80
South East          56
North West          40
Scotland            30
Name: api_region, dtype: int64 

England             641
Scotland             30
Northern Ireland     12
Wales                10
Name: api_country, dtype: int64 

False    708
True      14
Name: uk, dtype: int64 

False    710
True      12
Name: remote, dtype: int64 



The value counts of the columns created from parsing the scraped locations using the Ordinance Survery API look as we'd expect. However, I'll turn the `api_region` and `api_country` columns into categorical data to reduce their memory usage. 


In [40]:
# change the sector api_region and api_country columns data type to category
gdjobs_loc["api_region"] = gdjobs_loc["api_region"].astype("category")
gdjobs_loc["api_country"] = gdjobs_loc["api_country"].astype("category")

# get info on the parsed_loc_cols columns in the jobs dataset dataframe
gdjobs_loc[parsed_loc_cols].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 722 entries, 0 to 726
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   api_citytownvilham  689 non-null    object  
 1   api_region          689 non-null    category
 2   api_country         693 non-null    category
 3   uk                  722 non-null    bool    
 4   remote              722 non-null    bool    
dtypes: bool(2), category(2), object(1)
memory usage: 14.7+ KB


## Remove irrelevant jobs based on job titles

Before we begin our exploratory data analysis, we want to make sure we're analysing only data scientist roles. 

When you search glassdoor.co.uk for a "data scientist" roles, jobs with wide-ranging titles are returned:

In [41]:
# run the code below as many times as you like 
# to get an impression of the wide-ranging job titles

# print 20 random job titles from the glassdoor data set
print(gdjobs_loc["job_title"].sample(20))

561                                            Data Scientist
338       Associate Laboratory Scientist - Clinical Pathology
351                                             Data Engineer
190          Data Scientist Degree Apprenticeship, Ware, 2021
456                                      Head of Data Science
559                                            Data Scientist
186    Transport Data Scientist, real world logistic analysis
544                                            Data Scientist
104                Senior Data Scientist - R&D Remote Sensing
208                       Senior Data Scientist (Forecasting)
556                                            Data Scientist
240                                     Senior Data Scientist
19                 Enzymology & Protein Engineering Scientist
95                                               Data Analyst
267                    Senior Data Scientist - Bioinformatics
680                                   Curious Data Scientists
371     

Having scanned the job titles (above), I have identified 6 categories of job titles returned by GlassDoor when searching for a "data scientist":

1. Data Scientist
2. Data Analyst
3. Data Engineer
4. Machine Learning/AI Engineer or Specialist
5. Researcher
6. Intern/Apprentice

While there is overlap between all of these roles, I want to focus my analysis on jobs that best match what is commonly considered to be the role of a data scientist: an inter-disciplinary role to extract knowledge and insights from many structural and unstructured data using scientific methods, processes, and algorithms (often involving data mining, machine learning and big data).

I will use regular expressions to identify whether each job title in the dataset matches any of the 6 title categories identified above, and create boolean columns to record the match results for each so that they can be used as masks to filter the data. I will use the boolean masking to select and scan the descriptions of jobs in each of the title categories (listed above) to see if they reflect a distinct job role, and whether they fit the core data scientist role that I am interested in.


In [42]:
def sample_description(jobsdf, col=None):
    """Print the description of a randomly selected job
    
    :param jobsdf: jobs data with job descriptions
    :type jobsdf: pandas.core.frame.DataFrame
    :param col: label of boolean column within jobsdf if you want a sample 
    job description from a subset of jobs (default None)
    :type col: str
    """
    # produce a random integer for the sample function
    seed = random.randint(0, 100)
    
    if col is not None:  # take a sample from jobs that are True in that column
        print(f"\nDescription of randomly selected job with title, '",
        f"{jobsdf[jobsdf[col]]['job_title'].sample(random_state=seed).iloc[0]}':\n")
        print(jobsdf[jobsdf[col]]['job_description'].sample(random_state=seed).iloc[0])
    else:  # otherwise take a sample from all jobs
        print(f"\nDescription of randomly selected job with title, '",
        f"{jobsdf['job_title'].sample(random_state=seed).iloc[0]}':\n")
        print(jobsdf['job_description'].sample(random_state=seed).iloc[0])
       
    

### 1. Data Scientist

In [43]:
# use a regular expression to identfy "data scientist" job titles and 
# create a boolean mask for this category
gdjobs_loc["title_datascientist"] = gdjobs_loc["job_title"].str.contains(
    r"data scientist", regex=True, flags=re.IGNORECASE)

# how many job titles indicate "data scientist" role? ("True" count)
print(
    f"{sum(gdjobs_loc.title_datascientist)} ({round((np.mean(gdjobs_loc.title_datascientist))*100)}%)",
    "job titles indicate 'data scientist' role"
)


474 (66%) job titles indicate 'data scientist' role


In [44]:
# display a random sample of job titles that include the string, "data scientist"
gdjobs_loc.loc[gdjobs_loc['title_datascientist'],'job_title'].sample(20)

149                                              Senior Data Scientist
75     Data Scientist | Python | Tensorflow | Deep Learning | Contract
656                                                     Data Scientist
190                   Data Scientist Degree Apprenticeship, Ware, 2021
659                                            Data Scientist, Digital
601                Power, Transmission and Distribution Data Scientist
130                                              Senior Data Scientist
718          Research Data Scientist, Intern - Infrastructure Strategy
421                                           Principal Data Scientist
600                                 Data Scientist - Marketing Science
196                                              Senior Data Scientist
266                                            Data Scientist (Senior)
156                                              Senior Data Scientist
455                   Senior Data Scientist - Electrical Power Systems
170   

In [45]:
# scan job descriptions; keep running this cell until you've reviewed enough samples
sample_description(jobsdf=gdjobs_loc, col="title_datascientist")


Description of randomly selected job with title, ' Senior Data Scientist':

We are notonthehighstreet.

We’re home to 5,000 phenomenal small creative businesses that we are proud to call our Partners. But, now more than ever, this community needs our support. So we’re doing all we can to shine a light on these dynamic entrepreneurs, waving the flag for small businesses and generally championing their socks off.

On top of our brilliant Partners, products and customers (not to mention our incredible team who have been busy beavering away from home since March), the last 12 months has seen great progress with our tech platform and customer experience. We are now looking to build on this momentum to drive our business to the next level. And that’s where you come in.

What we need

We’re looking for a Data Scientist to join our team to help us and our Partners continue to grow by using machine learning to improve the user experience of our site and our marketing.

Reporting into the Head 

Having looked at many random samples of job titles that include "data scientist", it often seems to be included as a general title in addition to specialisms, e.g. "Data Scientist/Engineer" indicating a Data Engineer role, and "Remote Data Scientist / Machine Learning Engineer" specifying a machine learning engineer role, "AI Ops Data Scientist" suggesting a DevOps role specialising in AI, and "Data Scientist / Software Developer", which would require someone with significant software development skills.

### 2. Data Analyst

In [46]:
# use a regular expression to identfy "data analyst" job titles and 
# create a boolean mask for this category
gdjobs_loc["title_dataanalyst"] = gdjobs_loc["job_title"].str.contains(
    r"analy", regex=True, flags=re.IGNORECASE)

# how many job titles indicate "data analyst" role? ("True" count)
print(
    f"{sum(gdjobs_loc.title_dataanalyst)} ({round((np.mean(gdjobs_loc.title_dataanalyst))*100)}%)",
    "job titles indicate a 'data analyst' role"
)

66 (9%) job titles indicate a 'data analyst' role


In [47]:
# display the job titles that fit the "data analyst" category
print(gdjobs_loc.loc[gdjobs_loc["title_dataanalyst"],'job_title'].sample(20))


55                                                     Data Analyst
504                                Analytical Outsourcing Scientist
128                                                    DATA ANALYST
466                         Data Analytics apprenticeship programme
407          Senior Data Scientist - Innovation, Advanced Analytics
627                               Data Scientist, Product Analytics
48         Data Scientist, Data Analyst, Data Visualisation, Python
497                                     Risk & Control Data Analyst
95                                                     Data Analyst
13                    Finance Data Analyst - Growth and Forecasting
375                                         Business Analyst (Data)
462                             Data & Analytics Consultant (m/f/d)
573                        Data Science Lead, Reliability Analytics
328              Scientific Data Analyst - Machine Learning, Python
429                          eCommerce Data Anal

In [48]:
# scan job descriptions; keep running this cell until you've reviewed enough samples
sample_description(jobsdf=gdjobs_loc, col="title_dataanalyst")



Description of randomly selected job with title, ' Senior Data Scientist - Innovation, Advanced Analytics':

Location
12 Endeavour Square, London, E20 1JN
Division
Strategy & Competition
The Role
There are few jobs where you can make a real difference to the 40 million consumers of financial products, the 2 million people who work in the UK Financial Services industry, and the stability of our economy as a whole. The FCA has three objectives. It is responsible for:
ensuring that markets operate with integrity;
promoting effective competition; and
protecting consumers of financial services.
The RegTech & Advanced Analytics department is a newly formed function in the FCA leading the development of an organisation-wide capability to support a more analytics-led regulatory approach. Its principle objectives are delivering business value through the application of pioneering advanced analytical techniques and championing a disruptive innovation culture across the FCA. You will be at the f

Having scanned the job descriptions of the data analyst roles that were returned in this search of "data scientist" jobs on glass door, there is considerable overlap with the "data scientist" roles. However, they don't seem to mention machine learning as often as 'data scientist' roles.

### 3. Data Engineer

In [49]:
# use a regular expression to identfy "data engineer" job titles and 
# create a boolean column to record whether or not each job title fits this category
gdjobs_loc["title_dataengineer"] = gdjobs_loc["job_title"].str.contains(
    r"data engineer|devops", regex=True, flags=re.IGNORECASE)

# how many job titles indicate "data engineer" role? ("True" count)
print(
    f"{sum(gdjobs_loc.title_dataengineer)} ({round((np.mean(gdjobs_loc.title_dataengineer))*100)}%)",
    "job titles indicate a 'data engineer' role\n"
)


60 (8%) job titles indicate a 'data engineer' role



In [50]:
# display the job titles that fit the "data engineer" category
print(gdjobs_loc.loc[gdjobs_loc["title_dataengineer"],'job_title'].sample(20))

288                       Data Engineer
284                       Data Engineer
365                       Data Engineer
478        Data Engineer - Test Analyst
393    Data Engineer (Machine Learning)
398                       DATA ENGINEER
183                 Cloud Data Engineer
364                       Data Engineer
333                       Data Engineer
499                       Data Engineer
276                       Data Engineer
418     Big Data Engineer (Python Team)
270                       Data Engineer
436        Data Engineer (12 month FTC)
498                       Data Engineer
388                Senior Data Engineer
287                       Data Engineer
339                       Data Engineer
440                       Data Engineer
275                       Data Engineer
Name: job_title, dtype: object


In [51]:
# scan job descriptions; keep running this cell until you've reviewed enough samples
sample_description(jobsdf=gdjobs_loc, col="title_dataengineer")



Description of randomly selected job with title, ' Data Engineer':

DATA ENGINEER
READING (REMOTE WORKING)
UP TO £60,000
Harnham are partnered with a global FMCG company who are looking for a data engineer to join their R&D team to help productionise their ML models.
THE COMPANY
This company are a global leader in selling consumer products. They have offices across the world that work closely with each other to produce consistent results. This company ingests huge volumes of data due to their size and require a data engineer to help build pipelines to ingest, transform and distribute data. Their UK R&D team is based in Reading and are working with Computer Vision to detect faults with products on their supply chains.
THE ROLE
You will sit within the R&D team, working with cutting-edge technology on Data Science specific projects.
You will be building data pipelines to ingest, transform and distribute data according to R&D initiatives
You will be deploying ML models and solutions devel

Scanning the description of the "Data Engineer" jobs indicates that this role is distinct from a data scientist job; data engineers know how to build an effective data architecture, streamline data processing, and maintain large-scale data systems. In addition to working with Python or R, they likely also work with other languages to create data engineering pipelines, automate common file system tasks, and build high-performance databases. They also need to know how to use cloud and big data tools such as AWS Boto, PySpark, Spark SQL, and MongoDB, to create and query databases, wrangle data, and configure schedules to run pipelines. They need database, scripting, and process skills. We will exclude these positions from our analysis of data science jobs.

### 4. Machine Learning/AI Specialist or Engineer

In [52]:
# use a regular expression to identfy "machine learning/AI specialist/engineer" job titles and 
# create a boolean column to record whether or not each job title fits this category
gdjobs_loc["title_mlai"] = gdjobs_loc["job_title"].str.contains(
    r"machine learning engineer|ml engineer|machine learning scientist|machine learning|\bai\b|artificial intelligence", 
    regex=True, 
    flags=re.IGNORECASE
)

# how many job titles indicate "machine learning/AI specialist/engineer" role? ("True" count)
print(
    f"{sum(gdjobs_loc.title_mlai)} ({round((np.mean(gdjobs_loc.title_mlai))*100)}%)",
    "job titles indicate a 'machine learning/AI specialist/engineer' role\n"
)


42 (6%) job titles indicate a 'machine learning/AI specialist/engineer' role



In [53]:
# display the job titles that fit the "machine learning/AI specialist/engineer" category
print(gdjobs_loc.loc[gdjobs_loc["title_mlai"], 'job_title'].sample(20))

412                                               Data and ML Engineer
40                                  Applied Machine Learning Scientist
403                                Machine Learning Scientist (London)
268                  Senior Data Scientist/ Machine Learning Developer
254                                      Visiting Scientist, AI (EMEA)
368               Applied Scientist in Machine Learning for Simulation
481                         AI Scientist - Natural Language Processing
7       Senior Data Scientist- (Machine Learning & Advanced Analytics)
225    Outside IR35 | Data Scientist | AI | Python | Contract | London
228                     Data Scientist - Machine Learning (Python/SQL)
419                           Data Science - Machine Learning Research
515                     Principal Research Scientist: Machine Learning
328                 Scientific Data Analyst - Machine Learning, Python
39                                   Data Scientist - Machine Learning
618   

In [54]:
# scan job descriptions; keep running this cell until you've reviewed enough samples
sample_description(jobsdf=gdjobs_loc, col="title_mlai")


Description of randomly selected job with title, ' Senior AI Data Scientist/Engineer':

Senior AI Data Scientist/Engineer London (current remote)
Permanent
Up to £80,000 perm annum

I am working with a leading data consultancy who have recently gone through a huge period of growth allowing them to invest in their AI and Data Science practices.

You will have the chance to work alongside a a team of talented software engineers as well as developing team of data science and AI professionals. You will be able to get stuck into some really exciting projects with a focus on those core data science principles.

As a Senior in the team you will have a say in projects planning and work closely with the technical architects and lead to successfully deliver projects across a range of technologies in your space. You will ideally be a seasoned data science professional who is ready to take that next step in their career.

Skills:
Proficient in designing, building, testing and maintaining producti

The descriptions of jobs with titles that include 'machine learning' and 'artificial intelligence', sometimes with term 'engineer', tend to involve researching, designing, building, testing and optimizing machine learning/AI algorithms/models and systems that can learn and be used to make predictions. However, since our glassdoor search was for UK 'data scientist' jobs, the jobs that fall into this category almost always overlap with the general data scientist roles that don't mention ML/AI in the title, many of which still mention machine learning knowledge and skills. 

### 5. Data Science or Machine Learning Research Scientist

In [55]:
# use a regular expression to identfy "data science or ML research scientist" job titles 
# and create a boolean column to record whether or not each job title fits this category
gdjobs_loc["title_research"] = gdjobs_loc["job_title"].str.contains(
    r"research", regex=True, flags=re.IGNORECASE)

# how many job titles indicate "data science or ML research scientist" role? ("True" count)
print(
    f"{sum(gdjobs_loc.title_research)} ({round((np.mean(gdjobs_loc.title_research))*100)}%)",
    "job titles indicate a 'data science or ML research scientist' role\n"
)


25 (3%) job titles indicate a 'data science or ML research scientist' role



In [56]:
# display the job titles that fit the "data science or ML research scientist" category
print(gdjobs_loc.loc[gdjobs_loc["title_research"],"job_title"].sample(20))


73                                                                  Research Officer and Data Scientist
493                                                                         Engineer Research Scientist
41                                                              Data Scientist / Operational Researcher
410                                                      Senior Data Scientist / Operational Researcher
726                                                                    Researcher/Data Scientist - QMUL
447    Research Associate III/Senior Research Associate Data Collection Peri- and Post Approval Studies
465                                                            Research Scientist Video Compression R&D
25                                                                Research Scientist Evidence Synthesis
419                                                            Data Science - Machine Learning Research
477                                         Imaging Research Sci

In [57]:
# scan job descriptions; keep running this cell until you've reviewed enough samples
sample_description(jobsdf=gdjobs_loc, col="title_research")


Description of randomly selected job with title, ' Research Associate III/Senior Research Associate Data Collection Peri- and Post Approval Studies':

Research Associate III (RAIII)/Senior Research Associate (SRA)- Data Collection– Peri- and Post Approval Studies
*We are looking to fill this role in our London, UK office; we will consider other locations based on the candidates’ experience and qualifications
The Team
Evidera has been providing epidemiology, data analytics, and outcomes research services to clients in the biopharmaceutical industry for over 19 years. The Peri- and Post Approval Studies team supports pharmaceutical/ biotechnology/ medical device companies in the design and conduct of real-world observational studies throughout the product lifecycle, from early pre-launch planning to launch and post-marketing management. Our focus is on helping our clients identify evidence gaps and rapidly build epidemiologic and economic evidence to demonstrate the effectiveness, safet

Jobs that include 'research' in the title tend to be more specialised and technical, often require the candidate to have a PhD, and involve researching and coming up with novel solutions/algorithms to address difficult machine learning/deep learning/AI problems. Within this category are jobs that require indepth knowledge/experience within a particular domain, e.g. operations research, biomedical data. Some are University positions, others are within interdisciplinary data science teams within businesses.

### 6. Data Science Intern/Apprentice

In [58]:
# use a regular expression to identfy "data science intern/apprentice" job titles and create a boolean column to record whether or not each job title fits this category
gdjobs_loc["title_internapprentice"] = gdjobs_loc["job_title"].str.contains(
    r"\binternship|\bintern\b|\bapprentic", regex=True, flags=re.IGNORECASE)

# how many job titles indicate "data science intern/apprentice" role? ("True" count)
print(
    f"{sum(gdjobs_loc.title_internapprentice)} ({round((np.mean(gdjobs_loc.title_internapprentice))*100)}%)",
    f"job titles indicate a 'data science intern/apprentice' role\n"
)


17 (2%) job titles indicate a 'data science intern/apprentice' role



In [59]:
# display the job titles that fit the "data science intern/apprentice" category
print(gdjobs_loc.loc[gdjobs_loc["title_internapprentice"],'job_title'].head())

43                                              Internship- Data Science
159           Data Scientist Degree Apprenticeship, Barnard Castle, 2021
179                Data Scientist Degree Apprenticeship, Stevenage, 2021
188    Data Scientist Degree Apprenticeship, GSK House (Brentford), 2021
190                     Data Scientist Degree Apprenticeship, Ware, 2021
Name: job_title, dtype: object


In [60]:
# scan job descriptions; keep running this cell until you've reviewed enough samples
sample_description(jobsdf=gdjobs_loc, col="title_internapprentice")


Description of randomly selected job with title, ' Data Scientist Degree Apprenticeship, GSK House (Brentford), 2021':

Site Name: UK - London - Brentford
Posted Date: Nov 25 2020

Exciting minds with our Data Science (R&D) Apprenticeship, Brentford

Education required:
5 GCSEs or equivalent grade 9-5 (A*-C) including Maths and English Language (not Literature).
Must have/be predicted 96 UCAS points from your top 3 A levels, each at grade C and above.
Must have A level Maths at grade C (or above) alternatively A level Computer Science at grade B (or above) plus Maths GCSE at grade 7-9
Start date: September 2021

Assessment centre dates: March 2021

We accept ongoing applications and will close this vacancy once we have enough applications, so its best to apply as soon as possible as we do not want you to miss out!

We want you to be motivated and passionate for the Apprenticeship you apply to we will only accept ONE Apprenticeship application per candidate each year. Please do your re

The data science internships and apprenticeships that come up in our search for data science jobs fall into 3 types of roles:
- essentially a full data scientist role for new graduates, likely so the company can trial the graduate before hiring
- mid-degree industry placement roles
- apprenticeships for people leaving school with science a-levels (including a company sponsored data science degree)


Having scanned many of the job descriptions in the GlassDoor "data scientist" jobs data set, I believe the Data Scientist role to be best represented by the job titles falling into the following categories *only*:

- Data Scientist
- Data Analyst
- Machine Learning/AI Specialist or Engineer



But not those that fall under the following:

- Data Engineer
- Researcher
- Intern/Apprentice

In [61]:
# create a boolean mask for jobs that have titles that fall into the following categories: 
# "data scientist", "data analyst", "machine learning/ai specialist or engineer"
gdjobs_loc["datascience_role"] = (
    ((gdjobs_loc["title_datascientist"] == True) | (gdjobs_loc["title_mlai"] == True) | (gdjobs_loc["title_dataanalyst"] == True))
) & (
    (gdjobs_loc["title_internapprentice"] == False) & (gdjobs_loc["title_dataengineer"] == False) & (gdjobs_loc["title_research"] == False)
)

# how many job titles indicate a data scientist role? ("True" count)
print(
    f"{sum(gdjobs_loc.datascience_role)} ({round((np.mean(gdjobs_loc.datascience_role))*100)}%)",
    f"job titles indicate a data scientist role (data scientists, analysts and ML/AI specialists/engineers)\n"
)

# create a dataframe of data science roles only
dsjobs = gdjobs_loc[gdjobs_loc["datascience_role"]]


511 (71%) job titles indicate a data scientist role (data scientists, analysts and ML/AI specialists/engineers)



In [62]:
dsjobs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 511 entries, 0 to 725
Data columns (total 33 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   job_title               511 non-null    object  
 1   salary_estimate         322 non-null    object  
 2   job_description         511 non-null    object  
 3   rating                  399 non-null    float64 
 4   company_name            511 non-null    object  
 5   location                511 non-null    object  
 6   size                    415 non-null    category
 7   founded                 348 non-null    Int64   
 8   type_of_ownership       428 non-null    category
 9   industry                381 non-null    category
 10  sector                  383 non-null    category
 11  revenue                 271 non-null    category
 12  rating_culturevalues    389 non-null    float64 
 13  rating_worklifebalance  394 non-null    float64 
 14  rating_diversity        29

## Save Data

Now that the data have been cleaned, they are ready for feature engineering and analysing! Since some jobs were removed from the data set during the cleaning process, the DataFrame needs to be reindexed before it is saved.

In [63]:
# reset index; 'drop' stops old index being inserted as a column
dsjobs.reset_index(inplace=True, drop=True)

In [64]:
# save the cleaned dataframe 
# as a .csv file
dsjobs.to_csv(os.path.join(path, f'dsjobs_df_{scrapedate}_postclean.csv'), encoding='utf-8')
# as a .pkl file which preserves data types (better for processing steps)
dsjobs.to_pickle(os.path.join(path, f'dsjobs_df_{scrapedate}_postclean.pkl'))
