# Capstone Project 3: 
## Part I: Extracting Relevant Information from Scraped LinkedIn Jobs Data

### Project motivation and details about the data used

Searching for a job has become reality for many people in the current economic environment. An efficient and smart job search is critical for one's success, since it could save a lot of time and provide a significant edge over the competition. Although the data we have collected and will be working with here could be used to answer various questions, the focus of this project is whether we can predict which postings will not attract many applicants, while meeting our predefined selection criteria. 

The three main stages of our application process are as follows:
- Process and analyze data consisting of job postings over the past week collected over several weeks
- Using this data create a model to predict the positions with smallest number of applicants per day
- Do a daily search, use the above model to filter down the results to postings which meet our criteria and will have the least number of applicants over the nearest future
- Do due diligence and apply only to these positions. 

This will save us an enormous amount of time and will give us a significant edge over the competition.

In this project, we will complete only the first two steps. In Part I of the project, particularly, we will extract relevant information from the LinkedIn job postings which have been scraped with home built Selenium web scraper. The scraper code is posted in the current repository.

The data we will be working with is a result from scraping LinkedIn job postings. The criteria for the job search are given below:
- Job title: Data Scientist
- Time of post: Past Week
- Location: separate searches are performed for 16 USA metropolitan areas
- Job type: full-time
- Experience level: separate searches are performed for each of the three possible seniority levels - entry, associate, and senior

The collected data is an accumulation of scaped job posting for five consecutive weeks starting Feb. 5, 2021

In [1]:
# import relevant libraries and packages

import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(style = 'whitegrid', font_scale = 1.8)

In [2]:
# ignore warnings

import warnings
warnings.filterwarnings('ignore')

In [3]:
# load scraped data

file_read = 'data/jobs_ds_s1_s5'

data = pd.read_excel(file_read + '.xlsx')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10167 entries, 0 to 10166
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Job Title                 10167 non-null  object
 1   Company Name              10167 non-null  object
 2   Location                  10167 non-null  object
 3   Metro Area                10167 non-null  object
 4   Time Posted               10167 non-null  object
 5   Number of Applicants      10167 non-null  object
 6   Industry Info             10167 non-null  object
 7   Education-Bachelor        10167 non-null  int64 
 8   Education-Master          10167 non-null  int64 
 9   Education-Doctor          10167 non-null  int64 
 10  Experience                10167 non-null  object
 11  Data Science Terms Count  10167 non-null  int64 
 12  Seniority Level           10167 non-null  object
dtypes: int64(4), object(9)
memory usage: 1.0+ MB


In [4]:
data.head()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level
0,Data Scientist,Honeywell,"Atlanta, GA",ATL,2 days ago,40 applicants,"['Seniority level', 'Entry level', 'Employment...",0,1,1,['2 years'],7,entry
1,ENTRY LEVEL (Data Analyst/Data Scientist),SynergisticIT,"Atlanta, GA",ATL,4 hours ago,Be among the first 25 applicants,"['Seniority level', 'Entry level', 'Employment...",1,1,0,['10 years'],2,entry
2,Data Scientist,Inspire Brands,"Atlanta, GA",ATL,7 hours ago,25 applicants,"['Seniority level', 'Entry level', 'Employment...",0,0,0,['1-3 years'],4,entry
3,Associate Data Scientist,The Home Depot,"Atlanta, GA",ATL,22 hours ago,29 applicants,"['Seniority level', 'Entry level', 'Employment...",0,1,0,['2+ years'],3,entry
4,Associate Data Scientist,Edelman Data & Intelligence (DxI),"Atlanta, GA",ATL,1 hour ago,Be among the first 25 applicants,"['Seniority level', 'Entry level', 'Employment...",0,0,0,[],3,entry


In [5]:
data.tail()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level
10162,Senior Machine Learning Engineer,Harnham,San Francisco Bay Area,SF,4 days ago,49 applicants,"['Seniority level', 'Mid-Senior level', 'Emplo...",0,0,0,['5+ years'],9,senior
10163,Senior Machine Learning Engineer,Harnham,San Francisco Bay Area,SF,3 days ago,74 applicants,"['Seniority level', 'Mid-Senior level', 'Emplo...",0,1,0,[],15,senior
10164,Data Engineer,Harnham,San Francisco Bay Area,SF,4 days ago,104 applicants,"['Seniority level', 'Mid-Senior level', 'Emplo...",0,0,0,[],0,senior
10165,Senior Bioinformatics Scientist,Harnham,San Francisco Bay Area,SF,6 days ago,Be among the first 25 applicants,"['Seniority level', 'Mid-Senior level', 'Emplo...",0,0,1,[],1,senior
10166,Senior Data Engineer,Harnham,San Francisco Bay Area,SF,18 hours ago,42 applicants,"['Seniority level', 'Mid-Senior level', 'Emplo...",0,0,0,[],0,senior


The data consists of 10167 records. There are five features which contain un-processed and ambiguous data - 'Job Title', 'Time Posted', 'Number of Applicants, 'Industry Info', and 'Experience'. We need to extract the relevant information from all of these features.

- **Extracting Information from 'Job Title'**

In [6]:
# get 'Job Title' count as percentage

count_title = data['Job Title'].value_counts(normalize = True) * 100
print(round(count_title, 2))

Data Engineer                                                         7.86
Data Scientist                                                        5.58
FT Real Estate Researcher (Work From Home)                            4.80
Senior Data Engineer                                                  4.18
Senior Data Scientist                                                 3.07
                                                                      ... 
Research Scientist I - GMP                                            0.01
Data Scientist - Multiple Positions (2)                               0.01
Sr. Scientist, Precision Informatics                                  0.01
Sr. Research Scientist - AI/ML Computer Graphics & Computer Vision    0.01
Data Engineer - Los Angeles                                           0.01
Name: Job Title, Length: 3636, dtype: float64


As expected, there are a large number of different job titles (3636) which is not suitable for our analysis and modeling. We will replace the current titles with more conventional and common titles where we can find a correct match.

In [7]:
# print top 50 to create list of titles

n_pts = 50
print(round(count_title.iloc[0:n_pts], 2))

Data Engineer                                                           7.86
Data Scientist                                                          5.58
FT Real Estate Researcher (Work From Home)                              4.80
Senior Data Engineer                                                    4.18
Senior Data Scientist                                                   3.07
Machine Learning Engineer                                               1.46
Senior Machine Learning Engineer                                        0.93
Research Scientist                                                      0.81
Lead Data Scientist                                                     0.72
UX Researcher                                                           0.70
Sr. Data Engineer                                                       0.68
Azure Data Engineer                                                     0.68
Sr. Data Scientist                                                      0.66

In [8]:
# create list of common job titles
# 'Architect' will later be replaced with 'Data Architect'!

list_titles = ['Data Scientist', 'Data Science Researcher', 'Data Science Manager', 'Data Science Engineer', 
               'Data Science Lead', 'Data Engineer', 'Data Specialist', 'Machine Learning Engineer', 
               'Machine Learning Researcher', 'Machine Learning Scientist', 'Research Scientist', 
               'Principal Scientist', 'Senior Scientist', 'Laboratory Scientist', 'Applied Scientist', 
               'UX Researcher', 'User Experience Researcher', 'User Researcher', 'Design Researcher', 
               'Quantitative Researcher', 'Business Researcher', 'Market Researcher', 'Real Estate Researcher', 
               'Senior Researcher', 'Lead Researcher', 'Software Engineer', 'Software Developer', 'Analyst', 
               'Consultant', 'Manager', 'Director', 'Architect']

In [9]:
# define function to replace values of a given data fieature  --> use re.search(pattern, text) to check for title match
# keep in mind that we will be using this function for several different features

def replace_feat_values(data, data_feature, list_items):
    
    for item in list_items:
        
        for i in range(len(data)):
            if re.search(item, data[data_feature][i], flags = re.IGNORECASE):
                data[data_feature][i] = item

In [10]:
# replace titles

data_feature = 'Job Title'
list_items = list_titles

replace_feat_values(data, data_feature, list_items)

In [11]:
# get the new 'Job Title' count as percentage

count_title = data['Job Title'].value_counts(normalize = True) * 100
print(round(count_title, 2))

Data Engineer                                     32.54
Data Scientist                                    26.91
Research Scientist                                 6.56
Machine Learning Engineer                          5.37
Real Estate Researcher                             4.80
                                                  ...  
"Usability Researcher" or "User Research"          0.01
Associate Artificial Intelligence Researcher       0.01
Researcher, Digital UX                             0.01
Post-Doctoral Researcher, Biology - PA14760798     0.01
Development Associate III - San Diego, CA          0.01
Name: Job Title, Length: 781, dtype: float64


After replacing the original titles with more common titles the number of unique titles was reduced to 781. However, there are still too many titles which are post specific. Let's look again in more detail at the top 50.

In [12]:
# print the top 50

n_pts = 50
print(round(count_title.iloc[0:n_pts], 2))

Data Engineer                            32.54
Data Scientist                           26.91
Research Scientist                        6.56
Machine Learning Engineer                 5.37
Real Estate Researcher                    4.80
UX Researcher                             2.09
Manager                                   1.99
Analyst                                   1.41
Architect                                 1.11
Consultant                                0.70
User Experience Researcher                0.67
Software Engineer                         0.57
Quantitative Researcher                   0.53
Senior Scientist                          0.39
Design Researcher                         0.37
Laboratory Scientist                      0.33
User Researcher                           0.32
Data Science Engineer                     0.30
Principal Scientist                       0.27
Director                                  0.27
Applied Scientist                         0.26
Postdoctoral 

There are records which have similar titles and some which could be replaced with more common titles still.

In [13]:
# replace some of the remaining titles

dict_titles = {'Data Science Lead': 'Data Scientist', 
               'User Experience Researcher':'UX Researcher', 
               'User Researcher':'UX Researcher', 
               'Architect':'Data Architect'}

data['Job Title'].replace(dict_titles, inplace=True)

In [14]:
# get again the count of new job titles as percentage

count_title = data['Job Title'].value_counts(normalize = True) * 100
print(round(count_title, 2))

Data Engineer                                                           32.54
Data Scientist                                                          27.14
Research Scientist                                                       6.56
Machine Learning Engineer                                                5.37
Real Estate Researcher                                                   4.80
                                                                        ...  
Solution Associate - Electric Power and Natural Gas, Power Solutions     0.01
"Usability Researcher" or "User Research"                                0.01
Associate Artificial Intelligence Researcher                             0.01
Researcher, Digital UX                                                   0.01
Development Associate III - San Diego, CA                                0.01
Name: Job Title, Length: 778, dtype: float64


We still have too many titles. We will need to set a certain representation threshold and examine the number of titles above and below that threshold. Here, we will set this threshold at 0.1%.

In [15]:
# print the titles above the 0.1% threshold

print(round(count_title[count_title.values >= 0.1], 2))
print('\n')
print('Number of titles above the 0.1% threshold: ', len(count_title[count_title.values > 0.1]))
print('Percentage of all records: ', round(sum(count_title[count_title.values > 0.1].values), 2))

Data Engineer                  32.54
Data Scientist                 27.14
Research Scientist              6.56
Machine Learning Engineer       5.37
Real Estate Researcher          4.80
UX Researcher                   3.08
Manager                         1.99
Analyst                         1.41
Data Architect                  1.11
Consultant                      0.70
Software Engineer               0.57
Quantitative Researcher         0.53
Senior Scientist                0.39
Design Researcher               0.37
Laboratory Scientist            0.33
Data Science Engineer           0.30
Principal Scientist             0.27
Director                        0.27
Applied Scientist               0.26
Postdoctoral Researcher         0.24
Researcher                      0.23
Senior Researcher               0.22
Market Researcher               0.21
Machine Learning Scientist      0.21
Machine Learning Researcher     0.20
Name: Job Title, dtype: float64


Number of titles above the 0.1% threshold

The number of titles above the 0.1% threshold is 25. These titles represent 89.3% of the data and all of them are distinct and sufficiently relevant to be included in our analysis as individual values. The rest of the job titles will be combined in one category denoted as 'Other'.

In [16]:
# create a filter for titles below the 0.1% threshold

titles_other = count_title[count_title.values < 0.1].index

mask_titles_other = data['Job Title'].isin(titles_other)

# replace all titles under this filter with 'Other'
data['Job Title'][mask_titles_other] = 'Other'

In [17]:
# print the new job title count as percentage

count_title = data['Job Title'].value_counts(normalize = True) * 100
print(round(count_title, 2))
print('\n')
print('Number of remaining titles: ', len(count_title))

Data Engineer                  32.54
Data Scientist                 27.14
Other                          10.74
Research Scientist              6.56
Machine Learning Engineer       5.37
Real Estate Researcher          4.80
UX Researcher                   3.08
Manager                         1.99
Analyst                         1.41
Data Architect                  1.11
Consultant                      0.70
Software Engineer               0.57
Quantitative Researcher         0.53
Senior Scientist                0.39
Design Researcher               0.37
Laboratory Scientist            0.33
Data Science Engineer           0.30
Principal Scientist             0.27
Director                        0.27
Applied Scientist               0.26
Postdoctoral Researcher         0.24
Researcher                      0.23
Senior Researcher               0.22
Machine Learning Scientist      0.21
Market Researcher               0.21
Machine Learning Researcher     0.20
Name: Job Title, dtype: float64


Numb

There are now 26 remaining titles which is an appropriate number for the rest of our analysis and modeling. The one drawback we can think of here is that the postings with job title 'Other' comprise 10.7% of all data. Since these records include a large number of diverse and likely incompatible positions, this could potentially cause problems with our modeling of the data. However, in the data processing and EDA section there will be additional data filtering which could possibly reduce the number of such records to a smaller portion of the data.

- **Extracting Information from 'Time Posted'**

In [18]:
# print 'Time Posted' value count

print(data['Time Posted'].value_counts())

4 days ago        2007
3 days ago        1452
5 days ago        1346
2 days ago        1183
6 days ago        1086
                  ... 
24 minutes ago       1
59 minutes ago       1
48 minutes ago       1
35 minutes ago       1
17 minutes ago       1
Name: Time Posted, Length: 71, dtype: int64


In [19]:
# print subsets of the 'Time Posted' value count

n_1 = 0
n_2 = 20
print(data['Time Posted'].value_counts().iloc[n_1:n_2])

4 days ago      2007
3 days ago      1452
5 days ago      1346
2 days ago      1183
6 days ago      1086
1 day ago        635
7 days ago       618
6 hours ago      129
4 hours ago      115
3 hours ago      113
7 hours ago      110
5 hours ago      109
8 hours ago       92
2 hours ago       87
12 hours ago      81
16 hours ago      79
9 hours ago       74
13 hours ago      73
18 hours ago      72
10 hours ago      69
Name: Time Posted, dtype: int64


In [20]:
n_1 = 20
n_2 = 40
print(data['Time Posted'].value_counts().iloc[n_1:n_2])

11 hours ago      65
19 hours ago      62
14 hours ago      60
17 hours ago      58
15 hours ago      55
22 hours ago      54
23 hours ago      52
20 hours ago      44
1 hour ago        41
24 hours ago      38
21 hours ago      37
1 week ago        12
44 minutes ago     4
25 minutes ago     3
57 minutes ago     3
30 minutes ago     3
55 minutes ago     3
21 minutes ago     2
58 minutes ago     2
29 minutes ago     2
Name: Time Posted, dtype: int64


In [21]:
print(data['Time Posted'].value_counts().iloc[n_2:])

46 minutes ago    2
42 minutes ago    2
45 minutes ago    2
32 minutes ago    2
10 minutes ago    2
49 minutes ago    2
31 minutes ago    1
6 minutes ago     1
56 minutes ago    1
18 minutes ago    1
28 minutes ago    1
36 minutes ago    1
9 minutes ago     1
27 minutes ago    1
34 minutes ago    1
3 weeks ago       1
54 minutes ago    1
47 minutes ago    1
39 minutes ago    1
40 minutes ago    1
41 minutes ago    1
51 minutes ago    1
16 minutes ago    1
4 weeks ago       1
22 minutes ago    1
20 minutes ago    1
24 minutes ago    1
59 minutes ago    1
48 minutes ago    1
35 minutes ago    1
17 minutes ago    1
Name: Time Posted, dtype: int64


'Time Posted' contains 'day(s)', 'hour(s)', 'minute(s)', and 'week(s)'. We will extract the time posted in number of days as follows:
- minutes - 0 days
- hours - 1 day
- week - 7 days

In [22]:
# convert time posted to days

time_posted = 0

for i in range(len(data)):
    time_posted = 0
    if re.search('minute', data['Time Posted'][i], flags = re.IGNORECASE):
        data['Time Posted'][i] = 0
    elif re.search('hour', data['Time Posted'][i], flags = re.IGNORECASE):
        data['Time Posted'][i] = 1
    elif re.search('week', data['Time Posted'][i], flags = re.IGNORECASE):
        time_posted = data['Time Posted'][i]
        data['Time Posted'][i] = int(time_posted.split(' ')[0]) * 7
    else:
        time_posted = data['Time Posted'][i]
        data['Time Posted'][i] = int(time_posted.split(' ')[0])

In [23]:
# check 'Time Posted' count again

print(data['Time Posted'].value_counts())

1     2404
4     2007
3     1452
5     1346
2     1183
6     1086
7      630
0       57
28       1
21       1
Name: Time Posted, dtype: int64


'Time Posted' has been successfully converted to days. Note that the scraping has been performed for jobs posted in the past week. However, there appears to be two records which have been posted for several weeks and will be eliminated.

In [24]:
# eliminate postings with more than 7 days
data = data[data['Time Posted'] < 8]

data.reset_index(inplace = True, drop = True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10165 entries, 0 to 10164
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Job Title                 10165 non-null  object
 1   Company Name              10165 non-null  object
 2   Location                  10165 non-null  object
 3   Metro Area                10165 non-null  object
 4   Time Posted               10165 non-null  object
 5   Number of Applicants      10165 non-null  object
 6   Industry Info             10165 non-null  object
 7   Education-Bachelor        10165 non-null  int64 
 8   Education-Master          10165 non-null  int64 
 9   Education-Doctor          10165 non-null  int64 
 10  Experience                10165 non-null  object
 11  Data Science Terms Count  10165 non-null  int64 
 12  Seniority Level           10165 non-null  object
dtypes: int64(4), object(9)
memory usage: 1.0+ MB


- **Extracting Information from 'Number of Applicants'**

In [25]:
# get count
print(data['Number of Applicants'].value_counts())

Be among the first 25 applicants    7776
Over 200 applicants                  385
26 applicants                         62
33 applicants                         51
35 applicants                         49
                                    ... 
136 applicants                         1
191 applicants                         1
147 applicants                         1
179 applicants                         1
185 applicants                         1
Name: Number of Applicants, Length: 177, dtype: int64


'Number of Applicants' contains two irregular cases
- "Be among the first 25 applicants" which accounts for an overwhelming majority of data. Please, note that it is very different to have a posting within this group for time posted of one or five days. Since we cannot be sure what the exact value in each case is and since this is one of the two most critical pieces of information for us, we will treat these values as incomplete data. We will temporarily assign a negative numerical value (-10) to them, so that we will be able to separate them easily or include them in comparison plots if necessary in the data analysis and exploration section later.
- "Over 200 applicants" which could be any number above 200. For the sake of simplicity, we will assign a value of 200 here.

In [26]:
# extract number of applicants as integers
# at this point assign 0 for 'Be among the first 25 applicants'

applicants = 0

for i in range(len(data)):
    applicants = data['Number of Applicants'][i]
    applicants = applicants.split(' ')[0]
    
    if applicants == 'Be':
        data['Number of Applicants'][i] = -10 # important to assign a reasonable negative value here
    elif applicants == 'Over':
        data['Number of Applicants'][i] = 200
    else:
        data['Number of Applicants'][i] = int(applicants)

In [27]:
# check again their count
print(data['Number of Applicants'].value_counts())

-10     7776
 200     388
 26       62
 33       51
 35       49
        ... 
 136       1
 188       1
 191       1
 147       1
 194       1
Name: Number of Applicants, Length: 176, dtype: int64


In [28]:
# check
data.head()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level
0,Data Scientist,Honeywell,"Atlanta, GA",ATL,2,40,"['Seniority level', 'Entry level', 'Employment...",0,1,1,['2 years'],7,entry
1,Data Scientist,SynergisticIT,"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",1,1,0,['10 years'],2,entry
2,Data Scientist,Inspire Brands,"Atlanta, GA",ATL,1,25,"['Seniority level', 'Entry level', 'Employment...",0,0,0,['1-3 years'],4,entry
3,Data Scientist,The Home Depot,"Atlanta, GA",ATL,1,29,"['Seniority level', 'Entry level', 'Employment...",0,1,0,['2+ years'],3,entry
4,Data Scientist,Edelman Data & Intelligence (DxI),"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",0,0,0,[],3,entry


The number of applicants values have been extracted successfully.

- **Extracting Information from 'Industry Info'**

In [29]:
# disply the full information from several records
data['Industry Info'][0]

"['Seniority level', 'Entry level', 'Employment type', 'Full-time', 'Job function', 'EngineeringInformation Technology', 'Industries', 'Information Technology and ServicesMachineryStaffing and Recruiting']"

In [31]:
data['Industry Info'][3000]

"['Seniority level', 'Mid-Senior level', 'Employment type', 'Full-time', 'Job function', 'EngineeringProduct ManagementScience', 'Industries', 'Legal Services']"

In [32]:
data['Industry Info'][6000]

"['Seniority level', 'Associate', 'Employment type', 'Full-time', 'Job function', 'Other', 'Industries', 'Marketing and AdvertisingComputer SoftwareInternet']"

In [33]:
data['Industry Info'][10000]

"['Seniority level', 'Mid-Senior level', 'Employment type', 'Full-time', 'Job function', 'EngineeringInformation Technology', 'Industries', 'Information Technology and ServicesComputer SoftwareInternet']"

From the information contained in 'Industry Info' we already have 'Seniority level' as a separate feature because the search was performed for a specific seniority level. In addition, the search was performed for full-time positions only by default. Therefore the two remaining pieces of information we need to extract from 'Industry Info'  are 'Job function' and 'Industries'.

- 'Job Function' Information

In [34]:
# create new column 'Job Function' and assign 'Other' value
data['Job Function'] = 'Other'

data.head()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level,Job Function
0,Data Scientist,Honeywell,"Atlanta, GA",ATL,2,40,"['Seniority level', 'Entry level', 'Employment...",0,1,1,['2 years'],7,entry,Other
1,Data Scientist,SynergisticIT,"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",1,1,0,['10 years'],2,entry,Other
2,Data Scientist,Inspire Brands,"Atlanta, GA",ATL,1,25,"['Seniority level', 'Entry level', 'Employment...",0,0,0,['1-3 years'],4,entry,Other
3,Data Scientist,The Home Depot,"Atlanta, GA",ATL,1,29,"['Seniority level', 'Entry level', 'Employment...",0,1,0,['2+ years'],3,entry,Other
4,Data Scientist,Edelman Data & Intelligence (DxI),"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",0,0,0,[],3,entry,Other


In [35]:
# fill 'Job Function' values using 'Industry Info'

job_func = ''

for i in range(len(data)):
    job_func = data['Industry Info'][i]
    data['Job Function'][i] = job_func.split("'")[-6]

In [36]:
# get 'Job Function' count as percentage

count_job_func = data['Job Function'].value_counts(normalize = True) * 100
print(round(count_job_func, 2))

Information Technology                         30.00
EngineeringInformation Technology              20.69
Other                                          11.76
ResearchAnalystInformation Technology          10.42
Information TechnologyEngineering               2.16
                                               ...  
FinanceEngineeringScience                       0.01
AnalystFinanceLegal                             0.01
ResearchDesignScience                           0.01
ManagementEngineeringInformation Technology     0.01
Health Care ProviderInformation Technology      0.01
Name: Job Function, Length: 396, dtype: float64


In [37]:
# print the ones above the 0.1% threshold

print(round(count_job_func[count_job_func >= 0.1], 2))

Information Technology                                30.00
EngineeringInformation Technology                     20.69
Other                                                 11.76
ResearchAnalystInformation Technology                 10.42
Information TechnologyEngineering                      2.16
Engineering                                            2.06
Business DevelopmentSales                              1.18
General Business                                       1.00
Research                                               0.90
Full-time                                              0.68
ResearchScience                                        0.65
ConsultingInformation TechnologyEngineering            0.59
Analyst                                                0.58
FinanceSales                                           0.53
Strategy/PlanningInformation Technology                0.52
ResearchScienceEngineering                             0.50
Strategy/PlanningAnalystInformation Tech

Although the information seems mixed up and overlapping, we will extract from each record the most specific word for a particular job function.

In [38]:
# create a list of specific job functions

list_functions = ['Art', 'Human Resources', 'Health Care', 'Sales', 'Finance', 'Business Development', 
                 'Marketing', 'Advertising', 'Accounting', 'Manufacturing', 'General Business', 'Supply Chain', 
                 'Venture Capital', 'Network Security', 'Analyst', 'Education', 'Research', 'Science', 'Design', 
                 'Consulting', 'Management', 'Engineering', 'Information Technology', 'Other']

In [39]:
# replace using the function 'replace_feat_values'

data_feature = 'Job Function'
list_items = list_functions

replace_feat_values(data, data_feature, list_items)

In [40]:
# get the new count

count_job_func = data['Job Function'].value_counts(normalize = True) * 100
print(round(count_job_func, 2))
print('\n')
print('Number of different job functions: ', len(count_job_func))

Information Technology    30.63
Engineering               25.78
Analyst                   13.35
Other                     11.76
Research                   4.62
Sales                      2.50
Consulting                 1.60
Marketing                  1.20
Management                 1.14
Finance                    1.14
General Business           1.14
Science                    0.97
Full-time                  0.68
Education                  0.56
Design                     0.54
Art                        0.53
Human Resources            0.33
Health Care                0.31
Manufacturing              0.26
Business Development       0.23
Supply Chain               0.20
Legal                      0.14
Advertising                0.12
Accounting                 0.09
Quality Assurance          0.08
Administrative             0.04
Associate                  0.02
Production                 0.01
Customer Service           0.01
Strategy/Planning          0.01
Name: Job Function, dtype: float64


Num

We have successfully reduced the job functions from the original 396 mixed and overlapping values to 30 specific values which we will use in further data analysis and modeling.

- 'Industries' Information

In [41]:
# create new feature 'Industry' and assign 'Other' value

data['Industry'] = 'Other'

data.head()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level,Job Function,Industry
0,Data Scientist,Honeywell,"Atlanta, GA",ATL,2,40,"['Seniority level', 'Entry level', 'Employment...",0,1,1,['2 years'],7,entry,Engineering,Other
1,Data Scientist,SynergisticIT,"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",1,1,0,['10 years'],2,entry,Information Technology,Other
2,Data Scientist,Inspire Brands,"Atlanta, GA",ATL,1,25,"['Seniority level', 'Entry level', 'Employment...",0,0,0,['1-3 years'],4,entry,Engineering,Other
3,Data Scientist,The Home Depot,"Atlanta, GA",ATL,1,29,"['Seniority level', 'Entry level', 'Employment...",0,1,0,['2+ years'],3,entry,Engineering,Other
4,Data Scientist,Edelman Data & Intelligence (DxI),"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",0,0,0,[],3,entry,Analyst,Other


In [42]:
# fill in 'Industry' values from 'Industry Info'

industry = ''

for i in range(len(data)):
    industry = data['Industry Info'][i]
    data['Industry'][i] = industry.split("'")[-2]

In [43]:
# display first five records
data.head()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level,Job Function,Industry
0,Data Scientist,Honeywell,"Atlanta, GA",ATL,2,40,"['Seniority level', 'Entry level', 'Employment...",0,1,1,['2 years'],7,entry,Engineering,Information Technology and ServicesMachinerySt...
1,Data Scientist,SynergisticIT,"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",1,1,0,['10 years'],2,entry,Information Technology,Information Technology and ServicesComputer So...
2,Data Scientist,Inspire Brands,"Atlanta, GA",ATL,1,25,"['Seniority level', 'Entry level', 'Employment...",0,0,0,['1-3 years'],4,entry,Engineering,Information Technology and ServicesComputer So...
3,Data Scientist,The Home Depot,"Atlanta, GA",ATL,1,29,"['Seniority level', 'Entry level', 'Employment...",0,1,0,['2+ years'],3,entry,Engineering,ConstructionInformation Technology and Service...
4,Data Scientist,Edelman Data & Intelligence (DxI),"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",0,0,0,[],3,entry,Analyst,Public Relations and CommunicationsResearchMar...


In [44]:
# get 'Industry' count as percentage
count_industry = data['Industry'].value_counts(normalize = True) * 100
print(round(count_industry, 2))

Information Technology and ServicesComputer SoftwareFinancial Services        14.34
Information Technology and ServicesComputer SoftwareInternet                   6.16
Staffing and RecruitingInsuranceReal Estate                                    4.80
Information Technology and Services                                            3.24
Computer SoftwareInformation Technology and ServicesInternet                   3.04
                                                                              ...  
Electrical/Electronic ManufacturingResearchBiotechnology                       0.01
Hospital & Health CareHealth, Wellness and Fitness                             0.01
Information Technology and ServicesFinancial ServicesInvestment Management     0.01
Capital MarketsCommercial Real EstateBiotechnology                             0.01
Financial ServicesInvestment BankingCapital Markets                            0.01
Name: Industry, Length: 994, dtype: float64


In [45]:
# print the ones above the 1% threshold

print(round(count_industry[count_industry >= 1], 2))

Information Technology and ServicesComputer SoftwareFinancial Services           14.34
Information Technology and ServicesComputer SoftwareInternet                      6.16
Staffing and RecruitingInsuranceReal Estate                                       4.80
Information Technology and Services                                               3.24
Computer SoftwareInformation Technology and ServicesInternet                      3.04
Marketing and AdvertisingComputer SoftwareInternet                                2.52
Marketing and AdvertisingInformation Technology and ServicesComputer Software     2.51
Computer Software                                                                 2.49
Information Technology and ServicesDefense & SpaceComputer Software               2.38
Management Consulting                                                             1.66
Marketing and AdvertisingInformation Technology and ServicesInternet              1.64
Biotechnology                              

Similarly to 'Job Function', the feature  'Industry' also has mixed and overlapping values. We will extract the more specific and unique values in the same way as before.

In [46]:
# create a list of specific industries

# create 'Industry Specifics' list - important to pay attention to the ordering of the items in the list
list_industry = ['Real Estate', 'Insurance', 'Staffing and Recruiting', 'Marketing and Advertising', 'Health Care', 
                 'Accounting', 'Banking', 'Investment', 'Market Research', 'Financial Services', 'Biotechnology', 
                 'Pharmaceutical', 'Manufacturing', 'Transportation', 'Automotive', 'Defense', 'Retail', 
                 'Management Consulting', 'Network Security', 'Wellness and Fitness', 'Medical Devices', 'Education', 
                 'Food & Beverages', 'Professional Training', 'Construction', 'Venture Capital', 'Apparel & Fashion', 
                 'Architecture & Planning', 'Semiconductors', 'Research', 'Telecommunications', 'Online Media', 'Sports', 
                 'Environment', 'Mining', 'Consumer Goods', 'Consumer Services', 'Consumer Electronics', 'Entertainment', 
                 'Supply', 'Facilities',  'Energy', 'Aerospace', 'Wireless', 'Engineering', 'Automation', 'Legal Services', 
                 'Nonprofit', 'Computer Games', 'Internet', 'Computer Software', 'Information Technology and Services']

In [47]:
# replace the values using the function 'replace_feat_values'

data_feature = 'Industry'
list_items = list_industry

replace_feat_values(data, data_feature, list_items)

In [48]:
# get the new 'Industry' count as percentage
count_industry = data['Industry'].value_counts(normalize = True) * 100
print(round(count_industry, 2))

Financial Services           19.75
Internet                     11.52
Marketing and Advertising     9.93
Health Care                   8.50
Real Estate                   5.33
                             ...  
Legal                         0.01
Luxury Goods & Jewelry        0.01
Plastics                      0.01
Venture Capital               0.01
E-Learning                    0.01
Name: Industry, Length: 85, dtype: float64


We are left with 85 unique industry values. Since some of these are well below 0.1% it is reasonable to combine those under the name 'Other' in the same way as we did with 'Job Title'.

In [49]:
# print the industries above the 0.1% threshold

print(round(count_industry[count_industry.values >= 0.1], 2))
print('\n')
print('Number of industries above the 0.1% threshold: ', len(count_industry[count_industry.values >= 0.1]))
print('Percentage of all records: ', round(sum(count_industry[count_industry.values >= 0.1].values), 2))

Financial Services                     19.75
Internet                               11.52
Marketing and Advertising               9.93
Health Care                             8.50
Real Estate                             5.33
Biotechnology                           4.79
Computer Software                       4.79
Staffing and Recruiting                 4.51
Management Consulting                   3.47
Information Technology and Services     3.32
Defense                                 3.28
Banking                                 2.68
Insurance                               1.95
Research                                1.53
Network Security                        1.28
Education                               1.27
Manufacturing                           1.14
Telecommunications                      1.05
Retail                                  0.91
Automotive                              0.83
Pharmaceutical                          0.67
Online Media                            0.52
Constructi

The number of industries above the 0.1% threshold is 42 and these represent 98.9% of all data. Thus, we can combine the rest under one label, 'Other', without concerns for potential issues during the future analysis and modeling.

In [50]:
# create a filter for industries below the 0.1% threshold

industry_other = count_industry[count_industry.values < 0.1].index

mask_industry_other = data['Industry'].isin(industry_other)

# replace all titles under this filter with 'Other'
data['Industry'][mask_industry_other] = 'Other'

In [51]:
# print the new count as percentage
count_industry = data['Industry'].value_counts(normalize = True) * 100
print(round(count_industry, 2))
print('\n')
print('Number of industries: ', len(count_industry))

Financial Services                     19.75
Internet                               11.52
Marketing and Advertising               9.93
Health Care                             8.50
Real Estate                             5.33
Biotechnology                           4.79
Computer Software                       4.79
Staffing and Recruiting                 4.51
Management Consulting                   3.47
Information Technology and Services     3.32
Defense                                 3.28
Banking                                 2.68
Insurance                               1.95
Research                                1.53
Network Security                        1.28
Education                               1.27
Other                                   1.15
Manufacturing                           1.14
Telecommunications                      1.05
Retail                                  0.91
Automotive                              0.83
Pharmaceutical                          0.67
Online Med

We have extracted the 'Industry' information successfully with 43 distinct industry values left.

- **Extracting and Processing Experience Information**


The last feature which we need to extract information from before proceeding further is 'Experience'. Because of the different requirements for different positions, we believe that the most adequate and common-denominator information would be the minimum number of years of experience.

In [52]:
# create new column 'Minimum Experience'
data['Minimum Experience'] = 0
data.head()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level,Job Function,Industry,Minimum Experience
0,Data Scientist,Honeywell,"Atlanta, GA",ATL,2,40,"['Seniority level', 'Entry level', 'Employment...",0,1,1,['2 years'],7,entry,Engineering,Staffing and Recruiting,0
1,Data Scientist,SynergisticIT,"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",1,1,0,['10 years'],2,entry,Information Technology,Staffing and Recruiting,0
2,Data Scientist,Inspire Brands,"Atlanta, GA",ATL,1,25,"['Seniority level', 'Entry level', 'Employment...",0,0,0,['1-3 years'],4,entry,Engineering,Financial Services,0
3,Data Scientist,The Home Depot,"Atlanta, GA",ATL,1,29,"['Seniority level', 'Entry level', 'Employment...",0,1,0,['2+ years'],3,entry,Engineering,Financial Services,0
4,Data Scientist,Edelman Data & Intelligence (DxI),"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",0,0,0,[],3,entry,Analyst,Research,0


In [53]:
# use regex re.findall(r'\d+', test_string) to fill 'Minimum Experience' values from 'Experience'
# the records with missing 'Experience' values ([]) will be filled with 0
res = 0

for i in range(len(data)):
    res = np.array(re.findall(r'\d+', data['Experience'][i])).astype(int)
    
    if res.size == 0: # need to acount for cases where result is empty []
        data['Minimum Experience'][i] = 0
    else:
        data['Minimum Experience'][i] = res.min()

In [54]:
# check
data.head()

Unnamed: 0,Job Title,Company Name,Location,Metro Area,Time Posted,Number of Applicants,Industry Info,Education-Bachelor,Education-Master,Education-Doctor,Experience,Data Science Terms Count,Seniority Level,Job Function,Industry,Minimum Experience
0,Data Scientist,Honeywell,"Atlanta, GA",ATL,2,40,"['Seniority level', 'Entry level', 'Employment...",0,1,1,['2 years'],7,entry,Engineering,Staffing and Recruiting,2
1,Data Scientist,SynergisticIT,"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",1,1,0,['10 years'],2,entry,Information Technology,Staffing and Recruiting,10
2,Data Scientist,Inspire Brands,"Atlanta, GA",ATL,1,25,"['Seniority level', 'Entry level', 'Employment...",0,0,0,['1-3 years'],4,entry,Engineering,Financial Services,1
3,Data Scientist,The Home Depot,"Atlanta, GA",ATL,1,29,"['Seniority level', 'Entry level', 'Employment...",0,1,0,['2+ years'],3,entry,Engineering,Financial Services,2
4,Data Scientist,Edelman Data & Intelligence (DxI),"Atlanta, GA",ATL,1,-10,"['Seniority level', 'Entry level', 'Employment...",0,0,0,[],3,entry,Analyst,Research,0


### Save relevant features from cleaned data

At this point, we will select the features with which we will continue working and save the cleaned data in a new file.

In [55]:
data.columns

Index(['Job Title', 'Company Name', 'Location', 'Metro Area', 'Time Posted',
       'Number of Applicants', 'Industry Info', 'Education-Bachelor',
       'Education-Master', 'Education-Doctor', 'Experience',
       'Data Science Terms Count', 'Seniority Level', 'Job Function',
       'Industry', 'Minimum Experience'],
      dtype='object')

In [56]:
# select all relevant features we will continue to work with and create a new data set, data_1
save_features = ['Job Title', 'Company Name', 'Industry', 'Job Function', 'Metro Area', 
                 'Education-Bachelor', 'Education-Master', 'Education-Doctor', 'Minimum Experience', 
                 'Seniority Level', 'Data Science Terms Count', 'Time Posted', 'Number of Applicants']

save_data = data[save_features]
save_data.head()

Unnamed: 0,Job Title,Company Name,Industry,Job Function,Metro Area,Education-Bachelor,Education-Master,Education-Doctor,Minimum Experience,Seniority Level,Data Science Terms Count,Time Posted,Number of Applicants
0,Data Scientist,Honeywell,Staffing and Recruiting,Engineering,ATL,0,1,1,2,entry,7,2,40
1,Data Scientist,SynergisticIT,Staffing and Recruiting,Information Technology,ATL,1,1,0,10,entry,2,1,-10
2,Data Scientist,Inspire Brands,Financial Services,Engineering,ATL,0,0,0,1,entry,4,1,25
3,Data Scientist,The Home Depot,Financial Services,Engineering,ATL,0,1,0,2,entry,3,1,29
4,Data Scientist,Edelman Data & Intelligence (DxI),Research,Analyst,ATL,0,0,0,0,entry,3,1,-10


In [57]:
# save data as an excel file

file_save = file_read + '_clean.xlsx'

save_data.to_excel(file_save, index = False)