# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

# Files Loading

We load all the files in the folder "data":

In [2]:
folder = 'data_jobs'
files = os.listdir(folder)

dfs = []

headers = ['Job','URL','Company','Location','Salary','Posted','ID','Job_Description']

for file in files:
    path = os.path.join(folder, file)
    
    df = pd.read_excel(path, names=headers)
    
    dfs.append(df)

Jobs = pd.concat(dfs)

# Data Cleaning

First of all, we need to delete all the duplicated values we could have. The best column to do so is the "ID" column:

In [3]:
Jobs = Jobs.drop_duplicates(subset='ID')

Let's see how it looks like the df:

In [4]:
Jobs.describe()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description
count,44594,44594,44563,44594,17979,38903,44594,44594
unique,22136,44594,17760,14781,8089,58,44594,44034
top,Financial Analyst,https://www.indeed.com/rc/clk?jk=dd21611605e12...,"JPMorgan Chase Bank, N.A.",Remote,"$100,000 - $120,000 por año",PostedHoy,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...
freq,1358,1,278,2479,115,13838,1,14


In [5]:
Jobs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44594 entries, 0 to 987
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Job              44594 non-null  object
 1   URL              44594 non-null  object
 2   Company          44563 non-null  object
 3   Location         44594 non-null  object
 4   Salary           17979 non-null  object
 5   Posted           38903 non-null  object
 6   ID               44594 non-null  object
 7   Job_Description  44594 non-null  object
dtypes: object(8)
memory usage: 3.1+ MB


We see we have almost all the columns with non-null values, except Salary and Posted. The column Posted is not critic, since it's a column that shows when the job posting was published in terms of the extraction date. It doesn't seem very useful, so we can't drop it.

With regards to the Salary column, we see we have aprox 30-35% of non-null values.
#### Could we estimate the Salary null values?
Let's see it later!

In [6]:
Jobs.drop(columns=['Posted'])

Unnamed: 0,Job,URL,Company,Location,Salary,ID,Job_Description
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...
...,...,...,...,...,...,...,...
969,Property Controller- Graduate East Lansing,https://www.indeed.com/rc/clk?jk=a88f74a87f1c6...,Schulte Companies,"East Lansing, MI 48823+1 ubicación",,job_a88f74a87f1c6e48,Ir directamente al contenido principal_x000D_\...
972,Assistant Controller,https://www.indeed.com/rc/clk?jk=6fe37642cf957...,"Home State Insurance Group, Inc.","Waco, TX 76710+1 ubicación",,job_6fe37642cf95726f,Ir directamente al contenido principal_x000D_\...
980,Plant Controller,https://www.indeed.com/rc/clk?jk=094beaf0996c9...,Dometic Corporation,"Cerritos, CA 90703",,job_094beaf0996c95fe,Ir directamente al contenido principal_x000D_\...
981,Controller,https://www.indeed.com/rc/clk?jk=f3027b081096f...,"The Talance Group, LP","Houston, TX 77002 (Downtown area)",,job_f3027b081096ff4c,Ir directamente al contenido principal_x000D_\...


We saw all the Dtypes are object. However, the Salary column would be interesting to be a numeric column.

## Salary

Let's see how it looks like:

In [7]:
Jobs['Salary'].head(50).values.tolist()

[nan,
 nan,
 '$55,000 - $105,000 por año',
 '$57,300 - $86,400 por año',
 nan,
 '$67,000 - $100,000 por año',
 nan,
 nan,
 nan,
 '$60,000 - $132,000 por año',
 nan,
 nan,
 nan,
 nan,
 '$70 por hora',
 '$70,785 - $107,640 por año',
 '$65,490 - $88,660 por año',
 nan,
 nan,
 '$65,000 - $105,000 por año',
 nan,
 nan,
 '$154,523 - $169,854 por año',
 '$66,300 - $137,700 por año',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '$68,395 - $97,933 por año',
 '$54,000 - $134,000 por año',
 nan,
 nan,
 nan,
 nan,
 nan,
 '$63,118 - $87,346 por año',
 nan,
 nan,
 nan]

It looks horrible! We have single values, we have ranges, we have salary per year (año), month (mes), week (semana), day (día), and hourly (hora). Moreover we have the $ symbol in all the values, and in some cases we have the words Desde (From), and Hasta (Up to). Let's clean it!

First of all, the frecuency pay could be interesting in order to standardize the salaries. Let's split it into a new column:

In [8]:
Jobs['Frecuency_Salary'] = Jobs['Salary'].str.split(' por ').str.get(1)

In [9]:
Jobs['Frecuency_Salary'].unique()

array([nan, 'año', 'hora', 'mes', 'semana', 'día'], dtype=object)

In [10]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,PostedPublicado hace 2 días,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...,
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,PostedPublicado hace más de 30 días,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...,
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",PostedPublicado hace más de 30 días,job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...,año
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",PostedPublicado hace 3 días,job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...,año
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,PostedPublicado hace 6 días,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...,


Nice! Now let's delete the words "Desde " and "Hasta ":

In [11]:
Jobs['Salary'] = Jobs['Salary'].str.replace('Desde ','')
Jobs['Salary'] = Jobs['Salary'].str.replace('Hasta ','')

In [12]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,PostedPublicado hace 2 días,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...,
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,PostedPublicado hace más de 30 días,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...,
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",PostedPublicado hace más de 30 días,job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...,año
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",PostedPublicado hace 3 días,job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...,año
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,PostedPublicado hace 6 días,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...,


Let's work on the range values. An interesting approach would be to have two new columns, for the low part of the range, and another for the high part of the range:

In [13]:
salary_split = Jobs['Salary'].str.split('-', expand=True)
Jobs['Low_Salary'] = salary_split[0].str.strip()
Jobs['High_Salary'] = salary_split[1].str.strip()

With these split, all the non-range values, will have value only in the Low_Salary column. In this cases, let's set the same values for all low and high salary columns:

In [14]:
Jobs['High_Salary'] = Jobs['High_Salary'].fillna(Jobs['Low_Salary'])

In [15]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,PostedPublicado hace 2 días,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...,,,
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,PostedPublicado hace más de 30 días,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...,,,
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",PostedPublicado hace más de 30 días,job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...,año,"$55,000","$105,000 por año"
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",PostedPublicado hace 3 días,job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...,año,"$57,300","$86,400 por año"
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,PostedPublicado hace 6 días,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...,,,


We see how it looks better, but we need to clean the non numeric values for both columns. So let's replace all the non numeric strings:

In [16]:
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace('[\$,]', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace('[\$,]', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por hora', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por hora', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por día', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por día', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por año', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por año', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por mes', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por mes', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por semana', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por semana', '', regex=True)

In [17]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,PostedPublicado hace 2 días,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...,,,
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,PostedPublicado hace más de 30 días,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...,,,
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",PostedPublicado hace más de 30 días,job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...,año,55000.0,105000.0
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",PostedPublicado hace 3 días,job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...,año,57300.0,86400.0
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,PostedPublicado hace 6 días,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...,,,


Good! Now we need to standardize the Salaries, into a unique frecuency. The most common is to show the salary yearly. So let's transform all the non-yearly salaries into a yearly:

In [18]:
Jobs['Low_Salary'] = Jobs['Low_Salary'].astype(float)
Jobs['High_Salary'] = Jobs['High_Salary'].astype(float)

In [19]:
def transform_salary(row):
    if row['Frecuency_Salary'] == 'año':
        factor = 1
    elif row['Frecuency_Salary'] == 'mes':
        factor = 12
    elif row['Frecuency_Salary'] == 'semana':
        factor = 49
    elif row['Frecuency_Salary'] == 'día':
        factor = 230
    elif row['Frecuency_Salary'] == 'hora':
        factor = 1840
    else:
        factor = 1
    row['Low_Salary'] *= factor
    row['High_Salary'] *= factor
    return row

Jobs = Jobs.apply(transform_salary, axis=1)

In [20]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,PostedPublicado hace 2 días,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...,,,
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,PostedPublicado hace más de 30 días,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...,,,
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",PostedPublicado hace más de 30 días,job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...,año,55000.0,105000.0
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",PostedPublicado hace 3 días,job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...,año,57300.0,86400.0
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,PostedPublicado hace 6 días,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...,,,


Nice! The last step here, is to create a unique column with salaries. The purpose of this is to compare and analyze, since some of the salaries are ranges and some others are not. The best approach for this is to have a mean salary column:

In [21]:
Jobs['Mean_Salary'] = (Jobs['Low_Salary'] + Jobs['High_Salary']) / 2

In [22]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary,Mean_Salary
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,PostedPublicado hace 2 días,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...,,,,
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,PostedPublicado hace más de 30 días,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...,,,,
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",PostedPublicado hace más de 30 días,job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...,año,55000.0,105000.0,80000.0
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",PostedPublicado hace 3 días,job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...,año,57300.0,86400.0,71850.0
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,PostedPublicado hace 6 días,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...,,,,


PERFECT!

## Jobs Group

Since the Job names are quite heterogeneus, it's mandatory to group them in a few job groups. Let's do it with the dictionary categories.xlsx: 

In [23]:
category_df = pd.read_excel('master_data\categories.xlsx')

category_df.head()

Unnamed: 0,keyword,category
0,data engineer,Data Engineer
1,data scientist,Data Scientist
2,data science,Data Scientist
3,ds,Data Scientist
4,machine learning,ML/AI Engineer


In [24]:
categories = category_df.set_index('keyword')['category'].to_dict()

def assign_category(job):
    job = job.lower()
    for keyword, category in categories.items():
        if keyword in job:
            return category
    return 'Others'

Jobs['Jobs_Group'] = Jobs['Job'].apply(assign_category)

In [25]:
category_counts = Jobs['Jobs_Group'].value_counts()

print(category_counts)

Financial Analyst            7823
Business Analyst             6935
Data Analyst                 5629
Data Engineer                4403
Controller                   3856
Data Scientist               3360
Business Intelligence        2877
CFO                          2263
Finance                      2077
Analyst                      1870
Operations Analyst           1242
ML/AI Engineer                955
Others                        918
Statistician/Mathemathics     386
Name: Jobs_Group, dtype: int64


Nice!

## Location

The next step would be to clean the Location column. Would be nice to have a City and State column. The best way to split it, is to split it two columns by comma, getting all the string before, and the two letters after comma. The first column will be the city, and the second one the state. Let's work on it!

In [26]:
Jobs[['City', 'State']] = Jobs['Location'].str.extract(r'^(.+),\s([A-Z]{2})', expand=True)

In [27]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary,Mean_Salary,Jobs_Group,City,State
0,Enterprise Business Intelligence Analyst I,https://www.indeed.com/rc/clk?jk=dd21611605e12...,BOK Financial,"Tulsa, OK 74101",,PostedPublicado hace 2 días,job_dd21611605e12f1c,Ir directamente al contenido principal_x000D_\...,,,,,Business Intelligence,Tulsa,OK
1,"Business Analyst, Commercial Strategy",https://www.indeed.com/rc/clk?jk=64d9fb0d5e335...,Live Nation,"Chicago, IL",,PostedPublicado hace más de 30 días,job_64d9fb0d5e3351ae,Ir directamente al contenido principal_x000D_\...,,,,,Business Analyst,Chicago,IL
2,Business Analyst I,https://www.indeed.com/rc/clk?jk=3f63ec9a3f8d6...,Amex,"Phoenix, AZ","$55,000 - $105,000 por año",PostedPublicado hace más de 30 días,job_3f63ec9a3f8d6a9e,Ir directamente al contenido principal_x000D_\...,año,55000.0,105000.0,80000.0,Business Analyst,Phoenix,AZ
3,"Sr. Analyst, Business Optimization (Remote)",https://www.indeed.com/rc/clk?jk=f78f07684adbc...,Staples,Remote,"$57,300 - $86,400 por año",PostedPublicado hace 3 días,job_f78f07684adbc4c4,Ir directamente al contenido principal_x000D_\...,año,57300.0,86400.0,71850.0,Business Analyst,,
4,Business Risk Analyst I - Commercial Risk,https://www.indeed.com/rc/clk?jk=72945a28e3458...,M&T Bank,"Buffalo, NY",,PostedPublicado hace 6 días,job_72945a28e345817a,Ir directamente al contenido principal_x000D_\...,,,,,Business Analyst,Buffalo,NY


In [28]:
Jobs['City'].values

array(['Tulsa', 'Chicago', 'Phoenix', ..., 'Cerritos', 'Houston',
       'Marinette'], dtype=object)

In [29]:
Jobs['State'].values

array(['OK', 'IL', 'AZ', ..., 'CA', 'TX', 'WI'], dtype=object)

Nice! Now a minor problem, is that there is some Locations, that are directy the state (e.g. California). In this cases, let's create a function, using the dictionary states.xlsx:

In [30]:
states_df = pd.read_excel('master_data\states.xlsx')

states_df.head()

Unnamed: 0,State,Code
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


In [31]:
states_dict = states_df.set_index('State')['Code'].to_dict()

def assign_state(row):
    if pd.isnull(row['State']):
        for state, code in states_dict.items():
            if state.lower() in row['Location'].lower():
                return code
    return row['State']

Jobs['State'] = Jobs.apply(assign_state, axis=1)

Perfect!

## Remote

An interesting column would be if a job is remote or hybrid or none of these. In order to do so, we can use a dictionary with keywords:

In [32]:
remote_df = pd.read_excel('master_data/remote.xlsx')

remote_df.head()

Unnamed: 0,Value,Category
0,Remote,Remote
1,remote,Remote
2,Hybrid,Hybrid
3,hybrid,Hybrid
4,Work From Home,Remote


In [33]:
remote_dict = remote_df.set_index('Value')['Category'].to_dict()

def assign_remote(row):
    for col in ['Job', 'Location', 'Job_Description']:
        for value, category in remote_dict.items():
            if value.lower() in row[col].lower():
                return category
    return None

Jobs['Remote'] = Jobs.apply(assign_remote, axis=1)

In [34]:
remote_counts = Jobs['Remote'].value_counts()

print(remote_counts)

Remote    13058
Hybrid     4680
Name: Remote, dtype: int64


Looks great!

## Profile

Another interesting column is the profile. In this column we will show if it's a Lead, Senior or a Junior position. As we did in the remote column, we will use a dictionary:

In [35]:
profile_df = pd.read_excel('master_data\profile.xlsx')

profile_df.head()

Unnamed: 0,Value,Profile
0,junior,Junior
1,Junior,Junior
2,Jr,Junior
3,Jr.,Junior
4,senior,Senior


In [36]:
profile_dict = profile_df.set_index('Value')['Profile'].to_dict()

def assign_profile(row):
    for col in ['Job']:
        for value, profile in profile_dict.items():
            if value.lower() in row[col].lower():
                return profile
    return None

Jobs['Profile'] = Jobs.apply(assign_profile, axis=1)

In [37]:
profile_counts = Jobs['Profile'].value_counts()

print(profile_counts)

Senior    8749
Lead      6746
Junior     524
Name: Profile, dtype: int64


Perfect!

## Skills

Now, let's deep into the job descriptions. Would be nice to have a Skills column. We can have it using a dictionary skills.xlsx:

In [38]:
skills_df = pd.read_excel('master_data\skills.xlsx')

skills_df.head()

Unnamed: 0,Skills,Group
0,Access,Access
1,Adobe Analytics,Adobe Analytics
2,AWS,AWS
3,Amazon Web Services,AWS
4,Amazon Web,AWS


In [39]:
skills_dict = skills_df.set_index('Skills')['Group'].to_dict()

Jobs['Skills'] = Jobs['Job_Description'].apply(lambda x: list(set([skills_dict[skill] for skill in skills_dict.keys() if skill in x])))

In [40]:
skills_freq = Jobs['Skills'].explode().value_counts()

skills_percent = skills_freq / len(Jobs) * 100

top_skills = skills_percent.sort_values(ascending=False)

skills_df_freq = pd.DataFrame({'Percentage': top_skills, 'Count': skills_freq[top_skills.index]})

skills_df_freq.style.format({'Percentage': '{:.2f}%'})

Unnamed: 0,Percentage,Count
Bachelor,65.38%,29155
Office,32.20%,14360
SQL,30.09%,13417
Excel,28.37%,12653
Python,19.54%,8715
Master,18.59%,8292
Tableau,13.51%,6023
Power BI,12.98%,5790
PowerPoint,12.41%,5532
Word,12.15%,5417


Perfect!

# Generating the csv file

The last part is generate the final file. We are going to generate a csv format file. But first of all, we need to reorganize the columns:

In [41]:
Jobs_csv = Jobs[['ID','Job','Jobs_Group','Profile','Remote','Company','Location','City','State','Salary','Frecuency_Salary','Low_Salary','High_Salary','Mean_Salary','Skills']]

In [42]:
Jobs_csv

Unnamed: 0,ID,Job,Jobs_Group,Profile,Remote,Company,Location,City,State,Salary,Frecuency_Salary,Low_Salary,High_Salary,Mean_Salary,Skills
0,job_dd21611605e12f1c,Enterprise Business Intelligence Analyst I,Business Intelligence,,,BOK Financial,"Tulsa, OK 74101",Tulsa,OK,,,,,,"[Office, Power BI, Bachelor]"
1,job_64d9fb0d5e3351ae,"Business Analyst, Commercial Strategy",Business Analyst,,Remote,Live Nation,"Chicago, IL",Chicago,IL,,,,,,"[PowerPoint, Tableau, Excel, Bachelor]"
2,job_3f63ec9a3f8d6a9e,Business Analyst I,Business Analyst,,Hybrid,Amex,"Phoenix, AZ",Phoenix,AZ,"$55,000 - $105,000 por año",año,55000.0,105000.0,80000.0,"[Master, Python, Excel, SQL, Tableau, VBA]"
3,job_f78f07684adbc4c4,"Sr. Analyst, Business Optimization (Remote)",Business Analyst,Senior,Remote,Staples,Remote,,,"$57,300 - $86,400 por año",año,57300.0,86400.0,71850.0,[Office]
4,job_72945a28e345817a,Business Risk Analyst I - Commercial Risk,Business Analyst,,,M&T Bank,"Buffalo, NY",Buffalo,NY,,,,,,[Bachelor]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
969,job_a88f74a87f1c6e48,Property Controller- Graduate East Lansing,Controller,,,Schulte Companies,"East Lansing, MI 48823+1 ubicación",East Lansing,MI,,,,,,"[Office, Bachelor]"
972,job_6fe37642cf95726f,Assistant Controller,Controller,,,"Home State Insurance Group, Inc.","Waco, TX 76710+1 ubicación",Waco,TX,,,,,,"[Office, Word, Excel, Bachelor]"
980,job_094beaf0996c95fe,Plant Controller,Controller,,,Dometic Corporation,"Cerritos, CA 90703",Cerritos,CA,,,,,,"[Master, Dynamics 365, Excel, PowerPoint, Qlik..."
981,job_f3027b081096ff4c,Controller,Controller,,Remote,"The Talance Group, LP","Houston, TX 77002 (Downtown area)",Houston,TX,,,,,,[CPA]


In [43]:
Jobs_csv.to_csv('jobs.csv', index=False, encoding="utf-16")
Jobs_csv.to_excel('jobs.xlsx', index=False)

# THANK YOU!!