# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

# Files Loading

We load all the files in the folder "data":

In [2]:
folder = 'data_jobs'
files = os.listdir(folder)

dfs = []

headers = ['Job','URL','Company','Location','Salary','Posted','ID','Job_Description']

for file in files:
    path = os.path.join(folder, file)
    
    df = pd.read_excel(path, names=headers)
    
    dfs.append(df)

Jobs = pd.concat(dfs)

# Data Cleaning

First of all, we need to delete all the duplicated values we could have. The best column to do so is the "ID" column:

In [3]:
Jobs = Jobs.drop_duplicates(subset='ID')

Let's see how it looks like the df:

In [4]:
Jobs.describe()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description
count,105756,105756,105707,105756,44423,96571,105756,105756
unique,47228,105756,32606,27175,15373,176,105756,104375
top,Financial Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,JPMorgan Chase & Co,Remote,"$100,000 - $120,000 por año",PostedHoy,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...
freq,2968,1,492,5949,328,35517,1,19


In [5]:
Jobs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105756 entries, 0 to 987
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Job              105756 non-null  object
 1   URL              105756 non-null  object
 2   Company          105707 non-null  object
 3   Location         105756 non-null  object
 4   Salary           44423 non-null   object
 5   Posted           96571 non-null   object
 6   ID               105756 non-null  object
 7   Job_Description  105756 non-null  object
dtypes: object(8)
memory usage: 7.3+ MB


We see we have almost all the columns with non-null values, except Salary and Posted. The column Posted is not critic, since it's a column that shows when the job posting was published in terms of the extraction date. It doesn't seem very useful, so we can't drop it.

With regards to the Salary column, we see we have aprox 30-35% of non-null values.
#### Could we estimate the Salary null values?
Let's see it later!

In [6]:
Jobs.drop(columns=['Posted'])

Unnamed: 0,Job,URL,Company,Location,Salary,ID,Job_Description
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...
...,...,...,...,...,...,...,...
969,Property Controller- Graduate East Lansing,https://www.indeed.com/rc/clk?jk=a88f74a87f1c6...,Schulte Companies,"East Lansing, MI 48823+1 ubicación",,job_a88f74a87f1c6e48,Ir directamente al contenido principal_x000D_\...
972,Assistant Controller,https://www.indeed.com/rc/clk?jk=6fe37642cf957...,"Home State Insurance Group, Inc.","Waco, TX 76710+1 ubicación",,job_6fe37642cf95726f,Ir directamente al contenido principal_x000D_\...
980,Plant Controller,https://www.indeed.com/rc/clk?jk=094beaf0996c9...,Dometic Corporation,"Cerritos, CA 90703",,job_094beaf0996c95fe,Ir directamente al contenido principal_x000D_\...
981,Controller,https://www.indeed.com/rc/clk?jk=f3027b081096f...,"The Talance Group, LP","Houston, TX 77002 (Downtown area)",,job_f3027b081096ff4c,Ir directamente al contenido principal_x000D_\...


We saw all the Dtypes are object. However, the Salary column would be interesting to be a numeric column.

## Salary

Let's see how it looks like:

In [7]:
Jobs['Salary'].head(50).values.tolist()

['$80,000 - $110,000 por año',
 nan,
 nan,
 nan,
 nan,
 nan,
 '$70,000 - $80,000 por año',
 nan,
 '$36 - $40 por hora',
 'Hasta $176,800 por año',
 nan,
 nan,
 '$70,000 - $129,000 por año',
 nan,
 '$55,000 - $75,000 por año',
 '$56,400 - $94,000 por año',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '$110,000 - $130,000 por año',
 nan,
 '$101,250 - $148,500 por año',
 nan,
 nan,
 '$78,592 - $122,459 por año',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '$85,000 - $100,000 por año',
 nan,
 nan,
 nan,
 '$100,000 - $120,000 por año',
 nan,
 '$140,000 - $190,000 por año',
 nan,
 nan,
 nan,
 'Desde $58 por hora',
 nan]

It looks horrible! We have single values, we have ranges, we have salary per year (año), month (mes), week (semana), day (día), and hourly (hora). Moreover we have the $ symbol in all the values, and in some cases we have the words Desde (From), and Hasta (Up to). Let's clean it!

First of all, the frecuency pay could be interesting in order to standardize the salaries. Let's split it into a new column:

In [8]:
Jobs['Frecuency_Salary'] = Jobs['Salary'].str.split(' por ').str.get(1)

In [9]:
Jobs['Frecuency_Salary'].unique()

array(['año', nan, 'hora', 'mes', 'día', 'semana'], dtype=object)

In [10]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",PostedRecién publicado,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...,año
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,PostedRecién publicado,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...,
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,PostedRecién publicado,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...,
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,PostedRecién publicado,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...,
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,PostedRecién publicado,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...,


Nice! Now let's delete the words "Desde " and "Hasta ":

In [11]:
Jobs['Salary'] = Jobs['Salary'].str.replace('Desde ','')
Jobs['Salary'] = Jobs['Salary'].str.replace('Hasta ','')

In [12]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",PostedRecién publicado,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...,año
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,PostedRecién publicado,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...,
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,PostedRecién publicado,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...,
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,PostedRecién publicado,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...,
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,PostedRecién publicado,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...,


Let's work on the range values. An interesting approach would be to have two new columns, for the low part of the range, and another for the high part of the range:

In [13]:
salary_split = Jobs['Salary'].str.split('-', expand=True)
Jobs['Low_Salary'] = salary_split[0].str.strip()
Jobs['High_Salary'] = salary_split[1].str.strip()

With these split, all the non-range values, will have value only in the Low_Salary column. In this cases, let's set the same values for all low and high salary columns:

In [14]:
Jobs['High_Salary'] = Jobs['High_Salary'].fillna(Jobs['Low_Salary'])

In [15]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",PostedRecién publicado,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...,año,"$80,000","$110,000 por año"
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,PostedRecién publicado,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...,,,
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,PostedRecién publicado,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...,,,
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,PostedRecién publicado,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...,,,
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,PostedRecién publicado,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...,,,


We see how it looks better, but we need to clean the non numeric values for both columns. So let's replace all the non numeric strings:

In [16]:
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace('[\$,]', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace('[\$,]', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por hora', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por hora', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por día', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por día', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por año', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por año', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por mes', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por mes', '', regex=True)
Jobs['Low_Salary'] = Jobs['Low_Salary'].replace(' por semana', '', regex=True)
Jobs['High_Salary'] = Jobs['High_Salary'].replace(' por semana', '', regex=True)

In [17]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",PostedRecién publicado,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...,año,80000.0,110000.0
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,PostedRecién publicado,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...,,,
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,PostedRecién publicado,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...,,,
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,PostedRecién publicado,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...,,,
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,PostedRecién publicado,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...,,,


Good! Now we need to standardize the Salaries, into a unique frecuency. The most common is to show the salary yearly. So let's transform all the non-yearly salaries into a yearly:

In [18]:
Jobs['Low_Salary'] = Jobs['Low_Salary'].astype(float)
Jobs['High_Salary'] = Jobs['High_Salary'].astype(float)

In [19]:
def transform_salary(row):
    if row['Frecuency_Salary'] == 'año':
        factor = 1
    elif row['Frecuency_Salary'] == 'mes':
        factor = 12
    elif row['Frecuency_Salary'] == 'semana':
        factor = 49
    elif row['Frecuency_Salary'] == 'día':
        factor = 230
    elif row['Frecuency_Salary'] == 'hora':
        factor = 1840
    else:
        factor = 1
    row['Low_Salary'] *= factor
    row['High_Salary'] *= factor
    return row

Jobs = Jobs.apply(transform_salary, axis=1)

In [20]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",PostedRecién publicado,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...,año,80000.0,110000.0
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,PostedRecién publicado,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...,,,
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,PostedRecién publicado,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...,,,
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,PostedRecién publicado,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...,,,
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,PostedRecién publicado,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...,,,


Nice! The last step here, is to create a unique column with salaries. The purpose of this is to compare and analyze, since some of the salaries are ranges and some others are not. The best approach for this is to have a mean salary column:

In [21]:
Jobs['Mean_Salary'] = (Jobs['Low_Salary'] + Jobs['High_Salary']) / 2

In [22]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary,Mean_Salary
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",PostedRecién publicado,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...,año,80000.0,110000.0,95000.0
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,PostedRecién publicado,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...,,,,
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,PostedRecién publicado,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...,,,,
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,PostedRecién publicado,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...,,,,
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,PostedRecién publicado,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...,,,,


PERFECT!

## Jobs Group

Since the Job names are quite heterogeneus, it's mandatory to group them in a few job groups. Let's do it with the dictionary categories.xlsx: 

In [23]:
category_df = pd.read_excel('master_data\categories.xlsx')

category_df.head()

Unnamed: 0,keyword,category
0,data engineer,Data Engineer
1,data scientist,Data Scientist
2,data science,Data Scientist
3,ds,Data Scientist
4,machine learning,ML/AI Engineer


In [24]:
categories = category_df.set_index('keyword')['category'].to_dict()

def assign_category(job):
    job = job.lower()
    for keyword, category in categories.items():
        if keyword in job:
            return category
    return 'Others'

Jobs['Jobs_Group'] = Jobs['Job'].apply(assign_category)

In [25]:
category_counts = Jobs['Jobs_Group'].value_counts()

print(category_counts)

Financial Analyst            18143
Business Analyst             16346
Data Analyst                 12413
Data Engineer                10061
Controller                    9868
Data Scientist                7048
Business Intelligence         6088
CFO                           5472
Analyst                       5352
Finance                       5346
Operations Analyst            3508
Others                        2624
ML/AI Engineer                2589
Statistician/Mathemathics      898
Name: Jobs_Group, dtype: int64


Nice!

## Location

The next step would be to clean the Location column. Would be nice to have a City and State column. The best way to split it, is to split it two columns by comma, getting all the string before, and the two letters after comma. The first column will be the city, and the second one the state. Let's work on it!

In [26]:
Jobs[['City', 'State']] = Jobs['Location'].str.extract(r'^(.+),\s([A-Z]{2})', expand=True)

In [27]:
Jobs.head()

Unnamed: 0,Job,URL,Company,Location,Salary,Posted,ID,Job_Description,Frecuency_Salary,Low_Salary,High_Salary,Mean_Salary,Jobs_Group,City,State
0,Business Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,CyberCoders,"Torrington, CT 06790","$80,000 - $110,000 por año",PostedRecién publicado,sj_1e37379f40861c74,Ir directamente al contenido principal_x000D_\...,año,80000.0,110000.0,95000.0,Business Analyst,Torrington,CT
1,RPA Business Systems Analyst,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",,PostedRecién publicado,sj_a2789bdbc24f4aed,Ir directamente al contenido principal_x000D_\...,,,,,Business Analyst,Philadelphia,PA
2,Quantitive Business Analyst - Strategic Data S...,https://www.indeed.com/rc/clk?jk=15e7be7c9bf65...,Apple,"Austin, TX+1 location",,PostedRecién publicado,job_15e7be7c9bf658e3,Ir directamente al contenido principal_x000D_\...,,,,,Business Analyst,Austin,TX
3,Business Line Product Lifecycle Management (PL...,https://www.indeed.com/rc/clk?jk=e8519e1ec2d60...,NXP Semiconductors,"Austin, TX (West Oak Hill area)",,PostedRecién publicado,job_e8519e1ec2d60a16,Ir directamente al contenido principal_x000D_\...,,,,,Business Analyst,Austin,TX
4,Global Markets Operations Asset Services Ops S...,https://www.indeed.com/rc/clk?jk=0545baf656087...,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",,PostedRecién publicado,job_0545baf6560877d1,Ir directamente al contenido principal_x000D_\...,,,,,Operations Analyst,Jacksonville,FL


In [28]:
Jobs['City'].values

array(['Torrington', 'Philadelphia', 'Austin', ..., 'Cerritos', 'Houston',
       'Marinette'], dtype=object)

In [29]:
Jobs['State'].values

array(['CT', 'PA', 'TX', ..., 'CA', 'TX', 'WI'], dtype=object)

Nice! Now a minor problem, is that there is some Locations, that are directy the state (e.g. California). In this cases, let's create a function, using the dictionary states.xlsx:

In [30]:
states_df = pd.read_excel('master_data\states.xlsx')

states_df.head()

Unnamed: 0,State,Code
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


In [31]:
states_dict = states_df.set_index('State')['Code'].to_dict()

def assign_state(row):
    if pd.isnull(row['State']):
        for state, code in states_dict.items():
            if state.lower() in row['Location'].lower():
                return code
    return row['State']

Jobs['State'] = Jobs.apply(assign_state, axis=1)

Perfect!

## Remote

An interesting column would be if a job is remote or hybrid or none of these. In order to do so, we can use a dictionary with keywords:

In [32]:
remote_df = pd.read_excel('master_data/remote.xlsx')

remote_df.head()

Unnamed: 0,Value,Category
0,Remote,Remote
1,remote,Remote
2,Hybrid,Hybrid
3,hybrid,Hybrid
4,Work From Home,Remote


In [33]:
remote_dict = remote_df.set_index('Value')['Category'].to_dict()

def assign_remote(row):
    for col in ['Job', 'Location', 'Job_Description']:
        for value, category in remote_dict.items():
            if value.lower() in row[col].lower():
                return category
    return None

Jobs['Remote'] = Jobs.apply(assign_remote, axis=1)

In [34]:
remote_counts = Jobs['Remote'].value_counts()

print(remote_counts)

Remote    29981
Hybrid    10993
Name: Remote, dtype: int64


Looks great!

## Profile

Another interesting column is the profile. In this column we will show if it's a Lead, Senior or a Junior position. As we did in the remote column, we will use a dictionary:

In [35]:
profile_df = pd.read_excel('master_data\profile.xlsx')

profile_df.head()

Unnamed: 0,Value,Profile
0,junior,Junior
1,Junior,Junior
2,Jr,Junior
3,Jr.,Junior
4,senior,Senior


In [36]:
profile_dict = profile_df.set_index('Value')['Profile'].to_dict()

def assign_profile(row):
    for col in ['Job']:
        for value, profile in profile_dict.items():
            if value.lower() in row[col].lower():
                return profile
    return None

Jobs['Profile'] = Jobs.apply(assign_profile, axis=1)

In [37]:
profile_counts = Jobs['Profile'].value_counts()

print(profile_counts)

Senior    20848
Lead      16572
Junior     1293
Name: Profile, dtype: int64


Perfect!

## Skills

Now, let's deep into the job descriptions. Would be nice to have a Skills column. We can have it using a dictionary skills.xlsx:

In [38]:
skills_df = pd.read_excel('master_data\skills.xlsx')

skills_df.head()

Unnamed: 0,Skills,Group
0,Access,Access
1,Adobe Analytics,Adobe Analytics
2,AWS,AWS
3,Amazon Web Services,AWS
4,Amazon Web,AWS


In [39]:
skills_dict = skills_df.set_index('Skills')['Group'].to_dict()

Jobs['Skills'] = Jobs['Job_Description'].apply(lambda x: list(set([skills_dict[skill] for skill in skills_dict.keys() if skill in x])))

In [40]:
skills_freq = Jobs['Skills'].explode().value_counts()

skills_percent = skills_freq / len(Jobs) * 100

top_skills = skills_percent.sort_values(ascending=False)

skills_df_freq = pd.DataFrame({'Percentage': top_skills, 'Count': skills_freq[top_skills.index]})

skills_df_freq.style.format({'Percentage': '{:.2f}%'})

Unnamed: 0,Percentage,Count
Bachelor,65.49%,69263
Office,33.81%,35752
Excel,28.63%,30275
SQL,28.42%,30058
Python,18.80%,19882
Master,18.49%,19559
Tableau,12.77%,13501
Word,12.58%,13307
Power BI,12.51%,13227
PowerPoint,12.47%,13193


Perfect!

# Generating the csv file

The last part is generate the final file. We are going to generate a csv format file. But first of all, we need to reorganize the columns:

In [41]:
Jobs_csv = Jobs[['ID','Job','Jobs_Group','Profile','Remote','Company','Location','City','State','Salary','Frecuency_Salary','Low_Salary','High_Salary','Mean_Salary','Skills']]

In [42]:
Jobs_csv

Unnamed: 0,ID,Job,Jobs_Group,Profile,Remote,Company,Location,City,State,Salary,Frecuency_Salary,Low_Salary,High_Salary,Mean_Salary,Skills
0,sj_1e37379f40861c74,Business Analyst,Business Analyst,,,CyberCoders,"Torrington, CT 06790",Torrington,CT,"$80,000 - $110,000 por año",año,80000.0,110000.0,95000.0,[]
1,sj_a2789bdbc24f4aed,RPA Business Systems Analyst,Business Analyst,,,Amerihealth,"Philadelphia, PA 19107 (City Center East area)...",Philadelphia,PA,,,,,,"[Office, SQL, Bachelor]"
2,job_15e7be7c9bf658e3,Quantitive Business Analyst - Strategic Data S...,Business Analyst,,,Apple,"Austin, TX+1 location",Austin,TX,,,,,,"[Python, SQL, Bachelor]"
3,job_e8519e1ec2d60a16,Business Line Product Lifecycle Management (PL...,Business Analyst,Junior,,NXP Semiconductors,"Austin, TX (West Oak Hill area)",Austin,TX,,,,,,[Bachelor]
4,job_0545baf6560877d1,Global Markets Operations Asset Services Ops S...,Operations Analyst,Senior,,Bank of America,"Jacksonville, FL 32246 (Windy Hill area)+4 loc...",Jacksonville,FL,,,,,,[Excel]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
969,job_a88f74a87f1c6e48,Property Controller- Graduate East Lansing,Controller,,,Schulte Companies,"East Lansing, MI 48823+1 ubicación",East Lansing,MI,,,,,,"[Office, Bachelor]"
972,job_6fe37642cf95726f,Assistant Controller,Controller,,,"Home State Insurance Group, Inc.","Waco, TX 76710+1 ubicación",Waco,TX,,,,,,"[Excel, Word, Office, Bachelor]"
980,job_094beaf0996c95fe,Plant Controller,Controller,,,Dometic Corporation,"Cerritos, CA 90703",Cerritos,CA,,,,,,"[Excel, Access, Office, Bachelor, PowerPoint, ..."
981,job_f3027b081096ff4c,Controller,Controller,,Remote,"The Talance Group, LP","Houston, TX 77002 (Downtown area)",Houston,TX,,,,,,[CPA]


In [43]:
Jobs_csv.to_csv('jobs.csv', index=False, encoding="utf-16")
Jobs_csv.to_excel('jobs.xlsx', index=False)

# THANK YOU!!