# EDA of the crawled data

This notebook explores the crawled data from the last notebook and cleans it for the next step: data modelling. 

The next step uses machine learning algorithms to answer the following questions:

- Q1: How many jobs are there for data scientist, data engineer and data analyst positions?

- Q2: What skills (both hard/technical and soft/communicational) do the three kinds of jobs require?

I explore the data while bearing the two questions in mind.

## #1 How much data is there?

How many days of crawl are there?

In [1]:
from glob import glob

csv_paths = sorted(glob("../data/crawled_jobs_????-??-??.csv"))
csv_paths

['../data/crawled_jobs_2023-10-22.csv',
 '../data/crawled_jobs_2023-10-28.csv',
 '../data/crawled_jobs_2023-10-29.csv',
 '../data/crawled_jobs_2023-10-30.csv',
 '../data/crawled_jobs_2023-10-31.csv',
 '../data/crawled_jobs_2023-11-01.csv',
 '../data/crawled_jobs_2023-11-02.csv',
 '../data/crawled_jobs_2023-11-03.csv',
 '../data/crawled_jobs_2023-11-04.csv']

## Observations #1

- We have collected the data of consecutive days from 2023-10-28 to 2023-11-04, and a non-consecutive day of 2023-10-22. Including the statistics of all the days to the plotting demo can result in a gap, so I would skip the non-consecutive one.

- New data will come to the plotting demo later. I would need a continuous pipeline to validate and deploy the new batches.

## #2: What does every day's job postings look like?

Pick a day and peek at the data. 

Let's first peek at the data from 2023-10-22.

In [2]:
import pandas as pd

_data0 = pd.read_csv(csv_paths[0])
print(csv_paths[0])
_data0

../data/crawled_jobs_2023-10-22.csv


Unnamed: 0,job_title,required_skills,job_type_1,job_type_2,linkedin_url,company,company_linkedin_url,location,posted_date,applicant_count,job_description,benefits
0,RWE Scientist / Epidemiologist,"Customer Relationship Management (CRM), Epidem...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3737909849/...,MedEngine,https://www.linkedin.com/company/medengine/life,"MedEngine · Helsinki, Uusimaa, Finland 1 week...",1 week ago,0,About the job\nMedEngine is a digitally minded...,
1,Data Engineer,"Data Engineering, Git, Python (Programming Lan...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3736532279/...,Suomen Palloliitto - Football Association of F...,https://www.linkedin.com/company/football-asso...,Suomen Palloliitto - Football Association of F...,1 week ago,0,About the job\nDATA ENGINEER\n\nSUOMEN PALLOLI...,
2,Senior Game Analyst,"Analytical Skills, Data Analysis, Economics, M...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3717037977/...,"Next Games, A Netflix Game Studio",https://www.linkedin.com/company/next-games/life,"Next Games, A Netflix Game Studio · Helsinki, ...",Reposted 6 days ago,0,About the job\nNext Games is a Netflix Game St...,
3,Data Scientist,"Data Analysis, Data Science, Machine Learning,...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3735986015/...,MedEngine,https://www.linkedin.com/company/medengine/life,"MedEngine · Helsinki, Uusimaa, Finland 1 week...",1 week ago,0,About the job\nMedEngine is a digitally minded...,
4,Data Science - Machine Learning Engineer,"Artificial Intelligence (AI), Computer Science...",Remote,Full-time,https://www.linkedin.com/jobs/view/3629670334/...,Wolt,https://www.linkedin.com/company/wolt-oy/life,"Wolt · Helsinki, Uusimaa, Finland Reposted 2 ...",Reposted 2 weeks ago,0,About the job\nJob Description\n\nTeam purpose...,
...,...,...,...,...,...,...,...,...,...,...,...,...
307,Remote Work (Finnish Speakers) - Internet Ads ...,English and Finnish,Remote,,https://www.linkedin.com/jobs/view/3728766620/...,TELUS International AI Data Solutions,https://www.linkedin.com/company/telusinternat...,TELUS International AI Data Solutions · Finlan...,3 weeks ago,0,About the job\nOur Company \n\nTELUS Internati...,
308,Work From Home - Finnish Speakers (Internet Ad...,English and Finnish,Remote,,https://www.linkedin.com/jobs/view/3731394329/...,TELUS International AI Data Solutions,https://www.linkedin.com/company/telusinternat...,TELUS International AI Data Solutions · Finlan...,2 weeks ago,0,About the job\nOur Company \n\nTELUS Internati...,
309,"Senior Engineering Manager (Bangkok based, rel...","Software DevelopmentC#, Engineering Management...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3616699365/...,Agoda,https://www.linkedin.com/company/agoda/life,"Agoda · Helsinki, Uusimaa, Finland Reposted 3...",Reposted 3 days ago,0,About the job\nAbout Agoda\n\nAgoda is an onli...,
310,"Commercial Director Nordics, Transport Reagents","SwedishBusiness Planning, Commodities, Indirec...",On-site,Full-time,https://www.linkedin.com/jobs/view/3736132257/...,Yara Suomi,https://www.linkedin.com/company/yarasuomi/life,"Yara Suomi · Espoo, Uusimaa, Finland 1 week a...",1 week ago,0,About the job\nAbout Yara Industrial Solutions...,


### Observations #2 

- `required_skills` is a list of skills delimited by commas, but is stored as a single string. It would require tokenizing.
- `applicant_count` and `benefit` columns are not informative and should be dropped.
- `company_linkedin_url` and `posted_date` not useful for answering the research questions (Q1 or Q2), and should be dropped.
- `location` contains non-relevant infomation, such as company name and applicants. It should also be dropped.


### Aggregate all the dates' data

In [3]:
from datetime import datetime, timedelta
import pandas as pd

start_date = datetime(2023, 10, 29)
end_date = datetime(2023, 11, 4)

date = start_date
dfs = []
while date <= end_date:
    csv_path = f"../data/crawled_jobs_{date.strftime('%Y-%m-%d')}.csv"

    new_daily_data = pd.read_csv(csv_path)
    new_daily_data.loc[:,"crawl_date"] = date
    dfs.append(new_daily_data)

    date += timedelta(days=1)

df = pd.concat(dfs, axis=0, ignore_index=True)
df

Unnamed: 0,job_title,required_skills,job_type_1,job_type_2,linkedin_url,company,company_linkedin_url,location,posted_date,applicant_count,job_description,benefits,crawl_date
0,Data Scientist,"Data Analysis, Data Science, Machine Learning,...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3735986015/...,MedEngine,https://www.linkedin.com/company/medengine/life,"MedEngine · Helsinki, Uusimaa, Finland 2 week...",2 weeks ago,0,About the job\nMedEngine is a digitally minded...,,2023-10-29
1,"Senior Data Analyst, Ads","Data Analysis, Python (Programming Language), ...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3702242885/...,Rovio Entertainment Corporation,https://www.linkedin.com/company/rovio/life,"Rovio Entertainment Corporation · Helsinki, Uu...",Reposted 1 week ago,0,About the job\nAt Rovio you will get to work w...,,2023-10-29
2,JVM Performance and Tuning Engineer,"Business Logic, Garbage Collection, Honeycomb,...",Remote,Full-time,https://www.linkedin.com/jobs/view/3734708994/...,RELEX Solutions,https://www.linkedin.com/company/relexsolution...,RELEX Solutions · Finland 2 weeks ago · 10 a...,2 weeks ago,0,About the job\nRELEX Solutions create cutting-...,,2023-10-29
3,Data Engineer (Level Up),"Data Warehousing, Finnish, and SQLData Visuali...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3744740320/...,Loihde Advance,https://www.linkedin.com/company/loihdeadvance...,"Loihde Advance · Uusimaa, Finland 1 week ago ...",1 week ago,0,About the job\nOnko sinulle jo kertynyt jo väh...,,2023-10-29
4,Data Science - Machine Learning Engineer,"Artificial Intelligence (AI), Computer Science...",Remote,Full-time,https://www.linkedin.com/jobs/view/3629670334/...,Wolt,https://www.linkedin.com/company/wolt-oy/life,"Wolt · Helsinki, Uusimaa, Finland Reposted 6 ...",Reposted 6 hours ago,0,About the job\nJob Description\n\nTeam purpose...,,2023-10-29
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2588,Data Engineer,"Data Engineering, Git, Python (Programming Lan...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3736532279/...,Suomen Palloliitto - Football Association of F...,https://www.linkedin.com/company/football-asso...,Suomen Palloliitto - Football Association of F...,3 weeks ago,0,About the job\nDATA ENGINEER\n\nSUOMEN PALLOLI...,,2023-11-04
2589,AI Engineer Academy - Finland,"Artificial Intelligence (AI), Computer Science...",Full-time,,https://www.linkedin.com/jobs/view/3746215896/...,Avanade,https://www.linkedin.com/company/avanade/life,"Avanade · Helsinki, Uusimaa, Finland 1 week a...",1 week ago,0,About the job\nJob Description\n\nAre you inte...,,2023-11-04
2590,Lead Competitive Intelligence Analyst,"Analytical Skills, Data Analysis, Data Analyti...",Remote,Full-time,https://www.linkedin.com/jobs/view/3747370503/...,Huuuge Games,https://www.linkedin.com/company/huuugegames/life,Huuuge Games · Finland 1 week ago · 33 appli...,1 week ago,0,About the job\nIntroduction:\nJoin an industry...,,2023-11-04
2591,Senior Data Analyst,"Data Analysis, Python (Programming Language), ...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3750109074/...,Musopia,https://www.linkedin.com/company/musopia/life,Musopia · Helsinki Metropolitan Area 5 days a...,5 days ago,0,About the job\nHow about a backstage pass to t...,,2023-11-04


## #3 Are there duplicate postings in the day's crawl?

`linkedin_url` should identify unique postings. 

If there are no duplicates, the length of crawl should equal the number of unique `linkedin_url`.

In [4]:
with pd.option_context("display.max_colwidth", 200):
    display(df.linkedin_url)

0       https://www.linkedin.com/jobs/view/3735986015/?eBP=CwEAAAGLfEIr4Dx1-Kfr79dkVyiabtHzAgIzxq_6BKxl_Arj-rgRU4jHm6HBFOg536sM2oYAdDv41PFPygF1nNPP42uBq_qrqcduMR2elGTRHtCL-_BbUkas-X1kKbbNMvLgnvPHdTgjvx4EF...
1       https://www.linkedin.com/jobs/view/3702242885/?eBP=CwEAAAGLfEIr4Gfd6iss7Zh09ip4bKC81ieEpBLVCOztyL7qqFrGaW7A9ZV8UmZ5_4r103273MaRWG3YG8aVNO-wmSNUQEU6H2uWkIWXytlVuJP2awC9rYIl7Hto15kn7AmaAnno1xydiXV9_...
2       https://www.linkedin.com/jobs/view/3734708994/?eBP=CwEAAAGLfEIr4DFFImZ4W08zEPBGiLRMzXTtFoQMjZAr4PJDWS2dxuMWMYc4OYq6e5QliLsD7_Y7Wp9IHEYtENCP5QIYkerLC3kMnGGmrPDyllqjr9VyglRHSvZIavbQ6biF2oAAMDCgjqFOs...
3       https://www.linkedin.com/jobs/view/3744740320/?eBP=CwEAAAGLfEIr4AtXNJbIzHgii3096t-f-WLgW-C0Fe4uf73CfyqfcvB9mDAsWJfrEqSbtYAvKZdUQVn5fTeDP6R5zTAip8gcjfXfrMplSndXkmSmSKc4ZHige0ma7RW6rPFRRGAU-udREKqFz...
4                          https://www.linkedin.com/jobs/view/3629670334/?eBP=JOB_SEARCH_ORGANIC&refId=P8zLgWdnQx7b0E65HbIzDw%3D%3D&trackingId=yGhQYMf%2FznekMB3O%2FLW%2

In [5]:
# Standardize Linkedin URLs so that they can be used as job ID
import re
def standardize_job_url(url):
    return re.match(string=url, pattern="https://www.linkedin.com/jobs/view/\d+").group(0)

df.linkedin_url.map(standardize_job_url)

0       https://www.linkedin.com/jobs/view/3735986015
1       https://www.linkedin.com/jobs/view/3702242885
2       https://www.linkedin.com/jobs/view/3734708994
3       https://www.linkedin.com/jobs/view/3744740320
4       https://www.linkedin.com/jobs/view/3629670334
                            ...                      
2588    https://www.linkedin.com/jobs/view/3736532279
2589    https://www.linkedin.com/jobs/view/3746215896
2590    https://www.linkedin.com/jobs/view/3747370503
2591    https://www.linkedin.com/jobs/view/3750109074
2592    https://www.linkedin.com/jobs/view/3744561692
Name: linkedin_url, Length: 2593, dtype: object

### Observation #3

Usually not. In any case, deduplication should be part of the preprocessing pipeline.

In [6]:
# For later use: pre-processing pipeline
steps = []

def standardize_job_urls(df):
    df.loc[:,"linkedin_url"] = df.loc[:, "linkedin_url"].map(standardize_job_url)
    return df

steps.append(standardize_job_urls)
steps

[<function __main__.standardize_job_urls(df)>]

## #4 Descriptive statistics

- How many unique job titles are there? 
    * Unique jobs can be identified with unique job posting IDs (`linkedin_url`).

- How many `required_skills`? companies?

- How many jobs are full-time? How many jobs are on-site, hybrid or remote? Are the informaiton always indicated in `job_type`?

In [7]:
# Show if every posting has a job_type
df.loc[df.job_type_1.isnull(),:]

Unnamed: 0,job_title,required_skills,job_type_1,job_type_2,linkedin_url,company,company_linkedin_url,location,posted_date,applicant_count,job_description,benefits,crawl_date


In [8]:
df.job_type_1.value_counts()

job_type_1
Hybrid       1120
On-site       652
Remote        534
Full-time     269
Contract       18
Name: count, dtype: int64

In [9]:
df.job_type_2.value_counts()

job_type_2
Full-time    2050
Temporary      90
Contract       51
Name: count, dtype: int64

In [10]:
# Most required skills
with pd.option_context("display.min_rows", 400):
    display(
        df.required_skills.dropna(
        ).apply(lambda _x: [skill if not skill.startswith("and") else skill[4:] for skill in _x.split(", ")]
        ).explode(
        ).value_counts()
    )

required_skills
Communication                                            639
Data Analytics                                           436
Transform                                                347
Extract                                                  339
Data Science                                             327
Problem Solving                                          300
Databases                                                293
Data Warehousing                                         291
Data Engineering                                         261
Analytical Skills                                        249
Data Modeling                                            249
Data Analysis                                            241
English                                                  209
Computer Science                                         207
Analytics                                                202
Presentations                                            182
Artifici

In [11]:
# Extend the pipeline
def is_remote(df):
    df["is_remote"] = df.apply(lambda _row: _row["job_type_1"] == "Remote" 
                                or _row["job_type_2"] == "Remote"
                                , axis=1)
    return df
    

def is_on_site(df):
    df["is_on_site"] = df.apply(lambda _row: _row["job_type_1"] == "On-site" 
                                or _row["job_type_2"] == "On-site" 
                                , axis=1)
    return df

def is_hybrid(df):
    df["is_hybrid"] = df.apply(lambda _row: _row["job_type_1"] == "Hybrid" 
                                or _row["job_type_2"] == "Hybrid" 
                                , axis=1)
    return df
                       

def is_fulltime(df):
    df["is_fulltime"] = df.apply(lambda _row: _row["job_type_1"] == "Full-time"
                                or _row["job_type_2"] == "Full-time",
                                axis=1)
    return df

steps.append(is_remote)
steps.append(is_on_site)
steps.append(is_hybrid)
steps.append(is_fulltime)
steps


[<function __main__.standardize_job_urls(df)>,
 <function __main__.is_remote(df)>,
 <function __main__.is_on_site(df)>,
 <function __main__.is_hybrid(df)>,
 <function __main__.is_fulltime(df)>]

In [12]:
def drop_useless_columns(df):
    _show_columns = [x for x in df.columns 
                    if (x not in [
                        "applicant_count", "benefits","company_linkedin_url", "location", "posted_date", "job_type_1", "job_type_2"
                        ])]
    return df.loc[:,_show_columns]

steps.append(drop_useless_columns)
steps

[<function __main__.standardize_job_urls(df)>,
 <function __main__.is_remote(df)>,
 <function __main__.is_on_site(df)>,
 <function __main__.is_hybrid(df)>,
 <function __main__.is_fulltime(df)>,
 <function __main__.drop_useless_columns(df)>]

In [13]:
for step in steps:
    df = df.pipe(step)
    print(step, "done")

<function standardize_job_urls at 0x7f67eb349630> done
<function is_remote at 0x7f67eb3497e0> done
<function is_on_site at 0x7f67eb349c60> done
<function is_hybrid at 0x7f67eb349bd0> done
<function is_fulltime at 0x7f67eb349b40> done
<function drop_useless_columns at 0x7f67eb349870> done


In [14]:
df.describe(include="all")

Unnamed: 0,job_title,required_skills,linkedin_url,company,job_description,crawl_date,is_remote,is_on_site,is_hybrid,is_fulltime
count,2593,2574,2593,2593,2593,2593,2593,2593,2593,2593
unique,545,589,699,245,587,,2,2,2,2
top,Data Engineer,Artificial Intelligence (AI) and Defining Requ...,https://www.linkedin.com/jobs/view/3749553229,Tietoevry,About the job\nPosition: Data Contributor \nP...,,False,False,False,True
freq,61,53,39,253,51,,2059,1941,1473,2319
mean,,,,,,2023-11-01 00:43:18.997300,,,,
min,,,,,,2023-10-29 00:00:00,,,,
25%,,,,,,2023-10-30 00:00:00,,,,
50%,,,,,,2023-11-01 00:00:00,,,,
75%,,,,,,2023-11-03 00:00:00,,,,
max,,,,,,2023-11-04 00:00:00,,,,


In [15]:
# Check the statistics of the unique jobs
df.drop_duplicates(subset=["linkedin_url"]).describe(include="all")

Unnamed: 0,job_title,required_skills,linkedin_url,company,job_description,crawl_date,is_remote,is_on_site,is_hybrid,is_fulltime
count,699,688,699,699,699,699,699,699,699,699
unique,540,573,699,245,575,,2,2,2,2
top,Data Engineer,Artificial Intelligence (AI) and Defining Requ...,https://www.linkedin.com/jobs/view/3735986015,Nordea,About the job\nPosition: Data Contributor \nP...,,False,False,False,True
freq,18,10,1,61,10,,553,506,428,621
mean,,,,,,2023-10-30 15:33:13.133047,,,,
min,,,,,,2023-10-29 00:00:00,,,,
25%,,,,,,2023-10-29 00:00:00,,,,
50%,,,,,,2023-10-30 00:00:00,,,,
75%,,,,,,2023-11-01 00:00:00,,,,
max,,,,,,2023-11-04 00:00:00,,,,


In [16]:
# Most required skills of the unique jobs
with pd.option_context("display.min_rows", 400):
    display(
        df.drop_duplicates(subset=["linkedin_url"]
        ).required_skills.dropna(
        ).apply(lambda _x: [skill if not skill.startswith("and") else skill[4:] for skill in _x.split(", ")]
        ).explode(
        ).value_counts()
    )

required_skills
Communication                                    176
Data Analytics                                    95
Transform                                         86
Problem Solving                                   85
Extract                                           83
Data Science                                      73
Databases                                         69
Data Warehousing                                  63
Data Engineering                                  60
Analytical Skills                                 59
Presentations                                     56
Analytics                                         51
Data Modeling                                     49
Data Analysis                                     45
English                                           44
Computer Science                                  44
Load (ETL)                                        41
Teamwork                                          31
Data Privacy                  

### Observations #4

- Across the one week's data collection, there are 2593 postings but only 699 unique jobs in the market (or at least open ones in Linkedin). 

- `job_title`s are variable -- there are 545 unique titles out of 2593 postings.

- The skills named by recruiter are similarly variable as job titles.

- Most jobs employ hybrid working mode.

- Most jobs are full-time -- leaving only ~100 positions for contractors, interns and part-timers.

- A lot of `required_skills` specified by the recruiters are general and not informative -- Words like "problem-solving" or "data analytics" means nothing. More advanced keyword extraction is needed -- to extract specific skills such as "team work", "Presentations", "English", "Python" or "Spark".

## #5 How many jobs require Finnish? English? Swedish? Or both Fin & Eng?

1. Detect job posting languages (by paragraphs. A posting can contain more than one language!)

2. Should we translate the non-english postings to enhance the data (instead of dropping them)?

In [17]:
! pip install fasttext-langdetect



In [29]:
# Sample some JDs to decide on the standardize strategy
for _ in range(3):
    print(df.job_description.sample(1).values[0])
    print("==================")
    print()

About the job
Job Description

Position Title: Payroll Specialist, Finland

Department: Corporate Finance

Espoo,Finland

Are you a payroll professional seeking a new challenge in a growing international business?

If so, we have an exciting opportunity for you to join a new trans-European payroll function, where you will have the autonomy and opportunity to assist in shaping our payroll processes for Finland and Estonia.

Reporting into our International Payroll Owner and working closely with internal and external stakeholders, you will ensure our employees across Finland and Estonia are paid correctly and on time including annual pay reviews as well as annual bonus and monthly commission schemes.

Alongside this hugely important task, you will work with your colleagues on high-impact payroll projects; including the migration of payroll onto a new payroll system. And as we grow as a business there’s a likelihood that you will also lead on TUPEE payroll projects and folding acquired bu

In [54]:
from ftlangdetect import detect
from collections import Counter

def get_jd_langs(jd):
  langs = Counter()
  for line in jd.split("\n"):
    # Remove empty line,
    # the prilogue "About the job",
    # the epilogue "See more".
    clean_line = line.strip().lower()
    if not clean_line or \
      len(clean_line.split()) <= 4:
      continue

    detected_lang =  detect(clean_line)["lang"]
    langs.add(detected_lang)
  return langs

df["jd_lang"] = df.job_description.map(get_jd_langs)
df.jd_lang.value_counts()

jd_lang
{en}        2312
{fi}         231
{}            13
{fi, en}      13
{nl, en}       7
{fr, en}       4
{sv, en}       4
{de, en}       4
{es, en}       3
{pl, en}       2
Name: count, dtype: int64

Index([{'nl', 'en'}, {'fr', 'en'}, {'sv', 'en'}, {'de', 'en'}, {'es', 'en'},
       {'pl', 'en'}],
      dtype='object', name='jd_lang')

In [67]:
dedup_df = df.drop_duplicates(subset=["linkedin_url"])

dedup_df["jd_lang"] = dedup_df.job_description.map(get_jd_langs)
dedup_df.jd_lang.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dedup_df["jd_lang"] = dedup_df.job_description.map(get_jd_langs)


jd_lang
{en}        628
{fi}         51
{}            7
{fi, en}      6
{nl, en}      2
{fr, en}      1
{sv, en}      1
{de, en}      1
{es, en}      1
{pl, en}      1
Name: count, dtype: int64

In [53]:
# Check the long tails to see if the language detection are accurate
from pprint import pprint

def check_anomaly_langs(langs: set):
    jd_sample = df.loc[df.jd_lang == langs,"job_description"].sample(1).values[0]
    pprint(jd_sample)
    print()

    for line in jd_sample.split("\n"):
    # Remove empty line,
    # the prilogue "About the job",
    # the epilogue "See more".
        clean_line = line.strip().lower()
        if not clean_line or \
            len(clean_line.split()) <= 3:
            continue
        
        detected_lang = detect(clean_line)["lang"]
        if detected_lang != "en":
            print(detected_lang.upper(), clean_line)

    print('======================')
    print()
    
check_anomaly_langs({"fr","en"})

('About the job\n'
 'RELEX Solutions create cutting-edge optimisation software to help retailers '
 'and consumer brands drive profitable growth. With growth comes '
 'opportunities, and we embrace both. Within our platforms, our teams are '
 'driving change, working with international colleagues and the latest tech '
 'stack to develop solutions that transform into a pioneering end product; '
 'it’s tangible and impactful – for our customers and the world.\n'
 '\n'
 '\n'
 '\n'
 'Our Technology team enjoy a challenge. They’re hungry to learn, and don’t '
 'hesitate to ask what, why, and how. They get to work with various '
 'technologies to create high quality scalable applications (just imagine, '
 'thousands of stores, millions of products, and billions of rows of raw '
 'data!). Their expertise positively impacts the environment and business '
 'processes around the world; alongside international colleagues, they drive '
 'change and develop solutions that become our pioneering end 

In [62]:
check_anomaly_langs(set())

'About the job\nSee more'




In [94]:
# Check anomalies 
idx_lang_anomalies = df.jd_lang.isin(df.jd_lang.value_counts().index[4:])
with pd.option_context("display.max_colwidth", 500):
    display(df.loc[idx_lang_anomalies, ["jd_lang","job_description"]].drop_duplicates(subset="job_description"))

Unnamed: 0,jd_lang,job_description
70,"{nl, en}","About the job\nType: Full-time / Permanent\nLocation: Keilaniementie 1 02150 Espoo \nApplication DL: 6.11.\n\nEtlia is a technical forerunner and fast-growing data engineering company that helps customers create business value from data by leveraging major business process platforms and other data sources. The company has ambitious growth targets, and is therefore now looking for an experienced, senior levelto support its growth journey.\n\n\n\n\n\n\n\nSee more"
94,"{fr, en}","About the job\nRELEX Solutions create cutting-edge optimisation software to help retailers and consumer brands drive profitable growth. With growth comes opportunities, and we embrace both. Within our platforms, our teams are driving change, working with international colleagues and the latest tech stack to develop solutions that transform into a pioneering end product; it’s tangible and impactful – for our customers and the world.\n\n\n\nOur Technology team enjoy a challenge. They’re hungry..."
457,"{sv, en}","About the job\nBy clicking the “Apply” button, I understand that my employment application process with Takeda will commence and that the information I provide in my application will be processed in line with Takeda’s Privacy Notice and Terms of Use. I further attest that all information I submit in my employment application is true to the best of my knowledge.\n\nJob Description\n\nare committed colleagues. We offer interested people numerous opportunities and strongly believe in, and promo..."
459,"{nl, en}","About the job\nWhy join Ericsson?\n\nWe are a world leader in the rapidly changing environment of communications technology – by providing hardware, software, and services to enable the full value of connectivity.\n\nYou’ll play a part in using your skills and creativity to push the boundaries of what’s possible. To build never-seen-before solutions to some of the world’s toughest problems.\n\nJoin a team of like-minded innovators driven to go beyond the status quo to build what’s next.\n\nA..."
590,"{de, en}","About the job\nYou may apply to Tietoevry by selecting Apply and fill your application details to the form. You may also Apply by using LinkedIn and populate details to your application from your LinkedIn profile.\n\nAbout Us In Germany\n\nTieto Germany is a subsidiary of Tietoevry, the largest Nordic IT services company with 24,000 employees worldwide. TietoEVRY has been present in Germany since 2000, and we tackle a variety of challenges with our advanced IT services and solutions, ranging..."
1322,"{es, en}","About the job\nTake your next career step at ABB with a global team that is energizing the transformation of society and industry to achieve a more productive, sustainable future.\n\nAt ABB, we have the clear goal of driving diversity and inclusion across all dimensions: gender, LGBTQ+, abilities, ethnicity and generations. Together, we are embarking on a journey where each and every one of us, individually and collectively, welcomes and celebrates individual differences.\n\nThe Manager Appl..."
1708,"{pl, en}","About the job\nTake your next career step at ABB with a global team that is energizing the transformation of society and industry to achieve a more productive, sustainable future.\n\nAt ABB, we have the clear goal of driving diversity and inclusion across all dimensions: gender, LGBTQ+, abilities, ethnicity and generations. Together, we are embarking on a journey where each and every one of us, individually and collectively, welcomes and celebrates individual differences.\n\nLocation: (DE (D..."


Some job applications are empty. These comprise of only small part of all the postings. 

We can safely say that those anomalies are false negatives from language detection. They are all English.

In [105]:
# Clean the `jd_lang` column

df.loc[idx_lang_anomalies, ["jd_lang"]] = df.loc[idx_lang_anomalies, ["jd_lang"]].apply(lambda _: {"en"}, axis=1)
df.jd_lang.value_counts()

jd_lang
{en}        2336
{fi}         231
{}            13
{fi, en}      13
Name: count, dtype: int64

## 3. Job Classification - Data scientist, engineer or analyst (or None of those)?

- Naive approach - Match strings from the job title.
- More advanced approach - Cluster job descriptions.
- Some jobs are not related to any of the above. Drop those.

### 3.1 Naive approach

In [9]:
len(_data0.loc[_data0.job_title.str.lower().str.contains("engineer"),:])



In [10]:
len(_data0.loc[_data0.job_title.str.lower().str.contains("scientist"),:])



In [11]:
len(_data0.loc[_data0.job_title.str.lower().str.contains("analyst"),:])



In [18]:
_none_of_three = ~(_data0.job_title.str.lower().str.contains("engineer") |
                   _data0.job_title.str.lower().str.contains("scientist") |
                    _data0.job_title.str.lower().str.contains("analyst"))
_data0.loc[_none_of_three,:]

## 4. Explore Job type: remote, on-site or hybrid? 

Can we always tell? Is there any missing value?

## 5. Explore Job type: full-time, contract, part-time or internship?

Can we always tell? Is there any missing value?