# EDA of the crawled data

This notebook explores the crawled data from the last notebook and cleans it for the next step: data modelling. 

The next step uses machine learning algorithms to answer the following questions:

- Q1: How many jobs are there for data scientist, data engineer and data analyst positions?

- Q2: What skills (both hard/technical and soft/communicational) do the three kinds of jobs require?

I explore the data while bearing the two questions in mind.

## #1 How much data is there?

How many days of crawl are there?

In [1]:
from glob import glob

csv_paths = sorted(glob("../data/crawled_jobs_????-??-??.csv"))
csv_paths

['../data/crawled_jobs_2023-10-22.csv',
 '../data/crawled_jobs_2023-10-28.csv',
 '../data/crawled_jobs_2023-10-29.csv',
 '../data/crawled_jobs_2023-10-30.csv',
 '../data/crawled_jobs_2023-10-31.csv',
 '../data/crawled_jobs_2023-11-01.csv',
 '../data/crawled_jobs_2023-11-02.csv',
 '../data/crawled_jobs_2023-11-03.csv',
 '../data/crawled_jobs_2023-11-04.csv']

## Observations #1

- We have collected the data of consecutive days from 2023-10-28 to 2023-11-04, and a non-consecutive day of 2023-10-22. Including the statistics of all the days to the plotting demo can result in a gap, so I would skip the non-consecutive one.

- New data will come to the plotting demo later. I would need a continuous pipeline to validate and deploy the new batches.

## #2: What does every day's job postings look like?

Pick a day and peek at the data. 

Let's first peek at the data from 2023-10-22.

In [2]:
import pandas as pd

_data0 = pd.read_csv(csv_paths[0])
print(csv_paths[0])
_data0

../data/crawled_jobs_2023-10-22.csv


Unnamed: 0,job_title,required_skills,job_type_1,job_type_2,linkedin_url,company,company_linkedin_url,location,posted_date,applicant_count,job_description,benefits
0,RWE Scientist / Epidemiologist,"Customer Relationship Management (CRM), Epidem...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3737909849/...,MedEngine,https://www.linkedin.com/company/medengine/life,"MedEngine · Helsinki, Uusimaa, Finland 1 week...",1 week ago,0,About the job\nMedEngine is a digitally minded...,
1,Data Engineer,"Data Engineering, Git, Python (Programming Lan...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3736532279/...,Suomen Palloliitto - Football Association of F...,https://www.linkedin.com/company/football-asso...,Suomen Palloliitto - Football Association of F...,1 week ago,0,About the job\nDATA ENGINEER\n\nSUOMEN PALLOLI...,
2,Senior Game Analyst,"Analytical Skills, Data Analysis, Economics, M...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3717037977/...,"Next Games, A Netflix Game Studio",https://www.linkedin.com/company/next-games/life,"Next Games, A Netflix Game Studio · Helsinki, ...",Reposted 6 days ago,0,About the job\nNext Games is a Netflix Game St...,
3,Data Scientist,"Data Analysis, Data Science, Machine Learning,...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3735986015/...,MedEngine,https://www.linkedin.com/company/medengine/life,"MedEngine · Helsinki, Uusimaa, Finland 1 week...",1 week ago,0,About the job\nMedEngine is a digitally minded...,
4,Data Science - Machine Learning Engineer,"Artificial Intelligence (AI), Computer Science...",Remote,Full-time,https://www.linkedin.com/jobs/view/3629670334/...,Wolt,https://www.linkedin.com/company/wolt-oy/life,"Wolt · Helsinki, Uusimaa, Finland Reposted 2 ...",Reposted 2 weeks ago,0,About the job\nJob Description\n\nTeam purpose...,
...,...,...,...,...,...,...,...,...,...,...,...,...
307,Remote Work (Finnish Speakers) - Internet Ads ...,English and Finnish,Remote,,https://www.linkedin.com/jobs/view/3728766620/...,TELUS International AI Data Solutions,https://www.linkedin.com/company/telusinternat...,TELUS International AI Data Solutions · Finlan...,3 weeks ago,0,About the job\nOur Company \n\nTELUS Internati...,
308,Work From Home - Finnish Speakers (Internet Ad...,English and Finnish,Remote,,https://www.linkedin.com/jobs/view/3731394329/...,TELUS International AI Data Solutions,https://www.linkedin.com/company/telusinternat...,TELUS International AI Data Solutions · Finlan...,2 weeks ago,0,About the job\nOur Company \n\nTELUS Internati...,
309,"Senior Engineering Manager (Bangkok based, rel...","Software DevelopmentC#, Engineering Management...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3616699365/...,Agoda,https://www.linkedin.com/company/agoda/life,"Agoda · Helsinki, Uusimaa, Finland Reposted 3...",Reposted 3 days ago,0,About the job\nAbout Agoda\n\nAgoda is an onli...,
310,"Commercial Director Nordics, Transport Reagents","SwedishBusiness Planning, Commodities, Indirec...",On-site,Full-time,https://www.linkedin.com/jobs/view/3736132257/...,Yara Suomi,https://www.linkedin.com/company/yarasuomi/life,"Yara Suomi · Espoo, Uusimaa, Finland 1 week a...",1 week ago,0,About the job\nAbout Yara Industrial Solutions...,


### Observations #2 

- `required_skills` is a list of skills delimited by commas, but is stored as a single string. It would require tokenizing.
- `applicant_count` and `benefit` columns are not informative and should be dropped.
- `company_linkedin_url` and `posted_date` not useful for answering the research questions (Q1 or Q2), and should be dropped.
- `location` contains non-relevant infomation, such as company name and applicants. It should also be dropped.


### Aggregate all the dates' data

In [3]:
from datetime import datetime, timedelta
import pandas as pd

start_date = datetime(2023, 10, 29)
end_date = datetime(2023, 11, 4)

date = start_date
dfs = []
while date <= end_date:
    csv_path = f"../data/crawled_jobs_{date.strftime('%Y-%m-%d')}.csv"

    new_daily_data = pd.read_csv(csv_path)
    new_daily_data.loc[:,"crawl_date"] = date
    dfs.append(new_daily_data)

    date += timedelta(days=1)

df = pd.concat(dfs, axis=0, ignore_index=True)
df

Unnamed: 0,job_title,required_skills,job_type_1,job_type_2,linkedin_url,company,company_linkedin_url,location,posted_date,applicant_count,job_description,benefits,crawl_date
0,Data Scientist,"Data Analysis, Data Science, Machine Learning,...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3735986015/...,MedEngine,https://www.linkedin.com/company/medengine/life,"MedEngine · Helsinki, Uusimaa, Finland 2 week...",2 weeks ago,0,About the job\nMedEngine is a digitally minded...,,2023-10-29
1,"Senior Data Analyst, Ads","Data Analysis, Python (Programming Language), ...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3702242885/...,Rovio Entertainment Corporation,https://www.linkedin.com/company/rovio/life,"Rovio Entertainment Corporation · Helsinki, Uu...",Reposted 1 week ago,0,About the job\nAt Rovio you will get to work w...,,2023-10-29
2,JVM Performance and Tuning Engineer,"Business Logic, Garbage Collection, Honeycomb,...",Remote,Full-time,https://www.linkedin.com/jobs/view/3734708994/...,RELEX Solutions,https://www.linkedin.com/company/relexsolution...,RELEX Solutions · Finland 2 weeks ago · 10 a...,2 weeks ago,0,About the job\nRELEX Solutions create cutting-...,,2023-10-29
3,Data Engineer (Level Up),"Data Warehousing, Finnish, and SQLData Visuali...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3744740320/...,Loihde Advance,https://www.linkedin.com/company/loihdeadvance...,"Loihde Advance · Uusimaa, Finland 1 week ago ...",1 week ago,0,About the job\nOnko sinulle jo kertynyt jo väh...,,2023-10-29
4,Data Science - Machine Learning Engineer,"Artificial Intelligence (AI), Computer Science...",Remote,Full-time,https://www.linkedin.com/jobs/view/3629670334/...,Wolt,https://www.linkedin.com/company/wolt-oy/life,"Wolt · Helsinki, Uusimaa, Finland Reposted 6 ...",Reposted 6 hours ago,0,About the job\nJob Description\n\nTeam purpose...,,2023-10-29
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2588,Data Engineer,"Data Engineering, Git, Python (Programming Lan...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3736532279/...,Suomen Palloliitto - Football Association of F...,https://www.linkedin.com/company/football-asso...,Suomen Palloliitto - Football Association of F...,3 weeks ago,0,About the job\nDATA ENGINEER\n\nSUOMEN PALLOLI...,,2023-11-04
2589,AI Engineer Academy - Finland,"Artificial Intelligence (AI), Computer Science...",Full-time,,https://www.linkedin.com/jobs/view/3746215896/...,Avanade,https://www.linkedin.com/company/avanade/life,"Avanade · Helsinki, Uusimaa, Finland 1 week a...",1 week ago,0,About the job\nJob Description\n\nAre you inte...,,2023-11-04
2590,Lead Competitive Intelligence Analyst,"Analytical Skills, Data Analysis, Data Analyti...",Remote,Full-time,https://www.linkedin.com/jobs/view/3747370503/...,Huuuge Games,https://www.linkedin.com/company/huuugegames/life,Huuuge Games · Finland 1 week ago · 33 appli...,1 week ago,0,About the job\nIntroduction:\nJoin an industry...,,2023-11-04
2591,Senior Data Analyst,"Data Analysis, Python (Programming Language), ...",Hybrid,Full-time,https://www.linkedin.com/jobs/view/3750109074/...,Musopia,https://www.linkedin.com/company/musopia/life,Musopia · Helsinki Metropolitan Area 5 days a...,5 days ago,0,About the job\nHow about a backstage pass to t...,,2023-11-04


## #3 Are there duplicate postings in the day's crawl?

`linkedin_url` should identify unique postings. 

If there are no duplicates, the length of crawl should equal the number of unique `linkedin_url`.

In [4]:
with pd.option_context("display.max_colwidth", 200):
    display(df.linkedin_url)

0       https://www.linkedin.com/jobs/view/3735986015/?eBP=CwEAAAGLfEIr4Dx1-Kfr79dkVyiabtHzAgIzxq_6BKxl_Arj-rgRU4jHm6HBFOg536sM2oYAdDv41PFPygF1nNPP42uBq_qrqcduMR2elGTRHtCL-_BbUkas-X1kKbbNMvLgnvPHdTgjvx4EF...
1       https://www.linkedin.com/jobs/view/3702242885/?eBP=CwEAAAGLfEIr4Gfd6iss7Zh09ip4bKC81ieEpBLVCOztyL7qqFrGaW7A9ZV8UmZ5_4r103273MaRWG3YG8aVNO-wmSNUQEU6H2uWkIWXytlVuJP2awC9rYIl7Hto15kn7AmaAnno1xydiXV9_...
2       https://www.linkedin.com/jobs/view/3734708994/?eBP=CwEAAAGLfEIr4DFFImZ4W08zEPBGiLRMzXTtFoQMjZAr4PJDWS2dxuMWMYc4OYq6e5QliLsD7_Y7Wp9IHEYtENCP5QIYkerLC3kMnGGmrPDyllqjr9VyglRHSvZIavbQ6biF2oAAMDCgjqFOs...
3       https://www.linkedin.com/jobs/view/3744740320/?eBP=CwEAAAGLfEIr4AtXNJbIzHgii3096t-f-WLgW-C0Fe4uf73CfyqfcvB9mDAsWJfrEqSbtYAvKZdUQVn5fTeDP6R5zTAip8gcjfXfrMplSndXkmSmSKc4ZHige0ma7RW6rPFRRGAU-udREKqFz...
4                          https://www.linkedin.com/jobs/view/3629670334/?eBP=JOB_SEARCH_ORGANIC&refId=P8zLgWdnQx7b0E65HbIzDw%3D%3D&trackingId=yGhQYMf%2FznekMB3O%2FLW%2

In [5]:
# Standardize Linkedin URLs so that they can be used as job ID
import re
def standardize_job_url(url):
    return re.match(string=url, pattern="https://www.linkedin.com/jobs/view/\d+").group(0)

df.linkedin_url.map(standardize_job_url)

0       https://www.linkedin.com/jobs/view/3735986015
1       https://www.linkedin.com/jobs/view/3702242885
2       https://www.linkedin.com/jobs/view/3734708994
3       https://www.linkedin.com/jobs/view/3744740320
4       https://www.linkedin.com/jobs/view/3629670334
                            ...                      
2588    https://www.linkedin.com/jobs/view/3736532279
2589    https://www.linkedin.com/jobs/view/3746215896
2590    https://www.linkedin.com/jobs/view/3747370503
2591    https://www.linkedin.com/jobs/view/3750109074
2592    https://www.linkedin.com/jobs/view/3744561692
Name: linkedin_url, Length: 2593, dtype: object

### Observation #3

Usually not. In any case, deduplication should be part of the preprocessing pipeline.

In [6]:
# For later use: pre-processing pipeline
steps = []

def standardize_job_urls(df):
    df.loc[:,"linkedin_url"] = df.loc[:, "linkedin_url"].map(standardize_job_url)
    return df

steps.append(standardize_job_urls)
steps

[<function __main__.standardize_job_urls(df)>]

## #4 Descriptive statistics

- How many unique job titles are there? 
    * Unique jobs can be identified with unique job posting IDs (`linkedin_url`).

- How many `required_skills`? companies?

- How many jobs are full-time? How many jobs are on-site, hybrid or remote? Are the informaiton always indicated in `job_type`?

In [7]:
# Show if every posting has a job_type
df.loc[df.job_type_1.isnull(),:]

Unnamed: 0,job_title,required_skills,job_type_1,job_type_2,linkedin_url,company,company_linkedin_url,location,posted_date,applicant_count,job_description,benefits,crawl_date


In [8]:
df.job_type_1.value_counts()

job_type_1
Hybrid       1120
On-site       652
Remote        534
Full-time     269
Contract       18
Name: count, dtype: int64

In [9]:
df.job_type_2.value_counts()

job_type_2
Full-time    2050
Temporary      90
Contract       51
Name: count, dtype: int64

In [10]:
# Most required skills
with pd.option_context("display.min_rows", 400):
    display(
        df.required_skills.dropna(
        ).apply(lambda _x: [skill if not skill.startswith("and") else skill[4:] for skill in _x.split(", ")]
        ).explode(
        ).value_counts()
    )

required_skills
Communication                                            639
Data Analytics                                           436
Transform                                                347
Extract                                                  339
Data Science                                             327
Problem Solving                                          300
Databases                                                293
Data Warehousing                                         291
Data Engineering                                         261
Analytical Skills                                        249
Data Modeling                                            249
Data Analysis                                            241
English                                                  209
Computer Science                                         207
Analytics                                                202
Presentations                                            182
Artifici

In [11]:
# Extend the pipeline
def is_remote(df):
    df["is_remote"] = df.apply(lambda _row: _row["job_type_1"] == "Remote" 
                                or _row["job_type_2"] == "Remote"
                                , axis=1)
    return df
    

def is_on_site(df):
    df["is_on_site"] = df.apply(lambda _row: _row["job_type_1"] == "On-site" 
                                or _row["job_type_2"] == "On-site" 
                                , axis=1)
    return df

def is_hybrid(df):
    df["is_hybrid"] = df.apply(lambda _row: _row["job_type_1"] == "Hybrid" 
                                or _row["job_type_2"] == "Hybrid" 
                                , axis=1)
    return df
                       

def is_fulltime(df):
    df["is_fulltime"] = df.apply(lambda _row: _row["job_type_1"] == "Full-time"
                                or _row["job_type_2"] == "Full-time",
                                axis=1)
    return df

steps.append(is_remote)
steps.append(is_on_site)
steps.append(is_hybrid)
steps.append(is_fulltime)
steps


[<function __main__.standardize_job_urls(df)>,
 <function __main__.is_remote(df)>,
 <function __main__.is_on_site(df)>,
 <function __main__.is_hybrid(df)>,
 <function __main__.is_fulltime(df)>]

In [12]:
def drop_useless_columns(df):
    _show_columns = [x for x in df.columns 
                    if (x not in [
                        "applicant_count", "benefits","company_linkedin_url", "location", "posted_date", "job_type_1", "job_type_2"
                        ])]
    return df.loc[:,_show_columns]

steps.append(drop_useless_columns)
steps

[<function __main__.standardize_job_urls(df)>,
 <function __main__.is_remote(df)>,
 <function __main__.is_on_site(df)>,
 <function __main__.is_hybrid(df)>,
 <function __main__.is_fulltime(df)>,
 <function __main__.drop_useless_columns(df)>]

In [13]:
for step in steps:
    df = df.pipe(step)
    print(step, "done")

<function standardize_job_urls at 0x7f00199b1900> done
<function is_remote at 0x7effe8b51630> done
<function is_on_site at 0x7effe8b517e0> done
<function is_hybrid at 0x7effe8b51bd0> done
<function is_fulltime at 0x7effe8b51b40> done
<function drop_useless_columns at 0x7effe8b51ab0> done


In [14]:
df.describe(include="all")

Unnamed: 0,job_title,required_skills,linkedin_url,company,job_description,crawl_date,is_remote,is_on_site,is_hybrid,is_fulltime
count,2593,2574,2593,2593,2593,2593,2593,2593,2593,2593
unique,545,589,699,245,587,,2,2,2,2
top,Data Engineer,Artificial Intelligence (AI) and Defining Requ...,https://www.linkedin.com/jobs/view/3749553229,Tietoevry,About the job\nPosition: Data Contributor \nP...,,False,False,False,True
freq,61,53,39,253,51,,2059,1941,1473,2319
mean,,,,,,2023-11-01 00:43:18.997300,,,,
min,,,,,,2023-10-29 00:00:00,,,,
25%,,,,,,2023-10-30 00:00:00,,,,
50%,,,,,,2023-11-01 00:00:00,,,,
75%,,,,,,2023-11-03 00:00:00,,,,
max,,,,,,2023-11-04 00:00:00,,,,


In [15]:
# Check the statistics of the unique jobs
df.drop_duplicates(subset=["linkedin_url"]).describe(include="all")

Unnamed: 0,job_title,required_skills,linkedin_url,company,job_description,crawl_date,is_remote,is_on_site,is_hybrid,is_fulltime
count,699,688,699,699,699,699,699,699,699,699
unique,540,573,699,245,575,,2,2,2,2
top,Data Engineer,Artificial Intelligence (AI) and Defining Requ...,https://www.linkedin.com/jobs/view/3735986015,Nordea,About the job\nPosition: Data Contributor \nP...,,False,False,False,True
freq,18,10,1,61,10,,553,506,428,621
mean,,,,,,2023-10-30 15:33:13.133047,,,,
min,,,,,,2023-10-29 00:00:00,,,,
25%,,,,,,2023-10-29 00:00:00,,,,
50%,,,,,,2023-10-30 00:00:00,,,,
75%,,,,,,2023-11-01 00:00:00,,,,
max,,,,,,2023-11-04 00:00:00,,,,


In [16]:
# Most required skills of the unique jobs
with pd.option_context("display.min_rows", 400):
    display(
        df.drop_duplicates(subset=["linkedin_url"]
        ).required_skills.dropna(
        ).apply(lambda _x: [skill if not skill.startswith("and") else skill[4:] for skill in _x.split(", ")]
        ).explode(
        ).value_counts()
    )

required_skills
Communication                                    176
Data Analytics                                    95
Transform                                         86
Problem Solving                                   85
Extract                                           83
Data Science                                      73
Databases                                         69
Data Warehousing                                  63
Data Engineering                                  60
Analytical Skills                                 59
Presentations                                     56
Analytics                                         51
Data Modeling                                     49
Data Analysis                                     45
English                                           44
Computer Science                                  44
Load (ETL)                                        41
Teamwork                                          31
Data Privacy                  

### Observations #4

- Across the one week's data collection, there are 2593 postings but only 699 unique jobs in the market (or at least open ones in Linkedin). 

- `job_title`s are variable -- there are 545 unique titles out of 2593 postings.

- The skills named by recruiter are similarly variable as job titles.

- Most jobs employ hybrid working mode.

- Most jobs are full-time -- leaving only ~100 positions for contractors, interns and part-timers.

- A lot of `required_skills` specified by the recruiters are general and not informative -- Words like "problem-solving" or "data analytics" means nothing. More advanced keyword extraction is needed -- to extract specific skills such as "team work", "Presentations", "English", "Python" or "Spark".

## #5 How many jobs require Finnish? English? Swedish? Or both Fin & Eng?

1. Detect job posting languages (by paragraphs. A posting can contain more than one language!)

2. Should we translate the non-english postings to enhance the data (instead of dropping them)?

In [17]:
! pip install fasttext-langdetect



In [18]:
# Sample some JDs to decide on the standardize strategy
for _ in range(3):
    print(df.job_description.sample(1).values[0])
    print("==================")
    print()

About the job
Duell is one of the fastest growing Powersports gear importer and distributor in Europe. We have the leading position in the Northern European market and our presence is rapidly growing in Europe. Our portfolio covers motorcycle, ATV, snowmobile, bicycle, and marine products, including technical spare parts and accessories as well as personal equipment. Duell is a Finnish group headquartered in Mustasaari (Vaasa area) with subsidiaries in six European countries. Duell Group has a net sale of EUR 124 million and employs more than 200 people. At Duell, we are passionate about the sports we represent and to grow our business. We base our business on partnerships and trust.











See more

About the job
Join Nordcloud and make your mark on the European IT industry. Help our clients thrive on their cloud journey in solution areas such as infrastructure, migration, data, and security.

Currently, we are looking for a Google Cloud Architect for our team in Finland.

Your da

In [19]:
from ftlangdetect import detect

def get_jd_langs(jd):
  langs = set()
  for line in jd.split("\n"):
    # Remove empty line,
    # the prilogue "About the job",
    # the epilogue "See more".
    clean_line = line.strip().lower()
    if not clean_line or \
      len(clean_line.split()) <= 4:
      continue

    detected_lang =  detect(clean_line)["lang"]
    langs.add(detected_lang)
  return langs

df["jd_langs"] = df.job_description.map(get_jd_langs)
df.jd_langs.value_counts()



jd_langs
{en}        2312
{fi}         231
{}            13
{en, fi}      13
{nl, en}       7
{fr, en}       4
{sv, en}       4
{de, en}       4
{es, en}       3
{pl, en}       2
Name: count, dtype: int64

In [20]:
dedup_df = df.drop_duplicates(subset=["linkedin_url"])

dedup_df["jd_langs"] = dedup_df.job_description.map(get_jd_langs)
dedup_df.jd_langs.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dedup_df["jd_langs"] = dedup_df.job_description.map(get_jd_langs)


jd_langs
{en}        628
{fi}         51
{}            7
{en, fi}      6
{nl, en}      2
{fr, en}      1
{sv, en}      1
{de, en}      1
{es, en}      1
{pl, en}      1
Name: count, dtype: int64

In [21]:
# Check the long tails to see if the language detection are accurate
from pprint import pprint

def check_anomaly_langs(langs: set):
    jd_sample = df.loc[df.jd_langs == langs,"job_description"].sample(1).values[0]
    pprint(jd_sample)
    print()

    for line in jd_sample.split("\n"):
    # Remove empty line,
    # the prilogue "About the job",
    # the epilogue "See more".
        clean_line = line.strip().lower()
        if not clean_line or \
            len(clean_line.split()) <= 3:
            continue
        
        detected_lang = detect(clean_line)["lang"]
        if detected_lang != "en":
            print(detected_lang.upper(), clean_line)

    print('======================')
    print()
    
check_anomaly_langs({"fr","en"})

('About the job\n'
 'RELEX Solutions create cutting-edge optimisation software to help retailers '
 'and consumer brands drive profitable growth. With growth comes '
 'opportunities, and we embrace both. Within our platforms, our teams are '
 'driving change, working with international colleagues and the latest tech '
 'stack to develop solutions that transform into a pioneering end product; '
 'it’s tangible and impactful – for our customers and the world.\n'
 '\n'
 '\n'
 '\n'
 'Our Technology team enjoy a challenge. They’re hungry to learn, and don’t '
 'hesitate to ask what, why, and how. They get to work with various '
 'technologies to create high quality scalable applications (just imagine, '
 'thousands of stores, millions of products, and billions of rows of raw '
 'data!). Their expertise positively impacts the environment and business '
 'processes around the world; alongside international colleagues, they drive '
 'change and develop solutions that become our pioneering end 

In [22]:
check_anomaly_langs(set())

'About the job\nSee more'




In [23]:
# Check anomalies 
idx_lang_anomalies = df.jd_langs.isin(df.jd_langs.value_counts().index[4:])
with pd.option_context("display.max_colwidth", 500):
    display(df.loc[idx_lang_anomalies, ["jd_langs","job_description"]].drop_duplicates(subset="job_description"))

Unnamed: 0,jd_langs,job_description
70,"{nl, en}","About the job\nType: Full-time / Permanent\nLocation: Keilaniementie 1 02150 Espoo \nApplication DL: 6.11.\n\nEtlia is a technical forerunner and fast-growing data engineering company that helps customers create business value from data by leveraging major business process platforms and other data sources. The company has ambitious growth targets, and is therefore now looking for an experienced, senior levelto support its growth journey.\n\n\n\n\n\n\n\nSee more"
94,"{fr, en}","About the job\nRELEX Solutions create cutting-edge optimisation software to help retailers and consumer brands drive profitable growth. With growth comes opportunities, and we embrace both. Within our platforms, our teams are driving change, working with international colleagues and the latest tech stack to develop solutions that transform into a pioneering end product; it’s tangible and impactful – for our customers and the world.\n\n\n\nOur Technology team enjoy a challenge. They’re hungry..."
457,"{sv, en}","About the job\nBy clicking the “Apply” button, I understand that my employment application process with Takeda will commence and that the information I provide in my application will be processed in line with Takeda’s Privacy Notice and Terms of Use. I further attest that all information I submit in my employment application is true to the best of my knowledge.\n\nJob Description\n\nare committed colleagues. We offer interested people numerous opportunities and strongly believe in, and promo..."
459,"{nl, en}","About the job\nWhy join Ericsson?\n\nWe are a world leader in the rapidly changing environment of communications technology – by providing hardware, software, and services to enable the full value of connectivity.\n\nYou’ll play a part in using your skills and creativity to push the boundaries of what’s possible. To build never-seen-before solutions to some of the world’s toughest problems.\n\nJoin a team of like-minded innovators driven to go beyond the status quo to build what’s next.\n\nA..."
590,"{de, en}","About the job\nYou may apply to Tietoevry by selecting Apply and fill your application details to the form. You may also Apply by using LinkedIn and populate details to your application from your LinkedIn profile.\n\nAbout Us In Germany\n\nTieto Germany is a subsidiary of Tietoevry, the largest Nordic IT services company with 24,000 employees worldwide. TietoEVRY has been present in Germany since 2000, and we tackle a variety of challenges with our advanced IT services and solutions, ranging..."
1322,"{es, en}","About the job\nTake your next career step at ABB with a global team that is energizing the transformation of society and industry to achieve a more productive, sustainable future.\n\nAt ABB, we have the clear goal of driving diversity and inclusion across all dimensions: gender, LGBTQ+, abilities, ethnicity and generations. Together, we are embarking on a journey where each and every one of us, individually and collectively, welcomes and celebrates individual differences.\n\nThe Manager Appl..."
1708,"{pl, en}","About the job\nTake your next career step at ABB with a global team that is energizing the transformation of society and industry to achieve a more productive, sustainable future.\n\nAt ABB, we have the clear goal of driving diversity and inclusion across all dimensions: gender, LGBTQ+, abilities, ethnicity and generations. Together, we are embarking on a journey where each and every one of us, individually and collectively, welcomes and celebrates individual differences.\n\nLocation: (DE (D..."


Some job applications are empty. These comprise of only small part of all the postings. 

We can safely say that those anomalies are false negatives from language detection. They are all English.

In [24]:
# Manually correct the `jd_lang` column

df.loc[idx_lang_anomalies, ["jd_langs"]] = df.loc[idx_lang_anomalies, ["jd_langs"]].apply(lambda _: {"en"}, axis=1)
df.jd_langs.value_counts()

jd_langs
{en}        2336
{fi}         231
{}            13
{en, fi}      13
Name: count, dtype: int64

In [25]:
# Check the JDs containing both English and Finnish
with pd.option_context("display.max_colwidth",500):
    display(df.loc[df.jd_langs=={"en","fi"},["job_title","job_description"]])

Unnamed: 0,job_title,job_description
116,Avoin hakemus / Open application,"About the job\nMiltä kuulostaisi monipuolinen työ kansainvälisessä kaapelialan yrityksessä teknologiateollisuuden parissa? Prysmian Group on maailman johtava kaapeleiden toimittaja. Yli 100 vuoden kokemuksella Prysmian Group tarjoaa markkinoiden laajimman kaapelivalikoiman Suomessa. Prysmian Group Finland Oy:llä on tehtaat Kirkkonummen Pikkalassa ja Oulun Ruskossa. Meillä pääset työskentelemään energiamurroksen eturintamassa ja kehittämään osaamistasi mukavassa ja osaavassa työyhteisössä, jo..."
227,Senior Power BI Developer,"About the job\nJob Description\n\nOletko datan visualisoinnin ja analytiikan ammattilainen, joka haluaa toimia dynaamisessa ja kansainvälisessä ympäristössä? Meillä saattaisi olla sopiva paikka juuri sinulle!\n\nEtsimme tiimiimme Power BI:n parissa toimivaa kokenutta konsulttia työskentelemään mielenkiintoisissa projekteissa Data & Analytiikka -tiimissämme. Tarjoamme sinulle joustavan työpaikan, jossa saat mahdollisuuden vaikuttaa työaikaasi ja -paikkaasi, laajan tarjoaman erilaisia koulutus..."
458,Process Engineer,"About the job\nPrysmian Group on maailman suurin energia- ja telekaapeleiden sekä kaapelijärjestelmien toimittaja, jonka liikevaihto on yli 12 miljardia euroa. Prysmian Group on aidosti globaali yritys, jolla on tytäryhtiöitä yli 50 maassa, 112 tehdasta, 25 tutkimus- ja kehityskeskusta ja noin 30 000 työntekijää. Suomessa Prysmian Group Finlandilla on henkilöstöä 650 ja tuotantolaitokset Oulussa ja Pikkalassa. Lisätietoja yhtiöstä: www.prysmiangroup.fi\n\nInvestointien myötä toimintamme kasv..."
491,Avoin hakemus / Open application,"About the job\nMiltä kuulostaisi monipuolinen työ kansainvälisessä kaapelialan yrityksessä teknologiateollisuuden parissa? Prysmian Group on maailman johtava kaapeleiden toimittaja. Yli 100 vuoden kokemuksella Prysmian Group tarjoaa markkinoiden laajimman kaapelivalikoiman Suomessa. Prysmian Group Finland Oy:llä on tehtaat Kirkkonummen Pikkalassa ja Oulun Ruskossa. Meillä pääset työskentelemään energiamurroksen eturintamassa ja kehittämään osaamistasi mukavassa ja osaavassa työyhteisössä, jo..."
905,Senior Power BI Developer,"About the job\nJob Description\n\nOletko datan visualisoinnin ja analytiikan ammattilainen, joka haluaa toimia dynaamisessa ja kansainvälisessä ympäristössä? Meillä saattaisi olla sopiva paikka juuri sinulle!\n\nEtsimme tiimiimme Power BI:n parissa toimivaa kokenutta konsulttia työskentelemään mielenkiintoisissa projekteissa Data & Analytiikka -tiimissämme. Tarjoamme sinulle joustavan työpaikan, jossa saat mahdollisuuden vaikuttaa työaikaasi ja -paikkaasi, laajan tarjoaman erilaisia koulutus..."
1237,Process Engineer,"About the job\nHaluatko rakentaa uraa kansainvälisessä kaapeliyhtiössä teknologia-alalla? Prysmian Group on maailman suurin kaapelivalmistaja, jolla on Suomessakin jo yli 110 vuoden kokemus alalta. Prysmian Group investoi yli 220 miljoonaa euroa Kirkkonummella sijaitsevan Pikkalan tehtaan tuotantokapasiteetin kasvattamiseen vuoteen 2025 mennessä. Nämä investoinnit ovat keskeinen osa Prysmian Groupin vastausta uusiutuvan energian maailmanlaajuisen kysynnän nopeaan kasvuun. \n\n\n\n(See job ad..."
1283,Senior Power BI Developer,"About the job\nJob Description\n\nOletko datan visualisoinnin ja analytiikan ammattilainen, joka haluaa toimia dynaamisessa ja kansainvälisessä ympäristössä? Meillä saattaisi olla sopiva paikka juuri sinulle!\n\nEtsimme tiimiimme Power BI:n parissa toimivaa kokenutta konsulttia työskentelemään mielenkiintoisissa projekteissa Data & Analytiikka -tiimissämme. Tarjoamme sinulle joustavan työpaikan, jossa saat mahdollisuuden vaikuttaa työaikaasi ja -paikkaasi, laajan tarjoaman erilaisia koulutus..."
1454,"Data Architect, Digital Society","About the job\nPosition Description\n\nVuonna 1976 perustettu CGI on maailman ja Suomen suurimpia IT- ja liiketoimintakonsultoinnin palveluyhtiöitä – ja tuhansien erilaisten uratarinoiden talo. Lisätietoa www.cgi.fi.\n\n\n\nHaluatko työskennellä kehittyvässä ympäristössä, jossa mielenkiintoiset ongelmanratkaisutilanteet ovat arkipäivää? Jos vastasit kyllä, paikkasi voisi olla tiimissämme!\n\n\n\nData Arkkitehdin tehtävässä voit olla esimerkiksi toteuttamassa asiakkaamme modernia tietovarasto..."
1662,Maintenance Technician,"About the job\nLocation: Vantaa, Finland\n\nThales people architect identity management and data protection solutions at the heart of digital security. Business and governments rely on us to bring trust to the billons of digital interactions they have with people. Our technologies and services help banks exchange funds, people cross borders, energy become smarter and much more. More than 30,000 organizations already rely on us to verify the identities of people and things, grant access to di..."
1835,"Data Architect, Digital Society","About the job\nPosition Description\n\nVuonna 1976 perustettu CGI on maailman ja Suomen suurimpia IT- ja liiketoimintakonsultoinnin palveluyhtiöitä – ja tuhansien erilaisten uratarinoiden talo. Lisätietoa www.cgi.fi.\n\n\n\nHaluatko työskennellä kehittyvässä ympäristössä, jossa mielenkiintoiset ongelmanratkaisutilanteet ovat arkipäivää? Jos vastasit kyllä, paikkasi voisi olla tiimissämme!\n\n\n\nData Arkkitehdin tehtävässä voit olla esimerkiksi toteuttamassa asiakkaamme modernia tietovarasto..."


In [26]:
# Check the language distribution of unique job postings
_anomaly_idx = dedup_df.jd_langs.isin(dedup_df.jd_langs.value_counts().index[4:])
dedup_df.loc[_anomaly_idx, "jd_langs"] = dedup_df.loc[_anomaly_idx, ["jd_langs"]].apply(lambda _: {"en"}, axis=1)
dedup_df.jd_langs.value_counts()

jd_langs
{en}        635
{fi}         51
{}            7
{en, fi}      6
Name: count, dtype: int64

### Observation #5

Around 10% of the jobs contains Finnish. This is a big chunk.

For the purpose of extracting skill keywords or of classifying jobs -- Either I should use a multilingual language model to handle them, or I should translate Finnish.

## #6 Job Classification - Data scientist, engineer or analyst (or None of those)?

- Naive approach - Match strings from the job title.
- More advanced approach - Cluster job descriptions.
- Some jobs are not related to any of the above. Drop those.

1. Naive approach

In [27]:
len(df)

2593

In [28]:
idx_engineer = df.job_title.str.lower().str.contains("engineer")
display(df.loc[idx_engineer,:])
print(df.loc[idx_engineer,:].job_title.unique())

Unnamed: 0,job_title,required_skills,linkedin_url,company,job_description,crawl_date,is_remote,is_on_site,is_hybrid,is_fulltime,jd_langs
2,JVM Performance and Tuning Engineer,"Business Logic, Garbage Collection, Honeycomb,...",https://www.linkedin.com/jobs/view/3734708994,RELEX Solutions,About the job\nRELEX Solutions create cutting-...,2023-10-29,True,False,False,True,{en}
3,Data Engineer (Level Up),"Data Warehousing, Finnish, and SQLData Visuali...",https://www.linkedin.com/jobs/view/3744740320,Loihde Advance,About the job\nOnko sinulle jo kertynyt jo väh...,2023-10-29,False,False,True,True,{fi}
4,Data Science - Machine Learning Engineer,"Artificial Intelligence (AI), Computer Science...",https://www.linkedin.com/jobs/view/3629670334,Wolt,About the job\nJob Description\n\nTeam purpose...,2023-10-29,True,False,False,True,{en}
5,Data Engineer,"Data Analytics, Data Engineering, Data Science...",https://www.linkedin.com/jobs/view/3750477070,The Hub,About the job\nAbout Huuva\n\nHuuva Kitchens t...,2023-10-29,False,False,True,True,{en}
7,Data Platform Engineer,"Cor, Information Technology, Information and C...",https://www.linkedin.com/jobs/view/3733236101,COR Group Oy,About the job\nCor Group -konsernin liiketoimi...,2023-10-29,False,False,True,True,{fi}
...,...,...,...,...,...,...,...,...,...,...,...
2581,AI Engineer Academy - Finland,"Artificial Intelligence (AI), Computer Science...",https://www.linkedin.com/jobs/view/3746215896,Avanade,About the job\nJob Description\n\nAre you inte...,2023-11-04,False,False,False,True,{en}
2585,Data Warehouse Engineer,"Computer Science, Data Warehousing, Databases,...",https://www.linkedin.com/jobs/view/3749884540,Schibsted Finland,About the job\nTHE OPPORTUNITY IN A NUTSHELL\n...,2023-11-04,False,False,True,True,{en}
2587,Machine Learning Engineer - MLOps,"Artificial Intelligence (AI), Data Mining, Dat...",https://www.linkedin.com/jobs/view/3750358338,Wolt,About the job\nJob Description\n\nTeam purpose...,2023-11-04,True,False,False,True,{en}
2588,Data Engineer,"Data Engineering, Git, Python (Programming Lan...",https://www.linkedin.com/jobs/view/3736532279,Suomen Palloliitto - Football Association of F...,About the job\nDATA ENGINEER\n\nSUOMEN PALLOLI...,2023-11-04,False,False,True,True,{fi}


['JVM Performance and Tuning Engineer' 'Data Engineer (Level Up)'
 'Data Science - Machine Learning Engineer' 'Data Engineer'
 'Data Platform Engineer' 'Software Engineer, Data Platform'
 'Cloud Data Engineer' 'Machine Learning Engineer - MLOps'
 'Senior Data Engineer (Data Platform)'
 '(Senior) Data Engineer - Tietoevry Tech Services'
 'Data Engineer, OP Life Insurance' 'DATA ENGINEER' 'Senior Data Engineer'
 'Databricks data engineer' 'AI Engineer Academy - Finland'
 'Azure Data Engineer' 'Software Engineer - Data Infrastructure - Kafka'
 'Azure Data Engineers & Data Architects'
 'Automation Engineer / Research Scientist, Hydrogen Applications'
 'Software Development Engineer II, Computational Geometry'
 'Software Engineer - Data Infrastructure - OpenSearch/ElasticSearch'
 'Cloud Engineer'
 'Senior Machine Learning Engineer, Core Performance Foundations'
 'Senior Threat Detection Engineer - 5G Cybersecurity Platform'
 'Engineering Intern' 'DevOps Engineer for PKI Services (m/f/d)'
 '

In [29]:
idx_scientist = df.job_title.str.lower().str.contains("scientist")
print(len(df.loc[idx_scientist,:]))
print(df.loc[idx_scientist,:].job_title.unique())

72
['Data Scientist' 'Data Scientist / Senior Data Scientist'
 'Automation Engineer / Research Scientist, Hydrogen Applications'
 'Lead Data Scientist, multiple domains'
 'Senior Data Scientist, Consumer, AdTech'
 'Senior Data Scientist, Forecasting (Retail Platforms)'
 'Senior Data Scientist, Merchant'
 'Data Scientist / Machine Learning/AI Expert & Startup Founder'
 'Data Scientist, Data Platform & AI' 'Data Scientist - Helsinki / Salo'
 'Senior Data Scientist' 'Senior Data Scientist - Merchant Analytics'
 'Senior Data Scientist, Pricing (Retail Platforms)'
 'Senior Data Scientist (m/f/d)'
 'Senior Data Scientist, Consumer, Pricing' 'Junior Data Scientist'
 'Senior or Principal Scientist, AI-Driven Corporate Foresight & Strategy'
 'Data Scientist, Pohjola Insurance Analytics, Helsinki'
 'Data Scientist, Pohjola Vakuutuksen analytiikka, Helsinki']


In [30]:
idx_analyst = df.job_title.str.lower().str.contains("analyst")
print(len(df.loc[idx_analyst,:]))
print(df.loc[idx_analyst,:].job_title.unique())

273
['Senior Data Analyst, Ads' 'Data Analyst, Dream Blast'
 'Quantitative Risk Analyst'
 '2024, Commercial Banking, Full Time Analyst, Helsinki (Copenhagen, Stockholm, Oslo)'
 'Data Analyst, Pohjola Vakuutuksen Korvauspalvelut'
 'Senior Quantitative Risk Analyst'
 'Lead Quantitative Risk Analyst (Data Analytics Team)'
 'Senior SOC Analyst'
 '(Senior) Quantitative Risk Analyst (Data Analytics Team)'
 'Customer Insight Analyst'
 'Data Analyst Trainee, Henkilöasiakasrahoitus ja asumisen palvelut'
 'Senior Game Analyst'
 'Business Analyst to Integrated Risk Management Application (IRMA)'
 'Senior IT Analyst, Adobe Experience Platform'
 'Senior Credit Risk Analyst' 'Senior Financial Analyst, New Build'
 'Analyst (Bangkok Based, relocation provided)'
 'Flight Marketing Analyst (Bangkok Based, relocation provided)'
 'Senior Data Analyst (Product Team, Bangkok-based, Relocation provided)'
 'Senior Data Analyst (Bangkok Based, relocation provided)'
 'Senior Analyst (Bangkok Based, relocation p

In [31]:
_none_of_three = ~(idx_engineer | idx_scientist | idx_analyst)
display(df.loc[_none_of_three,:])
print(df.loc[_none_of_three,:].job_title.unique())

Unnamed: 0,job_title,required_skills,linkedin_url,company,job_description,crawl_date,is_remote,is_on_site,is_hybrid,is_fulltime,jd_langs
6,Data Platform Lead Architect - Tietoevry Care ...,"Analytics, Big Data, Cloud Computing, Data Ana...",https://www.linkedin.com/jobs/view/3717021115,Tietoevry,About the job\nYou may apply to Tietoevry by s...,2023-10-29,False,False,True,True,{en}
8,ETL Specialist,"Data Warehousing, English, Extract, Transform,...",https://www.linkedin.com/jobs/view/3743271896,Gazelle Global,About the job\nETL Specialist\n\n A great oppo...,2023-10-29,False,False,True,True,{en}
9,Analytics Solution Owner,"Analytical Skills, Data Warehousing, Extract, ...",https://www.linkedin.com/jobs/view/3730342817,Framery Acoustics,About the job\nFramery is constantly beating t...,2023-10-29,False,False,True,True,{en}
10,Digital Cloud Solution Architect - Azure - Dan...,"Cloud Computing, Computer Science, and Data An...",https://www.linkedin.com/jobs/view/3732569049,Microsoft,"About the job\nIn SMC and Digital Sales, we ha...",2023-10-29,False,False,True,True,{en}
12,Data Officer for Group Risk,"Data Management, Databases, and EnglishCommuni...",https://www.linkedin.com/jobs/view/3742624994,Nordea,About the job\nJob ID: 19795 \nWould you like ...,2023-10-29,False,True,False,True,{en}
...,...,...,...,...,...,...,...,...,...,...,...
2516,"Associate Director, Marketing Analytics (Bangk...","Analytical SkillsBenchmarking, Executive Manag...",https://www.linkedin.com/jobs/view/3702379056,Agoda,About the job\nAbout Agoda\n\nAgoda is an onli...,2023-11-04,False,False,False,True,{en}
2517,"Manager/ Senior Manager, Social MediaTeam (Ban...","Digital Marketing ChannelsDigital Marketing, E...",https://www.linkedin.com/jobs/view/3702377220,Agoda,About the job\nAbout Agoda\n\nAgoda is an onli...,2023-11-04,False,False,False,True,{en}
2518,"Senior Social Media Manager (Bangkok-based, re...","Content Marketing, Digital Marketing, Influenc...",https://www.linkedin.com/jobs/view/3687639088,Agoda,About the job\nAbout Agoda\n\nAgoda is an onli...,2023-11-04,False,True,False,True,{en}
2519,"Associate Director, Performance Marketing (Ban...","Analytical SkillsBenchmarking, Campaigns, Coop...",https://www.linkedin.com/jobs/view/3702376532,Agoda,About the job\nAbout Agoda\n\nAgoda is an onli...,2023-11-04,False,True,False,True,{en}


['Data Platform Lead Architect - Tietoevry Care Data and Analytics'
 'ETL Specialist' 'Analytics Solution Owner'
 'Digital Cloud Solution Architect - Azure - Danish, Finnish or Norwegian speakers'
 'Data Officer for Group Risk' 'Data specialist'
 'Data Platform and AI Trainees'
 'Digital Analytics Specialist - Data Activation'
 'Senior Threat Intelligence Officer, Nordic' 'Data Architect'
 'Research Coordinator to the Department of Computer Science'
 'Helsinki Associate Program - Privacy (Cyber & Digital Risk)'
 'Data Manager' 'Project Manager, Data & Analytics - Tietoevry Care'
 'XDR Specialist' 'Data Warehouse Consultant - Southern Finland'
 'Juristi-trainee, Data ja tietosuoja' 'IT Architect - Helsinki'
 'Data Center Critical Facilities IV'
 'QA Tools Specialist in Quality Research & Development unit'
 'AI Designer / Consultant' 'Manager, Product & Solutions, N&B'
 'Data Migration Owner' 'Sr. Data Consultant'
 'Manager, Data Platform & Analytics'
 'Build the Future - Graduate Progra

In [32]:
df.loc[_none_of_three,:].job_title.value_counts()

job_title
Remote Data Contributor – Image collection                   57
(Senior) Data Architect - Tietoevry Tech Services Finland    48
Google Cloud Architect                                       45
Serbian Transcriber ONSITE in Tampere, Finland               32
NAS Architect                                                29
                                                             ..
FP&A Manager, Partner Retail                                  1
Project Manager                                               1
Chief Information Security Officer                            1
Security Lead                                                 1
Lead/Expert Java Developer (Payments)                         1
Name: count, Length: 354, dtype: int64

### Observation #6

Naively following the job titles cannot accurately classify the jobs. There are inconsistencies between the titles and the postings, or some titles are not explicitly stating which group the job belongs to: Some of the jobs are real data engineer positions, but some are not. For example 'Peroxides Process Engineer - Europe 1' and 'Manager, Process Engineering'. 

To classify the jobs, we can use the former to train a machine learning model, or use them as the test set for unsupervised algorithms. And the rest unknowns should take the classification's label. 

All the preprocessing steps are concluded in `utils.py` for reuse.