In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
from src.getalljobs import *

# Scraping Jobs.ch

As a job seeker, one has to search through job portals to find most relevant jobs related to your profile. In this exercise, your goal is to find all jobs related to keywords: “Data Scientist”, “Data Analyst”, “Python Developer”, “Data Engineer”, “Data Manager”, “Data Architect”, “Big Data Analyst” and “Data Python” on jobs.ch.
1. Download all necessary information (including job title, date, company name, location…) for all webpages.
2. Using the information obtained, perform a descriptive analysis on this data including questions:
   - How many jobs are shared between these categories?
   - How much the keywords: “Data Analyst” and “Big Data Analyst” overlap?
   - Are there some companies doing more hires than average?
   - How many jobs are there in different Kantons?
   - Is “machine learning” keyword more often in data scientist or data analyst jobs?
   - What is the distribution of most common keywords between and across categories?
3. Produce a report in the form of a clean notebook (or jupyter slides), with commented code and markdown cells for structuring and interpretations.

### Web Scraping

The file `src/getalljobs` contains the necesary functions to pull infomation from https://www.jobs.ch/en/vacancies/. The function works in the following way:
- Receives a list of job positions on natural language
- The function `clean_job_keywords` will transform those key words to search keywords by removing white spaces and replacing them with `%20` characters
- Once the necesary keywords were obtained the function `df_full_data` will proceed to pull info for each job in the following way:
  - Get the number of available pages for each job position
  - For each of the available pages, scrap an individual text box using the function `get_data_one_job` and concatenating the info by using the function `df_all_jobs`
  - In case no job postings are found an error should be printed (see example below).

In [2]:
# Key words to be searched
job_positions = ["Data Engineer", "Data Scientist", "Data Analyst", "Python Developer", "Data Manager", "Data Architect", "Big Data Analyst", "Data Python"]

In [3]:
# Run the function to get both errors and 
df_all = df_full_data(job_positions)

# In this case we should not have errors
errors = df_all["errors"]
errors

[]

In [4]:
# Print the found jobs
df_jobs = df_all["results"]
df_jobs.head(10)

Unnamed: 0,title,publication_date,location,workload,job_type,company,job_link,keyword
0,Data Engineer temp. 24 months (w/m/d),25 April 2023,Baden,100%,Temporary,Axpo Group,https://www.jobs.ch/en/vacancies/detail/3fa23b...,data engineer
1,Head Engineering & Development (m/w/d) - Digit...,11 Mai 2023,Oftringen AG,100%,Unlimited employment,Mercuri Urval AG,https://www.jobs.ch/en/vacancies/detail/54aa6c...,data engineer
2,DevOps Engineer 80-100% (w/m/d),11 Mai 2023,Bendern,80% – 100%,Unlimited employment,LGT,https://www.jobs.ch/en/vacancies/detail/c8f7e0...,data engineer
3,Application Support Engineer,20 April 2023,Unterentfelden,100%,Unlimited employment,Hexagon Manufacturing Intelligence,https://www.jobs.ch/en/vacancies/detail/8dd0c8...,data engineer
4,Produktentwicklungsingenieur*,17 April 2023,Biel,100%,Unlimited employment,HARTING AG,https://www.jobs.ch/en/vacancies/detail/6b99d0...,data engineer
5,Experienced Machine Learning Engineer (f/m/d),12 Mai 2023,Heerbrugg,80% – 100%,Unlimited employment,Hexagon Technology Center GmbH,https://www.jobs.ch/en/vacancies/detail/1db76b...,data engineer
6,Bauingenieur/-in als Projektleiter/-in Wasserv...,03 April 2023,Sursee,60% – 100%,Unlimited employment,Kost + Partner AG Sursee,https://www.jobs.ch/en/vacancies/detail/29b656...,data engineer
7,Field Test / R&D Engineer,05 Mai 2023,Kriens,100%,Unlimited employment,ANDRITZ,https://www.jobs.ch/en/vacancies/detail/c7b03a...,data engineer
8,Service Engineer (m/w/d) 80-100 %,12 April 2023,Winterthur,80% – 100%,Unlimited employment,Swiss Birdradar Solution AG,https://www.jobs.ch/en/vacancies/detail/cf9b05...,data engineer
9,Ingénieur Service Client Laboratoire 80-100% (...,26 April 2023,"Renens, CH-VD",80% – 100%,Unlimited employment,Siemens Healthineers International AG,https://www.jobs.ch/en/vacancies/detail/bbdf25...,data engineer


In [3]:
# This should return no data and report the error
df_error = df_full_data(["this job shouldnt exist"])

# The empty data frame
print(df_error["results"])

# Returns the jobs that were not found
print(df_error["errors"])

this job shouldnt exist not found
Empty DataFrame
Columns: []
Index: []
['this job shouldnt exist']


## Data Cleaning

Lets do some additional data cleaning. For 83 job postings the company and the job_type are inverted. That issue can be automatically used if we find the tag instead of the index.

In [23]:
# There is an index problem for some cases and therefore some job types do not make sense
df_jobs.job_type.unique()

array(['Temporary', 'Unlimited employment', 'Freelance', 'Internship',
       'Jet Aviation AG', 'Sensirion AG', 'Apprenticeship', 'KPMG',
       'Universität Basel', 'COOP', 'Supplementary income'], dtype=object)

In [25]:
# Filter those positions were job types are correct
df_jobs_correct = df_jobs[~df_jobs["job_type"].isin(['Jet Aviation AG', 'Sensirion AG', 'KPMG', 'Universität Basel', 'COOP'])].copy()

# Filter those positions were job types are incorrect
df_jobs_incorrect = df_jobs[df_jobs["job_type"].isin(['Jet Aviation AG', 'Sensirion AG', 'KPMG', 'Universität Basel', 'COOP'])].copy()

In [27]:
# For those reverse the order and job_type should be empty
df_jobs_incorrect["company"] = df_jobs_incorrect["job_type"]
df_jobs_incorrect["job_type"] = ""

In [28]:
df_jobs_clean = pd.concat([df_jobs_correct, df_jobs_incorrect], ignore_index = True)

## Analysis

How many jobs are shared between these categories?

In [47]:
df_jobs_clean["job_id"] = df_jobs_clean.apply(lambda x: x["job_link"].split("/")[-2], axis = 1)
df_duplicated = df_jobs_clean.groupby(by = "job_id").keyword.nunique().reset_index()
df_duplicated = df_duplicated[df_duplicated["keyword"] > 1].sort_values(by = "keyword", ascending=False)
df_duplicated.shape[0]

1124