# Job posting gathering

**Note: The methods have been refactored and updated - this notebook is partially outdated**

In [32]:
import numpy as np
import json
import requests
from bs4 import BeautifulSoup
import re
import time
from selenium import webdriver

import methods.scrape as sfuncs
import methods.urls as urlfuncs

## Website: karriere.at

In [27]:
from methods import sites

karriere_at = sites.KarriereATScraper()
postings = karriere_at.gather_data(descriptions=True, verbose=False)

A summary of what we have gathered:

In [39]:
num_postings = len(postings)
num_postings_with_description = len([posting for posting in postings.values() if posting["description"]])
description_length_sum = np.sum([len(posting["description"]) for posting in postings.values() if posting["description"]])
description_length_avg = np.mean([len(posting["description"]) for posting in postings.values() if posting["description"]])
description_words_sum = np.sum([len(posting["description"].split()) for posting in postings.values() if posting["description"]])
description_words_avg = np.mean([len(posting["description"].split()) for posting in postings.values() if posting["description"]])

print(f"Number of postings: {num_postings}")
print(f"Number of postings with description: {num_postings_with_description}")
print(f"Total number of characters: {description_length_sum}")
print(f"Average number of characters (excluding 0-length descriptions): {description_length_avg}")
print(f"Total number of words: {description_words_sum}")
print(f"Average number of words (excluding 0-length descriptions): {description_words_avg}")

Number of postings: 462
Number of postings with description: 460
Total number of characters: 1783171
Average number of characters (excluding 0-length descriptions): 3876.458695652174
Total number of words: 226275
Average number of words (excluding 0-length descriptions): 491.9021739130435


Knowing the amount of characters we will load into AWS' services (e.g. Comprehend), we can calculate roughly the costs of using these models.<br>
Amazon Comprehend pricing is quite standard: *"NLP requests are measured in units of 100 characters, with a 3 unit (300 character) minimum charge per request"*.<br>The pricing is as follows: *"Key Phrase Extraction/Translation:	$0.0001 per unit (under 10M units)"*.

Since there is a minimum charge that doesn't change until 300+ characters, we use 299 character long text sections for initially detecting text language, and this is also the reason to not use the `detect_dominant_language` method separately for each keyword.<br>

What are our expected costs? We can utilize the fact that on average, our job descriptions are 3876.46 characters long, i.e. 39 units.<br>
Aside small-scale translation (which is cheap), the costs are made up of key phrase extractions; the approximate cost of that: average description length of 39 units * 460 job postings * $0.0001/unit ~ 1.80 USD for every run.


To reduce costs, let's filter out less relevant postings:

In [60]:
banned_words = ["manager", "management", "professor", "team leader", "teamleader", "teamleiter", "team leiter",
                "internship", "jurist", "lawyer", "audit","legal", "advisor", "owner", "officer", "controller"]
banned_words += ["praktikant", "praktikantin", "praktikum", "trainee", "intern", "senior", "administrator",
                 ]
capital_banned_words = ["SAP", "HR"]
postings_filtered = {key: value for key, value in postings.items()
                     if not any(banned_word in value["title"].lower() for banned_word in banned_words)}
postings_filtered = {key: value for key, value in postings_filtered.items()
                        if not any(banned_word in value["title"] for banned_word in capital_banned_words)}
len(postings_filtered), len(postings)

(352, 462)

We reduced costs by more than 24% overall! The new costs are approximately 1.37 USD per run.

In [2]:
titles = [posting["title"] for posting in postings.values()]
titles[:10]

['Consultant Technology Strategy & Advisory (all genders)',
 'IT-Consultant Artificial Intelligence (AI) (all genders)',
 'Senior Data Engineer (m/f/d)',
 'Conversational AI Specialist (all humans)',
 'Assistant Professor with tenure track to establish a Research Group for Artificial Intelligence / Machine Learning in the Life Sciences',
 'Games Analyst (f/m/d)',
 'Senior Full Stack Data Scientist_in',
 'Data Scientist_in',
 'DevOps Engineer (w/m/x)',
 'KI-Experte für Softwarelösungen (m/w/d) - AI Solution Architect']

In [41]:
date_today = time.strftime("%Y-%m-%d")

with open(f"source/save/postings_{date_today}.json", "w") as f:
    json.dump(postings, f)

The code for analyzing job postings is stored in the `comprehend_keywords.py` file.

## TieTalent

**In works**

In [79]:
session = requests.Session()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

url = 'https://tietalent.com/en/jobs?search=positions%5B0%5D%3DData_Engineer_5%26positions%5B1%5D%3DData_Analyst_36%26positions%5B2%5D%3DData_Scientist_37%26positions%5B3%5D%3DMachine_Learning_7%26positions%5B4%5D%3DBusiness_Intelligence_39%26positions%5B5%5D%3DNLP_14%26locations%5B0%5D%3DVienna_Vienna_Austria_304'
response = session.get(url, headers=headers)

On this website there is no scrolling needed, but checking page 2, 3, ... will be desired.

In [80]:
soup = BeautifulSoup(response.content, 'html.parser')