# Анализ вакансий на LinkedIn

**Цель исследования:** визуализировать информацию о Европейском рынке вакансий для специалиста в области анализа данных на материале социальной сети LinkedIn

**Задачи:**

    Изучить данные, привести их в пригодный для анализа формат
    Распарсить полученный csv файл с помощью BS 4, создав следующие признаки:_
        - наименование вакансии, город, страна, тип занятости (online, hybride, on-site);
        - компания - размер компании (количество работников);
        - сфера деятельности компании;
        - требуемые хард скилы;
        - дата публикации вакансии;
        - количество кандидатов на вакансию.
    Создать дашборд в PowerBI, содержаший следующие средства визуализации:
        - фильтры — по стране и по типу занятости;
        - количество вакансий (абсолютные значения) – индикатор;
        - количество вакансий по странам (относительные значения) — stack bar chart;
        - тип занятости — pie chart;
        - список нанимающих компаний с указанием количества вакансий отсортированный в порядке убывания — heat map;
        - ТОП 10 сфер деятельности компаний, которые нанимают аналитиков — barchart;
        - размер компаний и количество вакансий — pie chart;
        - хард скилы — barchart.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re
import datetime

In [2]:
df = pd.read_csv('masterskaya_parsing_LinkedIn_2023_05_23.csv')

In [3]:
#знакомимся с данными
display(df.head(1))
df.info()
df.columns

Unnamed: 0.1,Unnamed: 0,html
0,0,"\n <div>\n <div class=""\n jobs-deta..."


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998 entries, 0 to 997
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  998 non-null    int64 
 1   html        998 non-null    object
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


Index(['Unnamed: 0', 'html'], dtype='object')

Очевидно, что столбец `Unnamed: 0` лишний и не несёт никакой полезной информации. Убираем его:

In [4]:
df.drop('Unnamed: 0', axis=1, inplace=True)

#проверим результат
df.columns

Index(['html'], dtype='object')

С целью поиска тегов для парсинга выведем строку данных в формате html:

In [5]:
df.head(1).style

Unnamed: 0,html
0,"Data Analyst  PharmiWeb.Jobs: Global Life Science Jobs  Basel, Basel, Switzerland  On-site  1 week ago  47 applicants  Full-time · Entry level  11-50 employees · Staffing and Recruiting  See how you compare to 47 applicants. Try Premium for free  Apply  Save  Save Data Analyst at PharmiWeb.Jobs: Global Life Science Jobs  Share  Show more options  Data Analyst  PharmiWeb.Jobs: Global Life Science Jobs  Basel, Basel, Switzerland  On-site  Apply  Save  Save Data Analyst at PharmiWeb.Jobs: Global Life Science Jobs  Show more options  About the job  What You Will Achieve This position will apply advanced manufacturing, science, and technology to support business and process improvements for the manufacture of small and / or large volume parenteral products. You will be a member of the Transformation and Strategy team. As a Data Analyst you will be responsible for mining, retrieving, organizing, and analyzing data to support the operations of a large manufacturing facility. Using the data, you will help to develop key performance indicators to demonstrate the effectiveness of processes and systems against business strategies. Your knowledge of manufacturing operations and computer systems/tools will make you a critical member of the team. Your strong business processes and workflow skills will help facilitate required gatherings for building and enhancing business process maps and strategies. Your innovative use of communication tools and techniques will facilitate in explaining difficult issues, establishing consensus between teams, and will create a collaborative teaming environment for your colleagues. Main Responsibilities Interpret data, analyze results using statistical techniques and provide ongoing reportsDevelop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and qualityAcquire data from primary or secondary data sources and maintain databases/data systemsIdentify, analyze, and interpret trends or patterns in complex data sets Filter and “clean” data by reviewing reports, printouts, and performance indicatorsWork with management to prioritize business and information needs Identify and define new process improvement opportunities Must-Haves A Bachelor’s degree with at least three years of experience; OR a Master’s degree with more than one year of experience.Prior pharmaceutical and/or manufacturing experience requiredTechnical expertise regarding data models, database design development, data mining and segmentation techniques Knowledge of and experience with reporting packages (Business Objects etc) and databases (SQL etc)Knowledge of statistics and experience using statistical packages for analyzing datasets (Excel, SPSS, SAS etc)Knowledge of SAP (ERP materials planning systems). You will be joining an organisation with determined to bring about considerable change to the global industry, in an environment that promotes self-development and personal success while driving for company growth. Job Title: Data Analyst Location: Basel, Switzerland Job Type: Contract Aerotek, an Allegis Group company. Allegis Group AG, Aeschengraben 20, CH-4051 Basel, Switzerland. Registration No. CHE-101.865.121. Aerotek and Actalent Services are companies within the Allegis Group network of companies (collectively referred to as ""Allegis Group""). Aerotek, Actalent Services, Aston Carter, EASi, TEKsystems, Stamford Consultants and The Stamford Group are Allegis Group brands. If you apply, your personal data will be processed as described in the Allegis Group Online Privacy Notice available at https://www.allegisgroup.com/en-gb/privacy-notices. To access our Online Privacy Notice, which explains what information we may collect, use, share, and store about you, and describes your rights and choices about this, please go to https://www.allegisgroup.com/en-gb/privacy-notices. We are part of a global network of companies and as a result, the personal data you provide will be shared within Allegis Group and transferred and processed outside the UK, Switzerland and European Economic Area subject to the protections described in the Allegis Group Online Privacy Notice. We store personal data in the UK, EEA, Switzerland and the USA. If you would like to exercise your privacy rights, please visit the ""Contacting Us"" section of our Online Privacy Notice at https://www.allegisgroup.com/en-gb/privacy-notices for details on how to contact us. To protect your privacy and security, we may take steps to verify your identity, such as a password and user ID if there is an account associated with your request, or identifying information such as your address or date of birth, before proceeding with your request. If you are resident in the UK, EEA or Switzerland, we will process any access request you make in accordance with our commitments under the UK Data Protection Act, EU-U.S. Privacy Shield or the Swiss-U.S. Privacy Shield  About the company  PharmiWeb.Jobs: Global Life Science Jobs  64,551 followers  Follow  Staffing & Recruiting  11-50 employees  18 on LinkedIn  PharmiWeb.Jobs is PharmiWeb's dedicated Life Science job board. PharmiWeb.jobs is the largest dedicated niche pharma job board in Europe. Follow us for the latest jobs in pharma, industry employment news, career tips, Contact us to advertise your life science vacancies  …  show more  Show more"


Название вакансии соответсвует тегу `h2`. Занесём его в колонку `name`:

In [6]:
df['job_title'] = df['html'].apply(lambda x: BeautifulSoup(x, 'lxml').find('h2').text.strip())

#проверим результат
df.head(5)

Unnamed: 0,html,job_title
0,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst
1,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst - Logistics
2,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst - Logistics
3,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst (Space & Planning)
4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst


Переведём значения в столбце `job title` в нижний регистр и посмотрим уникальные значения в колонке вакансий:

In [7]:
df['job_title'] = df['job_title'].apply(lambda x: x.lower())
sorted(df['job_title'].unique())

['(esg) data analyst (w/m/d)',
 '(junior) application specialist ecommerce (m/f/d)',
 '(junior) business analyst (f/m/d)',
 '(junior) business analyst (w/m/d)',
 '(junior) consumer business intelligence analyst (m/w/d)',
 '(junior) data scientist',
 '(junior) project analyst (m/f/d)',
 '(junior) project coordinator',
 '(junior) systems business analyst (f/m/d)',
 '(senior) consultant transaction analytics (f/m/d)',
 '360 data product owner',
 '[alt] data analyst assistant - h/f',
 'accountant',
 'afc data analyst',
 'aftersales reporting specialist',
 'alternance - assistance data analyst et reporting rh f/h',
 'alternance - data analyst - data etl & visualisation (h/f/d)',
 'analista (h/m) financiero / fp&a reporting',
 'analista dati di geodesia - categoria protetta',
 'analista de datos',
 'analista de datos bi',
 'analista de datos dept de elaborados (guissona)',
 'analista de proyecto digitalización',
 'analista de software',
 'analista ecommerce, web y app',
 'analista funzionale

Согласно техническому заданию, нас интересуют только вакансии `Data Analyst` и `BI Analyst`. Поскольку здесь присутствуют наименования вакансий сразу на нескольких европейских языках, для надёжности "вручную" занесём наименования вакансий, которые им соответвуют, в `job_title_list`, путём исключения из списка уникальных значений заведомо нерелевантных позиций:

In [8]:
job_title_list = ['(esg) data analyst (w/m/d)', 
 '[alt] data analyst assistant - h/f',
 'afc data analyst',
 'alternance - assistance data analyst et reporting rh f/h',
 'alternance - data analyst - data etl & visualisation (h/f/d)',
 'analista de datos',
 'analista de datos bi',
 'analista de datos dept de elaborados (guissona)',
 'analista superior de datos',
 'analityk danych',
 'analityk danych internetowych (wszystko.pl)',
 'analyst (m/f/d) business intelligence & analytics',
 'analyste data -(h/f)',
 'analyste de données pièces de rechange automobile',
 'analyste des donnees (h/f)',
 'asset data analyst',
 'assistant, hr data analytics',
 'bi analyst',
 'bi analyst (m/w/d)',
 'bi analyst (m/w/d) marketing',
 'bi analyst (pricing)',
 'bi analyst - tableau or other bi tools',
 'bi analyst, power bi champion, cee based in warsaw',
 'bi-analyst (m/w/d)',
 'business intelligence analyst',
 'business intelligence analyst (m/w/d)',
 'business intelligence analyst junior',
 'business intelligence associate',
 'business intelligence-analyst:in',
 'cdi - chef de projet data/analytique f/h',
 'cdi - data analyst f/h',
 'cdi - data analyst h/f - yves rocher',
 'client data analyst (client intelligence specialist - fully remote) )',
 'client insights data analyst - sql, python, data bricks',
 'danish language data analyst - barcelona',
 'danish language data analyst in barcelona',
 'data & analytics analyst - bari, roma',
 'data & analytics consultant',
 'data & analytics senior analyst',
 'data analist',
 'data analist - startersfunctie (dutch speaking)',
 'data analist cbr',
 'data analysis and reporting team lead',
 'data analyst',
 'data analyst  h/f',
 'data analyst (9 months ftc)',
 'data analyst (assortment)',
 'data analyst (bangkok based, relocation provided)',
 'data analyst (engineer)',
 'data analyst (f/h)',
 'data analyst (fraud)*',
 'data analyst (ft)',
 'data analyst (h/f)',
 'data analyst (h/f) - cdi',
 'data analyst (legal services)',
 'data analyst (m/f)',
 'data analyst (m/f/d)',
 'data analyst (m/f/d) - global sea logistics systems',
 'data analyst (m/f/x)',
 'data analyst (m/w/d)',
 'data analyst (m/w/d) im controlling',
 'data analyst (marketing & comms)',
 'data analyst (mobile)',
 'data analyst (product data analyst)',
 'data analyst (slovakia) irc183410',
 'data analyst (space & planning)',
 'data analyst (w/m/d) - kurse',
 'data analyst - alternance - boursorama-(h/f)',
 'data analyst - analyste de données',
 'data analyst - bseu raw materials planning',
 'data analyst - client insight',
 'data analyst - confirmé.e',
 'data analyst - customer management domain',
 'data analyst - digital marketing (all genders)',
 'data analyst - edtech',
 'data analyst - finance',
 'data analyst - flanders digital',
 'data analyst - global marketing agency',
 'data analyst - global marketing h/f',
 'data analyst - h/f',
 'data analyst - hybrid',
 'data analyst - hybrid - permanent',
 'data analyst - hybrid working',
 'data analyst - lisboa e porto - campo grande',
 'data analyst - logistics',
 'data analyst - marketing & communications insight',
 'data analyst - marketing - e-commerce',
 'data analyst - milano',
 'data analyst - operations',
 'data analyst - pilotage transformation cloud-(h/f)',
 'data analyst - poland',
 'data analyst - scores & etudes',
 'data analyst - transportation',
 'data analyst - €60,- per hour - amsterdam based',
 'data analyst / data scientist (m/f/d)',
 'data analyst / decision scientist, growth',
 'data analyst / decision scientist, marketing',
 'data analyst / mathematiker / statistiker (m/w/d)',
 'data analyst / risques de crédit - boursorama-(h/f)',
 'data analyst and business development specialist',
 'data analyst assessor',
 'data analyst associate',
 'data analyst costing',
 'data analyst delivery operations',
 'data analyst débutant(e)',
 'data analyst en alternance (94) dcf/ab - f/h',
 'data analyst en alternance (h/f) - boulogne-billancourt',
 'data analyst export',
 'data analyst f/h',
 'data analyst für customer analytics (m/w/d)',
 'data analyst h/f',
 'data analyst h/f - alternance 12 ou 24 mois',
 'data analyst h/f _ cdd',
 'data analyst h/f h/f',
 'data analyst ii',
 'data analyst im bereich iiot / ki / predictive maintenan ...',
 'data analyst im bereich iiot / ki / predictive maintenance (m/w/d)',
 'data analyst in the area of iiot / ki / predictive maintenance (m/f/d)',
 'data analyst it',
 'data analyst job in overseas',
 'data analyst m/f',
 'data analyst marketing stratégique - boursorama-(h/f)',
 'data analyst pilotage operationnel h/f',
 'data analyst power bi (m/w/d)',
 'data analyst power bi f/h',
 'data analyst professional programme',
 'data analyst reporting',
 'data analyst return solutions',
 'data analyst sas - h/f',
 'data analyst tableau',
 'data analyst telco',
 'data analyst | deals (m&a) | cdi | h/f',
 'data analyst – web & app (m/w/d)',
 'data analyst, metrics & reporting',
 'data analyst, product intelligence #swx',
 'data analyst, sql/ssis {finance',
 'data analyst- 6 month',
 'data analyst-(h/f)',
 'data analyst/etl developer',
 'data analyst:in',
 'data analyst_frosinone (fr)',
 'data analyste',
 'ecommerce web analyst',
 'junior data analist',
 'junior data analyst',
 'junior data analyst / business intelligence expert (gn)',
 'junior data analyst power bi - contrato 6 meses',
 'junior data analyst sustainability',
 'junior data analyst – aerospace',
 'junior data analyst/associate (m/f/d)',
 'nrs13700 grade v, data analyst',
 'online data analyst',
 'online data analyst (m,f,d)',
 'online data analyst - france',
 'power bi / data analyst (m/w/d)',
 'principal data analyst - growth',
 'privacy data analyst - fluent german',
 'product data analyst',
 'product data analyst with english',
 'quality data analyst (m|w|d)',
 'reliability data analyst',
 'sales data analyst',
 'senior data analyst',
 'senior data analyst (bangkok based, relocation provided)',
 'senior data analyst (f/m/d) - onsite or remote / home office',
 'sql/data analyst (m/w/d)',
 'stage - data analyst h/f',
 'stage | data analyst',
 'statistical data analyst',
 'web analyst f/h en cdi',
 'web data analyst f/h']

В целях ускорения парсинга удалим из датасета лишние строки:

In [9]:
df = df.query('job_title in @job_title_list')

#сбросим индекс
df.reset_index(drop= True , inplace= True)

#проверяем результат
df.index

RangeIndex(start=0, stop=466, step=1)

Продолжаем парсить. Попробуем найти город и страну:

In [10]:
for i in range(len(df['html'])):
    html = df['html'][i]
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find('span', class_ = 'jobs-unified-top-card__bullet').text.strip())

Basel, Basel, Switzerland
Coventry, England, United Kingdom
Coventry, England, United Kingdom
South Molton, England, United Kingdom
Lugano, Ticino, Switzerland
Southampton, England, United Kingdom
Leeds, England, United Kingdom
Nuneaton, England, United Kingdom
Paris, Île-de-France, France
Cambridge, England, United Kingdom
West Midlands, England, United Kingdom
Chester, England, United Kingdom
Cambridge, England, United Kingdom
Craven Arms, England, United Kingdom
Dublin, County Dublin, Ireland
Belfast, Northern Ireland, United Kingdom
Sunderland, England, United Kingdom
Montévrain, Île-de-France, France
Bristol, England, United Kingdom
South Molton, England, United Kingdom
Bristol, England, United Kingdom
Solihull, England, United Kingdom
Dublin, County Dublin, Ireland
Blackpool, England, United Kingdom
Cracow, Małopolskie, Poland
Dijon, Bourgogne-Franche-Comté, France
Alsónémedi, Pest, Hungary
Dublin, County Dublin, Ireland
Manchester, England, United Kingdom
Elliniko-Argyroupoli, A

Чаще всего на первом месте в списке мы видим город, на втором - регион, на третьем - страну. Иногда указывается только город и страна, редко - только страна и совсем редко - только регион. Создадим функцию, возвращающую города и страны в соответствии с этим алгоритмом:

In [11]:
def location(data):
    soup = BeautifulSoup(data, 'lxml')
    x = soup.find('span', class_ = 'jobs-unified-top-card__bullet').text.strip().split(',')
    if len(x) == 3:
        city = x[0].strip()
        country = x[2].strip()
    elif len(x) == 2:
        city = x[0].strip()
        country = x[1].strip()
    elif len(x) == 1:
        city = None
        country = ''.join(x)
    else:
        city = None
        country = None
    return pd.Series([city, country]) 

In [12]:
df[['city','country']] = df['html'].apply(location)

#проверям результат
df.head(1)

Unnamed: 0,html,job_title,city,country
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland


Проверим, не потребуется ли дополнительная обработка для данных. Выведем уникальные значения городов:

In [13]:
df['city'].unique()

array(['Basel', 'Coventry', 'South Molton', 'Lugano', 'Southampton',
       'Leeds', 'Nuneaton', 'Paris', 'Cambridge', 'West Midlands',
       'Chester', 'Craven Arms', 'Dublin', 'Belfast', 'Sunderland',
       'Montévrain', 'Bristol', 'Solihull', 'Blackpool', 'Cracow',
       'Dijon', 'Alsónémedi', 'Manchester', 'Elliniko-Argyroupoli', None,
       'Umeå', 'North Holland', 'Vilnius', 'Milan', 'Stockholm County',
       'Luxembourg', 'Roubaix', 'Munich', 'Zaventem', 'West Malling',
       'Lille', 'Egham', 'Karlstad', 'Madrid', 'Oxford', 'Brussels',
       'Taibon', 'Epsom', 'Amsterdam', 'Spinea', 'Brindisi',
       'Boulogne-Billancourt', 'Wolfsburg', 'Nantes', 'Derby', 'Lund',
       'Garwolin', 'Stockholm', 'Massy', 'Prague', 'Middlesbrough',
       'Viladecans', 'Barcelona', 'Eindhoven', 'Warsaw', 'Budapest',
       'London', 'Hamburg', 'The Hague', 'Chappes', 'Sintra', 'Riga',
       'Coimbra', 'Tartu', 'Île-de-France', 'Issy-les-Moulineaux',
       'Hawthorn', 'Lyon', 'Valletta',

С городами на первый взгляд всё в порядке. Основные проблемы ожидаются в колонке стран:

In [14]:
sorted(df['country'].unique())

['Austria',
 'Belgium',
 'Berlin Metropolitan Area',
 'Brussels Metropolitan Area',
 'Bulgaria',
 'Cologne Bonn Region',
 'Czechia',
 'Denmark',
 'Eindhoven Area',
 'Estonia',
 'France',
 'Germany',
 'Greater Banska Bystrica Area',
 'Greater Barcelona Metropolitan Area',
 'Greater Munster Area',
 'Greater Nuremberg Metropolitan Area',
 'Greater Palma de Mallorca Metropolitan Area',
 'Greater Paris Metropolitan Region',
 'Greater Pau Area',
 'Greece',
 'Hungary',
 'Iasi Metropolitan Area',
 'Ireland',
 'Italy',
 'Krakow Metropolitan Area',
 'Latvia',
 'Lithuania',
 'Luxembourg',
 'Malta',
 'Monaco',
 'Netherlands',
 'Poland',
 'Portugal',
 'Romania',
 'Rotterdam and The Hague',
 'Spain',
 'Sweden',
 'Switzerland',
 'United Kingdom',
 'Warsaw Metropolitan Area',
 'Wroclaw Metropolitan Area']

Как и ожидалось, в колонку стран спарсилось много названий городских агломераций. Установим соотвествие между ними и каждой страной, создадим словарь, и приведём данные в порядок:

In [15]:
dictionary = {'Berlin Metropolitan Area' : 'Germany', 'Brussels Metropolitan Area' : 'Belgium',
'Cologne Bonn Region' : 'Germany', 'Eindhoven Area' : 'Netherlands', 'Greater Banska Bystrica Area' : 'Slovakia',
'Greater Barcelona Metropolitan Area' : 'Spain', 'Greater Munster Area' : 'Germany',
'Greater Nuremberg Metropolitan Area' : 'Germany', 'Greater Palma de Mallorca Metropolitan Area' : 'Spain',
'Greater Paris Metropolitan Region' : 'France', 'Greater Pau Area' : 'France', 'Iasi Metropolitan Area' : 'Romania',
'Krakow Metropolitan Area' : 'Poland', 'Rotterdam and The Hague' : 'Netherlands', 'Warsaw Metropolitan Area' : 'Poland', 
'Wroclaw Metropolitan Area' : 'Poland'}

df = df.replace({'country': dictionary})

In [16]:
#делаем проверку
sorted(df['country'].unique())

['Austria',
 'Belgium',
 'Bulgaria',
 'Czechia',
 'Denmark',
 'Estonia',
 'France',
 'Germany',
 'Greece',
 'Hungary',
 'Ireland',
 'Italy',
 'Latvia',
 'Lithuania',
 'Luxembourg',
 'Malta',
 'Monaco',
 'Netherlands',
 'Poland',
 'Portugal',
 'Romania',
 'Slovakia',
 'Spain',
 'Sweden',
 'Switzerland',
 'United Kingdom']

Ищем тип занятости. Лямбда-функцией при ошибках на пропущенных значениях работать не совсем удобно, напишем обычную:

In [17]:
def w_type(data):
    soup = BeautifulSoup(data, 'lxml')
    try:
        x = soup.find('span', class_ = 'jobs-unified-top-card__workplace-type').text.strip()
    except:
        x = None
    return x

Применим её и проверим результат:

In [18]:
df['workplace_type'] = df['html'].apply(w_type)
df.head(1)

Unnamed: 0,html,job_title,city,country,workplace_type
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland,On-site


Ищем наименование компании:

In [19]:
df['company_name'] = df['html'].apply(lambda x: BeautifulSoup(x).find('span', 
                                                            class_ = 'jobs-unified-top-card__company-name').text.strip())
#проверяем результат
df.head(1)

Unnamed: 0,html,job_title,city,country,workplace_type,company_name
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs


Ищем размер компании (количество работников):

In [20]:
def emp_qvt(data):
    soup = BeautifulSoup(data, 'lxml')
    try:
        x = ''.join(re.findall(r'[\d,]+(?:-[\d,]+)+|\d+(?:,[\d+]+)', soup.find('li', class_ = 'jobs-unified-top-card__job-insight').find_next_sibling().text.strip()))
        if x == '':
            x = None
    except:
        x = None
    return x

In [21]:
df['employees_qvt'] = df['html'].apply(emp_qvt)

#проверяем результат
df.head(1)

Unnamed: 0,html,job_title,city,country,workplace_type,company_name,employees_qvt
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50


In [22]:
df['employees_qvt'].unique()

array(['11-50', None, '501-1,000', '51-200', '10,001+', '1,001-5,000',
       '201-500', '5,001-10,000', '1-10'], dtype=object)

Как мы можем видеть, всего у нас 8 типов компаний по числу работников. Данные спарсены в формате `object`, что нас на сполностью устраивает, исходя из задач, прописанных в ТЗ.

Сфера деятельности компании находится по тому же тэгу, через разделитель '·'. Воспользуемся этим и выделим сферу деятельности:

In [23]:
def activity_field(data):
    soup = BeautifulSoup(data, 'lxml')
    try:
        x = soup.find('li', class_ = 'jobs-unified-top-card__job-insight').find_next_sibling().text.strip().split('·', 1)[1].strip()
    except:
        x = None
    return x

In [24]:
df['activity_field'] = df['html'].apply(activity_field)

#проверяем результат
df.head(1)

Unnamed: 0,html,job_title,city,country,workplace_type,company_name,employees_qvt,activity_field
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting


Найдём количество претендентов на вакансию:

In [25]:
def applicants_qvt(data):
    soup = BeautifulSoup(data, 'lxml')
    try:
        x = int(''.join(filter(str.isdigit, soup.find('span', class_ = 'jobs-unified-top-card__applicant-count').text.strip())))
    except:
        x = None
    return x

In [26]:
df['applicants_qvt'] = df['html'].apply(applicants_qvt)

#проверяем результат
df.head(1)

Unnamed: 0,html,job_title,city,country,workplace_type,company_name,employees_qvt,activity_field,applicants_qvt
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0


Вычислим дату публикации вакансии, исходя из того, что исходный файл датируется 23 мая 2023г.:

In [27]:
def date(data):
    soup = BeautifulSoup(data, 'lxml')
    x = soup.find('span', class_ = 'jobs-unified-top-card__posted-date').text.strip().split()
    if 'minutes' in x:
        x = datetime.date(2023,5,23)
        return x
    elif 'day' in x:
        x = datetime.date(2023,5,23) - datetime.timedelta(days=int(x[0]))
        return x
    elif 'days' in x:
        x = datetime.date(2023,5,23) - datetime.timedelta(days=int(x[0]))
        return x
    elif 'week' in x:
        x = datetime.date(2023,5,23) - datetime.timedelta(days=int(x[0])*7)
        return x
    elif 'weeks' in x:
        x = datetime.date(2023,5,23) - datetime.timedelta(days=int(x[0])*7)
        return x

In [28]:
df['date'] = df['html'].apply(date)

#проверяем результат
df.head(1)

Unnamed: 0,html,job_title,city,country,workplace_type,company_name,employees_qvt,activity_field,applicants_qvt,date
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16


Попытаемся извлечь из текста вакансии требуемые хард скилы. Для этого создадим список наиболее востребованных скилов, встречающихся в вакансиях, и напишем функцию, извлекающую их из описания должности:

In [29]:
skills = ['a/b testing', 'ab testing', 'actian', 'adobe analytics', 'adobe audience manager',
    'adobe experience platform', 'adobe launch', 'adobe target', 'ai', 'airflow',
    'alooma', 'alteryx', 'amazon machine learning', 'amazon web services', 'aml',
    'amplitude', 'ansible', 'apache camel', 'apache nifi', 'apache spark',
    'api', 'asana', 'auth0', 'aws', 'aws glue', 'azure', 'azure data factory',
    'basecamp', 'bash', 'beats', 'big query', 'bigquery', 'birst', 'bitbucket',
    'blendo', 'bootstrap', 'business objects bi', 'c#', 'c++', 'caffe', 'cassandra',
    'cdata sync', 'chronograf', 'ci/cd', 'cicd', 'clickhouse', 'cloudera', 'cluvio',
    'cntk', 'cognos', 'composer', 'computer vision', 'conda', 'confluence',
    'couchbase', 'css', 'd3.js', 'dash', 'dashboard', 'data factory', 'data fusion',
    'data mining', 'data studio', 'data warehouse', 'databricks', 'dataddo',
    'dataflow', 'datahub', 'dataiku', 'datastage', 'dbconvert', 'dbeaver', 'dbt',
    'deep learning', 'dl/ml', 'docker', 'domo', 'dune', 'dv360', 'dynamodb',
    'elasticsearch', 'elt', 'erwin', 'etl', 'etleap', 'excel', 'facebook business manager',
    'fivetran', 'fuzzy', 'ga360', 'gcp', 'gensim', 'ggplot', 'git', 'github', 'gitlab',
    'google ads', 'google analytics', 'google cloud platform', 'google data flow',
    'google optimize', 'google sheets', 'google tag manager', 'google workspace',
    'grafana', 'hadoop', 'hana', 'hanagrafana', 'hbase', 'hdfs', 'hevo data', 'hightouch',
    'hive', 'hivedatabricks', 'html', 'hubspot', 'ibm coremetrics', 'inetsoft',
    'influxdb', 'informatica', 'integrate.io', 'iri voracity', 'izenda', 'java',
    'java script', 'javascript', 'jenkins', 'jira', 'jmp', 'julia', 'jupyter',
    'k2view', 'kafka', 'kantar', 'kapacitor', 'keras', 'kibana', 'kubernetes',
    'lambda', 'linux', 'logstash', 'looker', 'lstm', 'luidgi', 'matillion', 'matlab',
    'matplotlib', 'mendix', 'metabase', 'microsoft sql', 'microsoft sql server',
    'microstrategy', 'miro', 'mixpanel', 'ml', 'ml flow', 'mlflow', 'mongodb', 'mxnet',
    'mysql', 'natural nanguage processing', 'neo4j', 'nlp', 'nltk', 'nosql', 'numpy',
    'oauth', 'octave', 'omniture', 'omnituregitlab', 'openshift', 'openstack',
    'optimizely', 'oracle', 'oracle business intelligence', 'oracle data integrator',
    'pandas', 'panorama', 'pentaho', 'plotly', 'postgre', 'postgresql', 'posthog',
    'power amc', 'power bi', 'power point', 'powerbi', 'powerpivot', 'powerpoint',
    'powerquery', 'pyspark', 'python', 'pytorch', 'pytorchhevo data', 'qlik',
    'qlik sense', 'qlikview', 'querysurge', 'r', 'raphtory', 'rapidminer', 'redash',
    'redis', 'redshift', 'retool', 'rivery', 'rust', 's3', 'sa360', 'salesforce', 'sap',
    'sap business objects', 'sas', 'sas visual analytics', 'scala', 'scikit-learn',
    'scipy', 'seaborn', 'segment', 'selenium', 'sem rush', 'semrush', 'shell', 'shiny',
    'singer', 'sisense', 'skyvia', 'snowflake', 'spacy', 'spark', 'sparkml', 'splunk',
    'spotfire', 'spreadsheet', 'spss', 'sql', 'ssis', 'sssr', 'stambia', 'statistics',
    'statsbot', 'stitch', 'streamlit', 'streamsets', 'svn', 't-sql', 'tableau', 'talend',
    'targit', 'tealium', 'telegraf', 'tensorflow', 'terraapi', 'terraform', 'theano',
    'thoughtspot', 'timeseries', 'trello', 'unix', 'vba', 'vtom', 'webfocus', 'wfh',
    'xplenty', 'xtract.io', 'yellowfin']

In [30]:
def skills_finder(cells, skill_list_2=skills):
    soup = BeautifulSoup(cells, 'lxml')
    cell = soup.find('div', class_ = 'jobs-box__html-content jobs-description-content__text t-14 t-normal jobs-description-content__text--stretch').text
    matched_skills_list=[]
    for i in skill_list_2:
        if i == 'c++':
            if re.search('\Wc\+\+\W', cell.lower()):
                matched_skills_list.append(i)      
        # word_border + rewritten "i" in special symbols + word_border
        else:
            pattern = (
            r'(\b|\W)'
            + re.escape(i)
            + r'(\b|\W)'
            +'|'
            + r'(\b|\W)'
            +re.escape(i.replace(' ', ''))
            + r'(\b|\W)'
        )
            if re.search(pattern, cell.lower()):
                matched_skills_list.append(i)
    return matched_skills_list

In [31]:
df['hard_skills'] = df['html'].apply(skills_finder)

In [32]:
df.head(1)

Unnamed: 0,html,job_title,city,country,workplace_type,company_name,employees_qvt,activity_field,applicants_qvt,date,hard_skills
0,"\n <div>\n <div class=""\n jobs-deta...",data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,"[data mining, excel, sap, sas, spss, sql, stat..."


Наводим "последние штрихи" - удаляем столбец `html`, поскольку он теперь нам не нужен:

In [33]:
df.drop('html', axis=1, inplace=True)
df.head(1)

Unnamed: 0,job_title,city,country,workplace_type,company_name,employees_qvt,activity_field,applicants_qvt,date,hard_skills
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,"[data mining, excel, sap, sas, spss, sql, stat..."


In [34]:
df.columns

Index(['job_title', 'city', 'country', 'workplace_type', 'company_name',
       'employees_qvt', 'activity_field', 'applicants_qvt', 'date',
       'hard_skills'],
      dtype='object')

Процесс парсинга завершён. Переходим к этапу подготовки данных к визуализации. Проверим датасет на наличие дубликатов. Так как последний столбец у нас содержит тип данных "список", исключим его из проверки:

In [35]:
df.duplicated(subset=['job_title', 'city', 'country', 'workplace_type', 'company_name', 'employees_qvt', 'activity_field',
                      'applicants_qvt', 'date']).sum()

66

Проверим наличие дубликатов визуально:

In [36]:
df[df.duplicated(subset=['job_title', 'city', 'country', 'workplace_type', 'company_name', 'employees_qvt', 'activity_field',
                      'applicants_qvt', 'date'], keep=False)].head(5)

Unnamed: 0,job_title,city,country,workplace_type,company_name,employees_qvt,activity_field,applicants_qvt,date,hard_skills
1,data analyst - logistics,Coventry,United Kingdom,On-site,Resolute Recruitment,,,,2023-05-16,[]
2,data analyst - logistics,Coventry,United Kingdom,On-site,Resolute Recruitment,,,,2023-05-16,[wfh]
221,data analyst (m/w/d),,Germany,On-site,Charisma-Tec GmbH,1-10,Human Resources Services,21.0,2023-05-18,"[excel, python, sql]"
222,data analyst (m/w/d),,Germany,On-site,Charisma-Tec GmbH,1-10,Human Resources Services,21.0,2023-05-18,"[excel, python, sql]"
223,data analyst (m/w/d),,Germany,On-site,Charisma-Tec GmbH,1-10,Human Resources Services,21.0,2023-05-18,"[excel, python, sql]"


Дубликаты действительно присутствуют. Будем исходить из того, что компания на одну должность даёт всё-таки одно объявление, даже если набирает на неё насколько человек. В связи этим, удаляем выявленные дубликаты:

In [37]:
df.drop_duplicates(subset=['job_title', 'city', 'country', 'workplace_type', 'company_name', 'employees_qvt', 'activity_field',
                      'applicants_qvt', 'date'], inplace=True)
#сбросим индекс
df.reset_index(drop= True , inplace= True)

#проверим результат
df.duplicated(subset=['job_title', 'city', 'country', 'workplace_type', 'company_name', 'employees_qvt', 'activity_field',
                      'applicants_qvt', 'date']).sum()

0

Дубликаты успешно удалены.\
Преобразуем теперь списки `hard skills` в строки таблицы. В качестве идентификатора для последующей визуализации в виде дашборда PowerBI оставим `index`:

In [38]:
df_explode = df.explode('hard_skills')
df_explode.head(10)

Unnamed: 0,job_title,city,country,workplace_type,company_name,employees_qvt,activity_field,applicants_qvt,date,hard_skills
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,data mining
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,excel
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,sap
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,sas
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,spss
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,sql
0,data analyst,Basel,Switzerland,On-site,PharmiWeb.Jobs: Global Life Science Jobs,11-50,Staffing and Recruiting,47.0,2023-05-16,statistics
1,data analyst - logistics,Coventry,United Kingdom,On-site,Resolute Recruitment,,,,2023-05-16,
2,data analyst (space & planning),South Molton,United Kingdom,On-site,Mole Valley Farmers,,,,2023-05-16,excel
3,data analyst,Lugano,Switzerland,On-site,FORFIRM,,,,2023-05-09,aws


Теперь заменим дубликаты навыков, которые мы заложили при поиске:

In [39]:
df_explode['hard_skills'] = df_explode['hard_skills'].replace({'powerbi': 'power bi', 'powerpoint': 'power point', 'javascript': 'java script'})

Таким образом, датасет для построения дашборда готов. Осталось сохранить его на локальном диске:

In [40]:
df_explode.to_csv('df_explode_linkedin.csv', sep=',')