In [220]:
import pandas as pd
import sqlite3
import esco_utils as eu
import pandas.io.sql as psql

import psycopg2 as pg
import pandas.io.sql as psql

con = sqlite3.connect('data/cache.db')

# Summary of DB tables
- `jobs` database by Azuna. non-Arabic job vacancies, provided to us by SkillLab 
- `data/crawl.db`: Arabic jobs crawled by us from `forsana.com, bayt.com, fu1sa.com, adwhit.com`
- `data/cache.db`: ESCO tags, tasks, projects, and occupations and skills
    - `tags`: The tag table contains individual tags inserted by users.
    - `task_data`: Contains all tasks that are not cancelled. 
    - `projects` and `project_labelers`: project name and id, plus labelers in each project
    - Occupations:
        - `occupation`: contain title, alternative title, and description of all ISCO occupations in all EU languages. 
        - `ISCOGroups`: contain the groups of occupations, contains the tree taxonomy of occupations
    - Skills:
        - `skill`: skills data, for all languages (in `lang`) Skill types: 'skill/competence', 'knowledge', 'skill/competence knowledge'
        - `skillGroups`: skill groups for all languages (in `lang`)
    - Relations: 
        - `occupationSkillRelations`: relationships between skills and occupations, as a bipartite graph
        - `skillSkillRelations`: relations between skills as a graph

## Skill/Occupation graph  
In each language we have the following graph 
- Vertices: 
    - skill
        - skill/competetence/knowledge/
            - skill/competence
            - knowledge
        - skillGroup
    - occupation
        - occupation
        - occupationGroup
- Edges
    - non-hierarchical:
        - occupation-skill
            - essential: eg. software engineer (occ) -> programming (skill) is essential
            - optional: eg. software engineer (occ) -> linear algebra (skill) is optional
        - skill-skill
            - essential
            - optional
    - hierarchical
        - occupation -> broader occupation (parent)
        - skill -> broader skill (parent) 

# Jobs 
### Azuna
All Job postings _except Arabic_ are in a postgres DB running at `localhost:5678`, in table `jobs`. 
You can retrieve the database connection using the helper function `con, _ = eu.load_skillLab_DB()`. The important columns in this table are:
- `description`: the description of the job posting
- `title`: title of the job
- `description`: description about the job  
- `location_country`: country where job was posted, eg. `NL` for netherlands or `GB` for Great Britain 

In [227]:
jobs_con, _ = eu.load_skillLab_DB()

psql.read_sql(f"""
        SELECT title, description,location_country 
        FROM jobs 
        WHERE location_country='GB' 
        LIMIT 3
        """, jobs_con)

Unnamed: 0,title,description,location_country
0,Trainee Personal Trainer at PureGym - Galashie...,Trainee Personal Trainer at PureGym Guaranteed...,GB
1,Become a Level 3 qualified Personal Trainer - ...,Become a Level 3 qualified Personal Trainer Al...,GB
2,Become a Level 3 qualified Personal Trainer - ...,Become a Level 3 qualified Personal Trainer Al...,GB


### `fu1sa_proc`
`9915` jobs crawled from `fu1sa.com`. Columns: 
- `title` and `description`: title and description of the job 
- `date` and `place`: date and location the job is posted 
- `views`: how many times the job is viewed

In [226]:
crawl_db = sqlite3.connect('data/crawl.db')
pd.read_sql(f" SELECT * FROM fu1sa_proc LIMIT 3", crawl_db)

Unnamed: 0,post,title,views,date,place,description
0,24,اعلان موعد الاختبار التحريري للمتقدمين على وظا...,165,29/07/2019 - 08:12 AM,مدينة : الخرج,تعلن الهيئة العامة للأصاد وحماية البيئة عن تحد...
1,25,سدافكو توفّر وظائف شاغرة لحملة الثانوية العامة...,80,30/07/2019 - 09:52 AM,مدينة : الخرج,أعلنت الشركة السعودية سدافكو (حليب السعودية) ع...
2,27,اعلان وظائف على البنود في الشؤون الصحية بالقريات,69,31/07/2019 - 07:15 AM,مدينة : الخرج,أعلنت المديرية العامة للشؤون الصحية بمحافظة ال...


### `adwhit_proc`: 
`22594` jobs crawled from `adwhit.com`. Columns:
- `title` and `description`: the title and description of the job posting
- `meta`: unprocessed meta-data, containing date, company name, and place

In [228]:
crawl_db = sqlite3.connect('data/crawl.db')
pd.read_sql(f" SELECT * FROM adwhit_proc LIMIT 3", crawl_db)

Unnamed: 0,url,meta,title,description
0,/مطلوب-انسات-لوظيفة-مدخل-بيانات,الشركة المقدمة للعمل:\nadwhit\nتاريخ النشر: 18...,مطلوب انسات لوظيفة مدخل بيانات,"( الرجاء قراءة الاعلان كاملا )\nمطلوب انسات ""ح..."
1,/مطلوب-موظفة-كول-سنتر-لشركة-سياحية-في-تقسيم-اس...,الشركة المقدمة للعمل:\nRoyal mark group\nتاريخ...,مطلوب موظفة كول سنتر لشركة سياحية في تقسيم اسط...,السلام عليكم مطلوب موظفة كول سنتر لشركة سياحية...
2,/مطلوب-video-grapher-لشركة-في-اسطنبول-32240,الشركة المقدمة للعمل:\nAKD İNVEST\nتاريخ النشر...,مطلوب Video Grapher لشركة في اسطنبول,Nitelikler ve İş Tanımı\nÖnde gelen bir emlak ...


# Tasks and Tags

### `tags` table 
If anntator `A1` in response to `task1` confirms tags `1,2,3`, they appear as three rows.
important columns:
- `task_id`: the task 
- `occupation_id` & `occupation_title`: the id & title of the occupation tag chosen by user
- `inserted_at` & `label_time`: time for the each tag

In [162]:
pd.read_sql("SELECT * FROM tags LIMIT 1",con)

Unnamed: 0,index,inserted_at,label_time,labeler_id,lang,task_id,occupation_id,occupation_title
0,0,"Wed, 12 Jan 2022 14:04:25 GMT",8347,61dda30da33565ec1ce31aab,,61dde40c527776b760a85fa0,1864,sales assistant


### `task_data` 
Important columns: 
- `project_id` and `project_name`: id & name of the project for this task
- `_id`: the task id 
- `title` and `description`: the title and description of the task. 

In [163]:
pd.read_sql("SELECT * FROM task_data LIMIT 1",con)

Unnamed: 0,project_name,project_id,_id,created_at,status,task-type,total_labels,updated_at,description,title
0,United Kingdom,61ddbe23527776b760a84bdd,61dddc3e527776b760a85c18,"Tue, 11 Jan 2022 19:36:30 GMT",pending,esco-text-tagging,0,"Tue, 11 Jan 2022 19:36:30 GMT",We're pleased to announce that SThree is looki...,Credit Controller - DACH


### `project` and `project_labelers`
get project name and id, plus labelers in each project

In [164]:
pd.read_sql("SELECT * FROM projects LIMIT 1",con)

Unnamed: 0,created_at,icon_id,labels_per_task,model_id,project_id,project_name,updated_at,lang,total_complete_tasks,total_labels,total_tasks,total_tasks_labeled
0,"Tue, 11 Jan 2022 14:14:51 GMT",61dd99d3527776b760a8497a,3,7693d764-72e5-11ec-83e4-0242c0a8d007,61dd90dbce271774889fbc02,GB,"Tue, 11 Jan 2022 14:14:51 GMT",en,0,1,300,0


In [165]:
pd.read_sql("SELECT * FROM project_labelers LIMIT 1",con)

Unnamed: 0,project_id,labeler_status,_id,email
0,61dd90dbce271774889fbc02,active_labelers,,ari@gmail.com


# Occupations 

### `occupations`: title, description, and ISCO codes in all languages
Important coluns:
- `occupation_id`: same occupation_id used in the `tags` table 
- `iscoGroup`: the group corresponding of this occupation, same as `code` in `ISCOGroups` table
- `preferredLabel`: title
- `altLabels`: alternative titles, seperated by `\n`
- `description`: description
- `lang`: the language for the record
- `conceptUri`: the unique identifier of this occupation  

In [276]:
lang = 'en'
pd.read_sql(f"SELECT * FROM occupations WHERE lang='{lang}' LIMIT 1",con)

Unnamed: 0,conceptType,conceptUri,iscoGroup,preferredLabel,altLabels,hiddenLabels,status,modifiedDate,regulatedProfessionNote,scopeNote,definition,inScheme,description,lang,esco_version,occupation_id,external_id
0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,2166,technical director,technical and operations director\nhead of tec...,,released,2016-07-05T13:58:41Z,http://data.europa.eu/esco/regulated-professio...,,,http://data.europa.eu/esco/concept-scheme/occu...,Technical directors realise the artistic visio...,en,v1.0.3,2,http://data.europa.eu/esco/occupation/00030d09...


### `ISCOGroups`: the groups of occupations 
Columns: 
- `conceptUri`: the unique identifier of this resource
- `code`: code of thiso group, corresponds to `iscoGroup` column in the `occupations` table
- `preferredLabel`: the title 
- `altLabels`: the alternative title

__Hierarchy__: the hierarchy of ISCO groups is preserved in their `code`. Example: ISCO group with code `22` is "Health professionals", which is the parent of group "Medical doctors" with code `221`, which is a parent of "Specialist medical practitioners" with code `2212". 

In [283]:
pd.read_sql(f"SELECT * FROM ISCOGroups WHERE lang='{lang}' LIMIT 1",con)

Unnamed: 0,conceptType,conceptUri,code,preferredLabel,altLabels,inScheme,description,lang,esco_version
0,ISCOGroup,http://data.europa.eu/esco/isco/C0,0,Armed forces occupations,,http://data.europa.eu/esco/concept-scheme/occu...,Armed forces occupations include all jobs held...,en,v1.0.3


# Skills
### `skill` table
There are `13502` skills in each of the EU languages. 
Important columns: 
- `conceptUri`: the unique identifier of this skill 
- `preferredLabel & altLabels, description`: main title, alternative titles, and description for this skill
- `skillType`: 'skill/competence', 'knowledge', or both 'skill/competence knowledge'
- `lang`: the language 

In [248]:
pd.read_sql(f"SELECT * FROM skill WHERE lang='{lang}' LIMIT 1",con)

Unnamed: 0,conceptType,conceptUri,preferredLabel,altLabels,hiddenLabels,status,modifiedDate,scopeNote,inScheme,description,lang,skillType,reuseLevel,definition,originalSkillUri,originalSkillType,relationType,relatedSkillType,relatedSkillUri,esco_version
0,KnowledgeSkillCompetence,http://data.europa.eu/esco/skill/0005c151-5b5a...,manage musical staff,manage staff of music\ncoordinate duties of mu...,,released,2016-12-20T17:43:43Z,,http://data.europa.eu/esco/concept-scheme/skil...,Assign and manage staff tasks in areas such as...,en,skill/competence,sector-specific,,,,,,,v1.0.3


### `skillGroups`: groups of skills  

In [250]:
pd.read_sql(f"SELECT * FROM skillGroups WHERE lang='{lang}' LIMIT 1",con)

Unnamed: 0,conceptType,conceptUri,preferredLabel,altLabels,hiddenLabels,status,modifiedDate,scopeNote,inScheme,description,lang,esco_version
0,SkillGroup,http://data.europa.eu/esco/skill/25a26ff6-af18...,values,scruples\nbeliefs\nmorals,,,2017-02-14T12:01:25Z,Use for personal principles of behaviour. For ...,http://data.europa.eu/esco/concept-scheme/skil...,"Principles or standards of behaviour, revealin...",en,v1.0.3


# Relations

### `occupationSkillRelations`: occupation & skills graph
There are `114402` edges between `occupation` table and `skills`. Important columns: 
- `occupationUri`: corresponds to `conceptUri` in the occupation table
- `skillUri`: corresponds ot `conceptUri` in the skill table
- `relationType`: can be either 'essential' or 'optional'

In [254]:
pd.read_sql("SELECT * FROM occupationSkillRelations LIMIT 1", con)

Unnamed: 0,occupationUri,relationType,skillType,skillUri,esco_version
0,http://data.europa.eu/esco/occupation/00030d09...,essential,knowledge,http://data.europa.eu/esco/skill/fed5b267-73fa...,v1.0.3


### `skillSkillRelations`: graph between skills 
There are `5971` edges between skills.

Important columns: 
- `originalSkillUri`: first skill URI, corresponding to `conceptUri` in the `skill` table
- `relatedSkillUri`: second skill URI, corresponding to `conceptUri` in the `skill` table
- `relationType`: relation between the two skills, can be 'optional' or 'essential'

In [255]:
pd.read_sql("SELECT * FROM skillSkillRelations LIMIT 1", con)

Unnamed: 0,originalSkillUri,originalSkillType,relationType,relatedSkillType,relatedSkillUri,esco_version
0,http://data.europa.eu/esco/skill/00064735-8fad...,knowledge,optional,knowledge,http://data.europa.eu/esco/skill/d4a0744a-508b...,v1.0.3


### `broaderRelationsOccPillar`: occupation hierarchical edges
columns:
- ‍`conceptUri`: the identifier of the occupation 
- `broaderUri`: the identifier of the parent occupation 

In [278]:
pd.read_sql("SELECT * FROM broaderRelationsOccPillar LIMIT 3",con)

Unnamed: 0,conceptType,conceptUri,broaderType,broaderUri,esco_version
0,ISCOGroup,http://data.europa.eu/esco/isco/C01,ISCOGroup,http://data.europa.eu/esco/isco/C0,v1.0.3
1,ISCOGroup,http://data.europa.eu/esco/isco/C011,ISCOGroup,http://data.europa.eu/esco/isco/C01,v1.0.3
2,ISCOGroup,http://data.europa.eu/esco/isco/C0110,ISCOGroup,http://data.europa.eu/esco/isco/C011,v1.0.3


### `broaderRelationsSkillPillar`: skill hierarchical edges 
- ‍`conceptUri`: the identifier of the skill 
- `broaderUri`: the identifier of the parent skill 

In [279]:
pd.read_sql("SELECT * FROM broaderRelationsSkillPillar LIMIT 3",con)

Unnamed: 0,conceptType,conceptUri,broaderType,broaderUri,esco_version
0,SkillGroup,http://data.europa.eu/esco/skill/045f71e6-0699...,SkillGroup,http://data.europa.eu/esco/skill/8f18f987-33e2...,v1.0.3
1,SkillGroup,http://data.europa.eu/esco/skill/09e28145-e205...,SkillGroup,http://data.europa.eu/esco/skill/8f18f987-33e2...,v1.0.3
2,SkillGroup,http://data.europa.eu/esco/skill/0ab5084c-cbd0...,SkillGroup,http://data.europa.eu/esco/skill/a90b12b6-248d...,v1.0.3
