<a href="https://colab.research.google.com/github/macro-mancer/data_analysis_scripts/blob/main/jserd_soft_skills_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install package that is not available on Google Colab
!pip install dython



In [25]:
import statistics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dython.nominal import associations

### Loading Data from Job Descriptions:

In [39]:
#if using outside google colab, provide path to the csv file
soft_skills_br = pd.read_csv("/content/soft_skills_data_br.csv", sep=";")
soft_skills_br = soft_skills_br.drop(["text", "ID"], axis="columns")

soft_skills_br.head()

Unnamed: 0,title,org_name,seniority,hard_skills,soft_skills,company_size
0,Analista de QA,EXAME,Mid-level,"[""test script"", ""customer insight"", ""API"", ""co...","[""Collaboration"", ""Analytical"", ""Organization""...",Large
1,ANALISTA DE TESTES JÚNIOR - ATUAÇÃO REMOTA OU ...,Fortics,Entry level,"[""software quality"", ""quality assurance"", ""sof...","[""Communication (generic)""]",Large
2,QA Pleno (Remoto),Beyond Soluções,Mid-level,"[""android studio"", ""azure devops"", ""visual art""]","[""Planning""]",Medium
3,Analista da Garantia da Qualidade Júnior,EMS,Entry level,"[""job description"", ""quality assurance"", ""good...","[""Investigative"", ""Planning"", ""Innovation""]",Large
4,Development QA Analyst,Kokku,Entry level,"[""quality assurance"", ""quality assurance"", ""te...","[""Collaboration"", ""Planning"", ""Leadership"", ""C...",Small


In [36]:
#if using outside google colab, provide path to the csv file
soft_skills_us = pd.read_csv("/content/soft_skills_data_us.csv", sep=";")
soft_skills_us = soft_skills_us.drop(["text", "ID"], axis="columns")

soft_skills_us.head()

Unnamed: 0,title,org_name,seniority,hard_skills,soft_skills,company_size
0,Quality Assurance/Quality Control (QA/QC) Analyst,LaBella Associates,Entry level,"[""quality assurance"", ""quality control"", ""prog...","[""Leadership"", ""Planning"", ""Proactive"", ""Self ...",Large
1,Quality Assurance/Quality Control (QA/QC) Analyst,LaBella Associates,Entry level,"[""quality assurance"", ""quality control"", ""prog...","[""Leadership"", ""Planning"", ""Proactive"", ""Self ...",Large
2,Tester /QA Analyst,"Donato Technologies, Inc.",Senior,"[""job description"", ""verbal communication skil...","[""Not mention""]",Medium
3,Quality Assurance Tester,Insight Global,Mid-level,"[""test case"", ""user story"", ""hp quality center...","[""Not mention""]",Large
4,"QA Manual Tester, FULLTIME","Conch Technologies, Inc",Mid-level,"[""manual testing"", ""computer science"", ""softwa...","[""Not mention""]",Medium


### Checking Data

In [29]:
soft_skills_br.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2164 entries, 0 to 2163
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         2164 non-null   object
 1   org_name      2164 non-null   object
 2   seniority     2164 non-null   object
 3   hard_skills   2164 non-null   object
 4   soft_skills   2164 non-null   object
 5   company_size  2164 non-null   object
dtypes: object(6)
memory usage: 101.6+ KB


In [37]:
soft_skills_us.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1658 entries, 0 to 1657
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         1658 non-null   object
 1   org_name      1658 non-null   object
 2   seniority     1658 non-null   object
 3   hard_skills   1658 non-null   object
 4   soft_skills   1658 non-null   object
 5   company_size  1657 non-null   object
dtypes: object(6)
memory usage: 77.8+ KB


#### Looking at an Overview of the Job Descriptions from Brazilian Companies

In [30]:
soft_skills_br["seniority"].value_counts()

seniority
Entry level      845
Not mentioned    582
Mid-level        537
Senior           200
Name: count, dtype: int64

In [31]:
soft_skills_br["company_size"].value_counts()

company_size
Large     1099
Medium     418
Micro      351
Small      296
Name: count, dtype: int64

In [32]:
soft_skills_br["soft_skills"].value_counts()

soft_skills
["Not mention"]                                                                                186
["Communication (generic)"]                                                                    105
["Planning"]                                                                                    70
["Collaboration"]                                                                               39
["Innovation", "Communication (generic)"]                                                       37
                                                                                              ... 
["Creativity", "Leadership", "Investigative", "Team", "Interpersonal"]                           1
["Innovation", "Curiosity", "Leadership", "Negotiation", "Communication (generic)", "Team"]      1
["Creativity", "Communication (generic)", "Communication (written)"]                             1
["Decision making", "Innovation", "Collaboration", "Planning"]                                   

In [40]:
soft_skills_br['soft_skills'] = soft_skills_br['soft_skills'].apply(eval)

# calculate average number of soft skills mentioned in the job ads (from Brazil)
num_of_soft_skills_in_ad = []
for i, ss_list in enumerate(soft_skills_br["soft_skills"]):
  if ss_list[0] != "Not mention":
    num_of_soft_skills_in_ad.append(len(ss_list))

#print(sorted(num_of_soft_skills_in_ad))
print(f"The job adverstisement [from a Brazilian company] that lists the largest number of soft skills mentions {max(num_of_soft_skills_in_ad)} soft skills")
print(f"On average, job ads [in Brazil] list {statistics.mean(num_of_soft_skills_in_ad)}")

The job adverstisement [from a Brazilian company] that lists the largest number of soft skills mentions 11 soft skills
On average, job ads [in Brazil] list 3.3806875631951465


## Data Analysis

A look at the companies (or recruitment agencies) that posted the job adverstisements:

In [42]:
pd.set_option("display.max_rows", None)
soft_skills_br["org_name"].value_counts()

org_name
IEL Paraná                                                     62
Netvagas                                                       51
Bluelight Consulting | DevOps & Software Development           38
BairesDev                                                      37
Volkswagen do Brasil                                           23
CI&T                                                           21
GeekHunter                                                     18
TSA - Tecnologia de Sistemas de Automação S/A                  18
Grupo Cimed                                                    17
Mollica IT                                                     16
Luxoft                                                         13
USIMINAS                                                       12
Atlântico                                                      12
TAKING                                                         12
Randstad                                                       11
W

In [45]:
num_companies = len(soft_skills_br["org_name"].value_counts())
print(f"Number of companies {num_companies}")

Number of companies 982


In [46]:
def extract_all_soft_skills(df):
  soft_skills_dict = {}
  for i in df["soft_skills"]:
    for j in i:
        if j not in soft_skills_dict:
            soft_skills_dict[j] = 1
        else:
            soft_skills_dict[j] += 1
  return soft_skills_dict

ss_d = extract_all_soft_skills(soft_skills_br)
{k: v for k, v in sorted(ss_d.items(), key=lambda item: item[1])}

{'Critical thinking': 3,
 'Enthusiasm': 3,
 'Communication(written)': 3,
 'Organization': 4,
 'Dynamism': 6,
 'Self management': 7,
 'Criativity': 9,
 'Flexibility': 14,
 'Self motivated': 17,
 'Communication (oral)': 20,
 'Empathy': 34,
 'Assertive': 37,
 'Self disciplined': 39,
 'Proactive': 42,
 'Cooperation': 52,
 'Mentoring': 66,
 'Diversity': 72,
 'Adaptable': 83,
 'Negotiation': 98,
 'Resilience': 102,
 'Investigative': 114,
 'Decision making': 135,
 'Interpersonal': 170,
 'Not mention': 186,
 'Creativity': 194,
 'Team': 205,
 'Self': 216,
 'Analytical': 227,
 'Curiosity': 246,
 'Problem solving': 258,
 'Leadership': 295,
 'Communication (written)': 356,
 'Collaboration': 700,
 'Innovation': 761,
 'Planning': 896,
 'Communication (generic)': 1203}

In [48]:
#del ss_d["Not mention"] # first, remove job ads that do not mention soft skills

print(f"We found {len(ss_d) - 1} different soft skills in job advertisements")
print("The percentages indicate the percentage of advertisements that referred to a skill" +
      "\n(since an advertisement can list multiple skills, the total exceeds 100%). ")

# then, calculate percentages
number_of_job_ads = len(soft_skills_br.index)
#assert number_of_job_ads == 253
s = number_of_job_ads #sum(ss_d.values()) #instead of using the
for k, v in ss_d.items():
    pct = v * 100.0 / s
    print(k, pct)

We found 35 different soft skills in job advertisements
The percentages indicate the percentage of advertisements that referred to a skill
(since an advertisement can list multiple skills, the total exceeds 100%). 
Collaboration 32.34750462107209
Analytical 10.489833641404806
Organization 0.18484288354898337
Communication (generic) 55.59149722735675
Empathy 1.5711645101663585
Self motivated 0.7855822550831792
Problem solving 11.922365988909426
Planning 41.40480591497227
Investigative 5.2680221811460255
Innovation 35.16635859519408
Leadership 13.632162661737523
Communication (written) 16.45101663585952
Curiosity 11.367837338262477
Creativity 8.964879852125692
Team 9.473197781885398
Self 9.981515711645102
Not mention 8.595194085027726
Interpersonal 7.855822550831793
Diversity 3.3271719038817005
Decision making 6.238447319778189
Cooperation 2.402957486136784
Assertive 1.7097966728280962
Adaptable 3.8354898336414047
Mentoring 3.0499075785582255
Negotiation 4.5286506469500925
Resilience 4.7

In [49]:
#print soft skills in alphabetical order
for i in sorted(ss_d.keys()):
    print(i, end="\n")

Adaptable
Analytical
Assertive
Collaboration
Communication (generic)
Communication (oral)
Communication (written)
Communication(written)
Cooperation
Creativity
Criativity
Critical thinking
Curiosity
Decision making
Diversity
Dynamism
Empathy
Enthusiasm
Flexibility
Innovation
Interpersonal
Investigative
Leadership
Mentoring
Negotiation
Not mention
Organization
Planning
Proactive
Problem solving
Resilience
Self
Self disciplined
Self management
Self motivated
Team
