In [None]:
!pip install langchain

In [None]:
!pip install observable_jupyter

The objective of this project is to analyze the skill requirements mentioned in job posts for the position of a data scientist and categorize these skills. The ultimate goal is to visualize the findings using a Sunburst chart.

In [234]:
from observable_jupyter import embed
embed('@khuyentran1401/data-science-skills', cells=['chart', 'data'])

To do this, we will do the following:
- Obtain a dataset of data science job posts using Diffbot
- Preprocess the acquired data to extract relevant information and clean the text
- Utilize a GPT model to categorize the extracted skills from the job posts
- Transform the processed data into the required format for creating a Sunburst chart
- Use Observable to create a Sunburst chart based on the processed data

## Get the data

[Diffbot](https://www.diffbot.com/) is the world’s largest knowledge graph that allows you to access a trillion connected facts across the web.

Read [this article](https://towardsdatascience.com/build-and-analyze-knowledge-graphs-with-diffbot-2af83065ade0) to learn more about how to use Diffbot.

The following query is used to extract all job posts with the title "data scientist":
![](diffbot.png)

For this demo, you can access the data from Gist for free:

In [235]:
import pandas as pd

# Load the data
df = pd.read_csv(
    "https://gist.githubusercontent.com/khuyentran1401/901cbe3b830cb2b4a0250ae99804611f/raw/6bbed328966c2484d0084a93da1aaa801fb02443/data-science-jobs-05-18-2023.csv"
)
df.head(10)


Unnamed: 0,name,employer_name,date_str,date_timestamp,allUris,jobCategories_name,employer_summary,requirements,tasks,remote_normalizedValue,skills_name,description
0,"Lead, Data Scientist Regulatory Intelligence a...",Takeda,d2023-04-30,1682812800000,jobs.wivb.com/jobs/lead-data-scientist-regulat...,"Senior,Engineering, IT and Software Developmen...",Japanese pharmaceutical company,"BSc, Advanced scientific-related degree prefer...",Serve as Global Regulatory Intelligence contac...,REMOTE,European Union,"By clicking the ""Apply"" button, I understand t..."
1,"Senior Research Data Scientist, School of Comp...",University College Dublin,d2023-04-25,1682380800000,timeshighereducation.com/unijobs/listing/33803...,"Senior,Artificial Intelligence,Engineering, IT...","University located in Dublin, Ireland",,,NOT_REMOTE,"analytics,data science,application,European Un...",Applications are invited for a temporary post ...
2,Data Scientist - Intern,Baylor College of Medicine,d2023-04-20,1681948800000,jobs.bcm.edu/job/Houston-Data-Scientist-Intern...,"Intern,Data Science,Engineering, IT and Softwa...","Private medical school in Houston, Texas, Unit...","College Student (undergraduate or graduate).,N...",Develops predictive models and algorithms to s...,NOT_REMOTE,"data science,student,data visualization,machin...",Baylor College of Medicine's Pediatrics divisi...
3,Data Scientist - Senior Consultant,Guidehouse,d2023-04-16,1681603200000,jobs.whnt.com/jobs/data-scientist-senior-consu...,"Senior,Engineering, IT and Software Developmen...",,The successful candidate must not be subject t...,Provide the full spectrum of data science serv...,NOT_REMOTE,"UiPath,information system,computer science,dat...",Job Family:\nData Science Consulting\nTravel R...
4,Data Scientist - Senior Consultant,Guidehouse,d2023-04-16,1681603200000,jobs.whnt.com/jobs/data-scientist-senior-consu...,"Senior,Engineering, IT and Software Developmen...",,The successful candidate must not be subject t...,Provide the full spectrum of data science serv...,NOT_REMOTE,"UiPath,information system,computer science,dat...",Job Family:\nData Science Consulting\nTravel R...
5,Data scientist jobs,NTT Data,d2023-04-15,1681516800000,sg.talent.com/jobs?k=data+scientist&l=,"Data Science,Engineering, IT and Software Deve...","Software company based in Sendai-shi, Miyagi P...",,,NOT_REMOTE,"Highcharts,machine learning,research,artificia...","These squads deliver products services in AI, ..."
6,"David Grönberg, data scientist på Axfood",Axfood,d2023-04-14,1681430400000,axfood.se/karriar/mot-vara-medarbetare/david-g...,"Engineering, IT and Software Development,Data ...","Retailer based in Stockholm, Stockholm County,...",,,NOT_REMOTE,,Direkt efter avslutade studier i datavetenskap...
7,Data Scientist - AI/ML (Genetic Medicine),Eli Lilly and Company,d2023-04-13,1681344000000,biospace.com/job/2733552/data-scientist-aiml-g...,"Engineering, IT and Software Development,Artif...",American pharmaceutical company,"Ph.D. in Computer Science, Applied Mathematics...","As a key contributor to this team, you will br...",NOT_REMOTE,"applied mathematics,engineering","At Lilly, we unite caring with discovery to ma..."
8,"Youcef Alouani, data scientist på Axfood",Axfood,d2023-04-13,1681344000000,axfood.se/karriar/mot-vara-medarbetare/youcef-...,"Engineering, IT and Software Development,Data ...","Retailer based in Stockholm, Stockholm County,...",,,NOT_REMOTE,,"""Om du vill spela i det vinnande laget – kom h..."
9,"Data Scientist, AI/ML drug discovery",Nitto BioPharma Inc,d2023-04-12,1681257600000,biospace.com/job/2732803/data-scientist-aiml-d...,"Artificial Intelligence,Engineering, IT and So...",,,,NOT_REMOTE,"education,data science,machine learning,core c...","Job Description\nPosition: Data Scientist, AI/..."


## Process the Data

In [3]:
# Convert the date_timestamp column to datetime format
df["datetime"] = pd.to_datetime(df["date_timestamp"], unit="ms")

# Get only the data from 2019 onwards
df = df.loc[df['datetime'].dt.year >= 2019]

In [224]:
from typing import List, Union

def split_on_comma(row: Union[str, float]) -> List[str]:
    if isinstance(row, float):
        return []
    return row.split(",")

# Turn skills into a list of strings
df["skills"] = df.skills_name.apply(split_on_comma)

# Split the skills into separate column
skills = df[["skills"]].explode('skills')

Unnamed: 0,skills
0,European Union
1,analytics
1,data science
1,application
1,European Union
1,computer vision
1,machine learning
1,artificial intelligence
2,data science
2,student


In [225]:
# Standardize the "skills" column by replacing each
#  skill with the standardized version.
# 
# e.g. "Master of Science" becomes "master's degree"
map_skills = {"Master of Science": "master's degree", "master s degree": "master's degree", "Doctor of Philosophy": "phd", "doctorate": "phd"}
skills["skills"] = skills["skills"].map(map_skills).fillna(skills['skills'])

In [226]:
# Get count of each skill
skills = skills.value_counts().to_frame().reset_index()

# Get top 100 
skills = skills.nlargest(columns='count', n=100)
print(skills.head(10))

# Get unique skills
unique_skills = skills.skills.values

Unnamed: 0,skills,count
0,machine learning,140
1,statistics,100
2,computer science,93
3,master's degree,79
4,mathematics,64
5,analytics,64
6,artificial intelligence,61
7,data science,50
8,engineering,46
9,SAS,36


## Categorize Skills with GPT

[LangChain](https://github.com/hwchase17/langchain) is a framework for developing applications powered by language models. We will use LangChain to ask a chat model to categorize skills. 

In [116]:
# Import the os package
import os

openai_api_key = "OPENAI_API_KEY"

In [117]:
from langchain.schema import HumanMessage, SystemMessage
from langchain.chat_models import ChatOpenAI

# Instantiate the ChatOpenAI model 
model_name = "gpt-3.5-turbo"
model = ChatOpenAI(openai_api_key=openai_api_key, model=model_name)

# Create messages to pass to the model
messages = [
    SystemMessage(content="You are a helpful assistant that returns a dictionary whose keys are categories and values are skills. Strings are enclosed in double quotes."),
    HumanMessage(content=f"Categorize the following skills {unique_skills}")
]

# Send messages to the model and receive results
result = model(messages)


The response is a string so we will turn it into a dictionary using the `eval` method.

In [None]:
# Turn the string into a dictionary
categories = eval(result.content)

In [160]:
import json 

# Save the dictionary of categories to a file
with open('skill_categories.json', 'w') as f:
    json.dump(categories, f)

### Prepare Data for Visualization

In [219]:
# Read in the skill categories from the file
with open('skill_categories.json', 'r') as f:
    categories = json.load(f)

In [220]:
# Turn {'category1: ['skill1', 'skill2'], 'category2: ['skill3', 'skill4']] into {'skill1': 'category1', 'skill2': 'category1', 'skill3': 'category2', 'skill4': 'category2'}
skills_to_categories = {
    skill: category for category, skills in categories.items() for skill in skills
}

# Add the categories column to the skills table
skills['categories'] = skills['skills'].map(skills_to_categories)

In [227]:
# Drop rows whose categories are in the list of categories to ignore
categories_to_ignore = ['other', 'expertise', 'companies', 'science']
skills = skills[~skills['categories'].isin(categories_to_ignore)]
skills.head(10)

Unnamed: 0,skills,count,categories
0,machine learning,140,data science
1,statistics,100,mathematics
2,computer science,93,computer science
3,master's degree,79,degrees
4,mathematics,64,mathematics
5,analytics,64,data science
6,artificial intelligence,61,data science
7,data science,50,data science
9,SAS,36,Programming Tools
10,Amazon Web Services,35,cloud services


In [228]:
"""This code turns a dataframe with the following columns:

	skills	count	categories
0	machine learning	140	data science
1	statistics	100	mathematics
2	computer science	93	computer science
3	analytics	64	data science
4	mathematics	64	mathematics

into a dictionary with the following structure:

[
    {
        "name": "data science",
        "children": [
            {"name": "machine learning", "value": 140},
            {"name": "analytics", "value": 64}
        ]
    },
    {
        "name": "mathematics",
        "children": [
            {"name": "statistics", "value": 100},
            {"name": "mathematics", "value": 64}
        ]
    },
    {
        "name": "computer science",
        "children": [
            {"name": "computer science", "value": 93},
    }
]
"""

# Group by 'categories' and convert to dictionary
grouped_df = skills.rename(columns={'skills': 'name', 'count': 'value'})
grouped_df = grouped_df.groupby('categories').apply(lambda x: x[['name', 'value']].to_dict(orient='records')).reset_index()
grouped_df.columns = ['name', 'children']
result_dict = grouped_df.to_dict(orient='records')

print(result_dict)


[{'name': 'Programming Tools', 'children': [{'name': 'SAS', 'value': 36}, {'name': 'Python', 'value': 20}, {'name': 'pandas', 'value': 17}, {'name': 'NumPy', 'value': 17}, {'name': 'Microsoft SQL Server', 'value': 10}, {'name': 'Microsoft Power BI', 'value': 10}, {'name': 'scikit-learn', 'value': 10}, {'name': 'Git', 'value': 10}, {'name': 'Microsoft Excel', 'value': 9}, {'name': 'Project Jupyter', 'value': 9}, {'name': 'Apache Hive', 'value': 9}, {'name': 'Oracle Database', 'value': 7}, {'name': 'Qlik', 'value': 7}, {'name': 'Matplotlib', 'value': 7}, {'name': 'Statistical Package for the Social Sciences', 'value': 6}, {'name': 'Databricks', 'value': 5}, {'name': 'Tableau Software', 'value': 5}, {'name': 'Keras', 'value': 5}, {'name': 'Docker', 'value': 5}, {'name': 'Visual Basic for Applications', 'value': 4}, {'name': 'Looker', 'value': 4}]}, {'name': 'business', 'children': [{'name': 'marketing', 'value': 23}, {'name': 'quantitative research', 'value': 14}, {'name': 'business intel

Save the data to a JSON file.

In [229]:
categories_dict = {'name': 'data science skills', 'children': result_dict}

with open('categories.json', 'w') as f:
    json.dump(categories_dict, f, indent=4)

## Visualize the Data with Observable

In [None]:
!pip install observable_jupyter

Finally, we will use [this Observable notebook](https://observablehq.com/@khuyentran1401/data-science-skills) to visualize the data. To use your own data, simply replace the current data file with a new file:

![](observable.png)

And you will see the following output:

In [236]:
from observable_jupyter import embed
embed('@khuyentran1401/data-science-skills', cells=['chart', 'data'])