<center><a target="_blank" href="https://academy.constructor.org/"><img src=https://lh3.googleusercontent.com/d/1EmH3Jks5CpJy0zK3JbkvJZkeqWtVcxhB width="500" style="background:none; border:none; box-shadow:none;" /></a> </center>
<hr />

# <h1 align="center"> Helper Notebook: Projecting Word Embeddings </h1>

<hr />
<center>Constructor Academy, 2025</center>

# Helper Notebook: Projecting Word Embeddings
We will work with job ads from job.ch. A dataset of 10000 English job ads is provided.

The goal of this exercise will be to develop a working understanding of Word2vec and use t-sne as a way to analyze word embeddings

Like any classical NLP task the steps in this analysis will be

- Clean data
- Build a corpus
- Train word2vec
- Visualize using t-sne

In [1]:
! pip install umap-learn



In [2]:
!pip install gensim



In [3]:
import re

import nltk
import numpy as np
import pandas as pd
import umap
from gensim.models import word2vec
from matplotlib import pyplot as plt
from sklearn.manifold import TSNE

import nltk
nltk.download('stopwords')
nltk.download('punkt')

%matplotlib inline

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Download dataset

In [4]:
!curl -L -o job_ads_eng.csv "http://drive.google.com/uc?export=download&id=1IGCgrq7AqygIaLcjiFwlqgcNoQd1OAqo"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 32.1M  100 32.1M    0     0  6271k      0  0:00:05  0:00:05 --:--:-- 8360k


## Data preparation
### Load the data set

In [5]:
data = pd.read_csv("job_ads_eng.csv")  # .sample(50000, random_state=23)
data.head(3)

Unnamed: 0,Keywords,Job title,Date published,Alive until,Company name,Location,Company type,Contract type,Occupation,Job rank,Content,Industry,Official website,Canton initials,Canton name
0,ICT System Engineer,System Engineer,2019-05-27 00:00:00,2019-06-05 00:00:00,Harvey Nash AG,Zürich,Consultants,Unlimited employment,100,Employee,System EngineerJob Description Overview of b...,"Recruitment agency, Staffing",http://www.harveynash.com/ch,ZH,Zurich
1,Automation Engineer,Automation Engineer with DeltaV,2019-04-23 00:00:00,2019-05-03 00:00:00,Spring Professional Engineering,Sion,Consultants,Unlimited employment,100,Position with responsibilities,Ihre Herausforderung You plan and implement A...,"Recruitment agency, Staffing",https://www.springprofessional.ch/,VS,Valais
2,Development Engineer,Junior Development Engineer 100% (m/f/d),2019-05-08 00:00:00,2019-05-24 00:00:00,Zentra AG Ihr Jobprofi,Canton of Zug,Consultants,Unlimited employment,Temporary,Position with responsibilities,Since 1989 - more than a quarter of a centur...,"Recruitment agency, Staffing",http://www.zentra.ch,ZG,Zug


### Data cleaning

In [6]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (113 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.9/113.9 kB[0m 

In [7]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [55]:
import contractions

custom_stopwords = {
    "job", "jobs", "team", "company", "experience", "skills", "manager",
    "position", "role", "support", "requirements", "development", "projects",
    "business", "apply", "please", "email", "contact", "work", "knowledge",
    "strong", "excellent", "join", "looking", "good", "ability", "environment",
    "background", "responsibilities", "send", "application", "communication",
    "teams", "organization", "working", "department", "phone", "submission",

    # Non-HTML/CSS job site noise:
    "ad", "online", "add", "cart", "display", "original", "friend", "interested",
    "privacy", "rights", "protected", "create", "signup", "click", "enter",
    "start", "referring", "questions", "tell", "person", "referral", "account",
    "might", "us", "information", "description", "submit", "posting", "posted",
    "view", "share", "details", "opening"
}


STOP_WORDS = set(nltk.corpus.stopwords.words("english")).union(custom_stopwords)


def clean_sentence(sentence):
    # remove special characters
    sentence = re.sub(r'[^a-zA-Z0-9\s]', '', sentence, flags=re.I|re.A)
    sentence = re.sub(r'<[^>]+>', '', sentence)  # Removes tags like <div>, <br>, etc.
    sentence = re.sub(r'style[\w\d\\]+', '', sentence)
    sentence = re.sub(r'\b(?:3c|5c|26|27)[a-zA-Z0-9]+', '', sentence)
    # lower case
    sentence = sentence.lower()
    # strip whitespaces
    sentence = sentence.strip()
    # fix contractions
    sentence = contractions.fix(sentence)
    return sentence

def remove_stopwords(sentence):
    """
    remove stopwords
    """
    # tokenize
    tokens = nltk.word_tokenize(sentence)
    #filter stopwords out of
    filtered_tokens = [token for token in tokens if token not in STOP_WORDS]
    # re-create  from filtered tokens
    sentence = ' '.join(filtered_tokens)
    # TODO
    return sentence

In [56]:
data = data.dropna(subset=["Content"])  # remove rows without content
data["Content"] = data["Content"].apply(clean_sentence)
data["Content"] = data["Content"].apply(remove_stopwords)
data.head(3)

Unnamed: 0,Keywords,Job title,Date published,Alive until,Company name,Location,Company type,Contract type,Occupation,Job rank,Content,Industry,Official website,Canton initials,Canton name
0,ICT System Engineer,System Engineer,2019-05-27 00:00:00,2019-06-05 00:00:00,Harvey Nash AG,Zürich,Consultants,Unlimited employment,100,Employee,system engineerjob overview area project resou...,"Recruitment agency, Staffing",http://www.harveynash.com/ch,ZH,Zurich
1,Automation Engineer,Automation Engineer with DeltaV,2019-04-23 00:00:00,2019-05-03 00:00:00,Spring Professional Engineering,Sion,Consultants,Unlimited employment,100,Position with responsibilities,ihre herausforderung plan implement automation...,"Recruitment agency, Staffing",https://www.springprofessional.ch/,VS,Valais
2,Development Engineer,Junior Development Engineer 100% (m/f/d),2019-05-08 00:00:00,2019-05-24 00:00:00,Zentra AG Ihr Jobprofi,Canton of Zug,Consultants,Unlimited employment,Temporary,Position with responsibilities,since 1989 quarter century zentra ag stands re...,"Recruitment agency, Staffing",http://www.zentra.ch,ZG,Zug


In [11]:
# Let's have a look at an example text
data[data["Job title"].str.contains("Data")]["Content"].values[1]

'data integration data store data analysis data dictionary sql javac pythondata science developer responsibilities setup development large groupwide data store containing wide range different data used hr etc organizational data service management data infrastructure data etc data integration data staging analysis stored data setup data dictionary development itsm suite focussing ucmdb development java python work longterm project different stages taking responsibilities delivery objects parts applications requirements experience data integration etl processes larger environment strong sql skills intermediate knowledge objectoriented programming java c python fluent english language skills nice experience data science statistics r dwh understanding interest service management itil infrastructure processes knowledge angular german language skills personality good communication skills strong customer focused company working mainly top 500 companies pay great attention training developmen

### Create the corpus

In [57]:
# Create a list of lists containing the words of each description

# TODO
corpus = []

for my_list in data["Content"]:
    corpus.append(my_list.split())
corpus[0:2][0]

['system',
 'engineerjob',
 'overview',
 'area',
 'project',
 'resource',
 'strategic',
 'programs',
 'enterprise',
 'services',
 'entire',
 'chief',
 'technology',
 'office',
 'within',
 'bank',
 'key',
 'deliveries',
 'based',
 'servicenow',
 'platform',
 'midserver',
 'infrastructure',
 'consists',
 'configuration',
 'management',
 'database',
 'cmdb',
 'multiple',
 'automatic',
 'semiautomatic',
 'integrations',
 'various',
 'tools',
 'applications',
 'key',
 'exciting',
 'opportunity',
 'lead',
 'servicenow',
 'data',
 'integration',
 'engineering',
 'stream',
 'including',
 'maintenance',
 'servicenow',
 'midserver',
 'infrastructure',
 'executing',
 'changes',
 'analysis',
 'incidents',
 'problem',
 'tickets',
 'includes',
 'deep',
 'troubleshooting',
 'complex',
 'issues',
 'code',
 'level',
 'open',
 'responsibility',
 'subject',
 'matter',
 'expert',
 'lead',
 'point',
 'etl',
 'data',
 'transformation',
 'scripts',
 'built',
 'perl',
 'get',
 'chance',
 'become',
 'part',
 '

## Create word embeddings
We use word2vec of the gensim package.

In [58]:
from collections import Counter

all_words = [word for sentence in corpus for word in sentence]
word_counts = Counter(all_words)

print(f"Total unique terms: {len(word_counts)}")
print(word_counts.most_common(10))  # Show top 10 frequent terms

Total unique terms: 75603
[('management', 20000), ('project', 12944), ('data', 12360), ('global', 10542), ('english', 9228), ('new', 8866), ('services', 8860), ('years', 8443), ('solutions', 8374), ('product', 7752)]


In [45]:
from gensim.models import word2vec

In [59]:
# Set values for various parameters
feature_size = 100    # Word vector dimensionality  every word -> [......] -> vector size of 15 float numbers
window_context = 5  # Context window size (looking at surrounding words)
min_word_count = 100  # Minimum word count
sg = 1               # skip-gram model if sg = 1 and CBOW if sg = 0

w2v_model = word2vec.Word2Vec(corpus,            #corpus needs to be a list of lists
                              vector_size=feature_size,
                              window=window_context,
                              min_count = min_word_count,
                              sg=sg, epochs=20)
w2v_model

<gensim.models.word2vec.Word2Vec at 0x7bb539af9d10>

In [61]:
vector = w2v_model.wv['engineer']
print(vector.shape)  # Output: (100,) — a 100-dimensional embedding
w2v_model.wv.most_similar('data', topn=10)

(100,)


[('visualization', 0.6835758090019226),
 ('sets', 0.6295908093452454),
 ('databases', 0.6200709342956543),
 ('collection', 0.620064377784729),
 ('analysis', 0.6093461513519287),
 ('sources', 0.6092392802238464),
 ('analytics', 0.5953243374824524),
 ('database', 0.5878210067749023),
 ('datasets', 0.5875391960144043),
 ('intelligence', 0.5858895182609558)]

## Project and plot embeddings
Let's use t-SNE or umap to project the embeddings into a 2 or 3-dim space. For plotting we use an interactive plotly plot.

In [62]:
from plotly import express as px


def plot_embeddings(model, projection="tsne", dim=2, wordlist=None, **kwargs):

    vectors_proj, lables = project_embeddings(
        model, projection=projection, dim=dim, wordlist=wordlist, **kwargs
    )

    if dim == 2:
        plot_2d(vectors_proj, lables)
    elif dim == 3:
        plot_3d(vectors_proj, lables)
    else:
        raise ValueError("Dimension of input vectors has to be 2 or 3.")


def project_embeddings(model, projection="tsne", dim=2, wordlist=None, **kwargs):
    if not wordlist:
        wordlist = model.wv.key_to_index

    lables = [word for word in wordlist]
    vectors = [model.wv[word] for word in wordlist]

    if projection == "tsne":
        vectors_proj = call_tsne(vectors, n_components=dim, **kwargs)
    elif projection == "umap":
        vectors_proj = call_umap(vectors, n_components=dim, **kwargs)
    return vectors_proj, lables


def call_tsne(vectors, n_components, **kwargs):
    arguments = dict(perplexity=40, init="pca", n_iter=2500, random_state=23)
    arguments.update(kwargs)
    tsne_model = TSNE(n_components=n_components, **arguments)
    vectors_proj = tsne_model.fit_transform(vectors)
    return vectors_proj


def call_umap(vectors, n_components, **kwargs):
    arguments = dict(n_neighbors=15, min_dist=0.1, metric="euclidean")
    arguments.update(kwargs)
    umap_model = umap.UMAP(random_state=42, n_components=n_components, **arguments)
    vectors_proj = umap_model.fit_transform(vectors)
    return vectors_proj


def plot_2d(vectors_proj, lables=None):
    x = [vec[0] for vec in vectors_proj]
    y = [vec[1] for vec in vectors_proj]

    fig = px.scatter(x=x, y=y, text=lables)
    fig.update_traces(textposition="top center", textfont_size=10)
    fig.update_layout(height=800, title_text="2d projection of word embeddings")
    fig.show()


def plot_3d(vectors_proj, lables=None):
    x = [vec[0] for vec in vectors_proj]
    y = [vec[1] for vec in vectors_proj]
    z = [vec[2] for vec in vectors_proj]

    fig = px.scatter_3d(x=x, y=y, z=z, text=lables)
    fig.update_traces(textposition="top center", textfont_size=10, marker_size=3)
    fig.update_layout(height=800, title_text="3d projection of word embeddings")
    fig.show()

In [87]:
plot_embeddings(w2v_model, projection="umap", dim=2)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



## Export the word embeddings
This allows us to visualize them at https://projector.tensorflow.org/

![tensorflow_projector_job_adds.gif](attachment:tensorflow_projector_job_adds.gif)

In [65]:
vector = w2v_model.wv['engineer']
print(vector.shape)  # Output: (100,) — a 100-dimensional embedding
w2v_model.wv.most_similar('data', topn=10)

<gensim.models.keyedvectors.KeyedVectors at 0x7bb54223d4d0>

In [71]:
#w2v_model.wv.index_to_key[2:50]
w2v_model.wv["data"]

array([ 0.07165404, -0.11688826,  0.15513936,  0.11122396,  0.44265756,
        0.3198273 , -0.12158367,  0.00801954,  0.17572054,  0.18726271,
       -0.0610051 , -0.13422775,  0.55296904, -0.21588844,  0.09231534,
       -0.145919  ,  0.21636055,  0.24664466,  0.35606107, -0.19366135,
        0.04104408,  0.23611079, -0.0056497 , -0.08578701,  0.03692646,
       -0.17124839, -0.05632437,  0.34226802,  0.17723985, -0.03142549,
        0.3646654 , -0.40713528, -0.3783603 , -0.05301993, -0.46172547,
       -0.15165934, -0.17979036, -0.05640126,  0.13890052,  0.1783857 ,
       -0.10186966, -0.3692162 ,  0.08679207,  0.276292  ,  0.1858998 ,
        0.12024935,  0.12382403, -0.27014053, -0.033753  ,  0.16455454,
        0.22164427, -0.1774511 , -0.09056724, -0.03497285, -0.25621098,
       -0.11302306,  0.05144621,  0.05947344, -0.2787156 , -0.13261327,
       -0.030711  , -0.03452719, -0.08896441,  0.09571179, -0.24071881,
        0.30119544,  0.07850942,  0.18207826,  0.11923852,  0.17

In [76]:
list(w2v_model.wv.index_to_key[0])

['m', 'a', 'n', 'a', 'g', 'e', 'm', 'e', 'n', 't']

In [77]:
lables [:5]

['management', 'project', 'data', 'global', 'english']

In [78]:
# TODO
lables = w2v_model.wv.index_to_key
vectors = w2v_model.wv[lables]

In [84]:
pd.DataFrame(lables).to_csv("lables.tsv", sep="\t", index=False, header=False)
pd.DataFrame(vectors).to_csv("vectors.tsv", sep="\t", index=False, header=False)

Go to https://projector.tensorflow.org/. Click `Load` for uploading the vectors and the labels files.

## Word embeddings - Try different parameters

In [88]:
# A more selective model, word has to be at least 1000 times in the corpus
# Set values for various parameters
feature_size = 200    # Word vector dimensionality  every word -> [......] -> vector size of 15 float numbers
window_context = 5  # Context window size (looking at surrounding words)
min_word_count = 1000  # Minimum word count
sg = 1               # skip-gram model if sg = 1 and CBOW if sg = 0

w2v_model_selective = word2vec.Word2Vec(corpus,            #corpus needs to be a list of lists
                              vector_size=feature_size,
                              window=window_context,
                              min_count = min_word_count,
                              sg=sg, epochs=30)
w2v_model_selective

<gensim.models.word2vec.Word2Vec at 0x7bb4f656f950>

In [93]:
plot_embeddings(w2v_model_selective, projection="umap", dim=2)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



In [91]:
# TODO
lables_sel = w2v_model.wv.index_to_key
vectors_sel = w2v_model.wv[lables_sel]

In [92]:
pd.DataFrame(lables_sel).to_csv("lables_sel.tsv", sep="\t", index=False, header=False)
pd.DataFrame(vectors_sel).to_csv("vectors_sel.tsv", sep="\t", index=False, header=False)

In [94]:
# Creat word embeddings with 300 components

# A more selective model, word has to be at least 1000 times in the corpus
# Set values for various parameters
feature_size = 300    # Word vector dimensionality  every word -> [......] -> vector size of 15 float numbers
window_context = 5  # Context window size (looking at surrounding words)
min_word_count = 1000  # Minimum word count
sg = 1               # skip-gram model if sg = 1 and CBOW if sg = 0

w2v_model_300 = word2vec.Word2Vec(corpus,            #corpus needs to be a list of lists
                              vector_size=feature_size,
                              window=window_context,
                              min_count = min_word_count,
                              sg=sg, epochs=20)
w2v_model_300

<gensim.models.word2vec.Word2Vec at 0x7bb4f606add0>

In [96]:
plot_embeddings(w2v_model_300, projection="umap", dim=2)


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



## Let's find some similar words to our query
Create the word embeddings.

In [None]:
# TODO
model = ??

Define a search word and find the most similar words.

In [100]:
# plot the most similiar words
search_word = "python"



# TODO
m_similar = w2v_model.wv.most_similar(search_word, topn=30)
m_similar
#wordlist = ??
# add the word itself
#wordlist.append(search_word)

[('programming', 0.8669495582580566),
 ('java', 0.8001464605331421),
 ('matlab', 0.7845606803894043),
 ('sql', 0.7627591490745544),
 ('c', 0.7345274090766907),
 ('scala', 0.7327268719673157),
 ('r', 0.6926185488700867),
 ('javascript', 0.6920092105865479),
 ('scripting', 0.6761025786399841),
 ('nosql', 0.6617627143859863),
 ('relational', 0.6401916146278381),
 ('angular', 0.6291892528533936),
 ('git', 0.6272966861724854),
 ('coding', 0.6233757138252258),
 ('hadoop', 0.6219736933708191),
 ('linux', 0.6086064577102661),
 ('libraries', 0.6069173216819763),
 ('vba', 0.6034427285194397),
 ('oracle', 0.6009383201599121),
 ('css', 0.5977500677108765),
 ('docker', 0.5941783785820007),
 ('cc', 0.5929298400878906),
 ('html5', 0.5921828746795654),
 ('frontend', 0.5896836519241333),
 ('languages', 0.5890989303588867),
 ('typescript', 0.5837157964706421),
 ('html', 0.5829545259475708),
 ('software', 0.5828973650932312),
 ('databases', 0.582223653793335),
 ('backend', 0.5758131742477417)]

Plot the search word together with the similar words.

In [101]:
plot_embeddings(w2v_model_300, projection="umap", dim=2, wordlist=wordlist)

NameError: name 'wordlist' is not defined

Print the list of similar words.

In [None]:
m_similar