# Algorithms for Massive Data


Project Finding Similar Items


Michela Mazzaglia academic year 2023/2024

## Importing libraries

In [1]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=8e690586d661ede5513eb456a6a8e09c1bb6d2188459934c2ec906e24f6021be
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
pip install findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [3]:
import pyspark
import findspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

In [4]:
! pip install -q kaggle

In [5]:
pip install pyspark nltk



In [6]:
import numpy as np
import pandas as pd
import re, math
import nltk
from pyspark.sql.functions import udf, length, expr, regexp_extract, collect_list
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import monotonically_increasing_id, regexp_replace, col, split, size, concat_ws
from pyspark.ml.linalg import Vectors, DenseVector, VectorUDT
from pyspark.ml.feature import MinHashLSH, HashingTF
import random
from sympy import nextprime

## Uploading the dataset

In [None]:
from google.colab import files

files.upload() # upload your kaggle api key

In [8]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

In [9]:
! chmod 600 ~/.kaggle/kaggle.json

In [10]:
!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024

Dataset URL: https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024
License(s): ODC Attribution License (ODC-By)
Downloading 1-3m-linkedin-jobs-and-skills-2024.zip to /content
100% 1.87G/1.88G [00:29<00:00, 116MB/s] 
100% 1.88G/1.88G [00:29<00:00, 67.9MB/s]


In [11]:
!unzip -q ./1-3m-linkedin-jobs-and-skills-2024.zip -d .

In [12]:
sc = spark.sparkContext

## Data Cleaning & Preprocessing

In [13]:
df_0 = pd.read_csv('/content/job_summary.csv')

In [14]:
df_0

Unnamed: 0,job_link,job_summary
0,https://www.linkedin.com/jobs/view/restaurant-...,Rock N Roll Sushi is hiring a Restaurant Manag...
1,https://www.linkedin.com/jobs/view/med-surg-re...,Schedule\n: PRN is required minimum 12 hours p...
2,https://www.linkedin.com/jobs/view/registered-...,Description\nIntroduction\nAre you looking for...
3,https://uk.linkedin.com/jobs/view/commercial-a...,Commercial account executive\nSheffield\nFull ...
4,https://www.linkedin.com/jobs/view/store-manag...,Address:\nUSA-CT-Newington-44 Fenn Road\nStore...
...,...,...
1297327,https://www.linkedin.com/jobs/view/roofing-sup...,We are currently seeking experienced commercia...
1297328,https://www.linkedin.com/jobs/view/service-cen...,Overview\nStable and growing organization\nCom...
1297329,https://www.linkedin.com/jobs/view/flight-qual...,Rôle et responsabilités\nJob Description:\nFli...
1297330,https://www.linkedin.com/jobs/view/global-sour...,Job Description\nAre You Ready to Make It Happ...


In [15]:
df_0['job_summary'][2]

'Description\nIntroduction\nAre you looking for a place to deliver excellent care patients deserve? At StoneSprings Hospital Center we support our colleagues in their positions. Join our Team as a(an) Registered Nurse Cath Lab and access programs to assist with every stage of your career.\nBenefits\nStoneSprings Hospital Center, offers a total rewards package that supports the health, life, career and retirement of our colleagues. The available plans and programs include:\nComprehensive medical coverage that covers many common services at no cost or for a low copay. Plans include prescription drug and behavioral health coverage as well as free telemedicine services and free AirMed medical transportation.\nAdditional options for dental and vision benefits, life and disability coverage, flexible spending accounts, supplemental health protection plans (accident, critical illness, hospital indemnity), auto and home insurance, identity theft protection, legal counseling, long-term care cove

Spark

In [17]:
schema = StructType([
    StructField("job_link", StringType(), True),
    StructField("job_summary", StringType(), True)
])

In [18]:
spark_df = spark.read.csv(
    '/content/job_summary.csv',
    header=True,
    schema=schema,
    sep=',',       # Specify the delimiter
    quote='"',     # Handle quotes properly
    escape='\\',   # Handle escape characters
    multiLine=True # Handle multiline fields
)

spark_df.show()
spark_df.printSchema()

+--------------------+--------------------+
|            job_link|         job_summary|
+--------------------+--------------------+
|https://www.linke...|Rock N Roll Sushi...|
|https://www.linke...|Schedule\n: PRN i...|
|https://www.linke...|"Description\nInt...|
|HCA Healthcare Co...|                NULL|
|If growth and con...| we encourage you...|
|Unlock the possib...|                NULL|
|We are an equal o...|            religion|
|           Show more|                NULL|
|          Show less"|                NULL|
|https://uk.linked...|Commercial accoun...|
|https://www.linke...|Address:\nUSA-CT-...|
|https://www.linke...|Description\nOur\...|
|https://www.linke...|Company Descripti...|
|https://uk.linked...|An exciting oppor...|
|https://www.linke...|Job Details:\nJob...|
|https://www.linke...|Our\nRestaurant T...|
|https://www.linke...|Our General Manag...|
|https://www.linke...|Earning potential...|
|https://www.linke...|Dollar General Co...|
|https://au.linked...|Restaurant

In [19]:
spark_df = spark_df.dropna(subset=['job_summary'])

In [20]:
spark_df = spark_df.withColumn('job_summary', regexp_replace(col('job_summary'), '\n', ' '))

In [21]:
spark_df = spark_df.withColumn('job_words', split(col('job_summary'), ' '))

In [22]:
spark_df = spark_df.filter(size(col('job_words')) > 4)

In [23]:
spark_df = spark_df.select('job_words').withColumn('doc_id', monotonically_increasing_id())

In [24]:
spark_df.show()

+--------------------+------+
|           job_words|doc_id|
+--------------------+------+
|[Rock, N, Roll, S...|     0|
|[Schedule, :, PRN...|     1|
|["Description, In...|     2|
|[, we, encourage,...|     3|
|[Commercial, acco...|     4|
|[Address:, USA-CT...|     5|
|[Description, Our...|     6|
|[Company, Descrip...|     7|
|[An, exciting, op...|     8|
|[Job, Details:, J...|     9|
|[Our, Restaurant,...|    10|
|[Our, General, Ma...|    11|
|[Earning, potenti...|    12|
|[Dollar, General,...|    13|
|[Restaurant, Desc...|    14|
|[Who, We, Are, We...|    15|
|[A, Place, Where,...|    16|
|[Description, The...|    17|
|["Overview, Descr...|    18|
|[, seat, them, at...|    19|
+--------------------+------+
only showing top 20 rows



In [25]:
df_scaled = spark_df.sample(withReplacement=True, fraction=0.00005, seed=42)

In [26]:
df_scaled = df_scaled.withColumn('job_words', concat_ws(' ', col('job_words')))

## Preprocessing

In [27]:
nltk.download('stopwords')
nltk.download('punkt')
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [28]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords]  # Remove stopwords
    return ' '.join(tokens)

In [29]:
df_udf = udf(preprocess_text, StringType())

In [30]:
rdd = df_scaled.select('doc_id', 'job_words').rdd.map(lambda row: (row['doc_id'], preprocess_text(row['job_words'])))

In [31]:
rdd.toDF().show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| 13395|grass greener sec...|
| 34993|client childrens ...|
| 47677|possess valid cla...|
| 64146|demonstrates inte...|
| 69905|new adventure awa...|
|104317|details signon bo...|
|143336|parttime intermit...|
|152380|conference center...|
|188603|scheduled nclex w...|
|218886|northern tier hig...|
|242951|experience corps ...|
|252827|role sr serviceno...|
|273921|us join us winco ...|
|278072|win sport school ...|
|304884|job details job l...|
|328483|department econom...|
|451330|zoom take great c...|
|455372|immediate start e...|
|459748|northrop grumman ...|
|459967|position interest...|
+------+--------------------+
only showing top 20 rows



### Shingles

In [32]:
import binascii

In [33]:
k = 7

In [34]:
shingles_rdd = rdd.flatMap(lambda doc: [(doc[0], doc[1][i:i+k]) for i in range(len(doc[1]) - k + 1)])

In [35]:
shingles_rdd.take(7)

[(13395, 'grass g'),
 (13395, 'rass gr'),
 (13395, 'ass gre'),
 (13395, 'ss gree'),
 (13395, 's green'),
 (13395, ' greene'),
 (13395, 'greener')]

In [36]:
shingles_list = (rdd
                 .groupByKey()
                 .map(lambda x: (x[0], list(x[1]))))

In [37]:
shingles_list.toDF().show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| 13395|[grass greener se...|
| 34993|[client childrens...|
| 47677|[possess valid cl...|
| 64146|[demonstrates int...|
| 69905|[new adventure aw...|
|104317|[details signon b...|
|143336|[parttime intermi...|
|152380|[conference cente...|
|188603|[scheduled nclex ...|
|218886|[northern tier hi...|
|242951|[experience corps...|
|252827|[role sr servicen...|
|273921|[us join us winco...|
|278072|[win sport school...|
|304884|[job details job ...|
|328483|[department econo...|
|451330|[zoom take great ...|
|455372|[immediate start ...|
|459748|[northrop grumman...|
|459967|[position interes...|
+------+--------------------+
only showing top 20 rows



Characteristic matrix

In [38]:
def hash_shingle(shingle):
  return binascii.crc32(shingle.encode('utf-8')) & 0xffffffff

In [39]:
hash_rdd = shingles_rdd.mapValues(hash_shingle).distinct()

In [40]:
hash_rdd.take(7)

[(13395, 1487558144),
 (13395, 1311898750),
 (13395, 1309835623),
 (13395, 3628940306),
 (13395, 1375209922),
 (13395, 1081128875),
 (13395, 745903883)]

In [41]:
df = hash_rdd.toDF(["doc_id", "hashed_shingle"]) \
    .groupBy("doc_id") \
    .agg(collect_list("hashed_shingle").alias("hashed_shingles"))

In [42]:
df.show()

+-------+--------------------+
| doc_id|     hashed_shingles|
+-------+--------------------+
| 608814|[223365557, 29428...|
| 304884|[150254450, 40235...|
|1663266|[1282415320, 2631...|
|1618038|[1809481451, 9370...|
|1152253|[1054767446, 5483...|
|  64146|[2843215090, 2482...|
| 773878|[4093678808, 3639...|
| 218886|[944620615, 29134...|
|  69905|[83770995, 469571...|
| 909604|[19356233, 303282...|
| 872160|[3793157859, 2783...|
| 684947|[2956170883, 3798...|
|1460189|[3459875643, 2048...|
|1388218|[2936512205, 2973...|
| 898066|[1944298662, 1054...|
| 273921|[1718515538, 1888...|
|  47677|[3428590799, 3060...|
| 525916|[1060420630, 2895...|
|1450243|[2818205730, 5881...|
|1621468|[2650994303, 3919...|
+-------+--------------------+
only showing top 20 rows



In [43]:
def collect_shingles(a, b):
    return a + b

In [44]:
hashed_shingles_rdd = hash_rdd.map(lambda x: (x[0], [x[1]])).reduceByKey(collect_shingles)

In [None]:
hashed_shingles_rdd.take(7)

In [46]:
hashed_shingles_list = hashed_shingles_rdd.flatMap(lambda x: x[1]).collect()

### MinHash


---
Fast approximation to the Jaccard Similarity Coefficient between any two finite sets




In [65]:
h_functions = 100
b_bands = 10

In [48]:
def multiple(x, a, b, c):
    return (a * x + b) % c

In [66]:
params =[]

for _ in range(h_functions):
  a = random.randint(1, 10000)
  b = random.randint(1, 10000)
  max_ab = max(a, b)
  c = nextprime(max_ab + 1)
  params.append({"a": a, "b": b, "c": c})

params[:7]

[{'a': 1629, 'b': 6850, 'c': 6857},
 {'a': 463, 'b': 973, 'c': 977},
 {'a': 9714, 'b': 1647, 'c': 9719},
 {'a': 5532, 'b': 6526, 'c': 6529},
 {'a': 5065, 'b': 8326, 'c': 8329},
 {'a': 2597, 'b': 3460, 'c': 3463},
 {'a': 7110, 'b': 3968, 'c': 7121}]

In [67]:
def enum_shingles(e, hashed_shingles_list):
    doc_id, shingle = e
    return [((doc_id), (h), (hashed_shingles_list)) for h in range(h_functions)]

In [68]:
minhash_matrix = hash_rdd.flatMap(lambda e: enum_shingles(e, hashed_shingles_list))

In [69]:
minhash_matrix.toDF().show()

+-----+---+--------------------+
|   _1| _2|                  _3|
+-----+---+--------------------+
|13395|  0|[1487558144, 1311...|
|13395|  1|[1487558144, 1311...|
|13395|  2|[1487558144, 1311...|
|13395|  3|[1487558144, 1311...|
|13395|  4|[1487558144, 1311...|
|13395|  5|[1487558144, 1311...|
|13395|  6|[1487558144, 1311...|
|13395|  7|[1487558144, 1311...|
|13395|  8|[1487558144, 1311...|
|13395|  9|[1487558144, 1311...|
|13395| 10|[1487558144, 1311...|
|13395| 11|[1487558144, 1311...|
|13395| 12|[1487558144, 1311...|
|13395| 13|[1487558144, 1311...|
|13395| 14|[1487558144, 1311...|
|13395| 15|[1487558144, 1311...|
|13395| 16|[1487558144, 1311...|
|13395| 17|[1487558144, 1311...|
|13395| 18|[1487558144, 1311...|
|13395| 19|[1487558144, 1311...|
+-----+---+--------------------+
only showing top 20 rows



In [70]:
def minhash_map(docId_hashedShingles):
    doc_id, hashed_shingles = docId_hashedShingles
    minhashes = []
    for h in range(h_functions):
        min_h = math.inf
        for shingle in hashed_shingles:
            hash_value = multiple(shingle, **params[h])
            if hash_value < min_h:
                min_h = hash_value
        minhashes.append(min_h)
    return (doc_id, minhashes)

In [71]:
sig_matrix_rdd = hashed_shingles_rdd.map(minhash_map)

In [72]:
signature_df = sig_matrix_rdd.toDF(["doc_id", "minhashes"])

In [73]:
signature_df.show()

+------+--------------------+
|doc_id|           minhashes|
+------+--------------------+
| 13395|[1, 0, 1, 1, 0, 3...|
| 34993|[0, 0, 1, 0, 1, 3...|
| 47677|[38, 1, 7, 23, 5,...|
| 64146|[71, 15, 108, 3, ...|
| 69905|[0, 0, 1, 0, 0, 7...|
|104317|[1, 0, 1, 6, 4, 0...|
|143336|[47, 26, 455, 81,...|
|152380|[272, 111, 199, 3...|
|188603|[332, 43, 428, 18...|
|218886|[2, 0, 2, 2, 4, 1...|
|242951|[3, 0, 12, 3, 3, ...|
|252827|[3, 0, 14, 1, 8, ...|
|273921|[0, 0, 2, 0, 3, 0...|
|278072|[9, 1, 16, 1, 1, ...|
|304884|[3, 0, 0, 0, 4, 0...|
|328483|[1, 0, 3, 0, 2, 1...|
|451330|[0, 0, 2, 2, 5, 0...|
|455372|[0, 2, 13, 0, 4, ...|
|459748|[4, 2, 3, 2, 4, 0...|
|459967|[5, 0, 1, 2, 0, 0...|
+------+--------------------+
only showing top 20 rows



### Locality-Sensitive Hashing

Define the threshold

In [74]:
# having h_functions = 100 and b_bands = 10
n_per_bands = h_functions // b_bands
threshold = (1/b_bands) ** (1/n_per_bands)

In [75]:
print("Threshold for candidate pairs: ", threshold)

Threshold for candidate pairs:  0.7943282347242815


Using the MinHashLSH provided by Pyspark

In [76]:
array_to_vector_udf = udf(lambda x: DenseVector(x), VectorUDT())
signature_df = signature_df.withColumn("minhash_vector", array_to_vector_udf(col("minhashes"))) # converted with dense vector

In [77]:
signature_df = signature_df.drop("minhashes")

In [78]:
mh = MinHashLSH(inputCol="minhash_vector", outputCol="hashes", numHashTables=b_bands*n_per_bands, seed=56)

In [79]:
model = mh.fit(signature_df)

In [80]:
lsh_df = model.transform(signature_df)

## Finding similar jobs

In [81]:
def find_similar_jobs(lsh_df, model, threshold):
    # Perform self-join to find all pairs
    similar_items = model.approxSimilarityJoin(lsh_df, lsh_df, threshold, distCol="JaccardDistance")

    # Filter out pairs with the same doc_id and JaccardDistance above the threshold
    similar_pairs = (similar_items
                     .select(
                         col("datasetA.doc_id").alias("doc_id_A"),
                         col("datasetB.doc_id").alias("doc_id_B"),
                         col("JaccardDistance"))
                     .filter(col("doc_id_A") < col("doc_id_B"))
                     .filter(col("JaccardDistance") <= threshold)
                     .rdd
                     .map(lambda row: (row["doc_id_A"], row["doc_id_B"], row["JaccardDistance"]))
                     .collect())

    return similar_pairs

In [82]:
similar_pairs = find_similar_jobs(lsh_df, model, threshold)

In [None]:
similar_pairs

### Some results

In [83]:
similar_df = pd.DataFrame(similar_pairs, columns=['doc_id1', 'doc_id2', 'jaccard_distance'])

In [84]:
similar_df

Unnamed: 0,doc_id1,doc_id2,jaccard_distance
0,47677,278072,0.153061
1,47677,950765,0.212121
2,188603,1152253,0.370000
3,273921,1371266,0.580247
4,451330,584862,0.191919
...,...,...,...
2770,1038199,1572773,0.380952
2771,1129756,1750168,0.306122
2772,464892,1651499,0.651685
2773,912495,1331267,0.525253


In [86]:
for pair in similar_pairs[:2]:
    doc1 = pair[0]
    doc2 = pair[1]
    dist = pair[2]

    # Filter DataFrame to retrieve text for doc1 and doc2
    doc1_text = df_scaled.filter(col('doc_id') == doc1).select('job_words').first()[0]
    doc2_text = df_scaled.filter(col('doc_id') == doc2).select('job_words').first()[0]

    # Display job summaries for doc1 and doc2
    print("Document 1:")
    print(doc1_text)
    print("\nDocument 2:")
    print(doc2_text)
    print("\nJaccard Distance: {:.4f}".format(dist))
    print("\n---")


Document 1:
 AND possess a valid Class C California driver's license with a safe driving record or driver's license from another state with a safe driving record.

Document 2:
WIN Sport School Laval , école de management du sport, forme les passionnés de sport et sportifs de haut niveau au management, marketing, événementiel sportif, sponsoring et développement commercial. Vous pourrez également renforcer votre employabilité grâce à l’alternance. Nous recherchons pour une entreprise partenaire, un Apprenti Manager Sport H/F . Vous souhaitez vous orienter vers une formation professionnalisante ? Dès septembre 2022, profitez des avantages d’un contrat d'apprentissage pour vous former et acquérir un BAC+3 Bachelor Bachelor Marketing sportif ( Titre certifié de niveau 6, reconnu par l’Etat et validant 180 crédits ECTS) . MISSIONS Veille concurrentielle Développement et prospection Recherche de partenaires Gestion des réseaux sociaux Support dans la mise en place de stratégie de communicati