# Algorithms for Massive Data


Project Finding Similar Items


Michela Mazzaglia academic year 2023/2024

## Importing libraries

In [None]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=feb75d9b66993a85e2f14c37b4cbe6bef5bf469b3c0801b8e0825e0d94451e4c
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [None]:
pip install findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [None]:
import pyspark
import findspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

In [None]:
! pip install -q kaggle

In [None]:
pip install pyspark nltk



In [None]:
import numpy as np
import pandas as pd
import re, math
import nltk
from pyspark.sql.functions import udf, length, expr, regexp_extract, collect_list
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import monotonically_increasing_id, regexp_replace, col, split, size, concat_ws
from pyspark.ml.linalg import Vectors, DenseVector, VectorUDT
from pyspark.ml.feature import MinHashLSH, HashingTF
import random
from sympy import nextprime

## Uploading the dataset

In [None]:
from google.colab import files

files.upload() # upload your kaggle api key

In [None]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024

Dataset URL: https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024
License(s): ODC Attribution License (ODC-By)
Downloading 1-3m-linkedin-jobs-and-skills-2024.zip to /content
100% 1.87G/1.88G [00:30<00:00, 87.5MB/s]
100% 1.88G/1.88G [00:30<00:00, 65.4MB/s]


In [None]:
!unzip -q ./1-3m-linkedin-jobs-and-skills-2024.zip -d .

In [None]:
sc = spark.sparkContext

## Data Cleaning & Preprocessing

In [None]:
df_0 = pd.read_csv('/content/job_summary.csv')

In [None]:
df_0

Unnamed: 0,job_link,job_summary
0,https://www.linkedin.com/jobs/view/restaurant-...,Rock N Roll Sushi is hiring a Restaurant Manag...
1,https://www.linkedin.com/jobs/view/med-surg-re...,Schedule\n: PRN is required minimum 12 hours p...
2,https://www.linkedin.com/jobs/view/registered-...,Description\nIntroduction\nAre you looking for...
3,https://uk.linkedin.com/jobs/view/commercial-a...,Commercial account executive\nSheffield\nFull ...
4,https://www.linkedin.com/jobs/view/store-manag...,Address:\nUSA-CT-Newington-44 Fenn Road\nStore...
...,...,...
1297327,https://www.linkedin.com/jobs/view/roofing-sup...,We are currently seeking experienced commercia...
1297328,https://www.linkedin.com/jobs/view/service-cen...,Overview\nStable and growing organization\nCom...
1297329,https://www.linkedin.com/jobs/view/flight-qual...,Rôle et responsabilités\nJob Description:\nFli...
1297330,https://www.linkedin.com/jobs/view/global-sour...,Job Description\nAre You Ready to Make It Happ...


In [None]:
df_0['job_summary'][2]

'Description\nIntroduction\nAre you looking for a place to deliver excellent care patients deserve? At StoneSprings Hospital Center we support our colleagues in their positions. Join our Team as a(an) Registered Nurse Cath Lab and access programs to assist with every stage of your career.\nBenefits\nStoneSprings Hospital Center, offers a total rewards package that supports the health, life, career and retirement of our colleagues. The available plans and programs include:\nComprehensive medical coverage that covers many common services at no cost or for a low copay. Plans include prescription drug and behavioral health coverage as well as free telemedicine services and free AirMed medical transportation.\nAdditional options for dental and vision benefits, life and disability coverage, flexible spending accounts, supplemental health protection plans (accident, critical illness, hospital indemnity), auto and home insurance, identity theft protection, legal counseling, long-term care cove

Spark

In [None]:
schema = StructType([
    StructField("job_link", StringType(), True),
    StructField("job_summary", StringType(), True)
])

In [None]:
spark_df = spark.read.csv(
    '/content/job_summary.csv',
    header=True,
    schema=schema,
    sep=',',       # Specify the delimiter
    quote='"',     # Handle quotes properly
    escape='\\',   # Handle escape characters
    multiLine=True # Handle multiline fields
)

spark_df.show()
spark_df.printSchema()

+--------------------+--------------------+
|            job_link|         job_summary|
+--------------------+--------------------+
|https://www.linke...|Rock N Roll Sushi...|
|https://www.linke...|Schedule\n: PRN i...|
|https://www.linke...|"Description\nInt...|
|HCA Healthcare Co...|                NULL|
|If growth and con...| we encourage you...|
|Unlock the possib...|                NULL|
|We are an equal o...|            religion|
|           Show more|                NULL|
|          Show less"|                NULL|
|https://uk.linked...|Commercial accoun...|
|https://www.linke...|Address:\nUSA-CT-...|
|https://www.linke...|Description\nOur\...|
|https://www.linke...|Company Descripti...|
|https://uk.linked...|An exciting oppor...|
|https://www.linke...|Job Details:\nJob...|
|https://www.linke...|Our\nRestaurant T...|
|https://www.linke...|Our General Manag...|
|https://www.linke...|Earning potential...|
|https://www.linke...|Dollar General Co...|
|https://au.linked...|Restaurant

In [None]:
spark_df = spark_df.dropna(subset=['job_summary'])

In [None]:
spark_df = spark_df.withColumn('job_summary', regexp_replace(col('job_summary'), '\n', ' '))

In [None]:
spark_df = spark_df.withColumn('job_words', split(col('job_summary'), ' '))

In [None]:
spark_df = spark_df.filter(size(col('job_words')) > 4)

In [None]:
spark_df = spark_df.select('job_words').withColumn('doc_id', monotonically_increasing_id())

In [None]:
spark_df.show()

+--------------------+------+
|           job_words|doc_id|
+--------------------+------+
|[Rock, N, Roll, S...|     0|
|[Schedule, :, PRN...|     1|
|["Description, In...|     2|
|[, we, encourage,...|     3|
|[Commercial, acco...|     4|
|[Address:, USA-CT...|     5|
|[Description, Our...|     6|
|[Company, Descrip...|     7|
|[An, exciting, op...|     8|
|[Job, Details:, J...|     9|
|[Our, Restaurant,...|    10|
|[Our, General, Ma...|    11|
|[Earning, potenti...|    12|
|[Dollar, General,...|    13|
|[Restaurant, Desc...|    14|
|[Who, We, Are, We...|    15|
|[A, Place, Where,...|    16|
|[Description, The...|    17|
|["Overview, Descr...|    18|
|[, seat, them, at...|    19|
+--------------------+------+
only showing top 20 rows



In [None]:
df_scaled = spark_df.sample(withReplacement=True, fraction=0.00002, seed=42)

In [None]:
df_scaled = df_scaled.limit(2000)

In [None]:
df_scaled = df_scaled.withColumn('job_words', concat_ws(' ', col('job_words')))

## Preprocessing

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords]  # Remove stopwords
    return ' '.join(tokens)

In [None]:
df_udf = udf(preprocess_text, StringType())

In [None]:
rdd = df_scaled.select('doc_id', 'job_words').rdd.map(lambda row: (row['doc_id'], preprocess_text(row['job_words'])))

In [None]:
rdd.toDF().show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| 13395|grass greener sec...|
| 34993|client childrens ...|
| 47677|possess valid cla...|
|104317|details signon bo...|
|143336|parttime intermit...|
|152380|conference center...|
|218886|northern tier hig...|
|242951|experience corps ...|
|278072|win sport school ...|
|455372|immediate start e...|
|459748|northrop grumman ...|
|464892|5000 sign bonus e...|
|525916|job summary irb c...|
|544349|45 pounds heavy c...|
|584862|sensitive compart...|
|608814|llcs development ...|
|684947|looking derivativ...|
|756249|overview job summ...|
|898066|focus remains fir...|
|912495|position features...|
+------+--------------------+
only showing top 20 rows



### Shingles

In [None]:
import binascii

In [None]:
k = 3

In [None]:
shingles_rdd = rdd.flatMap(lambda doc: [(doc[0], doc[1][i:i+k]) for i in range(len(doc[1]) - k + 1)])

In [None]:
shingles_rdd.take(7)

[(13395, 'gra'),
 (13395, 'ras'),
 (13395, 'ass'),
 (13395, 'ss '),
 (13395, 's g'),
 (13395, ' gr'),
 (13395, 'gre')]

In [None]:
shingles_list = (rdd
                 .groupByKey()
                 .map(lambda x: (x[0], list(x[1]))))

In [None]:
shingles_list.toDF().show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| 13395|[grass greener se...|
| 34993|[client childrens...|
| 47677|[possess valid cl...|
|104317|[details signon b...|
|143336|[parttime intermi...|
|152380|[conference cente...|
|218886|[northern tier hi...|
|242951|[experience corps...|
|278072|[win sport school...|
|455372|[immediate start ...|
|459748|[northrop grumman...|
|464892|[5000 sign bonus ...|
|525916|[job summary irb ...|
|544349|[45 pounds heavy ...|
|584862|[sensitive compar...|
|608814|[llcs development...|
|684947|[looking derivati...|
|756249|[overview job sum...|
|898066|[focus remains fi...|
|912495|[position feature...|
+------+--------------------+
only showing top 20 rows



Characteristic matrix

In [None]:
def hash_shingle(shingle):
  return binascii.crc32(shingle.encode('utf-8')) & 0xffffffff

In [None]:
hash_rdd = shingles_rdd.mapValues(hash_shingle).distinct()

In [None]:
hash_rdd.take(7)

[(13395, 2506444301),
 (13395, 501096268),
 (13395, 2068476598),
 (13395, 2525627878),
 (13395, 4020559682),
 (13395, 1404793474),
 (13395, 2450033172)]

In [None]:
df = hash_rdd.toDF(["doc_id", "hashed_shingle"]) \
    .groupBy("doc_id") \
    .agg(collect_list("hashed_shingle").alias("hashed_shingles"))

In [None]:
df.show()

+-------+--------------------+
| doc_id|     hashed_shingles|
+-------+--------------------+
| 608814|[2743074591, 9616...|
|1663266|[210272711, 33932...|
|1152253|[2189397276, 1432...|
| 218886|[3781580864, 4138...|
| 684947|[2179260663, 2228...|
|1388218|[4006809354, 4168...|
| 898066|[2244290826, 3530...|
|  47677|[2161764012, 1909...|
| 525916|[4225294584, 3496...|
|1749766|[2103224315, 1335...|
|1087080|[1689700070, 1016...|
| 104317|[4286418985, 3233...|
| 455372|[1528242610, 1388...|
| 756249|[4168556200, 1633...|
| 544349|[1165784608, 4089...|
|  34993|[1224102204, 9298...|
|  13395|[2506444301, 5010...|
|1090315|[2041764424, 3846...|
|1199836|[822850575, 13000...|
| 950765|[2179260663, 2228...|
+-------+--------------------+
only showing top 20 rows



In [None]:
def collect_shingles(a, b):
    return a + b

In [None]:
hashed_shingles_rdd = hash_rdd.map(lambda x: (x[0], [x[1]])).reduceByKey(collect_shingles)

In [None]:
hashed_shingles_rdd.take(7)

In [None]:
hashed_shingles_list = hashed_shingles_rdd.flatMap(lambda x: x[1]).collect()

### MinHash


---
Fast approximation to the Jaccard Similarity Coefficient between any two finite sets




In [None]:
h_functions = 120
b_bands = 10

In [None]:
def multiple(x, a, b, c):
    return (a * x + b) % c

In [None]:
params =[]

for _ in range(h_functions):
  a = random.randint(1, 10000)
  b = random.randint(1, 10000)
  max_ab = max(a, b)
  c = nextprime(max_ab + 1)
  params.append({"a": a, "b": b, "c": c})

params[:7]

[{'a': 678, 'b': 3428, 'c': 3433},
 {'a': 6255, 'b': 6918, 'c': 6947},
 {'a': 600, 'b': 3255, 'c': 3257},
 {'a': 5671, 'b': 7048, 'c': 7057},
 {'a': 2205, 'b': 2909, 'c': 2917},
 {'a': 4262, 'b': 9938, 'c': 9941},
 {'a': 2792, 'b': 7510, 'c': 7517}]

In [None]:
def enum_shingles(e, hhashed_shingles_list):
    doc_id, shingle = e
    return [((doc_id), (h), (hashed_shingles_list)) for h in range(h_functions)]

In [None]:
minhash_matrix = hash_rdd.flatMap(lambda e: enum_shingles(e, hashed_shingles_list))

In [None]:
minhash_matrix.toDF().show()

+-----+---+--------------------+
|   _1| _2|                  _3|
+-----+---+--------------------+
|13395|  0|[2506444301, 5010...|
|13395|  1|[2506444301, 5010...|
|13395|  2|[2506444301, 5010...|
|13395|  3|[2506444301, 5010...|
|13395|  4|[2506444301, 5010...|
|13395|  5|[2506444301, 5010...|
|13395|  6|[2506444301, 5010...|
|13395|  7|[2506444301, 5010...|
|13395|  8|[2506444301, 5010...|
|13395|  9|[2506444301, 5010...|
|13395| 10|[2506444301, 5010...|
|13395| 11|[2506444301, 5010...|
|13395| 12|[2506444301, 5010...|
|13395| 13|[2506444301, 5010...|
|13395| 14|[2506444301, 5010...|
|13395| 15|[2506444301, 5010...|
|13395| 16|[2506444301, 5010...|
|13395| 17|[2506444301, 5010...|
|13395| 18|[2506444301, 5010...|
|13395| 19|[2506444301, 5010...|
+-----+---+--------------------+
only showing top 20 rows



In [None]:
def minhash_map(docId_hashedShingles):
    doc_id, hashed_shingles = docId_hashedShingles
    minhashes = []
    for h in range(h_functions):
        min_h = math.inf
        for shingle in hashed_shingles:
            hash_value = multiple(shingle, **params[h])
            if hash_value < min_h:
                min_h = hash_value
        minhashes.append(min_h)
    return (doc_id, minhashes)

In [None]:
sig_matrix_rdd = hashed_shingles_rdd.map(minhash_map)

In [None]:
signature_df = sig_matrix_rdd.toDF(["doc_id", "minhashes"])

In [None]:
signature_df.show()

+------+--------------------+
|doc_id|           minhashes|
+------+--------------------+
| 13395|[2, 7, 1, 10, 6, ...|
| 34993|[2, 7, 4, 10, 11,...|
| 47677|[4, 84, 11, 55, 6...|
|104317|[2, 7, 1, 2, 5, 4...|
|143336|[72, 80, 7, 256, ...|
|152380|[102, 164, 160, 2...|
|218886|[4, 7, 1, 16, 1, ...|
|242951|[4, 84, 6, 1, 20,...|
|278072|[4, 4, 14, 2, 9, ...|
|455372|[0, 23, 4, 20, 5,...|
|459748|[0, 2, 1, 1, 5, 2...|
|464892|[2, 7, 4, 20, 28,...|
|525916|[2, 7, 0, 10, 6, ...|
|544349|[24, 449, 413, 27...|
|584862|[45, 80, 83, 225,...|
|608814|[4, 45, 6, 29, 89...|
|684947|[0, 12, 4, 1, 11,...|
|756249|[2, 24, 1, 10, 13...|
|898066|[9, 102, 4, 20, 1...|
|912495|[2, 2, 0, 10, 5, ...|
+------+--------------------+
only showing top 20 rows



### Locality-Sensitive Hashing

Define the threshold

In [None]:
# having h_functions = 100 and b_bands = 10
n_per_bands = h_functions // b_bands
threshold = (1/b_bands) ** (1/n_per_bands)

In [None]:
print("Threshold for candidate pairs: ", threshold)

Threshold for candidate pairs:  0.8254041852680184


Using the MinHashLSH provided by Pyspark

In [None]:
array_to_vector_udf = udf(lambda x: DenseVector(x), VectorUDT())
signature_df = signature_df.withColumn("minhash_vector", array_to_vector_udf(col("minhashes"))) # converted with dense vector

In [None]:
signature_df = signature_df.drop("minhashes")

In [None]:
mh = MinHashLSH(inputCol="minhash_vector", outputCol="hashes", numHashTables=b_bands*n_per_bands, seed=56)

In [None]:
model = mh.fit(signature_df)

In [None]:
lsh_df = model.transform(signature_df)

## Finding similar jobs

In [None]:
def find_similar_jobs(lsh_df, model, threshold):
    # Perform self-join to find all pairs
    similar_items = model.approxSimilarityJoin(lsh_df, lsh_df, threshold, distCol="JaccardDistance")

    # Filter out pairs with the same doc_id and JaccardDistance above the threshold
    similar_pairs = (similar_items
                     .select(
                         col("datasetA.doc_id").alias("doc_id_A"),
                         col("datasetB.doc_id").alias("doc_id_B"),
                         col("JaccardDistance"))
                     .filter(col("doc_id_A") < col("doc_id_B"))
                     .filter(col("JaccardDistance") <= threshold)
                     .rdd
                     .map(lambda row: (row["doc_id_A"], row["doc_id_B"], row["JaccardDistance"]))
                     .collect())

    return similar_pairs

In [None]:
similar_pairs = find_similar_jobs(lsh_df, model, threshold)

In [None]:
similar_pairs

[(544349, 1553707, 0.15000000000000002),
 (1199836, 1663266, 0.2542372881355932),
 (1331267, 1388218, 0.10169491525423724),
 (13395, 1749766, 0.12871287128712872),
 (47677, 544349, 0.008333333333333304),
 (455372, 1259945, 0.21818181818181814),
 (464892, 544349, 0.10833333333333328),
 (1152253, 1432904, 0.17307692307692313),
 (584862, 1388218, 0.07692307692307687),
 (143336, 242951, 0.06779661016949157),
 (950765, 1331267, 0.18644067796610164),
 (143336, 1331267, 0.050420168067226934),
 (104317, 684947, 0.1875),
 (278072, 1090315, 0.1834862385321101),
 (684947, 1199836, 0.15000000000000002),
 (34993, 1090315, 0.14423076923076927),
 (104317, 1388218, 0.1428571428571429),
 (143336, 1259945, 0.22881355932203384),
 (544349, 1749766, 0.19999999999999996),
 (1087080, 1749766, 0.16822429906542058),
 (34993, 47677, 0.19166666666666665),
 (34993, 152380, 0.18333333333333335),
 (152380, 1259945, 0.2416666666666667),
 (525916, 898066, 0.15000000000000002),
 (1432904, 1750168, 0.20869565217391306)

### Some results

In [None]:
similar_df = pd.DataFrame(similar_pairs, columns=['doc_id1', 'doc_id2', 'jaccard_sim'])

In [None]:
similar_df

Unnamed: 0,doc_id1,doc_id2,jaccard_sim
0,544349,1553707,0.150000
1,1199836,1663266,0.254237
2,1331267,1388218,0.101695
3,13395,1749766,0.128713
4,47677,544349,0.008333
...,...,...,...
556,898066,1663266,0.258333
557,13395,1152253,0.128713
558,143336,684947,0.110169
559,459748,1553707,0.168224


In [None]:
for pair in similar_pairs[:2]:
    doc1 = pair[0]
    doc2 = pair[1]

    # Filter DataFrame to retrieve text for doc1 and doc2
    doc1_text = df_scaled.filter(col('doc_id') == doc1).select('job_words').first()[0]
    doc2_text = df_scaled.filter(col('doc_id') == doc2).select('job_words').first()[0]

    # Display job summaries for doc1 and doc2
    print("Document 1:")
    print(doc1_text)
    print("\nDocument 2:")
    print(doc2_text)
    print("\n---")

Document 1:
 45 pounds and over; heavy carrying

Document 2:
As a Unit Supply Specialist for the Army National Guard, you will ensure that your Unit and fellow Soldiers are well supplied and equipped for any mission. In this role, your keen eye and management ability will keep warehouse functions running smoothly. You will oversee the shipping, storage, and supply of Army National Guard equipment. This includes receiving, inspecting, invoicing, storing, and delivering supplies. You will: ensure that all documents are prepared and organized; maintain automated systems; secure and control weapons and ammunition; and schedule and provide maintenance for weapons. Job Duties Issue and receive small arms. Secure and control weapons and ammunition in security areas Schedule and perform preventive and organizational maintenance on weapons Operate unit level computers Some Of The Skills You’ll Learn Procedures for handling medical and food supplies Helpful Skills Interest in mathematics, bookke