# Algorithms for Massive Data


Project Finding Similar Items


Michela Mazzaglia academic year 2023/2024

## Importing libraries

In [1]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=e09b438e2f624f5f5a78e687a24386a2d0bda2e507e4d31f2e5f408e55dedf21
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
pip install findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [3]:
import pyspark
import findspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

In [4]:
! pip install -q kaggle

In [5]:
pip install pyspark nltk



In [6]:
import numpy as np
import pandas as pd
import re, math
import nltk
from pyspark.sql.functions import udf, length, expr, regexp_extract, collect_list
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import monotonically_increasing_id, regexp_replace, col, split, size, concat_ws
from pyspark.ml.linalg import Vectors, DenseVector, VectorUDT
from pyspark.ml.feature import MinHashLSH, HashingTF
import random
from sympy import nextprime

## Uploading the dataset

In [None]:
from google.colab import files

files.upload() # upload your kaggle api key

In [8]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

In [9]:
! chmod 600 ~/.kaggle/kaggle.json

In [10]:
!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024

Dataset URL: https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024
License(s): ODC Attribution License (ODC-By)
Downloading 1-3m-linkedin-jobs-and-skills-2024.zip to /content
 99% 1.86G/1.88G [00:35<00:00, 87.5MB/s]
100% 1.88G/1.88G [00:35<00:00, 56.1MB/s]


In [11]:
!unzip -q ./1-3m-linkedin-jobs-and-skills-2024.zip -d .

In [12]:
sc = spark.sparkContext

## Data Cleaning & Preprocessing

In [13]:
df_0 = pd.read_csv('/content/job_summary.csv')

In [14]:
df_0

Unnamed: 0,job_link,job_summary
0,https://www.linkedin.com/jobs/view/restaurant-...,Rock N Roll Sushi is hiring a Restaurant Manag...
1,https://www.linkedin.com/jobs/view/med-surg-re...,Schedule\n: PRN is required minimum 12 hours p...
2,https://www.linkedin.com/jobs/view/registered-...,Description\nIntroduction\nAre you looking for...
3,https://uk.linkedin.com/jobs/view/commercial-a...,Commercial account executive\nSheffield\nFull ...
4,https://www.linkedin.com/jobs/view/store-manag...,Address:\nUSA-CT-Newington-44 Fenn Road\nStore...
...,...,...
1297327,https://www.linkedin.com/jobs/view/roofing-sup...,We are currently seeking experienced commercia...
1297328,https://www.linkedin.com/jobs/view/service-cen...,Overview\nStable and growing organization\nCom...
1297329,https://www.linkedin.com/jobs/view/flight-qual...,Rôle et responsabilités\nJob Description:\nFli...
1297330,https://www.linkedin.com/jobs/view/global-sour...,Job Description\nAre You Ready to Make It Happ...


In [15]:
df_0['job_summary'][2]

'Description\nIntroduction\nAre you looking for a place to deliver excellent care patients deserve? At StoneSprings Hospital Center we support our colleagues in their positions. Join our Team as a(an) Registered Nurse Cath Lab and access programs to assist with every stage of your career.\nBenefits\nStoneSprings Hospital Center, offers a total rewards package that supports the health, life, career and retirement of our colleagues. The available plans and programs include:\nComprehensive medical coverage that covers many common services at no cost or for a low copay. Plans include prescription drug and behavioral health coverage as well as free telemedicine services and free AirMed medical transportation.\nAdditional options for dental and vision benefits, life and disability coverage, flexible spending accounts, supplemental health protection plans (accident, critical illness, hospital indemnity), auto and home insurance, identity theft protection, legal counseling, long-term care cove

Spark

In [16]:
schema = StructType([
    StructField("job_link", StringType(), True),
    StructField("job_summary", StringType(), True)
])

In [17]:
spark_df = spark.read.csv(
    '/content/job_summary.csv',
    header=True,
    schema=schema,
    sep=',',       # Specify the delimiter
    quote='"',     # Handle quotes properly
    escape='\\',   # Handle escape characters
    multiLine=True # Handle multiline fields
)

spark_df.show()
spark_df.printSchema()

+--------------------+--------------------+
|            job_link|         job_summary|
+--------------------+--------------------+
|https://www.linke...|Rock N Roll Sushi...|
|https://www.linke...|Schedule\n: PRN i...|
|https://www.linke...|"Description\nInt...|
|HCA Healthcare Co...|                NULL|
|If growth and con...| we encourage you...|
|Unlock the possib...|                NULL|
|We are an equal o...|            religion|
|           Show more|                NULL|
|          Show less"|                NULL|
|https://uk.linked...|Commercial accoun...|
|https://www.linke...|Address:\nUSA-CT-...|
|https://www.linke...|Description\nOur\...|
|https://www.linke...|Company Descripti...|
|https://uk.linked...|An exciting oppor...|
|https://www.linke...|Job Details:\nJob...|
|https://www.linke...|Our\nRestaurant T...|
|https://www.linke...|Our General Manag...|
|https://www.linke...|Earning potential...|
|https://www.linke...|Dollar General Co...|
|https://au.linked...|Restaurant

In [18]:
spark_df = spark_df.dropna(subset=['job_summary'])

In [19]:
spark_df = spark_df.withColumn('job_summary', regexp_replace(col('job_summary'), '\n', ' '))

In [20]:
spark_df = spark_df.withColumn('job_words', split(col('job_summary'), ' '))

In [21]:
spark_df = spark_df.filter(size(col('job_words')) > 4)

In [22]:
spark_df = spark_df.select('job_words').withColumn('doc_id', monotonically_increasing_id())

In [23]:
spark_df.show()

+--------------------+------+
|           job_words|doc_id|
+--------------------+------+
|[Rock, N, Roll, S...|     0|
|[Schedule, :, PRN...|     1|
|["Description, In...|     2|
|[, we, encourage,...|     3|
|[Commercial, acco...|     4|
|[Address:, USA-CT...|     5|
|[Description, Our...|     6|
|[Company, Descrip...|     7|
|[An, exciting, op...|     8|
|[Job, Details:, J...|     9|
|[Our, Restaurant,...|    10|
|[Our, General, Ma...|    11|
|[Earning, potenti...|    12|
|[Dollar, General,...|    13|
|[Restaurant, Desc...|    14|
|[Who, We, Are, We...|    15|
|[A, Place, Where,...|    16|
|[Description, The...|    17|
|["Overview, Descr...|    18|
|[, seat, them, at...|    19|
+--------------------+------+
only showing top 20 rows



In [24]:
df_scaled = spark_df.sample(withReplacement=True, fraction=0.00005, seed=42)

In [25]:
df_scaled = df_scaled.withColumn('job_words', concat_ws(' ', col('job_words')))

## Preprocessing

In [26]:
nltk.download('stopwords')
nltk.download('punkt')
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [27]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords]  # Remove stopwords
    return ' '.join(tokens)

In [28]:
df_udf = udf(preprocess_text, StringType())

In [29]:
rdd = df_scaled.select('doc_id', 'job_words').rdd.map(lambda row: (row['doc_id'], preprocess_text(row['job_words'])))

In [30]:
rdd.toDF().show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| 13395|grass greener sec...|
| 34993|client childrens ...|
| 47677|possess valid cla...|
| 64146|demonstrates inte...|
| 69905|new adventure awa...|
|104317|details signon bo...|
|143336|parttime intermit...|
|152380|conference center...|
|188603|scheduled nclex w...|
|218886|northern tier hig...|
|242951|experience corps ...|
|252827|role sr serviceno...|
|273921|us join us winco ...|
|278072|win sport school ...|
|304884|job details job l...|
|328483|department econom...|
|451330|zoom take great c...|
|455372|immediate start e...|
|459748|northrop grumman ...|
|459967|position interest...|
+------+--------------------+
only showing top 20 rows



### Shingles

In [31]:
import binascii

In [32]:
k = 3

In [33]:
shingles_rdd = rdd.flatMap(lambda doc: [(doc[0], doc[1][i:i+k]) for i in range(len(doc[1]) - k + 1)])

In [34]:
shingles_rdd.take(7)

[(13395, 'gra'),
 (13395, 'ras'),
 (13395, 'ass'),
 (13395, 'ss '),
 (13395, 's g'),
 (13395, ' gr'),
 (13395, 'gre')]

In [35]:
shingles_list = (rdd
                 .groupByKey()
                 .map(lambda x: (x[0], list(x[1]))))

In [36]:
shingles_list.toDF().show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| 13395|[grass greener se...|
| 34993|[client childrens...|
| 47677|[possess valid cl...|
| 64146|[demonstrates int...|
| 69905|[new adventure aw...|
|104317|[details signon b...|
|143336|[parttime intermi...|
|152380|[conference cente...|
|188603|[scheduled nclex ...|
|218886|[northern tier hi...|
|242951|[experience corps...|
|252827|[role sr servicen...|
|273921|[us join us winco...|
|278072|[win sport school...|
|304884|[job details job ...|
|328483|[department econo...|
|451330|[zoom take great ...|
|455372|[immediate start ...|
|459748|[northrop grumman...|
|459967|[position interes...|
+------+--------------------+
only showing top 20 rows



Characteristic matrix

In [37]:
def hash_shingle(shingle):
  return binascii.crc32(shingle.encode('utf-8')) & 0xffffffff

In [38]:
hash_rdd = shingles_rdd.mapValues(hash_shingle).distinct()

In [39]:
hash_rdd.take(7)

[(13395, 2506444301),
 (13395, 501096268),
 (13395, 2068476598),
 (13395, 2525627878),
 (13395, 4020559682),
 (13395, 1404793474),
 (13395, 2450033172)]

In [40]:
df = hash_rdd.toDF(["doc_id", "hashed_shingle"]) \
    .groupBy("doc_id") \
    .agg(collect_list("hashed_shingle").alias("hashed_shingles"))

In [41]:
df.show()

+-------+--------------------+
| doc_id|     hashed_shingles|
+-------+--------------------+
| 608814|[2743074591, 9616...|
| 304884|[4225294584, 3496...|
|1663266|[210272711, 33932...|
|1618038|[1811742144, 3119...|
|1152253|[2189397276, 1432...|
|  64146|[2601924841, 3154...|
| 773878|[4225294584, 3496...|
| 218886|[3781580864, 4138...|
|  69905|[1810056261, 3930...|
| 909604|[1767766964, 4231...|
| 872160|[2068476598, 2576...|
| 684947|[2179260663, 2228...|
|1460189|[595022058, 58230...|
|1388218|[4006809354, 4168...|
| 898066|[2244290826, 3530...|
| 273921|[2449963348, 2434...|
|  47677|[2161764012, 1909...|
| 525916|[4225294584, 3496...|
|1450243|[2434037902, 3329...|
|1621468|[1629029770, 1643...|
+-------+--------------------+
only showing top 20 rows



In [42]:
def collect_shingles(a, b):
    return a + b

In [43]:
hashed_shingles_rdd = hash_rdd.map(lambda x: (x[0], [x[1]])).reduceByKey(collect_shingles)

In [None]:
hashed_shingles_rdd.take(7)

In [45]:
hashed_shingles_list = hashed_shingles_rdd.flatMap(lambda x: x[1]).collect()

### MinHash


---
Fast approximation to the Jaccard Similarity Coefficient between any two finite sets




In [46]:
h_functions = 120
b_bands = 10

In [47]:
def multiple(x, a, b, c):
    return (a * x + b) % c

In [48]:
params =[]

for _ in range(h_functions):
  a = random.randint(1, 10000)
  b = random.randint(1, 10000)
  max_ab = max(a, b)
  c = nextprime(max_ab + 1)
  params.append({"a": a, "b": b, "c": c})

params[:7]

[{'a': 732, 'b': 4666, 'c': 4673},
 {'a': 3334, 'b': 9066, 'c': 9091},
 {'a': 9957, 'b': 7174, 'c': 9967},
 {'a': 3532, 'b': 1874, 'c': 3539},
 {'a': 8573, 'b': 865, 'c': 8581},
 {'a': 2885, 'b': 8199, 'c': 8209},
 {'a': 3872, 'b': 999, 'c': 3877}]

In [49]:
def enum_shingles(e, hashed_shingles_list):
    doc_id, shingle = e
    return [((doc_id), (h), (hashed_shingles_list)) for h in range(h_functions)]

In [50]:
minhash_matrix = hash_rdd.flatMap(lambda e: enum_shingles(e, hashed_shingles_list))

In [51]:
minhash_matrix.toDF().show()

+-----+---+--------------------+
|   _1| _2|                  _3|
+-----+---+--------------------+
|13395|  0|[2506444301, 5010...|
|13395|  1|[2506444301, 5010...|
|13395|  2|[2506444301, 5010...|
|13395|  3|[2506444301, 5010...|
|13395|  4|[2506444301, 5010...|
|13395|  5|[2506444301, 5010...|
|13395|  6|[2506444301, 5010...|
|13395|  7|[2506444301, 5010...|
|13395|  8|[2506444301, 5010...|
|13395|  9|[2506444301, 5010...|
|13395| 10|[2506444301, 5010...|
|13395| 11|[2506444301, 5010...|
|13395| 12|[2506444301, 5010...|
|13395| 13|[2506444301, 5010...|
|13395| 14|[2506444301, 5010...|
|13395| 15|[2506444301, 5010...|
|13395| 16|[2506444301, 5010...|
|13395| 17|[2506444301, 5010...|
|13395| 18|[2506444301, 5010...|
|13395| 19|[2506444301, 5010...|
+-----+---+--------------------+
only showing top 20 rows



In [52]:
def minhash_map(docId_hashedShingles):
    doc_id, hashed_shingles = docId_hashedShingles
    minhashes = []
    for h in range(h_functions):
        min_h = math.inf
        for shingle in hashed_shingles:
            hash_value = multiple(shingle, **params[h])
            if hash_value < min_h:
                min_h = hash_value
        minhashes.append(min_h)
    return (doc_id, minhashes)

In [53]:
sig_matrix_rdd = hashed_shingles_rdd.map(minhash_map)

In [54]:
signature_df = sig_matrix_rdd.toDF(["doc_id", "minhashes"])

In [55]:
signature_df.show()

+------+--------------------+
|doc_id|           minhashes|
+------+--------------------+
| 13395|[6, 12, 4, 5, 8, ...|
| 34993|[6, 12, 4, 5, 3, ...|
| 47677|[15, 337, 64, 33,...|
| 64146|[79, 309, 4, 6, 1...|
| 69905|[6, 17, 4, 3, 4, ...|
|104317|[6, 12, 4, 2, 13,...|
|143336|[17, 22, 40, 8, 2...|
|152380|[15, 588, 10, 76,...|
|188603|[62, 148, 207, 29...|
|218886|[6, 12, 4, 3, 13,...|
|242951|[15, 10, 10, 3, 2...|
|252827|[6, 17, 4, 2, 13,...|
|273921|[6, 8, 4, 3, 4, 0...|
|278072|[6, 17, 10, 6, 13...|
|304884|[2, 17, 4, 3, 3, ...|
|328483|[6, 6, 4, 6, 4, 1...|
|451330|[6, 5, 4, 3, 4, 7...|
|455372|[15, 5, 4, 6, 42,...|
|459748|[6, 17, 4, 6, 42,...|
|459967|[6, 17, 4, 3, 4, ...|
+------+--------------------+
only showing top 20 rows



### Locality-Sensitive Hashing

Define the threshold

In [56]:
# having h_functions = 120 and b_bands = 10
n_per_bands = h_functions // b_bands
threshold = (1/b_bands) ** (1/n_per_bands)

In [57]:
print("Threshold for candidate pairs: ", threshold)

Threshold for candidate pairs:  0.8254041852680184


Using the MinHashLSH provided by Pyspark

In [58]:
array_to_vector_udf = udf(lambda x: DenseVector(x), VectorUDT())
signature_df = signature_df.withColumn("minhash_vector", array_to_vector_udf(col("minhashes"))) # converted with dense vector

In [59]:
signature_df = signature_df.drop("minhashes")

In [60]:
mh = MinHashLSH(inputCol="minhash_vector", outputCol="hashes", numHashTables=b_bands*n_per_bands, seed=56)

In [61]:
model = mh.fit(signature_df)

In [62]:
lsh_df = model.transform(signature_df)

## Finding similar jobs

In [63]:
def find_similar_jobs(lsh_df, model, threshold):
    # Perform self-join to find all pairs
    similar_items = model.approxSimilarityJoin(lsh_df, lsh_df, threshold, distCol="JaccardDistance")

    # Filter out pairs with the same doc_id and JaccardDistance above the threshold
    similar_pairs = (similar_items
                     .select(
                         col("datasetA.doc_id").alias("doc_id_A"),
                         col("datasetB.doc_id").alias("doc_id_B"),
                         col("JaccardDistance"))
                     .filter(col("doc_id_A") < col("doc_id_B"))
                     .filter(col("JaccardDistance") <= threshold)
                     .rdd
                     .map(lambda row: (row["doc_id_A"], row["doc_id_B"], row["JaccardDistance"]))
                     .collect())

    return similar_pairs

In [64]:
similar_pairs = find_similar_jobs(lsh_df, model, threshold)

In [65]:
similar_pairs

[(455372, 459967, 0.15596330275229353),
 (455372, 1750168, 0.15517241379310343),
 (472496, 1331267, 0.14035087719298245),
 (608814, 689406, 0.14406779661016944),
 (608814, 756249, 0.11864406779661019),
 (950765, 1663266, 0.09615384615384615),
 (1129756, 1750168, 0.12727272727272732),
 (1152253, 1618038, 0.15094339622641506),
 (64146, 1618038, 0.17796610169491522),
 (152380, 1129756, 0.18333333333333335),
 (273921, 1214632, 0.10204081632653061),
 (455372, 684947, 0.10526315789473684),
 (459967, 507283, 0.0980392156862745),
 (831098, 1038199, 0.1428571428571429),
 (872160, 1320765, 0.15000000000000002),
 (898066, 1621468, 0.19491525423728817),
 (912495, 1373757, 0.09999999999999998),
 (1129756, 1559819, 0.1694915254237288),
 (1278595, 1320765, 0.2416666666666667),
 (34993, 451330, 0.1009174311926605),
 (69905, 278072, 0.1339285714285714),
 (69905, 584862, 0.12605042016806722),
 (104317, 831098, 0.15887850467289721),
 (252827, 1432904, 0.09433962264150941),
 (273921, 328483, 0.16190476190

### Some results

In [69]:
similar_df = pd.DataFrame(similar_pairs, columns=['doc_id1', 'doc_id2', 'jaccard_distance'])

In [70]:
similar_df

Unnamed: 0,doc_id1,doc_id2,jaccard_distance
0,455372,459967,0.155963
1,455372,1750168,0.155172
2,472496,1331267,0.140351
3,608814,689406,0.144068
4,608814,756249,0.118644
...,...,...,...
2770,455372,1618038,0.120370
2771,472496,1572773,0.121495
2772,546299,878756,0.166667
2773,624623,1559819,0.041667


In [71]:
for pair in similar_pairs[:2]:
    doc1 = pair[0]
    doc2 = pair[1]
    dist = pair[2]

    # Filter DataFrame to retrieve text for doc1 and doc2
    doc1_text = df_scaled.filter(col('doc_id') == doc1).select('job_words').first()[0]
    doc2_text = df_scaled.filter(col('doc_id') == doc2).select('job_words').first()[0]

    # Display job summaries for doc1 and doc2
    print("Document 1:")
    print(doc1_text)
    print("\nDocument 2:")
    print(doc2_text)
    print("\nJaccard Distance: {:.4f}".format(dist))
    print("\n---")


Document 1:
Immediate start with an extremely progressive company |Full training provided and supportive working culture About Our Client My client are a family run business who have achieved a lot of success over recent years by winning successful contracts which will allow the business to continue to thrive in a competitive industry. They have a supportive working environment and always go the extra mile in ensuring customers receive the best possible service. Job Description The main duties of the role will include:- Answering calls regarding updates on job orders and progress Logging new job orders onto the CRM system Sending out confirmations to customers and following up Providing an excellent standard of service at all times Record all call information onto the system accurately The working hours are Mon to Fri 8 - 5 or Mon to Fri 9 - 6 with a Saturday in every month The Successful Applicant Excellent customer service skills Positive attitude to completing tasks Attention to det