## DOCUMENT SIMILARITY SEARCH WITH PYSPARK

In [11]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
        .appName("SimilaritySearch") \
        .master("spark://172.29.15.15:7077") \
        .config("spark.executor.instances", "2") \
        .config("spark.executor.cores", "10") \
        .config("spark.executor.memory", "16g") \
        .config("spark.driver.memory", "8g") \
        .config("spark.driver.maxResultSize", "12g") \
        .getOrCreate()

# Read datafile
file_path = "hdfs://172.29.15.15:9000/khang/dataset/corpus.jsonl"
df = spark.read.json(file_path)

# Just get first 1,000,000 rows to process because limited in hardware
df = df.limit(500000)

#segments.persist() # to avoid lazy behaviour and store dataset in memory
df.show() # data preview

                                                                                

+---+--------------------+--------------------+--------------------+
|_id|            metadata|                text|               title|
+---+--------------------+--------------------+--------------------+
| 12|{https://en.wikip...|Anarchism is a po...|           Anarchism|
| 25|{https://en.wikip...|Autism is a neuro...|              Autism|
| 39|{https://en.wikip...|Albedo ( ) is a m...|              Albedo|
|290|{https://en.wikip...|A (named , plural...|                   A|
|303|{https://en.wikip...|Alabama ( ) is a ...|             Alabama|
|305|{https://en.wikip...|In Greek mytholog...|            Achilles|
|307|{https://en.wikip...|Abraham Lincoln (...|     Abraham Lincoln|
|308|{https://en.wikip...|Aristotle ( ; Gre...|           Aristotle|
|309|{https://en.wikip...|An American in Pa...|An American in Paris|
|316|{https://en.wikip...|The Academy Award...|Academy Award for...|
|324|{https://en.wikip...|The Academy Award...|      Academy Awards|
|330|{https://en.wikip...|Actresse

In [12]:
print("Number of documents: ", df.count())
print(type(df))



Number of documents:  500000
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

In [13]:
from pyspark.sql.functions import col

# Check if the column "id" is unique or not
is_unique = df.groupBy("_id").count().filter(col("count") > 1).isEmpty()
if is_unique:
    print("'id' column in dataset is unique")
else:
    print("'id' column in dataset isn't unique")



'id' column in dataset is unique


                                                                                

In [14]:
# Get Spark's configuration information
sc = spark.sparkContext

# Get the driver node's hostname and port
driver_host = sc.getConf().get("spark.driver.host")
driver_port = sc.getConf().get("spark.driver.port")
num_workers = sc.getConf().get("spark.executor.instances")

print("Driver node hostname:", driver_host)
print("Driver node port:", driver_port)
print("Number of worker nodes:", num_workers)

Driver node hostname: master
Driver node port: 34615
Number of worker nodes: 2


### Add Python file to SparkContext

In [15]:
sc.addPyFile("./utils/utilities.py")

## SHINGLING

`Shingling` algorithms will dividing a document or text into a sequence of contiguous, overlapping, or non-overlapping `shingles`, which are essentially small units of text. Shingling is typically used to create a compact representation of the text, which can then be used to compare and measure the similarity between documents.

The Map function for shingling will distribute documents among worker nodes in the network producing `(doc id,Shingles set, hash_id)` pairs. The reduce function don't need in this case.

In [16]:
from utilities import Shingling
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col


def shingling_map(row):
    out = []
    sh_instance = Shingling(5)
    shingles = sh_instance.get_shingles(row["title"] + " " + row["text"], words=True)
    signature_size = 100
    for i in range(0, signature_size):  # signature size
        out.append((row["_id"], shingles, i))

    # return an iterator to use flatMap => produce more than one key-value pair as output (namely one per hash function)
    return iter(out)


# Define the schema as a list of StructField objects
schema = StructType(
    [
        StructField("doc_id", StringType(), nullable=False),
        StructField("shingles_set", StringType(), nullable=True),
        StructField("hash_id", IntegerType(), nullable=True),
    ]
)

# Use rdd.collect() to get all data from workers to driver.
result = df.rdd.flatMap(shingling_map).toDF(schema)
result.filter(col("hash_id") == 0).show(truncate=100)



+------+----------------------------------------------------------------------------------------------------+-------+
|doc_id|                                                                                        shingles_set|hash_id|
+------+----------------------------------------------------------------------------------------------------+-------+
|    12|[that advocates selfgoverned societies based, state to be undesirable unnecessary, defined them m...|      0|
|    25|[with autism reach their developmental, though some children with autism, and nonverbal communica...|      0|
|    39|[radiation to one corresponding to, is a measure for reflectance, one corresponding to a white, b...|      0|
|   290|[in two forms the doublestorey, of a triangle crossed in, and the first vowel of, the first vowel...|      0|
|   303|[us states with a total, among the most of any, east florida and the gulf, of inland waterways al...|      0|
|   305|[warrior of homers iliad his, greek mythology ac

                                                                                

## MIN-HASHING

`Min Hashing` is a technique used to estimate the similarity between sets, in this case is the sets of shingles extracted from documents. It works by creating a signature matrix that represents the presence or absence of shingles in each documents.

 Design a map-reduce task to produce the signature matrix:
 - Map Task: take input `(doc_id, shingle_set, h_i)` with `h_i` is an hash function from hash family defined above and produce the minhash value of the set for that given hash function. Output: `(doc_id, min_hash)`
 - Reduce Task: group result from Map Task. Ouput: `(doc_id, minhash_signature)`


In [18]:
from utilities import HashFamily
import math


def minhash_map(row):
    doc_id = row[0]
    shingles = row[1]
    hash_f = HashFamily(row[2])
    min_h = math.inf
    for el in shingles:
        hash_value = hash_f.get_hash_value(el)
        if hash_value < min_h:
            min_h = hash_value

    return (doc_id, min_h)


# Map task
minhash_map_task = df.rdd.flatMap(shingling_map).map(minhash_map)

# Reduce task. Format result: (doc_id, minhash_signature)
sig_matrix = minhash_map_task.groupByKey().map(lambda x : (x[0], list(x[1])))

In [21]:
# Print signature matrix
sig_matrix_df = sig_matrix.toDF(["doc_id", "signature"])
sig_matrix_df.show(truncate=100)

[Stage 25:>                                                         (0 + 1) / 1]

+------+----------------------------------------------------------------------------------------------------+
|doc_id|                                                                                           signature|
+------+----------------------------------------------------------------------------------------------------+
|631557|[233623080, 43798276, 11827396, 159118239, 5304397, 253723450, 75083950, 101483230, 93788666, 856...|
|634862|[6747511, 221390547, 397101929, 72733085, 2083994, 206760304, 108648721, 399346433, 116755236, 40...|
|636084|[97585608, 35616189, 421551816, 324054930, 77207479, 116010955, 145904762, 75209766, 89189051, 23...|
|636690|[94081269, 15658246, 224910, 28013445, 713335, 81796388, 26700060, 1503094, 13067487, 19307218, 2...|
|638112|[59891946, 38154720, 175970798, 161035742, 86921245, 78781234, 26373912, 153825846, 61255802, 214...|
|640841|[3163370, 16558871, 3738692, 5323037, 40500658, 13237297, 65642648, 37495668, 14470277, 33076425,...|
|640846|[2

                                                                                

In [23]:
# Print signature matrix dimensions
print("Signature matrix rows length: ", sig_matrix_df.count())
print("Number of minhash signature: ", len(sig_matrix_df.take(1)[0][1]))

                                                                                

Signature matrix rows length:  500000


[Stage 45:>                                                         (0 + 1) / 1]

Number of minhash signature:  100


                                                                                

## LOCALITY SENSITIVE HASHING

LSH is a technique used in data mining and similarity search to efficiently approximate the similarity between high-dimensional data points. The key idea behind LSH is to hash data points in such a way that similar points are mapped to the same or nearby "buckets" with high probability. This reduces the number of candidates that need to be considered when searching for similar data points, which can significantly speed up the process.

Steps:
- Hashing: LSH employs multiple hash functions that map data points to a set of buckets. Similar data points are more likely to be mapped to the same bucket, but this probability decreases as the dissimilarity between data points increases.
- Threshold evaluate: Once data points are hashed, a similarity threshold is defined. Only data points in the same or nearby buckets are considered candidates for being similar to a given query point.
- Candidate Search: During a search, the system only needs to consider data points in the candidate buckets to identify the nearest neighbors or similar items. This reduces the search space and computational cost.

In [75]:
from utilities import HashFamily

"""
Implementing points:
    - Spliting signature into equal subset, assign this subset to a band.
    - Hash the subset to get its bucket.
    - Map Task output will have format: key = (band_id, bucket), value = doc_id
    - Reduce Task output format:  key = (band_id, bucket), value = list of candidate 
"""


def map_buckets(row):
    # Band number = 10 and row number = 10 because we have 100 signatures
    band_number = 10
    row_number = 10
    doc_id = row[0]
    doc_sign = row[1]
    hash_funct = HashFamily(1)
    out = []

    for i in range(0, band_number):
        band_id = i
        idx = i * row_number
        set_col = " ".join(str(x) for x in doc_sign[idx : idx + row_number])
        bucket = hash_funct.get_hash_value(set_col)
        out.append(((band_id, bucket), doc_id))

    return iter(out)


# Map task
candidate_map = sig_matrix.flatMap(map_buckets)

# Reduce task
candidate_reduce = candidate_map.groupByKey().map(lambda x: (x[0], list(x[1])))

In [25]:
candidate_df = candidate_reduce.toDF(["key", "value"])
candidate_df.show()

[Stage 53:>                                                         (0 + 1) / 1]

+---------------+---------+
|            key|    value|
+---------------+---------+
| {1, 401173258}| [758214]|
|{5, 1753997025}| [761608]|
| {7, 113005060}| [866239]|
|{7, 1733999151}|[1008278]|
|{2, 3954184401}|[1019429]|
|{7, 1036611283}|[1028002]|
|{5, 2912925647}|[1033236]|
|{5, 3109292376}|[1132889]|
|{2, 1199289725}|[1151087]|
|{7, 3337611444}|[1197705]|
|{2, 2856988549}|[1440344]|
| {9, 900253593}|[3417616]|
|{4, 1245531571}|[3518315]|
|{3, 3742912261}|[3977011]|
|{5, 2612624286}|[4548150]|
|{8, 3936323950}| [690682]|
|{1, 3318745282}|[1072153]|
|{5, 1987524866}|[1346496]|
|{6, 3798984574}|[1582667]|
|{6, 3041842877}|[3430501]|
+---------------+---------+
only showing top 20 rows



                                                                                

In [66]:
from itertools import combinations

sig_matrix_df = sig_matrix.collect()
sig_matrix_dict = {item[0]: item[1] for item in sig_matrix_df}

# Calculate similarity
def calculate_similarity(row):
    out = []
    list_doc_id = row[1]
    for pair in combinations(list_doc_id, 2):
        doc_id_1 = pair[0]
        doc_id_2 = pair[1]
        sig_1 = set(sig_matrix_dict[doc_id_1])
        sig_2 = set(sig_matrix_dict[doc_id_2])
        js = len(sig_1.intersection(sig_2)) / len(sig_1.union(sig_2))
        out.append((pair, js))
    return iter(out)


similar_pairs = candidate_reduce.flatMap(calculate_similarity)

In [67]:
similar_pairs_df = similar_pairs.toDF()
similar_pairs_df.show()

[Stage 163:>                                                        (0 + 1) / 1]

+------------------+-------------------+
|                _1|                 _2|
+------------------+-------------------+
|{3751877, 3470923}| 0.7543859649122807|
|{1220598, 1218266}| 0.4084507042253521|
|{3995765, 1681273}|                0.0|
|{1042239, 1039893}| 0.5267175572519084|
|{1042239, 1030535}| 0.5267175572519084|
|{1042239, 1039849}| 0.4492753623188406|
|{1042239, 1030590}| 0.5037593984962406|
|{1042239, 1039852}| 0.5384615384615384|
|{1039893, 1030535}| 0.5384615384615384|
|{1039893, 1039849}|0.47058823529411764|
|{1039893, 1030590}| 0.5151515151515151|
|{1039893, 1039852}| 0.5267175572519084|
|{1030535, 1039849}|0.48148148148148145|
|{1030535, 1030590}| 0.5037593984962406|
|{1030535, 1039852}| 0.5503875968992248|
|{1039849, 1030590}|0.45985401459854014|
|{1039849, 1039852}|0.48148148148148145|
|{1030590, 1039852}|0.48148148148148145|
|{7381505, 7377476}| 0.7543859649122807|
|{7381505, 7381344}| 0.6666666666666666|
+------------------+-------------------+
only showing top

                                                                                

In [78]:
THRESHHOLD = 0.8

filtered_pairs = similar_pairs.filter(lambda x: x[1] >= 0.8)
print("SIMILARS DOCUMENT NUMBERS: ", filtered_pairs.count())

[Stage 184:>                                                        (0 + 1) / 1]

SIMILARS DOCUMENT NUMBER:  12929


                                                                                

In [84]:
# Print Example
ex_1 = filtered_pairs.take(1)
ex_2 = filtered_pairs.take(2)

ex_1_doc_1_id = ex_1[0][0][0]
ex_1_doc_2_id = ex_1[0][0][1]

ex_2_doc_1_id = ex_2[0][0][0]
ex_2_doc_2_id = ex_2[0][0][1]

                                                                                

In [115]:
print("SIMILAR DOCUMENT EXAMPLE 1: ")
df.filter((col("_id") == ex_1_doc_1_id)).show(truncate=150)
df.filter(col("_id") == ex_1_doc_2_id).show(truncate=150)

SIMILAR DOCUMENT EXAMPLE 1: 
+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
|    _id|                                     metadata|                                                                                                                                                  text|          title|
+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
|3556508|{https://en.wikipedia.org/wiki?curid=3556508}|Chulliyar River is one of the tributaries of the river Gayathripuzha. "Gayathripuzha" is one of the main tributaries of the Bharathapuzha River, th...|Chulliyar River|
+-------+---------------------------------------------+------------------------

                                                                                

+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
|    _id|                                     metadata|                                                                                                                                                  text|         title|
+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
|3556499|{https://en.wikipedia.org/wiki?curid=3556499}|Meenkarappuzha River is one of the tributaries of the river Gayathripuzha. "Gayathripuzha" is one of the main tributaries of the Bharathapuzha Rive...|Meenkarappuzha|
+-------+---------------------------------------------+---------------------------------------------------------

                                                                                

In [114]:
print("SIMILAR DOCUMENT EXAMPLE 2: ")
df.filter((col("_id") == ex_2_doc_1_id)).show(truncate=150)
df.filter(col("_id") == ex_2_doc_2_id).show(truncate=150)

SIMILAR DOCUMENT EXAMPLE 2: 


                                                                                

+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|    _id|                                     metadata|                                                                                                                                                  text|                                        title|
+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|3468627|{https://en.wikipedia.org/wiki?curid=3468627}|The modern pentathlon at the 1968 Summer Olympics was represented by two events (both for men): "Individual competition" and "Team competition". As...|Modern pentathlon at the 1968 Summe

## FINAL CODE

Merge all steps above to one LSH pipeline to similarity search with PySpark

In [1]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
        .appName("SimilaritySearch") \
        .master("spark://172.29.15.15:7077") \
        .config("spark.executor.instances", "2") \
        .config("spark.executor.cores", "10") \
        .config("spark.executor.memory", "16g") \
        .config("spark.driver.memory", "8g") \
        .config("spark.driver.maxResultSize", "12g") \
        .config("spark.shuffle.io.connectionTimeout", "600s") \
        .getOrCreate()

spark.sparkContext.addPyFile("./utils/utilities.py")

your 131072x1 screen size is bogus. expect trouble
23/11/03 10:51:26 WARN Utils: Your hostname, khangPC resolves to a loopback address: 127.0.0.1; using 172.29.15.15 instead (on interface eth0)
23/11/03 10:51:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/03 10:51:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from utilities import Shingling, HashFamily
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import math
import time
from itertools import combinations


### DEFINE SUPPORT FUNCTION
def shingling_map(row):
    out = []
    sh_instance = Shingling(5)
    shingles = sh_instance.get_shingles(row["title"] + " " + row["text"], words=True)
    signature_size = 100
    for i in range(0, signature_size):  # signature size
        out.append((row["_id"], shingles, i))

    # return an iterator to use flatMap => produce more than one key-value pair as output (namely one per hash function)
    return iter(out)

def minhash_map(row):
    doc_id = row[0]
    shingles = row[1]
    hash_f = HashFamily(row[2])
    min_h = math.inf
    for el in shingles:
        hash_value = hash_f.get_hash_value(el)
        if hash_value < min_h:
            min_h = hash_value

    return (doc_id, min_h)

def map_buckets(row):
    # Band number = 10 and row number = 10 because we have 100 signatures
    band_number = 10
    row_number = 10
    doc_id = row[0]
    doc_sign = row[1]
    hash_funct = HashFamily(1)
    out = []

    for i in range(0, band_number):
        band_id = i
        idx = i * row_number
        set_col = " ".join(str(x) for x in doc_sign[idx : idx + row_number])
        bucket = hash_funct.get_hash_value(set_col)
        out.append(((band_id, bucket), doc_id))

    return iter(out)

def calculate_similarity(row):
    out = []
    list_doc_id = row[1]
    for pair in combinations(list_doc_id, 2):
        doc_id_1 = pair[0]
        doc_id_2 = pair[1]
        sig_1 = set(sig_matrix_dict[doc_id_1])
        sig_2 = set(sig_matrix_dict[doc_id_2])
        js = len(sig_1.intersection(sig_2)) / len(sig_1.union(sig_2))
        out.append((pair, js))
    return iter(out)

In [1]:
THRESHHOLD = 0.8

start_time = time.time()
# Read datafile
file_path = "hdfs://172.29.15.15:9000/khang/dataset/corpus.jsonl"
df = spark.read.json(file_path)

# Just get first 1,000,000 rows to process because limited in hardware
df = df.limit(1000000)

## Step 1: Generate shingles
shingles = df.rdd.flatMap(shingling_map)

## Step 2: Apply min-hashing
# Map Task
minhash_map = shingles.map(minhash_map)
# Reduce Task
sig_matrix = minhash_map.groupByKey().map(lambda x : (x[0], list(x[1])))

## Step 3: Split signature to bucket
# Map Task
candidate_map = sig_matrix.flatMap(map_buckets)
# Reduce Task
candidate_reduce = candidate_map.groupByKey().map(lambda x: (x[0], list(x[1])))

## Step 4: Collect sig_matrix and convert to dict
sig_matrix_df = sig_matrix.collect()
sig_matrix_dict = {item[0]: item[1] for item in sig_matrix_df}

## Step 5: Calculate similarity
similar_pairs = candidate_reduce.flatMap(calculate_similarity)

## Step 6: Filtered candidate pair
filtered_pairs = similar_pairs.filter(lambda x: x[1] >= THRESHHOLD)

end_time = time.time()
print("SIMILARS DOCUMENT PAIR NUMBERS: ", filtered_pairs.count())
print(f"Time execution: {end_time-start_time} (s)")

SIMILARS DOCUMENT PAIR NUMBERS: 66322
Time execution: 1823 (s)


In [4]:
filtered_pairs_df = filtered_pairs.toDF(["pair", "jaccard similarity"])
filtered_pairs_df.show()

[Stage 15:>                                                         (0 + 1) / 1]

+------------------+------------------+
|              pair|jaccard similarity|
+------------------+------------------+
|{2568924, 2569147}|0.8018018018018018|
|{2568965, 2569147}|0.8018018018018018|
|{2569147, 2569239}|0.8348623853211009|
|{6557376, 6557914}|0.8181818181818182|
|  {493773, 491812}|0.8018018018018018|
|  {493773, 526980}|0.8018018018018018|
|  {491817, 526982}|0.8018018018018018|
|  {491817, 491896}|0.8181818181818182|
|  {491817, 526980}|0.8181818181818182|
|  {493230, 526980}|0.8181818181818182|
|  {476990, 491896}|0.8018018018018018|
|  {526974, 526980}|0.8018018018018018|
|  {492167, 526980}|0.8018018018018018|
|  {493194, 526980}|0.8181818181818182|
|  {526982, 491812}|0.8018018018018018|
|  {526982, 492468}|0.8181818181818182|
|  {526982, 491896}|0.8181818181818182|
|  {526982, 526971}|0.8181818181818182|
|  {526982, 526980}|0.8518518518518519|
|  {491812, 491896}|0.8018018018018018|
+------------------+------------------+
only showing top 20 rows



                                                                                

In [18]:
sample = filtered_pairs_df.sample(fraction=0.1, seed=69).limit(10).collect()

                                                                                

[Stage 93:>                                                         (0 + 1) / 1]

In [19]:
for i, row in enumerate(sample):
    doc_1_id = row.pair[0]
    doc_2_id = row.pair[1]
    print(f"SAMPLE {i}:")
    df.filter(col("_id") == doc_1_id).show(truncate=150)
    df.filter(col("_id") == doc_2_id).show(truncate=150)
    print("\n")

SAMPLE 0:


                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
|     _id|                                      metadata|                                                                                                                                                  text|         title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
|11881967|{https://en.wikipedia.org/wiki?curid=11881967}|Niederbrombach is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Birkenfeld distr...|Niederbrombach|
+--------+----------------------------------------------+-----------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|     _id|                                      metadata|                                                                                                                                                  text|           title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|11881401|{https://en.wikipedia.org/wiki?curid=11881401}|Dambach is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Birkenfeld district in ...|Dambach, Germany|
+--------+----------------------------------------------+---------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------+-------------+
|     _id|                                      metadata|                                                                                      text|        title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------+-------------+
|11825661|{https://en.wikipedia.org/wiki?curid=11825661}|For background information about this competition, please refer to the Amco Cup main page.|1975 Amco Cup|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------+-------------+



                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------+-------------+
|     _id|                                      metadata|                                                                                      text|        title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------+-------------+
|11845738|{https://en.wikipedia.org/wiki?curid=11845738}|For background information about this competition, please refer to the Amco Cup main page.|1979 Amco Cup|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------+-------------+



SAMPLE 2:
+--------+----------------------------------------------+-------------------------------------------------------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|     _id|                                      metadata|                                                                                                                                                  text| title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|12327183|{https://en.wikipedia.org/wiki?curid=12327183}|Tellig is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district in ...|Tellig|
+--------+----------------------------------------------+-------------------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+
|     _id|                                      metadata|                                                                                                                                                  text|                    title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+
|11859633|{https://en.wikipedia.org/wiki?curid=11859633}|Alf is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district in Rhi...|Alf, Rhineland-Palatinate|
+--------+----------------------------------------------+---

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|     _id|                                      metadata|                                                                                                                                                  text|   title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|12326397|{https://en.wikipedia.org/wiki?curid=12326397}|Haserich is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district i...|Haserich|
+--------+----------------------------------------------+-----------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|     _id|                                      metadata|                                                                                                                                                  text|    title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|12326415|{https://en.wikipedia.org/wiki?curid=12326415}|Hesweiler is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district ...|Hesweiler|
+--------+----------------------------------------------+-------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|     _id|                                      metadata|                                                                                                                                                  text|    title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|12327247|{https://en.wikipedia.org/wiki?curid=12327247}|Walhausen is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district ...|Walhausen|
+--------+----------------------------------------------+-------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|     _id|                                      metadata|                                                                                                                                                  text|      title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|11882294|{https://en.wikipedia.org/wiki?curid=11882294}|Stipshausen is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Birkenfeld district...|Stipshausen|
+--------+----------------------------------------------+-----------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|     _id|                                      metadata|                                                                                                                                                  text|    title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|11881652|{https://en.wikipedia.org/wiki?curid=11881652}|Gösenroth is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Birkenfeld district i...|Gösenroth|
+--------+----------------------------------------------+-------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|     _id|                                      metadata|                                                                                                                                                  text| title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|12325504|{https://en.wikipedia.org/wiki?curid=12325504}|Alflen is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district in ...|Alflen|
+--------+----------------------------------------------+-------------------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|     _id|                                      metadata|                                                                                                                                                  text|  title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|12326457|{https://en.wikipedia.org/wiki?curid=12326457}|Kliding is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district in...|Kliding|
+--------+----------------------------------------------+---------------------------------------------------------------------------

                                                                                

+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|    _id|                                     metadata|                                                                                                                                                  text| title|
+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|6301393|{https://en.wikipedia.org/wiki?curid=6301393}|Buttes was a municipality in the district of Val-de-Travers in the canton of Neuchâtel in Switzerland. On 1 January 2009, the former municipalities...|Buttes|
+-------+---------------------------------------------+-----------------------------------------------------------------------------------------

                                                                                

+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|    _id|                                     metadata|                                                                                                                                                  text|    title|
+-------+---------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|6301384|{https://en.wikipedia.org/wiki?curid=6301384}|Boveresse was a municipality in the district of Val-de-Travers in the canton of Neuchâtel in Switzerland. On 1 January 2009, the former municipalit...|Boveresse|
+-------+---------------------------------------------+-----------------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|     _id|                                      metadata|                                                                                                                                                  text|     title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|12326811|{https://en.wikipedia.org/wiki?curid=12326811}|Panzweiler is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district...|Panzweiler|
+--------+----------------------------------------------+---------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|     _id|                                      metadata|                                                                                                                                                  text|   title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|12326397|{https://en.wikipedia.org/wiki?curid=12326397}|Haserich is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district i...|Haserich|
+--------+----------------------------------------------+-----------------------------------------------------------------------

                                                                                

+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+
|     _id|                                      metadata|                                                                                                                                                  text|                    title|
+--------+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+
|11859633|{https://en.wikipedia.org/wiki?curid=11859633}|Alf is an "Ortsgemeinde" – a municipality belonging to a "Verbandsgemeinde", a kind of collective municipality – in the Cochem-Zell district in Rhi...|Alf, Rhineland-Palatinate|
+--------+----------------------------------------------+---

                                                                                

Exception in thread "serve-DataFrame" java.net.SocketTimeoutException: Accept timed out
	at java.net.PlainSocketImpl.socketAccept(Native Method)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.ServerSocket.implAccept(ServerSocket.java:560)
	at java.net.ServerSocket.accept(ServerSocket.java:528)
	at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:65)
