From given Pyspark DataFrame Containing user registration data. Due to human entry or OCR errors, names and emails might have minor typos.

Our task is
1.   To detect potential duplicates users based on similar name and email patterns.(e.g., nyima sherpa,nima sherpa,nyima@gmail.com VS nyima@gmail.com)
2.   Flag pairs where levenshtein distance between name or email is below a certain threshold.



In [6]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,levenshtein

In [7]:
spark=SparkSession.builder.appName("detect email duplicates").getOrCreate()

In [8]:
data=[
 ("ABC1", "nyima sherpa", "nyima@gmail.com"),
 ("ABC2", "nima sherpa", "nyima@gmal.com"),
 ("ABC3", "john smith", "john@xyz.com"),
 ("ABC4", "jhon smith", "johne@xyz.com"),
 ("ABC5", "Pasang", "pasang.sherpa@abc.com")
]


In [9]:
data

[('ABC1', 'nyima sherpa', 'nyima@gmail.com'),
 ('ABC2', 'nima sherpa', 'nyima@gmal.com'),
 ('ABC3', 'john smith', 'john@xyz.com'),
 ('ABC4', 'jhon smith', 'johne@xyz.com'),
 ('ABC5', 'Pasang', 'pasang.sherpa@abc.com')]

In [15]:
df=spark.createDataFrame(data,["user_id","full_name","Email"])
df

DataFrame[user_id: string, full_name: string, Email: string]

Self join to compare each pair

In [16]:
df1=df.alias("a")

In [17]:
df2=df.alias("b")
df2

DataFrame[user_id: string, full_name: string, Email: string]

#Avoid duplicate comparisions and self join

In [18]:
joined=df1.join(df2,col("a.user_id")<col("b.user_id"))
joined

DataFrame[user_id: string, full_name: string, Email: string, user_id: string, full_name: string, Email: string]

Calculate Levensthein distances

In [27]:
result=joined.select(col("a.user_id").alias("user_1"),col("b.user_id").alias("user_2"),levenshtein(col("a.full_name"),col("b.full_name")).alias("distance_name"),levenshtein(col("a.email"),col("b.email")).alias("email_distance")).filter((col("distance_name")<=2) | (col("email_distance")<=2))

In [28]:
result.show()

+------+------+-------------+--------------+
|user_1|user_2|distance_name|email_distance|
+------+------+-------------+--------------+
|  ABC1|  ABC2|            1|             1|
|  ABC3|  ABC4|            2|             1|
+------+------+-------------+--------------+

