# **üîç Detect Typographical Issues in Customer Data**

This notebook focuses exclusively on identifying potential typos across key customer fields.
The goal is to surface rows that may require manual review or automated correction.

What this notebook does:
- **Flags email** typos (format issues, invalid patterns)
- **Flags state** typos (invalid or non‚Äëstandard abbreviations)
- **Flags city** typos using Levenshtein distance + best‚Äëmatch suggestion
- **Flags full‚Äëname** anomalies (optional targeted checks)

In [1]:
# ------------------------------------------------------------
# PySpark Imports for Typo Detection Notebook
# ------------------------------------------------------------
# SparkSession: create and manage Spark execution context
# functions (F): core column expressions and transformations
# col, lit, lower, rlike, when: used for email/state validation
# concat_ws: optional helper for full-name checks
# levenshtein: compute edit distance for city typo detection
# row_number + Window: select best city match per record
# types (T): define schemas when needed
# ------------------------------------------------------------

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import (
    col, lit, lower, rlike, when, concat_ws,
    levenshtein, row_number
)
from pyspark.sql.window import Window
from pyspark.sql import types as T

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 3, Finished, Available, Finished)

In [2]:
# -----------------------------
# Load data
# -----------------------------
# Adjust the path if the file is stored elsewhere.
csv_file_path = "Files/csv/customers/CustomersLocation.csv"
df = spark.read.format("csv").option("header","true").load(csv_file_path)
display(df.limit(5))

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 4, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, fcada5f0-d1dc-4e8f-a2d0-2a3a8d939570)

In [3]:

# ------------------------------------------------------------
# 1. EMAIL VALIDATION
# ------------------------------------------------------------
# email_regex:
#   Defines the allowed structure of an email address using a
#   standard RFC‚Äëstyle pattern.
#   - Ensures presence of username, '@', domain, and TLD
#   - Filters out malformed or incomplete email formats
#   - Used with rlike() to flag invalid email entries
# ------------------------------------------------------------

email_regex = r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"
df = df.withColumn("invalid_email", ~col("email").rlike(email_regex))

# ------------------------------------------------------------
# 2. VALID STATE CHECK
# ------------------------------------------------------------
# valid_states:
#   List of all officially recognized U.S. state and district
#   abbreviations. Used as the reference set for validation.
#
# invalid_state column:
#   Flags any record where the state value is NOT in the list
#   of valid abbreviations.
#   - Ensures standardized two‚Äëletter state codes
#   - Quickly surfaces typos, misspellings, or non‚ÄëUS entries
# ------------------------------------------------------------
valid_states = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA",
                "KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
                "NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT",
                "VA","WA","WV","WI","WY","DC"]
df = df.withColumn("invalid_state", ~col("state").isin(valid_states))


# ------------------------------------------------------------
# 3. CITY TYPO DETECTION (LEVENSHTEIN)
# ------------------------------------------------------------
# valid_cities:
#   Reference list of known, standardized U.S. city names.
#   Used as the comparison set for detecting potential typos.
#
# valid_cities_df:
#   Spark DataFrame version of the city list, enabling a
#   cross‚Äëjoin with the input dataset.
#
# Purpose of this section:
#   - Compute Levenshtein distance between each input city
#     and every valid city.
#   - Identify the closest match for each record.
#   - Surface likely typos based on edit‚Äëdistance thresholds.
# ------------------------------------------------------------
valid_cities = [
    "New York","Los Angeles","Chicago","Houston","Phoenix","Philadelphia","San Antonio",
    "San Diego","Dallas","San Jose","Austin","Jacksonville","Fort Worth","Columbus",
    "Charlotte","San Francisco","Indianapolis","Seattle","Denver","Washington","Miami",
    "Boston","Detroit","El Paso","Nashville","Portland","Memphis","Oklahoma City",
    "Las Vegas","Louisville","Baltimore","Milwaukee","Albuquerque","Tucson","Fresno",
    "Sacramento","Kansas City","Atlanta","Omaha","Raleigh","Long Beach","Virginia Beach",
    "Oakland","Minneapolis","Tampa","Arlington","New Orleans","Anaheim"
]
valid_cities_df = spark.createDataFrame([(c,) for c in valid_cities], ["valid_city"])

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 5, Finished, Available, Finished)

In [4]:
# ------------------------------------------------------------
# Cross-compare each city with all valid cities to determine
# the closest match based on minimum Levenshtein distance.
# ------------------------------------------------------------
# comp:
#   Performs a full cross-join between the input DataFrame (a)
#   and the valid city list (b). For each pair, computes the
#   Levenshtein edit distance between the input city and the
#   candidate valid city.
#   - lower(): ensures case-insensitive comparison
#   - levenshtein(): measures similarity between strings
#
# w (Window):
#   Partitions by record ID and orders candidate matches by
#   ascending edit distance so the closest match appears first.
#
# best_city:
#   Selects the single best match per record (rn = 1), returning:
#     - id: original record identifier
#     - city: original city value
#     - best_city: closest valid city match
#     - city_dist: edit distance score
#
# Final join:
#   Merges best-city results back into the main DataFrame so
#   each record includes its closest valid city and distance.
# ------------------------------------------------------------
comp = (
    df.alias("a")
      .crossJoin(valid_cities_df.alias("b"))
      .withColumn("city_dist", levenshtein(lower(col("a.city")), lower(col("b.valid_city"))))
)

w = Window.partitionBy("a.id").orderBy(col("city_dist").asc())
best_city = (
    comp.withColumn("rn", row_number().over(w))
        .where(col("rn") == 1)
        .select(
            col("a.id").alias("id"),
            col("a.city").alias("city"),
            col("b.valid_city").alias("best_city"),
            col("city_dist")
        )
)

df = df.join(best_city, on=["id", "city"], how="left")

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 6, Finished, Available, Finished)

In [5]:
# ------------------------------------------------------------
# Flag suspicious city values based on edit distance
# ------------------------------------------------------------
# CITY_DISTANCE_THRESHOLD:
#   Maximum Levenshtein distance allowed before a city is
#   considered suspicious. Tunable depending on data quality.
#
# city_suspect column:
#   Flags cases where:
#     - The input city does NOT exactly match the best_city
#     - AND the edit distance is small enough to indicate a
#       likely typo rather than a completely unrelated city.
#   This helps isolate subtle misspellings (e.g., "Detriot").
# ------------------------------------------------------------
CITY_DISTANCE_THRESHOLD = 2
df = df.withColumn(
    "city_suspect",
    (lower(col("city")) != lower(col("best_city"))) & (col("city_dist") <= lit(CITY_DISTANCE_THRESHOLD))
)

# ------------------------------------------------------------
# 4. TARGETED FULL-NAME CHECK (Example: Sarah Hall)
# ------------------------------------------------------------
# full_name:
#   Combines first and last name for comparison against a list
#   of known full names of interest.
#
# known_fullnames:
#   Reference list of specific individuals to detect potential
#   misspellings or variations (extend as needed).
#
# name_comp:
#   Cross-joins each record with the known names list and
#   computes Levenshtein distance to measure similarity.
#
# best_name:
#   Selects the closest matching known name per record.
#
# name_suspect:
#   Flags cases where the full name differs from the closest
#   known name but is within a small edit-distance threshold.
#   Useful for targeted identity checks or VIP monitoring.
# ------------------------------------------------------------
from pyspark.sql.functions import concat_ws
df = df.withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name")))

known_fullnames = ["Sarah Hall"]  # extend as needed
known_names_df = spark.createDataFrame([(n,) for n in known_fullnames], ["known_fullname"])

name_comp = (
    df.alias("a")
      .crossJoin(known_names_df.alias("k"))
      .withColumn("name_dist", levenshtein(lower(col("a.full_name")), lower(col("k.known_fullname"))))
)
w2 = Window.partitionBy("a.id").orderBy(col("name_dist").asc())
best_name = (
    name_comp.withColumn("rn", row_number().over(w2))
             .where(col("rn") == 1)
             .select(
                 col("a.id").alias("id"),
                 col("a.full_name").alias("full_name"),
                 col("k.known_fullname").alias("best_fullname"),
                 col("name_dist")
             )
)

df = df.join(best_name, on=["id", "full_name"], how="left")
NAME_DISTANCE_THRESHOLD = 2
df = df.withColumn(
    "name_suspect",
    (lower(col("full_name")) != lower(col("best_fullname"))) & (col("name_dist") <= lit(NAME_DISTANCE_THRESHOLD))
)

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 7, Finished, Available, Finished)

In [6]:
# ------------------------------------------------------------
# 5. Aggregate a unified review flag
# ------------------------------------------------------------
# needs_review:
#   Consolidates all individual typo and validation checks into
#   a single boolean indicator. A record is flagged when ANY of:
#     - invalid_email   ‚Üí email format issues
#     - invalid_state   ‚Üí non‚Äëstandard or unknown state codes
#     - city_suspect    ‚Üí likely city misspellings (edit distance)
#     - name_suspect    ‚Üí targeted full‚Äëname anomalies
#
#   This column provides a simple, top‚Äëlevel signal for downstream
#   workflows, enabling quick filtering, routing, or manual review.
# ------------------------------------------------------------

df = df.withColumn(
    "needs_review",
    col("invalid_email") |
    col("invalid_state") |
    col("city_suspect") |
    col("name_suspect")
)

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 8, Finished, Available, Finished)

In [9]:
# ------------------------------------------------------------
# 6. Output results
# ------------------------------------------------------------
# Final output step:
#   - Selects all relevant fields, including original values
#     and all typo‚Äëdetection flags (email, state, city, name).
#   - Orders results so records requiring review appear first.
#   - coalesce(1): writes a single CSV file for convenience
#     when exporting results (useful for manual inspection).
#   - Writes the full dataset to the output folder
#     "out/customers_typos_detected".
#
# Final display:
#   After writing, filters the DataFrame to show ONLY the
#   records that require review (needs_review = true),
#   providing a quick on‚Äëscreen summary of detected issues.
# ------------------------------------------------------------
csv_file_typos_dir = "Files/csv/customers/typos/CustomersLocation_typos.csv"
(
    df.select("id", "first_name", "last_name", "email", "city", "state",
              "invalid_email", "invalid_state", "city_suspect", "city_dist", "best_city",
              "name_suspect", "name_dist", "best_fullname", "needs_review")
      .orderBy(col("needs_review").desc(), col("id").asc())
      .coalesce(1)  # single file for convenience
      .filter("needs_review = true")
      .write.mode("overwrite").option("header", True).csv(csv_file_typos_dir)
)

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 11, Finished, Available, Finished)

In [10]:
df = spark.read.format("csv").option("header","true").load(csv_file_typos_dir)
# df now is a Spark DataFrame containing CSV data from "Files/csv/customers/typos/CustomersLocation_typos.csv".
display(df)

StatementMeta(, 86c097fb-c5a8-4ff7-b9ae-45e5b6cad55a, 12, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, e74c1d61-a601-4319-99a6-1675bc43da9c)

end