# Provider Fraud Detection - Data Wrangling

### In this notebook we will focus on collecting, organizing, defining, and cleaning the relevant datasets for the Provider Fraud Detection model

#### 1.) Ingest & Inspect Raw Files
#### 2.) Standardize Column Names & Types
#### 3.) Clean and Dedupe (remove duplications)
#### 4.) Normalize & Enrich
#### 5.) Merge/Join Tables
#### 6.) Aggregate & Snapshot

In [29]:
# loading modules
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import (col, when, sum as sum_, trim, to_date, when, current_date, date_diff, lit)
from functools import reduce
from pyspark.sql import functions as F
import re
from collections import Counter

from pyspark.sql import Window
from pyspark.sql.functions import row_number

In [2]:
# starting Spark session
spark = SparkSession.builder.appName('ProviderFraudDetection').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/01 20:10:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# creating a csv path variable
csv_path = "NPPES_Data_Dissemination_July_2025_V2/npidata_pfile_20050523-20250713.csv"


In [4]:
# reading the CSV file into a DataFrame
full_npi_df = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "false")
    .csv(csv_path)
)

In [5]:
# inspecting the column names
print("Available columns (first 50 shown):")
for c in full_npi_df.columns[:50]:
    print(repr(c))
    

Available columns (first 50 shown):
'NPI'
'Entity Type Code'
'Replacement NPI'
'Employer Identification Number (EIN)'
'Provider Organization Name (Legal Business Name)'
'Provider Last Name (Legal Name)'
'Provider First Name'
'Provider Middle Name'
'Provider Name Prefix Text'
'Provider Name Suffix Text'
'Provider Credential Text'
'Provider Other Organization Name'
'Provider Other Organization Name Type Code'
'Provider Other Last Name'
'Provider Other First Name'
'Provider Other Middle Name'
'Provider Other Name Prefix Text'
'Provider Other Name Suffix Text'
'Provider Other Credential Text'
'Provider Other Last Name Type Code'
'Provider First Line Business Mailing Address'
'Provider Second Line Business Mailing Address'
'Provider Business Mailing Address City Name'
'Provider Business Mailing Address State Name'
'Provider Business Mailing Address Postal Code'
'Provider Business Mailing Address Country Code (If outside U.S.)'
'Provider Business Mailing Address Telephone Number'
'Provider B

In [9]:
# selecting the relevant columns I want to keep
keep_cols = [
    "NPI",
    "Entity Type Code",
    "Provider Business Practice Location Address State Name",
    "Provider Business Practice Location Address Postal Code",
    "Is Organization Subpart",
    "Parent Organization TIN",
    "Parent Organization LBN",        
    "Provider Enumeration Date",
    "Last Update Date",
    "NPI Deactivation Date",
    "NPI Reactivation Date",
    "Is Sole Proprietor",
    "Healthcare Provider Taxonomy Code_1",           
    "Healthcare Provider Primary Taxonomy Switch_1" 
]

In [10]:
# creating a new DataFrame with only the selected columns
npi_df = full_npi_df.select(*keep_cols)


In [11]:
# having a sanity check to ensure I loaded the right columns, the correct types, and the data looks okay
print("Schema:")
npi_df.printSchema()
print("Sample:")
npi_df.limit(5).show(truncate=False)

Schema:
root
 |-- NPI: string (nullable = true)
 |-- Entity Type Code: string (nullable = true)
 |-- Provider Business Practice Location Address State Name: string (nullable = true)
 |-- Provider Business Practice Location Address Postal Code: string (nullable = true)
 |-- Is Organization Subpart: string (nullable = true)
 |-- Parent Organization TIN: string (nullable = true)
 |-- Parent Organization LBN: string (nullable = true)
 |-- Provider Enumeration Date: string (nullable = true)
 |-- Last Update Date: string (nullable = true)
 |-- NPI Deactivation Date: string (nullable = true)
 |-- NPI Reactivation Date: string (nullable = true)
 |-- Is Sole Proprietor: string (nullable = true)
 |-- Healthcare Provider Taxonomy Code_1: string (nullable = true)
 |-- Healthcare Provider Primary Taxonomy Switch_1: string (nullable = true)

Sample:
+----------+----------------+------------------------------------------------------+-------------------------------------------------------+------------

In [12]:
# getting the total number of rows in the DataFrame
total_rows = npi_df.count()
print(f'Total rows: {total_rows}')



Total rows: 9026996


                                                                                

In [13]:
# getting the null counts for each column
npi_df.select(
    *[sum_(col(c).isNull().cast('int')).alias(c) for c in npi_df.columns]
).show()



+---+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+-----------------------+-----------------------+-------------------------+----------------+---------------------+---------------------+------------------+-----------------------------------+---------------------------------------------+
|NPI|Entity Type Code|Provider Business Practice Location Address State Name|Provider Business Practice Location Address Postal Code|Is Organization Subpart|Parent Organization TIN|Parent Organization LBN|Provider Enumeration Date|Last Update Date|NPI Deactivation Date|NPI Reactivation Date|Is Sole Proprietor|Healthcare Provider Taxonomy Code_1|Healthcare Provider Primary Taxonomy Switch_1|
+---+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+-----------------------+-----------------------+-------

                                                                                

### Up to this point, I've pulled the columns I am wanting from the NPI data and have checked how many total records there are in addition to how many total number of nulls per column

In [16]:
# now I want to normalize my columns names to ensure a smooth workflow
def normalize_col(name: str) -> str:
    s = name.strip().lower()
    s = re.sub(r'\(.*?\)', '', s)        
    s = re.sub(r'[^0-9a-z]+', '_', s)    
    s = re.sub(r'_+', '_', s)             
    return s.strip('_')


In [17]:
# comoputing normalized names and checking for collisions
normalized_names = [normalize_col(c) for c in keep_cols]
dupes = [n for n, cnt in Counter(normalized_names).items() if cnt > 1]
if dupes:
    raise RuntimeError(f"Normalized name collision detected: {dupes}")

In [18]:
# building a new DataFrame with normalized column names
normalized_npi_df = npi_df.select(
    *[col(orig).alias(norm) for orig, norm in zip(keep_cols, normalized_names)]
)


In [19]:
# doing a quick sanity check to ensure the new DataFrame has the correct columns
normalized_npi_df.printSchema()
normalized_npi_df.show(5, truncate=False)

root
 |-- npi: string (nullable = true)
 |-- entity_type_code: string (nullable = true)
 |-- provider_business_practice_location_address_state_name: string (nullable = true)
 |-- provider_business_practice_location_address_postal_code: string (nullable = true)
 |-- is_organization_subpart: string (nullable = true)
 |-- parent_organization_tin: string (nullable = true)
 |-- parent_organization_lbn: string (nullable = true)
 |-- provider_enumeration_date: string (nullable = true)
 |-- last_update_date: string (nullable = true)
 |-- npi_deactivation_date: string (nullable = true)
 |-- npi_reactivation_date: string (nullable = true)
 |-- is_sole_proprietor: string (nullable = true)
 |-- healthcare_provider_taxonomy_code_1: string (nullable = true)
 |-- healthcare_provider_primary_taxonomy_switch_1: string (nullable = true)

+----------+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+----

In [22]:
# casting and standardizing the data types
clean = (normalized_npi_df
    # trim whitespace
    .withColumn("npi", trim(col("npi")))
    .withColumn("entity_type_code", col("entity_type_code").cast("int"))

    # flags: decide mapping after inspecting distinct values
    .withColumn("is_organization_subpart", when(col("is_organization_subpart") == "Y", 1).otherwise(0))
    .withColumn("is_sole_proprietor", when(col("is_sole_proprietor").isin("Y","X"), 1).otherwise(0))

    # parse dates (adjust format if the data varies)
    .withColumn("provider_enumeration_date", to_date(col("provider_enumeration_date"), "MM/dd/yyyy"))
    .withColumn("last_update_date", to_date(col("last_update_date"), "MM/dd/yyyy"))
    .withColumn("npi_deactivation_date", to_date(col("npi_deactivation_date"), "MM/dd/yyyy"))
    .withColumn("npi_reactivation_date", to_date(col("npi_reactivation_date"), "MM/dd/yyyy"))
)

In [23]:
# primary taxonomy extraction with nuance
clean = clean.withColumn(
    "primary_taxonomy",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "Y", col("healthcare_provider_taxonomy_code_1"))
    .when(col("healthcare_provider_primary_taxonomy_switch_1") == "X", col("healthcare_provider_taxonomy_code_1"))  # fallback
    .otherwise(None)
).withColumn(
    "primary_taxonomy_explicit",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "Y", 1).otherwise(0)
).withColumn(
    "primary_taxonomy_unknown",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "X", 1).otherwise(0)
).withColumn(
    "primary_taxonomy_source",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "Y", lit("explicit"))
    .when(col("healthcare_provider_primary_taxonomy_switch_1") == "X", lit("fallback"))
    .otherwise(lit(None))
)


In [25]:
# lifecycle / derived features
clean = (
    clean
    .withColumn("npi_age_days", date_diff(current_date(), col("provider_enumeration_date")))
    .withColumn(
        "is_active",
        when(
            (col("npi_deactivation_date").isNull()) |
            ((col("npi_reactivation_date").isNotNull()) & (col("npi_reactivation_date") >= col("npi_deactivation_date"))),
            lit(1)
        ).otherwise(lit(0))
    )
    .withColumn("was_reactivated", when(col("npi_reactivation_date").isNotNull(), 1).otherwise(0))
    .withColumn(
        "deactivated_then_reactivated",
        when((col("npi_deactivation_date").isNotNull()) & (col("npi_reactivation_date").isNotNull()), 1).otherwise(0)
    )
    .withColumn(
        "has_location",
        when(col("provider_business_practice_location_address_state_name").isNotNull(), 1).otherwise(0)
    )
    .withColumn(
        "missing_entity_type",
        when(col("entity_type_code").isNull(), 1).otherwise(0)
    )
)



In [26]:
# creating a completeness score for when I'm de-duping
clean = clean.withColumn(
    "completeness_score",
    (when(col("primary_taxonomy").isNotNull(), 1).otherwise(0)
     + when(col("provider_business_practice_location_address_state_name").isNotNull(), 1).otherwise(0)
     + when(col("provider_enumeration_date").isNotNull(), 1).otherwise(0)
     + when(col("entity_type_code").isNotNull(), 1).otherwise(0))
)

In [27]:
# validating the NPIs, they should be 10 digits long
bad_npis_count = clean.filter(~col("npi").rlike(r'^\d{10}$')).count()
print(f"Malformed / unexpected NPI formats: {bad_npis_count}")



Malformed / unexpected NPI formats: 0


                                                                                

In [30]:
# deduplicating based off of latest "last_update_date" and then highest completeness score
w = Window.partitionBy("npi").orderBy(col("last_update_date").desc_nulls_last(), col("completeness_score").desc())
deduped = (
    clean
    .withColumn("rn", row_number().over(w))
    .filter(col("rn") == 1)
    .drop("rn")
)

In [31]:
# doing sanity checks after cleaning my NPI dataframe values
print("Distinct is_sole_proprietor values:")
deduped.select("is_sole_proprietor").distinct().show()
print("Primary taxonomy switch distinct modes:")
deduped.select("healthcare_provider_primary_taxonomy_switch_1").distinct().show()
print("Null counts after transformation:")
nulls = deduped.select(*[
    F.sum(when(col(c).isNull(), 1).otherwise(0)).alias(c)
    for c in deduped.columns
])
nulls.show(truncate=False)

Distinct is_sole_proprietor values:


                                                                                

+------------------+
|is_sole_proprietor|
+------------------+
|                 1|
|                 0|
+------------------+

Primary taxonomy switch distinct modes:


                                                                                

+---------------------------------------------+
|healthcare_provider_primary_taxonomy_switch_1|
+---------------------------------------------+
|                                            Y|
|                                            N|
|                                            X|
|                                         NULL|
+---------------------------------------------+

Null counts after transformation:




+---+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+-----------------------+-----------------------+-------------------------+----------------+---------------------+---------------------+------------------+-----------------------------------+---------------------------------------------+----------------+-------------------------+------------------------+-----------------------+------------+---------+---------------+----------------------------+------------+-------------------+------------------+
|npi|entity_type_code|provider_business_practice_location_address_state_name|provider_business_practice_location_address_postal_code|is_organization_subpart|parent_organization_tin|parent_organization_lbn|provider_enumeration_date|last_update_date|npi_deactivation_date|npi_reactivation_date|is_sole_proprietor|healthcare_provider_taxonomy_code_1|healthcare_provider_primary_taxonomy_switch_1|p

                                                                                