# Provider Fraud Detection - Data Wrangling

### In this notebook we will focus on collecting, organizing, defining, and cleaning the relevant datasets for the Provider Fraud Detection model

#### 1.) Ingest & Inspect Raw Files
#### 2.) Standardize Column Names & Types
#### 3.) Clean and Dedupe (remove duplications)
#### 4.) Normalize & Enrich
#### 5.) Merge/Join Tables
#### 6.) Aggregate & Snapshot

In [20]:
# loading modules
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import (col, when, sum as sum_, trim, to_date, when, current_date, date_diff, lit, avg, stddev, countDistinct)
from functools import reduce
from pyspark.sql import functions as F
import re
from collections import Counter

from pyspark.sql import Window
from pyspark.sql.functions import row_number

from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import expr, try_divide

In [2]:
# starting Spark session
spark = SparkSession.builder.appName('ProviderFraudDetection').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/04 19:17:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### NPPPPES NPI Registry Dataset

In [3]:
# creating a csv path variable
csv_path = "NPPES_Data_Dissemination_July_2025_V2/npidata_pfile_20050523-20250713.csv"


In [4]:
# reading the CSV file into a DataFrame
full_npi_df = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "false")
    .csv(csv_path)
)

In [5]:
# inspecting the column names
print("Available columns (first 50 shown):")
for c in full_npi_df.columns[:50]:
    print(repr(c))
    

Available columns (first 50 shown):
'NPI'
'Entity Type Code'
'Replacement NPI'
'Employer Identification Number (EIN)'
'Provider Organization Name (Legal Business Name)'
'Provider Last Name (Legal Name)'
'Provider First Name'
'Provider Middle Name'
'Provider Name Prefix Text'
'Provider Name Suffix Text'
'Provider Credential Text'
'Provider Other Organization Name'
'Provider Other Organization Name Type Code'
'Provider Other Last Name'
'Provider Other First Name'
'Provider Other Middle Name'
'Provider Other Name Prefix Text'
'Provider Other Name Suffix Text'
'Provider Other Credential Text'
'Provider Other Last Name Type Code'
'Provider First Line Business Mailing Address'
'Provider Second Line Business Mailing Address'
'Provider Business Mailing Address City Name'
'Provider Business Mailing Address State Name'
'Provider Business Mailing Address Postal Code'
'Provider Business Mailing Address Country Code (If outside U.S.)'
'Provider Business Mailing Address Telephone Number'
'Provider B

In [9]:
# selecting the relevant columns I want to keep
keep_cols = [
    "NPI",
    "Entity Type Code",
    "Provider Business Practice Location Address State Name",
    "Provider Business Practice Location Address Postal Code",
    "Is Organization Subpart",
    "Parent Organization TIN",
    "Parent Organization LBN",        
    "Provider Enumeration Date",
    "Last Update Date",
    "NPI Deactivation Date",
    "NPI Reactivation Date",
    "Is Sole Proprietor",
    "Healthcare Provider Taxonomy Code_1",           
    "Healthcare Provider Primary Taxonomy Switch_1" 
]

In [10]:
# creating a new DataFrame with only the selected columns
npi_df = full_npi_df.select(*keep_cols)


In [11]:
# having a sanity check to ensure I loaded the right columns, the correct types, and the data looks okay
print("Schema:")
npi_df.printSchema()
print("Sample:")
npi_df.limit(5).show(truncate=False)

Schema:
root
 |-- NPI: string (nullable = true)
 |-- Entity Type Code: string (nullable = true)
 |-- Provider Business Practice Location Address State Name: string (nullable = true)
 |-- Provider Business Practice Location Address Postal Code: string (nullable = true)
 |-- Is Organization Subpart: string (nullable = true)
 |-- Parent Organization TIN: string (nullable = true)
 |-- Parent Organization LBN: string (nullable = true)
 |-- Provider Enumeration Date: string (nullable = true)
 |-- Last Update Date: string (nullable = true)
 |-- NPI Deactivation Date: string (nullable = true)
 |-- NPI Reactivation Date: string (nullable = true)
 |-- Is Sole Proprietor: string (nullable = true)
 |-- Healthcare Provider Taxonomy Code_1: string (nullable = true)
 |-- Healthcare Provider Primary Taxonomy Switch_1: string (nullable = true)

Sample:
+----------+----------------+------------------------------------------------------+-------------------------------------------------------+------------

In [12]:
# getting the total number of rows in the DataFrame
total_rows = npi_df.count()
print(f'Total rows: {total_rows}')



Total rows: 9026996


                                                                                

In [13]:
# getting the null counts for each column
npi_df.select(
    *[sum_(col(c).isNull().cast('int')).alias(c) for c in npi_df.columns]
).show()



+---+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+-----------------------+-----------------------+-------------------------+----------------+---------------------+---------------------+------------------+-----------------------------------+---------------------------------------------+
|NPI|Entity Type Code|Provider Business Practice Location Address State Name|Provider Business Practice Location Address Postal Code|Is Organization Subpart|Parent Organization TIN|Parent Organization LBN|Provider Enumeration Date|Last Update Date|NPI Deactivation Date|NPI Reactivation Date|Is Sole Proprietor|Healthcare Provider Taxonomy Code_1|Healthcare Provider Primary Taxonomy Switch_1|
+---+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+-----------------------+-----------------------+-------

                                                                                

#### Up to this point, I've pulled the columns I am wanting from the NPI data and have checked how many total records there are in addition to how many total number of nulls per column

In [16]:
# now I want to normalize my columns names to ensure a smooth workflow
def normalize_col(name: str) -> str:
    s = name.strip().lower()
    s = re.sub(r'\(.*?\)', '', s)        
    s = re.sub(r'[^0-9a-z]+', '_', s)    
    s = re.sub(r'_+', '_', s)             
    return s.strip('_')


In [17]:
# comoputing normalized names and checking for collisions
normalized_names = [normalize_col(c) for c in keep_cols]
dupes = [n for n, cnt in Counter(normalized_names).items() if cnt > 1]
if dupes:
    raise RuntimeError(f"Normalized name collision detected: {dupes}")

In [18]:
# building a new DataFrame with normalized column names
normalized_npi_df = npi_df.select(
    *[col(orig).alias(norm) for orig, norm in zip(keep_cols, normalized_names)]
)


In [19]:
# doing a quick sanity check to ensure the new DataFrame has the correct columns
normalized_npi_df.printSchema()
normalized_npi_df.show(5, truncate=False)

root
 |-- npi: string (nullable = true)
 |-- entity_type_code: string (nullable = true)
 |-- provider_business_practice_location_address_state_name: string (nullable = true)
 |-- provider_business_practice_location_address_postal_code: string (nullable = true)
 |-- is_organization_subpart: string (nullable = true)
 |-- parent_organization_tin: string (nullable = true)
 |-- parent_organization_lbn: string (nullable = true)
 |-- provider_enumeration_date: string (nullable = true)
 |-- last_update_date: string (nullable = true)
 |-- npi_deactivation_date: string (nullable = true)
 |-- npi_reactivation_date: string (nullable = true)
 |-- is_sole_proprietor: string (nullable = true)
 |-- healthcare_provider_taxonomy_code_1: string (nullable = true)
 |-- healthcare_provider_primary_taxonomy_switch_1: string (nullable = true)

+----------+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+----

In [22]:
# casting and standardizing the data types
clean = (normalized_npi_df
    # trim whitespace
    .withColumn("npi", trim(col("npi")))
    .withColumn("entity_type_code", col("entity_type_code").cast("int"))

    # flags: decide mapping after inspecting distinct values
    .withColumn("is_organization_subpart", when(col("is_organization_subpart") == "Y", 1).otherwise(0))
    .withColumn("is_sole_proprietor", when(col("is_sole_proprietor").isin("Y","X"), 1).otherwise(0))

    # parse dates (adjust format if the data varies)
    .withColumn("provider_enumeration_date", to_date(col("provider_enumeration_date"), "MM/dd/yyyy"))
    .withColumn("last_update_date", to_date(col("last_update_date"), "MM/dd/yyyy"))
    .withColumn("npi_deactivation_date", to_date(col("npi_deactivation_date"), "MM/dd/yyyy"))
    .withColumn("npi_reactivation_date", to_date(col("npi_reactivation_date"), "MM/dd/yyyy"))
)

In [23]:
# primary taxonomy extraction with nuance
clean = clean.withColumn(
    "primary_taxonomy",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "Y", col("healthcare_provider_taxonomy_code_1"))
    .when(col("healthcare_provider_primary_taxonomy_switch_1") == "X", col("healthcare_provider_taxonomy_code_1"))  # fallback
    .otherwise(None)
).withColumn(
    "primary_taxonomy_explicit",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "Y", 1).otherwise(0)
).withColumn(
    "primary_taxonomy_unknown",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "X", 1).otherwise(0)
).withColumn(
    "primary_taxonomy_source",
    when(col("healthcare_provider_primary_taxonomy_switch_1") == "Y", lit("explicit"))
    .when(col("healthcare_provider_primary_taxonomy_switch_1") == "X", lit("fallback"))
    .otherwise(lit(None))
)


In [25]:
# lifecycle / derived features
clean = (
    clean
    .withColumn("npi_age_days", date_diff(current_date(), col("provider_enumeration_date")))
    .withColumn(
        "is_active",
        when(
            (col("npi_deactivation_date").isNull()) |
            ((col("npi_reactivation_date").isNotNull()) & (col("npi_reactivation_date") >= col("npi_deactivation_date"))),
            lit(1)
        ).otherwise(lit(0))
    )
    .withColumn("was_reactivated", when(col("npi_reactivation_date").isNotNull(), 1).otherwise(0))
    .withColumn(
        "deactivated_then_reactivated",
        when((col("npi_deactivation_date").isNotNull()) & (col("npi_reactivation_date").isNotNull()), 1).otherwise(0)
    )
    .withColumn(
        "has_location",
        when(col("provider_business_practice_location_address_state_name").isNotNull(), 1).otherwise(0)
    )
    .withColumn(
        "missing_entity_type",
        when(col("entity_type_code").isNull(), 1).otherwise(0)
    )
)



In [26]:
# creating a completeness score for when I'm de-duping
clean = clean.withColumn(
    "completeness_score",
    (when(col("primary_taxonomy").isNotNull(), 1).otherwise(0)
     + when(col("provider_business_practice_location_address_state_name").isNotNull(), 1).otherwise(0)
     + when(col("provider_enumeration_date").isNotNull(), 1).otherwise(0)
     + when(col("entity_type_code").isNotNull(), 1).otherwise(0))
)

In [27]:
# validating the NPIs, they should be 10 digits long
bad_npis_count = clean.filter(~col("npi").rlike(r'^\d{10}$')).count()
print(f"Malformed / unexpected NPI formats: {bad_npis_count}")



Malformed / unexpected NPI formats: 0


                                                                                

In [30]:
# deduplicating based off of latest "last_update_date" and then highest completeness score
w = Window.partitionBy("npi").orderBy(col("last_update_date").desc_nulls_last(), col("completeness_score").desc())
deduped = (
    clean
    .withColumn("rn", row_number().over(w))
    .filter(col("rn") == 1)
    .drop("rn")
)

In [None]:
# doing sanity checks after cleaning my NPI dataframe values
print("Distinct is_sole_proprietor values:")
deduped.select("is_sole_proprietor").distinct().show()
print("Primary taxonomy switch distinct modes:")
deduped.select("healthcare_provider_primary_taxonomy_switch_1").distinct().show()
print("Null counts after transformation:")
nulls = deduped.select(*[
    F.sum(when(col(c).isNull(), 1).otherwise(0)).alias(c)
    for c in deduped.columns
])
nulls.show(truncate=False)

Distinct is_sole_proprietor values:


                                                                                

+------------------+
|is_sole_proprietor|
+------------------+
|                 1|
|                 0|
+------------------+

Primary taxonomy switch distinct modes:


                                                                                

+---------------------------------------------+
|healthcare_provider_primary_taxonomy_switch_1|
+---------------------------------------------+
|                                            Y|
|                                            N|
|                                            X|
|                                         NULL|
+---------------------------------------------+

Null counts after transformation:




+---+----------------+------------------------------------------------------+-------------------------------------------------------+-----------------------+-----------------------+-----------------------+-------------------------+----------------+---------------------+---------------------+------------------+-----------------------------------+---------------------------------------------+----------------+-------------------------+------------------------+-----------------------+------------+---------+---------------+----------------------------+------------+-------------------+------------------+
|npi|entity_type_code|provider_business_practice_location_address_state_name|provider_business_practice_location_address_postal_code|is_organization_subpart|parent_organization_tin|parent_organization_lbn|provider_enumeration_date|last_update_date|npi_deactivation_date|npi_reactivation_date|is_sole_proprietor|healthcare_provider_taxonomy_code_1|healthcare_provider_primary_taxonomy_switch_1|p

                                                                                

25/08/02 04:18:06 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 946861 ms exceeds timeout 120000 ms
25/08/02 04:18:06 WARN SparkContext: Killing executors is not supported by current scheduler.
25/08/02 04:18:08 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:342)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:132)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$

#### Medicare Physician & Other Practitioners - by Provider and Service Dataset

In [3]:
# creating a csv path variable
csv_path = "Medicare Physician & Other Practitioners - by Provider and Service/MUP_PHY_R25_P05_V20_D23_Prov_Svc.csv"

In [4]:
# reading the CSV file into a DataFrame
full_physician_practitioner_df = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "false")
    .csv(csv_path)
)

In [5]:
# inspecting the column names
print("Available columns (first 50 shown):")
for c in full_physician_practitioner_df.columns[:50]:
    print(repr(c))

Available columns (first 50 shown):
'Rndrng_NPI'
'Rndrng_Prvdr_Last_Org_Name'
'Rndrng_Prvdr_First_Name'
'Rndrng_Prvdr_MI'
'Rndrng_Prvdr_Crdntls'
'Rndrng_Prvdr_Ent_Cd'
'Rndrng_Prvdr_St1'
'Rndrng_Prvdr_St2'
'Rndrng_Prvdr_City'
'Rndrng_Prvdr_State_Abrvtn'
'Rndrng_Prvdr_State_FIPS'
'Rndrng_Prvdr_Zip5'
'Rndrng_Prvdr_RUCA'
'Rndrng_Prvdr_RUCA_Desc'
'Rndrng_Prvdr_Cntry'
'Rndrng_Prvdr_Type'
'Rndrng_Prvdr_Mdcr_Prtcptg_Ind'
'HCPCS_Cd'
'HCPCS_Desc'
'HCPCS_Drug_Ind'
'Place_Of_Srvc'
'Tot_Benes'
'Tot_Srvcs'
'Tot_Bene_Day_Srvcs'
'Avg_Sbmtd_Chrg'
'Avg_Mdcr_Alowd_Amt'
'Avg_Mdcr_Pymt_Amt'
'Avg_Mdcr_Stdzd_Amt'


In [6]:
# selecting the relevant columns I want to keep
keep_cols_physician_practitioners = [
    "Rndrng_NPI",                 # Rendering provider NPI
    "Rndrng_Prvdr_Ent_Cd",        # Entity type (I/O)
    "Rndrng_Prvdr_State_Abrvtn",  # State abbreviation
    "Rndrng_Prvdr_Zip5",          # ZIP code
    "Rndrng_Prvdr_RUCA",          # Rural-Urban commuting area code
    "Rndrng_Prvdr_RUCA_Desc",     # RUCA description
    "Rndrng_Prvdr_Mdcr_Prtcptg_Ind", # Medicare participation Y/N

    "HCPCS_Cd",                   # Procedure code
    "HCPCS_Desc",                 # Procedure description
    "HCPCS_Drug_Ind",             # Drug vs non-drug flag
    "Place_Of_Srvc",              # Place of service
    "Tot_Benes",                  # Total distinct beneficiaries
    "Tot_Srvcs",                  # Total services provided
    "Tot_Bene_Day_Srvcs",         # Unique beneficiary-day services

    "Avg_Sbmtd_Chrg",             # Avg submitted charge
    "Avg_Mdcr_Alowd_Amt",         # Avg Medicare allowed
    "Avg_Mdcr_Pymt_Amt",          # Avg Medicare payment
    "Avg_Mdcr_Stdzd_Amt"          # Avg standardized payment
]

In [7]:
# creating a new DataFrame with only the selected columns
phys_pract_df = full_physician_practitioner_df.select(*keep_cols_physician_practitioners)

In [8]:
# having a sanity check to ensure I loaded the right columns, the correct types, and the data looks okay
print("Schema:")
phys_pract_df.printSchema()
print("Sample:")
phys_pract_df.limit(5).show(truncate=False)

Schema:
root
 |-- Rndrng_NPI: string (nullable = true)
 |-- Rndrng_Prvdr_Ent_Cd: string (nullable = true)
 |-- Rndrng_Prvdr_State_Abrvtn: string (nullable = true)
 |-- Rndrng_Prvdr_Zip5: string (nullable = true)
 |-- Rndrng_Prvdr_RUCA: string (nullable = true)
 |-- Rndrng_Prvdr_RUCA_Desc: string (nullable = true)
 |-- Rndrng_Prvdr_Mdcr_Prtcptg_Ind: string (nullable = true)
 |-- HCPCS_Cd: string (nullable = true)
 |-- HCPCS_Desc: string (nullable = true)
 |-- HCPCS_Drug_Ind: string (nullable = true)
 |-- Place_Of_Srvc: string (nullable = true)
 |-- Tot_Benes: string (nullable = true)
 |-- Tot_Srvcs: string (nullable = true)
 |-- Tot_Bene_Day_Srvcs: string (nullable = true)
 |-- Avg_Sbmtd_Chrg: string (nullable = true)
 |-- Avg_Mdcr_Alowd_Amt: string (nullable = true)
 |-- Avg_Mdcr_Pymt_Amt: string (nullable = true)
 |-- Avg_Mdcr_Stdzd_Amt: string (nullable = true)

Sample:
+----------+-------------------+-------------------------+-----------------+-----------------+---------------------

In [9]:
# getting the total number of rows in the DataFrame
total_rows_phys_pract = phys_pract_df.count()
print(f'Total rows: {total_rows_phys_pract}')



Total rows: 9660647


                                                                                

In [10]:
# getting the null counts for each column
phys_pract_df.select(
    *[sum_(col(c).isNull().cast('int')).alias(c) for c in phys_pract_df.columns]
).show()



+----------+-------------------+-------------------------+-----------------+-----------------+----------------------+-----------------------------+--------+----------+--------------+-------------+---------+---------+------------------+--------------+------------------+-----------------+------------------+
|Rndrng_NPI|Rndrng_Prvdr_Ent_Cd|Rndrng_Prvdr_State_Abrvtn|Rndrng_Prvdr_Zip5|Rndrng_Prvdr_RUCA|Rndrng_Prvdr_RUCA_Desc|Rndrng_Prvdr_Mdcr_Prtcptg_Ind|HCPCS_Cd|HCPCS_Desc|HCPCS_Drug_Ind|Place_Of_Srvc|Tot_Benes|Tot_Srvcs|Tot_Bene_Day_Srvcs|Avg_Sbmtd_Chrg|Avg_Mdcr_Alowd_Amt|Avg_Mdcr_Pymt_Amt|Avg_Mdcr_Stdzd_Amt|
+----------+-------------------+-------------------------+-----------------+-----------------+----------------------+-----------------------------+--------+----------+--------------+-------------+---------+---------+------------------+--------------+------------------+-----------------+------------------+
|         0|                  0|                        0|                0|   

                                                                                

#### Same as the first NPPES NPI dataset that I wrangled...
#### I have loaded in the Medicare Physician & Other Practitioners dataset and have narrowed down the columns that I am wanting and found the total record count with the total null count by column

In [11]:
# now I want to normalize my columns names to ensure a smooth workflow
def normalize_col(name: str) -> str:
    s = name.strip().lower()
    s = re.sub(r'\(.*?\)', '', s)        
    s = re.sub(r'[^0-9a-z]+', '_', s)    
    s = re.sub(r'_+', '_', s)             
    return s.strip('_')

In [12]:
# comoputing normalized names and checking for collisions
normalized_names = [normalize_col(c) for c in keep_cols_physician_practitioners]
dupes = [n for n, cnt in Counter(normalized_names).items() if cnt > 1]
if dupes:
    raise RuntimeError(f"Normalized name collision detected: {dupes}")

In [13]:
# building a new DataFrame with normalized column names
normalized_phys_pract_df = phys_pract_df.select(
    *[col(orig).alias(norm) for orig, norm in zip(keep_cols_physician_practitioners, normalized_names)]
)

In [14]:
# doing a quick sanity check to ensure the new DataFrame has the correct columns
normalized_phys_pract_df.printSchema()
normalized_phys_pract_df.show(5, truncate=False)

root
 |-- rndrng_npi: string (nullable = true)
 |-- rndrng_prvdr_ent_cd: string (nullable = true)
 |-- rndrng_prvdr_state_abrvtn: string (nullable = true)
 |-- rndrng_prvdr_zip5: string (nullable = true)
 |-- rndrng_prvdr_ruca: string (nullable = true)
 |-- rndrng_prvdr_ruca_desc: string (nullable = true)
 |-- rndrng_prvdr_mdcr_prtcptg_ind: string (nullable = true)
 |-- hcpcs_cd: string (nullable = true)
 |-- hcpcs_desc: string (nullable = true)
 |-- hcpcs_drug_ind: string (nullable = true)
 |-- place_of_srvc: string (nullable = true)
 |-- tot_benes: string (nullable = true)
 |-- tot_srvcs: string (nullable = true)
 |-- tot_bene_day_srvcs: string (nullable = true)
 |-- avg_sbmtd_chrg: string (nullable = true)
 |-- avg_mdcr_alowd_amt: string (nullable = true)
 |-- avg_mdcr_pymt_amt: string (nullable = true)
 |-- avg_mdcr_stdzd_amt: string (nullable = true)

+----------+-------------------+-------------------------+-----------------+-----------------+-------------------------------------

In [15]:

# start from your normalized-column PUF DataFrame
# e.g. puf_norm_df

clean_phys = (
    normalized_phys_pract_df

    # 1. Trim & validate NPI
    .withColumn("rndrng_npi", trim(col("rndrng_npi")))
    .withColumn(
        "npi_valid",
        when(col("rndrng_npi").rlike(r"^\d{10}$"), lit(1)).otherwise(lit(0))
    )

    # 2. Entity type: I → individual(1), others → org(0)
    .withColumn(
        "is_individual",
        when(col("rndrng_prvdr_ent_cd") == "I", lit(1)).otherwise(lit(0))
    )

    # 3. Trim location strings
    .withColumn("rndrng_prvdr_state_abrvtn", trim(col("rndrng_prvdr_state_abrvtn")))
    .withColumn("rndrng_prvdr_zip5", trim(col("rndrng_prvdr_zip5")))

    # 4. RUCA → integer, keep description as-is or drop
    .withColumn("rndrng_prvdr_ruca", col("rndrng_prvdr_ruca").cast(IntegerType()))

    # 5. Medicare participation flag
    .withColumn(
        "medicare_participation",
        when(col("rndrng_prvdr_mdcr_prtcptg_ind") == "Y", lit(1)).otherwise(lit(0))
    )

    # 6. HCPCS & service place: trim
    .withColumn("hcpcs_cd", trim(col("hcpcs_cd")))
    .withColumn("place_of_srvc", trim(col("place_of_srvc")))

    # 7. Cast volume counts
    .withColumn("tot_benes", col("tot_benes").cast(IntegerType()))
    .withColumn("tot_srvcs", col("tot_srvcs").cast(IntegerType()))
    .withColumn("tot_bene_day_srvcs", col("tot_bene_day_srvcs").cast(IntegerType()))

    # 8. Cast dollar amounts
    .withColumn("avg_sbmtd_chrg", col("avg_sbmtd_chrg").cast(DoubleType()))
    .withColumn("avg_mdcr_alowd_amt", col("avg_mdcr_alowd_amt").cast(DoubleType()))
    .withColumn("avg_mdcr_pymt_amt", col("avg_mdcr_pymt_amt").cast(DoubleType()))
    .withColumn("avg_mdcr_stdzd_amt", col("avg_mdcr_stdzd_amt").cast(DoubleType()))

    # 9. Drug indicator → binary (D=drug, else non-drug)
    .withColumn(
        "is_drug",
        when(col("hcpcs_drug_ind") == "D", lit(1)).otherwise(lit(0))
    )

    # 10. Missingness flags
    .withColumn("missing_state", when(col("rndrng_prvdr_state_abrvtn").isNull(), lit(1)).otherwise(lit(0)))
    .withColumn("missing_zip", when(col("rndrng_prvdr_zip5").isNull(), lit(1)).otherwise(lit(0)))

)

# Sanity check
clean_phys.printSchema()
clean_phys.select(
    "rndrng_npi", "npi_valid", "is_individual", "medicare_participation",
    "tot_benes", "avg_sbmtd_chrg", "is_drug", "missing_state"
).show(5, truncate=False)


root
 |-- rndrng_npi: string (nullable = true)
 |-- rndrng_prvdr_ent_cd: string (nullable = true)
 |-- rndrng_prvdr_state_abrvtn: string (nullable = true)
 |-- rndrng_prvdr_zip5: string (nullable = true)
 |-- rndrng_prvdr_ruca: integer (nullable = true)
 |-- rndrng_prvdr_ruca_desc: string (nullable = true)
 |-- rndrng_prvdr_mdcr_prtcptg_ind: string (nullable = true)
 |-- hcpcs_cd: string (nullable = true)
 |-- hcpcs_desc: string (nullable = true)
 |-- hcpcs_drug_ind: string (nullable = true)
 |-- place_of_srvc: string (nullable = true)
 |-- tot_benes: integer (nullable = true)
 |-- tot_srvcs: integer (nullable = true)
 |-- tot_bene_day_srvcs: integer (nullable = true)
 |-- avg_sbmtd_chrg: double (nullable = true)
 |-- avg_mdcr_alowd_amt: double (nullable = true)
 |-- avg_mdcr_pymt_amt: double (nullable = true)
 |-- avg_mdcr_stdzd_amt: double (nullable = true)
 |-- npi_valid: integer (nullable = false)
 |-- is_individual: integer (nullable = false)
 |-- medicare_participation: integer (

#### Because of the nature of this dataset, I will not be doing a traditional deduping process
#### This dataset is SUPPOSED to have multiple lines per provider
#### I will be checking for true duplicates

In [16]:
# checking for true duplicates
# A. Re-cast your count fields safely with try_cast
safe_phys = clean_phys \
    .withColumn("tot_benes", expr("try_cast(tot_benes AS INT)")) \
    .withColumn("tot_srvcs", expr("try_cast(tot_srvcs AS INT)")) \
    .withColumn("tot_bene_day_srvcs", expr("try_cast(tot_bene_day_srvcs AS INT)"))

# B. Now check for *exact* duplicates across the key combination
#    (provider + procedure + place + drug flag)
total_rows = safe_phys.count()

distinct_key_rows = safe_phys \
    .dropDuplicates([
        "rndrng_npi",
        "hcpcs_cd",
        "place_of_srvc",
        "is_drug"
    ]) \
    .count()

print(f"Rows: {total_rows:,}  Unique (prov×proc×place×drug): {distinct_key_rows:,}  "
      f"Exact dupes: {total_rows - distinct_key_rows:,}")


[Stage 15:>                                                         (0 + 8) / 9]

Rows: 9,660,647  Unique (prov×proc×place×drug): 9,660,647  Exact dupes: 0


                                                                                

In [21]:
# aggregating by provider for modeling purposes
#from pyspark.sql.functions import avg, sum as _sum, stddev



# 1) Normalize / cast your columns
clean_phys2 = normalized_phys_pract_df \
    .withColumn("avg_sbmtd_chrg",    col("avg_sbmtd_chrg"   ).cast("double")) \
    .withColumn("avg_mdcr_alowd_amt",col("avg_mdcr_alowd_amt").cast("double")) \
    .withColumn("avg_mdcr_pymt_amt", col("avg_mdcr_pymt_amt").cast("double")) \
    .withColumn("srvcs_d",           col("tot_srvcs"        ).cast("double")) \
    .withColumn("benes_d",           col("tot_benes"        ).cast("double")) \
    .withColumn("bene_days_d",       col("tot_bene_day_srvcs").cast("double")) \
    .withColumn("is_drug",           when(col("hcpcs_drug_ind") == "D", 1).otherwise(0)) \
    .withColumn("missing_zip",       when(col("rndrng_prvdr_zip5").isNull(), 1).otherwise(0))

# 2) Aggregate by provider, using try_divide to guard against zero denominators
provider_agg = (
    clean_phys2
    .groupBy("rndrng_npi")
    .agg(
        sum_("srvcs_d"    ).alias("total_services"),
        sum_("benes_d"    ).alias("total_beneficiaries"),
        sum_("bene_days_d").alias("total_bene_day_services"),

        # ANSI-safe ratio averages
        avg(try_divide(col("avg_sbmtd_chrg"), col("avg_mdcr_alowd_amt")))
          .alias("avg_charge_allowed_ratio"),
        avg(try_divide(col("avg_mdcr_pymt_amt"), col("avg_mdcr_alowd_amt")))
          .alias("avg_payment_allowed_ratio"),

        countDistinct("hcpcs_cd").alias("num_unique_procedures"),
        stddev(col("avg_sbmtd_chrg")).alias("stddev_submitted_charge"),

        avg(col("is_drug"    )).alias("frac_drug_services"),
        avg(col("missing_zip")).alias("frac_missing_zip")
    )
)

# 3) Inspect
provider_agg.printSchema()
provider_agg.show(5, truncate=False)

root
 |-- rndrng_npi: string (nullable = true)
 |-- total_services: double (nullable = true)
 |-- total_beneficiaries: double (nullable = true)
 |-- total_bene_day_services: double (nullable = true)
 |-- avg_charge_allowed_ratio: double (nullable = true)
 |-- avg_payment_allowed_ratio: double (nullable = true)
 |-- num_unique_procedures: long (nullable = false)
 |-- stddev_submitted_charge: double (nullable = true)
 |-- frac_drug_services: double (nullable = true)
 |-- frac_missing_zip: double (nullable = true)





+----------+--------------+-------------------+-----------------------+------------------------+-------------------------+---------------------+-----------------------+------------------+----------------+
|rndrng_npi|total_services|total_beneficiaries|total_bene_day_services|avg_charge_allowed_ratio|avg_payment_allowed_ratio|num_unique_procedures|stddev_submitted_charge|frac_drug_services|frac_missing_zip|
+----------+--------------+-------------------+-----------------------+------------------------+-------------------------+---------------------+-----------------------+------------------+----------------+
|1003846908|1832.0        |1445.0             |1809.0                 |4.530462815149122       |0.7338022651790301       |16                   |474.3959654350813      |0.0               |0.0             |
|1003890302|1543.0        |1360.0             |1508.0                 |4.8775082431546695      |0.7646488824457998       |7                    |432.7900067612964      |0.0         

                                                                                

#### 'List of Excluded Individuals and Entities' (LEIE) - OIG Dataset

In [24]:
# creating a csv path variable
csv_path = "Office of Inspector General - Excluded Individuals and Entities/20250710 LEIE.csv"

In [25]:
# reading the CSV file into a DataFrame
full_leie_df = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "false")
    .csv(csv_path)
)

In [29]:
# inspecting the column names
print("Available columns:")
for c in full_leie_df.columns[:]:
    print(repr(c))

Available columns:
'LASTNAME'
'FIRSTNAME'
'MIDNAME'
'BUSNAME'
'GENERAL'
'SPECIALTY'
'UPIN'
'NPI'
'DOB'
'ADDRESS'
'CITY'
'STATE'
'ZIP'
'EXCLTYPE'
'EXCLDATE'
'REINDATE'
'WAIVERDATE'
'WVRSTATE'


In [30]:
# selecting the relevant columns I want to keep
keep_cols_leie = [
    "NPI",        # the provider’s NPI
    "LASTNAME",   # provider last name
    "FIRSTNAME",  # provider first name
    "BUSNAME",    # organization/business name (if non-individual)
    "ADDRESS",    # mailing address line 1
    "CITY",       # mailing city
    "STATE",      # mailing state
    "ZIP",        # mailing postal code
    "SPECIALTY",  # provider specialty description
    "EXCLTYPE",   # exclusion type code
    "EXCLDATE",   # exclusion effective date
    "REINDATE"    # exclusion end (reinstatement) date
]

In [31]:
# creating a new DataFrame with only the selected columns
leie_df = full_leie_df.select(*keep_cols_leie)

In [33]:
# having a sanity check to ensure I loaded the right columns, the correct types, and the data looks okay
print("Schema:")
leie_df.printSchema()
print("Sample:")
leie_df.limit(5).show(truncate=False)

Schema:
root
 |-- NPI: string (nullable = true)
 |-- LASTNAME: string (nullable = true)
 |-- FIRSTNAME: string (nullable = true)
 |-- BUSNAME: string (nullable = true)
 |-- ADDRESS: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- STATE: string (nullable = true)
 |-- ZIP: string (nullable = true)
 |-- SPECIALTY: string (nullable = true)
 |-- EXCLTYPE: string (nullable = true)
 |-- EXCLDATE: string (nullable = true)
 |-- REINDATE: string (nullable = true)

Sample:
+----------+--------+---------+---------------------------+-----------------------------+----------+-----+-----+------------------+--------+--------+--------+
|NPI       |LASTNAME|FIRSTNAME|BUSNAME                    |ADDRESS                      |CITY      |STATE|ZIP  |SPECIALTY         |EXCLTYPE|EXCLDATE|REINDATE|
+----------+--------+---------+---------------------------+-----------------------------+----------+-----+-----+------------------+--------+--------+--------+
|0000000000|NULL    |NULL     |#1 MARK

In [34]:
# getting the total number of rows in the DataFrame
total_rows_leie = leie_df.count()
print(f'Total rows: {total_rows_leie}')

Total rows: 81774


In [35]:
# getting the null counts for each column
leie_df.select(
    *[sum_(col(c).isNull().cast('int')).alias(c) for c in leie_df.columns]
).show()

+---+--------+---------+-------+-------+----+-----+---+---------+--------+--------+--------+
|NPI|LASTNAME|FIRSTNAME|BUSNAME|ADDRESS|CITY|STATE|ZIP|SPECIALTY|EXCLTYPE|EXCLDATE|REINDATE|
+---+--------+---------+-------+-------+----+-----+---+---------+--------+--------+--------+
|  0|    3358|     3358|  78416|      4|   0|    0|  0|     4088|       0|       0|       0|
+---+--------+---------+-------+-------+----+-----+---+---------+--------+--------+--------+



                                                                                