# Manually testing data pipelines

### Validating a data pipeline at "checkpoints"
In this exercise, you'll be working with a data pipeline that extracts tax data from a CSV file, creates a new column, filters out rows based on average taxable income, and persists the data to a parquet file.

pandas has been loaded as pd, and the extract(), transform(), and load() functions have already been defined. You'll use these functions to validate the data pipeline at various checkpoints throughout its execution.

In [1]:
import pandas as pd
def extract(file_path):
    """Extract data from a CSV file."""
    return pd.read_csv(file_path)

def transform(df):
    clean_df = df.copy()

    clean_df = clean_df.dropna()

    if 'taxable_income' in clean_df.columns:
        avg_income = clean_df['taxable_income'].mean()
        clean_df = clean_df[clean_df['taxable_income'] >= avg_income]

    return clean_df
def load(df, output_path):
    """Load the DataFrame to a parquet file."""
    df.to_parquet(output_path, index=False)
    print(f"Data successfully loaded to {output_path}")

In [3]:
# Extract and transform tax_data
raw_tax_data = extract("../data/raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)

# Check the shape of the raw_tax_data DataFrame, compare to the clean_tax_data DataFrame
print(f"Shape of raw_tax_data: {raw_tax_data.shape}")
print(f"Shape of clean_tax_data: {clean_tax_data.shape}")

Shape of raw_tax_data: (82, 6)
Shape of clean_tax_data: (82, 6)


In [6]:
# Load the clean_tax_data to parquet file
load(clean_tax_data, "../data/clean_tax_data.parquet")

# Read in the loaded data, observe the head of each
to_validate = pd.read_parquet("../data/clean_tax_data.parquet")
print(clean_tax_data.head(3))
print(to_validate.head(3))

Data successfully loaded to ../data/clean_tax_data.parquet
       industry_name  number_of_firms  total_taxable_income  total_taxes_paid  \
0  Aerospace/Defense               77               30920.0          5106.376   
1            Apparel               39                5423.0          1112.113   
2       Auto & Truck               31               33360.0          3529.000   

   total_cash_taxes_paid  average_taxable_income  
0               7441.776                 401.561  
1               1479.292                 139.043  
2               2446.896                1076.071  
       industry_name  number_of_firms  total_taxable_income  total_taxes_paid  \
0  Aerospace/Defense               77               30920.0          5106.376   
1            Apparel               39                5423.0          1112.113   
2       Auto & Truck               31               33360.0          3529.000   

   total_cash_taxes_paid  average_taxable_income  
0               7441.776            

In [7]:
# Read in the loaded data, observe the head of each
to_validate = pd.read_parquet("../data/clean_tax_data.parquet")
print(clean_tax_data.head(3))
print(to_validate.head(3))


       industry_name  number_of_firms  total_taxable_income  total_taxes_paid  \
0  Aerospace/Defense               77               30920.0          5106.376   
1            Apparel               39                5423.0          1112.113   
2       Auto & Truck               31               33360.0          3529.000   

   total_cash_taxes_paid  average_taxable_income  
0               7441.776                 401.561  
1               1479.292                 139.043  
2               2446.896                1076.071  
       industry_name  number_of_firms  total_taxable_income  total_taxes_paid  \
0  Aerospace/Defense               77               30920.0          5106.376   
1            Apparel               39                5423.0          1112.113   
2       Auto & Truck               31               33360.0          3529.000   

   total_cash_taxes_paid  average_taxable_income  
0               7441.776                 401.561  
1               1479.292                 139

In [None]:
# Check that the DataFrames are equal
print(to_validate.equals(clean_tax_data))


### Testing a data pipeline end-to-end
In this exercise, you'll be working with the same data pipeline as before, which extracts, transforms, and loads tax data. You'll practice testing this pipeline end-to-end to ensure the solution can be run multiple times, without duplicating the transformed data in the parquet file.

pandas has been loaded as pd, and the extract(), transform(), and load() functions have already been defined.

In [None]:
# Trigger the data pipeline to run three times
for attempt in range(0, 3):
	print(f"Attempt: {attempt}")
	raw_tax_data = extract("raw_tax_data.csv")
	clean_tax_data = transform(raw_tax_data)
	load(clean_tax_data, "clean_tax_data.parquet")

	# Print the shape of the cleaned_tax_data DataFrame
	print(f"Shape of clean_tax_data: {clean_tax_data.shape}")

# Read in the loaded data, check the shape
to_validate = pd.read_parquet("../data/clean_tax_data.parquet")
print(f"Final shape of cleaned data: {to_validate.shape}")
