# Jake Singer Stripe Takehome Workbook
## Table of Contents

1. [Initial Setup](#initial-setup)
2. [Data Overview](#data-overview)
3. [Identifying Data Quality Issues](#identifying-issues)
4. [Cleaning Process](#cleaning-process)
   1. Mixed Type
      1. [Merchant Column](#merchant-column)
      2. [Industry Column](#industry-column)
      3. [First Charge Date Column](#first-charge-date)
   2. 
5. [Validation and Summary](#validation)

## 1. Initial Setup <a name="initial-setup"></a>

In [36]:
import pandas as pd
import numpy as np
import datetime

# Import merchants data
merchants = pd.read_excel('dstakehome_merchants.xlsx', parse_dates=['first_charge_date'])

# Import payments data
payments = pd.read_excel('dstakehome_payments.xlsx', parse_dates=['date'])

# Verify the data was imported correctly
print("Merchants table shape:", merchants.shape)
print("Payments table shape:", payments.shape)

# Display the first few rows of each table
print("\nFirst few rows of Merchants table:")
print(merchants.head())

print("\nFirst few rows of Payments table:")
print(payments.head())

Merchants table shape: (23627, 5)
Payments table shape: (1577887, 6)

First few rows of Merchants table:
   merchant           industry          first_charge_date country  \
0  5d03e714          Education  2032-02-13 00:00:00+00:00      US   
1  da22f154             Others  2031-10-16 00:00:00+00:00      US   
2  687eebc8           Software  2032-07-23 00:00:00+00:00      US   
3  de478470           Software  2033-03-15 00:00:00+00:00      US   
4  1e719b8a  Business services  2035-02-12 00:00:00+00:00      IT   

  business_size  
0        medium  
1         small  
2         small  
3         small  
4         small  

First few rows of Payments table:
                       date  merchant  subscription_volume  checkout_volume  \
0 2041-05-01 00:00:00+00:00  5d03e714                    0                0   
1 2041-05-01 00:00:00+00:00  da22f154                    0                0   
2 2041-05-01 00:00:00+00:00  687eebc8                79400                0   
3 2041-05-01 00:00:00

## 2. Data Overview <a name="data-overview"></a>

In [25]:
def print_dataframe_info(df, name):
    print(f"\n{name} DataFrame:")
    print(df.info())
    print("\nSample data:")
    print(df.sample(10))

print_dataframe_info(payments, "Payments")
print_dataframe_info(merchants, "Merchants")


Payments DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1577887 entries, 0 to 1577886
Data columns (total 6 columns):
 #   Column               Non-Null Count    Dtype              
---  ------               --------------    -----              
 0   date                 1577887 non-null  datetime64[ns, UTC]
 1   merchant             1577887 non-null  object             
 2   subscription_volume  1577887 non-null  int64              
 3   checkout_volume      1577887 non-null  int64              
 4   payment_link_volume  1577887 non-null  int64              
 5   total_volume         1577887 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(4), object(1)
memory usage: 72.2+ MB
None

Sample data:
                             date  merchant  subscription_volume  \
282783  2041-07-20 00:00:00+00:00  bf59b44d                    0   
1311206 2042-04-18 00:00:00+00:00  fc110f8a                21000   
749971  2041-11-25 00:00:00+00:00  96bedf32              

## 3. Identifying Data Quality Issues <a name="identifying-issues"></a>
The data quality issues I want to check for include:
1. Missing data
2. Examine the data types and look for mixed data columns
3. Look for any inconsistencies or anomalies
4. Verify date ranges (we were told to expect dates in 2041-2042)
5. Check for duplicate records
6. Make sure data between tables is consistent (i.e. merchant first_charge_date is consistent with the actual payments data).

### 3.1 Missing Data 

In [30]:
def check_missing_data(df, table_name):
    print(f"Missing Data in {table_name} table:")
    missing = df.isnull().sum()
    missing_percent = 100 * df.isnull().sum() / len(df)
    missing_table = pd.concat([missing, missing_percent], axis=1, keys=['Missing Values', '% Missing'])
    print(missing_table[missing_table['Missing Values'] > 0])
    print("\n")

# Check missing data in both tables
check_missing_data(payments, "Payments")
check_missing_data(merchants, "Merchants")

Missing Data in Payments table:
Empty DataFrame
Columns: [Missing Values, % Missing]
Index: []


Missing Data in Merchants table:
Empty DataFrame
Columns: [Missing Values, % Missing]
Index: []




Result: no data quality issue related to missing data.

### 3.1 Data Types

In [31]:
def check_mixed_types(df, df_name):
    print(f"\nChecking mixed types in {df_name}:")
    for column in df.columns:
        # Get unique types in the column
        unique_types = df[column].apply(type).unique()
        
        if len(unique_types) > 1:
            print(f"\nColumn '{column}' has mixed types:")
            for dtype in unique_types:
                count = (df[column].apply(type) == dtype).sum()
                print(f"  {dtype.__name__}: {count}")
        else:
            print(f"Column '{column}' only has one type: {unique_types[0].__name__}")

# Run the function on both dataframes
check_mixed_types(payments, "Payments")
check_mixed_types(merchants, "Merchants")


Checking mixed types in Payments:
Column 'date' only has one type: Timestamp

Column 'merchant' has mixed types:
  str: 1536860
  int: 40992
  datetime: 35
Column 'subscription_volume' only has one type: int
Column 'checkout_volume' only has one type: int
Column 'payment_link_volume' only has one type: int
Column 'total_volume' only has one type: int

Checking mixed types in Merchants:

Column 'merchant' has mixed types:
  str: 22930
  int: 696
  datetime: 1

Column 'industry' has mixed types:
  str: 23620
  int: 7

Column 'first_charge_date' has mixed types:
  str: 23601
  int: 26
Column 'country' only has one type: str
Column 'business_size' only has one type: str


**Summary**

We have discovered that the data types in 4 columns have mixed types, and so we'll probably need to clean this out. 

Let's first look at some examples so we know what we're dealing with. First up, the merchant columns in both dataframes:

In [6]:
def investigate_merchant_column(df, df_name):
    print(f"\nInvestigating 'merchant' column in {df_name}:")
    print("\nExample of string type:")
    print(df[df['merchbant'].apply(lambda x: isinstance(x, str))]['merchant'].head())
    print("\nExample of int type:")
    print(df[df['merchant'].apply(lambda x: isinstance(x, (int, np.integer)))]['merchant'].head())
    print("\nExample of potential datetime type:")
    print(df[~df['merchant'].apply(lambda x: isinstance(x, (str, int, np.integer)))]['merchant'].head())

investigate_merchant_column(payments, "Payments")
investigate_merchant_column(merchants, "Merchants")


Investigating 'merchant' column in Payments:

Example of string type:
0    5d03e714
1    da22f154
2    687eebc8
3    de478470
4    1e719b8a
Name: merchant, dtype: object

Example of int type:
34                                              69922557
36                                              13314398
83                                              49903842
151                                             54243875
172    2722999999999999891872532342880675763362592535...
Name: merchant, dtype: object

Example of potential datetime type:
34245     9082-12-06 00:00:00
48752     9082-12-06 00:00:00
58729     9082-12-06 00:00:00
62345     9082-12-06 00:00:00
117939    9082-12-06 00:00:00
Name: merchant, dtype: object

Investigating 'merchant' column in Merchants:

Example of string type:
0    5d03e714
1    da22f154
2    687eebc8
3    de478470
4    1e719b8a
Name: merchant, dtype: object

Example of int type:
34                                              69922557
36                      

We have identified the following issues with the merchants columns:
* Ideally, merchantIds are strings with length of 8 or fewer characters, typically mixed between strings and ints.
* In some cases, all the characters are ints. These should just be cast to strings to clean them.
* We see one example of a very long "int", which appears in both tables. We should validate whether this is the only such value (int that has length longer than 8). If it is the only value, we'll just shorten the identifier in both tables.
* We also see some date time values, and this only seems to occur in the payments table. It also appears that there is only one such value: "9082-12-06 00:00:00". To validate this, we should find out how many unique values of this type there are. If it is truly just this one, it looks like the correct value ˆmightˆ be "90821206" and that it got translated at some point to date time. We could quickly check to see if that value appears in the merchants table. If so, we could confidently assert that this value should just be changed to that of the string.

In [10]:
def analyze_merchant_column(df, df_name):
    print(f"\nAnalyzing 'merchant' column in {df_name}:")
    
    # Check data types
    type_counts = df['merchant'].apply(type).value_counts()
    print("Data types in merchant column:")
    print(type_counts)
    
    # Check length distribution
    length_counts = df['merchant'].astype(str).str.len().value_counts().sort_index()
    print("\nLength distribution of merchant IDs:")
    print(length_counts)
    
    # Check for long values (length > 8)
    long_values = df[df['merchant'].astype(str).str.len() > 8]
    print(f"\nNumber of long values (length > 8): {len(long_values)}")
    if len(long_values) > 0:
        print("Sample of long values:")
        print(long_values['merchant'].head())
    
    # Check for datetime-like values
    datetime_values = df[df['merchant'].astype(str).str.contains(':')]
    print(f"\nNumber of datetime-like values: {len(datetime_values)}")
    if len(datetime_values) > 0:
        print("Sample of datetime-like values:")
        print(datetime_values['merchant'].head())

# Analyze both dataframes
analyze_merchant_column(payments, "Payments")
analyze_merchant_column(merchants, "Merchants")

# Check if '90821206' exists in merchants
print("\nChecking for '90821206' in merchants:")
print('90821206' in merchants['merchant'].values)

# Check for the specific long int value we saw earlier
long_int_value = "272299999999999989187253234288067576336259253599999999"
print("\nChecking for the specific long int value:")
print(f"In Payments: {long_int_value in payments['merchant'].values}")
print(f"In Merchants: {long_int_value in merchants['merchant'].values}")


Analyzing 'merchant' column in Payments:
Data types in merchant column:
merchant
<class 'str'>    1577887
Name: count, dtype: int64

Length distribution of merchant IDs:
merchant
1       150
4       114
5       168
6       621
7      2615
       ... 
228     417
239     170
250       1
259       1
282       2
Name: count, Length: 68, dtype: int64

Number of long values (length > 8): 7624
Sample of long values:
172    2722999999999999891872532342880675763362592535...
310    6910999999999999870594406756389217016713121289...
393    4297000000000000335783791370003204432445948705...
409    8870300000000000777024428306674398232244977079...
458    6836000000000000469949036784867651046214128083...
Name: merchant, dtype: object

Number of datetime-like values: 0

Analyzing 'merchant' column in Merchants:
Data types in merchant column:
merchant
<class 'str'>    23627
Name: count, dtype: int64

Length distribution of merchant IDs:
merchant
1       8
4       1
5       3
6       9
7      47
      

In [14]:

def clean_merchant_column(df, existing_ids):
    # Function to generate a unique ID for long numeric strings
    def generate_unique_id(x, existing):
        if x in existing:
            return x  # If it already exists, leave it as is
        
        base = x[:8]  # Take the first 8 digits as the base
        unique_id = base
        counter = 1
        
        while unique_id in existing:
            unique_id = f"{base}_{counter}"
            counter += 1
        
        existing.add(unique_id)
        return unique_id

    # Collect all existing IDs (including those from the other dataframe)
    all_ids = set(existing_ids) | set(df['merchant'])
    
    # Apply the unique ID generation for long numeric strings
    df['merchant'] = df['merchant'].apply(
        lambda x: generate_unique_id(x, all_ids) if x.isnumeric() and len(x) > 8 else x
    )
    
    return df

# Collect all existing merchant IDs from both dataframes
all_existing_ids = set(payments['merchant']) | set(merchants['merchant'])

# Clean both dataframes
payments_cleaned = clean_merchant_column(payments, all_existing_ids)
merchants_cleaned = clean_merchant_column(merchants, all_existing_ids)

# Verify cleaning
print("After cleaning:")
analyze_merchant_column(payments_cleaned, "Cleaned Payments")
analyze_merchant_column(merchants_cleaned, "Cleaned Merchants")

# Check if cleaning made merchant IDs inconsistent between dataframes
payments_merchants = set(payments_cleaned['merchant'])
merchants_merchants = set(merchants_cleaned['merchant'])

print("\nMerchant ID Consistency Check:")
print(f"Merchant IDs only in Payments: {len(payments_merchants - merchants_merchants)}")
print(f"Merchant IDs only in Merchants: {len(merchants_merchants - payments_merchants)}")

# Check for any collisions
collisions = payments_merchants & merchants_merchants
print(f"\nNumber of shared Merchant IDs: {len(collisions)}")

# Verify uniqueness within each dataframe
print("\nUniqueness Check:")
print(f"Unique merchants in Payments: {payments_cleaned['merchant'].nunique()} (Total: {len(payments_cleaned)})")
print(f"Unique merchants in Merchants: {merchants_cleaned['merchant'].nunique()} (Total: {len(merchants_cleaned)})")


After cleaning:

Analyzing 'merchant' column in Cleaned Payments:
Data types in merchant column:
merchant
<class 'str'>    1577887
Name: count, dtype: int64

Length distribution of merchant IDs:
merchant
1        150
4        114
5        168
6        621
7       2615
8    1574219
Name: count, dtype: int64

Number of long values (length > 8): 0

Number of datetime-like values: 0

Analyzing 'merchant' column in Cleaned Merchants:
Data types in merchant column:
merchant
<class 'str'>    23627
Name: count, dtype: int64

Length distribution of merchant IDs:
merchant
1        8
4        1
5        3
6        9
7       47
8    23559
Name: count, dtype: int64

Number of long values (length > 8): 0

Number of datetime-like values: 0

Merchant ID Consistency Check:
Merchant IDs only in Payments: 0
Merchant IDs only in Merchants: 0

Number of shared Merchant IDs: 23620

Uniqueness Check:
Unique merchants in Payments: 23620 (Total: 1577887)
Unique merchants in Merchants: 23620 (Total: 23627)
