# Data Profiling & Validation Using LLMs

In [1]:
# configure api
from dotenv import load_dotenv
import os

load_dotenv()
gemini_api_key = os.getenv("GEMINI_API_KEY")

In [2]:
from google import genai

client = genai.Client(api_key=gemini_api_key)

model = "gemini-2.0-flash"

In [3]:
import pandas as pd

# Sample dataset
data = {
    'CustomerID': [1001, 1002, 1003, None, 1005],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Email': ['alice@gmail', 'bob@yahoo.com', 'charlie@@mail.com', None, 'eve@gmail.com'],
    'JoinDate': ['2021-01-01', '2021-02-30', '2021-03-15', 'bad_date', '2021/04/01'],
    'Country': ['MY', 'Malaysia', 'MY', 'Singapore', None],
    'Revenue': ['1000', 'Two Thousand', '3000', '-500', '4000']
}
df = pd.DataFrame(data)

# Convert schema to prompt
schema = str(df.dtypes)
prompt = f"""
You are a data quality expert. Analyze the schema below and:
1. Identify 3 potential data quality issues.
2. Suggest data profiling steps.
3. Recommend data standardization or validation actions.

Schema:
{schema}
"""

In [4]:
# STEP 7: Generate and Print the Output 
response = client.models.generate_content(model=model, contents=prompt)
print(response.text) 

Okay, let's analyze the provided schema and address the data quality concerns.

**1. Potential Data Quality Issues**

Here are three potential data quality issues, along with explanations:

*   **a. `CustomerID` as Float:**  A `CustomerID` should ideally be a unique identifier and is usually represented as an integer or a string.  Using `float64` can lead to problems:
    *   **Loss of precision:** Floating-point numbers can have rounding errors, which is unacceptable for IDs.  e.g., You might expect 12345 but get 12345.0000000000001.
    *   **Inconsistency:**  Having `CustomerID` as a float suggests potentially meaningless decimal places. What does CustomerID 123.5 represent?
    *   **Joining Issues:** Joining with other tables where CustomerID is an integer or string will be problematic without explicit type casting.

*   **b. `JoinDate` as Object:**  Storing dates as generic `object` types (usually strings) is a major data quality issue.  It makes date-based analysis and calculati

## Model Output:

Okay, let's analyze the provided schema and address the data quality concerns.

**1. Potential Data Quality Issues**

Here are three potential data quality issues, along with explanations:

*   **a. `CustomerID` as Float:**  A `CustomerID` should ideally be a unique identifier and is usually represented as an integer or a string.  Using `float64` can lead to problems:
    *   **Loss of precision:** Floating-point numbers can have rounding errors, which is unacceptable for IDs.  e.g., You might expect 12345 but get 12345.0000000000001.
    *   **Inconsistency:**  Having `CustomerID` as a float suggests potentially meaningless decimal places. What does CustomerID 123.5 represent?
    *   **Joining Issues:** Joining with other tables where CustomerID is an integer or string will be problematic without explicit type casting.

*   **b. `JoinDate` as Object:**  Storing dates as generic `object` types (usually strings) is a major data quality issue.  It makes date-based analysis and calculations incredibly difficult and error-prone.
    *   **Format Inconsistencies:**  The `JoinDate` might have various formats like "YYYY-MM-DD", "MM/DD/YYYY", "DD-MMM-YY", etc.
    *   **Sorting/Filtering Problems:** Sorting strings lexicographically will not give the correct chronological order.  Filtering by date ranges will also be incorrect.
    *   **Calculations Impossible:**  Calculating the age of customers or the duration of their membership is impossible without parsing these strings into proper date objects.

*   **c. `Revenue` as Object:** Revenue should be a numerical type (either integer or float). Storing it as `object` (most likely a string) is a serious issue.
    *   **Invalid Characters:** The `Revenue` strings might contain currency symbols ("$", "€"), commas (","), or non-numeric characters.
    *   **Incorrect Aggregations:** You cannot perform sums, averages, or other calculations directly on string data.
    *   **Potential Missing Values:** There might be "NA" "None" or empty strings being stored.

**2. Data Profiling Steps**

Data profiling is essential to understand the *actual* data characteristics before attempting cleaning or transformation. Here's a suggested set of profiling steps:

*   **a. Descriptive Statistics:**
    *   For *all* columns:
        *   Count of non-null values.
        *   Number of unique values.
        *   Number of missing (null) values.
        *   Data type confirmation.

    *   For `CustomerID`:
        *   Minimum, Maximum (to get the range of IDs)
        *   Check for duplicate CustomerIDs

    *   For `Name`:
        *   Most frequent names.
        *   Average name length.

    *   For `Email`:
        *   Percentage of valid email formats (using regex).
        *   Most frequent domain names.
        *   Number of duplicate emails.

    *   For `JoinDate`:
        *   Most frequent date formats.
        *   Minimum and maximum dates (to identify outliers or impossible dates).
        *   Frequency of dates.
        *   Identify any invalid date entries like "N/A"

    *   For `Country`:
        *   List of unique country values.
        *   Value counts for each country (to identify potential data entry errors or unexpected distributions).

    *   For `Revenue`:
        *   Identify values with non-numeric characters (currency symbols, commas, etc.).
        *   Identify potential missing values (e.g., "NA", "", "None").

*   **b. Value Distribution Analysis:**
    *   Histograms or bar charts for numeric/categorical columns to visualize distributions.
    *   Frequency tables to show the occurrences of each unique value in categorical columns.

*   **c. Pattern Analysis:**
    *   Use regular expressions to identify patterns in `Email`, `Name`, `JoinDate`, and `Revenue` to understand common formats and potential anomalies.  For example, check for common email patterns (e.g., `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`).

*   **d. Dependency Analysis:**
    *   Are there any relationships between columns? For example, are certain countries associated with specific revenue ranges or email domains?

**3. Data Standardization and Validation Actions**

Based on the identified issues and the insights gained from data profiling, here are recommended actions:

*   **a. `CustomerID`:**
    *   **Data Type Conversion:**  Convert `CustomerID` to `int64` if all values are integers and there are no missing values. If there are any non-integer values, investigate and either convert to integer or string.
    *   **Validation:** Ensure `CustomerID` values are unique.  Identify and handle duplicates (e.g., merge records, assign new IDs).
    *   **Format Standardization:** If the IDs have a specific format (e.g., prefixed with "CUST-"), enforce that format.

*   **b. `Name`:**
    *   **Standardization:** Remove leading/trailing whitespace.
    *   **Case Consistency:** Convert to a consistent case (e.g., title case, lowercase).
    *   **Splitting:** Consider splitting the `Name` column into `FirstName` and `LastName` if appropriate.
    *   **Validation:**  Check for and handle special characters or unexpected symbols.

*   **c. `Email`:**
    *   **Validation:** Use regular expressions to validate email format.  Flag or correct invalid emails.
    *   **Domain Standardization:** Standardize domain names (e.g., convert "gmail.com" to "google.com" if necessary).
    *   **Deduplication:** Identify and handle duplicate email addresses.
    *   **Consider Email Verification:** Use a third-party service to verify email addresses if real-time validation is important.

*   **d. `JoinDate`:**
    *   **Data Type Conversion:** Convert `JoinDate` to a datetime data type (e.g., `datetime64[ns]` in pandas).
    *   **Format Standardization:** Identify the dominant date format and use `pd.to_datetime(df['JoinDate'], format='%Y-%m-%d', errors='coerce')` to parse the column. The `errors='coerce'` option handles any dates that do not conform to the specified format and turn them to 'NaT'
    *   **Missing Value Handling:** Decide how to handle invalid dates (e.g., replace with a default date, impute based on other data).
    *   **Range Validation:** Check for dates that are illogical (e.g., future dates or dates that are historically impossible).

*   **e. `Country`:**
    *   **Standardization:**  Use a standard list of country codes (e.g., ISO 3166-1 alpha-2 or alpha-3) and map inconsistent names to the standard codes.  Remove leading/trailing whitespace.  Handle variations in spelling (e.g., "USA" vs. "United States of America").
    *   **Validation:**  Validate that the country values are valid according to the standard list.

*   **f. `Revenue`:**
    *   **Data Cleaning:** Remove currency symbols (e.g., "$", "€"), commas, and other non-numeric characters using regex and the `.replace` function.
    *   **Data Type Conversion:** Convert `Revenue` to a numeric data type (e.g., `float64` or `int64`).
    *   **Missing Value Handling:** Decide how to handle missing or invalid revenue values (e.g., replace with 0, calculate an average, or flag the record).
    *   **Range Validation:** Check for revenue values that are illogical (e.g., negative values if not allowed, or very high outliers).

**Important Considerations:**

*   **Data Source Understanding:** Understand the origin of the data and the business rules associated with each field.
*   **Imputation vs. Removal:** Carefully consider whether to impute missing values or remove records with missing values. Imputation can introduce bias, while removal can reduce the sample size.
*   **Data Quality Monitoring:** Implement a process for ongoing data quality monitoring to detect and address issues as they arise.
*   **Document Everything:**  Document all data cleaning and transformation steps to ensure reproducibility and maintainability.
*   **Testing:** Test any changes that are made to the data. Make sure you can revert your steps if you need to.

By following these steps, you can significantly improve the quality and reliability of your data. Remember to tailor these recommendations to the specific characteristics of your data and business requirements.

