# Automated Data Quality Rules Generation using Hugging Face Transformers

## Introduction
This notebook outlines an end-to-end process for generating semantic tags for dataset columns using a model from Hugging Face. These tags are then used to create and apply data quality rules to a sample dataset.

## Setup
First, we will install and import the necessary libraries for our task.

In [9]:
# !pip install transformers
# ! pip install pyarrow

import pandas as pd
from transformers import pipeline

## Loading the Model
We load a zero-shot classification model which will be used to predict semantic tags for our dataset columns.

In [10]:
# Load a zero-shot-classification pipeline
classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

## Sample Dataset
We will create a sample dataset to demonstrate the generation and application of DQ rules.

In [11]:
sample_data = {
    'Product_Name': ['T-shirt', 'Jeans', '', 'Socks', 'Jacket'],
    'Price': [19.99, 49.99, 24.99, 5.99, -1],
    'Stock_Count': [120, 90, -5, 200, 60]
}
df = pd.DataFrame(sample_data)

## Generate Semantic Tags
For each column, we generate semantic tags that are relevant to the type of data contained in that column.

In [12]:
def generate_semantic_tags(column_name):
    # Define potential tags
    potential_tags = ['required', 'numeric_positive', 'integer_non_negative', 'unique', 'datetime', 'text']
    # Use the classifier to get the most likely tags
    result = classifier(column_name, potential_tags, multi_label=True)
    return result['labels'][0]  # Return the top tag for simplicity in this example

# Note: This function now requires an actual call to a classifier pipeline from Hugging Face.
# Ensure you have internet access and the transformers library installed to use this function.

## Define Data Quality Rules
Based on the generated tags, we define a set of DQ rules to apply to the dataset.

In [13]:
def apply_dq_rules(df, column_name, tag):
    if tag == 'required':
        df[column_name + '_DQ'] = df[column_name].apply(lambda x: 'Valid' if pd.notnull(x) and x != '' else 'Invalid')
    elif tag == 'numeric_positive':
        df[column_name + '_DQ'] = df[column_name].apply(lambda x: 'Valid' if x > 0 else 'Invalid')
    elif tag == 'integer_non_negative':
        df[column_name + '_DQ'] = df[column_name].apply(lambda x: 'Valid' if pd.notnull(x) and isinstance(x, int) and x >= 0 else 'Invalid')
    # Here you can add more DQ rules based on other tags
    return df

## Apply Data Quality Rules
We now apply the defined DQ rules to our sample dataset.

In [14]:
for column in df.columns:
    tag = generate_semantic_tags(column)
    df = apply_dq_rules(df, column, tag)
df

Unnamed: 0,Product_Name,Price,Stock_Count,Product_Name_DQ,Stock_Count_DQ
0,T-shirt,19.99,120,Valid,Valid
1,Jeans,49.99,90,Valid,Valid
2,,24.99,-5,Invalid,Valid
3,Socks,5.99,200,Valid,Valid
4,Jacket,-1.0,60,Valid,Valid


## Conclusion
The final DataFrame displayed above shows the original data along with the results of the DQ rules application. This process can be scaled and automated to enhance data quality across various datasets.