### Task 1: Data Profiling to Understand Data Quality
**Description**: Use basic statistical methods to profile a dataset and identify potential quality issues.

**Steps**:
1. Load the dataset using pandas in Python.
2. Understand the data by checking its basic statistics.
3. Identify null values.
4. Check unique values for categorical columns.
5. Review outliers using box plots.

In [None]:
# write your code from here
import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Load dataset
# Example dataset: Titanic dataset from seaborn (or replace with your own CSV)
import seaborn as sns
df = sns.load_dataset('titanic')

# Step 2: Basic statistics
print("Basic Statistical Summary:\n", df.describe(include='all'))

# Step 3: Identify null values
print("\nMissing Values per Column:\n", df.isnull().sum())

# Step 4: Unique values for categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    print(f"\nUnique values in '{col}': {df[col].unique()}")

# Step 5: Review outliers with box plots for numeric columns
numeric_cols = df.select_dtypes(include=['number']).columns
for col in numeric_cols:
    plt.figure(figsize=(6, 3))
    plt.title(f"Box plot of {col}")
    plt.boxplot(df[col].dropna())
    plt.show()


### Task 2: Implement Simple Data Validation
**Description**: Write a Python script to validate the data types and constraints of each column in a dataset.

**Steps**:
1. Define constraints for each column.
2. Validate each column based on its constraints.

In [None]:
# write your code from here
import pandas as pd

# Sample dataset
data = {
    'age': [25, 30, -1, 40],          # age should be non-negative integer
    'gender': ['M', 'F', 'F', 'O'],  # gender should be in set {'M', 'F'}
    'income': [50000, 60000, 70000, None]  # income should be positive and non-null
}

df = pd.DataFrame(data)

# Step 1: Define constraints
constraints = {
    'age': lambda x: x.apply(lambda v: isinstance(v, int) and v >= 0),
    'gender': lambda x: x.isin(['M', 'F']),
    'income': lambda x: x.notnull() & (x > 0)
}

# Step 2: Validate each column and collect errors
errors = []
for col, check_func in constraints.items():
    if col in df.columns:
        valid_mask = check_func(df[col])
        if not valid_mask.all():
            invalid_indices = df.index[~valid_mask].tolist()
            errors.append(f"Column '{col}' failed validation at rows: {invalid_indices}")
    else:
        errors.append(f"Column '{col}' is missing in the dataset.")

if errors:
    print("Validation Errors:")
    for err in errors:
        print("-", err)
else:
    print("All columns passed validation.")


### Task 3: Detect Missing Data Patterns
**Description**: Analyze and visualize missing data patterns in a dataset.

**Steps**:
1. Visualize missing data using a heatmap.
2. Identify patterns in missing data.

In [None]:
# write your code from here
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample dataset with missing values
data = {
    'age': [25, None, 35, 40, None],
    'gender': ['M', 'F', None, 'F', 'M'],
    'income': [50000, 60000, None, None, 70000]
}
df = pd.DataFrame(data)

# Step 1: Visualize missing data using heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()

# Step 2: Identify patterns in missing data
missing_summary = df.isnull().sum()
print("Missing Values per Column:\n", missing_summary)

# Optional: Visualize missing value counts as a bar plot
missing_summary.plot(kind='bar')
plt.title('Count of Missing Values per Column')
plt.ylabel('Number of Missing Values')
plt.show()


### Task 4: Integrate Automated Data Quality Checks
**Description**: Integrate automated data quality checks using the Great Expectations library for a dataset.

**Steps**:
1. Install and initialize Great Expectations.
2. Set up Great Expectations.
3. Add further checks and validate.

In [None]:
# write your code from here
# Step 1: Install Great Expectations (run this in your shell/terminal)
# !pip install great_expectations

# Step 2: Initialize Great Expectations and setup context
import great_expectations as ge
from great_expectations.core.batch import BatchRequest

# Initialize Data Context (in current directory)
context = ge.data_context.DataContext()

# Step 3: Create an Expectation Suite
suite_name = "my_data_quality_suite"
suite = context.create_expectation_suite(suite_name, overwrite_existing=True)

# Load sample data (e.g., CSV)
import pandas as pd
df = pd.DataFrame({
    "age": [25, 30, 35, None, 40],
    "gender": ["M", "F", "F", "M", None],
    "income": [50000, 60000, None, 45000, 70000]
})

# Save dataset to CSV for GE batch request usage
data_path = "sample_data.csv"
df.to_csv(data_path, index=False)

# Add a datasource for filesystem csv (only needed once)
datasource_config = {
    "name": "my_datasource",
    "class_name": "Datasource",
    "execution_engine": {"class_name": "PandasExecutionEngine"},
    "data_connectors": {
        "default_runtime_data_connector_name": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["default_identifier_name"],
        }
    },
}
context.add_datasource(**datasource_config)

# Create a batch request with the CSV file
batch_request = {
    "datasource_name": "my_datasource",
    "data_connector_name": "default_runtime_data_connector_name",
    "data_asset_name": "sample_data_asset",
    "runtime_parameters": {"batch_data": df},
    "batch_identifiers": {"default_identifier_name": "default_identifier"},
}

# Step 4: Add expectations (checks)
validator = context.get_validator(batch_request=batch_request, expectation_suite_name=suite_name)

# Check for no missing values in 'age' column
validator.expect_column_values_to_not_be_null("age")

# Check for 'gender' to have values in allowed set
validator.expect_column_values_to_be_in_set("gender", ["M", "F"])

# Check for 'income' to be non-null
validator.expect_column_values_to_not_be_null("income")

# Save the expectations suite
validator.save_expectation_suite()

# Step 5: Validate the data and print results
results = validator.validate()
print(results)

# Optionally, you can store or visualize the validation results through Great Expectations UI or reports.
