### Task 1: Data Profiling to Understand Data Quality
**Description**: Use basic statistical methods to profile a dataset and identify potential quality issues.

**Steps**:
1. Load the dataset using pandas in Python.
2. Understand the data by checking its basic statistics.
3. Identify null values.
4. Check unique values for categorical columns.
5. Review outliers using box plots.

In [1]:
# write your code from here
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Load the dataset
# Replace 'your_data.csv' with your actual file path
df = pd.read_csv('data/sample_data.csv')  # Update this if your path or filename differs

# Step 2: Basic statistics
print("=== Basic Statistical Summary ===")
print(df.describe(include='all'))

# Step 3: Identify null values
print("\n=== Null Values per Column ===")
print(df.isnull().sum())

# Step 4: Unique values for categorical columns
print("\n=== Unique Values in Categorical Columns ===")
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")

# Step 5: Review outliers using box plots (only numeric columns)
numeric_cols = df.select_dtypes(include=np.number).columns

print("\n=== Generating Box Plots for Numeric Columns ===")
for col in numeric_cols:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col])
    plt.title(f'Box plot of {col}')
    plt.grid(True)
    plt.show()


FileNotFoundError: [Errno 2] No such file or directory: 'data/sample_data.csv'

### Task 2: Implement Simple Data Validation
**Description**: Write a Python script to validate the data types and constraints of each column in a dataset.

**Steps**:
1. Define constraints for each column.
2. Validate each column based on its constraints.

In [None]:
# write your code from here
import pandas as pd

# Sample dataset
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, -1, 30],  # Invalid age: -1
    'email': ['alice@example.com', 'bob@example.com', 'not-an-email'],
    'salary': [50000.0, 60000.0, 'unknown']  # Invalid salary: string
}

df = pd.DataFrame(data)

# Step 1: Define column constraints
constraints = {
    'name': {'type': str},
    'age': {'type': int, 'min': 0, 'max': 120},
    'email': {'type': str, 'pattern': '@'},  # simple check for '@' symbol
    'salary': {'type': float}
}

# Step 2: Validate each column
def validate_column(df, column, rules):
    issues = []
    for i, val in enumerate(df[column]):
        # Type check
        if not isinstance(val, rules['type']):
            issues.append((i, f"Type mismatch: expected {rules['type']}, got {type(val)}"))
            continue

        # Numeric range check
        if 'min' in rules and val < rules['min']:
            issues.append((i, f"Below minimum: {val} < {rules['min']}"))
        if 'max' in rules and val > rules['max']:
            issues.append((i, f"Above maximum: {val} > {rules['max']}"))

        # Simple pattern check (e.g., for email)
        if 'pattern' in rules and rules['pattern'] not in str(val):
            issues.append((i, f"Pattern missing '{rules['pattern']}': {val}"))

    return issues

# Step 3: Run validation
validation_report = {}
for col, rule in constraints.items():
    issues = validate_column(df, col, rule)
    if issues:
        validation_report[col] = issues

# Step 4: Print report
if validation_report:
    print("Validation Issues Found:")
    for col, issues in validation_report.items():
        for row, issue in issues:
            print(f"- Row {row}, Column '{col}': {issue}")
else:
    print("All data passed validation.")


### Task 3: Detect Missing Data Patterns
**Description**: Analyze and visualize missing data patterns in a dataset.

**Steps**:
1. Visualize missing data using a heatmap.
2. Identify patterns in missing data.

In [None]:
# write your code from here
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Optional: install missingno if not already installed
# !pip install missingno

import missingno as msno

# Step 1: Load your dataset
df = pd.read_csv('your_data.csv')  # Replace with your actual file path

# Step 2: Visualize missing data with a heatmap
print("=== Missing Data Heatmap ===")
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Data Heatmap")
plt.show()

# Step 3: Use missingno to detect patterns more clearly
print("=== Missing Data Matrix ===")
msno.matrix(df)
plt.show()

print("=== Missing Data Bar Chart ===")
msno.bar(df)
plt.show()

print("=== Missing Data Correlation Heatmap ===")
msno.heatmap(df)
plt.show()

# Step 4: Optional - Identify exact patterns
print("\n=== Rows with Multiple Missing Values ===")
print(df[df.isnull().sum(axis=1) > 1])


### Task 4: Integrate Automated Data Quality Checks
**Description**: Integrate automated data quality checks using the Great Expectations library for a dataset.

**Steps**:
1. Install and initialize Great Expectations.
2. Set up Great Expectations.
3. Add further checks and validate.

In [None]:
# write your code from here