## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [3]:
import pandas as pd
import numpy as np
import re

# ------------------------------
# Simulate students.csv data
# ------------------------------
data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', None, 'David', 'Eva'],
    'Age': [20, 21, np.nan, 19, 'twenty-two'],
    'Grade': [85.5, 92.0, 88.0, None, 101],
    'Email': ['alice@example.com', 'bob@example', 'charlie@example.com', '', None]
}

df = pd.DataFrame(data)
print("=== DATAFRAME PREVIEW ===")
print(df)

# ------------------------------
# Functions for Validation
# ------------------------------
def validate_age(age):
    try:
        age = float(age)
        return 0 < age < 120
    except (ValueError, TypeError):
        return False

def validate_grade(grade):
    try:
        grade = float(grade)
        return 0 <= grade <= 100
    except (ValueError, TypeError):
        return False

def validate_email(email):
    if not isinstance(email, str) or not email:
        return False
    email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return re.match(email_pattern, email) is not None

# ------------------------------
# Apply Validations
# ------------------------------
df['age_valid'] = df['Age'].apply(validate_age)
df['grade_valid'] = df['Grade'].apply(validate_grade)
df['email_valid'] = df['Email'].apply(validate_email)

# ------------------------------
# Completeness Check
# ------------------------------
missing_by_column = df.isnull().sum()
rows_with_missing = df[df.isnull().any(axis=1)]

# ------------------------------
# Output Section
# ------------------------------
print("\n=== ACCURACY CHECKS ===")
print(df[['Age', 'age_valid']])
print(df[['Grade', 'grade_valid']])
print(df[['Email', 'email_valid']])

print("\n=== COMPLETENESS CHECKS ===")
print("Missing values per column:")
print(missing_by_column)

print("\nRows with missing data:")
print(rows_with_missing)

# ------------------------------
# Problematic Rows Summary
# ------------------------------
invalid_rows = df[~(df['age_valid'] & df['grade_valid'] & df['email_valid'])]
print("\n=== INVALID DATA ROWS (Accuracy Issues) ===")
print(invalid_rows)

clean_rows = df[(df['age_valid'] & df['grade_valid'] & df['email_valid']) & (~df.isnull().any(axis=1))]
print("\n=== CLEAN DATA ROWS ===")
print(clean_rows)

=== DATAFRAME PREVIEW ===
   ID   Name         Age  Grade                Email
0   1  Alice          20   85.5    alice@example.com
1   2    Bob          21   92.0          bob@example
2   3   None         NaN   88.0  charlie@example.com
3   4  David          19    NaN                     
4   5    Eva  twenty-two  101.0                 None

=== ACCURACY CHECKS ===
          Age  age_valid
0          20       True
1          21       True
2         NaN      False
3          19       True
4  twenty-two      False
   Grade  grade_valid
0   85.5         True
1   92.0         True
2   88.0         True
3    NaN        False
4  101.0        False
                 Email  email_valid
0    alice@example.com         True
1          bob@example        False
2  charlie@example.com         True
3                             False
4                 None        False

=== COMPLETENESS CHECKS ===
Missing values per column:
ID             0
Name           1
Age            1
Grade          1
Email    