### Machine Learning for Data Quality Prediction
**Description**: Use a machine learning model to predict data quality issues.

**Steps**:
1. Create a mock dataset with features and label (quality issue/label: 0: good, 1: issue).
2. Train a machine learning model.
3. Evaluate the model performance.

In [3]:
# write your code from here
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# 1. Create mock dataset with some features
np.random.seed(42)
n_samples = 200

data = {
    'missing_values_ratio': np.random.rand(n_samples),         # e.g. % of nulls in row
    'outlier_score': np.random.rand(n_samples) * 5,            # e.g. some anomaly detection score
    'duplicate_flag': np.random.randint(0, 2, size=n_samples), # binary: is_duplicate
    'value_length_avg': np.random.normal(loc=10, scale=3, size=n_samples), # average text field length
    'special_char_ratio': np.random.rand(n_samples),           # proportion of special characters
    'quality_label': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3])  # 0 = good, 1 = issue
}

df = pd.DataFrame(data)

# 2. Features and label
X = df.drop('quality_label', axis=1)
y = df['quality_label']

# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# 4. Train ML model (Random Forest)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 5. Evaluate model
y_pred = model.predict(X_test)

print("\n✅ Classification Report:\n")
print(classification_report(y_test, y_pred))

print("🧩 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))



✅ Classification Report:

              precision    recall  f1-score   support

           0       0.65      0.85      0.73        26
           1       0.33      0.14      0.20        14

    accuracy                           0.60        40
   macro avg       0.49      0.49      0.47        40
weighted avg       0.54      0.60      0.55        40

🧩 Confusion Matrix:
[[22  4]
 [12  2]]
