### Machine Learning for Data Quality Prediction
**Description**: Use a machine learning model to predict data quality issues.

**Steps**:
1. Create a mock dataset with features and label (quality issue/label: 0: good, 1: issue).
2. Train a machine learning model.
3. Evaluate the model performance.

In [1]:
# write your code from here
import pandas as pd
import numpy as np

# Seed for reproducibility
np.random.seed(42)

# Generate a mock dataset
n_samples = 1000

# Features
completeness = np.random.uniform(0, 1, n_samples)  # Percentage of completeness
consistency = np.random.uniform(0, 1, n_samples)  # Consistency score (0 = inconsistent, 1 = consistent)
accuracy = np.random.uniform(0, 1, n_samples)  # Accuracy of data
duplicates = np.random.randint(0, 2, n_samples)  # Binary feature for duplicates (0 = no, 1 = yes)

# Label: 0 = good data quality, 1 = data quality issue
quality_issue = (completeness < 0.7) | (consistency < 0.5) | (accuracy < 0.6) | (duplicates == 1)

# Create a DataFrame
df = pd.DataFrame({
    'completeness': completeness,
    'consistency': consistency,
    'accuracy': accuracy,
    'duplicates': duplicates,
    'quality_issue': quality_issue.astype(int)  # 0 = good quality, 1 = issue
})

print(df.head())
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Split data into features and label
X = df.drop('quality_issue', axis=1)
y = df['quality_issue']

# Split the dataset into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


   completeness  consistency  accuracy  duplicates  quality_issue
0      0.374540     0.185133  0.261706           0              1
1      0.950714     0.541901  0.246979           0              1
2      0.731994     0.872946  0.906255           0              0
3      0.598658     0.732225  0.249546           1              1
4      0.156019     0.806561  0.271950           0              1
Model Accuracy: 0.95

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.95      1.00      0.97       190

    accuracy                           0.95       200
   macro avg       0.47      0.50      0.49       200
weighted avg       0.90      0.95      0.93       200



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
