### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [None]:
# Write your code from here
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Sample dataset with common data issues
data = {
    'age': [25, 30, None, 22, 40, None, 35],
    'income': [50000, 60000, 55000, None, 80000, 75000, None],
    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M'],
    'education_level': ['Bachelors', 'Masters', 'PhD', 'Bachelors', None, 'Masters', 'PhD'],
    'redundant_feature': [1, 1, 1, 1, 1, 1, 1]  # This feature has no variance and is redundant
}

df = pd.DataFrame(data)

# Drop redundant feature
df = df.drop(columns=['redundant_feature'])

# Handle missing numerical values with mean imputation
num_cols = ['age', 'income']
imputer = SimpleImputer(strategy='mean')
df[num_cols] = imputer.fit_transform(df[num_cols])

# Handle missing categorical values with most frequent imputation
cat_cols = ['education_level']
cat_imputer = SimpleImputer(strategy='most_frequent')
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

# Encode categorical variables (one-hot encoding for simplicity)
df = pd.get_dummies(df, columns=['gender', 'education_level'], drop_first=True)

# Normalize numerical features
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

print(df)


**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [None]:
# Write your code from here
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample dataset with missing and noisy data
data = {
    'age': [25, 30, None, 22, 40, None, 35, 28, 32, 31],
    'income': [50000, 60000, 55000, None, 80000, 75000, None, 62000, 58000, 61000],
    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M'],
    'education_level': ['Bachelors', 'Masters', 'PhD', 'Bachelors', None, 'Masters', 'PhD', 'Bachelors', 'Masters', 'PhD'],
    'target': [0, 1, 0, 1, 0, 1, 0, 1, 1, 0]  # Binary classification target
}

df = pd.DataFrame(data)

X = df.drop(columns=['target'])
y = df['target']

# Split original data (without preprocessing)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model training and evaluation on raw data (handle missing by dropping rows here for simplicity)
X_train_raw_clean = X_train_raw.dropna()
y_train_raw_clean = y_train.loc[X_train_raw_clean.index]
X_test_raw_clean = X_test_raw.dropna()
y_test_raw_clean = y_test.loc[X_test_raw_clean.index]

# Convert categorical to dummy variables for raw data
X_train_raw_enc = pd.get_dummies(X_train_raw_clean, drop_first=True)
X_test_raw_enc = pd.get_dummies(X_test_raw_clean, drop_first=True)

# Align columns (in case some categories are missing in test or train)
X_test_raw_enc = X_test_raw_enc.reindex(columns=X_train_raw_enc.columns, fill_value=0)

model_raw = LogisticRegression(max_iter=1000)
model_raw.fit(X_train_raw_enc, y_train_raw_clean)
y_pred_raw = model_raw.predict(X_test_raw_enc)
accuracy_raw = accuracy_score(y_test_raw_clean, y_pred_raw)

# Preprocessing pipeline on full data
df_processed = df.copy()

# Impute numerical columns
num_cols = ['age', 'income']
imputer_num = SimpleImputer(strategy='mean')
df_processed[num_cols] = imputer_num.fit_transform(df_processed[num_cols])

# Impute categorical column
cat_cols = ['education_level']
imputer_cat = SimpleImputer(strategy='most_frequent')
df_processed[cat_cols] = imputer_cat.fit_transform(df_processed[cat_cols])

# Encode categorical variables
df_processed = pd.get_dummies(df_processed, columns=['gender', 'education_level'], drop_first=True)

# Scale numerical columns
scaler = StandardScaler()
df_processed[num_cols] = scaler.fit_transform(df_processed[num_cols])

X_processed = df_processed.drop(columns=['target'])
y_processed = df_processed['target']

X_train_proc, X_test_proc, y_train_proc, y_test_proc = train_test_split(X_processed, y_processed, test_size=0.3, random_state=42)

model_proc = LogisticRegression(max_iter=1000)
model_proc.fit(X_train_proc, y_train_proc)
y_pred_proc = model_proc.predict(X_test_proc)
accuracy_proc = accuracy_score(y_test_proc, y_pred_proc)

print(f"Accuracy without preprocessing: {accuracy_raw:.3f}")
print(f"Accuracy with preprocessing: {accuracy_proc:.3f}")