# Classification Issues:

In this session, we will explore some of the key issues in classification models with explanations, examples, and potential solutions.

## 1. Imbalanced Datasets
**Issue:** When one class has significantly more data points than the others, the model may become biased toward the majority class.

**Impact:** The classifier may perform well on the majority class but fail to accurately predict the minority class, resulting in misleading accuracy scores.

**Solution:** Techniques like oversampling, undersampling, or using performance metrics like F1-score, precision, and recall can help address this issue.

In [None]:
# Example: Handling Imbalanced Datasets using SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply random forest without balancing
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print('Classification report without balancing:')
print(classification_report(y_test, y_pred))

# Apply SMOTE for balancing the dataset
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# Train the model on the balanced data
clf_balanced = RandomForestClassifier()
clf_balanced.fit(X_res, y_res)
y_pred_balanced = clf_balanced.predict(X_test)

print('Classification report after balancing with SMOTE:')
print(classification_report(y_test, y_pred_balanced))

## 2. Overfitting
**Issue:** The model performs well on the training data but fails to generalize to unseen data, as it learns noise or patterns specific to the training set.

**Impact:** This can lead to poor performance in real-world applications.

**Solution:** Regularization techniques (L1/L2), cross-validation, pruning (in decision trees), and reducing model complexity can mitigate overfitting.

In [None]:
# Example: Logistic Regression with Regularization
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Logistic Regression with L2 regularization
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200)

# Cross-validation to detect overfitting
cv_scores = cross_val_score(model, X, y, cv=5)

print('Cross-validation scores with L2 regularization:', cv_scores)
print('Mean CV score:', cv_scores.mean())

## 3. Underfitting
**Issue:** The model is too simple to capture the underlying patterns in the data.

**Impact:** Both training and test accuracy are low, and the model fails to provide useful predictions.

**Solution:** Increase the complexity of the model, add more features, or use more advanced algorithms to better capture the data's complexity.

In [None]:
# Example: Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

# Simple Decision Tree Classifier
dt_model = DecisionTreeClassifier(max_depth=1)
dt_model.fit(X_train, y_train)

print('Training accuracy:', dt_model.score(X_train, y_train))
print('Test accuracy:', dt_model.score(X_test, y_test))

# Now increase the model complexity by increasing max_depth
dt_model_complex = DecisionTreeClassifier(max_depth=5)
dt_model_complex.fit(X_train, y_train)

print('Training accuracy (complex model):', dt_model_complex.score(X_train, y_train))
print('Test accuracy (complex model):', dt_model_complex.score(X_test, y_test))

## 4. Noisy Data
**Issue:** Real-world data often contains noise, irrelevant features, or mislabeled data points.

**Impact:** Noisy data can confuse the model, leading to poor accuracy and inconsistent predictions.

**Solution:** Preprocessing steps like data cleaning, feature selection, and dimensionality reduction (e.g., using PCA) can help.

In [None]:
# Example: Using PCA to reduce noise
from sklearn.decomposition import PCA

# Using PCA to reduce the noise in the data
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_train)

# Train the classifier on reduced data
clf_reduced = RandomForestClassifier()
clf_reduced.fit(X_reduced, y_train)
y_pred_reduced = clf_reduced.predict(pca.transform(X_test))

print('Classification report after PCA:')
print(classification_report(y_test, y_pred_reduced))

## 5. High Dimensionality
**Issue:** When the dataset has too many features (dimensions), the model may struggle to find meaningful patterns.

**Impact:** This can lead to the 'curse of dimensionality,' where the model's performance degrades as the feature space grows.

**Solution:** Techniques like feature selection, feature engineering, and dimensionality reduction (e.g., PCA, t-SNE) can improve model performance.

In [None]:
# Example: Feature Selection with SelectKBest
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Feature selection with SelectKBest
selector = SelectKBest(score_func=chi2, k=10)
X_new = selector.fit_transform(X_train, y_train)

# Train a model using selected features
clf_selected = RandomForestClassifier()
clf_selected.fit(X_new, y_train)
y_pred_selected = clf_selected.predict(selector.transform(X_test))

print('Classification report after feature selection:')
print(classification_report(y_test, y_pred_selected))