In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Set a random seed for reproducibility
np.random.seed(42)

# Load the dataset
data = load_breast_cancer()
X = data.data  # Features
y = data.target  # Labels

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Introduce missing values in the training dataset
mask = np.random.rand(*X_train.shape) < 0.1  # Assuming 10% of data is missing
X_train_missing = X_train.copy()
X_train_missing[mask] = np.nan

# Printing the shape of the training and testing data
print(f"Training Features shape: {X_train_missing.shape}")
print(f"Training Labels shape: {y_train.shape}")
print(f"Testing Features shape: {X_test.shape}")
print(f"Testing Labels shape: {y_test.shape}")


Training Features shape: (455, 30)
Training Labels shape: (455,)
Testing Features shape: (114, 30)
Testing Labels shape: (114,)


## Exercise 1: Imputation of Missing Values

Simple Imputation:
    Utilize the `SimpleImputer` class from scikit-learn to fill in the missing values in the `X_train_missing` dataset. Use the mean strategy for imputation.

## Exercise 2: Visualization using T-SNE

T-SNE Application:
    Apply `T-SNE` on the imputed data to reduce it to 2 dimensions.

Plotting the Features:
    Plot the two-dimensional features obtained from T-SNE and color the points by their labels to visualize the class separation. Do this for both for train and test sets.

## Exercise 3: Dimensionality Reduction using PCA

PCA Application:
    Apply PCA on the imputed data.

Elbow Plot:
    Create an elbow plot to visualize the explained variance ratio for each component. Determine the number of principal components that explain at least 90% of the variance.

## Exercise 4: Logistic Regression on PCA Features

Logistic Regression:
    Train a logistic regression model on the training data using the PCA features obtained in Exercise 3.

Model Evaluation:
    Make predictions on the testing data and compute the confusion matrix.

Confusion Matrix:
    Plot the confusion matrix to visualize the model's performance.

## Exercise 5:

Repeat Exercise 3 and 4 but this time with `X_train`, the dataset without missing values and then compare the results obtained with the imputed ones.