# Machine Learning Experimentation

## 1. Introduction

### 1.1 Hypothesis
Start with a clear hypothesis: What are you trying to prove or discover about the algorithms (Neural Network, KNN, and SVM) across two different datasets? Outline your experimental steps to test this hypothesis.

### 1.2 Datasets
Introduce your datasets:
- **Brain Tumor Classification MRI Dataset:**
  - **Description:** This dataset contains MRI images of brain tumors, classified into different types. It’s interesting from an ML perspective due to its medical application and the challenge of accurately classifying tumor types based on image data.
  - **Preprocessing:** Discuss any preprocessing steps you applied, such as image resizing, normalization, or data augmentation.
  - **Source:** [Kaggle Brain Tumor Classification MRI Dataset](https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri/data)

- **Cervical Cancer Risk Classification Dataset:**
  - **Description:** This dataset includes risk factors for cervical cancer, which is valuable for predictive modeling in healthcare. It’s challenging due to potential class imbalances and the importance of precision in medical diagnostics.
  - **Preprocessing:** Mention any preprocessing steps, like handling missing values, feature scaling, or encoding categorical variables.
  - **Source:** [Kaggle Cervical Cancer Risk Classification Dataset](https://www.kaggle.com/datasets/loveall/cervical-cancer-risk-classification?select=kag_risk_factors_cervical_cancer.csv)

### 1.3 Experimental Methodology
Discuss your overall methodology:
- **Training and Test Sets:** The datasets are pre-split into training and test sets.
- **Validation Split:** Set aside the test set and further split your training set into training and validation sets for each dataset.
- **Reproducibility:** Ensure results are reproducible by setting seeds (e.g., using `random_state` or `numpy.random.seed`).

## 2. Model Training and Tuning

### 2.1 Algorithms
Describe the three algorithms you are testing:
- **Neural Network**
- **K-Nearest Neighbors (KNN)**
- **Support Vector Machine (SVM)**

### 2.2 Cross-Validation (CV)
Use cross-validation to train your models for each algorithm across both datasets, providing a more robust estimate of model performance.

### 2.3 Learning and Validation Curves
Analyze the bias-variance trade-off using learning curves, and identify overfitting or underfitting with validation curves for each algorithm.

### 2.4 Hyperparameter Tuning
Discuss the process of hyperparameter tuning for each algorithm:
- Use techniques like grid search or random search.
- Adjust hyperparameters iteratively, using the validation set.

## 3. Evaluation

### 3.1 Final Model Evaluation
Once you’re satisfied with your model's performance, evaluate it on the test set, which has been untouched during the training and tuning phases.

### 3.2 Performance Metrics
Evaluate your models using various performance metrics:
- For balanced data: Accuracy, Precision, Recall, F1 Score.
- For imbalanced data: Precision, Recall, F1 Score, ROC Curves, PRAUC Curves, Confusion Matrices, Decision Surfaces.

## 4. Results and Discussion

### 4.1 Isolated Algorithm Results
Present results isolated to each algorithm and dataset:
- How did each algorithm perform on each dataset?
- Any specific observations about algorithmic behavior or hyperparameter interaction?

### 4.2 Comparative Analysis
Compare and contrast the results across algorithms and datasets:
- Discuss how Neural Network, KNN, and SVM performed across both datasets.
- Highlight any notable differences or similarities.

## 5. Conclusion

### 5.1 Summary and Learnings
Conclude by summarizing your findings:
- Reflect on your hypothesis and whether it was supported.
- Link the results back to algorithmic behavior, hyperparameter interactions, and input data.
- Include insights gained from lectures that applied to this assignment.



In [1]:
import os
import numpy as np
from PIL import Image
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# Define the base directory
base_dir = '/home/ec2-user/SageMaker/ML-Algorithm-Benchmarks/data/brain_tumor/Training'

# Prepare data containers
X = []
y = []

# Define image size and batch size
image_size = (128, 128)
batch_size = 128

# Loop over each label which is a folder in this case
for label in os.listdir(base_dir):
    folder_path = os.path.join(base_dir, label)
    if os.path.isdir(folder_path):
        for image_name in os.listdir(folder_path):
            image_path = os.path.join(folder_path, image_name)
            image = Image.open(image_path)
            image = image.resize(image_size)
            image = np.array(image) / 255.0
            X.append(image)
            y.append(label)

# Convert lists to numpy arrays
X = np.array(X)
y = np.array(y)

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Shuffle and split data into training and test sets
X, y_encoded = shuffle(X, y_encoded, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Initialize the model
model = MLPClassifier(hidden_layer_sizes=(128, 64), max_iter=1, warm_start=True, random_state=42)

# Function to train the model in batches
def train_in_batches(model, X_train, y_train, batch_size):
    num_samples = X_train.shape[0]
    for i in range(0, num_samples, batch_size):
        # Select a batch
        X_batch = X_train[i:i+batch_size]
        y_batch = y_train[i:i+batch_size]

        # Fit the model on the batch
        model.fit(X_batch.reshape(X_batch.shape[0], -1), y_batch)

    return model

# Train the model on the training data in batches
model = train_in_batches(model, X_train, y_train, batch_size)

# Evaluate the model on the test set
test_score = model.score(X_test.reshape(X_test.shape[0], -1), y_test)
print("Test Set Score: ", test_score)




Test Set Score:  0.29094076655052264
