<a href="https://colab.research.google.com/github/iftekharchowdhuryJOY/100-Days-of-Python-Code/blob/main/Final_Project_PLDA(1st_version).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Bishop’s University
##Department of Computer Science
## Name: IFTEKHARUL ISLAM CHOWDHURY
##Intelligent systems and Neural Nets – CS504
##FL2024 – Project

Accessing the Dataset:

UCI Machine Learning Repository:

The dataset can be accessed directly from the UCI repository.
Breast Cancer Wisconsin (Diagnostic) Data Set - UCI Repository

In [2]:
import pandas as pd

# Load dataset from UCI repository
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
column_names = ['ID', 'Diagnosis'] + [f'Feature_{i}' for i in range(1, 31)]
data = pd.read_csv(url, header=None, names=column_names)

# Alternatively, load dataset from a local CSV file
# data = pd.read_csv('path_to_downloaded_csv_file.csv')


Preprocess the Data
1. Drop Unnecessary Columns

In [3]:
data = data.drop(columns=['ID'])

2. Encode Target Variable

In [4]:
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})


3. Separate Features and Target

In [5]:
X = data.drop(columns=['Diagnosis'])
y = data['Diagnosis']

4. **Split the Data: Divide the dataset into training (60%), validation (20%), and test (20%) sets.**

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
print("Training set class distribution:", y_train.value_counts())
print("Validation set class distribution:", y_val.value_counts())


Training set class distribution: Diagnosis
0    214
1    127
Name: count, dtype: int64
Validation set class distribution: Diagnosis
0    71
1    43
Name: count, dtype: int64


5. **Feature Scaling: Apply Min-Max scaling to the features.**

In [7]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
import numpy as np

# Clip negative values to zero
X_train_scaled = np.clip(X_train_scaled, 0, None)
X_val_scaled = np.clip(X_val_scaled, 0, None)
X_test_scaled = np.clip(X_test_scaled, 0, None)


print("Minimum value in X_train_scaled:", X_train_scaled.min())
print("Minimum value in X_val_scaled:", X_val_scaled.min())
print("Minimum value in X_test_scaled:", X_test_scaled.min())




Minimum value in X_train_scaled: 0.0
Minimum value in X_val_scaled: 0.0
Minimum value in X_test_scaled: 0.0


6. **Dimensionality Reduction: PCA**

In [8]:
from sklearn.decomposition import PCA

n_components = 10  # Adjust based on your analysis
pca = PCA(n_components=n_components, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_val_pca = pca.transform(X_val_scaled)
X_test_pca = pca.transform(X_test_scaled)


In [9]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=n_components, init='random', random_state=42)
X_train_nmf = nmf.fit_transform(X_train_scaled)
X_val_nmf = nmf.transform(X_val_scaled)
X_test_nmf = nmf.transform(X_test_scaled)




**Train Classifiers: Train each classifier individually using cross-validation on the training set, applying default parameters for each model.**

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import numpy as np

# Define classifiers with default parameters
lr = LogisticRegression(random_state=42)
svm = SVC(random_state=42)


# Initialize dictionary to store cross-validation results
cv_results = {
    'PCA_LR': None,
    'NMF_LR': None,
    'PCA_SVM': None,
    'NMF_SVM': None
}

# Train Logistic Regression with PCA-transformed features
cv_results['PCA_LR'] = cross_val_score(lr, X_train_pca, y_train, cv=5)

# Train Logistic Regression with NMF-transformed features
cv_results['NMF_LR'] = cross_val_score(lr, X_train_nmf, y_train, cv=5)

# Train SVM with PCA-transformed features
cv_results['PCA_SVM'] = cross_val_score(svm, X_train_pca, y_train, cv=5)

# Train SVM with NMF-transformed features
cv_results['NMF_SVM'] = cross_val_score(svm, X_train_nmf, y_train, cv=5)

# Calculate mean cross-validation scores for each model
mean_scores = {model: np.mean(scores) for model, scores in cv_results.items()}
mean_scores


{'PCA_LR': 0.9619778346121057,
 'NMF_LR': 0.9238277919863599,
 'PCA_SVM': 0.9589087809036659,
 'NMF_SVM': 0.9502131287297528}

**Predict Labels: Apply the trained classifiers to the validation set to obtain predicted labels.**

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Train Logistic Regression and SVM with PCA and NMF features
lr_pca = LogisticRegression(C=10, random_state=42).fit(X_train_pca, y_train)
lr_nmf = LogisticRegression(C=10, random_state=42).fit(X_train_nmf, y_train)

svm_pca = SVC(C=10, kernel='rbf', random_state=42).fit(X_train_pca, y_train)
svm_nmf = SVC(C=10, kernel='rbf', random_state=42).fit(X_train_nmf, y_train)

# Predict labels on the validation set
y_val_pred_pca_lr = lr_pca.predict(X_val_pca)
y_val_pred_nmf_lr = lr_nmf.predict(X_val_nmf)
y_val_pred_pca_svm = svm_pca.predict(X_val_pca)
y_val_pred_nmf_svm = svm_nmf.predict(X_val_nmf)

# Display a few predictions for each model
print("Logistic Regression with PCA:", y_val_pred_pca_lr[:5])
print("Logistic Regression with NMF:", y_val_pred_nmf_lr[:5])
print("SVM with PCA:", y_val_pred_pca_svm[:5])
print("SVM with NMF:", y_val_pred_nmf_svm[:5])


Logistic Regression with PCA: [0 0 0 0 0]
Logistic Regression with NMF: [0 0 0 0 0]
SVM with PCA: [0 0 0 0 0]
SVM with NMF: [0 0 0 0 0]


In [12]:
from sklearn.metrics import classification_report

print("Logistic Regression with PCA:\n", classification_report(y_val, y_val_pred_pca_lr))
print("Logistic Regression with NMF:\n", classification_report(y_val, y_val_pred_nmf_lr))
print("SVM with PCA:\n", classification_report(y_val, y_val_pred_pca_svm))
print("SVM with NMF:\n", classification_report(y_val, y_val_pred_nmf_svm))


Logistic Regression with PCA:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99        71
           1       1.00      0.95      0.98        43

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

Logistic Regression with NMF:
               precision    recall  f1-score   support

           0       0.95      1.00      0.97        71
           1       1.00      0.91      0.95        43

    accuracy                           0.96       114
   macro avg       0.97      0.95      0.96       114
weighted avg       0.97      0.96      0.96       114

SVM with PCA:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98        71
           1       1.00      0.93      0.96        43

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       1

**Majority Voting: Implement majority voting to determine unified predicted labels.**

In [13]:
from scipy.stats import mode
import numpy as np

# Stack predictions from each model for majority voting
predictions = np.vstack([
    y_val_pred_pca_lr,
    y_val_pred_nmf_lr,
    y_val_pred_pca_svm,
    y_val_pred_nmf_svm
])

# Perform majority voting (most common value along each column)
y_val_pred_majority = mode(predictions, axis=0).mode.flatten()

# Display the first few predictions as a check
print("Majority Voting Predictions:", y_val_pred_majority[:10])


Majority Voting Predictions: [0 0 0 0 0 0 0 0 0 0]


In [14]:
from sklearn.metrics import classification_report, accuracy_score

# Evaluate majority voting predictions on the validation set
accuracy = accuracy_score(y_val, y_val_pred_majority)
report = classification_report(y_val, y_val_pred_majority)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)


Accuracy: 0.956140350877193
Classification Report:
               precision    recall  f1-score   support

           0       0.93      1.00      0.97        71
           1       1.00      0.88      0.94        43

    accuracy                           0.96       114
   macro avg       0.97      0.94      0.95       114
weighted avg       0.96      0.96      0.96       114



**Compute F-Score: Calculate the F-score using the predicted labels compared to the true labels of the validation set.**

In [15]:
from sklearn.metrics import f1_score

# Calculate the F-score for the majority voting predictions on the validation set
f_score = f1_score(y_val, y_val_pred_majority, average='weighted')

print("Weighted F-score:", f_score)


Weighted F-score: 0.95553257040308


**Iterate with Different Components: Repeat steps 2 to 6 for varying numbers of components (2, 4, 6, 8, and 10).**

In [16]:
from sklearn.decomposition import PCA, NMF
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from scipy.stats import mode
import numpy as np

# Component counts to iterate through
component_counts = [2, 4, 6, 8, 10]
results = {}

for n_components in component_counts:
    print(f"Number of Components: {n_components}")

    # Step 2: Feature Extraction using PCA and NMF
    pca = PCA(n_components=n_components, random_state=42)
    nmf = NMF(n_components=n_components, init='random', random_state=42)

    X_train_pca = pca.fit_transform(X_train_scaled)
    X_val_pca = pca.transform(X_val_scaled)
    X_train_nmf = nmf.fit_transform(X_train_scaled)
    X_val_nmf = nmf.transform(X_val_scaled)

    # Step 3: Train Classifiers with Cross-Validation
    # Define classifiers with default parameters
    lr = LogisticRegression(random_state=42)
    svm = SVC(random_state=42)

    # Fit Logistic Regression on PCA and NMF features
    lr_pca = lr.fit(X_train_pca, y_train)
    lr_nmf = lr.fit(X_train_nmf, y_train)

    # Fit SVM on PCA and NMF features
    svm_pca = svm.fit(X_train_pca, y_train)
    svm_nmf = svm.fit(X_train_nmf, y_train)

    # Step 4: Predict Labels on the Validation Set
    y_val_pred_pca_lr = lr_pca.predict(X_val_pca)
    y_val_pred_nmf_lr = lr_nmf.predict(X_val_nmf)
    y_val_pred_pca_svm = svm_pca.predict(X_val_pca)
    y_val_pred_nmf_svm = svm_nmf.predict(X_val_nmf)

    # Step 5: Majority Voting
    predictions = np.vstack([
        y_val_pred_pca_lr,
        y_val_pred_nmf_lr,
        y_val_pred_pca_svm,
        y_val_pred_nmf_svm
    ])
    y_val_pred_majority = mode(predictions, axis=0).mode.flatten()

    # Step 6: Compute F-Score
    f_score = f1_score(y_val, y_val_pred_majority, average='weighted')
    results[n_components] = f_score

# Display the F-scores for each component count
print("F-scores for different component counts:", results)


Number of Components: 2
Number of Components: 4
Number of Components: 6




Number of Components: 8
Number of Components: 10




F-scores for different component counts: {2: 0.9007452405849058, 4: 0.819535861067334, 6: 0.7401629072681705, 8: 0.8300890092879256, 10: 0.7521781286434897}




**Report Best F-Score: Identify and report the optimal number of components based on the highest F-score.**

In [17]:
# Find the component count with the highest F-score
optimal_components = max(results, key=results.get)
best_f_score = results[optimal_components]

# Report the optimal number of components and the highest F-score
print(f"Optimal Number of Components: {optimal_components}")
print(f"Best F-Score: {best_f_score}")


Optimal Number of Components: 2
Best F-Score: 0.9007452405849058


**9. Model Configuration: You will develop five models: model_2, model_4, model_6, model_8, and model_10, where each model corresponds to a specific number of feature components used during training. For instance, model_4 is trained with four feature components. This model is an ensemble of five different combinations of feature extraction and classifier methods, represented as follows: model_4 = {(PCA_4, LR_4), (NMF_4, LR_4), (PCA_4, SVM_4), (NMF_4, SVM_4)}.**

In [21]:
from sklearn.decomposition import PCA, NMF
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Component counts to iterate through
component_counts = [2, 4, 6, 8, 10]
models = {}

for n_components in component_counts:
    # Initialize the model dictionary for the current component count
    model_name = f"model_{n_components}"
    models[model_name] = {}

    # Step 1: Feature Extraction
    # Initialize PCA and NMF with the current number of components
    pca = PCA(n_components=n_components, random_state=42)
    nmf = NMF(n_components=n_components, init='random', random_state=42)

    # Transform the training set with PCA and NMF
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_train_nmf = nmf.fit_transform(X_train_scaled)

    # Step 2: Train Classifiers
    # Logistic Regression
    lr_pca = LogisticRegression(random_state=42).fit(X_train_pca, y_train)
    lr_nmf = LogisticRegression(random_state=42).fit(X_train_nmf, y_train)

    # SVM
    svm_pca = SVC(random_state=42).fit(X_train_pca, y_train)
    svm_nmf = SVC(random_state=42).fit(X_train_nmf, y_train)


    print("Model_best dictionary successfully created with the best models!")
    # Step 3: Save Trained Models
    # Save each model configuration in the dictionary
    models[model_name][f"PCA_{n_components}_LR"] = lr_pca
    models[model_name][f"NMF_{n_components}_LR"] = lr_nmf
    models[model_name][f"PCA_{n_components}_SVM"] = svm_pca
    models[model_name][f"NMF_{n_components}_SVM"] = svm_nmf

model_best = {
    "PCA_LR": lr_pca,
    "NMF_LR": lr_nmf,
    "PCA_SVM": svm_pca,
    "NMF_SVM": svm_nmf
}
print("Model_best dictionary successfully created with the best models!")
# Output the dictionary keys to confirm model structure

print("Model configuration keys:", list(models.keys()))
for model_name, model_dict in models.items():
    print(f"{model_name} contains models:", list(model_dict.keys()))


Model_best dictionary successfully created with the best models!
Model_best dictionary successfully created with the best models!
Model_best dictionary successfully created with the best models!
Model_best dictionary successfully created with the best models!




Model_best dictionary successfully created with the best models!
Model_best dictionary successfully created with the best models!
Model configuration keys: ['model_2', 'model_4', 'model_6', 'model_8', 'model_10']
model_2 contains models: ['PCA_2_LR', 'NMF_2_LR', 'PCA_2_SVM', 'NMF_2_SVM']
model_4 contains models: ['PCA_4_LR', 'NMF_4_LR', 'PCA_4_SVM', 'NMF_4_SVM']
model_6 contains models: ['PCA_6_LR', 'NMF_6_LR', 'PCA_6_SVM', 'NMF_6_SVM']
model_8 contains models: ['PCA_8_LR', 'NMF_8_LR', 'PCA_8_SVM', 'NMF_8_SVM']
model_10 contains models: ['PCA_10_LR', 'NMF_10_LR', 'PCA_10_SVM', 'NMF_10_SVM']




In [29]:
# Clip negative values to zero after scaling
X_test_scaled = np.clip(X_test_scaled, 0, None)
X_test_scaled += 1e-6  # Add a very small constant to ensure all values are positive

print("Minimum value in X_test_scaled after clipping:", X_test_scaled.min())
print("Minimum value in X_test_scaled:", X_test_scaled.min())
print("Data type of X_test_scaled:", X_test_scaled.dtype)


Minimum value in X_test_scaled after clipping: 1e-06
Minimum value in X_test_scaled: 1e-06
Data type of X_test_scaled: float64


In [30]:
nmf_test = NMF(n_components=2, init='random', random_state=42)
try:
    X_test_nmf = nmf_test.fit_transform(X_test_scaled)
    print("NMF transformation successful.")
except ValueError as e:
    print("Error during NMF transformation:", e)


NMF transformation successful.


In [31]:
from sklearn.decomposition import PCA, NMF
from sklearn.metrics import f1_score
from scipy.stats import mode
import numpy as np

# Step 1: Feature Scaling on the Test Set
# Ensure the test set is scaled using the same Min-Max scaler applied to the training set
X_test_scaled = scaler.transform(X_test)

# Step 2: Add a tiny constant to ensure all values are positive, avoiding NMF issues
X_test_scaled = np.clip(X_test_scaled, 0, None)  # Ensure no negative values remain
X_test_scaled += 1e-6  # Add a small constant to guarantee positivity for NMF

# Step 3: Feature Extraction using PCA and NMF with 2 Components (Best Configuration)

# Apply PCA
pca_best = PCA(n_components=2, random_state=42)
X_train_pca_best = pca_best.fit_transform(X_train_scaled)  # Train PCA on the training set
X_test_pca = pca_best.transform(X_test_scaled)  # Apply PCA transformation to the test set

# Apply NMF
nmf_best = NMF(n_components=2, init='random', random_state=42)
X_train_nmf_best = nmf_best.fit_transform(X_train_scaled)  # Train NMF on the training set
X_test_nmf = nmf_best.transform(X_test_scaled)  # Apply NMF transformation to the test set

# Step 4: Define and Train the Best Models Using the Training Data
# Assuming we haven't already trained them, we train each classifier with 2 components (PCA and NMF)

# Logistic Regression Models
lr_pca_best = LogisticRegression(random_state=42).fit(X_train_pca_best, y_train)
lr_nmf_best = LogisticRegression(random_state=42).fit(X_train_nmf_best, y_train)

# SVM Models
svm_pca_best = SVC(random_state=42).fit(X_train_pca_best, y_train)
svm_nmf_best = SVC(random_state=42).fit(X_train_nmf_best, y_train)

# Store the trained models in `model_best`
model_best = {
    "PCA_LR": lr_pca_best,
    "NMF_LR": lr_nmf_best,
    "PCA_SVM": svm_pca_best,
    "NMF_SVM": svm_nmf_best
}

# Step 5: Make Predictions on the Test Set Using Each Model in `model_best`

# Make predictions on the test set with each model
y_test_pred_pca_lr = model_best['PCA_LR'].predict(X_test_pca)
y_test_pred_nmf_lr = model_best['NMF_LR'].predict(X_test_nmf)
y_test_pred_pca_svm = model_best['PCA_SVM'].predict(X_test_pca)
y_test_pred_nmf_svm = model_best['NMF_SVM'].predict(X_test_nmf)

# Step 6: Perform Majority Voting on the Test Set Predictions

# Stack predictions from each model for majority voting
predictions_test = np.vstack([
    y_test_pred_pca_lr,
    y_test_pred_nmf_lr,
    y_test_pred_pca_svm,
    y_test_pred_nmf_svm
])

# Determine the majority vote for each test sample
y_test_pred_majority = mode(predictions_test, axis=0).mode.flatten()

# Step 7: Compute the F-Score for the Majority Voting Predictions

# Calculate the F-score on the test set by comparing the majority-voted predictions with the true labels
f_score_test = f1_score(y_test, y_test_pred_majority, average='weighted')

print("F-Score on Test Set:", f_score_test)


F-Score on Test Set: 0.9372286827846255
