# Problem Statement
Breast cancer remains one of the leading causes of cancer-related deaths among women globally. Despite advances in medical technology, early diagnosis and accurate classification of breast cancer types remain critical challenges in clinical practice. Misclassification or delayed diagnosis can lead to ineffective treatment plans, increased patient suffering, and higher mortality rates.

Traditional diagnostic methods, such as mammography and biopsy analysis, are often time-consuming, costly, and sometimes prone to human error due to subjective interpretation by medical professionals. Therefore, there is a pressing need for reliable, automated systems that can assist healthcare providers by accurately identifying the type of breast cancer from patient data.

The objective of this project is to develop a machine learning-based classification system that can analyze clinical features extracted from breast tissue samples to distinguish between different types of breast cancer (e.g., benign, malignant, and other subtypes). This system aims to:

- Improve diagnostic accuracy and consistency.

- Reduce the time and cost associated with manual diagnosis.

- Provide a decision-support tool that complements radiologists and pathologists.

- Facilitate early detection to enhance patient prognosis and survival rates.

By leveraging a well-structured dataset with relevant features and applying advanced AI techniques, this project seeks to contribute to more precise breast cancer diagnosis and ultimately improve patient

# Dataset Description

The dataset used for this project is the Breast Cancer Wisconsin (Diagnostic) Dataset (or whichever specific breast cancer dataset you choose). It is sourced from UCI Machine Learning Repository.The dataset employed in this study is the Breast Cancer Wisconsin (Diagnostic) Dataset, a widely recognized benchmark dataset in the domain of medical diagnostics and machine learning. This dataset comprises clinical attributes extracted from digitized images of fine needle aspirate (FNA) biopsies, representing various morphological characteristics of breast tissue cells.

- The dataset contains multiple features (at least 4) derived from digitized images of fine needle aspirate (FNA) of breast masses.

- It includes a label column representing the diagnosis classes, typically categorized into at least three classes such as benign, malignant, and    possibly atypical or other subtypes.

- The dataset consists of more than 100 samples, ensuring adequate data for training and validation.

# Dataset Characteristics
- Number of samples: ~569 

- Number of features: 7 

- Classes: 3

# AI Techniques Planned
- Data Preprocessing: Handling missing data, normalization, and feature selection.
  
- Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, and Confusion Matrix to assess model performance.



# Import Libaries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import chardet

with open('data.csv', 'rb') as f:
    rawdata = f.read()

result = chardet.detect(rawdata)
encoding = result['encoding']
print(f"Detected encoding: {encoding}")

with open('data.csv', encoding=encoding) as f:
    data = f.read()

In [2]:
df = pd.read_csv('data.csv', encoding='utf-8', on_bad_lines='skip')
print(df.head())

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 80-81: invalid continuation byte

In [None]:
# df=pd.read_csv('data.csv')

In [None]:
df.head(5)

# Data Preprocessing

In [None]:
df.loc[:9, 'diagnosis'] = 'U'
df.loc[df['area_mean'] < 300, 'diagnosis'] = 'U'


In [None]:
import numpy as np
# Choose 5% of rows randomly
num_rows = int(0.05 * len(df))
random_indices = np.random.choice(df.index, size=num_rows, replace=False)

# Assign new class 'U' to those rows
df.loc[random_indices, 'diagnosis'] = 'U'


In [None]:
columns_to_keep = [
    "area_worst",
    "concave points_worst",
    "concave points_mean",
    "radius_worst",
    "perimeter_worst",
    "perimeter_mean",
    "concavity_mean",
    "area_mean",
    "diagnosis"
]

df = df[columns_to_keep]


In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
duplicates_all = df[df.duplicated()]
print(duplicates_all)

In [None]:
# drop duplicate data
df.drop_duplicates(inplace=True)
df.shape

In [None]:
# removes  Nan value .
df.dropna(inplace =True)

In [None]:
#  Nan Value in the dataframe
df.isnull().any().any()

In [None]:
#Count the nan value
df.isnull().sum().sum()

In [None]:
missing = df.isnull().sum()
print("Missing values:\n", missing)

In [None]:
df['area_worst'].unique()

In [None]:
df['concave points_worst'].unique()

In [None]:
 df['concave points_mean'].unique()

In [None]:
 df['radius_worst'].unique()

In [None]:
 df['perimeter_worst'].unique()

In [None]:
 df['perimeter_mean'].unique

In [None]:
df['concavity_mean'].unique

In [None]:
df['area_mean'].unique

In [None]:
df['diagnosis'].unique

In [None]:
print(df['diagnosis'].unique())
print(df['diagnosis'].value_counts())

# Encode Categorical Data

In [None]:
# Encode the 'diagnosis' column: M = 1, B = 0
#df["diagnosis"] = df["diagnosis"].map({'M': 1, 'B': 0})
df['diagnosis_encoded'] = df['diagnosis'].map({'M': 1, 'B': 0, 'U': 2})


# Display the updated column to confirm
df["diagnosis"].value_counts()


# Outlier detection

In [None]:
df.isnull().sum()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(1, 3, 1)  
sns.histplot(df['area_worst'], kde=True, color='blue')  

# Second subplot
plt.subplot(1, 3, 2)  
sns.histplot(df['concave points_worst'], kde=True, color='green')  
# Third subplot
plt.subplot(1, 3, 3)  # Third column
sns.histplot(df['radius_worst'], kde=True, color='red') 

# Show the plots
plt.tight_layout()  
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(1, 3, 1)  
sns.histplot(df['perimeter_worst'], kde=True, color='blue')  

# Second subplot
plt.subplot(1, 3, 2)  
sns.histplot(df['perimeter_mean'], kde=True, color='green')  
# Third subplot
plt.subplot(1, 3, 3)  # Third column
sns.histplot(df['concavity_mean'], kde=True, color='red') 

# Show the plots
plt.tight_layout()  
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(1, 3, 1)  
sns.histplot(df['area_mean'], kde=True, color='blue')  


# Show the plots
plt.tight_layout()  
plt.show()


In [None]:
df['area_worst'].skew()
df['area_worst'].describe()
sns.boxplot(df['area_worst'])

In [None]:
percentile25 = df['area_worst'].quantile(0.25)
percentile75 = df['area_worst'].quantile(0.75)
percentile75
iqr = percentile75 - percentile25
iqr
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("upperlimit", upper_limit)
print("lowerlimit", lower_limit)
df[df['area_worst']>upper_limit]
df[df['area_worst'] < lower_limit]
df = df[df['area_worst'] < upper_limit]
df.shape
plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(2, 2, 3)  
sns.histplot(df['area_worst'], kde=True, color='blue')
plt.subplot(2, 2, 4) 
sns.boxplot(df['area_worst'])
plt.show()

In [None]:
df['concave points_worst'].skew()
df['concave points_worst'].describe()
sns.boxplot(df['concave points_worst'])

In [None]:
df['concave points_mean'].skew()
df['concave points_mean'].describe()
sns.boxplot(df['concave points_mean'])

In [None]:
percentile25 = df['concave points_mean'].quantile(0.25)
percentile75 = df['concave points_mean'].quantile(0.75)
percentile75
iqr = percentile75 - percentile25
iqr
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("upperlimit", upper_limit)
print("lowerlimit", lower_limit)
df[df['concave points_mean']>upper_limit]
df[df['concave points_mean'] < lower_limit]
df = df[df['concave points_mean'] < upper_limit]
df.shape
plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(2, 2, 3)  
sns.histplot(df['concave points_mean'], kde=True, color='blue')
plt.subplot(2, 2, 4) 
sns.boxplot(df['concave points_mean'])
plt.show()

In [None]:
df['radius_worst'].skew()
df['radius_worst'].describe()
sns.boxplot(df['radius_worst'])

In [None]:
percentile25 = df['radius_worst'].quantile(0.25)
percentile75 = df['radius_worst'].quantile(0.75)
percentile75
iqr = percentile75 - percentile25
iqr
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("upperlimit", upper_limit)
print("lowerlimit", lower_limit)
df[df['radius_worst']>upper_limit]
df[df['radius_worst'] < lower_limit]
df = df[df['radius_worst'] < upper_limit]
df.shape
plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(2, 2, 3)  
sns.histplot(df['radius_worst'], kde=True, color='blue')
plt.subplot(2, 2, 4) 
sns.boxplot(df['radius_worst'])
plt.show()

In [None]:
df['perimeter_worst'].skew()
df['perimeter_worst'].describe()
sns.boxplot(df['perimeter_worst'])

In [None]:
percentile25 = df['perimeter_worst'].quantile(0.25)
percentile75 = df['perimeter_worst'].quantile(0.75)
percentile75
iqr = percentile75 - percentile25
iqr
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("upperlimit", upper_limit)
print("lowerlimit", lower_limit)
df[df['perimeter_worst']>upper_limit]
df[df['perimeter_worst'] < lower_limit]
df = df[df['perimeter_worst'] < upper_limit]
df.shape
plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(2, 2, 3)  
sns.histplot(df['perimeter_worst'], kde=True, color='blue')
plt.subplot(2, 2, 4) 
sns.boxplot(df['perimeter_worst'])
plt.show()

In [None]:
df['perimeter_mean'].skew()
df['perimeter_mean'].describe()
sns.boxplot(df['perimeter_mean'])

In [None]:
percentile25 = df['perimeter_mean'].quantile(0.25)
percentile75 = df['perimeter_mean'].quantile(0.75)
percentile75
iqr = percentile75 - percentile25
iqr
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("upperlimit", upper_limit)
print("lowerlimit", lower_limit)
df[df['perimeter_mean']>upper_limit]
df[df['perimeter_mean'] < lower_limit]
df = df[df['perimeter_mean'] < upper_limit]
df.shape
plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(2, 2, 3)  
sns.histplot(df['perimeter_mean'], kde=True, color='blue')
plt.subplot(2, 2, 4) 
sns.boxplot(df['perimeter_mean'])
plt.show()

In [None]:
df['concavity_mean'].skew()
df['concavity_mean'].describe()
sns.boxplot(df['concavity_mean'])

In [None]:
percentile25 = df['concavity_mean'].quantile(0.25)
percentile75 = df['concavity_mean'].quantile(0.75)
percentile75
iqr = percentile75 - percentile25
iqr
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("upperlimit", upper_limit)
print("lowerlimit", lower_limit)
df[df['concavity_mean']>upper_limit]
df[df['concavity_mean'] < lower_limit]
df = df[df['concavity_mean'] < upper_limit]
df.shape
plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(2, 2, 3)  
sns.histplot(df['concavity_mean'], kde=True, color='blue')
plt.subplot(2, 2, 4) 
sns.boxplot(df['concavity_mean'])
plt.show()

In [None]:
df['area_mean'].skew()
df['area_mean'].describe()
sns.boxplot(df['area_mean'])

In [None]:
percentile25 = df['area_mean'].quantile(0.25)
percentile75 = df['area_mean'].quantile(0.75)
percentile75
iqr = percentile75 - percentile25
iqr
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("upperlimit", upper_limit)
print("lowerlimit", lower_limit)
df[df['area_mean']>upper_limit]
df[df['area_mean'] < lower_limit]
df = df[df['area_mean'] < upper_limit]
df.shape
plt.figure(figsize=(16, 5))

# First subplot
plt.subplot(2, 2, 3)  
sns.histplot(df['area_mean'], kde=True, color='blue')
plt.subplot(2, 2, 4) 
sns.boxplot(df['area_mean'])
plt.show()

In [None]:
 print("Unique classes in original df:", df['diagnosis'].nunique())

# Feature Scaling

# MinMax Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [None]:
# df is your original DataFrame and diagnosis is your target column
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

In [None]:
# Step 2: Initialize MinMaxScaler
scaler = MinMaxScaler()

# Step 3: Apply MinMax scaling
X_scaled = scaler.fit_transform(X)

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Important: set original index to avoid NaNs during concat
minmax_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

# Now concatenate with y
final_minmax_df = pd.concat([minmax_scaled_df, y], axis=1)

# Step 6: Display result
print(final_minmax_df.head())

print(final_minmax_df.isna().sum())  # Should be all 0


# Standization (Z-score Normalization)

In [None]:
from sklearn.preprocessing import StandardScaler

# df is your original DataFrame and diagnosis is your target column
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Fit and transform scaler on entire X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# When creating DataFrame, preserve the original index!
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

# Now concatenate with y, indices align perfectly
final_scaled_df = pd.concat([X_scaled_df, y], axis=1)
# Step 6: Display result
print(final_scaled_df.head())


In [None]:
print(final_scaled_df.isna().sum()) 

# PCA

In [None]:
from sklearn.decomposition import PCA

In [None]:
# 3. Encode text labels into numbers (needed for `c=...`)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(df['diagnosis'])

In [None]:
pca = PCA(n_components=2)
pca.fit(X_scaled)

In [None]:
X_pca = pca.fit_transform(X_scaled)

In [None]:
X_scaled.shape

In [None]:
X_pca.shape

In [None]:
X_pca

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_encoded, cmap='plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.clim(-0.5, len(le.classes_)-0.5)
plt.grid(True)
plt.show()

# Data Visualization

#Boxplot

In [None]:
sns.boxplot(x='diagnosis', y='concavity_mean', data=df, palette='pastel')
plt.title('Concavity Mean by Diagnosis')
plt.xlabel('Diagnosis')
plt.ylabel('concavity_mean')
plt.show()


#Violin plot

In [None]:
sns.violinplot(x=df['concave points_mean'], color='lightgreen')
plt.title('Violin Plot of concave points_mean')
plt.xlabel('concave points_mean')
plt.show()


#Histogram

In [None]:
sns.histplot(data=df, x='radius_worst', hue='diagnosis', kde=True, palette='Set2')
plt.title('Radius Worst by Diagnosis')
plt.xlabel('radius_worst')
plt.ylabel('Count')
plt.show()


#Stripplot

In [None]:
sns.stripplot(x='diagnosis', y='perimeter_worst', data=df, jitter=True, palette='coolwarm')
plt.title('Perimeter Worst by Diagnosis')
plt.xlabel('Diagnosis')
plt.ylabel('perimeter_worst')
plt.show()


#Swarmplot

In [None]:
sns.swarmplot(x='diagnosis', y='perimeter_mean', data=df, palette='Set1')
plt.title('Perimeter Mean by Diagnosis')
plt.xlabel('Diagnosis')
plt.ylabel('perimeter_mean')
plt.show()


#KDE Plot (Kernel Density Estimation)

In [None]:
sns.kdeplot(data=df, x='area_mean', hue='diagnosis', fill=True, palette='husl')
plt.title('Density Plot of area_mean by Diagnosis')
plt.xlabel('area_mean')
plt.ylabel('Density')
plt.show()


# Implementation & Evaluation

## Task 1

# MLP Claasifier

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Step 1: Split features and target
X = final_scaled_df.drop('diagnosis', axis=1)
y = final_scaled_df['diagnosis']

# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 3: Define the MLP model with regularization to reduce overfitting
mlp = MLPClassifier(
    hidden_layer_sizes=(30, 15, 20),
    max_iter=1000,
    random_state=42,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=10,
    alpha=0.001
)

# Step 4: Train the model
mlp.fit(X_train, y_train)

# Step 5: Make predictions
y_pred = mlp.predict(X_test)

# Step 6: Evaluate performance
accuracy1 = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy1:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 7: Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - MLP Classifier')
plt.show()


# Hyperparametr Tuning (GridSearchCV )

In [None]:
!pip install scikit-optimize


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier

# Define the model
mlp = MLPClassifier(max_iter=1000, random_state=42)


# Define parameter grid to search
param_grid = {
    'hidden_layer_sizes': [(30, 15, 20), (50, 30, 20), (30, 30, 30)],
    'activation': ['relu', 'tanh'],
    'solver': ['adam', 'sgd'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01]
}

# Set up GridSearchCV
grid_search = GridSearchCV(mlp, param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit to training data (X_train, y_train from your split)
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-validation Accuracy:", grid_search.best_score_)

# Use best estimator to predict test set
best_mlp = grid_search.best_estimator_
y_pred = best_mlp.predict(X_test)

from sklearn.metrics import accuracy_score
print("Test Accuracy with best MLP:", accuracy_score(y_test, y_pred))


## Task 2

# Ensembel model(Bagging)

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Bagging ensemble with your MLP as base estimator
bagging_mlp = BaggingClassifier(
    estimator=mlp,
    n_estimators=10,         # 10 MLP models in the ensemble
    max_samples=0.8,         # Each base model trained on 80% random subset
    max_features=1.0,        # Use all features
    bootstrap=True,          # Sampling with replacement
    random_state=42,
    n_jobs=-1                # Use all CPU cores for parallelism
)

# Train on your scaled training data
bagging_mlp.fit(X_train, y_train)

# Predict on your scaled test data
y_pred = bagging_mlp.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging MLP Test Accuracy: {accuracy:.4f}")

# Detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Bagging MLP Classifier')
plt.show()

In [None]:
from sklearn.datasets import load_iris  # Replace with your dataset
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
# Task 2 ensemble models:

# 1. Random Forest
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred_rf)

# 2. Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, y_pred_gb)

# 3. Bagging (using Decision Trees as base estimators)
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=5),
    n_estimators=100,
    random_state=42
)
bagging_model.fit(X_train, y_train)
y_pred_bag = bagging_model.predict(X_test)
bag_accuracy = accuracy_score(y_test, y_pred_bag)

# 4. Voting Classifier (ensemble of MLP, Random Forest, and Gradient Boosting)
voting_model = VotingClassifier(
    estimators=[
        ('mlp', mlp_model),
        ('rf', rf_model),
        ('gb', gb_model)
    ],
    voting='hard'  # majority voting
)
voting_model.fit(X_train, y_train)
y_pred_voting = voting_model.predict(X_test)
voting_accuracy = accuracy_score(y_test, y_pred_voting)

# Print accuracy results
print(f"MLP Accuracy (Task 1): {mlp_accuracy:.4f}")
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Gradient Boosting Accuracy: {gb_accuracy:.4f}")
print(f"Bagging Accuracy: {bag_accuracy:.4f}")
print(f"Voting Classifier Accuracy: {voting_accuracy:.4f}")

# Comparison

In [None]:
# Print comparison
print(f"Single MLP Accuracy: {accuracy1:.4f}")
print(f"Bagging Ensemble MLP Accuracy: {accuracy:.4f}")

# Task 3

# CNN

In [None]:
pip install tensorflow scikit-learn pandas matplotlib seaborn


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense, Dropout

In [None]:

# Prepare data
X = final_scaled_df.drop('diagnosis', axis=1).values
y = LabelEncoder().fit_transform(final_scaled_df['diagnosis'])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Reshape for Conv1D: (samples, features, 1)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

# Build CNN model
model = Sequential([
    Conv1D(32, kernel_size=3, activation='relu', padding='same', input_shape=(X_train.shape[1], 1)),
    MaxPooling1D(pool_size=2),
    
    Conv1D(64, kernel_size=3, activation='relu', padding='same'),
    MaxPooling1D(pool_size=2),

    Conv1D(128, kernel_size=3, activation='relu', padding='same'),
    Flatten(),
    
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')  # Binary classification
])

# Compile
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train
history = model.fit(X_train, y_train, epochs=30, batch_size=16, validation_split=0.1, verbose=1)

# Predict
y_pred_prob = model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int).flatten()

# Evaluate
print("CNN Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - CNN')
plt.show()


## Task 4

#  clustering(Gaussian Mixture Model (GMM))

In [None]:
from sklearn.mixture import GaussianMixture
from sklearn.metrics import adjusted_rand_score, silhouette_score

In [None]:
# Step 2: Apply Gaussian Mixture Model
gmm = GaussianMixture(n_components=2, random_state=42)
gmm_labels = gmm.fit_predict(X)

In [None]:
# Step 3: Evaluate Clustering
ari = adjusted_rand_score(y_encoded, gmm_labels)
sil_score = silhouette_score(X, gmm_labels)

print(f"\n Adjusted Rand Index (ARI): {ari:.4f}")
print(f" Silhouette Score: {sil_score:.4f}")

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=gmm_labels, palette='Set1')
plt.title("GMM Clustering Visualization")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.legend(title="GMM Cluster")
plt.grid(True)
plt.show()

## Clustering( Agglomerative Clustering )

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import accuracy_score, adjusted_rand_score
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
# Step 3: Plot dendrogram to choose number of clusters
linked = linkage(X_scaled, method='ward')


In [None]:
plt.figure(figsize=(12, 6))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=False)
plt.title('Dendrogram - Hierarchical Clustering')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()


In [None]:
# Step 4: Fit Agglomerative Clustering
# Based on dendrogram, assume 2 clusters (Malignant, Benign)
agg_cluster = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward')
cluster_labels = agg_cluster.fit_predict(X_scaled)


In [None]:
# Step 5: Evaluate clustering performance
# Map clusters to true labels (optional flip if needed)
# Use adjusted rand index for label-agnostic comparison
ari = adjusted_rand_score(y, cluster_labels)
print(f"Adjusted Rand Index (ARI): {ari:.4f}")
# Silhouette Score (measures cohesion and separation)
sil_score = silhouette_score(X_scaled, cluster_labels)
print(f"Silhouette Score: {sil_score:.4f}")

# compaire the clustering

In [None]:
print("Agglomerative Clustering")
ari = adjusted_rand_score(y, cluster_labels)
print(f"Adjusted Rand Index (ARI): {ari:.4f}")
# Silhouette Score (measures cohesion and separation)
sil_score = silhouette_score(X_scaled, cluster_labels)
print(f"Silhouette Score: {sil_score:.4f}")

print("Gaussian Mixture")
ari = adjusted_rand_score(y_encoded, gmm_labels)
sil_score = silhouette_score(X, gmm_labels)
print(f"\n Adjusted Rand Index (ARI): {ari:.4f}")
print(f" Silhouette Score: {sil_score:.4f}")