# Breast Cancer Classification Project

Breast cancer classification projects aim to use machine learning to distinguish between malignant and benign tumors based on features from medical imaging. Key steps include data preprocessing, model training using algorithms like K-Nearest Neighbors and Random Forests, and evaluation to assess model accuracy in clinical decision-making.

## Step 1: Data Preprocessing

1-Load the Dataset: Read the dataset into a Pandas DataFrame.

In [3]:
import pandas as pd

# Load the dataset
data = pd.read_csv('data.csv')


2-Encode the Target Variable: Convert the 'diagnosis' column (malignant or benign) into numerical values.

In [4]:
from sklearn.preprocessing import LabelEncoder

# Encode the 'diagnosis' column
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis'])


3-Split the Data: Separate the dataset into features (X) and the target variable (y).

In [5]:
# Split the dataset into features and target variable
X = data.drop(['id', 'diagnosis'], axis=1)  # Features
y = data['diagnosis']  # Target variable


4-Handle Missing Values: Impute missing values if necessary. For this example, assuming the dataset is clean.

## Step 2: Model Selection and Training

5-Split Data into Training and Testing Sets: Split the data into training and testing sets for model evaluation.

6-Feature Scaling: Standardize the features to ensure all features contribute equally to the model.

In [7]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


## Step 3: Model Training and Evaluation

7-Train Multiple Models: Train various classifiers using the training data.

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the dataset
data = pd.read_csv('data.csv')

# Encode the 'diagnosis' column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis'])

# Split the dataset into features and target variable
X = data.drop(['id', 'diagnosis'], axis=1)
y = data['diagnosis']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle missing values by imputing with the mean
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize models
knn = KNeighborsClassifier()
rf = RandomForestClassifier()
lr = LogisticRegression()
dt = DecisionTreeClassifier()

# Train the models
knn.fit(X_train, y_train)
rf.fit(X_train, y_train)
lr.fit(X_train, y_train)
dt.fit(X_train, y_train)

# Evaluate models (you can use the evaluation code from the previous response)




## Step 4: Model Evaluation

8.Evaluate Models: Assess the performance of each model using appropriate metrics.

In [10]:
from sklearn.metrics import accuracy_score, classification_report

# Function to evaluate and print metrics
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report

# Evaluate each model
models = {'K-Nearest Neighbors': knn,
          'Random Forest': rf,
          'Logistic Regression': lr,
          'Decision Tree': dt}

for name, model in models.items():
    accuracy, report = evaluate_model(model, X_test, y_test)
    print(f"=== {name} ===")
    print(f"Accuracy: {accuracy}")
    print(f"Classification Report:\n{report}\n")


=== K-Nearest Neighbors ===
Accuracy: 0.9473684210526315
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96        71
           1       0.93      0.93      0.93        43

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114


=== Random Forest ===
Accuracy: 0.956140350877193
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        71
           1       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114


=== Logistic Regression ===
Accuracy: 0.9736842105263158
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98        71
  