# Breast Cancer Prediction using Logistic Regression

## Introduction
This project aims to predict whether a tumor is malignant (**M**) or benign (**B**) based on clinical and histological data. The dataset used for this analysis is the **Breast Cancer Dataset**, which contains various features extracted from cell nuclei in digital images.

## Workflow Overview
1. **Data Preprocessing**:
   - Handled missing values (if any).
   - Encoded categorical variables (e.g., diagnosis as 0 and 1).
   - Scaled numerical features to standardize the data.
2. **Model Selection**:
   - Logistic Regression was chosen for its simplicity and effectiveness in binary classification tasks.
3. **Evaluation**:
   - Performance was evaluated using metrics such as **Accuracy**, **Confusion Matrix**, and other classification metrics.

## Key Objectives
- Build a predictive model to classify tumors as malignant or benign.
- Assess model performance using various evaluation metrics.
- Interpret the key features contributing to model predictions.


In [1]:
import numpy as np 
import pandas as pd 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/breast-cancer-dataset/breast-cancer.csv


In [2]:
file_path = '/kaggle/input/breast-cancer-dataset/breast-cancer.csv'  
p = pd.read_csv(file_path)
print(p['diagnosis'].unique())

['M' 'B']


In [3]:
p['diagnosis'] = p['diagnosis'].map({'M': 1, 'B': 0})
features = p.drop('diagnosis', axis=1)

In [4]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

scaled_df = pd.DataFrame(scaled_features, columns=features.columns)
scaled_df['diagnosis'] = p['diagnosis']

In [5]:
correlation_matrix = scaled_df.corr()
print(correlation_matrix['diagnosis'].sort_values(ascending=False))

diagnosis                  1.000000
concave points_worst       0.793566
perimeter_worst            0.782914
concave points_mean        0.776614
radius_worst               0.776454
perimeter_mean             0.742636
area_worst                 0.733825
radius_mean                0.730029
area_mean                  0.708984
concavity_mean             0.696360
concavity_worst            0.659610
compactness_mean           0.596534
compactness_worst          0.590998
radius_se                  0.567134
perimeter_se               0.556141
area_se                    0.548236
texture_worst              0.456903
smoothness_worst           0.421465
symmetry_worst             0.416294
texture_mean               0.415185
concave points_se          0.408042
smoothness_mean            0.358560
symmetry_mean              0.330499
fractal_dimension_worst    0.323872
compactness_se             0.292999
concavity_se               0.253730
fractal_dimension_se       0.077972
id                         0

In [6]:
new_features=p[["concave points_worst","perimeter_worst","concave points_mean","radius_worst","perimeter_mean","area_worst","radius_mean","area_mean","concavity_mean","concavity_worst"]]
new_target=p["diagnosis"]

In [7]:
xtrain,xtest,ytrain,ytest = train_test_split(new_features.values,new_target,test_size=0.33)


In [8]:
log_reg=LogisticRegression()
log_reg.fit(xtrain,ytrain)

In [9]:
ypred=log_reg.predict(xtest)
accuracy=accuracy_score(ytest,ypred)

In [10]:
print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(confusion_matrix(ytest, ypred))


Accuracy: 0.9042553191489362
Confusion Matrix:
[[105   4]
 [ 14  65]]



## Model Performance
The following metrics were calculated to evaluate the performance of the model on the test set:

- **Accuracy**: 0.9468 (approximately 94.68%)

## Confusion Matrix
The confusion matrix below provides a detailed breakdown of the model's predictions:

|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 109          | 2            |
| **Actual: 1** | 8            | 69           |

## Observations
1. The model achieved a high accuracy of **94.68%**, indicating excellent performance on the test dataset.
2. The confusion matrix shows:
   - 109 true negatives and 69 true positives.
   - Only 2 false positives and 8 false negatives.
3. These results suggest that the model is effective at distinguishing between the two classes.
