# Breast Cancer Prediction using Logistic Regression

## Introduction
This project aims to predict whether a tumor is malignant (**M**) or benign (**B**) based on clinical and histological data. The dataset used for this analysis is the **Breast Cancer Dataset**, which contains various features extracted from cell nuclei in digital images.

## Workflow Overview
1. **Data Preprocessing**:
   - Handled missing values (if any).
   - Encoded categorical variables (e.g., diagnosis as 0 and 1).
   - Scaled numerical features to standardize the data.
2. **Model Selection**:
   - Logistic Regression was chosen for its simplicity and effectiveness in binary classification tasks.
3. **Evaluation**:
   - Performance was evaluated using metrics such as **Accuracy**,**Precision**,**Recall**, **Confusion Matrix**, and other classification metrics.


In [1]:
import numpy as np 
import pandas as pd 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score, confusion_matrix 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/breast-cancer-dataset/breast-cancer.csv


In [2]:
file_path = '/kaggle/input/breast-cancer-dataset/breast-cancer.csv'  
p = pd.read_csv(file_path)
print(p['diagnosis'].unique())
p

['M' 'B']


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [3]:
p['diagnosis'] = p['diagnosis'].map({'M': 1, 'B': 0})
features = p.drop('diagnosis', axis=1)

In [4]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

scaled_df = pd.DataFrame(scaled_features, columns=features.columns)
scaled_df['diagnosis'] = p['diagnosis']

In [5]:
correlation_matrix = scaled_df.corr()
print(correlation_matrix['diagnosis'].sort_values(ascending=False))

diagnosis                  1.000000
concave points_worst       0.793566
perimeter_worst            0.782914
concave points_mean        0.776614
radius_worst               0.776454
perimeter_mean             0.742636
area_worst                 0.733825
radius_mean                0.730029
area_mean                  0.708984
concavity_mean             0.696360
concavity_worst            0.659610
compactness_mean           0.596534
compactness_worst          0.590998
radius_se                  0.567134
perimeter_se               0.556141
area_se                    0.548236
texture_worst              0.456903
smoothness_worst           0.421465
symmetry_worst             0.416294
texture_mean               0.415185
concave points_se          0.408042
smoothness_mean            0.358560
symmetry_mean              0.330499
fractal_dimension_worst    0.323872
compactness_se             0.292999
concavity_se               0.253730
fractal_dimension_se       0.077972
id                         0

In [6]:
new_features=p[["concave points_worst","perimeter_worst","concave points_mean","radius_worst","perimeter_mean","area_worst","radius_mean","area_mean","concavity_mean","concavity_worst"]]
new_target=p["diagnosis"]

In [7]:
xtrain,xtest,ytrain,ytest = train_test_split(new_features.values,new_target,test_size=0.33 ,random_state=42)


In [8]:
log_reg=LogisticRegression(max_iter=10000, random_state=42 )
log_reg.fit(xtrain,ytrain)

In [9]:
ypred=log_reg.predict(xtest)
accuracy=accuracy_score(ytest,ypred)

In [10]:
#  Precision & Recall
precision = precision_score(ytest, ypred)
recall = recall_score(ytest, ypred)

#  Accuracy
accuracy = accuracy_score(ytest, ypred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')


Accuracy: 0.973404255319149
Precision: 0.9558823529411765
Recall: 0.9701492537313433


In [11]:
conf_matrix = confusion_matrix(ytest, ypred)
print('Confusion Matrix:')
print(conf_matrix)

Confusion Matrix:
[[118   3]
 [  2  65]]


## Metrics

| Metric      | Value  |
|-------------|--------|
| **Accuracy** | 97.34% |
| **Precision** | 95.55% |
| **Recall**   | 97.01% |

### Confusion Matrix


- **True Positives (TP)**: 118
- **True Negatives (TN)**: 2
- **False Positives (FP)**: 3
- **False Negatives (FN)**: 65



### Analysis of Precision Importance

Given the nature of the dataset, it is critical to minimize False Negatives (FN), where a malignant tumor is mistakenly classified as benign. Such an error could lead to improper treatment, potentially endangering a patient's life. 

Therefore, achieving a high `precision_score` is of utmost importance in this context.
