**PIYUSH CHAUHAN B.TECH CSE 3RD YEAR**

# Objective: The objective of this program is to implement a Naïve Bayes classifier on the Breast Cancer dataset.

# Q1: Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Q2: Data Analysis

**Load the dataset**

In [2]:
df= pd.read_csv('/kaggle/input/breast-cancer-dataset/breast-cancer.csv')

In [3]:
# Display the first few rows of the dataset
print("First few rows of the dataframe:")
print(df.head())

First few rows of the dataframe:
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   ...  radius_worst  texture

In [4]:
# Check for missing values
print("\nMissing values in the dataset:")
print(df.isnull().sum())


Missing values in the dataset:
id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64


In [5]:
# Drop rows with missing values
df = df.dropna()

In [6]:
# Replace 'diagnosis' with the actual label column from your dataset
X = df.drop(columns=['diagnosis'])  
y = df ['diagnosis']

In [7]:
# Check the structure and summary statistics of the dataset
print("\nData Info:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())


Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se  

# Q3: Data Preprocessing

**Assign data and labels**

In [8]:
# Assuming 'diagnosis' is the target variable
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})  # Convert 'M' to 1 and 'B' to 0
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

**Scaling Data**

In [9]:
# Scaling Data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

**Splitting Data**

In [10]:
# Splitting Data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Q4: Model Implementation

**Instantiate the Naïve Bayes classifier**

In [11]:
model = GaussianNB()

**Train the model**

In [12]:
model.fit(X_train, y_train)

**Prediction on the set test**

In [13]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Q5: Calculate the accuracy, precision, and recall

**Use average='macro' to handle multiple classes**

In [14]:
# Use average='macro' to handle multiple classes
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')  # 'macro' considers all classes equally
recall = recall_score(y_test, y_pred, average='macro')


**Display the results**

In [15]:
# Display the results
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"Precision: {precision * 100:.2f}%")
print(f"Recall: {recall * 100:.2f}%")

Accuracy: 96.49%
Precision: 96.73%
Recall: 95.81%
