<a href="https://colab.research.google.com/github/halaaab/IT_326-Project-Group-1/blob/main/Phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IT326 – Phase 1: Data Selection
**Project:** Breast Cancer Wisconsin (Diagnostic)

## Project Goal
The goal of this project is to apply data mining techniques on the Breast Cancer Wisconsin (Diagnostic) dataset to support early detection of breast cancer. Using **classification** (benign vs. malignant) and later **clustering**, we aim to distinguish tumor types accurately and explore underlying patterns that can assist medical decision-making and research.

## Dataset Source
**Name:** Breast Cancer Wisconsin (Diagnostic)  
**Source:** UCI Machine Learning Repository  
**Link:** https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic


In [17]:
import pandas as pd
from IPython.display import display

data = pd.read_csv('DataSet/Raw_dataset.csv')

if 'diagnosis' not in data.columns:
    data.rename(columns={data.columns[1]: 'diagnosis'}, inplace=True)

meta_cols = [c for c in data.columns if 'id' in c.lower() or c.lower().startswith('unnamed')]
feature_cols = [c for c in data.columns if c not in meta_cols + ['diagnosis']]

print('Sample of the dataset:')
display(data.head())

print('\nShape (rows, columns):', data.shape)

print('\nDataset info:')
data.info()

print('\nFeature dtypes:\n')
print(data[feature_cols].dtypes)

print('\nClass distribution:')
print(data['diagnosis'].value_counts(dropna=False))


Sample of the dataset:


Unnamed: 0,842302,diagnosis,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
0,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
1,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
2,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
3,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
4,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244



Shape (rows, columns): (568, 32)

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 32 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   842302     568 non-null    int64  
 1   diagnosis  568 non-null    object 
 2   17.99      568 non-null    float64
 3   10.38      568 non-null    float64
 4   122.8      568 non-null    float64
 5   1001       568 non-null    float64
 6   0.1184     568 non-null    float64
 7   0.2776     568 non-null    float64
 8   0.3001     568 non-null    float64
 9   0.1471     568 non-null    float64
 10  0.2419     568 non-null    float64
 11  0.07871    568 non-null    float64
 12  1.095      568 non-null    float64
 13  0.9053     568 non-null    float64
 14  8.589      568 non-null    float64
 15  153.4      568 non-null    float64
 16  0.006399   568 non-null    float64
 17  0.04904    568 non-null    float64
 18  0.05373    568 non-null    float64
 19  0