# Breast Cancer Data Wrangling

## Contents:

* Introduction
* Imports
* Initial look at data

## Introduction

The dataset is structured with 11 variables for each patient:

* pid: Patient identifier, a unique identifier assigned to each patient in the study.
* age: Age of the patient in years.
* meno: Menopausal status of the patient. It is represented as 0 for premenopausal and 1 for postmenopausal.
* size: Tumor size in millimeters.
* grade: Tumor grade, which provides information about the aggressiveness of the tumor.
* nodes: Number of positive lymph nodes, indicating the extent of lymph node involvement.
* pgr: Progesterone receptors measured in fmol/l (femtomoles per liter).
* er: Estrogen receptors measured in fmol/l.
* hormon: Hormonal therapy given to the patient. It is represented as 0 for no hormonal therapy and 1 for receiving hormonal therapy.
* rfstime: Recurrence-free survival time in days, which refers to the duration until the first occurrence of recurrence, death, or the last follow-up.
* status: Patient status indicator, where 0 represents being alive without recurrence, and 1 indicates recurrence or death.

The dataset contains patient records from a clinical trial conducted by the German Breast Cancer Study Group (GBSG) between 1984 and 1989.


## Imports

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

## Initial look at data

In [10]:
breast_cancer_1 = pd.read_csv('../data/gbsg.csv')

In [11]:
breast_cancer_1.head()

Unnamed: 0.1,Unnamed: 0,pid,age,meno,size,grade,nodes,pgr,er,hormon,rfstime,status
0,1,132,49,0,18,2,2,0,0,0,1838,0
1,2,1575,55,1,20,3,16,0,0,0,403,1
2,3,1140,56,1,40,3,3,0,0,0,1603,0
3,4,769,45,0,25,3,1,0,4,0,177,0
4,5,130,65,1,30,2,5,0,36,1,1855,0


In [5]:
breast_cancer_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686 entries, 0 to 685
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  686 non-null    int64
 1   pid         686 non-null    int64
 2   age         686 non-null    int64
 3   meno        686 non-null    int64
 4   size        686 non-null    int64
 5   grade       686 non-null    int64
 6   nodes       686 non-null    int64
 7   pgr         686 non-null    int64
 8   er          686 non-null    int64
 9   hormon      686 non-null    int64
 10  rfstime     686 non-null    int64
 11  status      686 non-null    int64
dtypes: int64(12)
memory usage: 64.4 KB


In [6]:
breast_cancer_1.shape

(686, 12)

In [8]:
breast_cancer_1.isna().sum()

Unnamed: 0    0
pid           0
age           0
meno          0
size          0
grade         0
nodes         0
pgr           0
er            0
hormon        0
rfstime       0
status        0
dtype: int64

In [15]:
breast_cancer_1.describe()

Unnamed: 0.1,Unnamed: 0,pid,age,meno,size,grade,nodes,pgr,er,hormon,rfstime,status
count,686.0,686.0,686.0,686.0,686.0,686.0,686.0,686.0,686.0,686.0,686.0,686.0
mean,343.5,966.061224,53.052478,0.577259,29.329446,2.116618,5.010204,109.995627,96.252187,0.358601,1124.489796,0.43586
std,198.175427,495.506249,10.120739,0.494355,14.296217,0.582808,5.475483,202.331552,153.083963,0.47994,642.791948,0.496231
min,1.0,1.0,21.0,0.0,3.0,1.0,1.0,0.0,0.0,0.0,8.0,0.0
25%,172.25,580.75,46.0,0.0,20.0,2.0,1.0,7.0,8.0,0.0,567.75,0.0
50%,343.5,1015.5,53.0,1.0,25.0,2.0,3.0,32.5,36.0,0.0,1084.0,0.0
75%,514.75,1340.5,61.0,1.0,35.0,2.0,7.0,131.75,114.0,1.0,1684.75,1.0
max,686.0,1819.0,80.0,1.0,120.0,3.0,51.0,2380.0,1144.0,1.0,2659.0,1.0
