# Breast Cancer data


The breast cancer data was downloaded from [Link](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) ). It is also included as [wdbc.data](wdbc.data).

The description of the data can be found in [wdbc.names](wdbc.names).

Try using other functions we learned in W09_2.


<sub> Acknowledgement: Much of this ipynb is from [Link](https://towardsdatascience.com/pandas-for-biologists-f01c8a548b7c) </sub>


---

In [None]:
import numpy as np
import pandas as pd

bcd = pd.read_csv('wdbc.data')
bcd


In [None]:
bcd = pd.read_csv('wdbc.data', header = None, usecols=np.arange(12))
bcd

In [None]:
bcd = pd.read_csv('wdbc.data', header = None, usecols=np.arange(12), index_col=0)
bcd

In [None]:
bcd.columns = [ 'Diagnosis', 'Radius','Texture', 'Perimeter', 'Area', 'Smoothness',
              'Compactness','Concavity','Concave points','Symmetry','Fractal dimension']
bcd.index.name = 'ID'
bcd.head()

In [None]:
bcd.tail()

In [None]:
bcd.describe()

In [None]:
bcd.corr()

In [None]:
bcd.columns

In [None]:
bcd.shape

In [None]:
sub = bcd.loc[bcd["Diagnosis"] == 'M', ["Diagnosis","Compactness", "Perimeter"]]
sub

In [None]:
sub.drop("Compactness", axis=1)

In [None]:
sub

In [None]:
sub.loc[927241, 'Compactness']    # This is the most desirable indexing method
                                  # Other methods below work, but not easily readable.

In [None]:
sub.loc[927241]['Compactness']

In [None]:
sub['Compactness'].loc[927241]

In [None]:
sub.drop(926682, axis=0)  # 566 is not the row number, but the ID of the row

In [None]:
sub.to_csv('test.csv')

In [None]:
sub.assign(a_name_for_new_column = sub.sum(axis=1))

In [None]:
bcd_2 = bcd.copy()
bcd_2['Diagnosis'] = bcd_2['Diagnosis'].replace(to_replace=['M','B'], value=['Malign', 'Benign'])
bcd_2

In [None]:
bcd

In [None]:
bcd_2["Diagnosis"].value_counts()

In [None]:
bcd_2["Diagnosis"].value_counts(normalize = True)

In [None]:
bcd.groupby('Diagnosis').describe()

In [None]:
bcd.groupby('Diagnosis')['Radius'].describe()

In [None]:
bcd.groupby('Diagnosis').describe()['Radius']

# Visualization

Pandas' default is matplotlib.

In [None]:
bcd.plot(kind = 'scatter', x = 'Radius', y = 'Texture')

In [None]:
# Tricks to vary the dot properties in a scatter plot

size = bcd.Radius**2  # The size of dots will follow the square of Radius.
color = np.where(bcd.Diagnosis == 'M',  'blue', 'orange')  # Check the "where()" function in numpy

bcd.plot.scatter(x='Radius',y='Texture',s = size,c=color)   # This uses a large dots for all data points with year is 10's multiple

In [None]:
bcd.plot(kind = 'hist', y = 'Radius')

In [None]:
bcd[['Radius', 'Texture']].plot(kind = 'hist', alpha = 0.6)

In [None]:
bcd[['Radius', 'Texture']].plot(kind = 'density', bw_method = 0.5)  # bw_method: smoothing factor

In [None]:
bcd[['Radius', 'Texture']].plot(kind = 'box')

## Seaborn

If you want more modern figures, try Seaborn. [Seaborn website](https://seaborn.pydata.org/examples/index.html)

Below are a couple of examples.

In [None]:
import seaborn as sns

sns.heatmap(bcd.corr(), cmap='Blues', linewidth=0.5)

In [None]:
sns.heatmap(bcd.groupby('Diagnosis').corr())

In [None]:
sub = bcd.loc[:, ['Diagnosis', 'Radius', 'Perimeter', 'Smoothness']]
sns.pairplot(sub, hue = 'Diagnosis')