# Data Wrangling 

In [1]:
import pandas as pd

## Loading

In [127]:
bcd = pd.read_csv('./data/breast_cancer_data.csv')
bcd

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Doe
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,benign,Dr. Smith
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,benign,Dr. Lee
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,benign,Dr. Smith
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,benign,Dr. Wong
...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,benign,Dr. Lee
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,benign,Dr. Smith
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,malignant,Dr. Lee
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,malignant,Dr. Lee


## Exploring the Data Frame

In [64]:
bcd.describe()

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bland_chromatin,normal_nucleoli,mitoses
count,699.0,698.0,698.0,699.0,699.0,699.0,695.0,698.0,699.0
mean,1071704.0,4.416905,3.137536,3.207439,2.793991,3.216023,3.447482,2.868195,1.589413
std,617095.7,2.817673,3.052575,2.971913,2.843163,2.2143,2.441191,3.055647,1.715078
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0
75%,1238298.0,6.0,5.0,5.0,3.5,4.0,5.0,4.0,1.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


In [65]:
bcd.shape

(699, 12)

In [66]:
bcd.dtypes

patient_id                 int64
clump_thickness          float64
cell_size_uniformity     float64
cell_shape_uniformity      int64
marginal_adhesion          int64
single_ep_cell_size        int64
bare_nuclei               object
bland_chromatin          float64
normal_nucleoli          float64
mitoses                    int64
class                     object
doctor_name               object
dtype: object

In [67]:
bcd.nunique()

patient_id               645
clump_thickness           10
cell_size_uniformity      10
cell_shape_uniformity     10
marginal_adhesion         10
single_ep_cell_size       10
bare_nuclei               11
bland_chromatin           10
normal_nucleoli           10
mitoses                    9
class                      2
doctor_name                4
dtype: int64

In [68]:
bcd.columns

Index(['patient_id', 'clump_thickness', 'cell_size_uniformity',
       'cell_shape_uniformity', 'marginal_adhesion', 'single_ep_cell_size',
       'bare_nuclei', 'bland_chromatin', 'normal_nucleoli', 'mitoses', 'class',
       'doctor_name'],
      dtype='object')

## Modifying a text column

Let's look at the column for the doctors' names:

In [99]:
bcd['doctor_name']

0        Dr. Doe
1      Dr. Smith
2        Dr. Lee
3      Dr. Smith
4       Dr. Wong
         ...    
694      Dr. Lee
695    Dr. Smith
696      Dr. Lee
697      Dr. Lee
698     Dr. Wong
Name: doctor_name, Length: 699, dtype: object

The doctors' name data are redundant; each one has a "Dr. " in front of the actual name, but we already know these are doctors by the column name. And the entries have white space in them, which is just annoying. So let's modify this column so it only contains the actual surnames of the doctors.

One great thing about pandas is that it has versions of many of Python's string methods that operate element-wise on column of strings. We want to separate the "Dr. " from the actual name, which is exactly what Python's `str.split()` function does. So chances are, pandas has a version of this function that operates element-wise on data frames.

---

#### String Splitting Review:

Let's briefly remind ourselves of splitting up Python strings and extracting bits of them.

In [70]:
# Here's a string of the form: surname, first initial.
myStr = 'SirString, A.'
print(myStr)

SirString, A.


Let's say we wanted to get the surname. We could split this string into a list at the white space like this:

In [71]:
spltStr = myStr.split()
print(spltStr)

['SirString,', 'A.']


We now have a list in which the items contain the text on either side of the split. This is close: the first entry in the list has the surname, but it also has an unwanted comma. 

Let's split the string at the comma instead:

In [72]:
spltStr = myStr.split(',')   # tell Python to split at commas
print(spltStr)

['SirString', ' A.']


And now we can get the surname by indexing:

In [73]:
surname = spltStr[0]
print(surname)

SirString


---

See if you can extract the surname from `myStr` in one line of code:

In [74]:
surname = myStr.split(',')[0]
print(surname)

SirString


---

Alright, time to replace the `bcd[doctor_name]` column values with just the doctors' surnames. 

We could do this in one step, but let's break it out for clarity. First, let's copy the name column out into a new series.

In [128]:
dr_names = bcd['doctor_name']
dr_names

0        Dr. Doe
1      Dr. Smith
2        Dr. Lee
3      Dr. Smith
4       Dr. Wong
         ...    
694      Dr. Lee
695    Dr. Smith
696      Dr. Lee
697      Dr. Lee
698     Dr. Wong
Name: doctor_name, Length: 699, dtype: object

Now let's split all the names in the `doctor_name` column at the whitespace by using pandas `DataFrame.str.split()` function.

In [129]:
split_dr_names = dr_names.str.split()
split_dr_names

0        [Dr., Doe]
1      [Dr., Smith]
2        [Dr., Lee]
3      [Dr., Smith]
4       [Dr., Wong]
           ...     
694      [Dr., Lee]
695    [Dr., Smith]
696      [Dr., Lee]
697      [Dr., Lee]
698     [Dr., Wong]
Name: doctor_name, Length: 699, dtype: object

Now we have a column of lists, each with two elements. The first element of each list is the "Dr. " bit, and the second consists of the surnames we want. 

We can get these by using pandas string indexing, `Series.str[index]`. 

In [130]:
surnames = split_dr_names.str[1]
surnames[0] = 'hello'
surnames

0      hello
1      Smith
2        Lee
3      Smith
4       Wong
       ...  
694      Lee
695    Smith
696      Lee
697      Lee
698     Wong
Name: doctor_name, Length: 699, dtype: object

In [131]:
bcd['doctor_name'] = surnames

In [132]:
bcd['doctor_name']

0      hello
1      Smith
2        Lee
3      Smith
4       Wong
       ...  
694      Lee
695    Smith
696      Lee
697      Lee
698     Wong
Name: doctor_name, Length: 699, dtype: object

Success!

## Dealing with missing data

### Finding missing values

Even though this data set isn't all that large:

In [133]:
bcd.shape

(699, 12)

699 rows is lot to look through in order to find missing values by hand. 

We can test for missing values using the `DataFrame.isna()` method. 

In [134]:
bcd.isna()

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
694,False,False,False,False,False,False,False,False,False,False,False,False
695,False,False,False,False,False,False,False,False,False,False,False,False
696,False,False,False,False,False,False,False,False,False,False,False,False
697,False,False,False,False,False,False,False,False,False,False,False,False


By itself, that doesn't help us much. But if we combine it with summation (remember that `True` values count as 1 and `False` counts as zero):

In [135]:
bcd.isna().sum()

patient_id               0
clump_thickness          1
cell_size_uniformity     1
cell_shape_uniformity    0
marginal_adhesion        0
single_ep_cell_size      0
bare_nuclei              2
bland_chromatin          4
normal_nucleoli          1
mitoses                  0
class                    0
doctor_name              0
dtype: int64

Now we have the counts by variable, and can easly see that there are missing values for a few of the variables.

The bland chromatin variable has the most missing values. Let's check these and make sure everything else looks normal.

In [148]:
bcd[bcd['bland_chromatin'].isna()]

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
342,814265,2.0,1.0,1,1,2,1,,1.0,1,benign,Lee
343,814911,1.0,1.0,1,1,2,1,,1.0,1,benign,Doe
359,873549,10.0,3.0,5,4,3,7,,5.0,3,malignant,Doe
365,897172,2.0,1.0,1,1,2,1,,1.0,1,benign,Lee


---

In the cell below, check the rows that have missing bare nuclei values.

---

In [153]:
bcd.isna().sum(axis = 1).sort_values()

0      0
462    0
463    0
464    0
465    0
      ..
6      1
12     1
365    1
342    1
359    1
Length: 699, dtype: int64

In [156]:
bcd.isna().sum(axis = 1).max()

1

### Dealing with missing values

Now that we have determined that there are missing values, we have to determine how to deal with them. 

#### Ignoring missing values elementwise

One way to handle missing values is just to ignore them. Most of the standard math and statisitical functions will do that by default.

In [144]:
bcd.mean(numeric_only = True)  # the numeric_only refers to columns, not missing values

clump_thickness          4.416905
cell_size_uniformity     3.137536
cell_shape_uniformity    3.207439
marginal_adhesion        2.793991
single_ep_cell_size      3.216023
bland_chromatin          3.447482
normal_nucleoli          2.868195
mitoses                  1.589413
dtype: float64

That worked, but the output is a little awkward because the patient ID is being treated as a numeric variable. We can change that:

In [143]:
bcd['patient_id'] = bcd['patient_id'].astype('string')

In [158]:
bcd.mean(numeric_only = True)

clump_thickness          4.416905
cell_size_uniformity     3.137536
cell_shape_uniformity    3.207439
marginal_adhesion        2.793991
single_ep_cell_size      3.216023
bland_chromatin          3.447482
normal_nucleoli          2.868195
mitoses                  1.589413
dtype: float64

#### Removing missing values

Before we start messing around too much with the values in our data frame, let's make sure we can easily hit the reset button and get back to a nice starting point. To do this, we'll want to

- reload the data
- modify the column of Dr. names
- set the patient ID to type str 

A perfect job for a function!


In [173]:
def hit_reset():
    bcd = pd.read_csv('./data/breast_cancer_data.csv')
    bcd['patient_id'] = bcd['patient_id'].astype('string')
    bcd['doctor_name'] = bcd['doctor_name'].str.split().str[1]
    
    return bcd

##### Removing rows with missing values

Obviously, rows in which all values are missing won't do us any good, so we can drop them with:

In [145]:
bcd = bcd.dropna(how = 'all')

This will drop rows in which *all* of the values are missing.

Sometimes a case can be made for throwing out all observations (rows) that are incomplete, that is, if they contain *any* missing values.

In [174]:
bcd = bcd.dropna(how = 'any')

---

In the cell below, check the (new) shape of `bcd`.

In [175]:
bcd.shape

(690, 12)

It should have 9 fewer rows now.

---

And now is a perfect time to test our function! In the cell below, hit the reset button on bcd.

In [179]:
bcd = hit_reset()  # reset bcd data frame

Check the shape.

In [180]:
bcd.shape

(699, 12)

Check the type of the id column.

In [181]:
bcd.dtypes

patient_id                string
clump_thickness          float64
cell_size_uniformity     float64
cell_shape_uniformity      int64
marginal_adhesion          int64
single_ep_cell_size        int64
bare_nuclei               object
bland_chromatin          float64
normal_nucleoli          float64
mitoses                    int64
class                     object
doctor_name               object
dtype: object

Check the doctor name column.

In [182]:
bcd['doctor_name']

0        Doe
1      Smith
2        Lee
3      Smith
4       Wong
       ...  
694      Lee
695    Smith
696      Lee
697      Lee
698     Wong
Name: doctor_name, Length: 699, dtype: object

---

##### Removing columns with missing values

And we could do the same for columns if we wished, though this is less frequently done.

In [146]:
# bcd = bcd.dropna(axis = 1, how = 'any')
# bcd = bcd.dropna(axis = 1, how = 'all')

#### Filling in missing values