# Data Wrangling 

Data wrangling generally refers to the process of getting a data set ready for analysis. Why would we need to do that?

The reason is that real-world data can be messy. Data sets are recorded and assembled by humans, and humans make mistakes. A single data set might created and updated by multiple people who may decide to do things in slightly different ways. On a spreadsheet, one person might might decide to leave cells with missing data blank, another might enter "NaN", while a third may enter "missing". If the data has many many rows, one person might decide to repeat the column headers partway down so they don't have to scroll up to see them. Any of these things mean that the data set cannot be analyzed "as is" and wrangling will be required.

Even in a tightly controlled laboratory setting in which data are collected via computer and automatically written out to data files, some data wrangling might be required. There might be a separate data file for each subject or experimental session, meaning that these separate files will have to be combined into a single data set before analysis. 

Our main wrangling tool is pandas, so we can go ahead and import it.

In [1]:
import pandas as pd

## Loading

For our wrangling practice today, we'll look at a data set containing various measurements on breast cancer patients. The file is called `breast_cancer_data.csv`, and you should place it in the "data" folder you should have in the same directory as this notebook.

Let's import it as a pandas dataframe.

In [2]:
bcd = pd.read_csv('./data/breast_cancer_data.csv')
bcd

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Doe
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,benign,Dr. Smith
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,benign,Dr. Lee
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,benign,Dr. Smith
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,benign,Dr. Wong
...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,benign,Dr. Lee
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,benign,Dr. Smith
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,malignant,Dr. Lee
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,malignant,Dr. Lee


Before we do any actual wrangling, let's get familiar with the data frame as-is.

## Exploring the Data Frame

We can explore the data frame by looking at it's attributes, such as its shape, column names, and data types:

In [None]:
bcd.columns

---

Use the cells below to get the shape and data types (`dtypes`) of our data frame.

In [10]:
bcd.shape

(699, 12)

In [None]:
bcd.dtypes

---

In the cell below, use the `describe()` method to get a summary of the numerical columns.

In [11]:
bcd.describe()

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bland_chromatin,normal_nucleoli,mitoses
count,699.0,698.0,698.0,699.0,699.0,699.0,695.0,698.0,699.0
mean,1071704.0,4.416905,3.137536,3.207439,2.793991,3.216023,3.447482,2.868195,1.589413
std,617095.7,2.817673,3.052575,2.971913,2.843163,2.2143,2.441191,3.055647,1.715078
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0
75%,1238298.0,6.0,5.0,5.0,3.5,4.0,5.0,4.0,1.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


---

## Modifying a text column

We'll often want to "tune up" any columns that contain text. We might encounter, for example, a column containing full names that we need to break up into separate columns for the first and last names.

Let's look at the column for the doctors' names:

In [None]:
bcd['doctor_name']

The doctors' name data are redundant; each one has a "Dr. " in front of the actual name, but we already know these are doctors by the column name. And the entries have white space in them, which is just annoying. So let's modify this column so it only contains the actual surnames of the doctors.

One great thing about pandas is that it has versions of many of Python's string methods that operate element-wise on column of strings. We want to separate the "Dr. " from the actual name, which is exactly what Python's `str.split()` function does. So chances are, pandas has a version of this function that operates element-wise on data frames.

---

#### String Splitting Review:

Let's briefly remind ourselves of splitting up Python strings and extracting bits of them.

In [None]:
# Here's a string of the form: surname, first initial.
myStr = 'SirString, A.'
print(myStr)

Let's say we wanted to get the surname. We could split this string into a list at the white space like this:

In [None]:
spltStr = myStr.split()
print(spltStr)

We now have a list in which the items contain the text on either side of the split. This is close: the first entry in the list has the surname, but it also has an unwanted comma. 

Let's split the string at the comma instead:

In [None]:
spltStr = myStr.split(',')   # tell Python to split at commas
print(spltStr)

And now we can get the surname by indexing:

In [None]:
surname = spltStr[0]
print(surname)

---

See if you can extract the surname from `myStr` in one line of code:

In [None]:
surname = myStr.split(',')[0]
print(surname)

---

Alright, time to replace the `bcd[doctor_name]` column values with just the doctors' surnames. 

We could do this in one step, but let's break it out for clarity. First, let's copy the name column out into a new series.

In [None]:
dr_names = bcd['doctor_name']
dr_names

Now let's split all the names in the `doctor_name` column at the whitespace by using pandas `DataFrame.str.split()` function.

In [None]:
split_dr_names = dr_names.str.split()
split_dr_names

Now we have a column of lists, each with two elements. The first element of each list is the "Dr. " bit, and the second consists of the surnames we want. 

We can get these by using pandas string indexing, `Series.str[index]`. 

In [None]:
surnames = split_dr_names.str[1]
surnames[0] = 'hello'
surnames

In [None]:
bcd['doctor_name'] = surnames

In [None]:
bcd['doctor_name']

Success!

## Converting a column type (and other aggravations)

In [None]:
bcd.dtypes

In [None]:
bcd['bare_nuclei'] = bcd['bare_nuclei'].astype('int64')

And, argh, we get an error! If we look at the bottom of the error message, it looks like the error involves question marks in the data, which would also explain why this column imported as text rather than numbers in the first place.

Let's check.

In [None]:
bcd[bcd['bare_nuclei'] == '?']

Sure enough. Rather than leaving the cells of missing values empty, somebody has made the poor decision to enter question marks.

When you are dealing with other peoples' data, you'll find that this sort of the happens a LOT. It can be very aggravating.

Let's replace the question marks with nothing, so that this column becomes consistent with the rest.

In [None]:
bcd['bare_nuclei'] = bcd['bare_nuclei'].replace('?', '')

And now we can convert the column to numeric values.

In [None]:
bcd['bare_nuclei'] = pd.to_numeric(bcd['bare_nuclei'])

In [None]:
bcd.dtypes

Okay! We have now have gotten our data somewhat into shape, meaning:

- missing data are actually missing
- columns of numeric data are numeric in type
- the column of doctor names contains only last names

So now we can explore some ways to deal with missing values.

## Dealing with missing data

### Finding missing values

Even though this data set isn't all that large:

In [None]:
bcd.shape

699 rows is lot to look through in order to find missing values by hand. 

We can test for missing values using the `DataFrame.isna()` method. 

In [None]:
bcd.isna()

By itself, that doesn't help us much. But if we combine it with summation (remember that `True` values count as 1 and `False` counts as zero):

In [None]:
bcd.isna().sum()

Now we have the counts by variable, and can easly see that there are missing values for a few of the variables.

The bare nuclei variable we dealt with earlier has the most missing values, with bland chromatin coming a distant second. 

Let's check some of the rows with missing values and make sure everything else looks normal in those rows.

In [None]:
bcd[bcd['bland_chromatin'].isna()]

---

In the cell below, check the rows that have missing values for clump thickness and cell size uniformity. Do this in one go rather than separately.

In [None]:
bcd[bcd['clump_thickness'].isna() | bcd['cell_size_uniformity'].isna()]

---

So far so good. It looks like the rows that have missing values just have one missing value, and everything else seems fine. But let's do check that no rows have more than one missing value.

To do this, we can sum the number of missing values across the columns (i.e. for each row), and then see what the maximum number of missing values within a row is.

In [None]:
row_na_totals = bcd.isna().sum(axis = 1)
row_na_totals.max()

---

In the cell below, do the above calculation in one line.

In [None]:
bcd.isna().sum(axis = 1).max()

### Dealing with missing values

Now that we have determined that there are missing values, we have to determine how to deal with them. 

#### Ignoring missing values elementwise

One way to handle missing values is just to ignore them. Most of the standard math and statisitical functions will do that by default.

So this:

In [None]:
bcd['clump_thickness'].mean()  

Computes the mean clump thickness ignoring the one missing value.

We can compute the mean (again ignoring missing values) for all the numeric columns like this:

In [None]:
bcd.mean(numeric_only = True)  # the numeric_only refers to columns, not missing values

That worked, but the output is a little awkward because the patient ID is being treated as a numeric variable. We can fix that by converting the patient ID variable to a string variable.

In [None]:
bcd['patient_id'] = bcd['patient_id'].astype('string')

And now the means should look a little better because we won't have the mean for the ID column in the millions>

In [None]:
bcd.mean(numeric_only = True)

#### Removing missing values

Before we start messing around too much with the values in our data frame, let's make sure we can easily hit the reset button and get back to a nice starting point. To do this, we'll want to

- reload the data
- modify the column of Dr. names
- set the patient ID to type str 
- set the bare nuclei column to numeric

This is a perfect job for a function!


---

In the cell below, write function to reset our data frame to the desired starting point.

In [None]:
def hit_reset():
    bcd = pd.read_csv('./data/breast_cancer_data.csv')
    bcd['patient_id'] = bcd['patient_id'].astype('string')
    bcd['doctor_name'] = bcd['doctor_name'].str.split().str[1]
    bcd['bare_nuclei'] = bcd['bare_nuclei'].replace('?', '')
    bcd['bare_nuclei'] = pd.to_numeric(bcd['bare_nuclei'])
    
    return bcd

---

##### Removing rows with missing values

Obviously, rows in which all values are missing won't do us any good, so we can drop them with:

In [None]:
bcd = bcd.dropna(how = 'all')

This drops rows in which *all* of the values are missing. This code ran without error, but we know it also didn't do anything in this case because we don't have any rows in which all the values are missing.

Sometimes a case can be made for throwing out all observations (rows) that are incomplete, that is, if they contain *any* missing values.

In [None]:
bcd = bcd.dropna(how = 'any')

---

In the cell below, check the (new) shape of `bcd`.

In [None]:
bcd.shape

It should have fewer rows now.

---

And now is a perfect time to test our function! In the cell below, hit the reset button on bcd.

In [None]:
bcd = hit_reset()  # reset bcd data frame

Check the shape.

In [None]:
bcd.shape

Check the data types of the columns.

In [None]:
bcd.dtypes

Check the doctor name column.

In [None]:
bcd['doctor_name']

---

##### Removing columns with missing values

And we could do the same for columns if we wished, though this is less frequently done. We just need to change the axis (direction) over which `DataFrame.dropna()` works.

In [None]:
bcd = bcd.dropna(axis = 1, how = 'any') # drop columns rather than rows

This leaves us with only the complete columns.

In [None]:
bcd.shape

Let's see which they are.

In [None]:
bcd.columns

#### Filling in missing values

Occasionally, we may want to fill in missing values. This isn't very common, but might be useful if some other function you are using doesn't handle missing values gracefully.

Before filling in missing values, we need to restore our data frame so it actually has missing values. Good thing we wrote that function!

In [None]:
bcd = hit_reset()

We can fill in missing values with any single value we want, such as a zero.

In [None]:
bcd = bcd.fillna(0)

---

In the cell below, check to see that we no longer have missing values.

In [None]:
bcd.isna().sum()

In the cell below, reset the data and verify that the missing data are back.

In [None]:
bcd = hit_reset()
bcd.isna().sum()

---

In the cell below, fill the missing values in each column with the column mean. (Hint: this is pandas, so this is actually easy!)

In [None]:
bcd = bcd.fillna(bcd.mean(numeric_only = True))

And now verify that there are no more missing values.

In [None]:
bcd.isna().sum()

---

### Duplicate entries

In [None]:
bcd.nunique()