# Data Wrangling 

Data wrangling generally refers to the process of getting a data set ready for analysis. Why would we need to do that?

Real-world data can be messy. Data sets are recorded and assembled by humans, and humans make mistakes. A single data set might created and updated by multiple people who may decide to do things in slightly different ways. On a spreadsheet, one person might might decide to leave cells with missing data blank, another might enter "NaN", while a third may enter "missing". If the data has many many rows, one person might decide to repeat the column headers partway down so they don't have to scroll up to see them. Any of these things mean that the data set cannot be analyzed "as is" and wrangling will be required.

Even in a tightly controlled laboratory setting in which data are collected via computer and automatically written out to data files, some data wrangling might be required. There might be a separate data file for each subject or experimental session, meaning that these separate files will have to be combined into a single data set before analysis. 

Our main wrangling tool is pandas, so we can go ahead and import it.

In [None]:
import pandas as pd

## Loading

For our wrangling practice today, we'll look at a data set containing various measurements on breast cancer patients. The file is called `breast_cancer_data.csv`, and you should place it in the "data" folder you should already have in the same directory as this notebook.

Let's import it as a pandas dataframe.

In [None]:
bcd = pd.read_csv('./data/breast_cancer_data.csv')
bcd

Before we do any actual wrangling, let's get familiar with the data frame in its current form.

## Exploring the Data Frame

We can explore the data frame by looking at it's attributes, such as its shape, column names, and data types:

In [None]:
bcd.columns

---

Use the cells below to get the shape and data types (`dtypes`) of our data frame.

In [None]:
bcd.shape

In [None]:
bcd.dtypes

---

In the cell below, use the `describe()` method to get a summary of the numerical columns.

---

## Modifying a text column

We'll often want to "tune up" columns that contain text. We might encounter, for example, a column containing full names that we need to break up into separate columns for the first and last names.

---

Let's look at the column for the doctors' names. Use the cell below to take a peek.

---

The doctors' name data are redundant; each one has a "Dr. " in front of the actual name, but we already know these are doctors by the column name. Further, the entries have white space in them, which can cause us problems down the road. So let's modify this column so it only contains the surnames of the doctors.

One great thing about pandas is that it has versions of many of Python's string methods that operate *element-wise on an entire column of strings*. Here, we want to separate the "Dr. " from the actual name, which is exactly what Python's `str.split()` function does. So chances are, pandas has a version of this function that operates element-wise on data frames.

---

#### String Splitting Review:

Let's briefly remind ourselves of splitting up Python strings and extracting bits of them.

In [None]:
# Here's a string of the form: surname, first initial.
myStr = 'SirString, A.'
print(myStr)

Let's say we wanted to get the surname. We could split this string into a Python list at the white space like this:

In [None]:
spltStr = myStr.split()    # split() defaults to splitting at white space
print(spltStr)

We now have a list in which the items contain the text on either side of the split. This is close to what we want: the first entry in the list has the surname, but it also has an unwanted comma. 

Let's split the string at the comma instead:

In [None]:
spltStr = myStr.split(',')   # tell Python to split at commas
print(spltStr)

Now we have isolated the last name, and we can fetch it by indexing:

In [None]:
surname = spltStr[0]
print(surname)

---

In the cell below, see if you can extract the surname from `myStr` in one line of code:

---

Alright, time to replace the `bcd['doctor_name']` column values with just the doctors' last names. 

We could do this in one step, but let's break it out for clarity. First, let's copy the name column out into a new series.

In [None]:
dr_names = bcd['doctor_name']
dr_names

---

***Note***: pandas objects behave like ordinary Python objects. So, strictly speaking, we have not created a new object (pandas Series), rather, *we have created a new label that refers to the "doctor_name" column of `bcd`.*

In the cell below, use the `id()` function to compare the object IDs of `dr_names` and the corresponding column of `bcd`.

---

Now let's split all the names in the `doctor_name` column at the whitespace by using pandas `DataFrame.str.split()` function.

In [None]:
split_dr_names = dr_names.str.split()
split_dr_names

`DataFrame.str.split()`, however, *does* create a new object.

---

Use the cell below to confirm that the `split()` spawed a new object.

---

Now we have a column of lists, each with two elements. The first element of each list is the "Dr. " bit, and the second consists of the surnames we want. 

We can get these by using pandas string indexing, `Series.str[index]`. 

In [None]:
surnames = split_dr_names.str[1]
surnames

Note that, like the splitting, the string indexing worked on the entire `Series` automatically.

Now we can change the column in our main data frame, `bcd`.

In [None]:
bcd['doctor_name'] = surnames

In [None]:
bcd['doctor_name']

Success!

## Converting a column type (and other aggravations)

Let's look at those data types again.

In [None]:
bcd.dtypes

Notice that "class" and "doctor_name" are of dtype "object", which refers to a general purpose column type, and is how pandas imports text columns by default. Most of the others are numeric (integers or floats), except for "bare_nuclei".

---

In the cell below, take a quick glance at 'bcd' again, and see if the "bare_nuclei" column should be a different data type that, say "marginal_adhesion".

---

It looks like "bare_nuclei" was intended to be a numeric column, so let's try and convert it using the `DataFrame.astype()` converter method.

In [None]:
bcd['bare_nuclei'] = bcd['bare_nuclei'].astype('int64')

And, argh, we get an error! If we look at the bottom of the error message, it seems that the error involves question marks ("?") in the data, which would also explain why this column imported as text rather than numbers in the first place.

Let's check.

---

In the cell below, use logical indexing to show the rows of `bcd` in which `bcd[bare_nuclei]` contains a question mark.

---

Sure enough. Rather than leaving the cells of missing values empty, somebody has made the poor decision to enter question marks instead.

When you are dealing with other peoples' data, you'll find that this sort of the happens a LOT. It can be very aggravating, so we need to learn to treat these things as challenging puzzles instead of hassles!

Let's replace the question marks with nothing, so that this column becomes consistent with the rest. Fortunately, `DataFrame` (and `Series`) objects have a `replace()` function built in, so let's use that.

In [None]:
bcd['bare_nuclei'] = bcd['bare_nuclei'].replace('?', '')

---

In the cell below, confirm that we no longer have question marks in our "bare_nuclei" column.

---

**Note**: As mentioned above, extracting columns or other subsets of data from a pandas `DataFrame` or `Series` does not create a new object but rather a new label to the existing object.

So, for example, `the_IDs = bcd['patient_id']` does not make a new object, but rather creates a second label referring to the original object (consistent with the behavior of base Python).

In general, however, pandas methods (functions) *do* create new objects. Thus, the step of assigning the output of `.replace()` back to the original data frame column is necessary.

---

In the cell below, confirm that the output of `.replace()` and `bcd['bare_nuclei']` have different IDs.

---

And now we can convert the column to numeric values.

In [None]:
bcd['bare_nuclei'] = pd.to_numeric(bcd['bare_nuclei'])

---

In the cell below, check the data types of columns in `bcd`.

---

Okay! We have now have gotten our data somewhat into shape, meaning:

- missing data are actually missing
- columns of numeric data are numeric in type
- the column of doctor names contains only last names

So now we can explore some ways to deal with missing values.

## Dealing with missing data

### Finding missing values

Even though this dataset isn't all that large:

In [None]:
bcd.shape

699 rows is lot to look through "by hand" in order to find missing values. 

We can test for missing values using the `DataFrame.isna()` method. 

In [None]:
bcd.isna()

By itself, that doesn't help us much. But if we combine it with summation (remember that `True` values count as 1 and `False` counts as zero):

In [None]:
bcd.isna().sum()

Now we have the counts by variable, and can easly see that there are missing values for a few of the variables.

The "bare_nuclei" variable we dealt with earlier has the most missing values, with "bland_chromatin" coming in a distant second. 

Let's check some of the rows with missing values and make sure everything else looks normal in those rows. Notice above that the output of `.isna()` is Boolean, so we can use it to do logical indexing.

In [None]:
bcd[bcd['bland_chromatin'].isna()]

---

In the cell below, check the rows that have missing values for either clump thickness or cell size uniformity. Do this in one go rather than separately (remember about the element-wise or operator, "|".

---

So far so good. It looks like the rows that have missing values just have one missing value, and everything else seems fine. But let's do check that no rows have more than one missing value.

To do this, we can sum the number of missing values across the columns (i.e. within each row), and then see what the maximum number of missing values within a row is.

In [None]:
row_na_totals = bcd.isna().sum(axis = 1)
row_na_totals.max()

So we see that no row has more than one missing value.

---

In the cell below, do the above calculation in one line.

---

### Dealing with missing values

Now that we have determined that there are missing values, we have to determine how to deal with them. 

#### Ignoring missing values elementwise

One way to handle missing values is just to ignore them. Most of the standard math and statisitical functions will do that by default.

So this:

In [None]:
bcd['clump_thickness'].mean()  

Computes the mean clump thickness ignoring the one missing value.

We can compute the mean (again ignoring missing values) for all the numeric columns like this:

In [None]:
bcd.mean(numeric_only = True)  # the numeric_only refers to columns, not missing values

That worked, but the output is a little awkward because the patient ID is being treated as a numeric variable. We can fix that by converting the patient ID variable to a string variable.

In [None]:
bcd['patient_id'] = bcd['patient_id'].astype('string')

And now the means should look a little better because we won't have the mean for the ID column in the millions>

---

Recompute the mean for the numeric columns in the cell below.

---

#### Removing missing values

We are about to start learning how to remove missing values from our data frame, *however...*

Before we start messing around too much with the values in our data frame, let's make sure we can easily "hit the reset button" and get back to a nice starting point. To do this, we'll want to

- reload the data
- modify the column of Dr. names
- set the patient ID to type str 
- remove the question marks from the bare nuclei column
- set the bare nuclei column to numeric

This is a perfect job for a function!


---

In the cell below, finish writing the function to reset our data frame to the desired starting point.

In [None]:
def hit_reset():
    bcd = pd.read_csv('./data/breast_cancer_data.csv')
    bcd['patient_id'] = ...
    ...
    
    return bcd

---

##### *Removing rows with missing values*

Obviously, rows in which all values are missing won't do us any good, so we can drop them with:

In [None]:
bcd = bcd.dropna(how = 'all')

This drops rows in which *all* of the values are missing. This code ran without error, but we know it also didn't do anything in this case because we don't have any rows in which all the values are missing!

Sometimes a case can be made for throwing out all observations (rows) that are incomplete, that is, if they contain *any* missing values.

In [None]:
bcd = bcd.dropna(how = 'any')

---

In the cell below, check the (new) shape of `bcd`.

It should have fewer rows now.

---

And now is a perfect time to test our function! In the cell below, hit the reset button on bcd.

Check the shape.

Check the data types of the columns.

Check the doctor name column.

---

##### *Removing columns with missing values*

And we could do the same for columns if we wished, though this is less frequently done. We just need to change the axis (direction) over which `DataFrame.dropna()` works.

In [None]:
bcd = bcd.dropna(axis = 1, how = 'any') # drop columns rather than rows

This leaves us with only the complete columns.

In [None]:
bcd.shape

Let's see which they are.

In [None]:
bcd.columns

#### Filling in missing values

Occasionally, we may want to fill in missing values. This isn't very common, but might be useful if some other function you are using doesn't handle missing values gracefully.

Before filling in missing values, we need to restore our data frame so it actually has missing values. Good thing we wrote that function!

In [None]:
bcd = hit_reset()

We can fill in missing values with any single value we want, such as a zero.

In [None]:
bcd = bcd.fillna(0)

---

In the cell below, check to see that we no longer have missing values.

In the cell below, reset the data and verify that the missing data are back.

---

In the cell below, fill the missing values in each column with the column mean. (Hint: this is pandas, so this is actually easy!)

And now verify that there are no more missing values.

---

## Summary

In this tutorial, we learned or remembered how to do some of the foundational data wrangling tasks. These are:

- importing data into pandas from a data file
- cleaning up the data in the columns
- converting columns to the appropriate type
- removing or filling in missing values