# Wrangling III homework

## Preliminaries

As usual, we'll load some libraries we'll be likely to use.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now we'll get set up to work by

- loading the cancer data and cleaning it up (as before)
- trim out some columns so we can look at the data frame more easily
- shorten up some of the column names to save ourselves some typing

Let's reuse our function to do the loading and cleaning.

In [2]:
def bcd_load_clean():
    bcd = pd.read_csv('./data/breast_cancer_data.csv')
    bcd['patient_id'] = bcd['patient_id'].astype('string')
    bcd['doctor_name'] = bcd['doctor_name'].str.split().str[1]
    bcd['bare_nuclei'] = bcd['bare_nuclei'].replace('?', '')
    bcd['bare_nuclei'] = pd.to_numeric(bcd['bare_nuclei'])
    
    return bcd

In [3]:
bcd = bcd_load_clean()

Make a little version with just two numeric columns to play with.

In [4]:
bcd2 = bcd[['patient_id', 'clump_thickness', 'bland_chromatin', 'class']].copy()

Let's give the columns shorter names to save some typing.

In [5]:
bcd2 = bcd2.rename(columns={'clump_thickness': 'thick',
                            'bland_chromatin': 'chrom',
                            'patient_id': 'id'})

## Duplicate entries

As we have already seen, datasets can contain strange things that we have to overcome prior to analysis. One of the most common issues in a dataset are duplicate entries. These are common with large datasets that have been transcribed by humans at some point. Humands get bored, lose their place, etc.

---

Let's look at the shape of our cancer data frame (remember data frames have a `shape` attribute).

---

Now let's look at the number of unique entries using the `nunique()` data frame method; this will return the number of distinct values in each column.

---

So we can see that, while there are 699 observations in our data, there are only 645 unique patient ids. This tells us that several patients have multiple entries. These could be from patients making multiple visits to the doctor, or they could be a mistakes, or some combination thereof.

We can find out which rows – which entire observations – are identical with the `duplicated()` method. 

In [None]:
bcd2.duplicated()

That's not terribly helpful by itself, but...

---

In the cell below, count the number of duplicated rows (remember a True is a 1).

---

We can also use the output of `.duplicated()` to do logical indexing to see the observations that have duplicates. Do that in the cell below.

---

This is promising but, if we look at what is listed, we don't actually see any duplicates. So what is `duplicates()` doing?

---

Use the cell below to get help on `duplicated()` using `help()` or `?`.

---

... we can see that it has a "keep" argument. By default, `duplicated()` it gives us the *first* instance of any duplicated rows. We can make it show all the rows with `keep=False`.

Go ahead and do that in the cell below.

---

Hm. That's somewhat helpful. If we look near the bottom, we see that the last 5 or so duplicates occur in successive rows, perhaps indicating a data entry mistake. Perhaps looking at the data sorted by patient ID would be more helpful.

---

In the cell below, use the the `.sort_values()` method to look at our duplicates sorted by ID.

So most of the duplicates occur in adjacent rows, but others do not. Perhaps we should check and see if the same patients occur multiple times with different measurements, indicating multiple visits to the doctor. 

---

Use the cell below and the `subset` argument to `duplicated()` to look at multiple entries for any patients that have them.

---

Now, in the cell below, do the same thing but sort the output by patient ID.

---

So it looks like patients do come in multiple times and the values can change between visits.

We can look at repeat patient's number of visits directly if we want. We'll take advantage of the fact that the `.size` of a `groupby()` object returns the number of rows for each group.

In [18]:
repeat_patients = bcd2.groupby('id').size().sort_values(ascending =False)

In [6]:
repeat_patients

---

So one patient came in 6 times.

---

Use the cell below look at the data for the patient with 6 visits.

In [20]:
bcd2[bcd2['id'] == '1182404']

Unnamed: 0,id,thick,chrom,class
136,1182404,4.0,2.0,benign
256,1182404,3.0,1.0,benign
257,1182404,3.0,2.0,benign
265,1182404,5.0,3.0,benign
448,1182404,1.0,1.0,benign
497,1182404,4.0,1.0,benign


---

So it appears that some patients have multiple legitimate entries in the data frame.

---

If you were put in charge of analyzing these data, what would you do with duplicate observations in this data frame, and why?

---