# Wrangling III homework

## Preliminaries

As usual, we'll load some libraries we'll be likely to use.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now we'll get set up to work by

- loading the cancer data and cleaning it up (as before)
- trim out some columns so we can look at the data frame more easily
- shorten up some of the column names to save ourselves some typing

Let's reuse our function to do the loading and cleaning.

In [2]:
def bcd_load_clean():
    bcd = pd.read_csv('./data/breast_cancer_data.csv')
    bcd['patient_id'] = bcd['patient_id'].astype('string')
    bcd['doctor_name'] = bcd['doctor_name'].str.split().str[1]
    bcd['bare_nuclei'] = bcd['bare_nuclei'].replace('?', '')
    bcd['bare_nuclei'] = pd.to_numeric(bcd['bare_nuclei'])
    
    return bcd

In [3]:
bcd = bcd_load_clean()

Make a little version with just two numeric columns to play with.

In [4]:
bcd2 = bcd[['patient_id', 'clump_thickness', 'bland_chromatin', 'class']].copy()

Let's give the columns shorter names to save some typing.

In [5]:
bcd2 = bcd2.rename(columns={'clump_thickness': 'thick',
                            'bland_chromatin': 'chrom',
                            'patient_id': 'id'})

---

## make a mini data set to test with.

In [20]:
tiny = pd.DataFrame(dict(a = [1, 2, 3, 3, 4, 5, 5, 5, 6],
                         b = ['a', 'b', 'c', 'c', 'd', 'e', 'e', 'f', 'g']))

In [21]:
tiny

Unnamed: 0,a,b
0,1,a
1,2,b
2,3,c
3,3,c
4,4,d
5,5,e
6,5,e
7,5,f
8,6,g


In [23]:
tiny.duplicated(keep=False)

0    False
1    False
2     True
3     True
4    False
5     True
6     True
7    False
8    False
dtype: bool

Get the number of rows.

In [33]:
# get the number of rows and columns
nrows, ncols = tiny.shape[0], tiny.shape[1]

2

Make sure we can index into rows with a for loop.

In [None]:
for i in range(nrows) :
    print(tiny.loc[i])
    

In [41]:
id_with_above = [] # list to hold locations for nuking

In [42]:
for i in range(nrows) :
    if i > 0 :
        tst = sum(tiny.loc[i] == tiny.loc[i-1])
        if tst == ncols :
            id_with_above.append(i)
    else :
        pass
        

In [43]:
id_with_above

[3, 6]

In [51]:
temp = tiny.drop(id_with_above)

In [52]:
new_tiny = temp.reset_index(drop=True)

In [53]:
new_tiny

Unnamed: 0,a,b
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e
5,5,f
6,6,g


---

## Duplicate entries

As we have already seen, datasets can contain strange things that we have to overcome prior to analysis. One of the most common issues in a dataset are duplicate entries. These are common with large datasets that have been transcribed by humans at some point. Humands get bored, lose their place, etc.

---

... we can see that it has a "keep" argument. By default, `duplicated()` it gives us the *first* instance of any duplicated rows. We can make it show all the rows with `keep=False`.

Go ahead and do that in the cell below.

In [9]:
bcd2.duplicated(keep=False)

0      False
1      False
2      False
3      False
4      False
       ...  
694    False
695    False
696    False
697     True
698     True
Length: 699, dtype: bool

In [17]:
sum(bcd2.loc[698] == bcd2.loc[697])

4

In [18]:
bcd2.loc[697:698]

Unnamed: 0,id,thick,chrom,class
697,897471,4.0,10.0,malignant
698,897471,4.0,10.0,malignant
