# Wrangling III homework

## Deleting invalid rows from the cancer data

It's been decided by committee that duplicate data in the (smaller version of) the cancer data set is going to go as follows. If a row is identical to the one immediately above it, we'll consider it an accidental entry due to fatigue or whatever. But a row that is *not* identical to the one above it will be considered valid, even it has a duplicate somewhere else in the data set; we'll assume such duplicates represent separate visits.

So our rule is: **if a row is identical to the one above it, we drop it**.

A small pre-cleaned version of the data with only 4 columns is in **small_cancer_data.csv**, so you can read it in directly without having to clean it up.

Spend a minute or two thinking about how you would approach this problem.

If you are ready to go on your own, then go!

Once you have working code – once you can take **small_cancer_data.csv** and trim out the unwanted rows – then wrap your code into a function, so all you have to do drop unwanted rows is call your function!

---

Spend some time thinking about and working on the problem. If you get to an impass and you'd like some hints, read on.

---

## Preliminaries

As usual, we'll load some libraries we'll be likely to use.

In [2]:
import pandas as pd

---

## Make a mini data set for testing

Rather than taking a crack at the whole set, make a small data frame named `tiny` with 10 rows and two columns. Put successive repeated rows in two places (like rows 2 and 3 could repeat, as could rows 6 and 7). Put an additional repeated row on it's own.

Something like this:

In [3]:
tiny = pd.DataFrame(dict(a = [1, 2, 3, 3, 4, 5, 5, 5, 6, 3],
                         b = ['a', 'b', 'c', 'c', 'd', 'e', 'e', 'f', 'g', 'c']))

Check our tiny data frame.

In [None]:
# check your test data frame


There should be two rows that need to be dropped, and one (the last) that should be kept even though it's a duplicate.

Just to be sure, check the output of `.duplicated(keep=False)` – it should show back-to-back `True` values in 2 places, and one solo `True` at the end.

In [None]:
# check .duplicated output


---

## Make a plan

There are probably 100 ways to solve this problem. Many are probably very clever and involve using fancy pandas functions.

A straightforward plan using things we already know about might be something like

- go through the rows of the data frame with a `for` loop, starting with the second row
- at each row, compare the current row with the previous one
- if they're the same, save the index of the current row
- after the `for` loop, delete the unwanted rows using the saved indexes

---

## Test the parts of the plan

Now that we've got a plan, let's get the pieces of the plan to work before putting the whold plan together.

### Make sure we can get rows

We should be able to get rows of a data frame in a couple of ways. These are

- using `.loc[]` with the value of rows index (it's name)
- using `.iloc[]` and indexing into the data like it were a numpy array

Let's try the `.loc[]` way.

And let's try the `.iloc[]` method.

Look's like either will work!

### Figure out how to compare rows

We are going to need to compare rows. Let's see how that is going to work.

#### compare the first and second rows - these *should* ***not*** match

In [None]:
# which things in the rows match?


#### compare the third and fourth rows - these *should*  match

In [None]:
# which things in the rows match?


The rows only match if ***all*** the columns match, so we can see if this is the case with the `all()` function.

In [None]:
# do all the column match?


Now we have a way to compare rows and get a single `True` if the rows are identical, and a `False` if they're not.

And now that we know how to do the row comparison, let's get a `for` loop working.

### Confirm we can get rows with a `for` loop

##### *Loop through the first few rows*

Let's make sure we can index into rows with a for loop. Let's try to get the first few using `.loc[]` and print them. Like
```
for ... :
    print(...)
```

In [None]:
# loop through the first few rows


##### *Loop through the **all** rows*

To loop through all the rows, we first need to get the number of rows. We can do this using the `shape` attribute.

In [41]:
# get the number of rows using shape


In [None]:
# loop through all the rows


---

## Putting it all together

Get the number of rows

In [97]:
# get the number of rows using shape


Make an empty list to hold the indexes of the columns we're going to drop

Make a `for` loop that 

- goes from 1 (i.e. the second row) to the end
- tests the current row against previous
- stores index for dropping

Check that we got the correct indexes.

Make a new data frame with the unwanted rows `.drop`ped.

Use `.reset_index()` to make a new sequental index for our data frame.

Marvel at your work!

If you don't like the "index" column with old indexes (sometimes it's useful to have the old indexes – here it's just annoying), you can set `drop=True` when you call `.reset_index()` above.

---

## Run your code on the cancer data

Try our code on the (small version of the) cancer data!

### Load the data

### Get the number of rows

In [115]:
# get the number of rows using shape


### Make an empty list for indexes

### Run your `for` loop!

### Check the indexes you found

### Drop the unwanted rows

### Reset the row indexes

### Check the shape to confirm the rows were dropped!

---

## Wrapping it all in a function

Once you've got your code running, put it all in a function so it's reusable!

In [88]:
def...



Run your function!

Check the shape to confirm your function worked!

### High-five the person closest to you!

Because you deserve a high-five right now.

---