# Data Preparation and Cleaning

---


Unfortunately, a real world data set is unlikely to be ready for analysis without some preperation and cleaning. Some of the cleaning you may want do includes, but is not limited to, handling missing values, ordering the entries, and reformatting data. In this chapter we will cover important data cleaning and preparation skills that will help you work around some commonly encountered issues. 


# Preparing Our Environment

---

`pandas` has a lot of powerful and easy to use functionality to manipulate and clean data, therefore, we will import the pandas toolkit with the standard alias `pd` for this chapter

```python
import pandas as pd
```

We will also be making some basic plots for some examples to spot flaws in our data set, as we learned to do in the last chapter. To do this we use the python package `matplotlib.pyplot` under the standard alias `plt`.

```python
import matplotlib.pyplot as plt
```

Lastly, some examples we will be working through require constructing `DataFrames` that are complex, and for convienence I will be using the `numpy` package with the standard alias `np`. You do not need to worry about the details of this package to understand this chapter, but, if you are interested, you can read the documentation for this widely used package here: [`numpy`](https://docs.scipy.org/doc/numpy/).

```python
import numpy as np
```

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# About the Data

---

In previous chapters we explored the medical spending data set and saw some methods for spotting flaws, now we will discuss how to correct these errors. As a reminder, our subset of data contains the following columns:

| Column |Description|
|:----------|-----------|
| `unique_id`| A unique identifier for a Medicare claim to CMS |
| `doctor_id` | The Unique Identifier of the doctor who <br/> prescribed the medicine  |
| `specialty` | The specialty of the doctor prescribed the medicine |
| `medication` | The medication prescribed |
| `nb_beneficiaries` | The number of beneficiaries the <br/> medicine was prescribed to  |
| `spending` | The total cost of the medicine prescribed <br/>for the CMS |

The complete data set is available from the Center for Medicare & Medicaid Services website ([`CMS` website](https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads.html)).

We will specifically be looking at datasets that are intentionally constructed with some common issues so we can practice the data cleaning and preperation skills we will be covering in this chapter. 

The file that we will be using throughout this chapter is named 'spending_clean_ex.csv' and, relative to our current working directory, this file is in the folder 'Data'. We will be reading this .`csv` file and saving its contents into the `DataFrame` named `spending_df`. We also know ahead of time that this file contains the column `unique_id`, which we will want to use to index our `DataFrame`. Please run the code in the cell below to have access to this data set.

In [0]:
spending_df = pd.read_csv('Data/spending_clean_ex.csv', index_col='unique_id')

# Inspecting and Modifying Data Types


---


Please recall the fact that `dtypes` is a `DataFrame` attribute which tells us the `pandas` data type of each column. We previously noted that this was important as some methods only make sense to call on certain data types. Let us again inspect the data types of `spending_df`.

```python
spending_df.dtypes
doctor_id             int64
specialty            object
medication           object
nb_beneficiaries    float64
spending             object
dtype: object
```

Please note two issues in this output, first, `doctor_id` is saved as an `int64`, and second, `spending` is saved as an `object`. These are issues since it wouldn't make sense to handle `doctor_id`s as numeric types (it wouldn't make sense to sum `doctor_id`s for example), also the `pandas object` data type is handled like a `Python` string, but it would be useful to perform numeric operations on the `spending` entries (it totally would make sense to sum `spending` for example).

To solve this issue we need to cast the entries of the columns to a new `pandas` data type that is more appropriate. The method for this job is the `Series` `astype()` method. The `Series` `astype()` method will make a copy of the calling series and cast the new `Series` to the data type specified by the one positional argument, `dtype`.

For instance, we are going to want to cast the `doctor_id` entries as `pandas objects` and make this cast permanent. To do this we can call the `Series` `astype()` method using the `spending_df` `doctor_id` column (which remember is a `Series`) and save the result into the same column, `doctor_id`.

```python
>>> spending_df.loc[:, 'doctor_id'] = spending_df.loc[:, 'doctor_id'].astype('object')
>>> spending_df.dtypes['doctor_id']
dtype('O')
```

Note that `dtype('O')` indicates that the datatype is `pandas` *O*bject. This worked great, and was very easy, the  `doctor_id`s are now `pandas objects`.

However, we cannot convert the `spending` column to `float64` using the same method, if we try using the same methodology we see the following results.

```python
spending_df.loc[:, 'spending'] = spending_df.loc[:, 'spending'].astype('float64')
...
ValueError: could not convert string to float: '$248.95'
```

The values contained in the `Series` `spending` are not compatible with the `float64` type, this is because a `float64` cannot contain "`$`" or ",", therefore we will have to remove it from each entry. To change spending from `object` to `float64` will require `Series` string methods.


In [0]:
spending_df.dtypes

In [0]:
spending_df.loc[:, 'doctor_id'] = spending_df.loc[:, 'doctor_id'].astype('object')
print(spending_df.dtypes)

# Exercise 5.1: Inspecting and Modifying Data Types

 Consider the following `DataFrame`
 
```python
scoring_2018 = pd.DataFrame({'Rank': {'Tennessee Tech': 1, 'Morehead St.': 2, 'Texas Tech': 3},
                                'G': {'Tennessee Tech': 65, 'Morehead St.': 63, 'Texas Tech': 65},
                              'W.L': {'Tennessee Tech': '53-12','Morehead St.': '37-26','Texas Tech': '45-20'},
                                'R': {'Tennessee Tech': 639, 'Morehead St.': 529, 'Texas Tech': 529},
                               'PG': {'Tennessee Tech': 9.8, 'Morehead St.': 8.4, 'Texas Tech': 8.1}})
```

Which table correctly states the data type of the columns of `scoring_2018` in its currect state?


A: 

| Column |Data Type|
|:----------|-----------|
| `Rank` |  int64 |
| `G` |  int64 |
| `W.L` |   object |
| `R` |  int64 |
| `PG` |  float64 |

B: 

| Column |Data Type|
|:----------|-----------|
| `Rank` |  object |
| `G` |  int64 |
| `W.L` |   object |
| `R` |  int64 |
| `PG` |  object |

C:  

| Column |Data Type|
|:----------|-----------|
| `Rank` |  object |
| `G` |  object |
| `W.L` |   object |
| `R` |  object |
| `PG` |  object |


*Hint: Feel free to use the code cell added below to explore the data types of `HNL_flights_df`*

---

Which line(s) of code will modify the data types of `scoring_2018`, inplace, so that it matches the following table?

| Column |Data Type|
|:----------|-----------|
| `Rank` |  object |
| `G` |  object |
| `W.L` |   object |
| `R` |  object |
| `PG` |  object |

A.
```python
scoring_2018['Rank'] = scoring_2018['Rank'].astype('object')
scoring_2018['G'] = scoring_2018['G'].astype('object')
scoring_2018['R'] = scoring_2018['R'].astype('object')
scoring_2018['PQ'] = scoring_2018['PG'].astype('object')
```

B.
```python
scoring_2018 = scoring_2018[['Rank', 'G',  'R', 'PG']].astype('object')
```

C. 
```python
scoring_2018 = scoring_2018.astype('object')
```

D. 

    All of the above


*Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.*

In [17]:
# Exercise 5.1 Scratch code cell

# Series String Methods

---

A `Series` object contains various `string` processing methods that can be accessed using the `Series`’s `str` attribute. There are too many string methods to cover exhaustively, but we will cover perhaps the most important method, and that is `str.replace()`. You can see all of the `Series str` methods and descriptions [here](https://pandas-docs.github.io/pandas-docs-travis/api.html#string-handling)

The `Series str.replace()`method  takes two required positional parameters; the first is the pattern we want to replace, and the second is the value to replace it with. Please note that `str.replace()` does not operate in place, i.e. it makes a copy of the calling `Series` and then the desired changes are made. Thus, if we wanted to create a new `Series`, $s_{1}$ from the `Series` of `pandas objects,` $s_{0}$, where each string in $s_{0}$ is modified so that every substring of the entries matching the pattern $p$ is replaced with the pattern $l$ we would type: `s_1 = s_0.str.replace(p, l)`.

In our case we want to remove every dollar sign, '$\$$' character. That seems different than what `str.replace()` does, but notice that removing  "$\$$"  is equivalent to replacing it with the empty string, `""`, similarly, removing ","  is equivalent to replacing it with the empty string, `""`, therefore `str.replace()` is going to help us. Also, recall, that we want the modification to be permanent, so to accomplish this we can assign the column of the original `DataFrame` to the result of `str.replace()` call on that same column. 


```python
>>> spending_df.loc[:, "spending"] = spending_df.loc[:, "spending"].str.replace("$", "")
>>> spending_df.loc[:, "spending"]  = spending_df.loc[:, "spending"].str.replace(",", "")
>>> spending_df.head()
            doctor_id          specialty       medication  nb_beneficiaries spending
unique_id                                                                           
BK982218   1750389599  INTERNAL MEDICINE     AZITHROMYCIN              12.0    77.26
CG916968   1952344418         CARDIOLOGY      SIMVASTATIN              85.0   767.83
SA964720   1669522744  INTERNAL MEDICINE  INSULIN DETEMIR              14.0  5409.29
```

We see that the code above made the desired changes and the `spending` column now looks like a proper `float64`. The final step is to convert the spending column into the float64 data type as it is now float-compatible. Recall that to do this we use the `astype() Series` method and pass it the string `float64` specifying the datatype. Also,  since we want this to be permanent, we save the result of the `astype()` method to the original `DataFrame's` 'spending' column.

```python
>>> spending_df.loc[:, "spending"] = spending_df.loc[:, "spending"].astype("float64")
>>> spending_df.dtypes
doctor_id            object
specialty            object
medication           object
nb_beneficiaries    float64
spending            float64
dtype: object
```

We see from the results above that the `spending` column is now succesfully saved as a `pandas` `float64` data type. 

The entire procedure could be executed with a single line of code using method chaining as seen in the code below. Remember that we read method chaining from left to right, i.e. the leftmost method is executed first and the returned result is then the caller of the next method in the chain.

```python
>>> spending_df.loc[:, "spending"] = (spending_df.loc[:, "spending"]
                               .str.replace("$", "")
                               .str.replace(",", "")
                               .astype("float64"))
```

Wrapping the method chaining calls in parenthesis is not necessary but allows us to write the complete expression across multiple lines.

In [0]:
spending_df["spending"] = (spending_df["spending"]
                           .str.replace("$", "")
                           .str.replace(",", "")
                           .astype("float64"))
spending_df.dtypes

# Exercise 5.2: `Series` String Methods

Given the  `Series`, `UH_Series` which can be described by the following table:

| Index |Data |
|:----------|-----------|
| 0 | 'U H Manoa' |
| 1 | 'UH_Manoa' |
| 2 | 'UH Manoa' |
| 3 | 'UH-Manoa' |

And can be constructed using the code shown below:

```python
UH_Series = pd.Series(['U H Manoa',  'UH_Manoa', 'UH Manoa', 'UH-Manoa'])
```

Which line of code will modify `UH_Series`, inplace, so that each entry matches the string: 'UHManoa'?

A:
```python
(UH_Series
 .str.replace(" ", "")
 .str.replace("_","")
.str.replace("-",""))
```

B: 
```python
(UH_Series
 .str.replace("_"," ")
.str.replace("-"," "))
```

C:
```python
UH_Series = (UH_Series
 .str.replace(" ", "")
 .str.replace("_","")
.str.replace("-",""))
```

D:
```python
UH_Series = (UH_Series
 .str.replace("_","")
.str.replace("-",""))
```

*Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.*

In [0]:
# Exercise 5.2 Scratch code cell

# Reindexing

---


The next, and very common, cleaning task that we will discuss is reindexing. To motivate, suppose that you are working with data that is not in an order you would like, it contains entries you are not concerned with, or maybe it does not contain all of the entries you desire. For example consider the following `DataFrame` `sine_wave_df` which is constructed by simulating a sine wave that may be measurements of the voltage signal coming from an outlet at your house. 
(do not concern yourself with how this `DataFrame` is constructed, the point of this example is to illustrate the value of reindexing):

```python
>>> times = [60.0 * 2 * np.pi * x / 900 for x in range(30)] 
>>> wave = np.sqrt(2) * 120.0 * np.sin(times) 

>>> sine_wave_df = pd.DataFrame({'wave':wave}, index=times)
>>> sine_wave_df.sort_values(by='wave', inplace=True)
```

The above code constructs a `DataFrame` `sine_wave_df` that is indexed by the time and has the single column 'wave', which could be the voltage measurements of a home outlet. Currently the `DataFrame` `sine_wave_df` is sorted in increasing order by the values in the column `wave`. Let us try and plot this wave to see what is going on.

```python
>>> plt.figure()
>>> sine_wave_df.plot(kind='line')
>>> plt.show()
```

<img src="images/wave_before_reindex.png" width="600">

The above image looks nothing like the expected sine wave vs. time signal we would expect to see from a home outlet. One solution to the problem would be to sort the `DataFrame` by the the time which, in this case, is the index, but, another solution would be to utilize the `reindex()` `DataFrame` method, and this option, as we will see is a little more flexible since we can simulatneously drop and add entries when we reindex, allowing us to do cool things like decimation and interpolation.

The `DataFrame` `reindex()` method will create a new `DataFrame` with data entries that conform to the new index passed to the method. By conform I mean that the new `DataFrame` will have exactly the index passed to the method, the same order and the same labels. The entries that existed in the `DataFrame` which called the method that had labels that are in the new index will still exist in the new `DataFrame`, while those entries whose labels did not exist in the index passed will be dropped. Furthermore, if labels exist in the index passed to the `reindex()` method that did not exist in the calling `DataFrame`, then those entries will be filled with missing values, `NaN` unless otherwise specified by the caller.

Let us reindex the `sine_wave_df` `DataFrame` by the same `times` array we made when we were constructing `sine_wave_df`, this will simply create a new `DataFrame` that is sorted in increasing time. To do this we call the `reindex()` `DataFrame` method passing the `times` array.

```python
>>> sine_wave_df_reindexed = sine_wave_df.reindex(times)
```

Now, let us plot the reindexed sine wave `DataFrame` to see if it is what we wanted.

<img src="images/sine_wave_reindexed.png" width="600">

Indeed the above plot looks like the expected sine wave. 

As stated, sorting is not the only value of reindexing, we may also add, or drop entries using this method. Adding entries to a `DataSet` like the sine wave we have been working with is called interpolating. By default, `reindex()` will set new entries to `NaN` to represent missing. But, we may also set these entries to a value using some filling function. The `reindex()` method has implemented for us a some common filling functions that we may specify when calling the method using the the optional `method` parameter, and those are *forward fill* and *backward fill*. Forward filling will propagate the last valid observation forward until the next valid entry, while backward fill will use the next valid entry to propagate backwards. Note that these methods may only be called on indices that are either only increasing or decreasing.

To see how interpolation is done, let us reindex our sine wave `DataFrame` to conform to an index with times that are twice as close together, that is to say that the time between samples is halved, and forward fill the missing entries that arise. To do this we call `reindex()` on the `sine_wave_df` `DataFrame` and pass a new array with twice the frequency of the `times` array and set the `method` parameter to 'ffil'

```python
>>> times2 = [60.0 * 2 * np.pi * x / 1800 for x in range(60)]
>>> sine_wave_df = sine_wave_df.reindex(times) 
>>> sine_wave_df_reindexed = sine_wave_df.reindex(times2, method='ffill')
```

To illustrate the results lets plot both the original `sine_wave_df` `DataFrame` and the new `sine_wave_df_reindexed` `DataFrame`. 

```python
>>> sine_wave_df['time'] = times
>>> sine_wave_df_reindexed['time'] = times2
>>> fig, axes = plt.subplots(nrows=2, ncols=1) 
>>> sine_wave_df_reindexed.plot(kind='line', x='time', y='wave', ax=axes[1]) 
>>> sine_wave_df.plot(kind='line', x='time', y='wave', ax=axes[0]) 
>>> plt.show()  
```

<img src="images/interpolated_sine.png" width="600">

We see that it looks as though there are steps in our new signal, and that is caused by the forward fill method for filling in the missing values, backward fill would have caused a very similar type of distortion. If this were voltage measurements from a home outlet then this type of distortion would be undersirable, a better method would be to interpolate using a linear method, that is filling the missing value with the arithemtic mean between the previous valid index and the next valid index. A linear method is not an option in the `reindex()` method, but it is in the `interpolate()` `DataFrame` method which we will discuss in a moment, but first let us change gears and look at reindexing and how it could be used on the medical spending data set.

## Reindexing Continued: Example with medical spending data set

We saw how reindexing could be used with time series data, now let us take a look at how the `reindex()` method would help us with analyzing the medical spending data set that we are more familiar with.

Let us first take a peak into the `spending_df` `DataFrame` to remind ourselves how it is structured.

```python
>>> spending_df.head(n=3)
            doctor_id          specialty       medication  nb_beneficiaries  spending
unique_id                                                                            
BK982218   1750389599  INTERNAL MEDICINE     AZITHROMYCIN              12.0    \$77.26
CG916968   1952344418         CARDIOLOGY      SIMVASTATIN              85.0   \$767.83
SA964720   1669522744  INTERNAL MEDICINE  INSULIN DETEMIR              14.0  \$5409.29
```

We see that the `spending_df` `DataFrame` is indexed by the `unique_id` and has 5 columns labeled: 'doctor_id' 'specialty' 'medication' 'nb_beneficiaries' and 'spending'. Suppose we were only interested in the medications and not the doctors prescribing them, i.e. we only wanted the 2 columns: `medication` and `spending`. Furthermore, what if we wanted to add the column 'opioid_drug' that will, once we fill it with values, tell us whether the medication is classified as an opioid or not. To do this we could use the `reindex()` method. This time however we are reindexing the column of the `DataFrame`, so instead of passing an index we use the optional `columns` parameter of the `reindex()` method and set it to the columns we would like our new `DataFrame` to be labeled by.

```python
>>> spending_df.reindex(columns=['medication', 'spending', 'opioid_drug']).head(n=3)
                  medication   spending opioid_drug
unique_id                                       
BK982218      AZITHROMYCIN    \$77.26         NaN
CG916968       SIMVASTATIN   \$767.83         NaN
SA964720   INSULIN DETEMIR  \$5409.29         NaN
```

We see the returned `DataFrame` only contains the columns we specified and in the very same order. Also the new column `opioid_drug` was filled with, `NaNs`, missing values. We will learn how to fill theses missing values in the coming sections of this chapter.

The `reindex()` `DataFrame` method has many parameters and we cannot cover them all but we did see the most important usages. I believe you will find the `reindex()` method very useful.  The table below summarizes the parameters of the `reindex()` method


| Parameters |Description|
|:----------|-----------|
| `labels`| New labels / index to conform the axis specified by ‘axis’ to. |
| `index, columns` | New labels / index to conform to. Preferably an Index object to avoid duplicating data |
| `axis` | Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). |
| `method` | method to use for filling holes in reindexed DataFrame.|
| `copy` | Return a new object, even if the passed indexes are the same  |
| `level` | Broadcast across a level, matching Index values on the passed MultiIndex level |
| `fill_value` | Value to use for missing values. Defaults to NaN, but can be any “compatible” value |
| `limit` | Maximum number of consecutive elements to forward or backward fill |
| `tolerance` | Maximum distance between original and new labels for inexact matches.|



* More information about the reindexing method is available [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html)

In [0]:
spending_df.reindex(columns=['medication', 'spending', 'opiod_drug']).head(n=3)

# Exercise 5.3: Reindexing

Given the following `DataFrame`, `df`, described in the table below:

|  | color | number |
|:----------|-----------|
| d | green | 4 |
| a | blue |1|
| c | orange | 3 |

And constructed using the following code:

```python
df = pd.DataFrame({'color':['green','blue', 'orange'], 'number': [4, 1, 3]}, 
                  index = ['d', 'a', 'c'])
```

Use the `reindex()` function to modify the `df DataFrame` inplace so that its new index is the list of labels `[a, b, c, d]` and the new values are left as `NaN`s. The `DataFrame` should have the following information and structure.

|  | color | number |
|:----------|-----------|
| a | blue |1|
| b | NaN | NaN |
| c | orange | 3 |
| d | green | 4 |

In [0]:
# Type your solution to Exercise 5.3 here

# Handling Missing Data

It is very common for real world data to come with missing entries, and sometimes, as we saw with reindexing, we may introduce missing values ourselves, and if these entries are not identified and handled appropriately, then your analysis can be flawed.

Handling missing data can be summarized into 3 objectives:

1.   Identifying
2.   Filtering
3.   Filling

For some datsets it may be more suitable to *filter* rows or columns with missing values, and for others it may be better to *fill* the entries, we will work through examples of both possibilites.

## Identifying

We have learned in previous chapters that `pandas` denotes missing values with `NaN`. We have also seen how to load our data and specify what tags represent missing values using `read_table()` and `read_csv()` and the `na_values` parameter; this is the first step to identifying missing values. Next, once all the missing values are properly labeled as `NaN` in the `pandas` `DataFrame` or `Series`, it is often useful to find where all the missing values precisely are, i.e. the locations of the missing values. 

This is typically achieved with the  `isnull()` `DataFrame` and `Series` methods. `isnull()` returns `True` if a cell contains a `NaN` value, and returns `False` otherwise. There are no positional parameters that we can pass to this method.

Let us see for instance the results of calling `isnull()` on the first four entries of the `spending_df` `DataFrame`

```python
spending_df.isnull().head(n=4)
           doctor_id  specialty  medication  nb_beneficiaries  spending
unique_id                                                              
BK982218       False      False       False             False     False
CG916968       False      False       False             False     False
SA964720       False      False       False             False     False
TR390895       False       True       False             False     False
```

We see that the `specialty` value for `unique_id` TR390895 evaluates `isnull()` to `True`. We will see soon that this is going to be useful information for us. 

In [0]:
spending_df.isnull().head(n=4)

## Filtering

Filtering in this context of handling missing values refers to dropping the entries with missing values completely. Information is lost when you do this but in many practical cases this is acceptable. Various approaches can be taken to filtering out missing values, we will cover two, subsetting and the `dropna()` `DataFrame` method. 

To filter by subsetting, a skill we learned in chapter 4, we will make use of the fact that the `isnull()` `Series` method returns a boolean `Series` that is the exact same shape as the calling `Series`. Recall that `pandas` allows us to subset  `DataFrames` using boolean `Series`, the result will be a subset of the original `DataFrame` with only the entries whos index aligned with a True value. Also, recall that, ~, is a boolean operation that is equivalent to 'not', i.e. True is mapped to False and False is mapped to True. 

For example, we can filter out missing values in the `spending` `DataFrames` by first obtaining a boolean `Series` indicating whether the entry at the index had a valid entry in the `spending` column, i.e. if the entry was missing, `NaN`, then the corresponding entry in the boolean `Series` will be false. This is done by calling `isnull()` on the `spending` column of `spending_df` and 'notting' the result, i.e. we use the ~ operator. Then we use the boolean `Series` to subset `spending_df`. We can save our subset to a new `DataFrame` we will call `filtered_spending_df`

```python
>>> filtered_spending_df = spending_df.loc[~ spending_df.loc[:, "spending"].isnull(), : ]
```

To verify our results we can make use of the `Series` `sum()` method and make note that when calling the `sum()` method with a boolean `Series` we can consider `True` to be 1 and `False` to be 0. Thus, if we expect there to be no missing values in the `spending_df['spending']` `Series`, then we should expect calling `sum()` with the `Series` `filtered_spending_df.loc[:, 'spending'].isnull()` to be 0. Let us see if this is case.

```python
>>> filtered_spending_df.loc[:, 'spending'].isnull().sum()
0
```

Excellent, the output above shows that there are 0 missing entries in the `spending` column of the `filtered_spending_df` `DataFrame` as we expected.

`DataFrames` also have the `dropna()`method to drop entries with missing values. `dropna()` has the familiar optional parameter `axis` along which to drop the the column or the row that contains the `NaN`. If we set `axis=0` or `axis="rows"` then we will drop rows containing `NaN`s. Similarly, if we set `axis=1` or `axis="columns"` we will drop columns containing `NaN`s. Note that the operation does not overwrite the original data, but instead, returns a new `DataFrame` with the `NaN` dropped.

![](images/axis_drop.png)

The image above shows how the `dropna()` method will work with the `axis` parameter set to both 0 and 1 on the basic `DataFrame` with a single missing value, `NaN`.

For example, lets make a new `DataFrame`, again we will call it `filtered_spending_df` which will overwrite the exisiting `DataFrame`, that is the same as the `spending_df` `DataFrame` except all the rows with missing values are dropped. To do this we use the `dropna()` method and set `axis=0`.

```
>>> filtered_spending_df = spending_df.dropna(axis=0)
```

The output of the `dropna()` method was saved into `filtered_spending_df`, lets verify our results. We expect there to be no missing values at all in the new `DataFrame`. To check this we can call the `sum()` `DataFrame` method with the `filtered_spending_df.isnull()` `DataFrame`. The output of the `sum()` `DataFrame` method will be a `Series` with an index matching the columns of `filtered_spending_df` and entries corresponding to the sum across the rows for the column.

```python
>>> filtered_spending_df.isnull().sum()
doctor_id           0
specialty           0
medication          0
nb_beneficiaries    0
spending            0
dtype: int64
```

The output tells us that the `dropna()` method worked as we expected, there are 0 missing values in each of the 5 rows of the `filtered_spending_df` `DataFrame`. 


In [0]:
filtered_spending_df = spending_df.loc[~ spending_df.loc[:, "spending"].isnull(), : ]
filtered_spending_df.loc[:, 'spending'].isnull().sum()

In [0]:
filtered_spending_df = spending_df.dropna(axis=0)
filtered_spending_df.isnull().sum()

### Filtering Cont.

In addition to the parameter `axis`,  `dropna()` has other useful parameters that can we can use to customize the way we drop rows or columns from DataFrames, all of which are summarized in the table below.

| Parameter | Description |
|:----------:|:------------|
| `how` | ('any') drops a row or a column if any of its value are `NaN`. <br/> ('all') drops a row or a column if all of its values are `NaN` | 
| `thresh` | Defines the minimum number of non-`NaN` required before a column is dropped. <br/> Useful for dropping `variables` (columns) with too many (above threshold ) `NaN`s |
|`subset`| Defines a list of columns to consider. |
|`inplace`| bool, default False. If True, do operation inplace and return None. |

Notice that the familiar inplace parameter is available for the `dropna()` and is by default set to false (per usual). 

If you would like to read more about this method you can visit the `pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html).

# Exercise 5.4: Filtering Missing Values

Consider the `DataFrame`, `flights_df`, described using the table below:


| index |ORIGIN_AIRPORT | DESTINATION_AIRPORT | DEPARTURE_DELAY| ARRIVAL_DELAY |
|----------|-----------|---------|----------|----------|
| 0 | HNL | SFO | 3.0 | -21.0 |
| 1 | LAS | HNL | NaN | -2.0 |
| 2 | HNL | ITO | NaN | NaN |
| 3 | HNL | KOA| -6.0 | -9.0 |


And constructed using the following line of code:

```python
flights_df = pd.DataFrame({'ORIGIN_AIRPORT': ['HNL', 'LAS', 'HNL', 'HNL'], 
                           'DESTINATION_AIRPORT': ['SFO', 'HNL', 'ITO', 'KOA'],
                           'DEPARTURE_DELAY': [ 3., None, None, -6.], 
                           'ARRIVAL_DELAY': [-21.,  -2.,  None,  -9.]})
```

Which of the following statements will create a `DataFrame` with that is a subset of `flights_df` containing only ***rows*** that do not have ***both*** a missing `ARRIVAL_DELAY`  and `DEPARTURE_DELAY` value? (do not modify the original `flights_df DataFrame`).

A:
```python
flights_df.dropna(axis='columns')
```

B:
```python
flights_df.dropna(axis='rows')
```

C:
```python
flights_df[ ~ (flights_df.loc[:, 'DEPARTURE_DELAY'].isnull() 
               & flights_df.loc[:, 'ARRIVAL_DELAY'].isnull())]
```

D:
```python
flights_df[ ~ (flights_df.loc[:, 'DEPARTURE_DELAY'].isnull() 
               | flights_df.loc[:, 'ARRIVAL_DELAY'].isnull())]
```

*Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.*

In [23]:
# Exercise 5.4 Scratch code cell

## Filling

Filling in this context of handling missing values refers to replacing the `NaNs` with valid values. This practice is sometimes also referred to as *imputation*, when we fill in missing data we are imputing the missing values. Filling data often requires some understanding of how your data was collected and or some domain knowledge so you can determine what is the appropriate way to fill the missing data. 

Two very broad categories that methods for filling missing values will fall into are:

1.   Filling the value with a constant
2.   Filling the value dynamically

Both approaches can be carried out using the `DataFrame` and `Series` method `fillna()`, and the only difference between the two approaches will be the value we pass to `fillna()`.

### Filling Continued: `fillna()` with Static Values 

First lets discuss filling with a static, or constant, value.  The `fill_na()` `DataFrame` method will take either a scalar constant which replaces all missing values of the calling `DataFrame`, or a dictionary with specific values for each column of the calling `DataFrame`. `fill_na()` has the optional parameter `inplace` that is by default `False`. 

If we just wanted to make a new `DataFrame`, named `filled_spending_df`, that is the same as `spending_df` but instead with all missing values set to 0, regardless of the location of the missing entry, then we would just pass the scalar constant 0. 

```python
>>> filled_spending_df = spending_df.fillna(0)
```

We can verfiy our results by observing that the $3^{rd}$ and $4^{th}$ row entries of the $1^{st}$ column, `specialty`,  where originally missing in the `spending_df` `DataFrame`, so when we print those same entries in `filtered_df`we should see that they are 0.

```python
>>> filled_spending_df.iloc[[3,4], 1]
unique_id
TR390895    0
JA436080    0
Name: specialty, dtype: object
```

We see that the values that were once missing in `spending_df` are now 0.

Most of the time, it is not the best practice to set all the missing values in a `DataFrame` to the same value. In our example a 0 in the `specialty` column does not make much sense, it would be more suitable to set the specialty to a string value such as `'UNKNOWN'`. We can accomplish column specific filling with static values by passing a `python` dictionary to the `fillna()` method. The dictionary should have keys that are the column labels of the calling `DataFrame`, and values that specify what you want to fill with. 

For example, let us make a new `DataFrame`, again named `filled_spending_df`,  where all missing entries in the `spending` or the `nb_beneficiaries` column are replaced with 0s, and all missing entries in the `specialty` column are replaced with the string `'UNKNOWN'`.  

```python
>>> filled_spending_df = spending_df.fillna( { "specialty": "UNKNOWN", 
                              "nb_beneficiaries": 0, 
                              "spending": 0 } )
```

To check our results we can take a look at the originally missing values in the `specialty` column of `spending_df`, we should see that they now all  hold the value `'UNKNOWN'`

```python
>>> filled_spending_df[spending_df.loc[:, 'specialty'].isnull()].loc[:, 'specialty']
unique_id
TR390895    UNKNOWN
JA436080    UNKNOWN
DT789371    UNKNOWN
CF887728    UNKNOWN
YN526335    UNKNOWN
ES437458    UNKNOWN
Name: specialty, dtype: object
```

Just as we expected, all the once missing values are now replaced with the string 'UNKNOWN' in the `specialty` column of the `DataFrame` `filled_spending_df`.

In [0]:
filled_spending_df = spending_df.fillna(0)
filled_spending_df.iloc[[3,4], 1]

In [0]:
filled_spending_df = spending_df.fillna( { "specialty": "UNKNOWN", 
                      "nb_beneficiaries": 0, 
                      "spending": 0 } )
filled_spending_df[spending_df.loc[:, 'specialty'].isnull()].loc[:,'specialty']

# Exercise 5.5: Filling Missing Entries with Static Values

Again let us consider the `flights_df DataFrame` that was first introduced in exercise 4.4 and is described in the table below:

| index | ORIGIN_AIRPORT | DESTINATION_AIRPORT | DEPARTURE_DELAY| ARRIVAL_DELAY |
|:----------|-----------|---------|---------|---------|
| 0 | HNL | SFO | 3.0 | -21.0 |
| 1 | LAS | HNL | NaN | -2.0 |
| 2 | HNL | ITO | NaN | NaN |
| 3 | HNL | KOA| -6.0 | -9.0 |

And constructed using the following line of code:

```python
flights_df = pd.DataFrame({'ORIGIN_AIRPORT': ['HNL', 'LAS', 'HNL', 'HNL'], 
                           'DESTINATION_AIRPORT': ['SFO', 'HNL', 'ITO', 'KOA'],
                           'DEPARTURE_DELAY': [ 3., None, None, -6.], 
                           'ARRIVAL_DELAY': [-21.,  -2.,  None,  -9.]})
```

Write a command using the code cell below to fill the missing entries in both the `DEPARTURE_DELAY` and `ARRIVAL_DELAY` columns with 0. The original `DataFrame` should be modified, i.e. the changes should be done in place.


In [24]:
# Type your solution to Exercise 5.5 here

### Filling Continued: `fillna()` with Dynamic Values 

Dynamic filling refers to filling the missing entries with a values that  depends on the existing data. For example, a common strategy is filling missing values with a column's median value.

For instance, let us make a new `DataFrame`, again we will call it `filled_spending_df`, which is a copy of `spending_df` except that the mising values in the columns `nb_beneficiaries` and `spending` are replaced with their respective column means. The only difference between this and the static values case is that we need to first compute the values we are going to fill with. To do this we will utilize the `Series` `mean()` method, and then pass a dictionary to `fillna()` just as before.

```python
>>> average_spending = spending_df.loc[:, "spending"].mean()
>>> average_nb_beneficiaries = spending_df.loc[:, "nb_beneficiaries"].mean()
>>> filled_spending_df = spending_df.fillna({"nb_beneficiaries": average_nb_beneficiaries , 
                      "spending": average_spending})
```

Now lets check our results, 

```python
>>> print(average_nb_beneficiaries)
55.6
>>> filled_spending_df.loc[spending_df.loc[:, 'nb_beneficiaries'].isnull(), 'nb_beneficiaries']
unique_id
EQ932492    55.6
ON964391    55.6
BT820276    55.6
AD891213    55.6
Name: nb_beneficiaries, dtype: float64

>>> print(average_spending)
5806.981
>>> filled_spending_df.loc[spending_df.loc[:, 'spending'].isnull(), 'spending']
unique_id
WO822855    5806.981
AD891213    5806.981
ES437458    5806.981
NM108377    5806.981
Name: spending, dtype: float64
```

We see that all the once missing values in the `spending_df` `DataFrame` columns `spending` and `nb_beneficiaries` are now replaced by their respective column means. 

In [0]:
average_spending = spending_df.loc[:, "spending"].mean()
average_nb_beneficiaries = spending_df.loc[:, "nb_beneficiaries"].mean()
filled_spending_df = spending_df.fillna({"nb_beneficiaries": average_nb_beneficiaries , 
                      "spending": average_spending})
print('--------nb_beneficiaries---------')
print(average_nb_beneficiaries)
print(filled_spending_df.loc[spending_df.loc[:, 'nb_beneficiaries'].isnull(), 'nb_beneficiaries'])
print('------------spending-------------')
print(average_spending)
print(filled_spending_df.loc[spending_df.loc[:, 'spending'].isnull(), 'spending'])

### Filling Continued : `fillna()` with Dynamic Values 2

Very similar to what we saw with reindexing, `fillna()` has a `method` parameter that can be modified to either back fill, `method='bfill'`, or forward fill, `method=ffill`, missing values along an axis. Recall that forward filling will propagate the last valid observation forward until the next valid entry, while backward fill will use the next valid entry to propagate backwards.

For example, lets us forward fill the `DataFrame` `spending_df` and save the result to `ffill_spending_df`. Since it really only make sense to forward fill across rows we will set `axis='rows'`. To verify the results we will also print the head of the `specialy` column of both the `ffill_spending_df` `DataFrame` and the original `spending_df` `DataFrame` for reference.

```python
>>> ffill_spending_df = spending_df.fillna(method='ffill', axis='rows')
>>> print(spending_df.loc[:, 'specialty'].head())
unique_id
BK982218    INTERNAL MEDICINE
CG916968           CARDIOLOGY
SA964720    INTERNAL MEDICINE
TR390895                  NaN
JA436080                  NaN
Name: specialty, dtype: object
>>> print(ffill_spending_df.loc[:, 'specialty'].head())
unique_id
BK982218    INTERNAL MEDICINE
CG916968           CARDIOLOGY
SA964720    INTERNAL MEDICINE
TR390895    INTERNAL MEDICINE
JA436080    INTERNAL MEDICINE
Name: specialty, dtype: object
```

We see in the example above that the resulting `DataFrame` from the `fillna()` method has its missing values filled by the nearest previous valid entry, 'INTERNAL MEDICINE' . 

Let us do the same operation except set `method='bfill'`. We will save the results of the `fillna()` method call to the `DataFrame` `bfill_spending_df` and verify our results in the same way.

```python
>>> spending_df.fillna(method='bfill')
>>> print(spending_df.loc[:, 'specialty'].head())
unique_id
BK982218    INTERNAL MEDICINE
CG916968           CARDIOLOGY
SA964720    INTERNAL MEDICINE
TR390895                  NaN
JA436080                  NaN
Name: specialty, dtype: object
>>> print(bfill_spending_df.loc[:, 'specialty'].head())
unique_id
BK982218            INTERNAL MEDICINE
CG916968                   CARDIOLOGY
SA964720            INTERNAL MEDICINE
TR390895    INTERVENTIONAL CARDIOLOGY
JA436080    INTERVENTIONAL CARDIOLOGY
Name: specialty, dtype: object
```

Now the missing values are different, they are 'INTERVENTIONAL CARDIOLOGY', since backward fill will use the next valid entry to propagate backwards.

In [0]:
ffill_spending_df = spending_df.fillna(method='ffill', axis='rows')
print(spending_df.loc[:, 'specialty'].head())
print(ffill_spending_df.loc[:, 'specialty'].head())

In [0]:
bfill_spending_df = spending_df.fillna(method='bfill', axis='rows')
print(spending_df.loc[:, 'specialty'].head())
print(bfill_spending_df.loc[:, 'specialty'].head())

# Exercise 5.6: Filling Missing Entries with Dynamic Values

We will utilize the `flights_df DataFrame` one last time. Recall that it contains the information described in the table below:

| index | ORIGIN_AIRPORT | DESTINATION_AIRPORT | DEPARTURE_DELAY| ARRIVAL_DELAY |
|:----------|-----------|---------|---------|---------|
| 0 | HNL | SFO | 3.0 | -21.0 |
| 1 | LAS | HNL | NaN | -2.0 |
| 2 | HNL | ITO | NaN | NaN |
| 3 | HNL | KOA| -6.0 | -9.0 |

And can be constructed using the following line of code:

```python
flights_df = pd.DataFrame({'ORIGIN_AIRPORT': ['HNL', 'LAS', 'HNL', 'HNL'], 
                           'DESTINATION_AIRPORT': ['SFO', 'HNL', 'ITO', 'KOA'],
                           'DEPARTURE_DELAY': [ 3., None, None, -6.], 
                           'ARRIVAL_DELAY': [-21.,  -2.,  None,  -9.]})
```

Write a command using the code cell below to fill the missing entries in both the `DEPARTURE_DELAY` and `ARRIVAL_DELAY` columns with the column medians. The original `DataFrame` should **not** be modified.

In [25]:
# Type your solution to Exercise 5.6 here

### Filling Continued: The `interpolate()` `DataFrame` method

Interpolation is the construction of new data points from surrounding data points. Forward filling and backward filling are examples of interpolation methods, but there are many more. Interpolation is typically useful in time series data, that is data that is taken periodically and organized in chronological order.

The `interpolate()` `DataFrame` method will fill missing values in the calling `DataFrame` using an interpolation method specified with the `method` parameter across the axis specifieid by the `axis` parameter. We have multiple options to set the `method` parameter to and all of them can be seen [here](http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html), but we will be focussing on 'linear' interpolation. 

To see the value of interpolation we will reuse the same `sine_wave_df` from the reindexing section. Previously, we interpolated the missing values introduced by reindexing by forward filling, which introduced some undesired distortion. Now we will interpolate the missing values using the linear method.

```python
>>> sine_wave_df_reindexed = sine_wave_df.reindex(times2)
>>> sine_wave_df_reindexed.interpolate(method='linear', inplace=True)
```

Now lets again plot  both the original `sine_wave_df` `DataFrame` and the new `sine_wave_df_reindexed` `DataFrame`, this time however we will use a sactter plot to better illustrate the value of interpolation.

```python
>>> sine_wave_df['time'] = times
>>> sine_wave_df_reindexed['time'] = times2
>>> fig, axes = plt.subplots(nrows=2, ncols=1) 
>>> sine_wave_df_reindexed.plot(kind='scatter', x='time', y='wave', ax=axes[1]) 
>>> sine_wave_df.plot(kind='scatter', x='time', y='wave', ax=axes[0]) 
>>> fig.show()
```

<img src="images/sine_wave_linear_interpolate.png" width="500">

We see in the image above that now the resulting waveform is much smoother. Note that there is still some distortion in this signal and it is not a perfect interpolation, but for this application it is visibly better that forward filling and backward filling.

## Handling Missing Data for the Medical Spending Data Set

Let us take a look at how we can use the skills we learned for handling missing data can be applied to our running example of the Medical Spending Data Set. First, to gain some perspective and start to get an idea of what we are working with, we should check exactly how many values are missing from each column in the data set. To do this we will check call `sum()` with the `DataFrame` `spending_df.isnull()`. Recall that the output of the `sum()` `DataFrame` method will be a `Series` with an index matching the columns of the calling `DataFrame` and entries corresponding to the sum across the rows for the column.

```python
>>> spending_df.isnull().sum()
doctor_id           0
specialty           6
medication          1
nb_beneficiaries    4
spending            4
dtype: int64
```

From the output above, it looks like every column besides doctor_id has missing values. Our next task will is to decide how we should handle these missing entries. Suppose our study is primarily interested in investigating the spending of each specialty, that is spending and specialty are the most valuable features for us, then one reasonable solution would be to drop the rows where `specialty`  or `spending` is missing, replace missing `nb_beneficiaries` with the average, and replace the missing `medication` entry with a new values "UNKNOWN".

To do this we will need to do both filtering and filling. An efficient way to accomplish our goal would be to utilize the `dropna()` `DataFrame` method and the `fillna()` `DataFrame` method. We need to be careful with the `dropna()` method since by default it will consider and drop any rows that has a missing value in any column. If we want `dropna()` to only be concerned with a subset of the columns of the calling `DataFrame` then we can specify that using the `subset` parameter. Thus since we want to only drop rows if they have a missing value in the `specialty`  or `spending` columns we will set `subset=['specialty', 'spending']` . We can then call `fillna()` with the returned `DataFrame` from the `dropna()` call and pass it a dictionary with the desired values static and dynamic values. We want this change to be permanent so we can save the result into the `DataFrame` `spending_df`. Finally, check our results we can again call `sum()` with the `DataFrame` `spending_df.isnull()`.

```python
>>> spending_df = spending_df.dropna(subset=["specialty","spending"]).fillna({
                    "nb_beneficiaries": spending_df["nb_beneficiaries"].mean(), 
                    "medication": "UNKNOWN" } )
>>> spending_df.isnull().sum()
doctor_id           0
specialty           0
medication          0
nb_beneficiaries    0
spending            0
dtype: int64
```

As desired, there are no more missing entries in our `spending_df` `DataFrame`.

In [0]:
print(spending_df.isnull().sum())
print("----------------------")
spending_df = spending_df.dropna(subset=["specialty","spending"]).fillna({
                    "nb_beneficiaries": spending_df["nb_beneficiaries"].mean(), 
                    "medication": "UNKNOWN" } )

spending_df.isnull().sum()

# Summary

---

**Reindexing**

* Reindexing a `DataFrame` or `Series` will create a **new** `pandas` object that is conformed to the new index 
* More information about the `reindex()` method is available [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html)

**Inspecting and Modifying Data Types**

* Use the `DataFram dtypes` attribute to inspect the `pandas` data types of each column
* To cast a column of one type to another compatible type we can use the `astype()` `Series` method

**Series String Methods**

* `Series` contains various `string` processing methods that can be accessed using a `Series`’s `str` property.
* You can use .__`TAB`__ to explore these methods.
* Or you can see all of the `Series` `str` methods and descriptions [here](https://pandas-docs.github.io/pandas-docs-travis/api.html#string-handling)

**Handling Missing Data**

* The  `isnull` method is often useful to find where all the `NaNs` precisely are in the `pandas` `DataFrame` or `Series`

* **Filtering**

  * You can discard missing values using `isnull()` result and subsetting

* **Filling**

  * There  are two conventional approaches for filling missing value:

    1. Filling the value with a constant 
    2. Filling the value dynamically with something computed on the fly