# Data Cleaning Challenge - Handling missing values

## Intro

### All days of the challange:

* [Day 1: Handling missing values](./nb1-data-cleaning-challenge-handling-missing-values.ipynb)
* [Day 2: Scaling and normalization](./nb2-data-cleaning-challenge-scale-and-normalize-data.ipynb)
* [Day 3: Parsing dates](./nb3-data-cleaning-challenge-parsing-dates.ipynb)
* [Day 4: Character encodings](./nb4-data-cleaning-challenge-character-encodings.ipynb)
* [Day 5: Inconsistent Data Entry](./nb5-data-cleaning-challenge-inconsistent-data-entry.ipynb)
___
Welcome to day 1 of the 5-Day Data Challenge! Today, we're going to be looking at how to deal with missing values. To get started, click the blue "Fork Notebook" button in the upper, right hand corner. This will create a private copy of this notebook that you can edit and play with. Once you're finished with the exercises, you can choose to make your notebook public to share with others. :)

> **Your turn!** As we work through this notebook, you'll see some notebook cells (a block of either code or text) that has "Your Turn!" written in it. These are exercises for you to do to help cement your understanding of the concepts we're talking about. Once you've written the code to answer a specific question, you can run the code by clicking inside the cell (box with code in it) with the code you want to run and then hit CTRL + ENTER (CMD + ENTER on a Mac). You can also click in a cell and then click on the right "play" arrow to the left of the code. If you want to run all the code in your notebook, you can use the double, "fast forward" arrows at the bottom of the notebook editor.

Here's what we're going to do today:

* [Take a first look at the data](#Take-a-first-look-at-the-data)
* [See how many missing data points we have](#See-how-many-missing-data-points-we-have)
* [Figure out why the data is missing](#Figure-out-why-the-data-is-missing)
* [Drop missing values](#Drop-missing-values)
* [Filling in missing values](#Filling-in-missing-values)

Let's get started!

## Take a first look at the data

The first thing we'll need to do is load in the libraries and datasets we'll be using. For today, I'll be using a dataset of events that occured in American Football games for demonstration, and you'll be using a dataset of building permits issued in San Francisco.

> **Important!** Make sure you run this cell yourself or the rest of your code won't work!

In [2]:
# modules we'll use
import pandas as pd
import numpy as np
import os

from kaggle_cleaning.config import CLEAN_DATA_DIR, RAW_DATA_DIR


In [3]:
pd.set_option('display.max_columns', None)    # Setting this option will print all collumns of a dataframe
pd.set_option('display.max_colwidth', None)   # Setting this option will print all of the data in a feature

# read in all our data
nfl_data =   pd.read_csv("../data/d0-raw/NFL Play by Play 2009-2017 (v4).csv", low_memory=False)
sf_permits = pd.read_csv("../data/d0-raw/Building_Permits.csv", low_memory=False)

# set seed for reproducibility
np.random.seed(0) 

The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with `NaN` or `None`.

In [4]:
# look at a few rows of the nfl_data file. I can see a handful of missing data already!
nfl_data.sample(5)

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc,PlayAttempted,Yards.Gained,sp,Touchdown,ExPointResult,TwoPointConv,DefTwoPoint,Safety,Onsidekick,PuntResult,PlayType,Passer,Passer_ID,PassAttempt,PassOutcome,PassLength,AirYards,YardsAfterCatch,QBHit,PassLocation,InterceptionThrown,Interceptor,Rusher,Rusher_ID,RushAttempt,RunLocation,RunGap,Receiver,Receiver_ID,Reception,ReturnResult,Returner,BlockingPlayer,Tackler1,Tackler2,FieldGoalResult,FieldGoalDistance,Fumble,RecFumbTeam,RecFumbPlayer,Sack,Challenge.Replay,ChalReplayResult,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,HomeTeam,AwayTeam,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,No_Score_Prob,Opp_Field_Goal_Prob,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,ExPoint_Prob,TwoPoint_Prob,ExpPts,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
244485,2014-10-26,2014102607,18,3,1.0,00:39,1,939.0,12.0,TB,49.0,51.0,10,11,0.0,1.0,TB,MIN,(:39) M.Glennon pass short middle to B.Rainey pushed ob at MIN 40 for 11 yards (X.Rhodes).,1,11,0,0,,,,0,0,,Pass,M.Glennon,00-0030520,1,Complete,Short,0,11,0,middle,0,,,,0,,,B.Rainey,00-0029384,1,,,,X.Rhodes,,,,0,,,0,0,,0,,,,0,0.0,10.0,-10.0,10.0,TB,MIN,0,,3,3,3,3,3,0.024134,0.083335,0.000527,0.125624,0.309404,0.003279,0.453696,0.0,0.0,2.980215,0.652199,-0.588101,1.240299,0.225647,0.774353,0.245582,0.754418,0.225647,0.019935,-0.018156,0.038091,2014
115340,2011-11-20,2011112000,22,4,1.0,06:47,7,407.0,44.0,OAK,31.0,69.0,10,24,0.0,1.0,OAK,MIN,(6:47) M.Bush up the middle to OAK 44 for 13 yards (Ty.Johnson).,1,13,0,0,,,,0,0,,Run,,,0,,,0,0,0,,0,,M.Bush,00-0025487,1,middle,,,,0,,,,Ty.Johnson,,,,0,,,0,0,,0,,,,0,27.0,14.0,13.0,13.0,MIN,OAK,0,,3,3,3,3,3,0.246929,0.101102,0.001883,0.149762,0.198142,0.002598,0.299584,0.0,0.0,1.341311,0.871944,,,0.056036,0.943964,0.042963,0.957037,0.943964,0.013073,,,2011
68357,2010-11-14,2010111401,8,2,,00:23,1,1823.0,0.0,CLE,2.0,2.0,0,80,0.0,0.0,NYJ,CLE,"N.Folk extra point is GOOD, Center-T.Purdum, Holder-S.Weatherford.",1,0,1,0,Made,,,0,0,,Extra Point,,,0,,,0,0,0,,0,,,,0,,,,,0,,,,,,,,0,,,0,0,,0,,,,0,16.0,13.0,3.0,3.0,CLE,NYJ,0,,1,2,1,2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.990795,0.0,0.990795,0.009205,,,0.365307,0.634693,0.384697,0.615303,0.634693,-0.01939,,,2010
368377,2017-09-24,2017092405,24,4,1.0,08:48,9,528.0,8.0,CLE,42.0,58.0,10,25,0.0,1.0,CLE,IND,(8:48) (Shotgun) D.Kizer pass short middle to D.Johnson to IND 33 for 25 yards (M.Farley).,1,25,0,0,,,,0,0,,Pass,D.Kizer,00-0033899,1,Complete,Short,7,18,0,middle,0,,,,0,,,D.Johnson,00-0032257,1,,,,M.Farley,,,,0,,,0,0,,0,,,,0,14.0,31.0,-17.0,17.0,IND,CLE,0,,0,2,0,2,0,0.133814,0.088017,0.000845,0.130768,0.258227,0.002931,0.385398,0.0,0.0,2.297212,1.47774,0.40208,1.07566,0.935995,0.064005,0.921231,0.078769,0.064005,0.014764,0.003866,0.010899,2017
384684,2017-11-05,2017110505,11,2,1.0,09:15,10,2355.0,0.0,DEN,25.0,75.0,10,0,0.0,0.0,DEN,PHI,(9:15) (Shotgun) J.Charles right end to DEN 25 for no gain (B.Graham).,1,0,0,0,,,,0,0,,Run,,,0,,,0,0,0,,0,,J.Charles,00-0026213,1,right,end,,,0,,,,B.Graham,,,,0,,,0,0,,0,,,,0,6.0,24.0,-18.0,18.0,PHI,DEN,0,,2,3,2,3,2,0.170787,0.128766,0.003453,0.191982,0.199771,0.002876,0.302365,0.0,0.0,0.984542,-0.637178,,,0.928474,0.071526,0.934641,0.065359,0.071526,-0.006166,,,2017


Yep, it looks like there's some missing values. What about in the sf_permits dataset?

In [5]:
# your turn! Look at a couple of rows from the sf_permits dataset. Do you notice any missing data?

display(sf_permits.sample(5))

Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Permit Creation Date,Block,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,Unit,Unit Suffix,Description,Current Status,Current Status Date,Filed Date,Issued Date,Completed Date,First Construction Document Date,Structural Notification,Number of Existing Stories,Number of Proposed Stories,Voluntary Soft-Story Retrofit,Fire Only Permit,Permit Expiration Date,Estimated Cost,Revised Cost,Existing Use,Existing Units,Proposed Use,Proposed Units,Plansets,TIDF Compliance,Existing Construction Type,Existing Construction Type Description,Proposed Construction Type,Proposed Construction Type Description,Site Permit,Supervisor District,Neighborhoods - Analysis Boundaries,Zipcode,Location,Record ID
78651,201503120624,8,otc alterations permit,03/12/2015,4273,029,2986,,26th,St,,,replace 12 windows with fiberglass windows - same glazing and frame design,complete,05/11/2015,03/12/2015,03/12/2015,05/11/2015,03/12/2015,,3.0,3.0,,,03/06/2016,12000.0,12000.0,2 family dwelling,2.0,2 family dwelling,2.0,0.0,,5.0,wood frame (5),5.0,wood frame (5),,9.0,Mission,94110.0,"(37.74972976527956, -122.40963202424835)",1374057173006
177779,201707071220,8,otc alterations permit,07/07/2017,453,004A,950,,Bay,St,0.0,,to comply with nov 201641909soft story retrofit per sfebc chapter 4d engineering criteria 2016 cebc appendix a-4,issued,12/04/2017,07/07/2017,12/04/2017,,12/04/2017,,4.0,4.0,,,12/04/2018,94000.0,140000.0,apartments,18.0,apartments,18.0,2.0,,5.0,wood frame (5),5.0,wood frame (5),,2.0,Russian Hill,94109.0,"(37.80472551510833, -122.42280760820965)",1469546420378
64978,M529127,8,otc alterations permit,10/21/2014,478,011,1290,,Chestnut,St,,,street space,issued,10/21/2014,10/21/2014,10/21/2014,,10/21/2014,,,,,,,,1.0,,,,,,,,,,,,2.0,Russian Hill,94109.0,"(37.80244997614665, -122.42443255018165)",1359656465959
165958,201704073507,8,otc alterations permit,04/07/2017,6507,003A,1134,,Noe,St,,,"remodel kitchen, new cabients, power, counters. remodel bath - shower, new tile, toilet, sink. bolt exisitng foundationwith 7/8 all tgread & epoxy every 32""m add sine a30 clips at post in basement. repair some dryrot.",complete,08/01/2017,04/07/2017,04/07/2017,08/01/2017,04/07/2017,,3.0,3.0,,,04/02/2018,32000.0,35000.0,2 family dwelling,2.0,2 family dwelling,2.0,0.0,,5.0,wood frame (5),5.0,wood frame (5),,8.0,Noe Valley,94114.0,"(37.750866393629146, -122.43209633372915)",1458922436280
147902,201701237639,8,otc alterations permit,01/23/2017,623,001,1755,,Van Ness,Av,,,"unit 502: kitchen remodel in-kind, bathroom remodel in-kind, no changes to wall locations.",issued,01/23/2017,01/23/2017,01/23/2017,,01/23/2017,,6.0,6.0,,,01/18/2018,20000.0,25000.0,apartments,48.0,apartments,48.0,0.0,,5.0,wood frame (5),5.0,wood frame (5),,2.0,Pacific Heights,94109.0,"(37.791925784456105, -122.42306858292103)",1450933235988


## See how many missing data points we have

Ok, now we know that we do have some missing values. Let's see how many we have in each column. 

In [6]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:

In [7]:
# how many total missing values do we have?
total_cells = np.prod(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
float((total_missing/total_cells) * 100)

27.66722370547874

Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

In [8]:
# your turn! Find out what percent of the sf_permits dataset is missing
empty_values = sf_permits.isnull().sum() # Is the number of missing values **PER COLUMN**
display(empty_values.sort_values(ascending=False)[0:15]) # The top 15 columns of missing values

TIDF Compliance                           198898
Voluntary Soft-Story Retrofit             198865
Unit Suffix                               196939
Street Number Suffix                      196684
Site Permit                               193541
Structural Notification                   191978
Fire Only Permit                          180073
Unit                                      169421
Completed Date                            101709
Permit Expiration Date                     51880
Existing Units                             51538
Proposed Units                             50911
Existing Construction Type                 43366
Existing Construction Type Description     43366
Proposed Construction Type Description     43162
dtype: int64

In [9]:
total_cells = sf_permits.size # using size instead of 'np.prod(sf_permits.shape)'
total_empty_values = empty_values.sum()
pct_empty_values = total_empty_values/total_cells * 100
print(f"In the 'sf_permits' dataframe, the {pct_empty_values:.2f} % of cells are empty")

In the 'sf_permits' dataframe, the 26.26 % of cells are empty


## Figure out why the data is missing

This is the point at which we get into the part of data science that I like to call **"data intution"**, by which I mean **"really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis"**. It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most important question you can ask yourself to help figure this out is this:

> **Is this value missing becuase it wasn't recorded or becuase it dosen't exist?**

If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. (This is called "imputation" and we'll learn how to do it next! :)

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column `TimesSec` has a lot of missing values in it: 

In [10]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

By looking at [the documentation](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016), I can see that this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are other fields, like `PenalizedTeam` that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say *which* team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

> **Tip:** This is a great place to read over the dataset documentation if you haven't already! If you're working with a dataset that you've gotten from another person, you can also try reaching out to them to get more information.

If you're doing very careful data analysis, this is the point at which you'd look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we'll cover some "quick and dirty" techniques that can help you with missing values but will probably also end up removing some useful information or adding some noise to your data.

## Your turn!

* Look at the columns `Street Number Suffix` and `Zipcode` from the `sf_permits` datasets. Both of these contain missing values. Which, if either, of these are missing because they don't exist? Which, if either, are missing because they weren't recorded?

### Drop missing values

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)  

If you're sure you want to drop rows with missing values, pandas does have a handy function, `dropna()` to help you do this. Let's try it out on our NFL dataset!

In [11]:
# remove all the rows that contain a missing value
nfl_data.dropna()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc,PlayAttempted,Yards.Gained,sp,Touchdown,ExPointResult,TwoPointConv,DefTwoPoint,Safety,Onsidekick,PuntResult,PlayType,Passer,Passer_ID,PassAttempt,PassOutcome,PassLength,AirYards,YardsAfterCatch,QBHit,PassLocation,InterceptionThrown,Interceptor,Rusher,Rusher_ID,RushAttempt,RunLocation,RunGap,Receiver,Receiver_ID,Reception,ReturnResult,Returner,BlockingPlayer,Tackler1,Tackler2,FieldGoalResult,FieldGoalDistance,Fumble,RecFumbTeam,RecFumbPlayer,Sack,Challenge.Replay,ChalReplayResult,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,HomeTeam,AwayTeam,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,No_Score_Prob,Opp_Field_Goal_Prob,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,ExPoint_Prob,TwoPoint_Prob,ExpPts,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season


Oh dear, it looks like that's removed all our data! 😱 This is because every row in our dataset had at least one missing value. We might have better luck removing all the *columns* that have at least one missing value instead.

In [12]:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,Touchdown,Safety,Onsidekick,PlayType,PassAttempt,AirYards,YardsAfterCatch,QBHit,InterceptionThrown,RushAttempt,Reception,Fumble,Sack,Challenge.Replay,Accepted.Penalty,Penalty.Yards,HomeTeam,AwayTeam,Timeout_Indicator,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,0,0,0,Kickoff,0,0,0,0,0,0,0,0,0,0,0,0,PIT,TEN,0,3,3,3,3,3,0.0,0.0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,0,0,0,Pass,1,-3,8,0,0,0,1,0,0,0,0,0,PIT,TEN,0,3,3,3,3,3,0.0,0.0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,0,0,0,Run,0,0,0,0,0,1,0,0,0,0,0,0,PIT,TEN,0,3,3,3,3,3,0.0,0.0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,0,0,0,Pass,1,34,0,0,0,0,0,0,0,0,0,0,PIT,TEN,0,3,3,3,3,3,0.0,0.0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,0,0,0,Punt,0,0,0,0,0,0,0,0,0,0,0,0,PIT,TEN,0,3,3,3,3,3,0.0,0.0,2009


In [13]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102 

Columns with na's dropped: 37


We've lost quite a bit of data, but at this point we have successfully removed all the `NaN`'s from our data. 

In [14]:
# Your turn! Try removing all the rows from the sf_permits dataset that contain missing values. How many are left?
sf_permits.dropna(axis=0)

Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Permit Creation Date,Block,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,Unit,Unit Suffix,Description,Current Status,Current Status Date,Filed Date,Issued Date,Completed Date,First Construction Document Date,Structural Notification,Number of Existing Stories,Number of Proposed Stories,Voluntary Soft-Story Retrofit,Fire Only Permit,Permit Expiration Date,Estimated Cost,Revised Cost,Existing Use,Existing Units,Proposed Use,Proposed Units,Plansets,TIDF Compliance,Existing Construction Type,Existing Construction Type Description,Proposed Construction Type,Proposed Construction Type Description,Site Permit,Supervisor District,Neighborhoods - Analysis Boundaries,Zipcode,Location,Record ID


In [15]:
# Now try removing all the columns with empty values. Now how much of your data is left?
sf_permits_dropna = sf_permits.dropna(axis=1)
sf_permits_dropna.sample(5)

print(f"Columns in dataset with dropped NaN: {sf_permits_dropna.shape[1]}")
print(f"Columns in original dataset: {sf_permits.shape[1]}")

Columns in dataset with dropped NaN: 12
Columns in original dataset: 43


### Filling in missing values

Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the NFL data so that it will print well.

In [16]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the `NaN` values to be replaced with. Here, I'm saying that I would like to replace all the `NaN` values with 0.

In [17]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,0.0,0.0,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,0.0,0.0,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,0.0,0.0,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.0,0.0,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009


I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

In [18]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0

# In the original code the line is the following but it raises a deprecation warning.
#subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0) 
# Instead the following is used:
subset_nfl_data.bfill(axis=0).fillna(0) 

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,-1.068169,1.146076,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,-0.032244,0.036899,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,3.318841,-5.031425,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.106663,-0.156239,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009


Filling in missing values is also known as "imputation", and you can find more exercises on it [in this lesson, also linked under the "More practice!" section](https://www.kaggle.com/dansbecker/handling-missing-values). First, however, why don't you try replacing some of the missing values in the sf_permit dataset?

In [19]:
# Your turn! Try replacing all the NaN's in the sf_permits data with the one that
# comes directly after it and then replacing any remaining NaN's with 0
sf_permits=sf_permits.bfill(axis=0).fillna(0)
sf_permits.sample(5)

Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Permit Creation Date,Block,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,Unit,Unit Suffix,Description,Current Status,Current Status Date,Filed Date,Issued Date,Completed Date,First Construction Document Date,Structural Notification,Number of Existing Stories,Number of Proposed Stories,Voluntary Soft-Story Retrofit,Fire Only Permit,Permit Expiration Date,Estimated Cost,Revised Cost,Existing Use,Existing Units,Proposed Use,Proposed Units,Plansets,TIDF Compliance,Existing Construction Type,Existing Construction Type Description,Proposed Construction Type,Proposed Construction Type Description,Site Permit,Supervisor District,Neighborhoods - Analysis Boundaries,Zipcode,Location,Record ID
155884,201705025377,8,otc alterations permit,05/02/2017,238,1,275,A,Battery,St,0.0,A,"non structural demo (n) non-structural partitions, lighting, ceiling, millwork, finishes. no change in occupancy, use or area",issued,05/31/2017,05/02/2017,05/31/2017,06/20/2017,05/31/2017,Y,30.0,30.0,Y,Y,05/15/2020,1586552.0,1586552.0,office,0.0,office,0.0,2.0,0,1.0,constr type 1,1.0,constr type 1,Y,3.0,Financial District/South Beach,94104.0,"(37.79386286998276, -122.40037373645143)",146146463386
67171,201411101171,8,otc alterations permit,11/10/2014,514,6,3125,A,Steiner,St,501.0,C,"remodel interior spaces including kitchen, bedrooms, and bathrooms. relocate an interior stair provide opening in wall separating kitchen and dining room. add a new secondary living deck at rear. no expansion or bldg footprint. no change to facade.",filed,11/10/2014,11/10/2014,11/10/2014,10/23/2015,11/10/2014,Y,3.0,3.0,Y,Y,11/05/2015,200000.0,300000.0,1 family dwelling,1.0,1 family dwelling,1.0,2.0,P,5.0,wood frame (5),5.0,wood frame (5),Y,2.0,Marina,94123.0,"(37.798240096614876, -122.43770817950788)",136194772577
131269,201607223130,8,otc alterations permit,07/22/2016,299,2,729,A,Jones,St,0.0,B,"#107- add (1) new head in closet, relocate (1) head at bath, replace existing pendent heads with new, total 18 heads. n/a ordinance #155-13 ref. to pa#201507201972. n/a maher ordinance",issued,09/06/2016,07/22/2016,09/06/2016,09/02/2016,09/06/2016,Y,6.0,6.0,Y,Y,09/01/2017,7000.0,7000.0,apartments,80.0,apartments,80.0,2.0,0,5.0,wood frame (5),5.0,wood frame (5),Y,3.0,Nob Hill,94109.0,"(37.7880133065598, -122.41378779415976)",1431108412946
78305,201503100352,8,otc alterations permit,03/10/2015,340,15,144,A,Taylor,St,0.0,D,"adjustment to path of travel on emergency egress and occupant load for theater 1 & 2, added sound vestibule, adjusted kitchen sizerevision to pa# 2012 11 13 4020",complete,04/06/2015,03/10/2015,03/16/2015,04/06/2015,03/16/2015,Y,2.0,2.0,Y,Y,03/10/2016,1.0,1.0,food/beverage hndlng,27.0,food/beverage hndlng,27.0,2.0,P,5.0,wood frame (5),5.0,wood frame (5),Y,6.0,Tenderloin,94102.0,"(37.78393596981267, -122.4106602245025)",1373684301176
17827,201307081261,8,otc alterations permit,07/08/2013,3047,16,290,A,Maywood,Dr,0.0,BLDG 3,"remove and replace kitchen cabinets, remodel 2 baths, enlarge and remodel master bath for wheelchair access, kitchen approx 15x15, master bath approx 10x8, 2 other approx 8x8",complete,04/14/2014,07/08/2013,07/08/2013,04/14/2014,07/08/2013,Y,1.0,1.0,Y,Y,07/03/2014,75753.0,75753.0,1 family dwelling,1.0,1 family dwelling,1.0,0.0,Y,5.0,wood frame (5),5.0,wood frame (5),Y,7.0,West of Twin Peaks,94127.0,"(37.731407639433264, -122.46180412687058)",1310277150879


And that's it for today! If you have any questions, be sure to post them in the comments below or [on the forums](https://www.kaggle.com/questions-and-answers). 

Remember that your notebook is private by default, and in order to share it with other people or ask for help with it, you'll need to make it public. First, you'll need to save a version of your notebook that shows your current work by hitting the "Commit & Run" button. (Your work is saved automatically, but versioning your work lets you go back and look at what it was like at the point you saved it. It also let's you share a nice compiled notebook instead of just the raw code.) Then, once your notebook is finished running, you can go to the Settings tab in the panel to the left (you may have to expand it by hitting the [<] button next to the "Commit & Run" button) and setting the "Visibility" dropdown to "Public".

## More practice!

If you're looking for more practice handling missing values, check out these extra-credit\* exercises:

* [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values): In this notebook Dan shows you several approaches to imputing missing data using scikit-learn's imputer. 
* Look back at the `Zipcode` column in the `sf_permits` dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the [Datasets listing](https://www.kaggle.com/datasets).) 

From the results, the cleaned file **sf_permits_clean.csv** was saved in the clean data directory. This file contains the processed data and is now ready for further analysis.


In [20]:
clean_file=os.path.join(CLEAN_DATA_DIR,'sf_permits_clean.csv')
sf_permits.to_csv(clean_file)

\* no actual credit is given for completing the challenge, you just learn how to clean data real good :P