In [1]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")

# set seed for reproducibility
np.random.seed(0) 

FileNotFoundError: [Errno 2] File b'../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv' does not exist: b'../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv'

In [None]:
# look at the first five rows of the nfl_data file. 
# I can see a handful of missing data already!
nfl_data.head()

In [None]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

In [None]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

In [3]:
'''Figure out why the data is missing¶
This is the point at which we get into the part of data science that I like to call "data intution",
by which I mean "really looking at your data and trying to figure out why it is the way it is and 
how that will affect your analysis". It can be a frustrating part of data science, especially if you're 
newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your 
intution to figure out why the value is missing. 
One of the most important questions you can ask yourself to help figure this out is this:'''

'Figure out why the data is missing¶\nThis is the point at which we get into the part of data science that I like to call "data intution",\nby which I mean "really looking at your data and trying to figure out why it is the way it is and \nhow that will affect your analysis". It can be a frustrating part of data science, especially if you\'re \nnewer to the field and don\'t have a lot of experience. For dealing with missing values, you\'ll need to use your \nintution to figure out why the value is missing. \nOne of the most important questions you can ask yourself to help figure this out is this:'

Is this value missing because it wasn't recorded or because it doesn't exist?

# if value don't exits then take is NAN,but when value is not recorded then find out value by which basic on value present
This is called imputation

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, 
I notice that the column "TimesSec" has a lot of missing values in it:

In [None]:
missing_values_count[0:10]

# Drop missing values

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to 
just remove any rows or columns that contain missing values. (Note: I don't generally recommend this 
approch for important projects! It's usually worth it to take the time to go through your data and really
look at all the columns with missing values one-by-one to really get to know your dataset.)

If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna() 
to help you do this. Let's try it out on our NFL dataset!

In [None]:
# remove all the rows that contain a missing value
nfl_data.dropna()

# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

In [None]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Filling in missing values automatically
Another option is to try and fill in the missing values. For this next bit, I'm getting a 
small sub-section of the NFL data so that it will print well.

In [None]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the NaN values to be replaced with. Here, I'm saying that I would like to replace all the NaN values with 0.

In [None]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

In [None]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)