### Detect missing values
![Dark](https://user-images.githubusercontent.com/12748752/126914729-75e0fed5-fdaa-4216-81c8-719340e80694.png)
* The **isna()** function is used to detect missing values.
> _**Series.isna(self)**_
* **Returns: Series- values for each element in Series that indicates whether an element is not an NA value.**

In [1]:
import numpy as np
import pandas as pd

# passing a dictionary inorder to make a Dataframe
df = pd.DataFrame({'age': [6, 7, np.NaN],
                   'born': [pd.NaT, pd.Timestamp('1998-04-25'),
                            pd.Timestamp('1940-05-27')],
                   'name': ['Alfred', 'Spiderman', ''],
                   'toy': [None, 'Spidertoy', 'Joker']})
df.head()

Unnamed: 0,age,born,name,toy
0,6.0,NaT,Alfred,
1,7.0,1998-04-25,Spiderman,Spidertoy
2,,1940-05-27,,Joker


In [5]:
df.isna()

Unnamed: 0,age,born,name,toy
0,False,True,False,True
1,False,False,False,False
2,True,False,False,False


### How many missing(NA) values each column has 
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)


In [6]:
df.isna().sum()

age     1
born    1
name    0
toy     1
dtype: int64

### Detect existing values in Pandas series
![Dark](https://user-images.githubusercontent.com/12748752/126914729-75e0fed5-fdaa-4216-81c8-719340e80694.png)

* The **notna()** function is used to detect existing (non-missing) values.

> _**Series.notna(self)**_

* **Returns: Series- Mask of bool values for each element in Series that indicates whether an element is not an NA value.**


In [8]:
# Continuation of above DataFrame
df.isna()

Unnamed: 0,age,born,name,toy
0,False,True,False,True
1,False,False,False,False
2,True,False,False,False


In [9]:
df.notna()

Unnamed: 0,age,born,name,toy
0,True,False,True,False
1,True,True,True,True
2,False,True,True,True


### drop Rows/Columns with Null values
![Dark](https://user-images.githubusercontent.com/12748752/126914729-75e0fed5-fdaa-4216-81c8-719340e80694.png)
* The **dropna()** function is used to return a new Series with missing values removed.
> _**Series.dropna(self, axis=0, inplace=False, **kwargs) ;**_

* **Returns: Series- Series with NA entries dropped from it.**

In [14]:
import numpy as np
import pandas as pd

# passing a dictionary inorder to make a Dataframe
df = pd.DataFrame({'age': [6, 7, np.NaN],
                   'born': [pd.NaT, pd.Timestamp('1998-04-25'),
                            pd.Timestamp('1940-05-27')],
                   'name': ['Alfred', 'Spiderman', ''],
                   'toy': [None, 'Spidertoy', 'Joker']})
df.head()

Unnamed: 0,age,born,name,toy
0,6.0,NaT,Alfred,
1,7.0,1998-04-25,Spiderman,Spidertoy
2,,1940-05-27,,Joker


#### Row wise drop NAs 
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)
> **axis=0** default; 0 means 'index' means rows


In [15]:
df.dropna()

Unnamed: 0,age,born,name,toy
1,7.0,1998-04-25,Spiderman,Spidertoy


#### Column wise drop NAs
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)
> **axis=1**; 1 means 'column' 


In [17]:
# A blank space is not consider as 'NaN' or 'None' or 'NA' 

df.dropna(axis=1)

Unnamed: 0,name
0,Alfred
1,Spiderman
2,


### Fill NA/NaN values using the specified method
![Dark](https://user-images.githubusercontent.com/12748752/126914729-75e0fed5-fdaa-4216-81c8-719340e80694.png)
* The **fillna()** function is used to fill NA/NaN values using the specified method.
> **Series.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)**

* **Returns: Series- Object with missing values filled.**

In [18]:
import numpy as np
import pandas as pd


df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [5, np.nan, np.nan, 6],
                   [np.nan, 4, np.nan, 5]],
                 columns=list('PQRS'))
df.head()

Unnamed: 0,P,Q,R,S
0,,2.0,,0
1,3.0,4.0,,1
2,5.0,,,6
3,,4.0,,5


#### Replace all NaN elements with 0s.
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)


In [19]:
df.fillna(0)

Unnamed: 0,P,Q,R,S
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,5.0,0.0,0.0,6
3,0.0,4.0,0.0,5


#### Only replace maximum number of consecutive NaN values to forward/backward fill.
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)


In [21]:
df.fillna({'P': 0, 'Q': 1, 'R': 2, 'S': 3}, limit=2)

Unnamed: 0,P,Q,R,S
0,0.0,2.0,2.0,0
1,3.0,4.0,2.0,1
2,5.0,1.0,,6
3,0.0,4.0,,5


#### Propagate non-null values forward or backward.
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)


In [22]:
# Forward
df.fillna(method='ffill')

Unnamed: 0,P,Q,R,S
0,,2.0,,0
1,3.0,4.0,,1
2,5.0,4.0,,6
3,5.0,4.0,,5


In [23]:
# Backward
df.fillna(method='bfill')

Unnamed: 0,P,Q,R,S
0,3.0,2.0,,0
1,3.0,4.0,,1
2,5.0,4.0,,6
3,,4.0,,5


### Fill NA/missing values in a Pandas series
![Dark](https://user-images.githubusercontent.com/12748752/126914729-75e0fed5-fdaa-4216-81c8-719340e80694.png)
* The **interpolate()** function is used to interpolate values according to different methods.

> _**Series.interpolate(self, method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None, downcast=None, **kwargs)**_

* **Returns: Series or DataFrame- Returns the same object type as the caller, interpolated at some or all NaN values.**

* **Notes** The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index.


In [25]:
import numpy as np
import pandas as pd

s = pd.Series([0, 2, np.nan, 5])
s.head()

0    0.0
1    2.0
2    NaN
3    5.0
dtype: float64

In [26]:
s.interpolate()

0    0.0
1    2.0
2    3.5
3    5.0
dtype: float64

#### Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.

![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)


In [27]:
s = pd.Series([np.nan, "single_one", np.nan,
               "fill_two_more", np.nan, np.nan,
               3.71, np.nan])
s

0              NaN
1       single_one
2              NaN
3    fill_two_more
4              NaN
5              NaN
6             3.71
7              NaN
dtype: object

In [28]:
s.interpolate(method='pad', limit=2)

0              NaN
1       single_one
2       single_one
3    fill_two_more
4    fill_two_more
5    fill_two_more
6             3.71
7             3.71
dtype: object

## Example on a real Dataset
![Dark](https://user-images.githubusercontent.com/12748752/126914729-75e0fed5-fdaa-4216-81c8-719340e80694.png)


In [29]:
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("NFL Play by Play 2009-2017 (v4).csv")
nfl_data.head(2)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009


#### Finding missing Values
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)


In [30]:
missing_values_count = nfl_data.isnull().sum(axis=0)
missing_values_count

Date             0
GameID           0
Drive            0
qtr              0
down         61154
             ...  
Win_Prob     25009
WPA           5541
airWPA      248501
yacWPA      248762
Season           0
Length: 102, dtype: int64

### Finding the percentage of the missing values
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)


In [31]:
# total missing values
total_missing=missing_values_count.sum()
# number of records
total_record=nfl_data.shape[0]*nfl_data.shape[1]
percent=(total_missing/total_record)*100
percent

24.87214126835169

In [32]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

24.87214126835169


### Drop missing values
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)
* **Note:** Generally this approch is not recommend for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.    

* To drop rows with missing values, Pandas does have a handy function, _**dropna()**_ to help you do this. 

In [36]:
## remove each rows that contains at least one missing value
nfl_data.dropna()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season


**😱 This is because every row in our dataset had at least one missing value.**

In [37]:
# remove all columns with at least one missing value; axis : {0 or 'index', 1 or 'columns'}, default 0
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,...,Timeout_Indicator,Timeout_Team,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,...,0,,3,3,3,3,3,0.0,0.0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,...,0,,3,3,3,3,3,0.0,0.0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,...,0,,3,3,3,3,3,0.0,0.0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0.0,0.0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,0,,3,3,3,3,3,0.0,0.0,2009


In [38]:
# Just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102 

Columns with na's dropped: 41


### Filling in missing values automatically
![Light](https://user-images.githubusercontent.com/12748752/126914730-b5b13ba9-4d20-4ebf-b0ed-231af4c8b984.png)

* We can use the Panda's -**fillna()**_ function to fill in missing values in a dataframe for us.
* One option we have is to specify what we want the NaN values to be replaced with. 
* Here, I would like to replace all the NaN values with 0.

> _df.fillna({'NameColumn':8,'AddressColumn':0})_ 

> _df[['col1', 'col2']].fillna(value=0, inplace=True)_

In [39]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


In [40]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,0.0,0.0,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,0.0,0.0,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,0.0,0.0,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.0,0.0,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009


In [41]:
# replace all NA's the value that comes directly after it in the same column,; Beasuse we did select columns from 'EPA'to'Season'
# then replace all the remaining na's with 0
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,-1.068169,1.146076,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,-0.032244,0.036899,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,3.318841,-5.031425,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.106663,-0.156239,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009
