<a href="https://colab.research.google.com/github/gopal2812/mlblr/blob/master/pandas2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 - Data Preparation Basics
## Segment 2 - Treating missing values

Treating missing values
- [Voiceover] Let's talk about treating missing values. By default, missing values are represented in Python with NaN, which stands for not a number. I want to give you a warning here that if your dataset has 0s, 99s, or 999s, be sure to either drop or approximate them as you would with a missing value. Let me give you an example of where treating missing values is useful. Imagine you work in a marketing department of a local car dealership. You've been tasked with summarizing recent results from a customer satisfaction survey. You get this data set and you can see that most of the records have been completed, but Sally and Jim didn't respond with information about their opinion of quality of work. You can see that here with the missing values. Nevertheless, Sally and Jim have responded to 75% of the request for information. So we wouldn't want to drop them from the survey altogether. That said, the other respondents Rob, Sam, and Jane did give information about what they thought about the quality of work. So we don't want to drop this variable altogether. What could we do? Well, we could take the average value of the response we do have which would be an average of eight, nine, and 10, and just fill in these missing values in order to generate an approximation that gives your boss a pretty good idea of the customers actual responses. You'll see later on in the coding demonstration why it's important to try to use approximation rather than just dropping missing values altogether. In the coding demonstration that's coming up, I'm going to show you how to discover what's missing, how to fill in for missing values, how to count missing values, and also how to filter out using missing values. All right, so here we are in Jupyter, I've already imported NumPy in Pandas and series and data frames from Pandas. Now I'm going to show you how to figure out what data is missing from your dataset. So let's start by creating a variable called missing and we will set it equal to np.nan for not a number. And then let's create a series object named series_obj and we will set it equal to a series object. So we'll call our series constructor. And we're going to pass in a series of label indexes for each of the rows in our series. In this case, we're going to range from row one to row eight. So I'll just start creating labels for each of the indexes and we're going to name them row one, row two. Now, where we would have a row three, we're going to instead say missing. And then carry on with row four, row five, row six, and then where we would have for row seven, we're just going to fill that in with a missing value. So we'll say missing and then hit row eight. We will create a label that's called row eight and then we'll print this whole thing out. Series_obj. And we've got a series object with eight rows and there's a missing value in the position, index position two and index position six. Now let's work with the isnull method. This method returns a boolean value that describes in true or false whether an element in a Pandas object is a null value. So we will just say series_obj and then we will say is null and run this and you can see that for the values that were missing values in our series object we are now getting back a true value for yes, it is null. All of the values that had a label are returning a false value because they were not missing values, they were not null elements. Let's look now at how to fill in for missing values. To do that, what we're going to do is create a data frame of random numbers. So the first thing we want to do is set our seed so that you get the same numbers on your screen when you type this as I'm getting here. So we're going to say np.random.seed and I'll pass in the number 25. And then we will create a data frame object. So we say df_obj. And then we will set that equal to the data frame constructor. And we are going to create a series of random numbers. So we'll say np.random.rand n. And we're going to want, rephrase, and we're going to generate 36 numbers and we want that to be returned in a data frame with six rows and six columns. So we're going to say reshape and then pass in a six and a six. And then we are going to print this out just so we get an idea of what this looks like. Okay, cool. So we have a six by six data frame with random numbers in it. That's great. Now what we're going to do, is we're going to use the .loc method to select rows and columns and then set certain values in this data frame equal to missing. So let's start doing that by just saying DF_obj and then we will call the .loc method and we will create a list. And what we want to do is we want to select rows at index position three through five and at column index position zero. So to do that, we're going to say three colon five comma and then column zero. And we're going to select these values at those positions equal to missing. Let's do this again for a different selection of values from the data frame. So we will say DF_obj and then we're going to call .loc and we're going to this time select rows from index position one through four and we want to select column index position five and we're going to also set this equal to missing. And then print it out. Okay cool, so now you can see that we have some missing values in our data frame object, and now I want to show you how to use the fillna method. What this method does, is it finds missing values from within a Pandas object and it fills it in with a numeric value that you specify. So in this case, we're going to use the fillna method to fill our NaN values with a value of zero. To do that, we will say DF_obj and then we will call the fillna method and we're going to pass in zero, telling it to fill in all the missing values with the zero. And we'll call this whole thing filled_DF and print it out. I'm going to move this up a bit so we can see the results better. And okay, so you can see that our missing values have now been filled in with zeros instead of NaN. That's pretty cool. Into the fillna method. This will then fill in the missing values from each column series as designated by the dictionary key with its own value as specified by the corresponding dictionary value. So let's try that out. We will create a data frame called filled_DF and then we're going to set that equal to DF_obj. And we're going to call the fillna off of that. We're going to create a dictionary and then for the values in column at index position zero, we're going to set those equal to 0.1. And then for the missing values in the column at index position five, we're going to set those equal to 1.25. So let's just print this out. We'll write filled_DF and as you can see, the missing values have now been filled in column, at index position zero, the missing values have been filled with .1 and then at index position five, they've been filled in with 1.25. You may be wondering how this could be useful. Well imagine that you have a predictive application and it requires you to input data from four variables. Three of your variables are great. You've got all the information you need. But one of them has lots of missing values. You still need to input data from that variable, so what you could do is you could set those missing values equal to an approximation in order to make your predictive application work. You can also pass in the method f fill argument. And the fillna method will fill forward any missing values with values from the last non-null element in the column series. So, let's try that here. Let's create a data frame called fill_DF and then we'll set that equal to DF_obj and then we're going to say fillna and we're going to pass a perimeter that says method equal to f fill. And then we will run this. And what you can see has happened here is that the null values have been filled in with the last non-null element in the column series. So let's go back up to original data frame and just looking at column at index position five. You can see we have a bunch of null values and so with the fill forward method, what we're really doing is taking in this first row. This non-null element and filling in the values moving forward down the column. So let's just take a look at what we've gotten here and as you can see, yeah. These missing values have been filled in with the first non-null element preceding them. Which was .117376. All right, now let's look at counting missing values. Before we get into any of the coding, I just wanted to explain to you how this could be useful. Sometimes you just want to create a summary statistic of your dataset in order to understand what you've got in there. You can count missing values in order to figure out which variables are most problematic. In other words, which variables in your dataset have the most number of missing values. What I want to do in this demonstration, is just reuse the data frame we already created earlier in this demonstration with missing values. We'll need to recreate it, 'cause we've now filled them in. So let's just reuse the data frame we created earlier. The one with the missing values still present. So go back up here. And copy this. And then go ahead and regenerate these missing values in the data frame. So now we've got a new data frame object with missing values, so I'm going to run this. Okay cool. Now, what I want to do, is generate a count of how many missing values there are in the data frame per column. So we're going to use the isnull method off of that and then call the sum method in order to create a count. What I'm going to do is I'm just going to go ahead and insert a cell here. Insert a cell below. Move this up a bit. And okay, so I am going to say DF_obj and then I'm going to call the isnull method. And then I want to call the .sum method off of that. And I'll run it and as you can see, now we have a count for all of the null values inside the data frame. So as you can see in the column at index position zero, we have three null values and then at index position five we have four null values and in between we have no null values in any of the other columns. So now we have generated a count of null values in our data frame. Next I'm going to show you how to filter out missing values. The first thing we're going to do is we're going to look at the dropna method. And we're going to use it to drop all rows from a data frame that contain any missing values. So to do that, we would just say DF_obj and then we'll call the dropna method. And then let's just call this whole thing DF_no_NaN. So a data frame with no missing values. And we'll print it out. And as you can see, all of the rows that had a null value, any null value, were all dropped and so all we have left now is just one row, that's from index position zero. Because there were no null values in the row at index position zero. Now if you wanted to drop the columns that contain any missing values instead of dropping the row, that's really easy to do. All you would do is call the dropna method and you would pass in the argument access equal to one to select and search the data frame by column instead of row. So let me just, I'll just copy and paste this, and we will go in and say access equal to one here. And then run it. And now, as you can see, we have got our data frame returned and the column at index position zero and at index position five have both been dropped because they contained null values, but all of the other columns have been retained.

In [0]:
import numpy as np
import pandas as pd 

from pandas import Series, DataFrame

### Figuring out what data is missing

In [0]:
missing = np.nan

series_obj = Series(['row 1', 'row 2', missing, 'row 4', 'row 5', 'row 6', missing, 'row 8'])
series_obj

0    row 1
1    row 2
2      NaN
3    row 4
4    row 5
5    row 6
6      NaN
7    row 8
dtype: object

In [0]:
series_obj.isnull()

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
dtype: bool

### Filling in for missing values

In [0]:
np.random.seed(25)
DF_obj = DataFrame(np.random.rand(36).reshape(6,6))
DF_obj

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.113041
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.699186
3,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
4,0.514244,0.559053,0.03445,0.71993,0.421004,0.436935
5,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


In [0]:
DF_obj.loc[3:5, 0] = missing
DF_obj.loc[1:4, 5] = missing
DF_obj

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,
2,0.447031,0.585445,0.161985,0.520719,0.326051,
3,,0.836375,0.481343,0.516502,0.383048,
4,,0.559053,0.03445,0.71993,0.421004,
5,,0.900274,0.669612,0.456069,0.289804,0.525819


In [0]:
filled_DF = DF_obj.fillna(0)
filled_DF

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.0
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.0
3,0.0,0.836375,0.481343,0.516502,0.383048,0.0
4,0.0,0.559053,0.03445,0.71993,0.421004,0.0
5,0.0,0.900274,0.669612,0.456069,0.289804,0.525819


In [0]:
filled_DF = DF_obj.fillna({0: 0.1, 5:1.25})
filled_DF

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,1.25
2,0.447031,0.585445,0.161985,0.520719,0.326051,1.25
3,0.1,0.836375,0.481343,0.516502,0.383048,1.25
4,0.1,0.559053,0.03445,0.71993,0.421004,1.25
5,0.1,0.900274,0.669612,0.456069,0.289804,0.525819


In [0]:
fill_DF = DF_obj.fillna(method='ffill')
fill_DF

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.117376
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.117376
3,0.447031,0.836375,0.481343,0.516502,0.383048,0.117376
4,0.447031,0.559053,0.03445,0.71993,0.421004,0.117376
5,0.447031,0.900274,0.669612,0.456069,0.289804,0.525819


### Counting missing values

In [0]:
np.random.seed(25)
DF_obj = DataFrame(np.random.rand(36).reshape(6,6))
DF_obj.loc[3:5, 0] = missing
DF_obj.loc[1:4, 5] = missing
DF_obj

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,
2,0.447031,0.585445,0.161985,0.520719,0.326051,
3,,0.836375,0.481343,0.516502,0.383048,
4,,0.559053,0.03445,0.71993,0.421004,
5,,0.900274,0.669612,0.456069,0.289804,0.525819


In [0]:
DF_obj.isnull().sum()

0    3
1    0
2    0
3    0
4    0
5    4
dtype: int64

### Filtering out missing values

In [0]:
DF_no_NaN = DF_obj.dropna()
DF_no_NaN

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376


In [0]:
DF_no_NaN = DF_obj.dropna(axis=1)
DF_no_NaN

Unnamed: 0,1,2,3,4
0,0.582277,0.278839,0.185911,0.4111
1,0.437611,0.556229,0.36708,0.402366
2,0.585445,0.161985,0.520719,0.326051
3,0.836375,0.481343,0.516502,0.383048
4,0.559053,0.03445,0.71993,0.421004
5,0.900274,0.669612,0.456069,0.289804
