# Handling Missing Data

In this section, we will study ways to identify and treat missing data. We will:
- Identify missing data in dataframes
- Treat (delete or impute) missing values

There are various reasons for missing data, such as, human-errors during data-entry, non availability at the end of the user (e.g. DOB of certain people), etc. Most often, the reasons are simply unknown.

In python, missing data is represented using either of the two objects ```NaN``` (Not a Number) or ```NULL```. We'll not get into the differences between them and how Python stores them internally etc. We'll focus on studying ways to identify and treat missing values in Pandas dataframes.

There are four main methods to identify and treat missing data:
- ```isnull()```: Indicates presence of missing values, returns a boolean
- ```notnull()```: Opposite of ```isnull()```, returns a boolean
- ```dropna()```: Drops the missing values from a data frame and returns the rest
- ```fillna()```: Fills (or imputes) the missing values by a specified value


For this exercise, we will use the **Melbourne house pricing dataset**. 


In [4]:
import numpy as np
import pandas as pd

df = pd.read_csv("melbourne.csv")
df.head(3)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03-09-2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0


The first few rows contain missing values, represented as NaN.

Let's quickly look at the structure of the data frame, types of columns, etc.

In [2]:
# approx 23k rows, 21 columns
print(df.shape)
print(df.info())

(23547, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23547 entries, 0 to 23546
Data columns (total 21 columns):
Suburb           23547 non-null object
Address          23547 non-null object
Rooms            23547 non-null int64
Type             23547 non-null object
Price            18396 non-null float64
Method           23547 non-null object
SellerG          23547 non-null object
Date             23547 non-null object
Distance         23546 non-null float64
Postcode         23546 non-null float64
Bedroom2         19066 non-null float64
Bathroom         19063 non-null float64
Car              18921 non-null float64
Landsize         17410 non-null float64
BuildingArea     10018 non-null float64
YearBuilt        11540 non-null float64
CouncilArea      15656 non-null object
Lattitude        19243 non-null float64
Longtitude       19243 non-null float64
Regionname       23546 non-null object
Propertycount    23546 non-null float64
dtypes: float64(12), int64(1), object(8)
memory u

## Identifying Missing Values

The methods ```isnull()``` and ```notnull()``` are the most common ways of identifying missing values. 

While handling missing data, you first need to identify the rows and columns containing missing values, count the number of missing values, and then decide how you want to treat them.

It is important that **you treat missing values in each column separately**, rather than implementing a single solution (e.g. replacing NaNs by the mean of a column) for all columns.

```isnull()``` returns a boolean (True/False) which can then be used to find the rows or columns containing missing values.

In [6]:
# isnull()
df.isnull().head(10)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
9,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False


###  Identifying Missing Values in Columns
Let's first compute the total number of missing values in the data frame. You can calculate the number of missing values in each column by ```df.isnull().sum()``` 

In [4]:
# summing up the missing values (column-wise)
# sum is column function can change to row function axis=1
df.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Price             5151
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom2          4481
Bathroom          4484
Car               4626
Landsize          6137
BuildingArea     13529
YearBuilt        12007
CouncilArea       7891
Lattitude         4304
Longtitude        4304
Regionname           1
Propertycount        1
dtype: int64

Note that some columns have extremely **large number of missing values**, such as Price, Bedroom2, Bathroom, BuildingArea, YearBuilt etc. In such cases, one should be careful in handling missing values, since if you replace them by arbitrary numbers such as mean, median etc., the entire further analysis may throw unrealistic or unexpected results.

The functions ```any()``` and ```all()``` are quite useful to identify rows and columns having missing values:
- ```any()``` returns ```True``` when at least one value satisfies a condition (equivalent to logical ```or```)
- ```all()``` returns ```True``` when all the values satisfy a condition (equivalent to logical ```and```)

In [5]:
# columns having at least one missing value
df.isnull().any()

# above is equivalent to axis=0 (by default, any() operates on columns)
df.isnull().any(axis=0)

Suburb           False
Address          False
Rooms            False
Type             False
Price             True
Method           False
SellerG          False
Date             False
Distance          True
Postcode          True
Bedroom2          True
Bathroom          True
Car               True
Landsize          True
BuildingArea      True
YearBuilt         True
CouncilArea       True
Lattitude         True
Longtitude        True
Regionname        True
Propertycount     True
dtype: bool

We have identified columns having missing values and have computed the number of missing values in each. Let's do the same for rows.

### Identifying  Missing Values in Rows

The methods ```any()``` and ```all()``` can be used to identify rows having **at least one** and **all** missing values respectively. To specify that the operation should be done on rows, you need to use ```axis=1``` as an argument.

In [6]:
# rows having at least one missing value
df.isnull().any(axis=1)

0         True
1         True
2        False
3         True
4        False
5         True
6        False
7         True
8         True
9         True
10        True
11       False
12        True
13        True
14       False
15        True
16        True
17        True
18       False
19        True
20        True
21        True
22        True
23        True
24       False
25       False
26        True
27        True
28        True
29        True
         ...  
23517     True
23518     True
23519     True
23520     True
23521     True
23522     True
23523     True
23524     True
23525     True
23526     True
23527     True
23528     True
23529     True
23530     True
23531     True
23532     True
23533     True
23534     True
23535     True
23536     True
23537     True
23538     True
23539     True
23540     True
23541     True
23542     True
23543     True
23544     True
23545     True
23546     True
Length: 23547, dtype: bool

In [7]:
# rows having all missing values
df.isnull().all(axis=1)

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
23517    False
23518    False
23519    False
23520    False
23521    False
23522    False
23523    False
23524    False
23525    False
23526    False
23527    False
23528    False
23529    False
23530    False
23531    False
23532    False
23533    False
23534    False
23535    False
23536    False
23537    False
23538    False
23539    False
23540    False
23541    False
23542    False
23543    False
23544    False
23545    False
23546    False
Length: 23547, dtype: bool

In [8]:
# sum it up to check how many rows have all missing values
df.isnull().all(axis=1).sum()

0

Thus, there are no rows having all missing values (we'd remove them if there were any). 

Often, you may also want to remove the rows having more than a certain threshold number of missing values. To do that, you need to count the number of missing values in each row using ```sum()```.

In [9]:
# sum of misisng values in each row
df.isnull().sum(axis=1)

0        3
1        2
2        0
3        3
4        0
5        2
6        0
7        1
8        2
9        2
10       2
11       0
12       1
13       1
14       0
15       9
16       9
17       2
18       0
19       9
20       1
21       9
22       9
23       2
24       0
25       0
26       7
27       9
28       2
29       2
        ..
23517    2
23518    3
23519    3
23520    2
23521    1
23522    4
23523    3
23524    3
23525    1
23526    2
23527    1
23528    5
23529    4
23530    5
23531    9
23532    1
23533    5
23534    2
23535    3
23536    4
23537    2
23538    1
23539    2
23540    2
23541    1
23542    2
23543    8
23544    4
23545    1
23546    2
Length: 23547, dtype: int64

We have now identified:
- The number of missing values in columns
- The number of missing values in rows

Let's now move ahead and treat the missing values.

### Treating Missing Values

There are broadly two ways to treat missing values:
1. Delete: Delete the missing values 
2. Impute: 
    - Imputing by a simple statistic: Replace the missing values by another value, commonly the mean, median, mode etc. 
    - Predictive techniques: Use statistical models such as k-NN, SVM etc. to predict and impute missing values
   

In general, imputation makes assumptions about the missing values and replaces missing values by arbitrary numbers such as mean, median etc. It should be used only when you are reasonably confident about the assumptions.

Otherwise, deletion is often safer and recommended. You may lose some data, but will not make any unreasonable assumptions.

**Caution**: Always have a backup of the original data if you're deleting missing values.  

<hr>
**Additional Stuff for Nerds**

How you treat missing values should ideally depend upon an understnading of why missing values occur. The reasons are classified into categories such as *missing completely at random, missing at random, misisngness that depends on the missing value itself etc.* 
 
 
We'll not discuss *why missing values occur*, though you can read this article if interested: http://www.stat.columbia.edu/~gelman/arm/missing.pdf
<hr>

### Treating Missing Values in Columns

Let's now treat missing values in columns. Let's look at the number of NaNs in each column again, this time as the *percentage of missing values in each column*. Notice that we calculate the number of rows as ```len(df.index)```.

In [10]:
# summing up the missing values (column-wise)
# percentage of missing values at overall level if am not wrong
round(100*(df.isnull().sum()/len(df.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
BuildingArea     57.46
YearBuilt        50.99
CouncilArea      33.51
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

Notice that there are columns having almost 22%, 19%, 26%, 57% etc. missing values. When dealing with columns, you have two simple choices - either **delete or retain the column.** If you retain the column, you'll have to treat (i.e. delete or impute) the rows having missing values.

If you delete the missing rows, you lose data. If you impute, you introduce bias.

Apart from the number of missing values, the decision to delete or retain a variable depends on various other factors, such as:
- the analysis task at hand
- the usefulness of the variable (based on your understanding of the problem)
- the total size of available data (if you have enough, you can afford to throw away some of it)
- etc.

For e.g. let's say that we want to build a (linear regression) model to predict the house prices in Melbourne. Now, even though the variable ```Price``` has about 22% missing values, you cannot drop the variable, since that is what you want to predict. 

Similarly, you would expect some other variables such as ```Bedroom2```, ```Bathroom```, ```Landsize``` etc. to be important predictors of ```Price```, and thus cannot remove those columns.

There are others such as ```BuildingArea```, which although seem important, have more than 50% missing values. It is impossible to either delete or impute the rows corresponding to such large number of missing values without losing a lot of data or introducing heavy bias. 



Thus, for this exercise, let's remove the columns having more than 30% missing values, i.e. ```BuildingArea```, ```YearBuilt```, ```CouncilArea```.



In [11]:
# removing the three columns
df = df.drop('BuildingArea', axis=1)
df = df.drop('YearBuilt', axis=1)
df = df.drop('CouncilArea', axis=1)

round(100*(df.isnull().sum()/len(df.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

We now have columns having maximum 26% missing values (```Landsize```). Next, we need to treat the rows.

### Treating Missing Values in Rows

Now, we need to either delete or impute the missing values. First, let's see if there are any rows having a significant number of missing values. If so, we can drop those rows, and then take a decision to delete or impute the rest.

After dropping three columns, we now have 18 columns to work with. Just to inspect rows with missing values, let's have a look at the rows having more than 5 missing values.

In [8]:
# delete rows which have more than 5 column as null values
df[df.isnull().sum(axis=1) > 5].head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
15,Abbotsford,217 Langridge St,3,h,1000000.0,S,Jellis,08-10-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
16,Abbotsford,18a Mollison St,2,t,745000.0,S,Jellis,08-10-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
19,Abbotsford,403/609 Victoria St,2,u,542000.0,S,Dingle,08-10-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
21,Abbotsford,25/84 Trenerry Cr,2,u,760000.0,SP,Biggin,10-12-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
22,Abbotsford,106/119 Turner St,1,u,481000.0,SP,Purplebricks,10-12-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0


Notice an interesting pattern - many rows have multiple columns missing. Since each row represents a house, it indicates that there are houses (observations) whose majority data has either not been collected or is unavailable. Such observations are anyway unlikely to contribute to prediction of prices. 

Thus we can remove the rows with (say) more than 5 missing values.

In [13]:
# count the number of rows having > 5 missing values
# use len(df.index)
len(df[df.isnull().sum(axis=1) > 5].index)

4278

In [14]:
# 4278 rows have more than 5 missing values
# calculate the percentage
100*(len(df[df.isnull().sum(axis=1) > 5].index) / len(df.index))

18.16791948018856

Thus, about 18% rows have more than 5 missing values. Let's remove these rows and count the number of missing values remaining.

In [15]:
# retaining the rows having <= 5 NaNs
df = df[df.isnull().sum(axis=1) <= 5]

# look at the summary again
round(100*(df.isnull().sum()/len(df.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.71
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2          1.05
Bathroom          1.07
Car               1.81
Landsize          9.65
Lattitude         0.13
Longtitude        0.13
Regionname        0.00
Propertycount     0.00
dtype: float64

Notice that now, we have removed most of the rows where multiple columns (```Bedroom2```, ```Bathroom```, ```Landsize```) were missing. 

Now, we still have about 21% missing values in the column ```Price``` and 9% in ```Landsize```. Since ```Price``` still contains a lot of missing data (and imputing 21% values of a variable you want to predict will introduce heavy bias), its a bad idea to impute those values. 

Thus, let's remove the missing rows in ```Price``` as well. Notice that you can use ```np.isnan(df['column'])``` to filter out the corresonding rows, and use a ```~``` to discard the values satisfying the condition.

In [16]:
# removing NaN Price rows
df = df[~np.isnan(df['Price'])]

round(100*(df.isnull().sum()/len(df.index)), 2)


Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         1.05
Bathroom         1.07
Car              1.76
Landsize         9.83
Lattitude        0.15
Longtitude       0.15
Regionname       0.00
Propertycount    0.00
dtype: float64

Now, you have ```Landsize``` as the only variable having a significant number of missing values. Let's give this variable a chance and consider imputing the NaNs. 

The decision (whether and how to impute) will depend upon the distribution of the variable. For e.g., if the variable is such that all the observations lie in a short range (say between 800 sq. ft to 820 sq.ft), you can take a call to impute the missing values by something like the mean or median ```Landsize```. 

Let's look at the distribution.

In [17]:
df['Landsize'].describe()
# we wont replace missing values with mean coz huge diff between mean and max val

count     13603.000000
mean        558.116371
std        3987.326586
min           0.000000
25%         176.500000
50%         440.000000
75%         651.000000
max      433014.000000
Name: Landsize, dtype: float64

Notice that the minimum is 0, max is 433014, the mean is 558 and median (50%) is 440. There's a significant variation in the 25th and the 75th percentile as well (176 to 651). 

Thus, imputing this with mean/median seems quite biased, and so we should remove the NaNs.

In [18]:
# removing NaNs in Landsize
df = df[~np.isnan(df['Landsize'])]

round(100*(df.isnull().sum()/len(df.index)), 2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         0.00
Bathroom         0.01
Car              0.46
Landsize         0.00
Lattitude        0.16
Longtitude       0.16
Regionname       0.00
Propertycount    0.00
dtype: float64

We have reduced the NaNs significantly now. Only the variables ```Bathroom```, ```Car```, ```Lattitude``` and ```Longitude``` have a small number of missing values (most likely, the same rows will have ```Lattitude``` and ```Longitude``` missing).

Let's first look at ```Lattitude``` and ```Longitude```.

In [10]:
# rows having Lattitude and Longitude missing
df[np.isnan(df['Lattitude'])].head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
15,Abbotsford,217 Langridge St,3,h,1000000.0,S,Jellis,08-10-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
16,Abbotsford,18a Mollison St,2,t,745000.0,S,Jellis,08-10-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
19,Abbotsford,403/609 Victoria St,2,u,542000.0,S,Dingle,08-10-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
21,Abbotsford,25/84 Trenerry Cr,2,u,760000.0,SP,Biggin,10-12-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0
22,Abbotsford,106/119 Turner St,1,u,481000.0,SP,Purplebricks,10-12-2016,2.5,3067.0,...,,,,,,,,,Northern Metropolitan,4019.0


As expected, the same rows have ```Lattitude``` and ```Longitude``` missing. Let's look at the summary stats of these.

In [20]:
df.loc[:, ['Lattitude', 'Longtitude']].describe()

Unnamed: 0,Lattitude,Longtitude
count,13581.0,13581.0
mean,-37.809204,144.995221
std,0.079257,0.103913
min,-38.18255,144.43181
25%,-37.85682,144.9296
50%,-37.80236,145.0001
75%,-37.7564,145.05832
max,-37.40853,145.52635


Notice that the distribution of both ```Lattitude``` and ```Longitude``` is quite narrow. 

A good way to estimate the 'spread of data' is to look at the difference between the mean and the median (lower the better), and the variation from 25th to 75th percentile (quite small in this case).

Thus, let's impute the missing values by the mean value of ```Lattitude``` and ```Longitude``` respectively.

In [21]:
# imputing Lattitude and Longitude by mean values
df.loc[np.isnan(df['Lattitude']), ['Lattitude']] = df['Lattitude'].mean()
df.loc[np.isnan(df['Longtitude']), ['Longtitude']] = df['Longtitude'].mean()

round(100*(df.isnull().sum()/len(df.index)), 2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         0.00
Bathroom         0.01
Car              0.46
Landsize         0.00
Lattitude        0.00
Longtitude       0.00
Regionname       0.00
Propertycount    0.00
dtype: float64

We now have ```Bathroom``` and ```Car``` with 0.01% and 0.46% NaNs respectively. 

Since these are very small number of rows, it does not really matter whether you delete or impute. However, let's have a look at the distributions.

In [22]:
df.loc[:, ['Bathroom', 'Car']].describe()

Unnamed: 0,Bathroom,Car
count,13602.0,13540.0
mean,1.534921,1.610414
std,0.691834,0.962244
min,0.0,0.0
25%,1.0,1.0
50%,1.0,2.0
75%,2.0,2.0
max,8.0,10.0


These two are integer type variables, and thus have values 0, 1, 2 etc. You cannot impute the NaNs by the mean or the median (1.53 bathrooms does not make sense!).

Thus, you need to impute them by the mode - the most common occurring value.



In [23]:
# converting to type 'category'
df['Car'] = df['Car'].astype('category')

# displaying frequencies of each category
df['Car'].value_counts()

2.0     5606
1.0     5515
0.0     1026
3.0      748
4.0      507
5.0       63
6.0       54
8.0        9
7.0        8
10.0       3
9.0        1
Name: Car, dtype: int64

The most common value of ```Car``` is 2 (dtype is int), so let's impute the NaNs by that.

In [24]:
# imputing NaNs by 2.0
df.loc[pd.isnull(df['Car']), ['Car']] = 2
round(100*(df.isnull().sum()/len(df.index)), 2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         0.00
Bathroom         0.01
Car              0.00
Landsize         0.00
Lattitude        0.00
Longtitude       0.00
Regionname       0.00
Propertycount    0.00
dtype: float64

Similarly for ```Bathroom```:

In [25]:
# converting to type 'category'
df['Bathroom'] = df['Bathroom'].astype('category')

# displaying frequencies of each category
df['Bathroom'].value_counts()

1.0    7517
2.0    4987
3.0     921
4.0     106
0.0      34
5.0      28
6.0       5
8.0       2
7.0       2
Name: Bathroom, dtype: int64

In [26]:
# imputing NaNs by 1
df.loc[pd.isnull(df['Bathroom']), ['Bathroom']] = 1
round(100*(df.isnull().sum()/len(df.index)), 2)

Suburb           0.0
Address          0.0
Rooms            0.0
Type             0.0
Price            0.0
Method           0.0
SellerG          0.0
Date             0.0
Distance         0.0
Postcode         0.0
Bedroom2         0.0
Bathroom         0.0
Car              0.0
Landsize         0.0
Lattitude        0.0
Longtitude       0.0
Regionname       0.0
Propertycount    0.0
dtype: float64

We now have a dataframe with no missing values. Let's finally look at how many rows (apart from three columns) we have lost in the process (originally we had 23547):

In [27]:
df.shape

(13603, 18)

In [28]:
# fraction of rows lost
len(df.index)/23547

0.5776956724848176

Thus, we have lost about 42% observations in cleaning the missing values. 