The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it's time for you to take care of these yourself!

The unprocessed dataset has been loaded into a DataFrame df. Explore it in the IPython Shell with the .head() method. You will see that there are certain data points labeled with a '?'. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0 - real-world data can be very messy! If you're lucky, the missing values will already be encoded as NaN. We use NaN because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna() and .fillna(), as well as scikit-learn's Imputation transformer Imputer().

In this exercise, your job is to convert the '?'s to NaNs, and then drop the rows that contain them from the DataFrame.

Instructions

* Explore the DataFrame df in the IPython Shell. Notice how the missing value is represented.
* Convert all '?' data points to np.nan.
* Count the total number of NaNs using the .isnull() and .sum() methods. This has been done for you.
* Drop the rows with missing values from df using .dropna().
* Hit 'Submit Answer' to see how many rows were lost by dropping the missing values.

In [1]:
# Import pandas
import pandas as pd
import numpy as np
# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('house-votes-84.csv')
df.head()

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y


In [2]:
# df = df.replace('?',np.nan)

In [3]:
df.head()

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y


In [4]:
%%timeit
# Convert '?' to NaN
df[df == '?'] = np.nan

9.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [5]:
df.head()

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
1,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y


In [6]:
# Print the number of NaNs
print(df.isnull())

     republican      n      y    n.1    y.1    y.2    y.3    n.2    n.3  \
0         False  False  False  False  False  False  False  False  False   
1         False   True  False  False   True  False  False  False  False   
2         False  False  False  False  False   True  False  False  False   
3         False  False  False  False  False  False  False  False  False   
4         False  False  False  False  False  False  False  False  False   
5         False  False  False  False  False  False  False  False  False   
6         False  False  False  False  False  False  False  False  False   
7         False  False  False  False  False  False  False  False  False   
8         False  False  False  False  False  False  False  False  False   
9         False  False  False  False  False  False  False  False  False   
10        False  False  False  False  False  False  False  False  False   
11        False  False  False  False  False  False  False  False  False   
12        False  False  F

In [7]:
# Print the number of NaNs
print(df.isnull().sum())

republican      0
n              12
y              48
n.1            11
y.1            11
y.2            15
y.3            11
n.2            14
n.3            15
n.4            22
y.4             7
?              20
y.5            31
y.6            25
y.7            17
n.5            28
y.8           104
dtype: int64


In [8]:
# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df=df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))

Shape of Original DataFrame: (434, 17)
Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)
