# Value Counts and Missing Data

In this section, we are going to explore looking at the values in columns and identify missing data.  From there, we will learn how to handle missing data.

# Setup


First we need to setup our python session with numpy and pandas

In [1]:
## import numpy and pandas
import pandas as pd
import numpy as np

# Bring in the dataset

The dataset we will load is available via the web or on QTools.

In [3]:
## read in the dataset
## https://raw.githubusercontent.com/Btibert3/is834/master/datasets/pandas-missing-data.csv
users = pd.read_csv("https://raw.githubusercontent.com/Btibert3/is834/master/datasets/pandas-missing-data.csv")

In [4]:
## explore the data - shape
users.shape

(500, 5)

In [6]:
## explore the data - info
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
ID       500 non-null int64
State    450 non-null object
Dice     450 non-null float64
Zip      450 non-null float64
GPA      450 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 19.6+ KB


In [8]:
## look at some rows - head/tail
users.head()
users.tail()

Unnamed: 0,ID,State,Dice,Zip,GPA
495,496,Rhode Island,1.0,21147.0,3.0
496,497,Illinois,5.0,61113.0,4.0
497,498,South Carolina,3.0,,4.0
498,499,Michigan,4.0,27724.0,3.33
499,500,Florida,5.0,61113.0,3.33


# Describe the dataset

In [9]:
# describe the dataset
users.describe()

Unnamed: 0,ID,Dice,Zip,GPA
count,500.0,450.0,450.0,450.0
mean,250.5,3.397778,46732.935556,3.389356
std,144.481833,1.656329,20240.649098,0.386859
min,1.0,1.0,21147.0,2.33
25%,125.75,2.0,28321.0,3.0
50%,250.5,3.0,41142.0,3.33
75%,375.25,5.0,61113.0,3.67
max,500.0,6.0,90794.0,4.0


In [10]:
# numeric variables, but not all -- text and missing zip codes
users.isnull().sum()

ID        0
State    50
Dice     50
Zip      50
GPA      50
dtype: int64

In [0]:
# isnull().sum()


## dropna() - if any row missing, remove

In [18]:
## make a copy of our original dataset users2 with .copy()
users2 = users.copy()

In [19]:
## we can use the drop method - all rows - dropna
users2.dropna(inplace = True)

In [20]:
## how many rows were removed between users and users2?
len(users)-len(users2)

165

# Value Counts

In [21]:
## lets make another copy, users 3 - use copy()
users3 = users.copy()

In [24]:
## print columns again
users3.columns

Index(['ID', 'State', 'Dice', 'Zip', 'GPA'], dtype='object')

In [25]:
## lets look at the values for the Dice column
## note that pandas is looking for a series, so .column_name helps with the help options with <tab>
users.Dice.value_counts()

3.0    84
2.0    79
4.0    78
1.0    75
5.0    72
6.0    62
Name: Dice, dtype: int64

In [0]:
## lets look at the options, remmeber that each column had 50 values randomly missing?  
## value counts removes NAs by default


In [26]:
# lets do this for the GPA columns
users3.GPA.value_counts(dropna = False)

 3.33    147
 3.00    111
 3.67     96
 4.00     72
NaN       50
 2.67     19
 2.33      5
Name: GPA, dtype: int64

# Handling of missing values

At the simplest level, we can handle missing values in one of two ways:

- Drop the rows where the value is missing
- Replace the missing values with another value


**We will continue to use the dataset `users3`**

## Drop Rows

In [34]:
## Drop the rows where zip code is missing
z = users3.Zip.isna()

In [0]:
## Drop the rows where state is missing 

## Replace values

Replacing values is a tad more complex, but to keep things simple, let's start by replacing with the average or the most frequent value (the mode)

### Mean value replacement

In [0]:
## isolate the average for the GPA column



In [0]:
## find the rows where gpa is missing  - .isna() and notna()


In [0]:
## look at missing_gpa - head


In [0]:
## value counts for missing gpa


In [35]:
## for those rows, replace with .loc
## users3.loc[missing_gpa, "GPA"] 

NameError: name 'gpa_average' is not defined

In [0]:
## check with isna and sum


In [0]:
## another way, but lets create a copy, users4



In [0]:
## use fillna on the series for the column (dot selected) -- default and then inplace option
users4.GPA.fillna(gpa_mean, inplace=True)

In [0]:
## use fillna on the series for the column (dot selected) -- inplace option

### Max value replacement (Mode)

In [0]:
## make another copy, users5 with 
users5 = 

In [0]:
## replace the dice column with the max value, lets revisit -- dropna false


In [0]:
## get the mode - .mode()
max_dice = 

In [0]:
## check the type


In [0]:
## isolate the value, not as a series - mode()[0]

In [0]:
## fill the missing values with fillna --inplace=True


In [0]:
## lets look again -- dropna=False
