## Missing Data

Some datasets may be perfectly complete, but many will arrive with some missing values. Cleaning can increase the amount of missing data. Even if missingness is random, it can cause difficulties for analysis. The Python implementations of basic statistical methods like ANOVA, t-tests, and correlations will fail if any of the variables involved has a missing value. 

One way to solve this problem is to drop any rows that contain missing values in your variables of interest. The pandas package has the .dropna() [data frame method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) for doing exactly this:

In [11]:
import pandas as pd

# Sample data to play with and clean.
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, 71, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}
df = pd.DataFrame(data)

print('1. Full dataset.')
print(df)
print('\n')

print('2. Drop all rows that have any missing values in any column.')
print(df.dropna()) 
print('\n')

print('3. Drop only rows where all values are missing.')
print(df.dropna(how='all'))
print('\n')

print('4. Drop only rows where more than two values are missing.')
print(df.dropna(thresh=2))
print('\n')

print("5. Drop all rows that have any missing values in the 'gender' or 'height' columns.")
print(df.dropna(subset=['gender','height']))
print('\n')

print('6. Your turn. Write code below to drop rows where both height and weight are missing and print the result.')
print(df.dropna(how='all', subset=['weight','height']))

1. Full dataset.
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
5   NaN   None     NaN     NaN


2. Drop all rows that have any missing values in any column.
    age gender  height  weight
0  27.0      f    64.0   140.0
2  34.0      f    71.0   130.0


3. Drop only rows where all values are missing.
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0


4. Drop only rows where more than two values are missing.
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0


5. Drop all rows that have any missing values in the 'gender' or 'height' columns.
    age gender  height  weight
0  27.0     

Missing data matter if we believe the missingness will cause:

1.  Loss of statistical power because so many rows have to be thrown out, making it harder to detect effects, or
2.  Bias because certain values are more likely to be missing than others.

In cases where we want to keep all the information from all rows, even incomplete ones, we can "guess" what the missing data would have been and fill in that cell with our guess. This approach is called **imputation.**

There are many methods for imputing data, from the simple to the very complex. The most straightforward involves replacing missing values with the mode, mean, or median of the variable.

In [20]:
# Sample data to play with.
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, 71, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}
dfa = pd.DataFrame(data)
print(dfa)

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
5   NaN   None     NaN     NaN


In [21]:
# For each numeric column, replace the missing values with the mean for that column.
df = dfa.fillna(dfa.mean(),inplace=True)
print(df)

    age gender  height  weight
0  27.0      f   64.00   140.0
1  50.0      f   67.25   135.0
2  34.0      f   71.00   130.0
3  37.0      m   66.00   110.0
4  37.0      m   68.00   160.0
5  37.0   None   67.25   135.0


In [17]:
# For each column, replace the missing values with the most common value for that
# column. Useful for filling in missing categorical values.
# As written, this command will fill in missing values for both numerical and
# categorical columns.
df = pd.DataFrame(data)
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
print(df)

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f    68.0   160.0
2  34.0      f    71.0   130.0
3  34.0      m    66.0   110.0
4  34.0      m    68.0   160.0
5  34.0      f    68.0   160.0


In [24]:
# Your turn. Try replacing each value with the median, mode, or other statistic
# of your choice.
dfa.fillna(dfa.median(),inplace=True)

Unnamed: 0,age,gender,height,weight
0,27.0,f,64.0,140.0
1,50.0,f,67.25,135.0
2,34.0,f,71.0,130.0
3,37.0,m,66.0,110.0
4,37.0,m,68.0,160.0
5,37.0,,67.25,135.0
