# How to Handle Missing Data
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo35_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Import a dataset about Lake Mendocino
df = pd.read_csv('coy_wy2024_csvdata.csv')
df

In [None]:
df.isna().sum()

Based on this `.isna()` check, it appears this dataset has no null values. However, if we take a close look at the data. There are clearly some missing values. This dataset uses `-` in spaces that should be nulls. Some other columns use `0` where there should be nulls.

**Step 1:** Convert missing values to nulls.

In [None]:
# Replace dashes with nans
df[df=='-'] = np.nan
df

In [None]:
# Use .info to check on the zeros
df.info()

They are type int.

In [None]:
# Replace 0 with nan in columns where the column name contains the word "notes"
notes_columns = [col for col in df.columns if 'notes' in col]
df[notes_columns] = df[notes_columns].replace(0, np.nan)
df

**Step 2** Analyze type and amount of missing data.

In [None]:
df.isna().sum()

Some of of these columns have all almonst all missing data. These tend to be ones with "notes" in their column name. Maybe those columns are occasionally filled in during abnormal situations, but usually don't have entries. Some of the other columns have one missing data point. We'll treat the columns with many missing and columns with a few missing differently.

**Step 3:** Decide on an approach. 
Some of of these columns have a lot of missing data. Let's drop those. Others can be imputed.

In [None]:
# drop columns (axis = 1) that have more than 230 nan values (thresh = 230)
df_dropped = df.dropna(axis = 1, thresh = 230)
df_dropped

In [None]:
df_dropped.isna().sum()

Now we just have 4 columns with a missing value. We could drop those, but that isn't great for time series data. We don't want a gap in our time series (this can be problematic for many time series methods, so lets try to fill in the gaps (impute) instead. Let's first check on the distributions. 

In [None]:
# convert types before plotting distributions
df_dropped = df_dropped.assign(cons_high = df['Top of Conservation High (ac-ft)'].astype(float),
              cons = df['Top of Conservation (ac-ft)'].astype(float),
              gross_pool = df['Gross Pool'].astype(float),
              gross_pool_elev = df['Gross Pool(elev)'].astype(float))

df_dropped.drop(columns = ['Top of Conservation High (ac-ft)', 'Top of Conservation (ac-ft)','Gross Pool','Gross Pool(elev)'],inplace=True)
df_dropped

In [None]:
df_dropped.info()

In [None]:
# check distribution to decide how to impute
df_dropped.cons_high.hist();

In [None]:
# check distribution to decide how to impute
df_dropped.cons.hist();

In [None]:
# check distribution to decide how to impute
df_dropped.gross_pool.hist();

In [None]:
# check distribution to decide how to impute
df_dropped.gross_pool_elev.hist();

In [None]:
# Mean impute
gross_mean = df_dropped.gross_pool.mean()
df_dropped['gross_pool'].fillna(gross_mean,inplace=True)
df_dropped

In [None]:
# Mean impute
gross_mean_elev = df_dropped.gross_pool_elev.mean()
df_dropped['gross_pool_elev'].fillna(gross_mean_elev,inplace=True)
df_dropped

In [None]:
df_dropped[df_dropped.cons.isna()]

In [None]:
# Mode impute
cons_high_mode = df_dropped.cons_high.mode().values[0]
df_dropped['cons_high'].fillna(cons_high_mode,inplace=True)
cons_mode = df_dropped.cons.mode().values[0]
df_dropped['cons'].fillna(cons_mode,inplace=True)

In [None]:
df_dropped.info()

No more missing values!  
**Step 4: Evaluate and compare** 
Let's try plotting through time to get a complete time series. 

In [None]:
df_dropped['date'] = pd.to_datetime(df_dropped['ISO 8601 Date Time'].str[:10])

In [None]:
import seaborn as sns

sns.lineplot(data = df_dropped, x = 'date',y = 'cons');

Wait! This graph looks strange. It looks like the point we filled in is way different than what we might expect... Maybe a mode impute was *NOT* the right choice. It could have been better to use the points close in time to approximate our missing value. We will learn how to do that and other techniques in the next lecture. 