# Missing Data

## Learning Objectives
- Learn how to identify and visualise missing data
 - For numerical, categorical and time-series data
- The different types of missing values
- The appropiate techniques to deal with missing data
 - Dropping data
 - Using the mean (but please don't use the mean)
 - Imputation
- Imputing time series and categorical data
 
In the Data Cleaning lesson, we observed that entries of our data were missing. This **missing data** problem occurs with almost every dataset. Data can go missing because of a wide range of reasons. Perhaps the most common reason is a 'faulty' data acquisition process (e.g. defective sensors for measuring temperature data, incomplete patient information etc), however some other reasons could include accidental data deletion or human error.

Regardless of how data went missing (well.. not actually regardless - we'll see later on how knowing how data went missing may help our analysis), our job as data scientists is to treat our data based on safe and valid assumptions.

Generally speaking, dealing with and treating missing data follows this pipeline:
- Identify and convert missing values to null values
- Analyse how much data is missing, and the type of missing-ness it is
- Either delete the rows with missing values, or impute the missing values

In this lesson, I will show examples on the two following datasets: <br>
https://archive.ics.uci.edu/ml/datasets/Automobile <br>
https://www.kaggle.com/uciml/pima-indians-diabetes-database

## Identifying Missing Values
One arbitray dataset could present missing values with a variety of differents 'placeholders' for missing values (even over the span of one column!) Examples of common missing values include: `NA`, `-`, `UNKNOWN` etc. The data dictionary/documentation is the first thing you should look at as it may describe how missing values are stored/formatted in the dataset. We should also perform checks by hand - one way to identify missing values is to return the unique values for a column, and sort them.

In [1]:
import pandas as pd
import numpy as np

# From the data documentatiobn:
names = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
auto_df = pd.read_csv("Data/imports-85.data", header=None, names=names)
auto_df

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


In [None]:
auto_df.info()

In [None]:
# Let's look at missing values for stroke
np.sort(auto_df["stroke"].unique())

Great - we see a "?" which is an indication of a missing value. Let's check a couple of other columns to ensure this value is consistent throughout the dataframe.

In [None]:
print("Unique values in price:", np.sort(auto_df["price"].unique()))
print("Unique values in normalized losses:", np.sort(auto_df["normalized-losses"].unique()))

Question marks in both! Ok, this indicates that this missing value is probably consistent throughout the dataframe (and that more than one type of missing value doesn't exist). From the result of the `.info()` method above, note that although we would expect `stroke` to be of type float, Pandas is indicating to us that it is of type object. This happens because for Pandas, the native missing value is either `np.nan` or `pd.NA`. Now that know what the missing value is - let's reload the data, this time passing the missing value to the `na_values` argument in the `.read_csv()` constructor. The `na_values` flag looks at the string we've provided as the argument, and replaces that string with a `nan` value.

In [3]:
auto_df = pd.read_csv("Data/imports-85.data", header=None, names=names, na_values="?")
auto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

Above, we see that `stroke`, and some other columns have been converted to correct datatypes. 

NOTE: Historically, Pandas did not support `nan` types for integer numbers - which is why we see some of the above numbers that we would expect to be ints as floats. Recently they have introduced the `Int64` (capital I) type which does have support for a first party null type: `pd.NA`. You can read more here: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

We'll now be working with a diabetes dataset known as Pima. This was a study about diabetes on a Native American group of people known as Pima people. Let's load in the dataset and see if we can identify null values

In [4]:
pima_df = pd.read_csv("Data/datasets_228_482_diabetes.csv")
print(pima_df.info())
pima_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


This particular dataset is interesting because sometimes missing values aren't as obvious as having an explicit value. If we return the describe method, we might be able to find some questionable summary statistics

In [None]:
pima_df.describe()

Is there any *row* which particularly stands out to you as questionable?


The `min` row is particularly interesting because many of the values there are 0. Do you know any alive person who has a `BloodPressure` of 0? Or `Glucose`, `SkinThickness`, `Insulin` or `BMI` for that matter. It is obviously possible to have 0 `Pregnancies` though, so that value of 0 could be considered correct. For the questionable columns we identified, let's check how many 0 values are present.

In [None]:
questionable_columns = ["BloodPressure", "Glucose", "SkinThickness", "Insulin", "BMI"]
zero_counts = {col: 0 for col in questionable_columns}
for col in questionable_columns:
    zero_counts[col] = pima_df[col][pima_df[col] == 0].count()
    
print(zero_counts)

If we had performed visualisation on our data, identifying this would have been easier. Let's plot a histogram (which will be formally introduced later) and use `BloodPressure` as an example. Note that Plotly Express handles a LOT of things under the hood - and 'binning' float values is one of these things. We need to keep such things in mind when using high level libraries as otherwise we may make incorrect assumptions about our data

In [None]:
import plotly.express as px
fig = px.histogram(pima_df, "BloodPressure")
fig.show()

Ok, so we can't pass "0" as an argument to `na_values` `.read_csv()` like we did previously because some of our 0 values are legitimate. Instead, we'll have to 'manually' replace these values with `np.nan`

In [None]:
## Replace 0 with np.nan for the questionable columns


## describe the dataframe


### The amount of missingness
It can be valuable to know how much of our data is actually missing - in terms of absolute values or percentages. Doing so is a relatively straightforward process which I will demonstrate on our `auto_df`.

In [None]:
auto_df_null = auto_df.isnull() # .isna() is the same as .isnull()
auto_df_null

We've now obtained a dataframe with True/False values - True indicating where there is a missing/null value, and False otherwise. To see the absolute amount of missing values, we can `.sum()` the dataframe. To obtain the percentages, we can simply do a `.mean() * 100`

In [None]:
auto_df_null.sum()

In [None]:
## Using one line only, work out the total percentage of missing values from the pima dataframe


## Visualising Missing Data
There is a very useful package called `missingno` which allows us to easily visualise our data and identify rows where our data is missing. Doing this allows you to graphically visualise which rows have missing data, and can help us to determine whether data has gone missing because of a random error or because of something a bit more systematic.

For example, in the auto_df plot, we see that where 

In [None]:
import missingno as msno

msno.matrix(auto_df)

In [None]:
# msno.bar(pima_df)
msno.matrix(pima_df)

## When (and how) to delete data

There are two types of deletion we can consider:

1. Pairwise Deletion
2. Listwise Deletion

**Pairwise deletion** is when missing values are skipped during the calculation of some statistic. This is almost a non-factor with Pandas because when we compute a statistic, missing values aren't considered:

In [None]:
print("Mean of Glucose using native mean method:", pima_df["Glucose"].mean())
print("Mean of Glucose via manual calculation  :", pima_df["Glucose"].sum() / pima_df["Glucose"].count())

**Listwise deletion** is when we drop the whole row of data because of a missing value. This is completely valid to do but should only be done when the amount of rows you are dropping is insignificant when compared to the rest of the dataset. From our counts of nullity that we did previously, and working under the knowledge that `Glucose` and `BMI` are MCAR, we are safe to drop the rows where these missing values are present. 

In [None]:
## Drop the rows in Glucose and BMI for which there are missing values


## Plot a missingno matrix of the dataframe



## Imputing using averages

**Imputation** is the act of predicting the missing data, and can be applied to any of the missing data classifications we are working with. In this section, we will look at imputing data using central tendancy measures. We will show why this may not be a great idea, and then introduce ML imputing techniques. Recall the three types of average: Mean, Median and Mode.

The values of these types of imputations are trivial to calculate and we could easily do it ourselves if we wanted to. However, for the sake of introducing you to the API, we will be using skleanr's `SimpleImputer`.

In [None]:
pima_cols = pima_df.columns
pima_cols

In [None]:
from sklearn.impute import SimpleImputer

mode_imputer = SimpleImputer(strategy="most_frequent")
pima_mode_arr = mode_imputer.fit_transform(pima_df)
pima_mode_df = pd.DataFrame(data=pima_mode_arr, columns=pima_cols)
pima_mode_df

In [None]:
## Impute and assign dataframes for strategies of mean and median


### So why is this bad?

Central tendancy imputations should be avoided because they reduce the variance of the data, leading to a higher bias in the data. Perhaps more intuitively, these types of imputations will not consider any of the other variable relationships in your data. We can easily see this with visualisations

In [None]:
nulls = (pima_df["SkinThickness"].isnull() + pima_df["BMI"].isnull()).astype("int")
px.scatter(pima_mean_df, x="SkinThickness", y="BMI", title="SkinThickness vs BMI (Mean Imputation)",
           color=nulls)

In [None]:
# Visualising the different imputations through subplots
from plotly.subplots import make_subplots

fig_mean = px.scatter(pima_mean_df, x="SkinThickness", y="BMI", color=nulls)

fig_median = px.scatter(pima_median_df, x="SkinThickness", y="BMI", color=nulls)

fig_mode = px.scatter(pima_mode_df, x="SkinThickness", y="BMI", color=nulls)

fig = make_subplots(rows=1, cols=3, shared_xaxes=False, subplot_titles=("Mean Imputation","Median Imputation", "Mode Imputation"))
fig.add_trace(fig_mean['data'][0], row=1, col=1)
fig.add_trace(fig_median['data'][0], row=1, col=2)
fig.add_trace(fig_mode['data'][0], row=1, col=3)
fig.update_layout(title="SkinThickness vs BMI (Imputations)")
fig.show()

### ML based imputation techniques

The alternative to using central tendencies or constant values for imputations is to use ML based imputation methods. We will cover three types algorithms here which have popular use: Nearest neighbours imputation, tree based imputation and regression based imputation. These methods work by building models for a feature based on the other features of the data.

As you've already covered the underlying algorithms of KNNs, Ensemble Trees and Regression, we won't recap them here - just show how to apply them to our dataset.

We'll start with KNN imputation. Here, the algorithm selects the K nearest/most similar datapoints to a datapoint with a missing value. The missing value is then either populated with an average from the K neighbours, or a weighted average. This argument can be specified by `weights` flag in sklearn's [`KNNImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer) class. 

In [None]:
from sklearn.impute import KNNImputer

# default arguments are:
# n_neighbors = 5
# weights = "uniform"
knn_impute = KNNImputer(n_neighbors=3, weights="distance")
pima_knn_arr = knn_impute.fit_transform(pima_df)
pima_knn_df = pd.DataFrame(data=pima_knn_arr, columns=pima_cols)

fig_knn = px.scatter(pima_knn_df, x="SkinThickness", y="BMI", title="SkinThickness vs BMI (KNN Imputation)",
           color=nulls)
fig_knn

When researching imputation techniques, one common method you'll come across is something known as [**MICE**](https://www.jstatsoft.org/article/view/v045i03) - Multivariate Imputation by Chained Equations. This algorithm performs multiple regressions of a random sample of data, and uses the average of these multiple regressions to impute the missing value. With the sklearn API, the appropiate method to use is the [`IterativeImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) class, passing `sample_posterior=True` ([source](https://scikit-learn.org/stable/modules/impute.html))

In [None]:
# IterativeImputer is an experimental feature in sklearn, so we need to enable it prior to using it
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

regression_impute = IterativeImputer(sample_posterior=True)
pima_regression_arr = regression_impute.fit_transform(pima_df)
pima_regression_df = pd.DataFrame(data=pima_regression_arr, columns=pima_cols)

fig_br = px.scatter(pima_regression_df, x="SkinThickness", y="BMI", title="SkinThickness vs BMI (Regression Imputation)",
           color=nulls)
fig_br

The `IterativeImputer` class is highly flexible and allows us to use any estimator object to perform our imputation (instead of just regression). The default estimator that it uses isn't actually vanilla linear regression - it's something known as Bayesian Ridge regression. We won't cover the details here as the important thing to know is that it is just a linear regression variant. If you are curious about a fuller understanding, more details can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge.

Another, more recent, trend in imputation is using tree based ensembles. Here we will use the `RandomForestRegressor` estimator to impute the missing values:

In [None]:
from sklearn.ensemble import RandomForestRegressor

## Impute the missing values using RandomForestRegressor and IterativeImputer


In [None]:
fig = make_subplots(rows=1, cols=3, shared_xaxes=False, subplot_titles=("KNN Imputation", "BR Regression Imputation", "RF Regression Imputation"))
fig.add_trace(fig_knn['data'][0], row=1, col=1)
fig.add_trace(fig_br['data'][0], row=1, col=2)
fig.add_trace(fig_rf['data'][0], row=1, col=3)
fig.update_layout(title="SkinThickness vs BMI (Regression Imputations)")
fig.show()

## Time Series Imputation

Time series data referes to data that has been collected over time, with each datapoint being indexed by a sequential datetime. The most typical example is perhaps stock data. Here, we will look at the Beijing PM2.5 Data available from: https://www.kaggle.com/joshuapaulbarnard/beijing-air-quality-pm25-from-2010-to-2017?select=Beijing+PM2_5+from+2010+to+2017.csv. In the `.read_csv()` method, we'll tell Pandas that we want to index on the Date variable

In [None]:
pm_df_org = pd.read_csv("Data/Beijing PM2_5 from 2010 to 2017.csv", index_col="Date", parse_dates=True, infer_datetime_format=True)

In [None]:
pm_df = pm_df_org[:400000]
pm_df = pm_df.loc[~pm_df.index.duplicated(keep='last')]
pm_df = pm_df.sort_index(axis=0)
pm_df

In [None]:
## Check the percentage of nulls in each column


In [None]:
# Renaming columns so they're easier to refer to
pm_cols = ["city", "country", "season", "pm25", "dew_point", "temperature", "humidity", "pressure", "wind_direction", "wind_speed", "precipitation_hourly", "preciptiation_cum"]
pm_df.columns = pm_cols
pm_df.head()

### Fill missing values

We will start with the [`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) method. The `.fillna()` method does what the name implies - it fills values where there is a NaN. This method has two strategies we can adopt:
- `ffill`
- `bfill`

Which stand for "forward fill" and "backward fill" respectively. If we have a Series with some missing values in it, `ffill` will replace the NaNs with the last non-NaN value we've come across in the Series - until the next non-NaN value is reached. Backward fill reverses this and fills in the non-NaN values with the *next* non-NaN value we would observe. Seeing this in action will clarify this description:

In [None]:
pm_df["wind_speed"][31230:31245]

In [None]:
pm_ffill_df = pm_df.fillna(method="ffill")
pm_ffill_df["wind_speed"][31230:31245]

In [None]:
pm_bfill_df = pm_df.fillna(method="bfill")
pm_bfill_df["wind_speed"][31230:31245]

### Interpolation

Ok - so this works to fill NA values, but it's not really ideal. (Note that the fillna method has applications outside of time series data). The next level up is the [`.interpolation()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html) function. Interpolation is art of finding a transition from one point to the other. `.interpolation()` has many useful techniques to interpolate datapoints, but we will focus on the following 3. Refer to the documentation for a more comprehensive list, and the recommended place for when to use the method:
- Linear
- Quadratic
- Nearest

#### Linear
Linear interpolation simply fills values with equidistant points from the first non-NaN value to the next. Visually:
![](../images/linear_interpolation.png)

It would be trivial to calculate what this 'equidistant' value actually is - we take the first non-NaN value and the following non-NaN value, subtract the two together, and divide it by the number of NaNs between the two points. With the `wind_speed` data we were looking at, this would be $(2.4 - 1.3)/6 = 0.183$. 

In [None]:
pm_linear_df = pm_df.interpolate(method="linear")
pm_linear_df["wind_speed"][31230:31245]

#### Nearest

The `nearest` strategy is very similar to that of `.fillna()` in that it fills the values. However, this method simply fills in the NaNs with the closest non-NaN value to the current NaN  we're looping over.

In [None]:
pm_nearest_df = pm_df.interpolate(method="nearest")
pm_nearest_df["wind_speed"][31230:31245]

#### Quadratic
Quadratic interpolation attempts to fit a quadratic/parabolic curve between the two non-NaN values. Later in this notebook we will show plots of the different interpolation methods and you'll be able to get a better sense of understanding how and where each of the interpolation methods we are presenting would be useful. We won't be diving into the methodology behind quadratic interpolation, but you can find a brilliant walkthrough here: https://www.youtube.com/watch?v=ifS8LL3qT2g

![](../images/quadratic_interpolation.png)

In [None]:
pm_quad_df = pm_df.interpolate(method="quadratic")
pm_quad_df["wind_speed"][31230:31245]

### Visualising Time-series data


In [None]:
pm_df["wind_speed"][31230:31245]

In [None]:
pd.options.plotting.backend = 'plotly'

fig = make_subplots(rows=6, cols=1, shared_xaxes=False, subplot_titles=("None", "Ffill Interpoloation", "Bfill Interpoloation", "Linear Interpoloation", "Nearest Interpoloation", "Quadratic Interpoloation"))

default = pm_df["wind_speed"][31230:31245]

fig_none = default.plot()
fig.add_trace(fig_none["data"][0], row=1, col=1)

figs_list = [pm_ffill_df, pm_bfill_df, pm_linear_df, pm_nearest_df, pm_quad_df]
for i, to_fig in enumerate(figs_list):
    fig_ = to_fig["wind_speed"][31230:31245].plot(color_discrete_sequence=["red"])
    fig_.add_trace(px.line(pm_df["wind_speed"][31230:31245]).data[0])
    
    fig.add_trace(fig_["data"][0], row=i+2, col=1)
    fig.add_trace(fig_["data"][1], row=i+2, col=1)

fig.update_layout(title="Wind Speed Interpolations", width=np.inf, height=1600, showlegend=False)
fig.update_traces(mode='markers+lines')
fig.show()

# fig = pm_df.interpolate(method="quadratic")["wind_speed"][31230:31245].plot()
# fig.add_trace(px.line(pm_df["wind_speed"][31230:31245], color_discrete_sequence=["red"]).data[0])
# fig.show()

## Imputing Categorical Variables
A naive approach to impute categorical data is to either use the most frequent/mode method or the `.fillna()` strategy that we looked at earlier. However, it is possible to impute categorical variables as we would a typical continuous variable. We cannot directly perform the imputation as straightforwardly as before because the categorical variables are usually encoded as strings (and therefore we can not perform any mathematical operations on them). To impute these kinds of variables, we must first encode them them as numeric values.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
knn_impute = KNNImputer(n_neighbors=3, weights="distance")

# We will drop city and country because these are categorical variables which have all their values present
# We will keep season even though it has all the values present (just to demonstrate the process over multiple categorical variables)
# We will drop humidity and precipitation_cum as these features are fully NaN
pm_to_impute_cols = ["season", "pm25", "dew_point", "temperature", "pressure", "wind_direction", "wind_speed", "precipitation_hourly"]
pm_to_impute_df = pm_df[pm_to_impute_cols]
cat_cols = ["season", "wind_direction"]

# key: col, value: list of tuples
category_dict_encode = {}
category_dict_decode = {}
for col in cat_cols:
    pm_to_impute_df[col] = pm_to_impute_df[col].astype("category")
    categories = pm_to_impute_df[col].cat.categories
    
    col_cat_dict = dict(enumerate(categories))
    print("Column, Category dict:", col_cat_dict)
    category_dict_decode[col] = col_cat_dict
    
    col_cat_dict = {v:k for k,v in col_cat_dict.items()}
    print("Inverted Column, Category dict:", col_cat_dict)
    category_dict_encode[col] = col_cat_dict

print()
print("Category dict ENCODE:", category_dict_encode)
print("Category dict DECODE:", category_dict_decode)
pm_to_impute_df.replace(category_dict_encode, inplace=True)
pm_to_impute_df


In [None]:
knn_pm_arr = knn_impute.fit_transform(pm_to_impute_df)
knn_pm_df = pd.DataFrame(knn_pm_arr, columns = pm_to_impute_cols)
knn_pm_df[31230:31245]

In [None]:
for col in cat_cols:
    knn_pm_df[col] = knn_pm_df[col].round()
knn_pm_df.index = pm_df.index
knn_pm_df[31230:31245]

In [None]:
# In the original dataframe we will only replace the values for the categorical cols we imputed
# We need to map the columns back to their codes
knn_pm_df.replace(category_dict_decode, inplace=True)
knn_pm_df[31230:31245]

In [None]:
imputed_cols_df = knn_pm_df[cat_cols]
imputed_cols_df

In [None]:
pm_imputed_df = pm_df.copy(deep=True)
pm_imputed_df.update(knn_pm_df, overwrite=False)
pm_imputed_df[31230:31245]

In [None]:
nulls = pm_df["wind_speed"][31230:31245].isnull()
print(nulls)

fig = pm_imputed_df["wind_speed"][31230:31245].plot(color_discrete_sequence=["red"])
fig.add_trace(px.line(pm_df["wind_speed"][31230:31245]).data[0])
fig.update_traces(mode='markers+lines')
fig.update_layout(title="Wind Speed Interpolation (KNN Full DF imputation)")
fig["data"][0]["name"] = "Interpolated"
fig["data"][1]["name"] = "Orignal"
fig.show()