# Imputing values

Imputation is a tricky thing to do. You sometimes need to when not enough data is available, but you are still simply filling in columns by yourself.

The following notebook gives an example of a simple imputation and a small improvement.

[The example was stolen here.](https://towardsdatascience.com/pandas-tricks-for-imputing-missing-data-63da3d14c0d6)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


df_base = pd.read_csv("files/winemag-data_first150k.csv", delimiter=";")
df_mean = df_base.copy()
print(df_mean.head())

And where are the missing values?

In [None]:
df_mean.info()

This shows you which columns have the most values. You can immediatly see that "region_2" is pretty empty. There is another way that will simply show the number of na's.

In [None]:
df_mean.isnull().sum()

Let's focus on price. The easiest way would be to fill in the price using the mean price...

In [None]:
df_mean['price'].fillna(df_mean['price'].mean(), inplace = True)
df_mean.isnull().sum()

The data has been filled in, but not very accurately. It would be better to estimate the price based on some other data, like country. Wouldn't it be better to say that wines of which we have no price are priced the same as the average of that country?

We'll make a fresh copy of the starting dataframe and look for all countries in that dataframe.

In [None]:
df_base.copy()

from collections import Counter
Counter(df_base['country'])

First problem: two wines don't have a country. But do they have a price?

In [None]:

df_base[df_base['country'].isna()]

Yes, so no worries there.

Next, let's loop over all countries, create a dataframe of just that country and use fillna() to fill in the missing values with the mean value _of that country_.

In [None]:
frames = []
for i in list(set(df_base['country'])):
    df_country = df_base[df_base['country'] == i].copy()
    df_country['price'].fillna(df_country['price'].mean(),inplace = True)
    frames.append(df_country)
    final_df = pd.concat(frames)
    
print(final_df.isnull().sum())

Closing thought: is it a good idea to use this dataset to predict prices now? And if we do, what correlation will we likely find?