# Imputing values

Imputation is a tricky thing to do. You sometimes need to when not enough data is available, but you are still simply filling in columns by yourself.

The following notebook gives an example of a simple imputation and a small improvement.

[The example was stolen here.](https://towardsdatascience.com/pandas-tricks-for-imputing-missing-data-63da3d14c0d6)

In [37]:
import pandas as pd
import matplotlib.pyplot as plt


df_base = pd.read_csv("files/winemag-data_first150k.csv", delimiter=";")
df_mean = df_base.copy()
print(df_mean.head())

  country                           designation  points  price   
0      US                     Martha's Vineyard    96.0  235.0  \
1   Spain  Carodorum Selección Especial Reserva    96.0  110.0   
2      US         Special Selected Late Harvest    96.0   90.0   
3      US                               Reserve    96.0   65.0   
4  France                            La Brûlade    95.0   66.0   

         province           region_1           region_2             variety   
0      California        Napa Valley               Napa  Cabernet Sauvignon  \
1  Northern Spain               Toro                NaN       Tinta de Toro   
2      California     Knights Valley             Sonoma     Sauvignon Blanc   
3          Oregon  Willamette Valley  Willamette Valley          Pinot Noir   
4        Provence             Bandol                NaN  Provence red blend   

                    winery  last_year_points  
0                    Heitz                94  
1  Bodega Carmen Rodríguez        

And where are the missing values?

In [38]:
df_mean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144037 entries, 0 to 144036
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   country           144035 non-null  object 
 1   designation       100211 non-null  object 
 2   points            144032 non-null  float64
 3   price             130641 non-null  float64
 4   province          144030 non-null  object 
 5   region_1          120192 non-null  object 
 6   region_2          58378 non-null   object 
 7   variety           144032 non-null  object 
 8   winery            144032 non-null  object 
 9   last_year_points  144037 non-null  int64  
dtypes: float64(2), int64(1), object(7)
memory usage: 11.0+ MB


This shows you which columns have the most values. You can immediatly see that "region_2" is pretty empty. There is another way that will simply show the number of na's.

In [39]:
df_mean.isnull().sum()

country                 2
designation         43826
points                  5
price               13396
province                7
region_1            23845
region_2            85659
variety                 5
winery                  5
last_year_points        0
dtype: int64

Let's focus on price. The easiest way would be to fill in the price using the mean price...

In [40]:
df_mean['price'].fillna(df_mean['price'].mean(), inplace = True)
df_mean.isnull().sum()

country                 2
designation         43826
points                  5
price                   0
province                7
region_1            23845
region_2            85659
variety                 5
winery                  5
last_year_points        0
dtype: int64

The data has been filled in, but not very accurately. It would be better to estimate the price based on some other data, like country. Wouldn't it be better to say that wines of which we have no price are priced the same as the average of that country?

We'll make a fresh copy of the starting dataframe and look for all countries in that dataframe.

In [41]:
df_base.copy()

from collections import Counter
Counter(df_base['country'])

Counter({'US': 59796,
         'Spain': 7402,
         'France': 20527,
         'Italy': 22967,
         'New Zealand': 3149,
         'Bulgaria': 73,
         'Argentina': 4978,
         'Australia': 4496,
         'Portugal': 5251,
         'Israel': 587,
         'South Africa': 2140,
         'Greece': 868,
         'Chile': 5264,
         'Morocco': 12,
         'Romania': 138,
         'Germany': 2371,
         'Canada': 184,
         'Moldova': 71,
         'Hungary': 231,
         'Austria': 2935,
         'Croatia': 87,
         'Slovenia': 94,
         nan: 2,
         'India': 8,
         'Turkey': 51,
         'Macedonia': 16,
         'Lebanon': 37,
         'Serbia': 14,
         'Uruguay': 75,
         'Switzerland': 4,
         'Albania': 2,
         'Bosnia and Herzegovina': 4,
         'Brazil': 23,
         'Cyprus': 23,
         'Lithuania': 8,
         'Japan': 2,
         'China': 3,
         'South Korea': 4,
         'Ukraine': 5,
         'England': 9,
       

First problem: two wines don't have a country. But do they have a price?

In [42]:

df_base[df_base['country'].isna()]

Unnamed: 0,country,designation,points,price,province,region_1,region_2,variety,winery,last_year_points
1124,,Askitikos,90.0,17.0,,,,Assyrtiko,Tsililis,88
1427,,Shah,90.0,30.0,,,,Red Blend,Büyülübağ,100


Yes, so no worries there.

Next, let's loop over all countries, create a dataframe of just that country and use fillna() to fill in the missing values with the mean value _of that country_.

In [47]:
frames = []
for i in list(set(df['country'])):
    df_country = df[df['country'] == i].copy()
    df_country['price'].fillna(df_country['price'].mean(),inplace = True)
    frames.append(df_country)
    final_df = pd.concat(frames)
    
print(final_df.isnull().sum())

country                 0
designation         43826
points                  5
price                   0
province                7
region_1            23845
region_2            85659
variety                 5
winery                  5
last_year_points        0
dtype: int64


Closing thought: is it a good idea to use this dataset to predict prices now? And if we do, what correlation will we likely find?