# Handling missing values - Intro

So far, the data have arrived in exactly the way would like it - neat and tidy!

Now that you have learned the basics of pandas DataFrames, we can start introducing you to some real world problems.

In this chapter, you learn how to handle missing values in your data preparation steps. Namely, you will learn how to:
* eliminate rows with missing values
* impute missing values by padding
* impute missing values using prediction models

# Preparations

In [1]:
import pandas as pd

pd.set_option("display.max_columns", 500)

# Loading a dataset with missing values

In [2]:
# load the following tab-separated text file
df = pd.read_csv("../../data/raw/financial_data_wrong_types.tsv", sep="\t")
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",UNKNOWN,No,,17.9
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-999,3713.506,No,19317.672,5.274
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,-999,4226.559,No,,
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   u_company_name_id      824 non-null    int64  
 1   u_year                 824 non-null    int64  
 2   u_company_name         824 non-null    object 
 3   cb_naics               824 non-null    int64  
 4   u_iso3                 824 non-null    object 
 5   u_fye                  824 non-null    object 
 6   cb_cusip               824 non-null    object 
 7   cb_at                  824 non-null    object 
 8   cb_ni                  824 non-null    object 
 9   cb_financial_industry  824 non-null    object 
 10  cb_revt                648 non-null    float64
 11  employees              680 non-null    float64
dtypes: float64(2), int64(3), object(7)
memory usage: 77.4+ KB


# Checking for missing data

## the `isna()` method of DataFrame and Series

In [4]:
# checking each value of a column (Series):
df["employees"].isna()

0      False
1      False
2       True
3      False
4      False
       ...  
819    False
820    False
821    False
822     True
823    False
Name: employees, Length: 824, dtype: bool

In [5]:
# checking the whole DataFrame:
df.isna()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,True
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
819,False,False,False,False,False,False,False,False,False,False,False,False
820,False,False,False,False,False,False,False,False,False,False,False,False
821,False,False,False,False,False,False,False,False,False,False,False,False
822,False,False,False,False,False,False,False,False,False,False,False,True


## the `notna()` method

Reverse of `isna()`.

In [6]:
# one column
df["employees"].notna()

0       True
1       True
2      False
3       True
4       True
       ...  
819     True
820     True
821     True
822    False
823     True
Name: employees, Length: 824, dtype: bool

In [7]:
# several columns
df[["employees", "cb_revt"]].notna()

Unnamed: 0,employees,cb_revt
0,True,False
1,True,True
2,False,False
3,True,True
4,True,True
...,...,...
819,True,True
820,True,True
821,True,True
822,False,True


In [8]:
# the whole DataFrame:
df.notna()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
0,True,True,True,True,True,True,True,True,True,True,False,True
1,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,False,False
3,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...
819,True,True,True,True,True,True,True,True,True,True,True,True
820,True,True,True,True,True,True,True,True,True,True,True,True
821,True,True,True,True,True,True,True,True,True,True,True,True
822,True,True,True,True,True,True,True,True,True,True,True,False


## Summarizing the missingness of data

You have already learned that the methods `DataFrame.info()` and `DataFrame.describe()` tell you how how many non-missing cells are present in a particular coluumn.

Sometimes it is useful to determine the number of missing or non-missing values "manually". As the methods `isna()` and `notna()` return boolean (scalars, `numpy.ndarray` or `pandas.Series`) we can use the `sum()` methods to count the number of na/not-na values.

### Number of (non-)missing values in a column

In [9]:
print(f"Missing values in 'cb_revt':     {df['cb_revt'].isna().sum()}")
print(f"Non-Missing values in 'cb_revt': {df['cb_revt'].notna().sum()}")

Missing values in 'cb_revt':     176
Non-Missing values in 'cb_revt': 648


In [10]:
# this corresponds to the output of `describe()` and `info()`
print(df[["cb_revt"]].info())
df[["cb_revt"]].describe().transpose()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   cb_revt  648 non-null    float64
dtypes: float64(1)
memory usage: 6.6 KB
None


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cb_revt,648.0,39611.380448,63851.660251,0.0,500.158,11074.421,46209.9655,379136.0


### Number of (non-)missing values in rows

So far, we have only summarized the number of (non-)missings in columns. For analysis, it can also be useful to calculate and inspect the number of missing values per row.

The `sum()` method of `pandas.DataFrame` has an `axis` argument that defaults to 0 (i.e. 'index') which leads `sum()` to sum up the values of each row but within the column. We can also use `axis`='columns', to sum up values across columns but within each row. Together with `isna()`, we can then count the number of missings in a row.

In [11]:
# showing the first few rows again:
df.head(3)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",UNKNOWN,No,,17.9
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-999,3713.506,No,19317.672,5.274
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,-999,4226.559,No,,


In [12]:
# applying the pd.isna() function to the subset of the DataFrame
df.isna().head(3)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,True


In [13]:
# now using the `DataFrame.sum(axis='columns')` method:
df.isna().head(3).sum(axis="columns")

0    1
1    0
2    2
dtype: int64

In [14]:
# Saving a separate column containing the number of missing values for each row
df["missing_values"] = df.isna().sum(axis="columns")
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",UNKNOWN,No,,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-999,3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,-999,4226.559,No,,,2
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0


In [15]:
# Let's see the rows with the most missing values:
df.sort_values("missing_values", ascending=False).head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
753,69671,2008,ORANGE,517311,FRA,2008-12-31,684060106,-999,5663.641,No,,,2
347,11568,2005,Bayer AG,325,DEU,2005-12-31,072730302,"43.486,192",1891.167,No,,,2
351,11568,2009,Bayer AG,325,DEU,2009-12-31,072730302,-999,1947.719,No,,,2
655,27790,2005,Dolby Laboratories Inc,533110,USA,2005-09-30,25659T107,586277,52.293,Yes,,,2
539,15715,2014,CGG,541360,FRA,2014-12-31,12531Q204,-999,-1154.4,No,,,2


# Eliminating rows with missing values using `DataFrame.dropna()`

## Limit the dropna command to the columns that you actually need!

`DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)`


**Parameters:**
* `axis:` *{0 or 'index', 1 or 'columns'}, default 0*. Determine if rows or columns containing missing values are dropped.
  * 0, or 'index' : Drop rows containing missing values.
  * 1, or 'columns' : Drop columns containing missing value.
* `how:` *{'any', 'all'}, default 'any'*. Determine if row or column is dropped from DataFrame, when we have at least one NA or all NA.
  * 'any' : If any NA values are present, drop that row or column.
  * 'all' : If all values are NA, drop that row or column.
* `thresh:` *int, optional*. Require at least `thresh` non-NA values.
* `subset:` *array-like, optional*. Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
* `inplace:` *bool, default `False`*. If `True`, do operation inplace and return `None`.

**Returns:** DataFrame without the NA rows

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [16]:
# without any further settings, drop_na will remove all rows with any missing values in any of the columns
df_without_na_rows = df.dropna()
df_without_na_rows.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-999,3713.506,No,19317.672,5.274,0
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0
8,14651,2013,British American Tobacco PLC,312230,GBR,2013-12-31,110448107,"44.552,569",6470.49,No,25291.923,12.756,0
9,14651,2014,British American Tobacco PLC,312230,GBR,2014-12-31,110448107,"40.762,953",4852.547,No,21764.024,18.184,0


In [17]:
# How many rows did we drop?
print(f"N before: {len(df)}")
print(f"N after:  {len(df_without_na_rows)}")

N before: 824
N after:  536


In [18]:
# The columns have not changed:
print(f"Columns before: {df.columns.values}")
print(f"Columns after: {df_without_na_rows.columns.values}")

Columns before: ['u_company_name_id' 'u_year' 'u_company_name' 'cb_naics' 'u_iso3' 'u_fye'
 'cb_cusip' 'cb_at' 'cb_ni' 'cb_financial_industry' 'cb_revt' 'employees'
 'missing_values']
Columns after: ['u_company_name_id' 'u_year' 'u_company_name' 'cb_naics' 'u_iso3' 'u_fye'
 'cb_cusip' 'cb_at' 'cb_ni' 'cb_financial_industry' 'cb_revt' 'employees'
 'missing_values']


In [19]:
# Alternatively, dropping the columns that have any missing values
df_without_na_cols = df.dropna(axis="columns")
df_without_na_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   u_company_name_id      824 non-null    int64 
 1   u_year                 824 non-null    int64 
 2   u_company_name         824 non-null    object
 3   cb_naics               824 non-null    int64 
 4   u_iso3                 824 non-null    object
 5   u_fye                  824 non-null    object
 6   cb_cusip               824 non-null    object
 7   cb_at                  824 non-null    object
 8   cb_ni                  824 non-null    object
 9   cb_financial_industry  824 non-null    object
 10  missing_values         824 non-null    int64 
dtypes: int64(4), object(7)
memory usage: 70.9+ KB


In [20]:
# Which columns did we drop?
print(f"Columns before:  {df.columns.values}")
print(f"Columns after:   {df_without_na_cols.columns.values}")
print(f"Columns dropped: {set(df.columns) - set(df_without_na_cols.columns)}")

Columns before:  ['u_company_name_id' 'u_year' 'u_company_name' 'cb_naics' 'u_iso3' 'u_fye'
 'cb_cusip' 'cb_at' 'cb_ni' 'cb_financial_industry' 'cb_revt' 'employees'
 'missing_values']
Columns after:   ['u_company_name_id' 'u_year' 'u_company_name' 'cb_naics' 'u_iso3' 'u_fye'
 'cb_cusip' 'cb_at' 'cb_ni' 'cb_financial_industry' 'missing_values']
Columns dropped: {'cb_revt', 'employees'}


# Exercise 1

1. Load the first sheet of the Excel file "wdi_wrong_types.xlsx" into a pandas DataFrame.
2. What is the column with the most missing values?
3. What are the rows with the most missing values in the respective cells?
4. How many cells are missing for each row on average?
5. Create a copy of the DataFrame, dropping all **rows** that have at least one missing value.
6. Create a copy of the DataFrame, dropping all **columns** that have at least one missing value.
7. Bonus: Create a copy of the DataFrame, dropping all **columns** that have at least 30% missing values.
8. Bonus: Create a copy of the DataFrame, dropping all **rows** that have missing values in the following columns: *countryname* and *SP_URB_TOTL_IN_ZS*.
9. Bonus: Considering **'unusual' missing values** like 'NOT AVAILABLE' (in *SP_DYN_LE00_IN*) and -99 (in *CM_MKT_LCAP_CD*), produce a DataFrame with all **rows** removed that have at least one missing value.



# Replacing 'unusual' missing values

Pandas considers 'typical' placeholders for missing values besides empty cells, e.g. 'NA', 'N/A' etc. See, for example, the defaults for the argument `na_values` of `pandas.read_csv()` [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=na_values#](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=na_values#).

When you already know which other placeholders exist in your data, you can supply those to `na_values` and pandas will replace them with the appropriate NA value, typically `np.nan` or `pd.NA`.

However, in our case, we have already loaded the dataset and after a first inspection we find different unusual NA placeholders in two columns:

* *cb_at*: '-999' - please note that the column is recognized as a string!
* *cb_ni*: 'UNKNOWN'

We use the `replace` method to change these values into the appropriate NA values:

In [21]:
# before:
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",UNKNOWN,No,,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-999,3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,-999,4226.559,No,,,2
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0


In [22]:
# after:
df["cb_at"] = df["cb_at"].replace("-999", pd.NA)
df["cb_ni"] = df["cb_ni"].replace("UNKNOWN", pd.NA)
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",,No,,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,,3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,,4226.559,No,,,2
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0


# Backward and forward filling of missing values

The methods `bfill()` and `ffill()` perform backwards and forward fill operations, respectively. This is best explained by example:

In [23]:
# look at the data again
df.head(10)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",,No,,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,,3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,,4226.559,No,,,2
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0
5,14651,2010,British American Tobacco PLC,312230,GBR,2010-12-31,110448107,"42.882,112",4431.357,No,,4.549,1
6,14651,2011,British American Tobacco PLC,312230,GBR,2011-12-31,110448107,,4808.702,No,,13.998,1
7,14651,2012,British American Tobacco PLC,312230,GBR,2012-12-31,110448107,"44.439,167",6246.234,No,,12.07,1
8,14651,2013,British American Tobacco PLC,312230,GBR,2013-12-31,110448107,"44.552,569",6470.49,No,25291.923,12.756,0
9,14651,2014,British American Tobacco PLC,312230,GBR,2014-12-31,110448107,"40.762,953",4852.547,No,21764.024,18.184,0


In [24]:
# backfill
df.bfill().head(10)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",3713.506,No,19317.672,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,"40.276,807",3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,"40.276,807",4226.559,No,17721.152,11.038,2
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0
5,14651,2010,British American Tobacco PLC,312230,GBR,2010-12-31,110448107,"42.882,112",4431.357,No,25291.923,4.549,1
6,14651,2011,British American Tobacco PLC,312230,GBR,2011-12-31,110448107,"44.439,167",4808.702,No,25291.923,13.998,1
7,14651,2012,British American Tobacco PLC,312230,GBR,2012-12-31,110448107,"44.439,167",6246.234,No,25291.923,12.07,1
8,14651,2013,British American Tobacco PLC,312230,GBR,2013-12-31,110448107,"44.552,569",6470.49,No,25291.923,12.756,0
9,14651,2014,British American Tobacco PLC,312230,GBR,2014-12-31,110448107,"40.762,953",4852.547,No,21764.024,18.184,0


In [25]:
# forward fill
df.ffill().head(10)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",,No,,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,"32.737,984",3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,"32.737,984",4226.559,No,19317.672,5.274,2
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0
5,14651,2010,British American Tobacco PLC,312230,GBR,2010-12-31,110448107,"42.882,112",4431.357,No,22970.074,4.549,1
6,14651,2011,British American Tobacco PLC,312230,GBR,2011-12-31,110448107,"42.882,112",4808.702,No,22970.074,13.998,1
7,14651,2012,British American Tobacco PLC,312230,GBR,2012-12-31,110448107,"44.439,167",6246.234,No,22970.074,12.07,1
8,14651,2013,British American Tobacco PLC,312230,GBR,2013-12-31,110448107,"44.552,569",6470.49,No,25291.923,12.756,0
9,14651,2014,British American Tobacco PLC,312230,GBR,2014-12-31,110448107,"40.762,953",4852.547,No,21764.024,18.184,0


In [26]:
# forward fill but not more than one row at a time:
df.ffill(limit=1).head(10)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",,No,,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,"32.737,984",3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,,4226.559,No,19317.672,5.274,2
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038,0
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961,0
5,14651,2010,British American Tobacco PLC,312230,GBR,2010-12-31,110448107,"42.882,112",4431.357,No,22970.074,4.549,1
6,14651,2011,British American Tobacco PLC,312230,GBR,2011-12-31,110448107,"42.882,112",4808.702,No,,13.998,1
7,14651,2012,British American Tobacco PLC,312230,GBR,2012-12-31,110448107,"44.439,167",6246.234,No,,12.07,1
8,14651,2013,British American Tobacco PLC,312230,GBR,2013-12-31,110448107,"44.552,569",6470.49,No,25291.923,12.756,0
9,14651,2014,British American Tobacco PLC,312230,GBR,2014-12-31,110448107,"40.762,953",4852.547,No,21764.024,18.184,0


In [27]:
# now, what happens if we have different companies in our dataset?

# before forward fill:
df.iloc[10:20, :]

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
10,14651,2015,British American Tobacco PLC,312230,GBR,2015-12-31,110448107,"46.472,019",6326.034,No,,14.761,1
11,14651,2016,British American Tobacco PLC,312230,GBR,2016-12-31,110448107,"49.067,950",5734.238,No,18198.309,18.682,0
12,14651,2017,British American Tobacco PLC,312230,GBR,2017-12-31,110448107,"190.810,310",,No,27453.047,14.402,0
13,14651,2018,British American Tobacco PLC,312230,GBR,2018-12-31,110448107,"186.512,879",7687.784,No,31215.054,10.44,0
14,14651,2019,British American Tobacco PLC,312230,GBR,2019-12-31,110448107,"186.747,022",7554.378,No,34271.499,,1
15,13722,2005,Bob Evans Farms Inc.,311612,USA,2006-04-30,96761101,"1.209,183",54.774,No,1584.819,,1
16,13722,2006,Bob Evans Farms Inc.,311612,USA,2007-04-30,96761101,"1.196,962",60.542,No,1654.46,,1
17,13722,2007,Bob Evans Farms Inc.,311612,USA,2008-04-30,96761101,,64.876,No,1737.026,2345.0,0
18,13722,2008,Bob Evans Farms Inc.,311612,USA,2009-04-30,96761101,"1.147,648",,No,1750.512,15882.0,0
19,13722,2009,Bob Evans Farms Inc.,311612,USA,2010-04-30,96761101,"1.109,157",,No,1726.804,,1


In [28]:
# after forward fill !!!
df.ffill().iloc[10:20, :]

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
10,14651,2015,British American Tobacco PLC,312230,GBR,2015-12-31,110448107,"46.472,019",6326.034,No,21764.024,14.761,1
11,14651,2016,British American Tobacco PLC,312230,GBR,2016-12-31,110448107,"49.067,950",5734.238,No,18198.309,18.682,0
12,14651,2017,British American Tobacco PLC,312230,GBR,2017-12-31,110448107,"190.810,310",5734.238,No,27453.047,14.402,0
13,14651,2018,British American Tobacco PLC,312230,GBR,2018-12-31,110448107,"186.512,879",7687.784,No,31215.054,10.44,0
14,14651,2019,British American Tobacco PLC,312230,GBR,2019-12-31,110448107,"186.747,022",7554.378,No,34271.499,10.44,1
15,13722,2005,Bob Evans Farms Inc.,311612,USA,2006-04-30,96761101,"1.209,183",54.774,No,1584.819,10.44,1
16,13722,2006,Bob Evans Farms Inc.,311612,USA,2007-04-30,96761101,"1.196,962",60.542,No,1654.46,10.44,1
17,13722,2007,Bob Evans Farms Inc.,311612,USA,2008-04-30,96761101,"1.196,962",64.876,No,1737.026,2345.0,0
18,13722,2008,Bob Evans Farms Inc.,311612,USA,2009-04-30,96761101,"1.147,648",64.876,No,1750.512,15882.0,0
19,13722,2009,Bob Evans Farms Inc.,311612,USA,2010-04-30,96761101,"1.109,157",64.876,No,1726.804,15882.0,1


In [29]:
# we can avoid this by partioning the DataFrame using the `groupby()` method
df.groupby("u_company_name_id").ffill().iloc[10:20, :]

Unnamed: 0,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
10,2015,British American Tobacco PLC,312230,GBR,2015-12-31,110448107,"46.472,019",6326.034,No,21764.024,14.761,1
11,2016,British American Tobacco PLC,312230,GBR,2016-12-31,110448107,"49.067,950",5734.238,No,18198.309,18.682,0
12,2017,British American Tobacco PLC,312230,GBR,2017-12-31,110448107,"190.810,310",5734.238,No,27453.047,14.402,0
13,2018,British American Tobacco PLC,312230,GBR,2018-12-31,110448107,"186.512,879",7687.784,No,31215.054,10.44,0
14,2019,British American Tobacco PLC,312230,GBR,2019-12-31,110448107,"186.747,022",7554.378,No,34271.499,10.44,1
15,2005,Bob Evans Farms Inc.,311612,USA,2006-04-30,96761101,"1.209,183",54.774,No,1584.819,,1
16,2006,Bob Evans Farms Inc.,311612,USA,2007-04-30,96761101,"1.196,962",60.542,No,1654.46,,1
17,2007,Bob Evans Farms Inc.,311612,USA,2008-04-30,96761101,"1.196,962",64.876,No,1737.026,2345.0,0
18,2008,Bob Evans Farms Inc.,311612,USA,2009-04-30,96761101,"1.147,648",64.876,No,1750.512,15882.0,0
19,2009,Bob Evans Farms Inc.,311612,USA,2010-04-30,96761101,"1.109,157",64.876,No,1726.804,15882.0,1


# Filling NA with arbitrary placeholders

This is rarely useful but sometimes required when exporting datasets to other software that do not support 'true' missing values (e.g. SPSS). Then we might have to replace missing values with a 'special' value, such as -9999.

`fillna()`, when provided with a `value` will perform this task:

In [30]:
# fill missing values with single value
df.fillna(-9999).head(3)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",-9999.0,No,-9999.0,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-9999,3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,-9999,4226.559,No,-9999.0,-9999.0,2


In [31]:
# fill missing values with different single values per column (and only the columns specified)
df.fillna(value={"cb_at": 0, "cb_ni": -9999}).head(3)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",-9999.0,No,,17.9,1
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,0,3713.506,No,19317.672,5.274,0
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,0,4226.559,No,,,2


# Imputing NA with mean/median

In [32]:
# imputation using the grand mean
df["cb_revt"].fillna(df["cb_revt"].mean()).head(3)

0    39611.380448
1    19317.672000
2    39611.380448
Name: cb_revt, dtype: float64

In [33]:
# imputation using a groupwise median
df["cb_revt"].fillna(df.groupby("u_company_name_id")["cb_revt"].transform("median")).head(3)

0    22970.074
1    19317.672
2    22970.074
Name: cb_revt, dtype: float64

# Outlook: Imputing missing values using prediction models

There are many model based methods to fill missing values which, unfortunately, exceed the scope of this workshop. But if you would like an overview and some good examples using the popular **`scikit-learn`** package, please refer to this website:
[https://scikit-learn.org/stable/modules/impute.html](https://scikit-learn.org/stable/modules/impute.html)

In [34]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = df.copy()
df_imputed[df.select_dtypes("float").columns] = imputer.fit_transform(df.select_dtypes("float"))
df_imputed.loc[df.missing_values.nlargest(5).index, :]

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees,missing_values
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,,4226.559,No,39611.380448,2568.891131,2
49,10231,2012,BT Group PLC,517311,GBR,2013-03-31,05577E101,"37.718,142",3173.818,No,39611.380448,2568.891131,2
73,56071,2007,LVMH Moet Hennessy Louis Vuitton SE,3152,FRA,2007-12-31,502441306,"44.891,082",2957.108,No,39611.380448,2568.891131,2
124,29114,2014,EOG Resources Inc.,2111,USA,2014-12-31,26875P101,"34.762,687",,No,39611.380448,2568.891131,2
158,109031,2013,voxeljet AG,333244,DEU,2013-12-31,92912L107,79802,-3.74,No,39611.380448,2568.891131,2


# Exercise 2

Continue withe the data from the the previous exercise (i.e. the first sheet of the Excel file "wdi_wrong_types.xlsx" loaded into a pandas DataFrame).

1. Create a copy of the DataFrame, filling missing values in the column *SP_URB_TOTL_IN_ZS* using the backwards method.
2. Create a copy of the DataFrame, filling missing values in the column *SP_URB_TOTL_IN_ZS* using the forwards method.
3. Create a copy of the DataFrame, filling missing values in the column *SP_URB_TOTL_IN_ZS* with the grand mean.
4. Create a copy of the DataFrame, filling missing values in the column *SP_URB_TOTL_IN_ZS* with the country-specific median.