## **What are we talking about? => Missing values**


![Missing value](http://i.imgur.com/45zNc.jpg)


Today we're going to talk about "missing values" or "missing data". These values represent the cases when in a survey you do not have data for a respondent, or when certain variables for a respondent are unknown. This can occur when the respondent refused to answer for example, or when the researcher failed to collect an answer to a question.

They can be represented by an empty space in your data, a specific sign like a dash ( - ), Na/NaN, or a specific number.

![Mv example](https://www.logianalytics.com/wp-content/uploads/2019/06/Missing-values-2.png)

For example in the case of the ANES dataset, when you look in the codebook for the variables you wanna use, you can see missing values when looking at values below zero, that usually represent special answers like "not applicable" or "refused to answer".





## **Why do missing values matter?**

Up to now we have largely ignored them, but here we will understand why it might be better to consider them as important data, and how we can use that data to avoid biases, how we can turn missing values into useful information.

Not taking care of missing values can be problematic:


*   You may have too much missing values to be able to exploit your data.
*   You may skew your data by erasing important information.

This is why here we're gonna learn about how to manipulate these missing values, and learn about "imputing".


## How can we deal with them?

We're gonna use the "How do you think is going to win" variable to illustrate how we can manipulate missing values.

In [None]:
# First we do the usual setup
# Load Pandas
import pandas as pd

# Import the dataset
data_url = "https://raw.githubusercontent.com/datamisc/ts-2020/main/data.csv"
anes_data  = pd.read_csv(data_url, compression='gzip')

In [None]:
# Choose the variables we want to use

my_vars = [
    "V201217",  # betting on a winner / winbet
    "V201200",  # liberal-conservative self-placement / ideology
    ]

df = anes_data[my_vars]

df.columns = ["winbet", "ideology"]
df

In [None]:
# When checking the winbet value we can see that there are missing values
# => -9 and -8 
df["winbet"].value_counts()

In [None]:
# To deal with them, we can use the .replace method
# By replacing the missing values with pd.NA, we are designating these values
# as missing values.

df["winbet"] = df["winbet"].replace(-9, pd.NA)
df["winbet"].value_counts()

In [None]:
# You can use lists to go quicker and directly replace all your missing values

df["winbet"] = df["winbet"].replace([-9, -8], pd.NA)
df["winbet"].value_counts()


In [None]:
#Tou can also use pd.NA on the whole dataframe if needed.

# It is useful, but you need to chose carefully which data you are targeting
# For example, here we are targeting 99 from the ideology variable,
# but 99 may very well be an existing value for another variable of your dataset

df = df.replace([-9, -8, 99], pd.NA)
df.describe()

In [None]:
# You can change the "dropna" parameter to make them appear in your value_counts
# It allows you to quickly compare them to your existing values

df["winbet"].value_counts(dropna=False)

In [None]:
# If you only want to see the number of missing values, use the sum method

df.isna().sum()

# Using .isna

In [None]:
# .isna() allows you to direclty verify which values are missing values.

df["winbet"].isna()

# "True" means it is a missing value, and "False" means the opposite.

In [None]:
df["winbet"].isna().value_counts()

In [None]:
# .isna is useful when trying to target these values
# For example we can now see what's the percentage of missing values 

df["winbet"].isna().mean()

# Imputing the mean/median

"Imputing" is the process of replacing data with substitued values, to avoid the potential biaises created by your missing data. After designating the missing values in python, you can replace them with values that originate from the values you have, which will make your data more usable. Doing so should not affect your overall observations, if done right.

Here we will talk about how we can impute the median in our variable, but in other cases you may have to impute the mean.

If the variable is discrete, then you want to use the median. If it is continuous, then the mean will make more sense.

Warning: imputing affects your data and may skew it, as we will see later on.

In [None]:
# Our missing values can now be imputed with the median as it's a discrete variable

df.describe()

In [None]:
# For this variable we will impute the median.
# First we need to create an object with our median
my_median = df["winbet"].median()
my_median

In [None]:
# Next we need to create a mask with the .isna method to target our missing
# variables. Then we are apply the median to our mask.
mask = df["winbet"].isna()
df.loc[mask, "winbet"] = my_median

In [None]:
# Now if we check again with the mean, we can see there are no longer
# any missing values in our data.
df.isna().mean()

In [None]:
# You can also use the .fillna method, that allows you to instantly turn all
# the designated missing values into somehting else.
# It can be applied to the whole dataframe if needed: df.fillna(df.median())

# (fillna can also be used with the mean)

 
df["winbet"] = df["winbet"].fillna(df["winbet"].median())
df["winbet"].isna().value_counts()

In [None]:
# And as you can see the median stayed the same, but the mean was affected.
# As such there is no perfect solution for imputing missing values

df["winbet"].describe()

In [None]:
# You might also just want to delete the rows that contain missing values.
# You can easily do it with .dropna and adjust its paramters to your liking

# how=any deletes all of them, how=only deletes it if the variable is full of them

df["winbet"].dropna(how=any)

## Resources


- [Link to resource 1](https://www.freecodecamp.org/news/the-penalty-of-missing-values-in-data-science-91b756f95a32/#:~:text=Missing%20values%20affect%20our%20performance%20and%20predictive%20capacity.,affects%20our%20statistics.%20Conclusions%20can%20thus%20be%20misleading.): Useful info on why handling missing data matters
- [Link to ressource 2](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html): Guide to use .isna
- [Link to ressource 3](https://towardsdatascience.com/handling-missing-values-with-pandas-b876bf6f008f): A more complete guide on handling missing values and imputing, with a lot of useful methods
