In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Lecture 7

# EDA, Visualization, and Missing Value Imputation

### EECS 398: Practical Data Science, Winter 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/wn25">github.com/practicaldsc/wn25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/69737/discussion/5943734) </small>
    
</div>

### Agenda 📆

- Exploratory data analysis 🔎.
- Visualization 📊.
- Missing value imputation 🕳️.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
<small>Remember that you can always ask questions anonymously at the link above!</small>

<br>

How long does Homework 3 feel compared to Homework 2?
    
- A. Way shorter.
- B. Shorter.
- C. About the same.
- D. Longer.
- E. Way longer.

## Exploratory data analysis 🔎

---

### Loading the data 🏦

- Recall from last class, [LendingClub](https://www.lendingclub.com/) is a platform that allows individuals to borrow money – that is, take on **loans**.

- Each row of our dataset contains information about a loan holder, i.e. someone who was already approved for a loan.

In [None]:
# Run this cell to perform the data cleaning steps we implemented last lecture.
def clean_term_column(df):
    return df.assign(
        term=df['term'].str.split().str[0].astype(int)
    )
def clean_date_column(df):
    return (
        df
        .assign(date=pd.to_datetime(df['issue_d'], format='%b-%Y'))
        .drop(columns=['issue_d'])
    )
loans = (
    pd.read_csv('data/loans.csv')
    .pipe(clean_term_column)
    .pipe(clean_date_column)
)
loans

### Lender decision-making

- Larger interest rates make the loan more expensive for the borrower – as a borrower, you want a lower interest rate!

- Even for the same loan amount, different borrowers were approved for different terms and interest rates:

In [None]:
display_df(loans.loc[loans['loan_amnt'] == 3600, ['loan_amnt', 'term', 'int_rate']], rows=17)

- **Why do different borrowers receive different terms and interest rates?**

### Exploratory data analysis (EDA)

- Historically, data analysis was dominated by formal statistics, including tools like confidence intervals, hypothesis tests, and statistical modeling.

- In 1977, John Tukey [defined](https://search.worldcat.org/title/3058187) the term **exploratory data analysis**, which described a philosophy for proceeding about data analysis:

    > Exploratory data analysis is actively incisive, rather than passively descriptive, with real emphasis on the discovery of the unexpected.

- Practically, EDA involves, among other things, computing summary statistics and drawing plots to understand the nature of the data at hand.

    > The greatest gains from data come from surprises… The unexpected is best brought to our attention by **pictures**.

- Today, we'll learn how to ask questions, visualize, and deal with missing values.

### Individuals and features

<center><img src="imgs/indiv-feat.png" width=1200></center>

- <span style="color:#6d9eeb"><b>Individual (row)</b></span>: Person/place/thing for which data is recorded. Also called an **observation**.

- <span style="color:#ff9900"><b>Feature (column)</b></span>: Something that is recorded for each individual. Also called a **variable** or **attribute**.<br><small>Here, "variable" doesn't mean Python variable!</small>

- There are two key types of features:
    - **Numerical features**: It makes sense to do arithmetic with the values.
    - **Categorical features**: Values fall into categories, that may or may not have some order to them.

- Be careful: sometimes numerical features are stored as strings, and categorical features are stored as numbers.<br><small>The transformations we discussed last class can help fix these issues.</small>

### Feature types

<center><img src="imgs/features.png" width=1200></center>

### Asking questions 🙋

- EDA involves asking questions about the data without any preconceived notions of what the answers may be.

- With practice, you'll learn which questions to ask; we'll demonstrate some of that here.

### Who do we have data for?

- We were told that we're only looking at approved loans. What's the distribution of `'loan_status'`es?

In [None]:
loans.columns 

In [None]:
loans['loan_status'].value_counts() 

- Are the loans in our dataset specific to any particular state?

In [None]:
loans['addr_state'].value_counts().head()

### What is the distribution of loan amounts?

In [None]:
loans['loan_amnt'].describe()

In [None]:
loans['loan_amnt'].plot(kind='hist', nbins=10)

### Are related features in agreement?

- Why are there two columns with credit scores, `'fico_range_low'` and `'fico_range_high'`? What do they both mean?

In [None]:
loans[['fico_range_low', 'fico_range_high']] 

In [None]:
(loans['fico_range_high'] - loans['fico_range_low']).value_counts() 

- Does every `'sub_grade'` align with its related `'grade'`?

In [None]:
loans[['grade', 'sub_grade']] 

In [None]:
# Turns out, the answer is yes!
# The .str accessor allows us to use the [0] operation
# on every string in loans['sub_grade'].
(loans['sub_grade'].str[0] == loans['grade']).all() 

## Visualization 📊

---

### Visualizations complement statistics

<center><img src="imgs/dino.png" width=800></center>

<center>These 13 scatter plots, take from <a href="https://www.research.autodesk.com/publications/same-stats-different-graphs/">here</a>, all have the same means and <br>standard deviations of $x$ and $y$, and the same correlations. But they look very different!</center>

### Why visualize?

- In this lecture, we will create several visualizations using just a single dataset.

<center><img src="imgs/array.png" width=1300</center>

- The visualizations we want to create will often dictate the data cleaning steps we take.<br><small>For example, we can't plot the average interest rate over time without converting dates to timestamp objects!</small>

- One reason to create visualizations is for **us** to better understand our data.

- Another reason is to **_accurately communicate_ a message to other people**!

<center><img src="imgs/bad-india.jpg" width=600>

</center>

<br>

<center>What's wrong with <a href="https://x.com/JeffDean/status/1291613522942504962">this visualization</a>?</center>

### `plotly`

- We've used `plotly` in lecture briefly, and you've even used it in Homework 1 and Homework 3, but we've never formally discussed it.

- It's a visualization library that enables **interactive** visualizations.

<center><img src="imgs/plotly.png" width=300></center>

- We can use `plotly` using the `plotly.express` syntax.
    - `plotly` is very flexible but it can be verbose; `plotly.express` allows us to make plots quickly.
    - See the [**documentation here**](https://plotly.com/python/plotly-express) – it's very rich.<br><small>There are good examples for almost everything!</small>

In [None]:
import plotly.express as px

- Alternatively, we can use `plotly` by setting `pandas` plotting backend to `'plotly'` and using the DataFrame/Series `plot` method.<br><small>By default, the plotting backend is `matplotlib`, which creates non-interactive visualizations.</small>

In [None]:
pd.options.plotting.backend = 'plotly'

### Choosing the correct type of visualization

- The type of visualization we create depends on the types of features we're visualizing.

- We'll directly learn how to produce the **bolded** visualizations below, but the others are also options.<br><small>See more examples of visualization types [**here**](https://learningds.org/ch/10/eda_feature_types.html#the-importance-of-feature-types).</small>

| Feature types | Options |
| --- | --- |
| Single categorical feature | **Bar charts**, pie charts, dot plots |
| Single numerical feature | **Histograms**, **box plots**, density curves,<br>rug plots, **violin plots**  |
| Two numerical features | **Scatter plots**, **line plots**, heat maps,<br> contour plots |
| One categorical and one numerical feature<br><small>It really depends on the nature of the features themselves!</small> | **Side-by-side** histograms, **box plots**, or **bar charts**,<br> overlaid line plots or density curves|

- Note that we use the words "plot", "chart", and "graph" to mean the same thing.

- Now, we're going to look at several examples. Focus on _what_ is being visualized and _why_; read the notebook later for the _how_.

### Bar charts

- Bar charts are used to show:
    - The distribution of a single categorical feature, or
    - The relationship between one categorical feature and one numerical feature.

- Usage: `px.bar` / `px.barh` or `df.plot(kind='bar')` / `df.plot(kind='barh')`.<br><small>`'h'` stands for "horizontal."</small>

- Example: What is the distribution of `'addr_state'`s in `loans`?

In [None]:
# Here, we're using the .plot method on loans['addr_state'], which is a Series.
# We prefer horizontal bar charts, since they're easier to read.
(
    loans['addr_state']
    .value_counts()
    .plot(kind='barh')
)

In [None]:
# A little formatting goes a long way!
(
    loans['addr_state']
    .value_counts(normalize=True)
    .head(10)
    .sort_values()
    .plot(kind='barh', title='States of Residence for Successful Loan Applicants')
)

- Example: What is the average `'int_rate'` for each `'home_ownership'` status?

In [None]:
(
    loans
    .groupby('home_ownership')
    ['int_rate']
    .mean()
    .plot(kind='barh', title='Average Interest Rate by Home Ownership Status')
)

In [None]:
# The "ANY" category seems to be an outlier.
loans['home_ownership'].value_counts()

### Side-by-side bar charts

- Instead of just looking at `'int_rate'`s for different, `'home_ownership'` statuses, we could also group by loan `'term'`s, too. As we'll see, `'term'` impacts `'int_rate'` far more than `'home_ownership'`.

In [None]:
(
    loans
    .groupby('home_ownership')
    .filter(lambda df: df.shape[0] > 1) # Gets rid of the "ANY" category.
    .groupby(['home_ownership', 'term'])
    [['int_rate']]
    .mean()
)

- A side-by-side bar chart, which we can create by setting the `color` and `barmode` arguments, makes the pattern clear:

In [None]:
# Annoyingly, the side-by-side bar chart doesn't work properly
# if the column that separates colors (here, 'term')
# isn't made up of strings.
(
    loans
    .assign(term=loans['term'].astype(str) + ' months')
    .groupby('home_ownership')
    .filter(lambda df: df.shape[0] > 1)
    .groupby(['home_ownership', 'term'])
    [['int_rate']]
    .mean()
    .reset_index()
    .plot(kind='bar', 
          y='int_rate', 
          x='home_ownership', 
          color='term', 
          barmode='group',
          title='Average Interest Rate by Home Ownership Status and Loan Term',
          width=800)
)

- **Why do longer loans have higher `'int_rate'`s on average?**

### Histograms

- The previous slide showed the **average** `'int_rate'` for different combinations of `'home_ownership'` status and `'term'`.

- But, the average of a numerical feature is just a single number, and can be misleading!

- Histograms are used to show the distribution of a single numerical feature.

- Usage: `px.histogram` or `df.plot(kind='hist')`.

- Example: What is the distribution of `'int_rate'`?

In [None]:
(
    loans
    .plot(kind='hist', x='int_rate', title='Distribution of Interest Rates')
)

- With fewer bins, we see less detail (and less noise) in the shape of the distribution.<br><small>Play with the slider that appears when you run the cell below!</small>

In [None]:
def hist_bins(nbins):
    (
        loans
        .plot(kind='hist', x='int_rate', nbins=nbins, title='Distribution of Interest Rates')
        .show()
    )
interact(hist_bins, nbins=(1, 51));

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
<small>Remember that you can always ask questions anonymously at the link above!</small>
    
Based on the histogram below, what is the relationship between the mean and median interest rate?
    
- A. Mean > median.
- B. Mean $\approx$ median.
- C. Mean < median.

In [None]:
(
    loans
    .plot(kind='hist', x='int_rate', title='Distribution of Interest Rates', nbins=20)
)

### Box plots and violin plots

- Box plots and violin plots are alternatives to histograms, in that they also are used to show the distribution of a quantitative feature.<br><small>Learn more about box plots [**here**](https://datatab.net/tutorial/box-plot).</small>

- The benefit to them is that they're easily stacked side-by-side to compare distributions.

- Example: What is the distribution of `'int_rate'`?

In [None]:
(
    loans
    .plot(kind='box', x='int_rate', title='Distribution of Interest Rates')
)

- Example: What is the distribution of `'int_rate'`, separately for each loan `'term'`?

In [None]:
(
    loans
    .plot(kind='box', y='int_rate', color='term', orientation='v', 
          title='Distribution of Interest Rates by Loan Term')
)

In [None]:
(
    loans
    .plot(kind='violin', y='int_rate', color='term', orientation='v', 
          title='Distribution of Interest Rates by Loan Term')
)

- Overlaid histograms can be used to show the distributions of multiple numerical features, too.

In [None]:
(
    loans
    .plot(kind='hist', x='int_rate', color='term', marginal='box', nbins=20,
          title='Distribution of Interest Rates by Loan Term')
)

### Scatter plots

- Scatter plots are used to show the relationship between two quantitative features.

- Usage: `px.scatter` or `df.plot(kind='scatter')`.

- Example: What is the relationship between `'int_rate'` and debt-to-income ratio, `'dti'`?

In [None]:
(
    loans
    .sample(200, random_state=23)
    .plot(kind='scatter', x='dti', y='int_rate', title='Interest Rate vs. Debt-to-Income Ratio')
)

- There are a multitude of ways that scatter plots can be customized. We can color points based on groups, we can resize points based on another numeric column, we can give them hover labels, etc.

In [None]:
(
    loans
    .assign(term=loans['term'].astype(str))
    .sample(200, random_state=23)
    .plot(kind='scatter', x='dti', y='int_rate', color='term',
          hover_name='id', size='loan_amnt',
          title='Interest Rate vs. Debt-to-Income Ratio')
)

### Line charts

- Line charts are used to show how one quantitative feature changes over time.

- Usage: `px.line` or `df.plot(kind='line')`.

- Example: How many loans were given out each year in our dataset?<br><small>This is likely not true of the market in general, or even LendingClub in general, but just a consequence of where our dataset came from.</small>

In [None]:
(
    loans
    .assign(year=loans['date'].dt.year)
    ['year']
    .value_counts()
    .sort_index()
    .plot(kind='line', title='Number of Loans Given Per Year')
)

- Example: How has the average `'int_rate'` changed over time?

In [None]:
(
    loans
    .resample('6M', on='date')
    ['int_rate']
    .mean()
    .plot(kind='line', title='Average Interest Rate over Time')
)

- Example: How has the average `'int_rate'` changed over time, separately for 36 month and 60 month loans?

In [None]:
(
    loans
    .groupby('term')
    .resample('6M', on='date')
    ['int_rate']
    .mean()
    .reset_index()
    .plot(kind='line', x='date', y='int_rate', color='term',
          title='Average Interest Rate over Time')
)

## Missing value imputation 🕳️

---

In [None]:
loans.info()

<center>Most borrowers did not provide a description (<code>'desc'</code>).</center>

### Who provided loan descriptions?

- The Series `isna` method shows `True` for non-null values and `False` for null values; `notna` does the opposite.

In [None]:
...

In [None]:
# Run this repeatedly to read a random sample of loan descriptions.
for desc in loans.loc[loans['desc'].notna(), 'desc'].sample(3):
    print(desc + '\n')

- It _appears_ that applicants who submitted descriptions with their loan applications were given higher interest rates on average than those who didn't submit descriptions.<br><small>But, this could've happened for a variety of reasons, not just because they submitted a description.</small>

In [None]:
(
    loans
    .assign(submitted_description=loans['desc'].notna())
    .groupby('submitted_description')
    ['int_rate']
    .agg(['mean', 'median'])
)

- **Key idea**: The fact that some values are missing is, itself, information!

<div class="alert alert-danger">
    
#### Reference Slide

### Aside: Series operations with null values
    
</div>

- The `numpy`/`pandas` null value, `np.nan`, is typically ignored when using `numpy`/`pandas` operations.

In [None]:
# Note the NaN at the very bottom.
loans['mths_since_last_delinq']

In [None]:
loans['mths_since_last_delinq'].sum() 

- But, `np.nan`s typically aren't ignored when using regular Python operations.

In [None]:
sum(loans['mths_since_last_delinq']) 

- As an aside, the regular Python null value is `None`.

In [None]:
None

In [None]:
np.nan

### Intentionally missing values and default replacements

- Sometimes, values are missing **intentionally**, or by design. In these cases, we **can't** fill in the missing values.<br><small>For instance, if we survey students and ask "if you're from Michigan, what high school did you go to?", the students not from Michigan will have missing responses. But, there's **nothing to fill in**, since they're not from Michigan!</small>

- Other times, missing values have a **default** replacement.<br><small>For instance, you automatically get a 0 for all assignments you don't submit in this class. So, when calculating your grades, I'll need to fill in all of your `NaN`s with 0. The DataFrame/Series `fillna` method helps with this.</small>

- **Most situations are more complicated than this, though!**<br><small>**Don't** get in the habit of just automatically filling all null values with 0.</small>

### Generally, what do we do with missing data?

- Consider a feature, $Y$.<br><small>Imagine $Y$ is a column in a DataFrame.</small>

- Some of its values, $Y_\text{present}$, are present, while others, $Y_\text{missing}$, are missing.

- **Issue**: $Y_\text{present}$ may **look** different than the full dataset, $Y$.<br><small>Remember, we don't get to see $Y$.</small>

- That is, the mean, median, and variance of $Y_\text{present}$ may be different than that of $Y$.

- Furthermore, the correlations between $Y_\text{present}$ and other features may be different than the correlations between $Y$ and other features.

### Example: Heights

- Below, we load in a dataset containing the heights of parents and their children. Some of the `'child'`'s heights are missing.

- **Aside**: The dataset was collected by [Sir Francis Galton](https://en.wikipedia.org/wiki/Francis_Galton), who developed many key ideas in statistics (including correlation and regression) for the purposes of eugenics, which he is also the originator of.

In [None]:
heights = pd.read_csv('data/heights-missing.csv')
heights.head()

In [None]:
heights['child'].isna().sum()

- **Goal**: Try and fill the missing values in `heights['child']` using the information **we do have**.

- **Plan**: Discuss several ideas on how to solve this problem.<br><small>In practice, the approach you use depends on the situation.</small>

### Aside: Kernel density estimates

- In this section, we'll need to visualize the distributions of many numerical features.

- To do so, we'll use yet another visualization, a **kernel density estimate** (KDE).<br>Think of a KDE as a smoothed version of a histogram.

In [None]:
heights['child'].plot(kind='hist', nbins=30)

In [None]:
def multiple_kdes(ser_map, title=""):
    values = [ser_map[key].dropna() for key in ser_map]
    labels = list(ser_map.keys())
    fig = ff.create_distplot(
        hist_data=values,
        group_labels=labels,
        show_rug=False,
        show_hist=False,
        colors=px.colors.qualitative.Dark2[: len(ser_map)],
    )
    return fig.update_layout(title=title, width=1000).update_xaxes(title="child")
multiple_kdes({'Before Imputation': heights['child']})

### Idea: Dropping missing values

- One solution is to "drop" all rows with missing values, and do calculations with just the values that we have.

- This is called **listwise deletion**.

In [None]:
heights

In [None]:
heights.dropna()

- **Issue**: We went from 934 to 765 rows, which means we lost 18% of rows for all columns, even columns in which no values were originally missing.

- Most `numpy`/`pandas` methods already ignore missing values when performing calculations, so we don't _need_ to do anything explicit to ignore the missing values when calculating the mean and standard deviation.

In [None]:
...

In [None]:
...

### Idea: Mean imputation

- Suppose we need all of the missing values to be filled in, or **imputed**, for our future analyses, meaning we can't just drop them.

- A **terrible** idea would be to impute all of the missing values with 0. Why?

In [None]:
heights['child']

In [None]:
# DON'T do this!
...

- A better idea is to impute missing values with the **mean** of the observed values.

In [None]:
heights['child']

In [None]:
heights['child'].mean()

In [None]:
mean_imputed = ...
mean_imputed

- The mean of `mean_imputed` is the **same** as the mean of `'child'` **before** we imputed.<br><small>You proved this in Homework 1!</small>

In [None]:
# Mean before imputation:
heights['child'].mean()

In [None]:
# Mean after imputation:
mean_imputed.mean()

- What do you think a _histogram_ of `mean_imputed` would look like?

In [None]:
mean_imputed.value_counts()

### Mean imputation destroys spread!

- Let's look at the distribution of `heights['child']` <span style="color:#1c9e76"><b>before</b></span> we filled in missing values along with the distribution of `mean_imputed`, <span style="color:#d95f01"><b>after</b></span> we filled in missing values.

In [None]:
multiple_kdes({'Before Imputation': heights['child'], 
               'After Mean Imputation': mean_imputed})

- The standard deviation after imputing with the mean is much lower! The true distribution of `'child'` likely does not look like the distribution on the right.<br><small>In Homework 1, you found the relationship between the new standard deviation and the old one!</small>

In [None]:
heights['child'].std()

In [None]:
mean_imputed.std()

- This makes it harder to use the imputed `'child'` column in analyses with other columns.

### Mean imputation and listwise deletion introduce bias!

- **What if the values that are missing, $Y_\text{missing}$, _are not_ a representative sample of the full dataset, $Y$**?<br><small>Equivalently, what if the values that are **present** are not a representative sample of the full dataset?</small>

- For example, _if_ shorter heights are more likely to be missing than larger heights, then:
    - The mean of the present values will be **too big**.
    
    $$\text{mean}(Y_\text{present}) > \text{mean}(Y)$$
    
    - So, by replacing missing values with the mean, our estimates will all be **too big**.

- Instead of filling all missing values with the same one value, can we do something to prevent this added bias?

### Idea: Conditional mean imputation

- If we have reason to believe the chance that a `'height'` is missing **depends** on another feature, we can use that other feature to inform how we fill the missing value!

- For example, if we have reason to believe that heights are more likely to be missing for `'female'` children than `'male'` children, we could fill in the missing `'female'` and `'male'` heights separately.

In [None]:
# Here, we're computing the proportion of 'child' heights that are missing per gender.
(
    heights
    .groupby('gender')
    ['child']
    .agg(lambda s: s.isna().mean())
)

- That seems to be the case here, so let's try it. We can use the `groupby` `transform` method.

In [None]:
# The mean 'female' observed 'child' height is 64.03, while
# the mean 'male' observed 'child' height is 69.13.
heights.groupby('gender')['child'].mean()

In [None]:
heights

In [None]:
# Note the first missing 'child' height is filled in with
# 69.13, the mean of the observed 'male' heights, since
# they are a 'male' child!
conditional_mean_imputed = ...
...
conditional_mean_imputed

### Pros and cons of conditional mean imputation

- Instead of having a single "spike", the conditionally-imputed distribution has two smaller "spikes".<br><small>In this case, one at the observed `'female'` mean and one at the observed `'male'` mean.</small>

In [None]:
multiple_kdes({'Before Imputation': heights['child'], 
               'After Mean Imputation': mean_imputed,
               'After Conditional Mean Imputation': conditional_mean_imputed})

- **Pro ✅**: The conditionally-imputed column's **mean** is likely to be closer to the **true** mean than if we just dropped all missing values, since we attempted to account for the imbalance in missingness.

In [None]:
# The mean of just our present values.
heights['child'].mean()

In [None]:
# Lower than above, reflecting the fact that we are missing
# more 'female' heights and 'female' heights
# tend to be lower.
conditional_mean_imputed.mean()

- **Con ❌**: The conditionally-imputed column likely still has a lower standard deviation than the true `'height'` column.<br><small>The true `'height'` column likely doesn't look like the right-most histogram above.</small>

- **Con ❌**: The chance that `'child'`'s heights are missing may depend on other columns, too, and we didn't account for those. There may still be bias.

### Idea: Regression imputation

- A common solution is to fill in missing values by using other features to **predict** what the missing value would have been.

In [None]:
# There's nothing special about the values passed into .iloc below;
# they're just for illustration.
heights.iloc[[0, 2, 919, 11, 4, 8, 9]]

- We'll learn how to make such predictions in the coming weeks.

### Idea: Probabilistic imputation

- Since **we don't know** what the missing values would have been, one could argue our technique for filling in missing values should incorporate this uncertainty.

- We could fill in missing values using a **random sample** of observed values.<br><small>This avoids the key issue with mean imputation, where we fill all missing values with the same one value. It also limits the bias present if the missing values weren't a representative sample, since we're filling them in with a range of different values.</small>

In [None]:
# impute_prob should take in a Series with missing values and return an imputed Series.
def impute_prob(s):
    s = s.copy()
    # Find the number of missing values.
    num_missing = s.isna().sum()
    # Take a sample of size num_missing from the present values.
    sample = np.random.choice(s.dropna(), num_missing)
    # Fill in the missing values with our random sample.
    s.loc[s.isna()] = sample
    return s

- Each time we run the cell below, the missing values in `heights['child']` are filled in with a different sample of present values in `heights['child']`!

In [None]:
# The number at the very top is constantly changing!
prob_imputed = impute_prob(heights['child'])
print('Mean:', prob_imputed.mean())
prob_imputed

- To account for the fact that each run is slightly different, a common strategy is **multiple imputation**.<br><small>This involves performing probabilistic imputation many (> 5) times, performing further analysis on each new dataset (e.g. building a regression model), and aggregating the results.</small>

- Probabilistic imputation can even be done **conditionally**!<br><small>Now, missing `'male'` heights are filled in using a sample of observed `'male'` heights, and missing `'female'` heights are filled in using a sample of observed `'female'` heights!</small>

In [None]:
conditional_prob_imputed = ...
...
conditional_prob_imputed

### Visualizing imputation strategies

In [None]:
multiple_kdes({'Before Imputation': heights['child'], 
               'After Mean Imputation': mean_imputed,
               'After Conditional Mean Imputation': conditional_mean_imputed,
               'After Probabilistic Imputation': prob_imputed,
               'After Conditional Probabilistic Imputation': conditional_prob_imputed})

<div class="alert alert-danger">
    
#### Reference Slide

### Missingness mechanisms
    
</div>

- There are three key **missingness mechanisms**, which describe _how_ data in a column can be missing.

- **Missing completely at random (MCAR)**: Data are MCAR if the chance that a value is missing is **completely independent** of other columns and the actual missing value.<br><small>Example: Suppose that after the Midterm Exam, I randomly choose 5 scores to delete on Gradescope, meaning that 5 students have missing grades. MCAR is **ideal, but rare!**</small>

- **Missing at random (MAR)**: Data are MAR if the chance that a value is missing **depends on other columns**.<br><small>Example: Suppose that after the Midterm Exam, I randomly choose 5 scores to delete on Gradescope **among** sophomore students. Now, scores are missing at random **dependent on class standing**.</small>

- **Not missing at random (NMAR)**: Data are NMAR if the chance that a value is missing **depends on the actual missing value itself**.<br><small>Example: Suppose that after the Midterm Exam, I randomly delete 5 of the 10 lowest scores on Gradescope. Now, scores are **not** missing at random, since the chance a value is missing depends on how large it is.</small>

- Statistical imputation packages usually assume data are MAR.<br><small>MCAR is usually unrealistic to assume. If data are NMAR, you can't impute missing values, since the other features in your data **can't explain** the missingness.</small>

<div class="alert alert-danger">
    
#### Reference Slide

### How do we know if data are MCAR?
    
</div>

- It seems that if our data are MCAR, there is no risk to dropping missing values.<br><small>In the MCAR setting, just imagine we're being given a large, random sample of the true dataset.</small>

- If the data are not MCAR, though, then dropping the missing values will introduce bias.<br><small>For instance, suppose we asked people "How much do you give to charity?" People who give little are less likely to respond, so the average response is **biased high**.</small>

- There is no perfect procedure for determining if our data are MCAR, MAR, or NMAR; we mostly have to use our understanding of how the data is generated.

- But, we can _try_ to determine whether $Y_\text{missing}$ is similar to $Y$, using the information we **do have** in other columns.<br><small>We did this earlier, when looking at the proportion of missing `'child'` heights for each `'gender'`.</small>

### Summary of imputation techniques

- Consider whether values are missing intentionally, or whether there's a default replacement.

- Listwise deletion.<br><small>Drop, or ignore, missing values.</small>

- (Conditional) mean imputation.<br><small>Fill in missing values with the mean of observed values. If there's a reason to believe the missingness depends on another categorical column, fill in missing values with the observed mean separately for each category.</small>

- (Conditional) Probabilistic imputation.<br><small>Fill in missing values with a random sample of observed values. If there's a reason to believe the missingness depends on another categorical column, fill in missing values with a random sample drawn separately for each category.</small>

- Regression imputation.<br><small>Predict missing values using other features.</small>

### What's next?

- So far, our data has just been given to us as a CSV.<br><small>Sometimes it's messy, and we need to clean it.</small>

- But, what if the data we want is somewhere on the internet?