# Unit Missingness Demo

When handling unit missingness, the most common method is to do **weight class adjustments**. This requires us to break our observations into classes and weight them before doing our analysis.

In [1]:
# Import libraries.
import pandas as pd
import numpy as np

# Set random seed.
np.random.seed(42)

# Generate dataframe.
value_score = [min(np.random.poisson(5), 10) if i % 2 == 0 else min(np.random.poisson(6), 10) for i in range(10_000)]
value_score = [value_score[i] if (i % 8 == 0 or (i % 7 != 0 and i % 2 == 1)) else np.nan for i in range(10_000)]
departments = ['finance' if i % 2 == 0 else 'accounting' for i in range(10_000)]
df = pd.DataFrame({
    'dept': departments,
    'score': value_score
})

# Check first five rows.
df.head()

Unnamed: 0,dept,score
0,finance,5.0
1,accounting,4.0
2,finance,
3,accounting,5.0
4,finance,


In [2]:
# What is the distribution of department?
df['dept'].value_counts(normalize = True)

finance       0.5
accounting    0.5
Name: dept, dtype: float64

In [3]:
# Check for nulls.
df.isnull().sum()

dept        0
score    4464
dtype: int64

In [4]:
# Drop NAs.
df.dropna(inplace = True)

In [6]:
# What proportion of our responses came from accounting?
df['dept'].value_counts(normalize = True)['accounting']

0.7742052023121387

1. Take the full sample (observed and missing) and break them into subgroups based on characteristics we know.
2. Calculate a weight for each observation:

$$
\text{weight}_i = \frac{\text{true proportion in group }i}{\text{proportion of observed values in group }i}
$$

In [7]:
# Calculate and print the weight for accounting.
w_accounting = (1/2) / df['dept'].value_counts(normalize = True)['accounting']

print(f'The weight for each accounting vote is: {w_accounting}.')

The weight for each accounting vote is: 0.645823611759216.


In [8]:
# Calculate the and print weight for finance.
w_finance = (1/2) / df['dept'].value_counts(normalize = True)['finance']

print(f'The weight for each finance vote is: {w_finance}.')

The weight for each finance vote is: 2.2144.


In [9]:
# Let's confirm that the weights times the counts
# yields a 50/50 split.
print(w_accounting * df['dept'].value_counts()['accounting'])
print(w_finance * df['dept'].value_counts()['finance'])

2767.9999999999995
2768.0


In [10]:
# Create column that stores the weights.

df['weights'] = [w_accounting if i == 'accounting' else w_finance for i in df['dept']]

In [11]:
# Confirm counts.

df['weights'].value_counts()

0.645824    4286
2.214400    1250
Name: weights, dtype: int64

In [12]:
# Calculate raw mean of my employee satisfaction score.

np.mean(df['score'])

5.724530346820809

In [13]:
# Calculate weighted mean of my employee satisfaction score.

np.mean(df['score'] * df['weights'])

5.450634997666867

<details><summary>Our goal with post-weighting is to decrease bias. What should we be concerned about?</summary>
    
- Due to the bias-variance tradeoff, as we decrease bias, we may cause an increase in variance.
- This can be a really big deal, [said the New York Times in 2016](https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html).
</details>

<details><summary>What might be a situation where we may not be able to use weight class adjustments?</summary>
    
- If we don't know the true distribution of our classes.
- For example, if I didn't know that half of our team was in accounting and half in finance.
- Another example, let's say I wanted to apply this weighting method to understand the percentage of voters supporting the Democratic candidate in the upcoming election. I don't know how many people will be in each of the age groups 18-34, 35-54, and 55+. I'll have to make a guess. (Hopefully an educated one!)
</details>

#### Have more variables and want to build a sophisticated model?
Pass `df['weight']` into `sklearn` when fitting your model. [Source](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.fit).
> `model.fit(X_train, y_train, X_train['weight'])`