## Data Preprocessing

In the previous session, you successfully acquired and prepared a subset of the ESS10 data 

Keep in minde we aim to the following hypothesis:

> **Centrist individuals exhibit lower levels of affective polarization.**

Today, you will clean this dataset to make it analysis-ready. This involves:

1. Loading the subset data you created last time
2. Filtering observations (focusing on France)
3. Recoding variables for analysis
4. Creating relevant variables for the analysis
5. Saving the cleaned dataset

### Tips

- You should adapt code from previous [notebooks](https://github.com/mickaeltemporao/materials/tree/main), such as:
    - `05-data-exploration-rows.ipynb`
    - `06-data-management-existing-values.ipynb`


## Loading the Data

We have prepared a helper function in the `code/preprocess.py` module that will load your subset data. If the subset file doesn't exist, it will recreate it from the raw ESS data.

In [None]:
# Import necessary libraries
import sys
import pandas as pd

# Adding the code directory to path
sys.path.append('../code')  

# Import the preprocessing module
from preprocess import subset

In [None]:
# Load the subset data using our helper function
df = subset()
df.sample(5)

In [None]:
# Quick sanity check before we get started
df.describe()

# Let's Start!
## Filtering French Observations

Since some variables are country-specific, let's filter our dataset to focus on French respondents.

**Task:**
1. Filter the dataset to include only respondents from France (`country == 'FR'`)
2. Create a new dataframe called `df_france`
3. Drop the `cntry` column
4. Display the first few rows

In [None]:
# Your code here:


## Filtering Relevant Observations

Before we can analyze the data, we need to understand how the variables are coded. 

If you check the codebook, we see that most variables have values that are not applicable to our analysis (66, 77, 88, ...). 
For now, we will simply remove irrelevant observations.

> The codebook has been automatically downloaded in `data/raw/ESS10 codebook.html`

**Task:**
1. Filter each variable to include only relevant observations
2. Check the values of the remaining observations
2. Display the first few rows of the dataset

In [None]:
# Your code here (you can create multiple code blocks)


## Creating the Centrist Variable 

For our hypothesis, we need to identify "centrist" individuals. 
Therefore, we will should add a new variable capturing this concept into our data frame.

**Task:**
- Use the `lrscale` to create a centrism variable
- Add the new variable `centrism` to the data frame
- Check the distribution of the newly created variable


In [None]:
# Your code here:
# Create the centrism variable


# Check the distribution



## Creating an Affective Scale?

For our hypothesis, we also need to build an affective evaluation scale.

Unfortunately, ... the ESS does not have direct out-party dislike or feeling thermometers toward parties.
Not having direct observations of what we are trying to test, is common when doing research. 

We need need to find a way around this by building a proxy that is close to our original concept.
One way to build such proxy is by creating an additive scale combining multiple variables of interest. 

That is, we could add variables together to build an "Affective Scale".

**Task:**
- Select some or all trust and satisfaction variables.
- Make sure they are coded in the same direction (higher = more positive evaluation).
- Combine them into an additive scale (sum or average).
- Create a new variable called `aff_eval` in the data frame.
- Check the distribution of aff_eval using summary statistics and a histogram.


In [None]:
# Your code here:
# Create the aff_eval variable


# Check the distribution



## Exploratory Data Analysis

Now let's explore our cleaned data to understand the relationships between variables.

In [None]:
# Summary statistics for key variables


In [None]:
# Create crosstabs and plots to examine the relationships


## Saving the Final Dataset

Let's create our final analysis-ready dataset with the key variables for testing our hypothesis.

In [None]:
# Save the final cleaned dataset
