## Data Preprocessing

In the previous session, you successfully acquired and prepared a subset of the ESS10 data 

Keep in minde we aim to the following hypothesis:

> **Centrist individuals exhibit lower levels of affective polarization.**

Today, you will clean this dataset to make it analysis-ready. This involves:

1. Loading the subset data you created last time
2. Filtering observations (focusing on France)
3. Recoding variables for analysis
4. Creating relevant variables for the analysis
5. Saving the cleaned dataset

### Tips

- You should adapt code from previous [notebooks](https://github.com/mickaeltemporao/materials/tree/main), such as:
    - `05-data-exploration-rows.ipynb`
    - `06-data-management-existing-values.ipynb`


## Loading the Data

We have prepared a helper function in the `code/preprocess.py` module that will load your subset data. If the subset file doesn't exist, it will recreate it from the raw ESS data.

In [131]:
# Import necessary libraries
import sys
import pandas as pd
import numpy as np

# Adding the code directory to path
sys.path.append('../code')  

# Import the preprocessing module
from preprocess import subset

In [132]:
# Load the subset data using our helper function
df = subset()
df.sample(5)

Unnamed: 0,cntry,clsprty,lrscale,polintr,prtclffr,prtdgcl,prtvtefr,vote,trstplt,trstprt,stfgov,stfdem
24628,IT,2,88,3,,6,,2,3,2,5,7
8044,CZ,2,8,3,,6,,1,8,8,6,7
35728,SI,1,6,3,,2,,1,4,4,5,3
13541,GB,2,6,4,,6,,1,1,1,3,7
16263,GR,1,1,4,,2,,1,4,3,0,2


In [133]:
# Quick sanity check before we get started
df.describe()

Unnamed: 0,clsprty,lrscale,polintr,prtclffr,prtdgcl,prtvtefr,vote,trstplt,trstprt,stfgov,stfdem
count,37611.0,37611.0,37611.0,1977.0,37611.0,1977.0,37611.0,37611.0,37611.0,37611.0,37611.0
mean,1.727367,16.444338,2.749329,43.245321,4.503417,41.382398,1.425859,4.803621,5.033714,6.40544,7.193082
std,1.077339,27.90359,0.952343,29.115156,1.97217,30.683813,0.922063,9.83446,10.931258,12.515845,12.621552
min,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,1.0,4.0,2.0,7.0,2.0,7.0,1.0,2.0,2.0,3.0,4.0
50%,2.0,5.0,3.0,66.0,6.0,66.0,1.0,4.0,4.0,5.0,6.0
75%,2.0,8.0,3.0,66.0,6.0,66.0,2.0,6.0,6.0,7.0,7.0
max,9.0,99.0,9.0,88.0,9.0,88.0,9.0,99.0,99.0,99.0,99.0


# Let's Start!
## Filtering French Observations

Since some variables are country-specific, let's filter our dataset to focus on French respondents.

**Task:**
1. Filter the dataset to include only respondents from France (`country == 'FR'`)
2. Create a new dataframe called `df_france`
3. Drop the `cntry` column
4. Display the first few rows

In [134]:
# Your code here:
df_france = df[df['cntry'] == 'FR']
df_france = df_france.drop('cntry', axis = 1)
df_france.head()

Unnamed: 0,clsprty,lrscale,polintr,prtclffr,prtdgcl,prtvtefr,vote,trstplt,trstprt,stfgov,stfdem
11177,2,5,3,66.0,6,88.0,1,8,8,2,6
11178,1,0,2,4.0,3,6.0,1,5,5,5,5
11179,2,5,3,66.0,6,66.0,2,4,3,5,4
11180,2,3,4,66.0,6,66.0,2,5,5,6,8
11181,2,5,3,66.0,6,66.0,2,5,3,6,7


## Filtering Relevant Observations

Before we can analyze the data, we need to understand how the variables are coded. 

If you check the codebook, we see that most variables have values that are not applicable to our analysis (66, 77, 88, ...). 
For now, we will simply remove irrelevant observations.

> The codebook has been automatically downloaded in `data/raw/ESS10 codebook.html`

**Task:**
1. Filter each variable to include only relevant observations
2. Check the values of the remaining observations
2. Display the first few rows of the dataset

In [135]:
df_france.head()

Unnamed: 0,clsprty,lrscale,polintr,prtclffr,prtdgcl,prtvtefr,vote,trstplt,trstprt,stfgov,stfdem
11177,2,5,3,66.0,6,88.0,1,8,8,2,6
11178,1,0,2,4.0,3,6.0,1,5,5,5,5
11179,2,5,3,66.0,6,66.0,2,4,3,5,4
11180,2,3,4,66.0,6,66.0,2,5,5,6,8
11181,2,5,3,66.0,6,66.0,2,5,3,6,7


In [137]:
df_france.prtvtefr.value_counts()

prtvtefr
66.0    952
7.0     223
5.0     136
9.0     132
6.0     126
11.0    107
88.0     83
77.0     68
4.0      44
13.0     30
3.0      16
8.0      15
10.0     12
1.0      11
12.0     11
2.0       6
14.0      5
Name: count, dtype: int64

In [None]:
df_france['clsprty'] = df_france['clsprty'].mask(~df_france.clsprty.between(1,2), pd.NA)
df_france['lrscale'] = df_france['lrscale'].mask(~df_france.lrscale.between(1,10), pd.NA)
df_france['polintr'] = df_france['polintr'].mask(~df_france.polintr.between(1,4), pd.NA)
df_france['prtdgcl'] = df_france['prtdgcl'].mask(df_france['prtdgcl'] != 6, pd.NA)
df_france['prtclffr'] = df_france['prtclffr'].mask(~df_france.prtclffr.between(1,11), pd.NA)
df_france['prtvtefr'] = df_france['prtvtefr'].mask(~df_france.prtvtefr.between(1,11), pd.NA)
df_france['vote'] = df_france['vote'].mask(~df_france.vote.between(1,3), pd.NA)
df_france['trstplt'] = df_france['trstplt'].mask(~df_france.trstplt.between(0,10), pd.NA)
df_france['trstprt'] = df_france['trstprt'].mask(~df_france.trstprt.between(0,10), pd.NA)
df_france['stfgov'] = df_france['stfgov'].mask(~df_france.stfgov.between(0,10), pd.NA)
df_france['stfdem'] = df_france['stfdem'].mask(~df_france.stfdem.between(0,10), pd.NA)

## Creating the Centrist Variable 

For our hypothesis, we need to identify "centrist" individuals. 
Therefore, we will should add a new variable capturing this concept into our data frame.

**Task:**
- Use the `lrscale` to create a centrism variable
- Add the new variable `centrism` to the data frame
- Check the distribution of the newly created variable


In [None]:
# Your code here:
# Create the centrism variable


# Check the distribution



## Creating an Affective Scale?

For our hypothesis, we also need to build an affective evaluation scale.

Unfortunately, ... the ESS does not have direct out-party dislike or feeling thermometers toward parties.
Not having direct observations of what we are trying to test, is common when doing research. 

We need need to find a way around this by building a proxy that is close to our original concept.
One way to build such proxy is by creating an additive scale combining multiple variables of interest. 

That is, we could add variables together to build an "Affective Scale".

**Task:**
- Select some or all trust and satisfaction variables.
- Make sure they are coded in the same direction (higher = more positive evaluation).
- Combine them into an additive scale (sum or average).
- Create a new variable called `aff_eval` in the data frame.
- Check the distribution of aff_eval using summary statistics and a histogram.


In [151]:
# Your code here:
# Create the aff_eval variabletrstplt
def aff_eval(row):
    # Calculate the standardized values for each column
    index_trstplt = (row['trstplt'] - df_france['trstplt'].mean()) / df_france['trstplt'].std()
    index_trstprt = (row['trstprt'] - df_france['trstprt'].mean()) / df_france['trstprt'].std()
    index_stfgov = (row['stfgov'] - df_france['stfgov'].mean()) / df_france['stfgov'].std()
    index_stfdem = (row['stfdem'] - df_france['stfdem'].mean()) / df_france['stfdem'].std()

    # Calculate the average index
    index = (index_trstplt + index_trstprt + index_stfgov + index_stfdem) / 4

    return index

# Apply the function to each row
df_france['aff_eval_index'] = df_france.apply(aff_eval, axis=1)


In [153]:

# Check the distribution
df_france.aff_eval_index.head()


11177    0.807116
11178    0.327296
11179   -0.129296
11180    0.744861
11181    0.402714
Name: aff_eval_index, dtype: float64

## Exploratory Data Analysis

Now let's explore our cleaned data to understand the relationships between variables.

In [None]:
# Summary statistics for key variables


In [None]:
# Create crosstabs and plots to examine the relationships


## Saving the Final Dataset

Let's create our final analysis-ready dataset with the key variables for testing our hypothesis.

In [None]:
# Save the final cleaned dataset
