## Data Acquisition

We are ultimately interested in testing the following hypothesis:
> **Centrist individuals exhibit lower levels of affective polarization.**

The goal of this notebook is to **acquire, clean, and prepare** a dataset that allows us to empirically examine this hypothesis using survey data. Specifically, we will extract a reduced set of variables from the ESS Round 10 dataset, rename them for clarity, and conduct a brief exploratory analysis.

This notebook will output a **smaller dataset well-documented, and analysis-ready dataset** containing only the variables relevant for testing the hypothesis.

### Work Plan

1. Load the ESS Round 10 dataset *(already completed)*
2. Identify the variables relevant for our hypothesis
3. Create a subset dataframe containing only these variables
4. Rename variables using clear and descriptive names
5. Produce basic summary statistics for each selected variable
6. Export the cleaned dataset as a temporary CSV file

### Tips

- You should adapt code from previous notebooks, such as:
    - `04-data-exploration-columns.ipynb` (mostly)
    - `subset_dataframes.ipynb` (but also)

- You can also consult the **typst article** to help you identify theoretically relevant variables.

## Load the Data

In [2]:
!pip install requests

Defaulting to user installation because normal site-packages is not writeable
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.4.4-cp314-cp314-win_amd64.whl.metadata (38 kB)
Collecting idna<4,>=2.5 (from requests)
  Downloading idna-3.11-py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.6.3-py3-none-any.whl.metadata (6.9 kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading certifi-2026.1.4-py3-none-any.whl.metadata (2.5 kB)
Downloading requests-2.32.5-py3-none-any.whl (64 kB)
Downloading charset_normalizer-3.4.4-cp314-cp314-win_amd64.whl (107 kB)
Downloading idna-3.11-py3-none-any.whl (71 kB)
Downloading urllib3-2.6.3-py3-none-any.whl (131 kB)
Downloading certifi-2026.1.4-py3-none-any.whl (152 kB)
Installing collected packages: urllib3, idna, charset_normalizer, certifi, requests

   ----------------


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# Import Relevant Libraries/Packages/Modules
import os
import requests
import zipfile
import pandas as pd

In [4]:
# Preparing Relevant Paths & Folders
file_url = "https://github.com/datamisc/ess-10/raw/main/data.zip"
data_dir = "data/raw/"
# Check if data directory exists
os.makedirs(data_dir, exist_ok=True)
# Filenames
csv_filename = "ESS10.csv"
zip_path = os.path.join(data_dir, "data.zip")
csv_path = data_dir + csv_filename

In [5]:
# Function to download the dataset if not available
def get_data():
    if os.path.exists(csv_path):
        print(f"‚úÖ CSV already exists: {csv_path}")
        return

    print("‚ùó CSV not found. Downloading ZIP...")

    # Download ZIP
    response = requests.get(file_url)
    response.raise_for_status()
    with open(zip_path, "wb") as f:
        f.write(response.content)
    print(f"üì¶ ZIP downloaded to: {zip_path}")

    # Extract ZIP
    print("üìÇ Extracting ZIP...")
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(data_dir)

    print(f"‚úÖ Files extracted to: {data_dir}")

    # Confirm extraction
    if not os.path.exists(csv_path):
        raise FileNotFoundError(
            f"CSV '{csv_filename}' not found inside ZIP. "
            f"Check ZIP contents in {data_dir}."
        )

In [7]:
# Use the newly created `get_data()` function.
get_data()

‚úÖ CSV already exists: data/raw/ESS10.csv


In [17]:
# Loading the CSV Data
df_raw = pd.read_csv(csv_path)

  df_raw = pd.read_csv(csv_path)


In [None]:
# Take a look at the data you just loaded
df_raw

Unnamed: 0,name,essround,edition,proddate,idno,cntry,dweight,pspwght,pweight,anweight,...,vinwe,inwde,jinws,jinwe,inwtm,mode,domain,prob,stratum,psu
0,ESS10e03_2,10,3.2,02.11.2023,10038,BE,0.882220,0.972276,0.718075,0.698167,...,2022-09-01 17:47:00,2022-09-01 17:47:00,2022-09-01 17:47:00,2022-09-01 17:47:00,36.0,1,1.0,0.000397,188,2596
1,ESS10e03_2,10,3.2,02.11.2023,10053,BE,1.047643,0.888635,0.718075,0.638107,...,2022-04-08 11:07:00,2022-04-08 11:10:00,2022-04-08 11:07:00,2022-04-08 11:10:00,54.0,2,2.0,0.000334,194,2206
2,ESS10e03_2,10,3.2,02.11.2023,10055,BE,1.087741,0.722811,0.718075,0.519033,...,2022-05-20 11:08:00,2022-05-20 11:10:00,2022-05-20 11:08:00,2022-05-20 11:10:00,77.0,1,2.0,0.000322,198,2114
3,ESS10e03_2,10,3.2,02.11.2023,10062,BE,0.909910,1.005565,0.718075,0.722072,...,2022-05-22 13:58:00,2022-05-22 13:59:00,2022-05-22 13:58:00,2022-05-22 13:59:00,55.0,1,1.0,0.000385,150,2645
4,ESS10e03_2,10,3.2,02.11.2023,10064,BE,0.918949,0.638705,0.718075,0.458639,...,2022-05-18 11:44:00,2022-05-18 11:45:00,2022-05-18 11:44:00,2022-05-18 11:45:00,55.0,1,1.0,0.000381,149,2313
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37606,ESS10e03_2,10,3.2,02.11.2023,27808,SK,0.515714,0.339385,0.323800,0.109893,...,2021-06-08 14:28:34,2021-06-08 14:30:41,2021-06-08 14:29:01,2021-06-08 14:31:44,70.0,1,1.0,0.001522,2610,27206
37607,ESS10e03_2,10,3.2,02.11.2023,27826,SK,0.297974,0.196093,0.323800,0.063495,...,2021-08-02 10:33:27,2021-08-02 10:36:27,2021-08-02 10:35:22,2021-08-02 10:37:34,45.0,1,2.0,0.002635,2610,27217
37608,ESS10e03_2,10,3.2,02.11.2023,27834,SK,0.965931,0.857000,0.323800,0.277497,...,2021-06-26 20:52:15,2021-06-26 20:53:05,2021-06-26 20:52:27,2021-06-26 20:54:32,33.0,1,1.0,0.000813,2631,27134
37609,ESS10e03_2,10,3.2,02.11.2023,27846,SK,0.854279,0.624287,0.323800,0.202144,...,2021-07-21 14:14:41,2021-07-21 14:17:31,2021-07-21 14:16:38,2021-07-21 14:18:38,43.0,1,1.0,0.000919,2638,27183


# Relevant Variables/Columns
- You need to check the codebook

In [51]:
# Create a list with the names of the relevant variables
vars = ["vote", "cntry", "prtvtefr","prtclffr", "prtdgcl", "lrscale"]

In [52]:
df = df_raw[vars]  

In [53]:
col_names = [
    "vote",
    "country",
    "last_party",
    "closest_party",
    "how_close",
    "left_right"
]

In [54]:
df.columns = col_names

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37612 entries, 0 to cntry
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   vote           37612 non-null  object
 1   country        37612 non-null  object
 2   last_party     1978 non-null   object
 3   closest_party  1978 non-null   object
 4   how_close      37612 non-null  object
 5   left_right     37612 non-null  object
dtypes: object(6)
memory usage: 3.0+ MB


In [56]:
df = df.loc[df["country"] == "FR"]

In [57]:
df

Unnamed: 0,vote,country,last_party,closest_party,how_close,left_right
11177,1,FR,88.0,66.0,6,5
11178,1,FR,6.0,4.0,3,0
11179,2,FR,66.0,66.0,6,5
11180,2,FR,66.0,66.0,6,3
11181,2,FR,66.0,66.0,6,5
...,...,...,...,...,...,...
13150,1,FR,5.0,5.0,2,3
13151,1,FR,6.0,6.0,3,4
13152,1,FR,7.0,66.0,6,7
13153,1,FR,4.0,4.0,2,2


# Data Subset
Use the previous list to create a new data frame called `df_sub` containing only the relevant columns.

In [None]:
# Create the df_sub dataframe


# Rename Variables 
Create a list of meaningful column names and assign it to `df.columns` to rename the DataFrame columns.
- Remember to use lowercase underscore to name the columns (also called snake_case)

In [None]:
# Rename columns


# Summary Statistics
You will likely use some of the following methods to quickly explore your dataset:
- `df.info()` to inspect data types and missing values
- `df['some_continuous_var'].describe()` to summarize a continuous variable
- `df['some_discrete_var'].value_counts()` to examine the distribution of a discrete variable


In [None]:
df['some_var']...

# Save the Data
Save the cleaned DataFrame to a CSV file.


In [49]:
df

Unnamed: 0,vote,country,last_party,closest_party,how_close,left_right
11177,1,FR,88.0,66.0,6,5
11178,1,FR,6.0,4.0,3,0
11179,2,FR,66.0,66.0,6,5
11180,2,FR,66.0,66.0,6,3
11181,2,FR,66.0,66.0,6,5
...,...,...,...,...,...,...
13150,1,FR,5.0,5.0,2,3
13151,1,FR,6.0,6.0,3,4
13152,1,FR,7.0,66.0,6,7
13153,1,FR,4.0,4.0,2,2


In [58]:
# Can you figure this out? Don't forget to set the `index` argument to F
df.to_csv("data/ess_subset.csv", index=False)

In [74]:
df[df["last_party"] == 4].count()

vote             44
country          44
last_party       44
closest_party    44
how_close        44
left_right       44
dtype: int64

In [76]:
df[df["last_party"] == 5].count()

vote             136
country          136
last_party       136
closest_party    136
how_close        136
left_right       136
dtype: int64