## Data Acquisition

We are ultimately interested in testing the following hypothesis:
> **Centrist individuals exhibit lower levels of affective polarization.**

The goal of this notebook is to **acquire, clean, and prepare** a dataset that allows us to empirically examine this hypothesis using survey data. Specifically, we will extract a reduced set of variables from the ESS Round 10 dataset, rename them for clarity, and conduct a brief exploratory analysis.

This notebook will output a **smaller dataset well-documented, and analysis-ready dataset** containing only the variables relevant for testing the hypothesis.

### Work Plan

1. Load the ESS Round 10 dataset *(already completed)*
2. Identify the variables relevant for our hypothesis
3. Create a subset dataframe containing only these variables
4. Rename variables using clear and descriptive names
5. Produce basic summary statistics for each selected variable
6. Export the cleaned dataset as a temporary CSV file

### Tips

- You should adapt code from previous notebooks, such as:
    - `04-data-exploration-columns.ipynb` (mostly)
    - `subset_dataframes.ipynb` (but also)

- You can also consult the **typst article** to help you identify theoretically relevant variables.

## Load the Data

In [2]:
# Import Relevant Libraries/Packages/Modules
import os
import requests
import zipfile
import pandas as pd

In [3]:
# Preparing Relevant Paths & Folders
file_url = "https://github.com/datamisc/ess-10/raw/main/data.zip"
data_dir = "data/raw/"
# Check if data directory exists
os.makedirs(data_dir, exist_ok=True)
# Filenames
csv_filename = "ESS10.csv"
zip_path = os.path.join(data_dir, "data.zip")
csv_path = data_dir + csv_filename

In [4]:
# Function to download the dataset if not available
def get_data():
    if os.path.exists(csv_path):
        print(f"‚úÖ CSV already exists: {csv_path}")
        return

    print("‚ùó CSV not found. Downloading ZIP...")

    # Download ZIP
    response = requests.get(file_url)
    response.raise_for_status()
    with open(zip_path, "wb") as f:
        f.write(response.content)
    print(f"üì¶ ZIP downloaded to: {zip_path}")

    # Extract ZIP
    print("üìÇ Extracting ZIP...")
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(data_dir)

    print(f"‚úÖ Files extracted to: {data_dir}")

    # Confirm extraction
    if not os.path.exists(csv_path):
        raise FileNotFoundError(
            f"CSV '{csv_filename}' not found inside ZIP. "
            f"Check ZIP contents in {data_dir}."
        )

In [5]:
# Use the newly created `get_data()` function.
get_data()

‚úÖ CSV already exists: data/raw/ESS10.csv


In [6]:
# Loading the CSV Data
df = pd.read_csv(csv_path)

  df = pd.read_csv(csv_path)


In [7]:
# Take a look at the data you just loaded
df.head()

Unnamed: 0,name,essround,edition,proddate,idno,cntry,dweight,pspwght,pweight,anweight,...,vinwe,inwde,jinws,jinwe,inwtm,mode,domain,prob,stratum,psu
0,ESS10e03_2,10,3.2,02.11.2023,10038,BE,0.88222,0.972276,0.718075,0.698167,...,2022-09-01 17:47:00,2022-09-01 17:47:00,2022-09-01 17:47:00,2022-09-01 17:47:00,36.0,1,1.0,0.000397,188,2596
1,ESS10e03_2,10,3.2,02.11.2023,10053,BE,1.047643,0.888635,0.718075,0.638107,...,2022-04-08 11:07:00,2022-04-08 11:10:00,2022-04-08 11:07:00,2022-04-08 11:10:00,54.0,2,2.0,0.000334,194,2206
2,ESS10e03_2,10,3.2,02.11.2023,10055,BE,1.087741,0.722811,0.718075,0.519033,...,2022-05-20 11:08:00,2022-05-20 11:10:00,2022-05-20 11:08:00,2022-05-20 11:10:00,77.0,1,2.0,0.000322,198,2114
3,ESS10e03_2,10,3.2,02.11.2023,10062,BE,0.90991,1.005565,0.718075,0.722072,...,2022-05-22 13:58:00,2022-05-22 13:59:00,2022-05-22 13:58:00,2022-05-22 13:59:00,55.0,1,1.0,0.000385,150,2645
4,ESS10e03_2,10,3.2,02.11.2023,10064,BE,0.918949,0.638705,0.718075,0.458639,...,2022-05-18 11:44:00,2022-05-18 11:45:00,2022-05-18 11:44:00,2022-05-18 11:45:00,55.0,1,1.0,0.000381,149,2313


# Relevant Variables/Columns
- You need to check the codebook

In [8]:
# Create a list with the names of the relevant variables
vars = ["lrscale","clsprty","prtdgcl","cntry","prtclffr","prtvtefr"]

# Data Subset
Use the previous list to create a new data frame called `df_sub` containing only the relevant columns.

In [9]:
# Create the df_sub dataframe
df_sub=df[vars]
df_sub.head()

Unnamed: 0,lrscale,clsprty,prtdgcl,cntry,prtclffr,prtvtefr
0,3,2,6,BE,,
1,5,2,6,BE,,
2,5,1,3,BE,,
3,5,2,6,BE,,
4,4,2,6,BE,,


# Rename Variables 
Create a list of meaningful column names and assign it to `df.columns` to rename the DataFrame columns.
- Remember to use lowercase underscore to name the columns (also called snake_case)

In [10]:
# Rename columns
df_sub.columns=["left-Right","closer to a party","how close party","country","party closer to fr","last vote fr"]

# Summary Statistics
You will likely use some of the following methods to quickly explore your dataset:
- `df.info()` to inspect data types and missing values
- `df['some_continuous_var'].describe()` to summarize a continuous variable
- `df['some_discrete_var'].value_counts()` to examine the distribution of a discrete variable


In [11]:
df_sub.head()

Unnamed: 0,left-Right,closer to a party,how close party,country,party closer to fr,last vote fr
0,3,2,6,BE,,
1,5,2,6,BE,,
2,5,1,3,BE,,
3,5,2,6,BE,,
4,4,2,6,BE,,


# Save the Data
Save the cleaned DataFrame to a CSV file.


In [13]:
# Can you figure this out? Don't forget to set the `index` argument to False
df_sub.to_csv("data/ESS10_clean.csv",index=False)