## Data Acquisition

We are ultimately interested in testing the following hypothesis:
> **Centrist individuals exhibit lower levels of affective polarization.**

The goal of this notebook is to **acquire, clean, and prepare** a dataset that allows us to empirically examine this hypothesis using survey data. Specifically, we will extract a reduced set of variables from the ESS Round 10 dataset, rename them for clarity, and conduct a brief exploratory analysis.

This notebook will output a **smaller dataset well-documented, and analysis-ready dataset** containing only the variables relevant for testing the hypothesis.

### Work Plan

1. Load the ESS Round 10 dataset *(already completed)*
2. Identify the variables relevant for our hypothesis
3. Create a subset dataframe containing only these variables
4. Rename variables using clear and descriptive names
5. Produce basic summary statistics for each selected variable
6. Export the cleaned dataset as a temporary CSV file

### Tips

- You should adapt code from previous notebooks, such as:
    - `04-data-exploration-columns.ipynb` (mostly)
    - `subset_dataframes.ipynb` (but also)

- You can also consult the **typst article** to help you identify theoretically relevant variables.

## Load the Data

In [None]:
# Import Relevant Libraries/Packages/Modules
import os
import requests
import zipfile
import pandas as pd

In [None]:
# Preparing Relevant Paths & Folders
file_url = "https://github.com/datamisc/ess-10/raw/main/data.zip"
data_dir = "data/raw/"
# Check if data directory exists
os.makedirs(data_dir, exist_ok=True)
# Filenames
csv_filename = "ESS10.csv"
zip_path = os.path.join(data_dir, "data.zip")
csv_path = data_dir + csv_filename

In [None]:
# Function to download the dataset if not available
def get_data():
    if os.path.exists(csv_path):
        print(f"‚úÖ CSV already exists: {csv_path}")
        return

    print("‚ùó CSV not found. Downloading ZIP...")

    # Download ZIP
    response = requests.get(file_url)
    response.raise_for_status()
    with open(zip_path, "wb") as f:
        f.write(response.content)
    print(f"üì¶ ZIP downloaded to: {zip_path}")

    # Extract ZIP
    print("üìÇ Extracting ZIP...")
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(data_dir)

    print(f"‚úÖ Files extracted to: {data_dir}")

    # Confirm extraction
    if not os.path.exists(csv_path):
        raise FileNotFoundError(
            f"CSV '{csv_filename}' not found inside ZIP. "
            f"Check ZIP contents in {data_dir}."
        )

In [None]:
# Use the newly created `get_data()` function.
get_data()

In [None]:
# Loading the raw CSV data into a `df_raw` object that you keep as a backup
df_raw = pd.read_csv(csv_path)

In [None]:
# Take a look at the data you just loaded
df_raw
# or 
df_raw.head()
# or 
df_raw.sample(5)


# Relevant Variables/Columns
- You need to check the codebook

In [None]:
# Create a list with the names of the relevant variables
vars_1 = ["vote", "cntry", "prtvtefr", "prtclffr", "prtdgcl", "lrscale"]
vars_2 = ["lrscale", "clsprty", "prtdgcl", "cntry", "prtclffr", "prtvtefr"]

# or ?

vars = [
    "clsprty", 
    "cntry",  # country
    "lrscale",  # this needs to be recoded
    "prtclffr",  # some comment to explain this is closer to a party
    "prtdgcl",
    "prtvtefr",
    "vote",
]

In [None]:
vars

In [None]:
# You could also use a dictionnary (more on this later...)

vars_dict = {
    "clsprty": "closer_party_dummy",
    "cntry": "country",
    "lrscale": "ideology",
    "prtclffr": "closer_party",
    "prtdgcl": "closer_party_likert", 
    "prtvtefr": "previous_vote",
    "vote": "vote",
}


In [None]:
vars_dict.keys()    

# Data Subset
Use the previous list to create a new data frame called `df` containing only the relevant columns from `df_raw`

In [None]:
# Create dataframe with the selected variables
df = df_raw[vars]
df.sample(5)

In [None]:
df[vars_dict.keys()]

# Rename Variables 
Create a list of meaningful column names and assign it to `df.columns` to rename the DataFrame columns.
- Remember to use lowercase underscore to name the columns (also called snake_case)

In [None]:
# Rename columns
col_names =[
    "closer_party_dummy",
    "country",
    "ideology",
    "closer_party",
    "closer_party_likert", 
    "previous_vote",
    "vote"
]
df.columns = col_names
df.sample(5)

In [None]:
# Or using the previously created dictionnary
df = df.rename(columns=vars_dict)  # The advantage is that the order doesn't matter. It matches the keys!
df.sample(5)

# Summary Statistics
You will likely use some of the following methods to quickly explore your dataset:
- `df.info()` to inspect data types and missing values
- `df['some_continuous_var'].describe()` to summarize a continuous variable
- `df['some_discrete_var'].value_counts()` to examine the distribution of a discrete variable


In [None]:
df.info()

In [None]:
# You could go 1 by 1
df['closer_party_dummy'].max()

In [None]:
# Or take a deeper look at a specif numeric variable
df['closer_party_dummy'].describe()

In [None]:
# But you can also check them all together if you have just a few columns
# This selects all numeric columns
df.select_dtypes(include="number")

In [None]:
# Then we use the describe method
df.select_dtypes(include="number").describe()


In [None]:
# We can do the same for "string" or categorical variables
df.select_dtypes(include="str")

In [None]:
# But for categorical/nominal variables we usually want to count things
df.select_dtypes(include="object").value_counts()

# Save the Data
Save the cleaned DataFrame to a CSV file.


In [None]:
df

In [None]:
# Can you figure this out? Don't forget to set the `index` argument to False
df.to_csv("data/ess_subset.csv", index=False)
