# RUNNING EXAMPLE

Possible data and narrative to use for a general purpose running example

In [3]:
import os
import pandas as pd
import numpy as np
from pathlib import Path

In [8]:
PROJECT_DIR = Path.cwd().parent
ANES_PTH = os.path.join(PROJECT_DIR, "data", "anes_timeseries_cdf_csv_20251211", "anes_timeseries_cdf_csv_20251211.csv")

In [21]:
ANES_DF = pd.read_csv(ANES_PTH, low_memory=False)

# Narrative 1

* Dataset: American National Election Studies (ANES). URL: https://electionstudies.org/data-center/anes-time-series-cumulative-data-file/
* Challenges
  * Large number of coded variables. Want uncoded information on headers
  * Summarize information reported

## Problem 1: Getting summaries of the variables

Which is easier. Let AI find the variables or I search on my own?

# Examining partisanship

The source is "'I disrepsectfully agree.' The Differential Effects of  Partisan Sorting on Behavioral and Issue Polarization", L Mason, 2014.

In [22]:
"VCF0305" in ANES_DF.columns
partisans = ANES_DF["VCF0305"]
partisans.value_counts()

VCF0305
4    24940
3    22785
2    16050
1     8728
       662
0      580
Name: count, dtype: int64

Prompt: Please find the variables that most relate to partisanship. Provide their descriptions and scales.

## ANES PARTISANSHIP VARIABLES

## Core Party Identification Variables

### **VCF0301 - Party Identification (7-Point Scale)**
**Description:** The foundational measure of party identification using the classic Michigan model.

**Scale (1-7):**
1. Strong Democrat
2. Weak Democrat
3. Independent-leaning Democrat
4. Independent (pure independent)
5. Independent-leaning Republican
6. Weak Republican
7. Strong Republican

**Missing codes:** 0, 9 = NA/DK/Refused

**Years available:** 1952-2024

**Key uses:** 
- Primary measure of partisan attachment
- Predicts voting behavior, policy preferences
- Tracks partisan realignments over time

---

### **VCF0302 - Initial Party ID Response**
**Description:** Records respondent's first answer to the party ID question before follow-up probes.

**Scale:**
1. Democrat
2. Republican
3. Independent
4. Other party
5. No preference

**Years available:** 1952-2024

**Key uses:**
- Distinguishes initial response from leaning
- Useful for understanding strength of partisan attachment

---

### **VCF0303 - Party Identification (3-Category Summary)**
**Description:** Simplified party identification that includes leaners with partisans.

**Scale (1-3):**
1. Democrat (includes Strong, Weak, and Independent-leaning Democrats)
2. Independent (pure independents only - no partisan lean)
3. Republican (includes Strong, Weak, and Independent-leaning Republicans)

**Years available:** 1952-2024

**Key uses:**
- Standard breakdown variable for cross-tabulations
- Most commonly used in ANES Guide visualizations
- Treats leaners as "hidden partisans"

---

### **VCF0305 - Strength of Partisanship**
**Description:** Measures intensity of partisan attachment regardless of direction.

**Scale (0-4):**
- 1 = Independent (no partisan leaning)
- 2 = Leaning Independent (either direction)
- 3 = Weak partisan (either party)
- 4 = Strong partisan (either party)
- 0 = No response

**Years available:** 1952-2024

**Key uses:**
- Studies partisan polarization
- Examines intensity vs. direction of partisanship
- Tracks growth/decline of pure independents

---

### **VCF0306 - Father's Party Identification**
**Description:** Respondent's report of their father's party identification.

**Scale:** Same 7-point scale as VCF0301

**Years available:** 1952-2024

**Key uses:**
- Intergenerational transmission of partisanship
- Parental socialization effects

---

### **VCF0307 - Mother's Party Identification**
**Description:** Respondent's report of their mother's party identification.

**Scale:** Same 7-point scale as VCF0301

**Years available:** 1952-2024

**Key uses:**
- Intergenerational transmission of partisanship
- Research shows mothers may have stronger influence than previously thought

---

## Party Affect and Attitudes

### **VCF0218 - Thermometer: Democratic Party**
**Description:** Feeling thermometer rating of the Democratic Party.

**Scale (0-100):**
- 0 = Very cold/unfavorable
- 50 = Neutral
- 100 = Very warm/favorable

**Years available:** 1978-2024

**Key uses:**
- Affective polarization research
- Party attitudes beyond identification
- Measuring "warmth" toward parties

---

### **VCF0224 - Thermometer: Republican Party**
**Description:** Feeling thermometer rating of the Republican Party.

**Scale (0-100):**
- 0 = Very cold/unfavorable
- 50 = Neutral
- 100 = Very warm/favorable

**Years available:** 1978-2024

**Key uses:**
- Affective polarization research
- Party attitudes beyond identification
- Measuring "warmth" toward parties

---

### **VCF0290 - Thermometer Average: Major Parties**
**Description:** Average of Democratic and Republican party thermometer ratings.

**Scale (0-100):** Average of VCF0218 and VCF0224

**Years available:** 1978-2024

**Key uses:**
- Overall attitude toward the party system
- System support measures

---

### **VCF0291 - Thermometer Index: Major Parties**
**Description:** Difference between own-party and opposing-party thermometer ratings.

**Calculation:** 
- For Democrats: Democratic Party thermometer - Republican Party thermometer
- For Republicans: Republican Party thermometer - Democratic Party thermometer

**Scale:** -100 to +100 (higher = stronger partisan affect)

**Years available:** 1978-2024

**Key uses:**
- **Affective polarization** (primary measure)
- Tracks increase in partisan hostility over time
- Used in ~1,200+ citations since 2012

---

### **VCF0314-VCF0321 - Party Likes/Dislikes**
**Description:** Open-ended responses about what respondent likes/dislikes about each party.

**Variables:**
- VCF0314: Democratic Party likes (number of mentions)
- VCF0315: Democratic Party dislikes (number of mentions)
- VCF0316: Democratic Party affect (likes - dislikes)
- VCF0317: Democratic Party salience (likes + dislikes)
- VCF0318: Republican Party likes (number of mentions)
- VCF0319: Republican Party dislikes (number of mentions)
- VCF0320: Republican Party affect (likes - dislikes)
- VCF0321: Republican Party salience (likes + dislikes)

**Scale:** 0-5+ mentions (varies by year)

**Years available:** 1952-2024

**Key uses:**
- Understanding reasons for party preferences
- Party image research
- Measuring ambivalence

---

## Ideology Variables (Closely Related to Partisanship)

### **VCF0803 - Liberal-Conservative Self-Identification (7-Point Scale)**
**Description:** Ideological self-placement on a liberal-conservative continuum.

**Scale (1-7):**
1. Extremely liberal
2. Liberal
3. Slightly liberal
4. Moderate/middle of the road
5. Slightly conservative
6. Conservative
7. Extremely conservative

**Years available:** 1972-2024

**Key uses:**
- Ideological sorting with party ID
- Issue constraint
- Predicting policy preferences

---

### **VCF0804 - Liberal-Conservative Self-Identification (3-Category)**
**Description:** Simplified ideological identification.

**Scale (1-3):**
1. Liberal (including slightly liberal)
2. Moderate
3. Conservative (including slightly conservative)

**Years available:** 1972-2024

**Key uses:**
- Cross-tabulations
- Simplified ideological analysis

---

## Vote Choice Variables (Dependent Variables Often Predicted by Partisanship)

### **VCF0704 - Presidential Vote (2-Party)**
**Description:** Two-party presidential vote choice.

**Scale:**
1. Democrat
2. Republican

**Missing codes:** 0 = Did not vote/other

**Years available:** 1948-2024 (presidential years)

**Key uses:**
- Primary dependent variable in voting behavior research
- Testing party ID's predictive power

---

### **VCF0706 - House Vote (2-Party)**
**Description:** Two-party House vote choice.

**Scale:**
1. Democrat
2. Republican

**Missing codes:** 0 = Did not vote/other

**Years available:** 1948-2024

**Key uses:**
- Congressional voting behavior
- Coattail effects
- Partisan loyalty in down-ballot races

---

## Key Relationships Between Variables

### **Party ID → Vote Choice**
- Strong partisans vote for their party ~90% of the time
- Weak partisans ~75% loyalty
- Leaning independents ~60-70% loyalty (varies over time)

### **Party ID → Ideology**
- Increasing correlation since 1970s ("ideological sorting")
- 2020s: Most Democrats are liberal, most Republicans are conservative
- 1970s: Much more ideological diversity within parties

### **Party ID → Feeling Thermometers**
- Strong correlation between party ID and own-party warmth
- **Affective polarization**: Increasing gap between own-party and other-party ratings
- 1970s: ~30% had "cold" feelings toward opposing party
- 2020s: ~80% have "cold" feelings toward opposing party

---

## Special Notes on Missing Data

**Common missing data codes across all variables:**
- -9 = Refused
- -8 = Don't Know
- -7 = Deleted due to incomplete interview
- -6 = No post-election interview
- -5 = Missing, other
- -1 = Inapplicable (skip pattern)
- 0 = NA/Inap (context-dependent)

**Always check individual variable codebooks for specific missing codes.**

---

## Recommended Variable Combinations for Analysis

### **Basic Partisanship Analysis:**
- VCF0301 (7-point ID) or VCF0303 (3-category)
- VCF0704 (Presidential vote)

### **Affective Polarization:**
- VCF0291 (Party thermometer index)
- VCF0218 + VCF0224 (Individual party thermometers)
- VCF0303 (Party ID)

### **Ideological Sorting:**
- VCF0301 (Party ID)
- VCF0803 (Liberal-conservative ID)
- Time trends (1972-2024)

### **Partisan Strength/Intensity:**
- VCF0305 (Strength of partisanship)
- VCF0316 + VCF0320 (Party affect scores)

### **Intergenerational Transmission:**
- VCF0301 (Respondent's party ID)
- VCF0306 (Father's party ID)
- VCF0307 (Mother's party ID)

---

## Data Access

All variables are available in the **ANES Time Series Cumulative Data File (1948-2024)** at:
https://electionstudies.org/data-center/anes-time-series-cumulative-data-file/

Individual year studies may have additional partisanship-related variables not in the cumulative file.


## Other variables

Prompt: Please list variables in the ANES data that relate to date of response and to the geography of the respondent. Provide scale information.

## DATE AND TIMING VARIABLES

### **VCF0004 - Study Year**
**Description:** Year of the ANES study.

**Scale:** 
- 1948, 1952, 1954, 1956, 1958, 1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980, 1982, 1984, 1986, 1988, 1990, 1992, 1994, 1996, 1998, 2000, 2002, 2004, 2008, 2012, 2016, 2020, 2024

**Years available:** All years 1948-2024

**Key uses:**
- Identifying which election year the case is from
- Time-series analysis across elections
- Creating year fixed effects in models

---

### **VCF0006 - Case ID**
**Description:** Unique identifier for each respondent within a study year.

**Scale:** 4-digit numeric ID (varies by year, typically 1-9999)

**Years available:** All years 1948-2024

**Key uses:**
- Linking cases across pre/post-election waves
- Merging with other datasets
- Panel data linking (when applicable)

**Note:** Combined with VCF0004 creates unique identifier across all years

---

### **VCF1015 - Days Before Election (Pre-Election Interview)**
**Description:** Number of days before the election that the pre-election interview was conducted.

**Scale:** 
- Positive integers = days before election
- Typically ranges from 1 to 60+ days

**Missing codes:**
- 0 = Inap, no pre-election interview
- Negative values = various missing data codes

**Years available:** Presidential election years with pre-election interviews

**Key uses:**
- Campaign exposure analysis (early vs. late interviews)
- Controlling for timing effects
- Understanding when opinions were measured
- Identifying last-minute deciders vs. early deciders

**Example interpretation:**
- 60 = Interviewed 60 days (2 months) before election
- 7 = Interviewed 1 week before election
- 1 = Interviewed day before election

---

### **VCF1016 - Days After Election (Post-Election Interview)**
**Description:** Number of days after the election that the post-election interview was conducted.

**Scale:** 
- Positive integers = days after election
- Typically ranges from 1 to 90+ days

**Missing codes:**
- 0 = Inap, no post-election interview
- Negative values = various missing data codes

**Years available:** All years with post-election interviews (1948-2024)

**Key uses:**
- Post-election reflection and memory effects
- Controlling for interview timing in turnout validation
- Winner/loser effects over time
- Distinguishing immediate vs. delayed post-election responses

**Example interpretation:**
- 1 = Interviewed day after election
- 7 = Interviewed 1 week after election
- 30 = Interviewed 1 month after election
- 60 = Interviewed 2 months after election

**Research note:** Earlier post-election interviews may be more accurate for turnout reporting but affected by immediate emotional reactions to results.

---

### **VCF0017 - Mode of Interview**
**Description:** Method used to conduct the interview.

**Scale (varies by year, typical codes):**
- 1 = Face-to-face (in person)
- 2 = Telephone
- 3 = Mail
- 4 = Web/Internet
- 5 = Video (2020, 2024)

**Years available:** 1948-2024

**Key uses:**
- Mode effects analysis
- Data quality assessment
- Understanding social desirability bias
- Controlling for measurement differences

**Historical note:** 
- 1948-2012: Primarily face-to-face
- 2012: Introduction of web mode
- 2020: No face-to-face due to COVID-19
- 2024: Return of face-to-face plus web/video

---

## GEOGRAPHIC VARIABLES (PUBLIC USE)

### **VCF0112 - Census Region**
**Description:** Four-region Census classification of respondent's location.

**Scale (1-4):**
- 1 = Northeast (New England + Mid-Atlantic)
- 2 = Midwest (East North Central + West North Central)
- 3 = South (South Atlantic + East South Central + West South Central)
- 4 = West (Mountain + Pacific)

**Years available:** 1948-2024

**Key uses:**
- Regional voting patterns
- North/South comparisons
- Regional realignment analysis
- Standard demographic control

**Census divisions within regions:**
- **Northeast:** Maine, NH, VT, MA, RI, CT, NY, NJ, PA
- **Midwest:** OH, IN, IL, MI, WI, MN, IA, MO, ND, SD, NE, KS
- **South:** DE, MD, DC, VA, WV, NC, SC, GA, FL, KY, TN, AL, MS, AR, LA, OK, TX
- **West:** MT, ID, WY, CO, NM, AZ, UT, NV, WA, OR, CA, AK, HI

---

### **VCF0113 - Political South/Non-South**
**Description:** Binary classification based on historical political regions.

**Scale (1-2):**
- 1 = South
- 2 = Non-South

**Years available:** 1948-2024

**Key uses:**
- **Southern realignment** research (shift from Democratic to Republican South)
- Civil rights era political change
- Regional political culture
- Most commonly used in ANES Guide breakdowns

**States coded as "South":**
Typically includes the 11 states of the former Confederacy plus border states:
- AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, TX, VA, and sometimes OK, WV

**Research importance:** Critical for understanding:
- Civil Rights Movement's political impact
- "Solid South" Democratic era (pre-1960s)
- Republican Southern Strategy (1960s-1980s)
- Contemporary partisan geography

---

### **VCF0111 - Size of Place (Urban/Rural Classification)**
**Description:** Population size of place of residence.

**Scale (typically 1-6 or similar, exact coding varies by year):**

#### **Common categories:**
- 1 = Large city (500,000+ or central city of metro area)
- 2 = Suburb of large city
- 3 = Medium city (50,000-499,999)
- 4 = Small city/town (10,000-49,999)
- 5 = Rural/open country (<10,000)

#### **Alternative coding in some years:**
- 1 = Central cities of 12 largest SMSAs
- 2 = Central cities of other SMSAs
- 3 = Suburbs of 12 largest SMSAs
- 4 = Suburbs of other SMSAs
- 5 = Adjacent areas
- 6 = Non-adjacent areas

**Years available:** 1952-2024 (with some gaps and coding changes)

**Key uses:**
- Urban-rural political divide
- Suburban voting patterns
- Metropolitan vs. non-metropolitan analysis
- Understanding "Big Sort" (geographic sorting by politics)

**Missing codes:** 0, 9 = NA/Inap/DK

**Important note:** Coding has changed over time. Check codebook for year-specific definitions.

---

### **VCF0114 - Urban/Rural Summary (Simple Classification)**
**Description:** Simplified urban/rural classification.

**Scale (1-3):**
- 1 = Urban (cities, central cities, metro areas)
- 2 = Suburban (suburbs of metro areas)
- 3 = Rural (small towns, open country, non-metro)

**Years available:** Select years (check codebook)

**Key uses:**
- Simple urban/suburban/rural breakdown
- Easier to use than detailed VCF0111
- Standard demographic control

---

### **VCF0900 - Congressional District of Residence**
**Description:** Respondent's congressional district.

**Scale:** Varies by state (1-53 for California, 1-27 for New York, etc.)

**Years available:** Select years

**Key uses:**
- Linking to House election results
- District-level contextual analysis
- Merging with congressional data

**Note:** Available in public data but with limitations. More detailed in RDA files.

---

### **VCF0900b - State and Congressional District (FIPS)**
**Description:** Combined state FIPS code and congressional district.

**Format:** 5 digits
- First 2 digits: State FIPS code
- Last 3 digits: Congressional district number

**Example:**
- 06053 = California (06), 53rd district
- 36027 = New York (36), 27th district

**Years available:** Select years

**Key uses:**
- Precise geographic identifier
- Merging with other datasets using standard FIPS codes

---

### **VCF0900c - State and Congressional District (Postal Abbreviation)**
**Description:** State postal code + congressional district number.

**Format:** Text (e.g., "CA-53", "NY-27")

**Years available:** Select years

**Key uses:**
- Human-readable district identifier
- Easier reference than FIPS codes

---

### **VCF0901a - State Code (FIPS)**
**Description:** Federal Information Processing Standards (FIPS) state code.

**Scale:** 01-56 (2-digit numeric)

**Examples:**
- 01 = Alabama
- 06 = California
- 12 = Florida
- 36 = New York
- 48 = Texas

**Years available:** 1948-2024

**Key uses:**
- Standard state identifier for data merging
- Linking to Census data
- State-level analysis

**Full FIPS code list available at:** https://www.census.gov/library/reference/code-lists/ansi.html

---

### **VCF0901b - State Postal Abbreviation**
**Description:** Two-letter state postal code.

**Scale:** 2-letter codes (AL, AK, AZ... WY)

**Examples:**
- CA = California
- NY = New York
- TX = Texas
- FL = Florida

**Years available:** 1948-2024

**Key uses:**
- Human-readable state identifier
- More intuitive than FIPS codes
- Standard reference for charts/tables

---

### **County Variables (VCF0170a-d) - LIMITED AVAILABILITY**
**Description:** County of interview location.

**Variables:**
- VCF0170a: County 1956-1960
- VCF0170b: County 1964-1976
- VCF0170c: County 1968-1982
- VCF0170d: County 1970, 1978-1998

**Availability:** **Limited in public data**. More detailed county information available through **Restricted Data Access (RDA)** only.

**Key uses (when available):**
- Fine-grained geographic analysis
- Merging with county-level demographic/economic data
- Urban/rural classification
- Contextual effects

---

## RESTRICTED GEOGRAPHIC DATA (RDA Only)

### **Additional Geographic Detail Available Through RDA:**

#### **Standard RDA Geocode Files (1978-present):**
- Precise latitude/longitude coordinates
- Census tract
- Census block group
- Minor Civil Division (MCD)
- ZIP codes (1994-present)
- County FIPS codes (more complete than public data)
- Metropolitan Statistical Area (MSA) codes

#### **How to Access:**
1. Apply through ICPSR's Virtual Data Enclave (VDE)
2. Restricted use for research purposes only
3. Cannot be shared or published directly
4. Must be analyzed in secure environment

#### **Why Restricted:**
- Protect respondent confidentiality
- Prevent re-identification
- Geographic detail could reveal identity when combined with other variables

#### **Common RDA Uses:**
- Merging contextual variables (e.g., county unemployment, racial composition)
- Analyzing neighborhood effects
- Understanding geographic sorting
- Studying local political context

**More info:** https://electionstudies.org/data-center/restricted-data-access/

---

## COMBINING DATE AND GEOGRAPHY VARIABLES

### **Common Research Applications:**

#### **1. Regional Trends Over Time**
```
VCF0004 (Year) × VCF0113 (South/Non-South)
- Track Southern realignment 1948-2024
- Visualize party shifts by region
```

#### **2. Urban-Rural Polarization**
```
VCF0004 (Year) × VCF0111 (Urban/Rural)
- Document growing urban-rural divide
- Show suburban shift patterns
```

#### **3. Campaign Timing Effects**
```
VCF1015 (Days before election) × Vote choice
- Early deciders vs. late deciders
- Campaign influence windows
```

#### **4. Geographic Contextual Analysis**
```
VCF0901a (State) + RDA county data
- Merge with Census, economic, or election data
- Understand local context effects
```

#### **5. Mode Effects by Geography**
```
VCF0017 (Mode) × VCF0112 (Region)
- Urban areas more likely web interviews
- Rural areas more face-to-face historically
```

---

## IMPORTANT NOTES AND CAVEATS

### **Date/Timing Variables:**
1. **Interview dates vary** - pre-election interviews typically span 6-8 weeks
2. **Post-election timing matters** - memory decays, emotions change
3. **Campaign dynamics** - early interviews miss late campaign events
4. **Mode and timing often confounded** - web interviews may be completed faster

### **Geographic Variables:**
1. **Boundaries change** - congressional districts redistricted every 10 years
2. **Metropolitan definitions evolve** - what counts as "urban" has changed
3. **Regional coding consistent** - but political meaning of "South" has shifted
4. **Privacy protections** - most detailed geography is restricted access only
5. **Sample design** - not all geographic areas sampled every year

### **Missing Data:**
- Alaska and Hawaii: Not included in face-to-face samples
- Some geographic variables have gaps in coverage
- Always check codebook for year-specific availability

### **Weighting:**
- Use appropriate weights (VCF0009x/y/z) when analyzing geographic patterns
- Weights adjust for unequal selection probabilities and non-response
- Different weights for face-to-face vs. web vs. combined samples

---

## RECOMMENDED VARIABLE COMBINATIONS

### **Basic Geographic Analysis:**
- VCF0004 (Year)
- VCF0901b (State postal code) 
- VCF0112 (Region) or VCF0113 (South/Non-South)
- VCF0111 (Urban/Rural)

### **Timing Analysis:**
- VCF0004 (Year)
- VCF1015 (Pre-election timing)
- VCF1016 (Post-election timing)
- VCF0017 (Mode)

### **State-Level Analysis:**
- VCF0004 (Year)
- VCF0901a or VCF0901b (State identifier)
- Can merge with external state-level data

### **Congressional District Analysis:**
- VCF0004 (Year)
- VCF0900, VCF0900b, or VCF0900c (District)
- Can merge with House election results

---

## DATA ACCESS

### **Public Use Variables:**
All VCF variables listed above (except RDA) are available in the:
- **ANES Time Series Cumulative Data File (1948-2024)**
- Download at: https://electionstudies.org/data-center/

### **Restricted Access Variables:**
Additional geographic detail available through:
- **ICPSR Virtual Data Enclave**
- Application required
- Research use only
- Details at: https://electionstudies.org/data-center/restricted-data-access/

---

## REFERENCES FOR GEOGRAPHIC CODING

### **Census Regions and Divisions:**
https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf

### **FIPS State Codes:**
https://www.census.gov/library/reference/code-lists/ansi.html

### **Congressional Districts:**
https://www.census.gov/programs-surveys/geography/guidance/geo-areas/congressional-dist.html

### **Urban-Rural Classifications:**
https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural.html
