# Final Project

The final project is an opportunity for you to explore how political campaigns and political journalists apply data science. This is meant to be an open-ended assignment. You are to pick a topic from the world of politics (can be broader than just campaigns or American politics) and write a data-intensive paper (think blog post, e.g., Nate Cohn at the NYT's Upshot) exploring these questions, describing the methodology you employ to answer the question, the results, and finally the conclusions you can draw from the analysis. The paper should be around 5-8 pages (double-spaced), inclusive of any figures or tables and excluding your code.

The final project must involve original data analysis conducted in Python where you find or create a data source(s), analyze data, visualize data, and describe your results. Good final papers will combine substantive knowledge of the question with high-quality data and analytics. You must submit your code along with your final paper.

In this notebook, I provide you access to two data sources that you may use. You may use either of these, or others. I am providing this data to help you get started, but it is not a requirement that you use these data sources. The only requirement is that you use some real political data.

## Source #1: Cooperative Election Study

One promising data source is the Cooperative Election Study. This is a large N survey that's been conducted over multiple years. I am providing you data for 2010-2020. Each year has around 20,000 observations.

The Cooperative Election Study consists of two waves in election years. The pre-election wave is in the field from late September to late October. The post-election wave is administered in November. Within a given election, the same people take the survey pre- and post-election. Across elections, these are different respondents.

To help you get started, I have uploaded a version of the Cooperative Election Study (`CCES_data_for_final_project.csv`). See below for how to read the data. Note that some questions have missing data because not all questions were asked in every year and state. For example, if there wasn't a gubernatorial question in a certain state-year, that data will be missing. Make sure you review the codebook: `data_dictionary.csv`.

You can read more about the survey [here](https://github.com/joshuakalla/data_science_campaigns/blob/master/Final_Project/guide_cumulative_2006-2023.pdf).

## Source #2: American National Election Studies

A second promising data source is the American National Election Studies. This is a smaller N but very high quality survey. I am providing data from the ANES 2016-2020 panel. In this survey, the same respondents took a pre-2016 election survey, a post-2016 election survey, a pre-2020 election survey, and a post-2020 election survey. You can read more about the survey [here](https://electionstudies.org/data-center/2016-2020-panel-merged-file/).

The data for this survey is called `ANES_2020.csv`. To interpret the data, make sure you review the codebook: `anes_reduced_codebook.txt`.


# Reading the Cooperative Election Study

In [1]:
import pandas as pd
data_ces = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/refs/heads/master/Final_Project/CCES_data_for_final_project.csv")
data_ces.head()

Unnamed: 0,year,case_id,weight_cumulative,st,pid7,ideo5,gender,birthyr,race,educ,...,voted_pres_party,intent_turnout_self,voted_turnout_self,vv_turnout_gvm,intent_rep_party,voted_rep_party,intent_gov_party,voted_gov_party,intent_sen_party,voted_sen_party
0,2010,12274,0.189605,MI,Lean Republican,Very Conservative,Male,1968,White,Some College,...,Republican,,Yes,Voted,Republican,Republican,Republican,Republican,,
1,2010,16008,1.112055,CA,Lean Democrat,Liberal,Male,1979,White,4-Year,...,Democratic,,Yes,Voted,Democratic,Democratic,Democratic,Democratic,Democratic,Democratic
2,2010,54292,0.387745,FL,Not Very Strong Republican,Conservative,Female,1952,White,4-Year,...,Republican,,Yes,Voted,Republican,Republican,Republican,Republican,Republican,Republican
3,2010,1708,0.065135,MO,Lean Republican,Very Conservative,Male,1953,White,4-Year,...,Republican,,Yes,No Record of Voting,Republican,,,,Republican,
4,2010,68960,0.05115,OK,Strong Democrat,Liberal,Male,1955,Hispanic,Post-Grad,...,Democratic,,Yes,Voted,Democratic,,Democratic,,Democratic,


In [2]:
dictionary_ces = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/refs/heads/master/Final_Project/data_dictionary.csv")
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
dictionary_ces

Unnamed: 0,Variable,Label
0,year,Survey Year
1,case_id,Survey Respondent
2,weight_cumulative,Survey Weight
3,st,State
4,pid7,Partisan identity (7 point)
5,ideo5,Ideology (5 point)
6,gender,Sex
7,birthyr,Year of Birth
8,race,Race
9,educ,Education


# Reading the ANES

In [3]:
data_anes = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/refs/heads/master/Final_Project/ANES_2020.csv")
data_anes.head()


Unnamed: 0,V202110x,V202068x,V202072,V202119,V202120a,V202120b,V202120c,V202120d,V202120e,V202120f,...,V201426x,V201235,V201236,V201382x,V201351,V201352,V201356x,V201359x,V201362x,V202468x_quartile
0,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,...,1,1,2,6,3,3,7,1,7,4
1,3,2,1,1,0,0,1,0,0,0,...,4,1,1,4,2,2,7,1,6,3
2,1,2,1,2,0,0,0,0,0,0,...,7,1,3,1,3,4,2,6,1,4
3,1,2,1,1,0,0,0,0,0,0,...,6,2,4,2,4,4,4,2,2,2
4,2,2,1,2,0,0,0,0,0,0,...,1,1,2,2,2,2,7,1,3,4


To read the codebook, go to https://github.com/joshuakalla/data_science_campaigns/blob/master/Final_Project/anes_reduced_codebook.txt.

You can see the full question wording by downloading the questionnaires from https://electionstudies.org/data-center/2016-2020-panel-merged-file/.