**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Andy Huang
- Audrey Kim
- Katie Moc
- Khang Quach
- Elton Villalta

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)

_____________________________________________________________________________________________________________________________

Given a current year’s top hit songs, is there a correlation between genre, popularity, and lyrical density to any established swing states’ final vote?


## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.

_____________________________________________________________________________________________________________________________

Throughout history, music has played an important role in the creative expression of disadvantaged communities. We want to gain a greater insight of this relationship using exploratory data analysis. After doing some research, what we currently know is that groups fighting for liberation often turned to the rap and R&B genres and wrote songs about their struggles, especially throughout the civil rights movement. In more recent decades, we have seen increased political expression in other genres, such as pop and singer-songwrier. We want to find out if these genres and other music variables have a correlation with political affiliation over time by state.

Past research has shown that people's music preferences can help denote their political affiliation. Specifically, republicans typically prefer country music and democrats typically prefer pop, rap, hip-hop, classic rock, and/or alternative music. Additionally, democrats are less likely to prefer country music.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-3) Further studies delve into the relationship between music preferences, partisanship, and political attitudes. Historically, there has been a connection between music and political affiliation, where music often has an influence and reflected political events. Using an original survey for newer findings, the study again found that Republicans are more likely to prefer Country music, while Democrats tend to more Classic Rock and Alternative genres. These preferences were found to influence the person's political attitudes, where Republicans preferring New Country, shows more favored opinions of Republican Congressmen and the Supreme Court. These findings show that political identities can be strengthened with music preferences through Affective polarization.<a name="cite_ref-1"></a>[<sup>2</sup>](#cite_note-3) This is relevant to our project because this research showcases a clear correlation between music genre and political party. We want to explore this further by researching additional variables and focus on the trends among various US states.

1. <a name="cite_note-3"></a> [^](#cite_ref-3) Dolan, Eric W. “Music Preferences Serve as Markers of Political Affiliation.” PsyPost, 2 Mar. 2024, https://www.psypost.org/music-preferences-serve-as-markers-of-political-affiliation/
2. <a name="cite_note-4"></a> [^](#cite_ref-4) Mack, Brianna N. Martin, Teresa R. 26 December, 2023. "Party Rocking: Exploring the relationship between music preference, partisanship, and political attitudes" Ohio Wesleyan University. https://www.sciencedirect.com/science/article/pii/S0304422X23001018

# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

_____________________________________________________________________________________________________________________________

We hypothesize that given a year’s top hit songs, we can predict a given swing states’ final vote due to high correlations between genres, popularity and lyrical density.

During an election year, if the most popular genres are classic rock and alternative with high lyrical density we can predict that a swing states’ final vote to be Democratic, while years where the most popular genres are country and classical with less lyrical density tend to lean more Republican. As such, given the popularity of a song with its genre and lyrical density, songs of higher popularity are weighted more as they tend to be a better predictor of the direction of the swing state’s final vote.


# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset #1 (use name instead of number here)

In [6]:
import pandas as pd
import numpy as np

In [30]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

spotify = pd.read_csv("Spotify 2010 - 2019 Top 100.csv")
spotify

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
0,STARSTRUKK (feat. Katy Perry),3OH!3,dance pop,2009.0,2022‑02‑17,140.0,81.0,61.0,-6.0,23.0,23.0,203.0,0.0,6.0,70.0,2010.0,Duo
1,My First Kiss (feat. Ke$ha),3OH!3,dance pop,2010.0,2022‑02‑17,138.0,89.0,68.0,-4.0,36.0,83.0,192.0,1.0,8.0,68.0,2010.0,Duo
2,I Need A Dollar,Aloe Blacc,pop soul,2010.0,2022‑02‑17,95.0,48.0,84.0,-7.0,9.0,96.0,243.0,20.0,3.0,72.0,2010.0,Solo
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,atl hip hop,2010.0,2022‑02‑17,93.0,87.0,66.0,-4.0,4.0,38.0,180.0,11.0,12.0,80.0,2010.0,Solo
4,Nothin' on You (feat. Bruno Mars),B.o.B,atl hip hop,2010.0,2022‑02‑17,104.0,85.0,69.0,-6.0,9.0,74.0,268.0,39.0,5.0,79.0,2010.0,Solo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
998,Strike a Pose (feat. Aitch),Young T & Bugsey,afroswing,2019.0,2020‑08‑20,138.0,58.0,53.0,-6.0,10.0,59.0,214.0,1.0,10.0,67.0,2019.0,Duo
999,The London (feat. J. Cole & Travis Scott),Young Thug,atl hip hop,2019.0,2020‑06‑22,98.0,59.0,80.0,-7.0,13.0,18.0,200.0,2.0,15.0,75.0,2019.0,Solo
1000,,,,,,,,,,,,,,,,,
1001,,,,,,,,,,,,,,,,,


In [41]:
# The dataset has 3 empty rows at the bottom so we'll remove those.
spotify = spotify.iloc[:-3]

# Since the following code returns False, we know we have no other missing data in our dataset!
spotify.isna().any(axis=None)

False

In [47]:
# All of our numeric columns are already floats, except the "added" column. 
# We will not be using this column, so we can ignore it for now.
# We do not need to change the type of any other column.

spotify.dtypes 

title             object
artist            object
top genre         object
year released    float64
added             object
bpm              float64
nrgy             float64
dnce             float64
dB               float64
live             float64
val              float64
dur              float64
acous            float64
spch             float64
pop              float64
top year         float64
artist type       object
dtype: object

In [54]:
# We only want to consider the songs from 2011 onward since Spotify was not launched in the US until then.
spotify = spotify[spotify['top year'] >= 2011]

# For our project, we are going to focus on the "top genre", "spch", "pop", and "top year" columns.
spotify = spotify[['top genre', 'spch', 'pop', 'top year']]
spotify

Unnamed: 0,top genre,spch,pop,top year
100,british soul,3.0,84.0,2011.0
101,british soul,2.0,81.0,2011.0
102,dance pop,5.0,69.0,2011.0
103,canadian pop,5.0,80.0,2011.0
104,detroit hip hop,25.0,74.0,2011.0
...,...,...,...,...
979,dfw rap,6.0,84.0,2019.0
980,dfw rap,21.0,83.0,2019.0
981,dfw rap,8.0,82.0,2019.0
983,dance pop,9.0,85.0,2019.0


## Dataset #2 (if you have more than one, use name instead of number here)

In [56]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

president = pd.read_csv("1976-2020-president.csv")
president.head(10)

Unnamed: 0,year,state,state_po,state_fips,state_cen,state_ic,office,candidate,party_detailed,writein,candidatevotes,totalvotes,version,notes,party_simplified
0,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"CARTER, JIMMY",DEMOCRAT,False,659170,1182850,20210113,,DEMOCRAT
1,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"FORD, GERALD",REPUBLICAN,False,504070,1182850,20210113,,REPUBLICAN
2,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"MADDOX, LESTER",AMERICAN INDEPENDENT PARTY,False,9198,1182850,20210113,,OTHER
3,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"BUBAR, BENJAMIN """"BEN""""",PROHIBITION,False,6669,1182850,20210113,,OTHER
4,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"HALL, GUS",COMMUNIST PARTY USE,False,1954,1182850,20210113,,OTHER
5,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"MACBRIDE, ROGER",LIBERTARIAN,False,1481,1182850,20210113,,LIBERTARIAN
6,1976,ALABAMA,AL,1,63,41,US PRESIDENT,,,True,308,1182850,20210113,,OTHER
7,1976,ALASKA,AK,2,94,81,US PRESIDENT,"FORD, GERALD",REPUBLICAN,False,71555,123574,20210113,,REPUBLICAN
8,1976,ALASKA,AK,2,94,81,US PRESIDENT,"CARTER, JIMMY",DEMOCRAT,False,44058,123574,20210113,,DEMOCRAT
9,1976,ALASKA,AK,2,94,81,US PRESIDENT,"MACBRIDE, ROGER",LIBERTARIAN,False,6785,123574,20210113,,LIBERTARIAN


In [55]:
president.isna().any(axis=None) 

True

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

______________________________________________________________________________________________________________

The Kaggle dataset we are using analyzes US election results from 1976-2020, based on U.S. states, districts, and on the national level. We assume that since these election results are public nationwide that there wouldn’t be problems with the privacy/terms of use of this data. However, the data itself can have bias since voter turnout has varied from year-to-year, which doesn’t account for the public opinion of everyone in a certain state. Generally, more voter turnout has resulted in elections that were more high-stakes compared to other elections. As well, there are laws that prevent certain demographics from voting: for instance, non-citizens, including legal residents, convicted felons, and some with a mental disability cannot vote based on different states. 

The Spotify dataset was made using Spotify’s web API, and does not rely on the data of their individual users’ listening activity in a way it breaches their personal privacy. However, Spotify wasn’t launched in the U.S. until 2011, and after the trial period ended in January 2012 users were limited to ten hours of streaming per month. We cannot analyze the relationship of music preference and U.S. political preference based on Spotify alone for years before 2011 because of this reason. To answer our research question with best accuracy, we must analyze Spotify listening trends from 2011 and beyond in relation to U.S. election results to show a more direct correlation.

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* Show up to scheduled meetings consistently
* Evenly assign tasks during meetings
* Complete assigned tasks before meeting 
* Communicate via Discord beforehand if unable to complete assigned tasks or show up to meeting
* Communicate and respond timely (within 1 day)
* Voice opinions respectfully
* Always ask for help if needed!

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/30  |  12:30 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions; find dataset(s)  | Discuss and decide on final project topic; discuss hypothesis; begin background research; submit project proposal | 
| 11/5  |  3 PM |  Do background research on topic | Discuss who will focus on what (Background, Ethics, EDA, Conclusion, etc.); Review proposal feedback and make changes | 
| 11/12  | 2 PM  | Progress on assigned tasks; Clean data  | Work on checkpoint #1; Discuss any questions or other ideas we might want to add   |
| 11/19  | 3 PM  | Progress on assigned tasks | Review checkpoint #1 feedback and make changes; Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 11/26  | 3 PM  | Progress on assigned tasks; EDA | Work on checkpoint #2; Discuss/edit Analysis |
| 12/3  | 3 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project; Work on video |
| 12/11  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |