# Project 1 (Due Nov 13)

The goal of the first project is to non-parametrically model some phenomenon of interest, and generate sequences of values. There are six options below:

- Chordonomicon: 680,000 chord progressions of popular music songs. Create a chord generator, similar to what we did with Bach in class, but for a particular artist or genre. (https://github.com/spyroskantarelis/chordonomicon)
- Financial Time series, S&P500 Stocks: There are 500 time series here. Model how individual time series adjust over time, either together or separately. (https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks)
- MIT-BIT Arrythmia Database: Arrythmia is an abnormal heart rhythm. This is a classic dataset that a day of ECG time series measurements for 4,000 patients. (https://www.physionet.org/content/mitdb/1.0.0/)
- Ukraine conflict monitor: The ACLED Ukraine Conflict Monitor provides near real-time information on the ongoing war in Ukraine, including an interactive map, a curated data file, and weekly situation updates Ukraine Conflict Monitor, maintained by the Armed Conflict Location & Event Data Project, starting in 2022, including battles, explosions/remote violence, violence against civilians, protests, and riots: https://acleddata.com/monitor/ukraine-conflict-monitor
SIPRI Arms Trade: The SIPRI Arms Transfers Database is a comprehensive public resource tracking all international transfers of major conventional arms from 1950 to the present. For each deal, information includes: number ordered, supplier/recipient identities, weapon types, delivery dates, and deal comments. The database can address questions about: who are suppliers and recipients of major weapons, what weapons have been transferred by specific countries, and how supplier-recipient relationships have changed over time. https://www.sipri.org/databases/armstransfers
- Environmental Protection Agency data: The EPA, in general, has excellent data on the release of toxic substances, and I also tracked down air quality and asthma. You can put these together to look at how changes in toxic release correlate with air quality and respiratory disease over time: https://www.epa.gov/data https://www.epa.gov/toxics-release-inventory-tri-program/tri-toolbox https://www.cdc.gov/asthma/most_recent_national_asthma_data.htm https://www.earthdata.nasa.gov/topics/atmosphere/air-quality/data-access-tools

If you have other data sources that you're interested in, I am willing to consider them, as long as they lend themselves to an interesting analysis.

Submit a document or notebook that clearly addresses the following:

1. Describe the data clearly -- particularly any missing data that might impact your analysis -- and the provenance of your dataset. Who collected the data and why? (10/100 pts)
2. What phenomenon are you modeling? Provide a brief background on the topic, including definitions and details that are relevant to your analysis. Clearly describe its main features, and support those claims with data where appropriate. (10/100 pts)
3. Describe your non-parametric model (empirical cumulative distribution functions, kernel density function, local constant least squares regression, Markov transition models). How are you fitting your model to the phenomenon to get realistic properties of the data? What challenges did you have to overcome? (15/100 pts)
4. Either use your model to create new sequences (if the model is more generative) or bootstrap a quantity of interest (if the model is more inferential). (15/100 pts)
5. Critically evaluate your work in part 4. Do your sequences have the properties of the training data, and if not, why not? Are your estimates credible and reliable, or is there substantial uncertainty in your results? (15/100 pts)
6. Write a conclusion that explains the limitations of your analysis and potential for future work on this topic. (10/100 pts)

In addition, submit a GitHub repo containing your code and a description of how to obtain the original data from the source. Make sure the code is commented, where appropriate. Include a .gitignore file. We will look at your commit history briefly to determine whether everyone in the group contributed. (10/100 pts)

In class, we'll briefly do presentations and criticize each other's work, and participation in your group's presentation and constructively critiquing the other groups' presentations accounts for the remaining 15/100 pts.

In [82]:
import pandas as pd
import numpy as np

In [83]:
url = "https://huggingface.co/datasets/ailsntua/Chordonomicon/resolve/main/chordonomicon_v2.csv"
df = pd.read_csv(url, dtype=str, low_memory=False)

In [84]:
df.head()

Unnamed: 0,id,chords,release_date,genres,decade,rock_genre,artist_id,main_genre,spotify_song_id,spotify_artist_id
0,1,<intro_1> C <verse_1> F C E7 Amin C F C G7 C F...,,'classic country pop',,,artist_1,pop,,4AIEGdwDzPELXYgM5JaEY5
1,2,<intro_1> E D A/Cs E D A/Cs <verse_1> E D A/Cs...,2003-01-01,'alternative metal' 'alternative rock' 'nu met...,2000.0,pop rock,artist_2,metal,2ffJZ2r8HxI5DHcmf3BO6c,694QW15WkebjcrWgQHzRYF
2,3,<intro_1> Csmin <verse_1> A Csmin A Csmin A Cs...,2003-01-01,'alternative metal' 'canadian rock' 'funk meta...,2000.0,canadian rock,artist_3,metal,5KiY8SZEnvCPyIEkFGRR3y,0niJkG4tKkne3zwr7I8n9n
3,4,<intro_1> D Dmaj7 D Dmaj7 <verse_1> Emin A D G...,2022-09-23,,2020.0,,artist_4,,01TtAcUqyLCRBZq4ZZiQWS,17BfKBemmMGO5ZAK25wraW
4,5,<intro_1> C <verse_1> G C G C <chorus_1> F Dmi...,2023-02-10,'modern country pop',2020.0,,artist_5,pop,3zUecdrWC3IqrNSjhnoF3G,4GGfAshSkqoxpZdoaHm7ky


In [85]:
df_country = df[df['main_genre'].str.lower() == 'country']

print(df_country.shape)
print(df_country['main_genre'].value_counts())


(53306, 10)
main_genre
country    53306
Name: count, dtype: int64


In [86]:
df_country.info()

<class 'pandas.core.frame.DataFrame'>
Index: 53306 entries, 20 to 642859
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 53306 non-null  object
 1   chords             53306 non-null  object
 2   release_date       44962 non-null  object
 3   genres             53306 non-null  object
 4   decade             44962 non-null  object
 5   rock_genre         15083 non-null  object
 6   artist_id          53306 non-null  object
 7   main_genre         53306 non-null  object
 8   spotify_song_id    47247 non-null  object
 9   spotify_artist_id  53306 non-null  object
dtypes: object(10)
memory usage: 4.5+ MB


## Question 1: Missing Data, Provenance, and Purpose

The filtered dataset contains 53,306 songs from the Chordonomicon dataset where the main_genre is “country” (case-insensitive). The dataset has 10 columns:

- id: Unique song identifier (no missing values)
- chords: Full chord progression of the song (no missing values)
- release_date: Release date of the song (44,962 non-null; ~8,344 missing), which may impact analyses that rely on temporal trends.
- genres: List of genres associated with the song (complete for this subset)
- decade: Decade of release (44,962 non-null; missing for the same songs missing release_date)
- rock_genre: Sub-genre if applicable (15,083 non-null; many missing)
- artist_id: Unique artist identifier (complete)
- main_genre: Main genre of the song (complete; all “country”)
- spotify_song_id: Spotify ID for the song (47,247 non-null; missing for ~5,000 songs)
- spotify_artist_id: Spotify ID for the artist (complete)

Missing data in release_date, decade, rock_genre, and spotify_song_id could affect analyses that rely on temporal information, sub-genres, or Spotify metadata.

Provenance and Purpose: The Chordonomicon dataset was collected and released by Spyridon Kantarelis et al. (2024) to provide a large-scale collection of symbolic chord progressions from contemporary music, along with metadata such as genre, release date, and artist information. It is intended for research in music analysis, chord progression modeling, and graph-based machine learning. Researchers can use this dataset to analyze chord patterns, genre-specific musical structures, or temporal trends in music.

## Question 2: ...

## Question 3

In [87]:
#filter to get just the country genre
countrydf = df.query("main_genre == 'country'")
countrydf.shape

(53306, 10)

In [88]:
#get a list of chords for each song, removing tags like <verse_1>
songs = countrydf['chords']
chord_lists = [[chord for chord in song.split() if '<' not in chord] for song in songs]
chord_lists[0]

['G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'D',
 'G',
 'D',
 'G',
 'C',
 'G',
 'D',
 'G',
 'D',
 'G',
 'C',
 'G',
 'C',
 'D',
 'G',
 'C',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G',
 'Cadd9',
 'G']

In [89]:
#build state list
states = set()
for i in range(1, len(chord_lists)):
    song_i = set(chord_lists[i])
    states = states.union(song_i)
    
states = list(states)

In [90]:
S = len(states)
tr_counts = np.zeros( (S,S) )

#get transition counts
for song in chord_lists:
    seq = np.array(song)
    for t in range(1, len(seq)):
        #current and next tokens
        x_tml = seq[t-1] #previous state
        x_t = seq[t] #current state
        #determine transition indices
        index_from = states.index(x_tml)
        index_to = states.index(x_t)
        #update transition counts
        tr_counts[index_from, index_to] += 1

print('Transition Counts:\n', tr_counts)

Transition Counts:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [91]:
#get state proportions
sums = tr_counts.sum(axis=0, keepdims=True)
print('State Proportions:\n')
print(sums)

State Proportions:

[[ 33.   2.   3. ...  94.  26. 234.]]


In [92]:
#normalize the transition count matrix to get proportions
tr_pr = np.divide(tr_counts, sums, out = np.zeros_like(tr_counts), where = sums!=0)

print('Transition Proportions:\n')
tr_df = pd.DataFrame(np.round(tr_pr, 2), columns = states, index = states)
print(tr_df)


Transition Proportions:

           B/Cs  Csus4/G  F/F  Bmin/E  Ab7b9  Emin9/B  Asaugmaj7  Gmin13  \
B/Cs        0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
Csus4/G     0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
F/F         0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
Bmin/E      0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
Ab7b9       0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
...         ...      ...  ...     ...    ...      ...        ...     ...   
Bbmin7/Ds   0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
Dbsus2/Ab   0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
Abdim       0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
B11/E       0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   
G7sus4      0.0      0.0  0.0     0.0    0.0      0.0        0.0     0.0   

           Ddim7  A7/E  ...  Emin9/D  Ddim7/C  Gminadd13/E  Db