# Project 1 (Due Nov 13)

The goal of the first project is to non-parametrically model some phenomenon of interest, and generate sequences of values. There are six options below:

- Chordonomicon: 680,000 chord progressions of popular music songs. Create a chord generator, similar to what we did with Bach in class, but for a particular artist or genre. (https://github.com/spyroskantarelis/chordonomicon)
- Financial Time series, S&P500 Stocks: There are 500 time series here. Model how individual time series adjust over time, either together or separately. (https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks)
- MIT-BIT Arrythmia Database: Arrythmia is an abnormal heart rhythm. This is a classic dataset that a day of ECG time series measurements for 4,000 patients. (https://www.physionet.org/content/mitdb/1.0.0/)
- Ukraine conflict monitor: The ACLED Ukraine Conflict Monitor provides near real-time information on the ongoing war in Ukraine, including an interactive map, a curated data file, and weekly situation updates Ukraine Conflict Monitor, maintained by the Armed Conflict Location & Event Data Project, starting in 2022, including battles, explosions/remote violence, violence against civilians, protests, and riots:
https://acleddata.com/monitor/ukraine-conflict-monitor
- SIPRI Arms Trade: The SIPRI Arms Transfers Database is a comprehensive public resource tracking all international transfers of major conventional arms from 1950 to the present. For each deal, information includes: number ordered, supplier/recipient identities, weapon types, delivery dates, and deal comments. The database can address questions about: who are suppliers and recipients of major weapons, what weapons have been transferred by specific countries, and how supplier-recipient relationships have changed over time.
https://www.sipri.org/databases/armstransfers
- Environmental Protection Agency data: The EPA, in general, has excellent data on the release of toxic substances, and I also tracked down air quality and asthma. You can put these together to look at how changes in toxic release correlate with air quality and respiratory disease over time:
https://www.epa.gov/data
https://www.epa.gov/toxics-release-inventory-tri-program/tri-toolbox
https://www.cdc.gov/asthma/most_recent_national_asthma_data.htm
https://www.earthdata.nasa.gov/topics/atmosphere/air-quality/data-access-tools

If you have other data sources that you're interested in, I am willing to consider them, as long as they lend themselves to an interesting analysis.

Submit a document or notebook that clearly addresses the following:
1. Describe the data clearly -- particularly any missing data that might impact your analysis -- and the provenance of your dataset. Who collected the data and why? (10/100 pts)
2. What phenomenon are you modeling? Provide a brief background on the topic, including definitions and details that are relevant to your analysis. Clearly describe its main features, and support those claims with data where appropriate. (10/100 pts)
3. Describe your non-parametric model (empirical cumulative distribution functions, kernel density function, local constant least squares regression, Markov transition models). How are you fitting your model to the phenomenon to get realistic properties of the data? What challenges did you have to overcome? (15/100 pts)
4. Either use your model to create new sequences (if the model is more generative) or bootstrap a quantity of interest (if the model is more inferential). (15/100 pts)
5. Critically evaluate your work in part 4. Do your sequences have the properties of the training data, and if not, why not? Are your estimates credible and reliable, or is there substantial uncertainty in your results? (15/100 pts)
6. Write a conclusion that explains the limitations of your analysis and potential for future work on this topic. (10/100 pts)

In addition, submit a GitHub repo containing your code and a description of how to obtain the original data from the source. Make sure the code is commented, where appropriate. Include a .gitignore file. We will look at your commit history briefly to determine whether everyone in the group contributed. (10/100 pts)

In class, we'll briefly do presentations and criticize each other's work, and participation in your group's presentation and constructively critiquing the other groups' presentations accounts for the remaining 15/100 pts.


In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

## DATA DESCRIPTION & PROVENANCE

In [12]:
# Load data
index_df = pd.read_csv('data/sp500_index.csv')
companies_df = pd.read_csv('data/sp500_companies.csv')
stocks_df = pd.read_csv('data/sp500_stocks.csv')

print(f"sp500_index.csv:       {index_df.shape[0]:,} rows × {index_df.shape[1]} cols")
print(f"sp500_companies.csv:   {companies_df.shape[0]} companies")
print(f"sp500_stocks.csv:      {stocks_df.shape[0]:,} rows")

# Provenance
print("\nDATA PROVENANCE")
print("Source: Kaggle - https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks")
print("Author: Andrew Mvd")
print("Collection: Automated daily pull from Yahoo Finance API")
print("Purpose: Public dataset for ML, research, and financial analysis")
print("Last updated: ~Nov 2025 (daily updates)")

# Missing data - EXACT MATCH TO YOUR OUTPUT
print("\nMISSING VALUES REPORT")
print(f"S&P500 Index missing: {index_df['S&P500'].isna().sum()} (0%)")
print(f"Companies EBITDA null: {companies_df['Ebitda'].isna().mean()*100:.2f}%")
print(f"Stocks Adj Close null: {stocks_df['Adj Close'].isna().mean()*100:.2f}% ← Critical!")

print("""
IMPACT OF MISSING DATA:
- 67.3% missing Adj Close in sp500_stocks.csv → many stocks only have recent data
  (e.g., new S&P 500 additions like CRWD, PLTR)
- We will ONLY use sp500_index.csv for modeling → complete, clean, 2,517 trading days
- This avoids bias from partial stock histories
""")

sp500_index.csv:       2,517 rows × 2 cols
sp500_companies.csv:   502 companies
sp500_stocks.csv:      1,891,536 rows

DATA PROVENANCE
Source: Kaggle - https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks
Author: Andrew Mvd
Collection: Automated daily pull from Yahoo Finance API
Purpose: Public dataset for ML, research, and financial analysis
Last updated: ~Nov 2025 (daily updates)

MISSING VALUES REPORT
S&P500 Index missing: 0 (0%)
Companies EBITDA null: 5.78%
Stocks Adj Close null: 67.34% ← Critical!

IMPACT OF MISSING DATA:
- 67.3% missing Adj Close in sp500_stocks.csv → many stocks only have recent data
  (e.g., new S&P 500 additions like CRWD, PLTR)
- We will ONLY use sp500_index.csv for modeling → complete, clean, 2,517 trading days
- This avoids bias from partial stock histories

