# Project 1
## DS 5030 - Understanding Uncertainty
### Isaac Tabor, Hongfei Zhu, Jarrett Markman

In [25]:
# Import packages and load data
import pandas as pd
import numpy as np
data = pd.read_csv("data/sp500_companies.csv")

### Step 1: Describe the data clearly -- particularly any missing data that might impact your analysis -- and the provenance of your dataset. Who collected the data and why?

The data being used comes from [Kaggle](https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks/data) and is *collected as of 11/7.* The data is regularly updated to provide information on all the companies included in the S&P 500 index. The data contains details regarding the company, such as **`Shortname`, `Sector`, `Industry`, `City`, `State`, `Country`**, as well as metrics from the **S&P 500** and general financial metrics like: **`Currentprice`, `Marketcap`, `Ebitda`, `Revenuegrowth`, `Weight`**. The dataset can be used for different research and analysis of companies within the S&P 500. 

In [38]:
# See length of data and columns with missing data
print(f"There are: {len(data)} rows of data.", "The NA count by variable is:", data.isna().sum(), sep = '\n')

There are: 502 rows of data.
The NA count by variable is:
Exchange                0
Symbol                  0
Shortname               0
Longname                0
Sector                  0
Industry                0
Currentprice            0
Marketcap               0
Ebitda                 29
Revenuegrowth           3
City                    0
State                  20
Country                 0
Fulltimeemployees       9
Longbusinesssummary     0
Weight                  0
dtype: int64


The data isn't highly populated by NA values, however, there are NA values present in `Ebitda`, `Revenuegrowth`, `State`, and `Fulltimeemployees`. To further investigate these values, we can look at the rows with missing data. 

In [44]:
# Look at rows with missing data
data[data.isna().any(axis=1)]

Unnamed: 0,Exchange,Symbol,Shortname,Longname,Sector,Industry,Currentprice,Marketcap,Ebitda,Revenuegrowth,City,State,Country,Fulltimeemployees,Longbusinesssummary,Weight
12,NYQ,JPM,JP Morgan Chase & Co.,JPMorgan Chase & Co.,Financial Services,Banks - Diversified,237.6,668924837888,,0.03,New York,NY,United States,316043.0,JPMorgan Chase & Co. operates as a financial s...,0.012035
13,NYQ,V,Visa Inc.,Visa Inc.,Financial Services,Credit Services,317.71,615235846144,24973000000.0,0.117,San Francisco,CA,United States,,Visa Inc. operates as a payment technology com...,0.011069
23,NYQ,BAC,Bank of America Corporation,Bank of America Corporation,Financial Services,Banks - Diversified,44.17,338911100928,,-0.005,Charlotte,NC,United States,213000.0,"Bank of America Corporation, through its subsi...",0.006097
30,NYQ,WFC,Wells Fargo & Company,Wells Fargo & Company,Financial Services,Banks - Diversified,70.34,234196303872,,-0.018,San Francisco,CA,United States,220167.0,"Wells Fargo & Company, a financial services co...",0.004213
32,NYQ,ACN,Accenture plc,Accenture plc,Technology,Information Technology Services,366.37,229157109760,11065910000.0,0.026,Dublin,,Ireland,774000.0,Accenture plc provides strategy and consulting...,0.004123
34,NYQ,AXP,American Express Company,American Express Company,Financial Services,Credit Services,298.65,210382487552,,0.08,New York,NY,United States,74600.0,"American Express Company, together with its su...",0.003785
37,NYQ,BX,Blackstone Inc.,Blackstone Inc.,Financial Services,Asset Management,170.84,207208415232,,0.541,New York,NY,United States,4735.0,Blackstone Inc. is an alternative asset manage...,0.003728
40,NMS,LIN,Linde plc,Linde plc,Basic Materials,Specialty Chemicals,424.31,202038607872,12581000000.0,0.025,Woking,,United Kingdom,65596.0,Linde plc operates as an industrial gas compan...,0.003635
42,NYQ,MS,Morgan Stanley,Morgan Stanley,Financial Services,Capital Markets,123.44,198866780160,,0.165,New York,NY,United States,80000.0,"Morgan Stanley, a financial holding company, p...",0.003578
51,NYQ,GS,"Goldman Sachs Group, Inc. (The)","The Goldman Sachs Group, Inc.",Financial Services,Capital Markets,566.1,177704452096,,0.042,New York,NY,United States,46400.0,"The Goldman Sachs Group, Inc., a financial ins...",0.003197


There are a lot of NA's for `Ebitda` coming from banking/financial services companies. While `Ebitda` can be a valuable financial metric, companies like JP Morgan don't have a value, because banks operate with intrest income and expenses as their main services, unlike many other companies. 

The NA's present in the `State` variable are coming from companies present in the S&P 500 outside of the U.S., so these NA's are not problematic at all. 

There are 9 NA's coming from `Fulltimeemployees`, which come from the following companies: Visa, Starbucks, D.R. Horton, ResMed, Raymond James, Super Micro Computer, Inc., F5, Inc, Solventum, and Amentum Holdings, Inc. There doesn't appear to be any rhyme or reason as to why, however, it is important to take note of. 

There are NA's for `Revenuegrowth` coming from Verizon, American Tower Corporation, and Western Digital Corporation. There's no particular definitive reason as to why, however it can potentially be attributed to a lack of growth. 

### Step 2: What phenomenon are you modeling? Provide a brief background on the topic, including definitions and details that are relevant to your analysis. Clearly describe its main features, and support those claims with data where appropriate.

### Step 3: Describe your non-parametric model (empirical cumulative distribution functions, kernel density function, local constant least squares regression, Markov transition models). How are you fitting your model to the phenomenon to get realistic properties of the data? What challenges did you have to overcome?

### Step 4: Either use your model to create new sequences (if the model is more generative) or bootstrap a quantity of interest (if the model is more inferential).

### Step 5: Critically evaluate your work in part 4. Do your sequences have the properties of the training data, and if not, why not? Are your estimates credible and reliable, or is there substantial uncertainty in your results? 

### Step 6: Write a conclusion that explains the limitations of your analysis and potential for future work on this topic.