### Datasets explored

#### 1
The first dataset was `Huge Stock Market Dataset` found here: https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs

This looked promising as it contained the information for all US based stocks and ETFs trading on the NYSE, NASDAQ, and NYSE MKT, but after inspecting the data in the dataset, I found it was last updated in November of 2017. This could still be a useful dataset of course, but if we can find a set that includes the most recent years as well I believe any resulting models would be more accurate.

#### 2
The next dataset I came across after finding the first and then looking for something similar but including more recent data was `Stock Market Data (NASDAQ, NYSE, S&P500)` found here: https://www.kaggle.com/paultimothymooney/stock-market-data

I noticed this was the same exchanges we are interested in getting the data for (minus the S&P500 which is just a compilation of stocks from the exchanges themselves) and the description of the dataset said it was updated weekly. This seemed to solve my original problem of wanting additional years of data, but upon inspecting the dataset it appears to be missing many funds that were available in the first.

The first dataset had 8,539 total stocks & ETFs in it, but this second dataset had 3,022.

Even worse than missing out on the most recent few years of information would be to exclude a majority of the stocks & ETFs entirely, so we continue our search.

#### 3
The final dataset I found, called `Stock Market Dataset` found here: https://www.kaggle.com/jacksoncrow/stock-market-dataset, looked to be very similar to the first one, this one having 8,049 unique ticker symbols, but was updated in April of 2020. 

As it is, this dataset appears to be the best, and is recent enough; however, I found in the description that the user that created it used a library called yfinance to get the data for each symbol. 

### Plan
I'm going to use the library mentioned (yfinance) to download all the most current stock data info so I can work with the latest data possible.

Looking into yfinance, it takes a ticker symbol and a period you want data for, and will return a DataFrame with that info. First step is to get a list of all the ticker symbols traded, then feed those into yfinance to get all the historical data, and store that data in a csv.

#### Get list of stock symbols traded

In [1]:
import pandas as pd

data = pd.read_csv('http://www.nasdaqtrader.com/dynamic/SymDir/nasdaqtraded.txt', sep='|')
data.head()

Unnamed: 0,Nasdaq Traded,Symbol,Security Name,Listing Exchange,Market Category,ETF,Round Lot Size,Test Issue,Financial Status,CQS Symbol,NASDAQ Symbol,NextShares
0,Y,A,"Agilent Technologies, Inc. Common Stock",N,,N,100.0,N,,A,A,N
1,Y,AA,Alcoa Corporation Common Stock,N,,N,100.0,N,,AA,AA,N
2,Y,AAA,Listed Funds Trust AAF First Priority CLO Bond...,P,,Y,100.0,N,,AAA,AAA,N
3,Y,AAAU,Goldman Sachs Physical Gold ETF Shares,P,,Y,100.0,N,,AAAU,AAAU,N
4,Y,AAC.U,"Ares Acquisition Corporation Units, each consi...",N,,N,100.0,N,,AAC.U,AAC=,N


Looking to what these fields mean, I found that `Test Issue`
> Indicates whether the security is a test security.

Since these are tests they can safely be ignored.

In [2]:
data['Test Issue'].value_counts()

N    9816
Y      34
Name: Test Issue, dtype: int64

In [3]:
data = data[data['Test Issue'] == 'N']
len(data)

9816

In [4]:
stock_symbols = data['Symbol'].to_list()
stock_symbols[:5]

['A', 'AA', 'AAA', 'AAAU', 'AAC.U']

#### Get historical data for each symbol and store in CSV

In [5]:
!mkdir project_data_output

mkdir: project_data_output: File exists


In [6]:
import yfinance as yf

symbols_as_string = ' '.join(stock_symbols)
stock_data = yf.download(symbols_as_string, period='max', group_by='ticker', threads=False)

stock_data.to_csv(f'project_data_output/data.csv')

[**********************76%***********            ]  7463 of 9816 completed

JSONDecodeError: Expecting value: line 1 column 1 (char 0)