Take into account that those parts highlighted with yellow are quite time-consuming.

## Part 1. Data Scraper

## Part 1.1. Getting Raw Data

### Part 1.1.1. Importing Libraries

Import necessary libs

In [None]:
import pandas as pd

import yfinance as yf
from pytickersymbols import PyTickerSymbols

from read_write_csv import save_csv

### <mark>Part 1.1.2. Scraping Raw Data</mark> 

The logic for scraper

In [2]:
stock_data = PyTickerSymbols()
sp500_yahoo_tickers_list = stock_data.get_sp_500_nyc_yahoo_tickers()  # Get list of tickers for all SP500 stocks 

data = yf.download(  # let's scrape stocks data from yfinance for 5y period, inserting list with tickers
sp500_yahoo_tickers_list, period='10y', keepna=False
)

df = pd.DataFrame(  # let's make a dataframe out of scraped data, choosing only adj close prices
    data.iloc[:, data.columns.get_level_values(0)=='Adj Close']
    )

[*********************100%***********************]  538 of 538 completed

37 Failed downloads:
['HRS', 'SIVB', 'CNP-PB', 'BLL', 'MNSLV', 'WLTW', 'ABMD', 'HBANN', 'KIM-PI', 'RE', 'RF-PB', 'DISH', 'ABC', 'FRC', 'WRK', 'GS-PK', 'FLT', 'XON', 'CDAY', 'BOAPL', 'ATVI', 'UHID', 'K-WI', 'FBHS', 'PBSTV', 'PEAK', 'PXD', 'PKI']: YFPricesMissingError('$%ticker%: possibly delisted; no price data found  (period=10y) (Yahoo error = "No data found, symbol may be delisted")')
['HCP', 'PARAA', 'CEG', 'OGN', 'OTIS', 'CARR', 'NEEXU']: YFInvalidPeriodError("%ticker%: Period '10y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', '5y', 'ytd', 'max']")
['SBNY']: YFInvalidPeriodError("%ticker%: Period '10y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', 'ytd', 'max']")
['FISV']: YFInvalidPeriodError("%ticker%: Period '10y' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', 'ytd', 'max']")


### Part 1.1.3. Overview

Let us check what we got at this point

In [3]:
df.describe()  # gives first 7 rows of the table

Price,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close
Ticker,0P0000KQL0,0P0001I1JH,A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,...,XEL,XOM,XON,XRAY,XYL,YUM,ZBH,ZBRA,ZION,ZTS
count,1434.0,1434.0,2517.0,2517.0,2517.0,2517.0,2517.0,0.0,0.0,2517.0,...,2517.0,2517.0,0.0,2517.0,2517.0,2517.0,2517.0,2517.0,2517.0,2517.0
mean,2.048117,2.087138,88.756119,27.945144,131.378494,92.384616,88.792928,,,76.214976,...,48.231286,66.775741,,45.618223,76.444332,89.977027,115.963495,230.650095,36.991373,116.328325
std,0.221826,0.225107,40.293001,13.388271,41.06396,64.475861,43.802411,,,30.538865,...,13.741923,23.427996,,10.558341,30.768412,29.955871,16.20604,140.109513,10.998845,56.333974
min,1.3655,1.3855,31.031214,9.04,35.689999,20.697269,32.962017,,,30.864788,...,23.95533,25.031288,,17.26,27.042645,40.126244,75.456177,46.93,15.628467,36.925087
25%,1.8976,1.940325,55.954002,14.67,107.788757,34.155132,48.644257,,,42.275589,...,35.726334,54.053425,,36.444599,48.644886,62.214317,105.20977,104.410004,27.387794,57.855518
50%,2.03735,2.07596,75.648575,27.388784,138.948288,62.786938,72.829781,,,77.427391,...,52.689938,58.233849,,47.532429,73.231926,89.732719,113.299126,211.399994,37.348949,120.481628
75%,2.188475,2.23,129.002914,40.008011,151.840866,149.165283,130.198578,,,104.471436,...,60.203667,79.988983,,54.421982,100.557396,117.584152,125.037384,302.769989,44.652092,167.71994
max,2.615,2.6651,175.479584,56.988728,224.340866,237.330002,203.869995,,,133.728104,...,72.919998,124.348221,,65.587769,144.777924,141.782349,168.737976,614.549988,66.437309,240.630768


## Part 1.2. Processing Raw Data

### Part 1.2.1. Dropping index rows

There is several index rows in the table, but as we would like to use only the Adj Close prise (or another single index), we would like to make the table flat for further convenience.

In [4]:
df.columns = df.columns.droplevel(0)

### Part 1.2.2. Dropping Ticker column and saving CSV

You may see that Ticker is an ordering column now (the first column on the left side). We will drop this column by using index=False parameter when we will export the dataframe as csv file

In [5]:
save_csv(df, 'historical_data.csv')

'c:\\Users\\nikit\\Desktop\\Personal\\pythonLanguage\\portfolio_optimization_ml\\src\\data\\historical_data.csv'