# **Data Scraper**

This notebook will help you to generate a data folder which contains the 500 stocks time series between a start date and an end date you can define. By default they are scrapped between 18 Nov. 1999 and 2 Sep. 2020.

This code is mainly inspired from Kassie Papasotiriou and Romane Goldmuntz, special thanks to them.

Many other key functions are in the TS_utils.py file, be sure to have a look.

In [1]:
import numpy as np
import numpy.random as rd
import pandas as pd
import pandas_datareader.data as web

In [2]:
try:
  import pandas_market_calendars as mcal
except:
  !pip -q install pandas_market_calendars
  import pandas_market_calendars as mcal

[K     |████████████████████████████████| 61kB 2.5MB/s 
[K     |████████████████████████████████| 112kB 11.2MB/s 
[?25h  Building wheel for trading-calendars (setup.py) ... [?25l[?25hdone


In [3]:
# Start and End date of stock data
start_date = pd.to_datetime('1999-11-18')
end_date   = pd.to_datetime('2020-09-02')

In [4]:
# Read names of Stocks we are interested in
symbols = pd.read_csv('https://raw.githubusercontent.com/Amelrich/Capstone-Fall-2020/master/sp500.csv', index_col=False)
symbols = list(symbols['Symbol'].values)
symbols = sorted(symbols)

#Correction for Yahoo Finance scrapping
symbols = ['BF-B' if x=='BF.B' else x for x in symbols]
symbols = ['BRK-B' if x=='BRK.B' else x for x in symbols]

### Scraping

In [5]:
def scrape_yahoo(stock_name, start_date, end_date):
  # scrape data of each stock from yahoo
  try:
    df = web.DataReader(stock_name,'yahoo', start_date, end_date)
    df = df[['Adj Close','Volume']]
    df['Symbol'] =  stock_name
    find_flag = 1
    return df, find_flag
  except KeyError:
    print("Could not find data on ".format(stock_name))
    find_flag = 0
    return pd.DataFrame(), find_flag

In [6]:
!rm -rf data/
!mkdir data/

We scrap every stock and store the values inside a csv file

In [7]:
count = 0
table = dict()

for stock in symbols:
  count += 1
  if count % 100 == 0:
    print(f"{count} stocksout of {len(symbols)} completed")

  TS = scrape_yahoo(stock, start_date, end_date)[0]
  table[stock] = len(TS)
  TS.to_csv('data/'+stock+'.csv', header=TS.columns)

100 stocksout of 505 completed
200 stocksout of 505 completed
300 stocksout of 505 completed
400 stocksout of 505 completed
500 stocksout of 505 completed


In [34]:
np.save('data/summary.npy', table) 

In [None]:
try:
    from google.colab import files
    !zip -q -r data.zip data/
    files.download('data.zip') 
except:
    print("only in Colab")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>