# __Preparing data__

- In this chapter, we are going to download the '__adjusted closing price__' of stocks listed in the stock index __Russel 1000__. 
- The reason that we use Russel 1000 instead of S&P 500 is that the number of stocks left is not many when we filter stocks with missing values from 1999 to 2019, if we use S&P 500.
- We are aware of that filtering stocks with missing values will cause __Survivorship bias__. This will be fixed in the future research
- The adjusted closing price is going to turn into daily returns of stocks and then be used when we optimize a portfolio.

#### __Description of Data__
- __Timespan :__ Nov.1999 - Nov.2019 (recent 20 years)
- __Stock index used :__ Russel 1000
- __Source :__ Yahoo Finance
- __Library used for downloading data__ : Pandas datareader

#### __Contents__

- [__Step 01. Loading stock data__](#Step-01.-Loading-stock-data)
- [__Step 02. Filtering tickers__](#Step-02.-Filtering-tickers)
- [__Step 03. Creating a daily_price_df__](#Step-03.-Creating-a-daily_price_df)
- [__Step 04. Downloading industry information__](#Step-05.-Downloading-industry-information)

---

## Step 01. Loading stock data

__1. Importing required libraries__

In [1]:
# for importing libraries again that is already imported, just in case that any change is made in libraries
%load_ext autoreload
%autoreload 2

# libraries for general work
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook as tqdm

# for downloading stock data
import pandas_datareader.data as pdr
import datetime

# for downloading industry information of each company
# source : https://github.com/davidastephens/pandas-finance
from pandas_finance import Equity

# Disable printing all warnings on python 
import warnings
warnings.filterwarnings("ignore")

__2. Loading Russel 1000 index constituents__
- loading the tickers consisting the Russel 1000 index from the beginning of __Nov.1999__ till the end of __Nov.2019__

In [2]:
russel1000_tickers = pd.read_csv("./russel1000_constituents.csv", header=None)
russel1000_tickers = list(russel1000_tickers.iloc[:,0])
russel1000_tickers[:5]

['TWOU', 'MMM', 'ABT', 'ABBV', 'ABMD']

In [3]:
len(russel1000_tickers)

968

__3. Setting dates of `start` and `end` (the start and end dates when we download stocks)__

In [4]:
# We will look at stock prices over the past year, starting at January 1, 2016
start = (1999, 11, 1)
start = datetime.datetime(*start)
end = (2019, 12, 1)
end = datetime.datetime(*end)

__4. Download stock data__

In [5]:
stock_data_dict = {}
missing_tickers = []

for ticker in tqdm(russel1000_tickers):
    try:
        stock_data = pdr.DataReader(ticker, 'yahoo', start, end)
        stock_data_dict[ticker] = stock_data
    except:
        print(f"The following ticker made an error : {ticker}")
        missing_tickers.append(ticker)

HBox(children=(FloatProgress(value=0.0, max=968.0), HTML(value='')))

The following ticker made an error : BHGE
The following ticker made an error : BRK.B
The following ticker made an error : BFA
The following ticker made an error : BFB
The following ticker made an error : DPS
The following ticker made an error : DNB
The following ticker made an error : EGN
The following ticker made an error : FDC
The following ticker made an error : HEI.A
The following ticker made an error : HPT
The following ticker made an error : LEN.B
The following ticker made an error : LGF.A
The following ticker made an error : PF
The following ticker made an error : TSRO
The following ticker made an error : TMK
The following ticker made an error : TRCO
The following ticker made an error : DATA
The following ticker made an error : ULTI
The following ticker made an error : VVC
The following ticker made an error : VSM
The following ticker made an error : USG
The following ticker made an error : UBNT
The following ticker made an error : WFT
The following ticker made an error : JW.A
Th

In [6]:
len(stock_data_dict)

943

In [7]:
len(missing_tickers)

25

__5. Checking `missing_tickers` once again__

In [8]:
replaced_missing_tickers = []

for ticker in missing_tickers:
    replaced_ticker = ticker.replace('.','-')
    replaced_missing_tickers.append(replaced_ticker)

In [9]:
for ticker in tqdm(replaced_missing_tickers):
    try:
        stock_data = pdr.DataReader(ticker, 'yahoo', start, end)
        stock_data_dict[ticker] = stock_data
    except:
        print(f"The following ticker made an error : {ticker}")
        missing_tickers.append(ticker)

HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))

The following ticker made an error : BHGE
The following ticker made an error : BFA
The following ticker made an error : BFB
The following ticker made an error : DPS
The following ticker made an error : DNB
The following ticker made an error : EGN
The following ticker made an error : FDC
The following ticker made an error : HPT
The following ticker made an error : PF
The following ticker made an error : TSRO
The following ticker made an error : TMK
The following ticker made an error : TRCO
The following ticker made an error : DATA
The following ticker made an error : ULTI
The following ticker made an error : VVC
The following ticker made an error : VSM
The following ticker made an error : USG
The following ticker made an error : UBNT
The following ticker made an error : WFT
The following ticker made an error : WP



In [10]:
len(stock_data_dict)

948

In [11]:
# Saving
with open('russel1000_raw_data_dict.pickle', 'wb') as f: # 
    pickle.dump(stock_data_dict, f, pickle.HIGHEST_PROTOCOL)

---

## Step 02. Filtering tickers

__1. Filtering tickers with missing value__

In [12]:
ticker_list_in_stock_data_dict = list(stock_data_dict.keys())
stock_data_dict_len_list = []

for ticker in ticker_list_in_stock_data_dict:
    stock_data_dict_len_list.append(len(stock_data_dict[ticker]))

In [13]:
maximum_len = pd.Series(stock_data_dict_len_list).max()
maximum_len

5053

In [14]:
filtered_stock_data_dict = {}

for ticker in ticker_list_in_stock_data_dict:
    if len(stock_data_dict[ticker]) == maximum_len:
        filtered_stock_data_dict[ticker] = stock_data_dict[ticker]

In [15]:
len(filtered_stock_data_dict)

590

In [16]:
# Saving
with open('russel1000_stock_data_dict.pickle', 'wb') as f: # 
    pickle.dump(filtered_stock_data_dict, f, pickle.HIGHEST_PROTOCOL)

In [17]:
# Loading
with open('russel1000_stock_data_dict.pickle', 'rb') as f:
    filtered_stock_data_dict = pickle.load(f)

---

## Step 03. Creating a daily_price_df

__1. Creating a dataframe of price : `daily_price_df`__

In [18]:
daily_price_dict = {}
filtered_tickers = list(filtered_stock_data_dict.keys())

for ticker in filtered_tickers:
    daily_price_dict[ticker] = filtered_stock_data_dict[ticker].loc[:,'Adj Close']

In [19]:
daily_price_df = pd.DataFrame.from_dict(daily_price_dict)
del daily_price_df.index.name

daily_price_df.head(3)

Unnamed: 0,MMM,ABT,ABMD,ACHC,ATVI,ADBE,AMD,AES,AMG,AFL,...,ZBRA,ZION,XEL,XRX,XLNX,YUM,BRK-B,HEI-A,LGF-A,JW-A
1999-11-01,26.186508,7.625059,10.25,3.75,1.057096,17.013006,10.15625,22.383265,17.685398,5.89558,...,26.222221,44.172218,8.828288,59.627827,28.284966,4.807233,41.740002,1.89936,1.423901,12.179971
1999-11-02,26.430267,7.187653,10.3125,3.75,1.071191,16.888933,10.28125,22.780302,17.523146,6.068317,...,25.388889,45.481014,8.853581,57.188522,28.801279,4.733161,42.82,1.891793,1.385417,12.13453
1999-11-03,26.325787,6.963037,10.25,3.875,1.089984,17.075039,10.65625,22.20956,17.604273,6.150932,...,24.944445,44.920109,8.72711,57.324043,30.417568,4.65168,43.060001,1.755583,1.539352,11.725501


__2. Checking the dtypes of__ `daily_price_df`
- Need to check the datatypes of data included in dataframe because some of the numbers may not be numbers(floats, ints, etc) but actually strings.
- If some strings are mixed with numbers in the dataframe, some computations such as `.cov()` cannot be performed.

In [20]:
daily_price_df.dtypes.value_counts()

float64    590
dtype: int64

- The datatype `object` in pandas means string, while `float64` means float numbere.
- Source : <a href="https://pbpython.com/pandas_dtypes.html">_Overview of Pandas Data Types_</a>

In [22]:
# # Saving
# with open('data/russel1000_daily_price_df.pickle', 'wb') as f: # 
#     pickle.dump(daily_price_df, f, pickle.HIGHEST_PROTOCOL)

In [23]:
# Loading
with open('data/russel1000_daily_price_df.pickle', 'rb') as f:
    daily_price_df = pickle.load(f)

---

## Step 04. Downloading industry information

In [35]:
sector_dict = {}

sector_dict['Industrials'] = []
sector_dict['Healthcare'] = []
sector_dict['Communication Services'] = []
sector_dict['Technology'] = []
sector_dict['Utilities'] = []
sector_dict['Financial Services'] = []
sector_dict['Basic Materials'] = []
sector_dict['Real Estate'] = []
sector_dict['Consumer Defensive'] = []
sector_dict['Consumer Cyclical'] = []
sector_dict['Energy'] = []

In [36]:
for ticker in tqdm(filtered_tickers):
    try: 
        ticker_sector = Equity(ticker).sector
        sector_dict[ticker_sector].append(ticker)
    except:
        print(f"following ticker made an issue : {ticker}")

HBox(children=(FloatProgress(value=0.0, max=590.0), HTML(value='')))

following ticker made an issue : BBT
following ticker made an issue : CHK
following ticker made an issue : GD
following ticker made an issue : JEC
following ticker made an issue : STI



In [37]:
sector_dict["Financial Services"].append('BBT')
sector_dict["Energy"].append('CHK')
sector_dict["Industrials"].append('GD')
sector_dict["Industrials"].append('JEC')
sector_dict["Financial Services"].append('STI')

In [38]:
industry_sum = 0

for industry in sector_dict.keys():
    print(f"{industry} : {len(sector_dict[industry])}")
    industry_sum+= int(len(sector_dict[industry]))

print()
print(f"Total number of tickers : {industry_sum}")

Industrials : 93
Healthcare : 68
Communication Services : 18
Technology : 67
Utilities : 31
Financial Services : 92
Basic Materials : 30
Real Estate : 49
Consumer Defensive : 37
Consumer Cyclical : 73
Energy : 32

Total number of tickers : 590


In [39]:
# Saving
with open('./russel1000_sector_dict.pickle', 'wb') as f: # 
    pickle.dump(sector_dict, f, pickle.HIGHEST_PROTOCOL)