# Importing Market Data from Yahoo! Finance

In this notebook, I will be finding the ticker symbols of the companies in this universe. This will not be simple as the names of companies change over time and there is often M\&A activity that interferes with this process. This will be a learning experience so I will need to adjust as I go!

In [2]:
import numpy as np
import pandas as pd
import os
import urllib.request

import yfinance

We have a CIK-to-company mapping that was downloaded in another notebook (#ToDo), so let's see if we can use that to help us:

In [3]:
map_df = pd.read_csv('company_list.csv')
map_df.head()

Unnamed: 0,CIK,Company,State/Country
0,1053468,ABBOTT GREGORY,CO
1,1295721,ACE Aviation Holdings Inc.,A8
2,1002819,AIR CANADA /QUEBEC/,A8
3,1110452,AIR FRANCE-KLM /FI,I0
4,310454,AIR MIDWEST INC,KS


In [4]:
os.listdir('sec-edgar-filings/')

['0000100517',
 '0001351548',
 '0000101001',
 'AAL',
 '0001405419',
 '0001614436',
 '0001144331',
 '0001166291',
 '0000921929',
 '0000899394',
 '.DS_Store',
 '0001159154',
 '0001498710',
 '0000714560',
 '0000869187',
 '0000319687',
 '0000027904',
 '0000006201',
 '0001050715',
 '0000904020',
 '0000706270',
 '0000810332',
 '0001029863',
 '0000948845',
 '0001172222',
 '0000766421',
 '0001058033',
 '0000835768',
 '0000793733',
 '0000092380',
 '0000004515',
 '0001088734',
 '0001158463',
 '0001362468',
 '0001011696',
 '0000003202',
 '0000948846',
 '0000914397',
 '0001000578',
 '0000701345',
 '0000046205']

In [5]:
uurl = 'https://www.sec.gov/include/ticker.txt'

def download(t_url):
    response =  urllib.request.urlopen(t_url)
    data = response.read()
    txt_str = str(data)
    lines = txt_str.split("\\n")
    des_url = 'ticker_to_CIK.csv'
    fx = open(des_url,"w")
    for line in lines:
        fx.write(line+ "\n")
    fx.close()

download(uurl)

In [6]:
os.listdir()

['company_list.csv',
 'all_stock_parse.ipynb',
 '.DS_Store',
 'master.idx',
 'EDGAR_scrape.ipynb',
 'ticker_to_CIK.txt',
 'cik_ticker.csv',
 'sec-edgar-filings',
 'AAL_parse.ipynb',
 'AAL_parsed.pickle',
 'CIK_to_Ticker.ipynb',
 '.ipynb_checkpoints',
 'ticker_to_CIK.csv',
 'cik_ticker.csv.download']

In [70]:
ticker_to_CIK = pd.read_table('ticker_to_CIK.txt', sep=r'\\t', header=None)
ticker_to_CIK.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1
0,b'aapl,320193
1,msft,789019
2,amzn,1018724
3,goog,1652044
4,tcehy,1293451


In [71]:
# Fix weird apple entry
ticker_to_CIK.iloc[0,0] = 'aapl'
ticker_to_CIK.head()

Unnamed: 0,0,1
0,aapl,320193
1,msft,789019
2,amzn,1018724
3,goog,1652044
4,tcehy,1293451


So now we have a mapping from CIK to ticker. However, this is not quite the end of this process. Recall that our directories pad the CIKs with zeroes at the beginning so that the length of each folder is equal to 10 characters. We will need to pad our CIKs so that they match. 

In [72]:
sample = ticker_to_CIK[ticker_to_CIK[0] == 'aal']
sample

Unnamed: 0,0,1
915,aal,6201


In [73]:
sample.values[0][1].rjust(10, '0')

'0000006201'

By using this .rjust method, we can do exactly this. Let us now pad our CIKs:

In [75]:
ticker_to_CIK.index = [x.rjust(10,'0') for x in ticker_to_CIK[1]]
ticker_to_CIK = ticker_to_CIK[0]
ticker_to_CIK

0000320193       aapl
0000789019       msft
0001018724       amzn
0001652044       goog
0001293451      tcehy
               ...   
0001829432     aac-wt
0001209028    aaic-pb
0001209028    aaic-pc
0001838883    aaqc-un
001838883'    aaqc-wt
Name: 0, Length: 12857, dtype: object

In [81]:
companies = pd.DataFrame([x for x in os.listdir('sec-edgar-filings/') if len(x) == 10])
companies[:5]

Unnamed: 0,0
0,100517
1,1351548
2,101001
3,1405419
4,1614436


In [86]:
len([x for x in companies[0] if x not in ticker_to_CIK.index])

27

In [87]:
len(companies)

39

In [77]:
companies[0].map(lambda x: ticker_to_CIK.loc[x][0])

KeyError: '0001351548'

In [80]:
'0001351548' in ticker_to_CIK.index

False

In [31]:
companies

Unnamed: 0,0
0,100517
1,1351548
2,101001
3,1405419
4,1614436
5,1144331
6,1166291
7,921929
8,899394
9,1159154


In [69]:
ticker_to_CIK.sort_values(0)[-1030:-1000]

Unnamed: 0_level_0,0,1
padded,Unnamed: 1_level_1,Unnamed: 2_level_1
1821424,uk,1821424
1821424,ukomw,1821424
1856659,ukwi,1856659
217410,ul,217410
875657,ulbi,875657
831001,ulbr,831001
1670076,ulcc,1670076
1415311,ule,1415311
1308208,ulh,1308208
1605810,ulnv,1605810


In [59]:
len('0000100517')

10

In [66]:
ticker_to_CIK.loc['0001351548'][0]

KeyError: '0001351548'

I will also need to find the historical tickers for each company. Perhaps I am looking at this the wrong way and I will need to also scrape the tickers from the files? Something to investigate.