# Read the Data from Yahoo Fiance Remotely For One Stock

Using package `pandas_datareader` to read data remotely: <sup>[[1]](#ft1)</sup>
+ use `pip install pandas_datareader` to install `pandas_datareader` in shell

In [1]:
import pandas as pd
# REF: https://pandas.pydata.org/pandas-docs/version/0.18.1/remote_data.html#remote-data-yahoo
# Install pandas_datareader first.
import pandas_datareader.data as web
import datetime 

t1 = datetime.datetime.now()
start = datetime.datetime(1970,1,1)
end = datetime.date.today()
# read data from Yahoo Fiance Remotely
# for i in range(10):
BHP = web.DataReader("BHP", "yahoo", start, end)
t2 = datetime.datetime.now()
print((t2 - t1)/10) # test which proxy routine is better

0:00:00.128844


# How to Obtain the Company List of an Industry

1. Website: https://finance.yahoo.com/sector/ms_basic_materials
2. Select the filter
3. Copy the list to txt, and get the company name via regex

In [8]:
import re
from collections import defaultdict
import pandas as pd
import pandas_datareader.data as web
import datetime
import os

def get_industry_data(path: str, start: datetime.datetime, end: datetime.datetime):
    '''
    Description: read the txt file and extract the names of the companies, then obtain the history stock prices
                      from 1970,1,1 to 2021,1,1.
    Input:
        path = the path of the txt file
        start = start time
        end = end time
    Output:
        a dictionary {company_name: pd.dataframe(history stock prices)}
    '''
    # read the txt file
    with open(path, 'r') as f:
        f = f.readlines()
    # identify the symbol pattern of the company
    pattern = re.compile('(?<=^)[A-Z0-9-.&]+(?=\t)')
    
    company_list = []
    # get the company list
    for line in f:
        company_list.extend(re.findall(pattern, line)) 
        
    company_dict = defaultdict(pd.core.frame.DataFrame)
    for c in company_list:
        # read data from Yahoo Fiance Remotely
        try:
            company_dict[c] = web.DataReader(c, "yahoo", start, end)
#             print(c, 'complete')
        except:
            print(c)
        
    return company_dict

def to_csv(d: dict, folder: str):
    """
    Store the company data into csv file
    Input:
        d = the company dictionary
        folder = the path to store these file
    Output:
        many csv files, each one corresponds to a company historical stock prices
    """
    
    # store the data into data folder
    # REF: https://www.jianshu.com/p/dde02a88a5c1
    root_path = os.path.abspath(os.path.dirname(os.getcwd()))
    data_path = os.path.join(root_path, 'data')
    path = os.path.join(data_path, folder)
    try:
        # REF: https://www.geeksforgeeks.org/create-a-directory-in-python/
        os.mkdir(path) # create the corresponding industry direction
    except:
        # empty the previous folder
        for f in os.listdir(path):
            os.remove(os.path.join(path, f))
    for k,v in d.items():
        file = path + '/' + k + '.csv' # file location
        v.to_csv(path_or_buf = file, index = True)
        
def get_url():
    '''
    get the url of the companies to check whether they don't have data prior to 2021.1.1
    Input: a list of company names
    Output: the markdown style url link
    '''
    text = str(input('company'))
    text = text.split(' ')
    for t in text:
        url = "https://finance.yahoo.com/quote/NAME/history?p=NAME"
        p = re.compile('NAME')
        url = re.sub(p, t, url)
        print('+ [{c}]({u})'.format(c = t, u = url))

start = datetime.datetime(1970,1,1) # start time point 
end = datetime.datetime(2021,1,1) # end time point
root_path = os.path.abspath(os.path.dirname(os.getcwd()))
data_path = os.path.join(root_path, 'data')

## Agriculture

<!-- ![](https://user-images.githubusercontent.com/22797017/116096274-bb93b400-a6db-11eb-824d-94c837ef181c.png) -->
![](https://user-images.githubusercontent.com/22797017/116341107-ecc2d000-a812-11eb-8a15-21a47afcc38e.png)

<!-- The following companies do not have the data prior to 2021-01-01:
+ [PPTA](https://finance.yahoo.com/quote/PPTA/history?p=PPTA)
+ [CNEY](https://finance.yahoo.com/quote/CNEY/history?p=CNEY)
+ [HUDI](https://finance.yahoo.com/quote/HUDI/history?p=HUDI)
+ [CRKN](https://finance.yahoo.com/quote/CRKN/history?p=CRKN)
+ [ZY](https://finance.yahoo.com/quote/ZY/history?p=ZY) -->

Companies: 
```
YTEN; RKDA; CGA; SEED; MBII; IPI; AVD; UAN; MGPI; ICL; CF; SMG; MOS; FMC; NTR; CTVA; CTA-PA; CTA-PB;
```

In [4]:
agri_dict = get_industry_data(os.path.join(data_path, 'agriculture.txt'), start, end)
print('data is ready.')
# to_csv(agri_dict, 'agriculture')

data is ready.


In [4]:
get_url()

company
+ [](https://finance.yahoo.com/quote//history?p=)


In [5]:
t = ''
for k in agri_dict.keys():
    t = k + '; ' + t
    
print(len(agri_dict.keys()))
t

18


'YTEN; RKDA; CGA; SEED; MBII; IPI; AVD; UAN; MGPI; ICL; CF; SMG; MOS; FMC; NTR; CTVA; CTA-PA; CTA-PB; '

## Energy

![](https://user-images.githubusercontent.com/22797017/115959053-e398e100-a53c-11eb-9161-db98a1c912d6.png)

The following companies do not have the data prior to 2021-01-01<sup>[[2]](#ft2)</sup>:
+ [CSAN](https://finance.yahoo.com/quote/CSAN/history?p=CSAN)
+ [CHK](https://finance.yahoo.com/quote/CHK/history?p=CHK)
+ [DEN](https://finance.yahoo.com/quote/DEN/history?p=DEN)
+ [XOG](https://finance.yahoo.com/quote/XOG/history?p=XOG)
+ [VEI](https://finance.yahoo.com/quote/VEI/history?p=VEI)
+ [AMR](https://finance.yahoo.com/quote/AMR/history?p=AMR)
+ [GLP-PB](https://finance.yahoo.com/quote/GLP-PB/history?p=GLP-PB)

Otherwise:
```
GMLPP; GLP-PA; TGP-PB; DLNG-PA; DCP-PC; NS-PA; ALIN-PB; ALIN-PE; CEQP-P; HMLP-PA; GLOP-PA; EP-PC; TGP-PA; ALIN-PA; NS-PC; DCP-PB; MTR; MARPS; ICD; CELP; IO; NRT; PVL; PHX; DWSN; NNA; CRT; MVO; NINE; GIFI; VOC; PRT; BPT; DLNG-PB; KLXE; SMLP; RNGR; CCLP; MMLP; AMPY; DLNG; EXTN; FTK; FET; SND; SBOW; GEOS; NGS; GLOP; EGY; SD; NC; TNP; PBT; RNET; TUSK; OSG; VIST; SJT; NRP; BORR; NGL; NR; CEIX; TNP-PF; TNP-PD; TNP-PE; TK; TTI; NOA; PDS; OIS; DSSI; HESM; LPI; BTU; WTI; SGU; TNK; CLMT; SOI; REX; PVAC; BRY; WTTR; SBR; DMLP; ESTE; TDW; HMLP; SRLP; GLOG; NBR-PA; NVGS; KRP; NBR; TRMD; FLNG; BCEI; LPG; TGS; HLX; NEX; MRC; ARCH; OMP; GLP; BOOM; FI; CAPL; PARR; VTOL; SLCA; ARLP; MNRL; TALO; NGL-PB; PBFX; GLOP-PB; GLOP-PC; PUMP; NGL-PC; GPRK; DRQ; DHT; RES; OII; DNOW; VET; KOS; STNG; CRK; ERF; GEL; PTEN; CLB; GLNG; NBLX; TGP; CPE; WLL; GLOG-PA; BPMP; AROC; OAS; FRO; YPF; USAC; PBF; DK; EURN; RTLR; DKL; PAGP; LBRT; SM; CRC; CPG; RIG; CVI; NS; CLNE; BSM; INT; HEP; WHD; CEQP; RRC; REGI; AR; MUR; ENLC; SWN; MGY; HP; VNOM; MTDR; NS-PB; CNX; ENBL; FTI; SUN; PDCE; ETRN; CHX; UGP; AM; NFG; DCP; EQT; NOV; VVV; SHI; HFC; OVV; SHLX; XEC; COG; CCJ; PAA; APA; TRGP; PSXP; WES; MRO; CLR; MMP; SSL; TPL; FANG; TS; DVN; CVE; PBA; HAL; HES; BKR; ET; OKE; OXY; EC; MPLX; VLO; WMB; SU; PXD; PSX; MPC; CNQ; SLB; KMI; EOG; E; TRP; EPD; PBR; PBR-A; EQNR; COP; ENB; SNP; BP; TOT; PTR; RDS-B; RDS-A; CVX; XOM;
```

In [5]:
energy_dict = get_industry_data(os.path.join(data_path, 'energy.txt'), start, end)
print('data is ready.')
# to_csv(energy_dict, 'energy')

CSAN
CHK
DEN
XOG
VEI
AMR
GLP-PB
data is ready.


In [None]:
get_url()

In [6]:
t = ''
for k in energy_dict.keys():
    t = k + '; ' + t

print(len(energy_dict.keys()))
t

248


'GMLPP; GLP-PA; TGP-PB; DLNG-PA; DCP-PC; NS-PA; ALIN-PB; ALIN-PE; CEQP-P; HMLP-PA; GLOP-PA; EP-PC; TGP-PA; ALIN-PA; NS-PC; DCP-PB; MTR; MARPS; ICD; CELP; IO; NRT; PVL; PHX; DWSN; NNA; CRT; MVO; NINE; GIFI; VOC; PRT; BPT; DLNG-PB; KLXE; SMLP; RNGR; CCLP; MMLP; AMPY; DLNG; EXTN; FTK; FET; SND; SBOW; GEOS; NGS; GLOP; EGY; SD; NC; TNP; PBT; RNET; TUSK; OSG; VIST; SJT; NRP; BORR; NGL; NR; CEIX; TNP-PF; TNP-PD; TNP-PE; TK; TTI; NOA; PDS; OIS; DSSI; HESM; LPI; BTU; WTI; SGU; TNK; CLMT; SOI; REX; PVAC; BRY; WTTR; SBR; DMLP; ESTE; TDW; HMLP; SRLP; GLOG; NBR-PA; NVGS; KRP; NBR; TRMD; FLNG; BCEI; LPG; TGS; HLX; NEX; MRC; ARCH; OMP; GLP; BOOM; FI; CAPL; PARR; VTOL; SLCA; ARLP; MNRL; TALO; NGL-PB; PBFX; GLOP-PB; GLOP-PC; PUMP; NGL-PC; GPRK; DRQ; DHT; RES; OII; DNOW; VET; KOS; STNG; CRK; ERF; GEL; PTEN; CLB; GLNG; NBLX; TGP; CPE; WLL; GLOG-PA; BPMP; AROC; OAS; FRO; YPF; USAC; PBF; DK; EURN; RTLR; DKL; PAGP; LBRT; SM; CRC; CPG; RIG; CVI; NS; CLNE; BSM; INT; HEP; WHD; CEQP; RRC; REGI; AR; MUR; ENLC; S

## Transportation

![](https://user-images.githubusercontent.com/22797017/115959990-99fec500-a541-11eb-89cd-19ff40f4771e.png)

## Travel

![](https://user-images.githubusercontent.com/22797017/116093914-a9187b00-a6d9-11eb-8348-722caa8ce235.png)

The following companies do not have the data prior to 2021-01-01:

+ [EASEMYTRIP.NS](https://finance.yahoo.com/quote/EASEMYTRIP.NS/history?p=EASEMYTRIP.NS)
+ [NNAX](https://finance.yahoo.com/quote/NNAX/history?p=NNAX)
+ [61T.F](https://finance.yahoo.com/quote/61T.F/history?p=61T.F)
+ [CLVB.DU](https://finance.yahoo.com/quote/CLVB.DU/history?p=CLVB.DU)
+ [LI4.MU](https://financeLI4.MUyahooLI4.MUcom/quote/LI4.MU/history?p=LI4.MU)
+ [9961.HK](https://finance.yahoo.com/quote/9961.HK/history?p=9961.HK)
+ [CLVB.MU](https://finance.yahoo.com/quote/CLVB.MU/history?p=CLVB.MU)
+ [CLVB.F](https://finance.yahoo.com/quote/CLVB.F/history?p=CLVB.F)
+ [26Y.F](https://finance.yahoo.com/quote/26Y.F/history?p=26Y.F)
+ [LI4.DU](https://finance.yahoo.com/quote/LI4.DU/history?p=LI4.DU)
+ [0TUA.DU](https://finance.yahoo.com/quote/0TUA.DU/history?p=0TUA.DU)
+ TUI1.HM

Otherwise:
```
LAG.SG; FLI.BE; 1NC.BE; PCE1.HM; E3X1.DU; 1NC.DU; 09B.BE; T6A.HM; TUI1.HM; MY1.BE; 1NC.HM; E3X1.MU; WBJ.BE; RC8.BE; E3X1.BE; E3X1.SG; 1NC.SG; E3X1.HM; PCE1.MU; TUI2.BE; WBJ.MU; TUI2.MU; CLV.DU; DG1.BE; T6A.MU; RC8.SG; CG6C.BE; CVC1.DU; 0HB2.IL; 1C6.SG; HOC.SG; 0W2Y.IL; PCE1.DU; CLV.MU; TXM1.MU; T6A.DU; CVC1.HM; 0TUA.BE; 1481.KL; 6NIQ.BE; 0022.KL; 4ZO0.F; E3X1.HA; PCE1.HA; TUI1.SG; T6A.SG; LAG.BE; 09B.DU; RC8.DU; HOC.MU; 09B.MU; HOC.HM; CLV.BE; PCE1.BE; TVD6.BE; LAG.HM; LAG.MU; TVD6.DU; TEM.BE; TXM1.BE; WBJ.HM; PCE1.SG; WBJ.SG; 9113.KL; E3X1.DE; 0I50.IL; HOC.DU; TUI1.MU; 0TUA.SG; RC8.MU; 0TUA.MU; HOC.BE; 1NC.HA; 5016.KL; TVD6.SG; PCE1.DE; 1NC.MU; CVC1.SG; T6A.BE; TUI1.HA; TXM1.SG; CVC1.BE; TUI1.DU; DG1.MU; LAG.DU; D3G.SG; MY1.SG; TUI1.BE; 6T8.MU; T6A.HA; CVC1.HA; LAG.HA; WESC; 5IH.F; A8N.MU; PSA.MI; SOS.MI; ONVC; ZMA.V; ASWN.SW; NTU1L.VS; JAY.AX; MKGI; CROWNTOURS.BO; FLAP.IS; LRG.SG; AVIA.TA; TVD6.F; 26Y.SG; HOC.F; YTRA; HSW.F; 1TJ.F; HSW.IR; 1235.HK; HOC.DE; 8668.HK; TRZ.TO; TZOO; ID9.F; 8069.HK; COX&KINGS.NS; TRRB; VMV.BO; COX&KINGS.BO; 1620.HK; 1701.HK; 0TUA.F; HLO.AX; 6882.HK; ALVDM.PA; 09B.F; TOUR; 2719.TWO; ITHL.BO; LMN.SW; 1901.HK; EDR.MC; MIN.L; 8095.HK; 2743.TWO; 2745.TWO; 0487.HK; 2734.TWO; LI4.F; 9BP.F; D3G.F; LIND; 0265.HK; DESP; 6242.TWO; ISTA.TA; WBJ.F; 600706.SS; 6T8.F; 1C6.F; 1745.HK; WEB.AX; 9376.T; 300178.SZ; MY1.F; 002159.SZ; FLI.F; 603099.SS; 603199.SS; 2T9A.F; 5706.TW; CTD.AX; MMYT; DG1.F; DRTGF; MMB.PA; 6577.T; NTHOL.IS; LAG.F; 000888.SZ; FLT.AX; TEM.F; 1810.SR; TEM.DU; CVCB3.SA; 300859.SZ; TUI2.F; 6548.T; TUI2.SG; TUI1.F; TUI1.DE; 002707.SZ; TNL; TRIP.MI; T6A.F; T6A.DE; TRIP.VI; TUIFF; 2731.TW; TUIFY; TRIP; TENG.L; NCLH.VI; 1NC.F; 000796.SZ; 6561.T; NCLH; 9085.S; THOMASCOOK.BO; HSW.L; 1992.HK; PGJO.JK; RC8.F; THOMASCOOK.NS; CLV.F; 7048.T; RCL; EXPE.VI; E3X1.F; TCOM; EXPE; CVC1.F; 9726.T; CCL; T1RI34.SA; YELO.JK; 6030.T; 0780.HK; 6191.T; N1CL34.SA; OTB.L; BOOK.VI; PCE1.F; DESP.BA; BKNG; R1CL34.SA; TRIP.MX; EXGR34.SA; CRIP34.SA; NCLHN.MX; C1CL34.SA; PANR.JK; PDES.JK; TRN.L; BAYU.JK; RCL.MX; TUI.L; EXPE.MX; BKNG34.SA; 601888.SS; TRIP.BA; 039130.KS; 032350.KS; SONA.JK; BKNG.MX;
```

In [25]:
travel_dict = get_industry_data(os.path.join(data_path, 'travel.txt'), start, end)
print('data is ready.')
# to_csv(travel_dict, 'travel')

EASEMYTRIP.NS
NNAX
61T.F
CLVB.DU
LI4.MU
9961.HK
CLVB.MU
CLVB.F
26Y.F
LI4.DU
0TUA.DU
TUI1.HM
data is ready.


In [53]:
get_url()

companyEASEMYTRIP.NS NNAX 61T.F CLVB.DU LI4.MU 9961.HK CLVB.MU CLVB.F 26Y.F LI4.DU 0TUA.DU TUI1.HM
+ [EASEMYTRIP.NS](https://finance.yahoo.com/quote/EASEMYTRIP.NS/history?p=EASEMYTRIP.NS)
+ [NNAX](https://finance.yahoo.com/quote/NNAX/history?p=NNAX)
+ [61T.F](https://finance.yahoo.com/quote/61T.F/history?p=61T.F)
+ [CLVB.DU](https://finance.yahoo.com/quote/CLVB.DU/history?p=CLVB.DU)
+ [LI4.MU](https://finance.yahoo.com/quote/LI4.MU/history?p=LI4.MU)
+ [9961.HK](https://finance.yahoo.com/quote/9961.HK/history?p=9961.HK)
+ [CLVB.MU](https://finance.yahoo.com/quote/CLVB.MU/history?p=CLVB.MU)
+ [CLVB.F](https://finance.yahoo.com/quote/CLVB.F/history?p=CLVB.F)
+ [26Y.F](https://finance.yahoo.com/quote/26Y.F/history?p=26Y.F)
+ [LI4.DU](https://finance.yahoo.com/quote/LI4.DU/history?p=LI4.DU)
+ [0TUA.DU](https://finance.yahoo.com/quote/0TUA.DU/history?p=0TUA.DU)
+ [TUI1.HM](https://finance.yahoo.com/quote/TUI1.HM/history?p=TUI1.HM)


In [26]:
t = ''
for k in travel_dict.keys():
    t = k + '; ' + t

print(len(travel_dict.keys()))
t

256


'LAG.SG; FLI.BE; 1NC.BE; PCE1.HM; E3X1.DU; 1NC.DU; 09B.BE; T6A.HM; MY1.BE; 1NC.HM; E3X1.MU; WBJ.BE; RC8.BE; E3X1.BE; E3X1.SG; 1NC.SG; E3X1.HM; PCE1.MU; TUI2.BE; WBJ.MU; TUI2.MU; CLV.DU; DG1.BE; T6A.MU; RC8.SG; CG6C.BE; CVC1.DU; 0HB2.IL; 1C6.SG; HOC.SG; 0W2Y.IL; PCE1.DU; CLV.MU; TXM1.MU; T6A.DU; CVC1.HM; 0TUA.BE; 1481.KL; 6NIQ.BE; 0022.KL; 4ZO0.F; E3X1.HA; PCE1.HA; TUI1.SG; T6A.SG; LAG.BE; 09B.DU; RC8.DU; HOC.MU; 09B.MU; HOC.HM; CLV.BE; PCE1.BE; TVD6.BE; LAG.HM; LAG.MU; TVD6.DU; TEM.BE; TXM1.BE; WBJ.HM; PCE1.SG; WBJ.SG; 9113.KL; E3X1.DE; 0I50.IL; ZMWYD; HOC.DU; TUI1.MU; 0TUA.SG; RC8.MU; 0TUA.MU; HOC.BE; 1NC.HA; 5016.KL; TVD6.SG; TVD6.MU; PCE1.DE; 1NC.MU; CVC1.SG; T6A.BE; TUI1.HA; TXM1.SG; CVC1.BE; TUI1.DU; DG1.MU; LAG.DU; D3G.SG; MY1.SG; TUI1.BE; 6T8.MU; T6A.HA; CVC1.HA; LAG.HA; WESC; 5IH.F; A8N.MU; PSA.MI; SOS.MI; ONVC; ZMA.V; ASWN.SW; NTU1L.VS; JAY.AX; 6NIQ.F; MKGI; CROWNTOURS.BO; FLAP.IS; LRG.SG; AVIA.TA; TVD6.F; 26Y.SG; HOC.F; YTRA; HSW.F; 1TJ.F; HSW.IR; 1235.HK; HOC.DE; TXM1.F; 866

## Issue: Without Region 

[Solution](https://github.com/lowspace/MAST90106/blob/main/code/Yahoo%20Finance%20Profile%20Page.ipynb): crawl profile page of each company, and security our fliter.

REF1: https://info.cloudquant.com/2019/07/url2symbol/  
REF2: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

Issue in there:
+ some websites donot have a profile yet, such as:
    + https://finance.yahoo.com/quote/GMLPP/profile?p=GMLPP
    + https://finance.yahoo.com/quote/LAG.SG/profile?p=LAG.SG

# Footnotes

<a name="ft1">[1]</a>: How to acess Yahoo Fiance data remotely? (Chinese Version) https://blog.csdn.net/Hellolijunshy/article/details/82527643

<a name="ft2">[2]</a>: Getting KeyError : 'Date' in Yahoo https://github.com/pydata/pandas-datareader/issues/640