<a href="https://colab.research.google.com/github/livcitylit/tmp/blob/master/Stock%20project/Week%202/APIs_and_scraping_for_stock_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 - Yahoo finance API (Application programming interface)

Ever since Yahoo! finance decommissioned their historical data API, many programs that relied on it to stop working.

yfinance aimes to solve this problem by offering a reliable, threaded, and Pythonic way to download historical market data from Yahoo! finance.
It is open source and free to use. 

You will find that this API arbitrarily slows down and speeds up when you are downloading lots of data. This is intentional, so that Yahoo finance doesn't flag you for scraping. 

In [None]:
!pip install yfinance as yfinance

Collecting yfinance
  Downloading https://files.pythonhosted.org/packages/7a/e8/b9d7104d3a4bf39924799067592d9e59119fcfc900a425a12e80a3123ec8/yfinance-0.1.55.tar.gz
Collecting as
  Downloading https://files.pythonhosted.org/packages/b4/08/226c133ec497d25a63edb38527c02db093c7d89e6d4cdc91078834486a5d/as-0.1-py3-none-any.whl
Collecting lxml>=4.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/64/28/0b761b64ecbd63d272ed0e7a6ae6e4402fc37886b59181bfdf274424d693/lxml-4.6.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 8.0MB/s 
Building wheels for collected packages: yfinance
  Building wheel for yfinance (setup.py) ... [?25l[?25hdone
  Created wheel for yfinance: filename=yfinance-0.1.55-py2.py3-none-any.whl size=22618 sha256=14d6ba70bd602d902dd7589da7c9e8a9e36b886a0d91c8bd8d71b0d4f69701a8
  Stored in directory: /root/.cache/pip/wheels/04/98/cc/2702a4242d60bdc14f48b4557c427ded1fe92aedf257d4565c
Successfully built yfinance
Installing c

In [None]:
import numpy as np
import pandas as pd
import yfinance as yf

On the stock market, each company has an abbreviation they are identified. These are known as tickers. This is how we refer to the companies on yahoo finance. The ones that are traded on the London stock exchange have a .L at the end. 

In [None]:
# we now store the information from this ticker inside the variable tesco

tesco = yf.Ticker("TSCO.L")

tesco.info will give us a dictionary with the overall info for the company. Here we have very important information like the market cap (How much money could one maximum make in the market of this company), netIncome, previous close price - i.e. what was the stock price last time the market closed, profit margin, volume traded and much more. If you need a refresher on dictionaries, you'll find some resources on the google classroom. 

In [None]:
tesco.info

{'52WeekChange': -0.00042957067,
 'SandP52WeekChange': 0.13917124,
 'address1': 'Tesco House',
 'address2': 'Shire Park Kestrel Way',
 'algorithm': None,
 'annualHoldingsTurnover': None,
 'annualReportExpenseRatio': None,
 'ask': 227.1,
 'askSize': 0,
 'averageDailyVolume10Day': 44695995,
 'averageVolume': 25159565,
 'averageVolume10days': 44695995,
 'beta': 0.300632,
 'beta3Year': None,
 'bid': 226.9,
 'bidSize': 0,
 'bookValue': 1.253,
 'category': None,
 'circulatingSupply': None,
 'city': 'Welwyn Garden City',
 'companyOfficers': [],
 'country': 'United Kingdom',
 'currency': 'GBp',
 'dateShortInterest': None,
 'dayHigh': 228.5,
 'dayLow': 226.2,
 'dividendRate': 0.1,
 'dividendYield': 0.0417,
 'earningsQuarterlyGrowth': 0.42,
 'enterpriseToEbitda': 8.97,
 'enterpriseToRevenue': 0.557,
 'enterpriseValue': 36185833472,
 'exDividendDate': 1602720000,
 'exchange': 'LSE',
 'exchangeTimezoneName': 'Europe/London',
 'exchangeTimezoneShortName': 'GMT',
 'expireDate': None,
 'fiftyDayAvera

We can use the command history to have a look at the historic prices for the market. Open is the price of the stock when the market opened on a given day, and close what it was when it closed. It shows the highest and lowest price for the day. It also shows the volume of the stock traded that day. It also shows the dividends - this is a payment to shareholders. Sometimes stocks are also split, like Tesla did recently, and this is shown here as well. You can specify the period, or intervals for which you want data within the history argument. 

In [None]:
tesco.history(period='max')

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1988-07-01,16.005281,16.005281,16.005281,16.005281,0,0.0,0.0
1988-07-04,15.691253,15.691253,15.691253,15.691253,0,0.0,0.0
1988-07-05,15.900496,15.900496,15.900496,15.900496,0,0.0,0.0
1988-07-06,15.691253,15.691253,15.691253,15.691253,0,0.0,0.0
1988-07-07,15.377544,15.377544,15.377544,15.377544,0,0.0,0.0
...,...,...,...,...,...,...,...
2020-11-18,227.399994,233.800003,225.699997,233.199997,41037018,0.0,0.0
2020-11-19,232.300003,238.671005,232.300003,236.100006,38715085,0.0,0.0
2020-11-20,235.300003,236.100006,231.199997,232.699997,24921164,0.0,0.0
2020-11-23,232.199997,232.500000,227.199997,227.800003,29437772,0.0,0.0


In [None]:
tesco.actions

Unnamed: 0_level_0,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1993-04-19,4.85,0.0
1993-10-04,2.45,0.0
1994-04-25,5.3,0.0
1994-09-26,2.7,0.0
1995-04-24,5.9,0.0
1995-09-25,3.05,0.0
1996-04-22,6.55,0.0
1996-09-23,3.25,0.0
1997-04-14,7.1,0.0
1997-09-22,3.55,0.0


Analysts also frequently give recommendations whether to buy, sell or hold the stocks. This can be found by using the recommendations keyword. If there are no current recommentations, it will not return anything. 

In [None]:
tesco.recommendations


There are many other useful keywords we can use. A quick overview is given below. 

In [None]:


# show actions (dividends, splits)
tesco.actions

# show dividends
tesco.dividends

# show splits
tesco.splits

# show financials
tesco.financials
tesco.quarterly_financials

# show major holders
tesco.major_holders

# show institutional holders
tesco.institutional_holders

# show balance sheet
tesco.balance_sheet
tesco.quarterly_balance_sheet

# show cashflow
tesco.cashflow
tesco.quarterly_cashflow

# show earnings
tesco.earnings
tesco.quarterly_earnings

# show sustainability
tesco.sustainability

# show analysts recommendations
tesco.recommendations

# show next event (earnings, etc)
tesco.calendar

# show ISIN code - *experimental*
# ISIN = International Securities Identification Number
tesco.isin

# show options expirations
tesco.options

# get option chain for specific expiration
opt = tesco.option_chain('YYYY-MM-DD')
# data available via: opt.calls, opt.puts

IndexError: ignored

# 2 - Scraping some extra data with selenium

The yahoo finance API is great, but it doesn't always get what we want. If we go to the yahoo finance page for Tesco, we see that there are some things we don't have. Like the analyst price target, or the number of analysits who recommend buy, sell and hold. But we know webscraping now, so let's use that! 

## 2.1 Set up selenium 

In [None]:
# setting up selenium 

!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
# copy the driver into the python path
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')


from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# we use headless if we don't want to open a browser window that shows us what the driver is doing
# in colab, it will only work if you're in headless
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

#we then use driver.get and the name of the webpage we want to use 

#driver.get("https://www.webite-url.com")

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |▍                               | 10kB 15.2MB/s eta 0:00:01[K     |▊                               | 20kB 11.1MB/s eta 0:00:01[K     |█                               | 30kB 7.8MB/s eta 0:00:01[K     |█▌                              | 40kB 6.9MB/s eta 0:00:01[K     |█▉                              | 51kB 4.3MB/s eta 0:00:01[K     |██▏                             | 61kB 4.9MB/s eta 0:00:01[K     |██▌                             | 71kB 4.9MB/s eta 0:00:01[K     |███                             | 81kB 5.1MB/s eta 0:00:01[K     |███▎                            | 92kB 5.6MB/s eta 0:00:01[K     |███▋                            | 102kB 5.5MB/s eta 0:00:01[K     |████                            | 112kB 5.5MB/s eta 0:00:01[K     |████▍                           | 122kB 5.5MB



## 2.2 Access our webpage 

In [None]:
driver.get("https://uk.finance.yahoo.com/quote/TSCO.L?p=TSCO.L&.tsrc=fin-srch")

In [None]:
# we need to agree to cookies before it lets us into their page! 
driver.find_element_by_xpath('//*[@id="consent-page"]/div/div/div/div[2]/div[2]/form/button').click()


In [None]:
import time

In [None]:
# we need to scroll down on the page a bit before we get to the relevant part. This is important, otherwise the right html won't be loaded into our driver

old_position = 0
new_position = None

# this scrolls to the bottom of the page 
while new_position != old_position:
    # Get old scroll position
    old_position = driver.execute_script(
            ("return (window.pageYOffset !== undefined) ?"
            " window.pageYOffset : (document.documentElement ||"
            " document.body.parentNode || document.body);"))
    # Sleep and Scroll
    time.sleep(1)
    driver.execute_script((
            "var scrollingElement = (document.scrollingElement ||"
            " document.body);scrollingElement.scrollTop ="
            " scrollingElement.scrollHeight;"))
    # Get new position
    new_position = driver.execute_script(
            ("return (window.pageYOffset !== undefined) ?"
            " window.pageYOffset : (document.documentElement ||"
            " document.body.parentNode || document.body);"))

In [None]:
# I have checked the html and found the expath for the low analyst price target 

low_target = driver.find_element_by_xpath('/html/body/div[1]/div/div/div[1]/div/div[3]/div[2]/div/div/div/div/div/div[10]/div/div/div/section/div/div[2]/div[1]/span[2]')

# see if you can do the same with the high target and the average target 


In [None]:
# by getting the attribute of the inner HTML, we see what it contains, which here is the price 
low_target.get_attribute('innerHTML')

'200.00'

In [None]:
# doing them same for the esg risk 
esg_risk = driver.find_element_by_xpath('/html/body/div[1]/div/div/div[1]/div/div[3]/div[2]/div/div/div/div/div/div[4]/div/div/div/section/div[2]/div/div/div[1]')

In [None]:
esg_risk.get_attribute('innerHTML')

'18.5'

# 3 - Finnhub

We still couldn't get the recommendations for buy, sell etc. from Yahoo finance because this is stored in a picture. We can use another free API for this, the Finnhub API. This has some limits for the free version. We are limited 30 API call/second, but this won't be a huge issue for us. If you want to explore the American markets, you will probably want to use Finnhub quite a lot. It can give you company news, market news, forex, crypto and much more. A lot of this is limited to the American markets when we are using the free API. 


To use the Finnhub API, you will need to create a token. Do this by going to the [Finnhub webpage](https://finnhub.io/)
and click *Get free API key* . You will need to create an account, again feel free to use your citylit account here, if you want. This will take you to a page where it shows your API key. Copy this - you will need it a few cells further down. 

In [None]:
# need to install finnhub 
!pip install finnhub-python


Collecting finnhub-python
  Downloading https://files.pythonhosted.org/packages/5d/a6/905724e1e32abdbe91d121bd0634be20aae532d8bcebc8d41821d6cb7033/finnhub_python-2.3.0-py3-none-any.whl
Installing collected packages: finnhub-python
Successfully installed finnhub-python-2.3.0


In [None]:
#load finnhub

import finnhub

In [None]:
my_api_key = 'yourkeyhere!'
# Setup client - you need to paste your own api_key here! 
finnhub_client = finnhub.Client(api_key=my_api_key)

In [None]:
# requests is a package for requesting webpages 
import requests

# In this example, we request the webpage where the stock recommendations are stored. In this link, we have symbol=TSCO.L
# what comes after the symbol= is the ticker. When you want to change this for ofther companies, you can e.g. create a variable called ticker and put it there 
r = requests.get('https://finnhub.io/api/v1/stock/recommendation?symbol=TSCO.L&token='+my_api_key)
# the result is a dictionary that has the recommendations for certain time periods 
print(r.json())

JSONDecodeError: ignored

You can also get the tickers from different markets using Finnhub. You can have a think about what you want to predict. If you for instance want to predict the stocks on the London stock exchange, you can find all the relevant tickers by doing the below. If you want tickers from other exchanges, have a look at the list of avaliable indices and their abbreviations within Finnhub [here](https://docs.google.com/spreadsheets/d/1I3pBxjfXB056-g_JYf_6o3Rns3BV2kMGG1nCatb91ls/edit#gid=0)

In [None]:
# the exchange is specified in this link as exchange=L
# L is london  - if you need others, have 

r = requests.get('https://finnhub.io/api/v1/stock/symbol?exchange=L&token=bu7cue748v6uhfp5q5s0')
df = pd.DataFrame(r.json())


In [None]:
df

# 4 - Decide what you want to predict! 

Now you have access to a lot of cool data. Remember to use the news notebook to combine the news as well. You are free to choose what you want to predict. It could for instance be a specific index, a few specific stocks, or a specific industry. You will also need to choose which time period you want to predict over. This depends on how you would be trading. Would you want to trade every day, every week, month or year? Depending on your answer to this, you can predict the next day price, the average price over a week, the max price over a week, the change in price, and so on. It is worth giving this a lot of thought early on, becuase this will very much affect your data cleaning and processing. 

# 5 - Getting even more data




As we discovered, the Yahoo finance API is a little bit limited in that only stores about 170 or so articles for each company. I still prefer using yahoo finance because we get the articles by the company ticker, so we are absolutely sure we are getting the right results. However, if you feel the number of articles is limiting you, we can use other apis. One alternative is datanews, where you get 3000 free API calls a month. This could also potentially be a limit, depending on what you want to do. 

In [None]:
! pip install datanews

Collecting datanews
  Downloading https://files.pythonhosted.org/packages/a0/61/17def7528f3d2e07a5df4122f1bea62595bd9e915851dfb3935960442635/datanews-0.0.7-py3-none-any.whl
Installing collected packages: datanews
Successfully installed datanews-0.0.7


The below gives us a very basic example of their api. As with finnhub, you need to sign up, and use your own API key below. You can sign up [here](https://datanews.io/).


In [None]:
import datanews
#your key goes here!!  
datanews.api_key = '050085f0cxslgxpvwqkjp3ops'

response = datanews.headlines(q='Beyond meat',language=['en'])
articles = response['hits']
print(articles[0]['title'])

datanews.headlines

Beyond Meat Introduces Beyond Pork in China - TheStreet


<function datanews.api_resources.news.headlines>

In [None]:
articles

[{'authors': ['Dan Weil'],
  'content': 'Beyond Meat, the plant-based meat company that is riding the health-food boom, said it was selling a pork product in China.\n\nBeyond Meat (BYND) - Get Report, the plant-based meat company that is riding the health-food boom, said on Wednesday it’s sel ... [+1947 chars]',
  'country': 'us',
  'description': 'Beyond Meat Introduces Beyond Pork in China  TheStreet',
  'imageUrl': 'https://www.thestreet.com/.image/t_share/MTczNjU3NzgxNTg3NDIxMTU4/beyond-meat.jpg',
  'language': 'en',
  'pubDate': '2020-11-18T14:15:00+00:00',
  'source': 'thestreet.com',
  'title': 'Beyond Meat Introduces Beyond Pork in China - TheStreet',
  'url': 'https://www.thestreet.com/investing/beyond-meat-unveils-beyond-pork-in-china'},
 {'authors': ['Chris DeMuth Jr.'],
  'content': 'New flavor, new cooking instructions, and new guidance (something must have been terribly wrong with the previous flavor, cooking instructions, and guidance). The new guidance is.. no guidance 

What we will likely do, though, is to do more complex searches, specifically using a specified time frame. The documentation for datanews uses curl. curl is used in command lines or scripts to transfer data. In Python, we usually use request. I have translated the datanews curl commands into python, so you can alter the example below as you see fit. 

You can add other parameters in the same way. You can look at the extra parameters available [here](https://datanews.io/docs/news).

In [None]:
import requests

#YOUR OWN API KEY WILL GO HERE 
headers = {
    'x-api-key': '050085f0cxslgxpvwqkjp3ops',
}

params = (
    #the query is what we are searching for 
    ('q', 'Sainsbury'),
    ('from', '2016-07-01'),
    ('to', '2019-12-01'),
    ('language', 'en')
)



response = requests.get('http://api.datanews.io/v1/news', headers=headers, params=params)


In [None]:
# the response here is a java script format 
# it is a dictionary of dictionaries 
response.json()

{'hits': [{'authors': ['Bethany Alice Papworth'],
   'content': 'THE busiest time of the year to bag a supermarket delivery slot is at Christmas.\n\nTo avoid disappointment, supermarkets have been letting shoppers book delivery slots from the start of November.\n\nWhen these delivery slots do become available, they ca ... [+3542 chars]',
   'country': 'gb',
   'description': 'THE busiest time of the year to bag a supermarket delivery slot is at Christmas. To avoid disappointment, supermarkets are letting you book delivery slots as early as the start of November.',
   'imageUrl': 'https://www.thescottishsun.co.uk/wp-content/uploads/sites/2/2019/11/D01BX4jpg-JS341170895-1.jpg?strip=all&quality=100&w=1064&h=708&crop=1',
   'language': 'en',
   'pubDate': '2019-11-11T12:12:55+00:00',
   'source': 'thescottishsun.co.uk',
   'title': "Christmas delivery slots 2019 - when can to book Tesco, Asda, Morrisons, Sainsbury's, Waitrose, Ocado and Iceland",
   'url': 'https://www.thescottishsun.co.uk

We now want to get the urls from the dictionaries. 

In [None]:
links = []
dates = []

# all of the data is stored in the key hits 
for dictionary in response.json()['hits']: 
    # for each hit, we get the value stored in the key url 
    url = dictionary['url']
    # and we then append it to the links we found
    links.append(url)
    # we probably want to create a list of all the dates as well
    dates.append(dictionary['pubDate'])

#response.json()['hits']

links

['https://www.ccn.com/beyond-meat-tanks-13-percent-because-of-this-terrible-news/',
 'https://www.timesofisrael.com/poland-will-end-its-kosher-and-halal-meat-export-industry-in-2025/',
 'https://www.ibtimes.com/wheres-beef-mcdonalds-test-new-beyond-meat-burger-canada-2834005',
 'https://www.rte.ie/archives/2019/0930/1079179-mister-meat-loaf/',
 'https://www.marketwatch.com/story/what-nutrition-experts-are-saying-about-imitation-meats-in-your-diet-2019-11-25?mod=home-page',
 'https://food.ndtv.com/video-andhra-crab-meat-masala-recipe-528213',
 'https://www.richmond-news.com/standout/buying-meat-the-old-fashioned-way-ensures-taste-and-quality-1.23931400',
 'https://www.wired.com/story/the-meat-allergy-tick-also-carries-a-mystery-killer-virus/',
 'https://www.ccn.com/no-mans-sky-beyond-ride-aliens/',
 'https://www.eater.com/2019/10/11/20909222/impossible-meat-future-war-ceo']

Now say that we had a list of companies that we want to get news for. We need to find a good way to append that data. There are 
sevral ways of doing it. I think pandas is probably the easiest to read. The downside is that if you have huge panda, it might get slow and the panda grows fat. 

## 5.2 - Appending data to a panda

Usually I like to write a high level function. Makes it easy, and makes things reusable. Also note a very important point here. When we start to append lots of data - don't append straight to a panda. Append to a list, and then make a panda. It is a lot more efficient!! 

In [None]:

def getNews_dataNews(tickers, searchTerms): 
    
    """ Inputs: list of tickers, list of search terms
        Outputs: a data frame with the ticker, date, title and urls 
    """ 

    results = []
    # zip zips the lists together like a zipper 
    for ticker, searchTerm in zip(tickers, searchTerms):

        headers = {
            # YOUR API KEY HERE
            'x-api-key': '',
        }   

        params = (
            #the query is what we are searching for 
            ('q', searchTerm),
            ('from', '2016-07-01'),
            ('to', '2019-12-01'),
            ('language', 'en')
        )


        response = requests.get('http://api.datanews.io/v1/news', headers=headers, params=params)

        # all of the data is stored in the key hits 
        for dictionary in response.json()['hits']: 
            results.append([ticker, dictionary['pubDate'],  dictionary['title'], dictionary['url']])

    final_result_df = pd.DataFrame(results, columns=['ticker', 'pubDate', 'title', 'url'])
    return(final_result_df)

In [None]:
# time to rap the benefits of our high quality function

tickerList = ['TSCO.L', 'SBRY.L', 'OCDO.L']
searchTermList = ['Tesco', 'Sainsbury', 'Ocado']

superMarketNews = getNews_dataNews(tickerList, searchTermList)


In [None]:
superMarketNews

Unnamed: 0,ticker,pubDate,title,url
0,TSCO.L,2018-09-09T00:00:00+00:00,Tesco deal bodes ill for business,https://www.bangkokpost.com/business/2018099/t...
1,TSCO.L,2019-08-14T07:28:30+00:00,Triple Credit - TV and Broadband ISP NOW TV Pa...,https://www.ispreview.co.uk/index.php/2019/08/...
2,TSCO.L,2019-08-06T04:47:00+00:00,"Fresh redundancies as Tesco cuts 4,500 jobs at...",//www.theweek.co.uk/102629/fresh-redundancies-...
3,TSCO.L,2018-05-22T14:36:31+00:00,Tesco Direct is closing next month as it slash...,https://www.thescottishsun.co.uk/money/2680332...
4,TSCO.L,2019-11-11T12:12:55+00:00,Christmas delivery slots 2019 - when can to bo...,https://www.thescottishsun.co.uk/money/4941129...
5,TSCO.L,2016-08-23T00:00:00+00:00,Limit set for CP to appeal deal ruling,https://www.bangkokpost.com/business/2016823/l...
6,TSCO.L,2018-09-26T05:30:48+00:00,FCA fine Tesco Bank over cyber breach,https://www.fintechinshorts.com/fca-fine-tesco...
7,TSCO.L,2018-09-26T05:30:48+00:00,FCA fine Tesco Bank over cyber breach,https://www.fintechinshorts.com/fca-fine-tesco...
8,TSCO.L,2019-03-09T19:07:09+00:00,PRWeek UK Power Book 2019: New Tesco comms chi...,https://www.prweek.com/article/1578449/prweek-...
9,TSCO.L,2018-10-10T10:23:05.900000+00:00,Tesco Is Selling Cocoa-Cola Cinnamon For The F...,https://www.bustle.com/life/where-to-buy-cinna...
