### Topics

- Requests and getting webpages
- Tidying HTML
- Web Scraping with BS
    - Traversal
- simple parallelization
- Storing Snapshots by using python dill
- Exploration
- Stock selection using simple filters

### Imports

In [2]:
import numpy as np # linear algebra
import pandas as pd # pandas for dataframe based data processing and CSV file I/O

In [10]:
import requests # for http requests
from bs4 import BeautifulSoup # for html parsing and scraping
from fastnumbers import isfloat 
from fastnumbers import fast_float
from multiprocessing.dummy import Pool as ThreadPool 
import bs4

import matplotlib.pyplot as plt
import seaborn as sns
import json
from tidylib import tidy_document # for tidying incorrect html

sns.set_style('whitegrid')
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"



In [4]:
def ffloat(string):
    if string is None:
        return np.nan
    if type(string)==float or type(string)==np.float64:
        return string
    if type(string)==int or type(string)==np.int64:
        return string
    return fast_float(string.split(" ")[0].replace(',','').replace('%',''),
                      default=np.nan)

def ffloat_list(string_list):
    return list(map(ffloat,string_list))

def remove_multiple_spaces(string):
    if type(string)==str:
        return ' '.join(string.split())
    return string

### Requests and Getting Content
- Status code
- Getting Text Content
- Getting JSON Content
- Error Handling

#### Getting a page and its status

In [13]:
response = requests.get("http://www.example.com/", timeout=240)
response.status_code
response.content

200

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\

#### Getting Json and parsing it

In [14]:
url = "https://jsonplaceholder.typicode.com/posts/1"
response = requests.get(url, timeout=240)
response.status_code
response.json()

content = response.json()
content.keys()

200

{'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto',
 'id': 1,
 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit',
 'userId': 1}

dict_keys(['userId', 'id', 'title', 'body'])

#### Checking Status Codes for errors

In [4]:
def request_with_check(url):
    page_response = requests.get(url, timeout=240)
    status = page_response.status_code
    if status>299:
        raise AssertionError("page content not found, status: %s"%status)
    return page_response

In [162]:
request_with_check("https://www.google.co.in/mycustom404page")

AssertionError: page content not found, status: 404

In [163]:
request_with_check("https://www.google.co.in/")

<Response [200]>

### Parsing and Traversal

- Getting body of a page and rendering it
- Inspecting Elements in Chrome
- Ways of getting element contents
- Traversals of childrens
- Getting Text and numbers
- Tables

#### Rendering HTML in Jupyter notebook

In [5]:
from IPython.core.display import HTML
HTML("<b>Rendered HTML</b>")


Here We download the page, parse using BeautifulSoup and then find the body tag content. Next we render the body tag.

In [19]:
page_response = requests.get("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM", timeout=240)
page_content = BeautifulSoup(page_response.content, "html.parser")
HTML(str(page_content.find("h1")))

HTML(str(page_content.find("div",attrs={'id':"content_full"})))

0,1
5-Day,16400

0,1
10-Day,24652

0,1
30-Day,33712

0,1
5-Day,16400

0,1
10-Day,24652

0,1
30-Day,33712

BUY,BUY.1,SELL,SELL.1
QTY,PRICE,PRICE,QTY
39,3300.00,3308.90,10
10,3293.55,3311.65,2
15,3292.10,3314.90,7
1,3291.00,3315.90,10
2,3290.00,3320.00,27
1631,Total,2087,Total

0,1
5-Day,314629

0,1
10-Day,428169

0,1
30-Day,546298

0,1
5-Day,314629

0,1
10-Day,428169

0,1
30-Day,546298


#### Finding elements by attributes

In [6]:
page_response = requests.get("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM", timeout=240)
page_content = BeautifulSoup(page_response.content, "html.parser")
price_div = page_content.find("div",attrs={"id":'b_changetext'})
HTML(str(price_div))

```python
page_content.find_all("p")
page_content.find("p",attrs={"class":"my-id"})
page_content.find_all("p",attrs={"id":"my-id"})
page_content.find_all("p",attrs={"id":"my-id"})
```

`.find` finds the first occurence.
You can check other find_* (find_next,find_previous etc)

Once you have the element you can do `.text` to get its textual content.

Like

```python
elem = page_content.find("p",attrs={"class":"my-id"})
text = elem.text
```

#### Getting Children of an element in Array

In [7]:
list(price_div.children)

[' ', <span class="gr_15 uparw_pc"><strong>46.25</strong></span>, ' (+1.41%)']

As you see the first and last element is "\n" and not actual HTML elements so we will filter them. The next function does that.

In [37]:

def get_children(html_content):
    children = list()
    for item in html_content.children:
        if type(item)==bs4.element.Comment:
            continue
        if type(item)==bs4.element.Tag or len(str(item).replace("\n","").strip())>0:
            children.append(item)
        
    return children



In [38]:
get_children(price_div)

[<span class="gr_15 uparw_pc"><strong>46.25</strong></span>, ' (+1.41%)']

#### Parsing HTML tables and table like structures

In [39]:
html = '''
<table>
    <tr>
        <td>Month</td>
        <td>Price</td>
    </tr>
    <tr>
        <td>July</td>
        <td>2</td>
    </tr>
    <tr>
        <td>August</td>
        <td>4</td>
    </tr>
    <tr>
        <td>September</td>
        <td>3</td>
    </tr>
    <tr>
        <td>October</td>
        <td>2</td>
    </tr>
</table>

'''

HTML(html)

0,1
Month,Price
July,2
August,4
September,3
October,2


In [40]:




def get_table_simple(table,is_table_tag=True):
    elems = table.find_all('tr') if is_table_tag else get_children(table)
    table_data = list()
    for row in elems:
        
        row_data = list()
        row_elems = get_children(row)
        for elem in row_elems:
            text = elem.text.strip().replace("\n","")
            text = remove_multiple_spaces(text)
            if len(text)==0:
                continue
            row_data.append(text)
        table_data.append(row_data)
    return table_data


        

In [41]:
html = BeautifulSoup(html,"html.parser")
get_table_simple(html)

[['Month', 'Price'],
 ['July', '2'],
 ['August', '4'],
 ['September', '3'],
 ['October', '2']]

In [42]:
html = '''
<html>
<body>
<div id="table" class="FL" style="width:210px; padding-right:10px">
    <div class="PA7 brdb">
        <div class="FL gL_10 UC">MARKET CAP (Rs Cr)</div>
        <div class="FR gD_12">63,783.84</div>
        <div class="CL"></div>
    </div>
    <div class="PA7 brdb">
        <div class="FL gL_10 UC">P/E</div>
        <div class="FR gD_12">17.27</div>
        <div class="CL"></div>
    </div>
    <div class="PA7 brdb">
        <div class="FL gL_10 UC">BOOK VALUE (Rs)</div>
        <div class="FR gD_12">589.29</div>
        <div class="CL"></div>
    </div>
    <div class="PA7 brdb">
        <div class="FL gL_10 UC">DIV (%)</div>
        <div class="FR gD_12">4750.00%</div>
        <div class="CL"></div>
    </div>
    <div class="PA7 brdb">
        <div class="FL gL_10 UC">Market Lot</div>
        <div class="FR gD_12">1</div>
        <div class="CL"></div>
    </div>
    <div class="PA7 brdb">
        <div class="FL gL_10 UC">INDUSTRY P/E</div>
        <div class="FR gD_12">19.99</div>
        <div class="CL"></div>
    </div>
</div>
</body>
</html>
'''
HTML(html)

In [44]:
content = BeautifulSoup(html,"html.parser")
get_table_simple(content.find("div",attrs={"id":"table"}),is_table_tag=False)

[['MARKET CAP (Rs Cr)', '63,783.84'],
 ['P/E', '17.27'],
 ['BOOK VALUE (Rs)', '589.29'],
 ['DIV (%)', '4750.00%'],
 ['Market Lot', '1'],
 ['INDUSTRY P/E', '19.99']]

### Lib for fetching details

In [63]:

def get_scrip_info(url):
    original_url = url
    key_val_pairs = {}
    
    page_response = requests.get(url, timeout=240)
    page_content = BeautifulSoup(page_response.content, "html.parser")
    price = ffloat(page_content.find('div',attrs={'id':'Nse_Prc_tick_div'}).text)
    name = page_content.find('h1',attrs={'class':'company_name'}).text

    yearly_high = page_content.find('span',attrs={'id':'n_52high'}).text.strip()
    yearly_low = page_content.find('span',attrs={'id':'n_52low'}).text.strip()
    html_data_content = page_content.find('div', attrs={'id': 'mktdet_1'})

    petable = get_table_simple(get_children(html_data_content)[0],is_table_tag=False)
    pbtable = get_table_simple(get_children(html_data_content)[1],is_table_tag=False)
    volume = ffloat(page_content.find('span',attrs={'id':'nse_volume'}).text)


    data_table = list()
    data_table.extend(petable)
    data_table.extend(pbtable)

    collector = {row[0]:ffloat(row[1]) if len(row)==2 else None for row in data_table}


    key_val_pairs["pe"] = collector['P/E']
    key_val_pairs["book_value"] = collector['BOOK VALUE (Rs)']
    key_val_pairs["deliverables"] = collector['DELIVERABLES (%)']
    if 'MARKET CAP (Rs Cr)' in collector:
        key_val_pairs["market_cap"] = collector['MARKET CAP (Rs Cr)']
    elif '**MARKET CAP (Rs Cr)' in collector:
        key_val_pairs["market_cap"] = collector['**MARKET CAP (Rs Cr)']
    key_val_pairs["pb"] = collector['PRICE/BOOK']
    key_val_pairs['price'] = price
    key_val_pairs['volume'] = volume
    key_val_pairs["yearly_low"] = ffloat(yearly_low)
    key_val_pairs["yearly_high"] = ffloat(yearly_high)
    return key_val_pairs


In [64]:
get_scrip_info("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM")

{'book_value': 589.29,
 'deliverables': 45.79,
 'market_cap': 66332.16,
 'pb': 5.64,
 'pe': 17.96,
 'price': 3319.4,
 'volume': 356538.0,
 'yearly_high': 4091.95,
 'yearly_low': 3033.75}

### References

- https://ntguardian.wordpress.com/2016/09/19/introduction-stock-market-data-python-1/
- https://mapattack.wordpress.com/2017/02/14/python-for-stocks-2/

more involved articles:

 - https://ntguardian.wordpress.com/2018/07/17/stock-data-analysis-python-v2/
 - https://nextjournal.com/hisham/stock-market
 