## Webscrawling

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In the ipython you can use

```ipython
!pip install beautifulsoup4
```

In [1]:
import json
import requests
from bs4 import BeautifulSoup

## Simple html

Here is one example of html file, how to extract these item? Coffee, Tea, Coke

```html
<html>
    <body>
        <h1>Welcome to My Website</h1>
        <ul>
            <li>Coffee</li>
            <li>Tea</li>
            <li>Coke</li>
        </ul>
    </body>
</html>
```

Imagine we want to get the items in the list. The ul tag indicates an unordered list. We’ll then want to get each list item (list items are in li tags). Specifically, we’ll want to extract the text inside each list item. To do this, we’ll use the following code, where `example` is the HTML of the page.



```python
soup = BeautifulSoup(example, 'html.parser')
items = soup.find("ul").find_all("li")
```

You’ll notice that items is a list of three items, since there are three list items in the unordered list. You’ll also see that items[0].text will give you the text of the first list item!

In [28]:
example = """
<html>
    <body>
        <h1>Welcome to My Website</h1>
        <ul>
            <li>Coffee</li>
            <li test = 'heheda'>Tea</li>
            <li favorate = 'hehe'>
                <span style="color:blue">Coke</span>
            </li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(example, 'html.parser')
items = soup.find("ul").find_all("li")

In [29]:
str(items[2].text)

'\nCoke\n'

In [30]:
items[2].text.strip('\n')

'Coke'

In [31]:
it = []
for item in items:
    it.append(item.text.strip('\n'))

In [32]:
it

['Coffee', 'Tea', 'Coke']

## Course website scrapping

| Time         | Food                                   |   Calorie |
| :----------- | :------------------------------------- | --------: |
| breakfast    | egg, milk, cereal, avocado             |       600 |
| lunch        | chicken breast, brown rice, lettuce    |       700 |
| dinner       | steak, sweet potato, broccoli          |       800 |


The homepage [https://mlqmlq.github.io/STAT628/pages/d8.html](https://mlqmlq.github.io/STAT628/pages/d8.html) has a table which shows above, how do we get all the food and calorie from this table?

The following are the source code for this table. Which can use google chrom `Inspect` or `View Page Source` to check

```html
<table rules="groups">
  <thead>
    <tr>
      <th style="text-align: left">Time</th>
      <th style="text-align: left">Food</th>
      <th style="text-align: right">Calorie</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">breakfast</td>
      <td style="text-align: left">egg, milk, cereal, avocado</td>
      <td style="text-align: right">600</td>
    </tr>
    <tr>
      <td style="text-align: left">lunch</td>
      <td style="text-align: left">chicken breast, brown rice, lettuce</td>
      <td style="text-align: right">700</td>
    </tr>
    <tr>
      <td style="text-align: left">dinner</td>
      <td style="text-align: left">steak, sweet potato, broccoli</td>
      <td style="text-align: right">800</td>
    </tr>
  </tbody>
  <tbody>
    <tr>
      <td style="text-align: left"> </td>
      <td style="text-align: left"> </td>
      <td style="text-align: right"> </td>
    </tr>
  </tbody>
</table>
```

We need to use `request` to ask python to look through the webpage, and `BeautifulSoup` to parse the html text for us.

Sometimes we may see [404 page](http://mlqmlq.github.io/stat628/pages/notes0309.html), we can use `*.status_code` to check, here are the list of [status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In [33]:
url = "https://mlqmlq.github.io/STAT628/pages/d8.html"
req_page = requests.get(url)

In [34]:
req_page.status_code ## Success

200

In [35]:
req_page.content ## Raw contents

b'\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <title>Discussion 8</title>\n    <meta name="author" content="Linquan Ma">\n\n    <!-- Enable responsive viewport -->\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n\n    <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->\n    <!--[if lt IE 9]>\n      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>\n    <![endif]-->\n\n    <!-- Le styles -->\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/bootstrap/css/bootstrap.2.2.2.min.css" rel="stylesheet">\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/style.css?body=1" rel="stylesheet" type="text/css" media="all">\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/main.css" rel="stylesheet" type="text/css" media="all">\n\n    <!-- Le fav and touch icons -->\n\n    <!-- atom & rss feed -->\n    <link href="http://mlqmlq.github.io/

In [36]:
page_content = req_page.content
page = BeautifulSoup(page_content, 'html.parser') ## Use beacutiful to parse the html, so that it will be easy to manipulate the webpage
page


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Discussion 8</title>
<meta content="Linquan Ma" name="author"/>
<!-- Enable responsive viewport -->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
<!-- Le styles -->
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/bootstrap/css/bootstrap.2.2.2.min.css" rel="stylesheet"/>
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/style.css?body=1" media="all" rel="stylesheet" type="text/css"/>
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/main.css" media="all" rel="stylesheet" type="text/css"/>
<!-- Le fav and touch icons -->
<!-- atom & rss feed -->
<link href="http://mlqmlq.github.io/STAT628nil" rel="alternate" title="Sitewide ATOM Feed" type="application/atom+xml"

In [37]:
page.find("table")

<table rules="groups">
<thead>
<tr>
<th style="text-align: left">Time</th>
<th style="text-align: left">Food</th>
<th style="text-align: right">Calorie</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">breakfast</td>
<td style="text-align: left">egg, milk, cereal, avocado</td>
<td style="text-align: right">600</td>
</tr>
<tr>
<td style="text-align: left">lunch</td>
<td style="text-align: left">chicken breast, brown rice, lettuce</td>
<td style="text-align: right">700</td>
</tr>
<tr>
<td style="text-align: left">dinner</td>
<td style="text-align: left">steak, sweet potato, broccoli</td>
<td style="text-align: right">800</td>
</tr>
</tbody>
<tbody>
<tr>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: right"> </td>
</tr>
</tbody>
</table>

In [38]:
page.find_all("table")

[<table rules="groups">
 <thead>
 <tr>
 <th style="text-align: left">Time</th>
 <th style="text-align: left">Food</th>
 <th style="text-align: right">Calorie</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td style="text-align: left">breakfast</td>
 <td style="text-align: left">egg, milk, cereal, avocado</td>
 <td style="text-align: right">600</td>
 </tr>
 <tr>
 <td style="text-align: left">lunch</td>
 <td style="text-align: left">chicken breast, brown rice, lettuce</td>
 <td style="text-align: right">700</td>
 </tr>
 <tr>
 <td style="text-align: left">dinner</td>
 <td style="text-align: left">steak, sweet potato, broccoli</td>
 <td style="text-align: right">800</td>
 </tr>
 </tbody>
 <tbody>
 <tr>
 <td style="text-align: left"> </td>
 <td style="text-align: left"> </td>
 <td style="text-align: right"> </td>
 </tr>
 </tbody>
 </table>]

In [39]:
table_part = page.find("table")

In [41]:
table_part.find_all("th")

[<th style="text-align: left">Time</th>,
 <th style="text-align: left">Food</th>,
 <th style="text-align: right">Calorie</th>]

In [42]:
table_part.find_all("td")

[<td style="text-align: left">breakfast</td>,
 <td style="text-align: left">egg, milk, cereal, avocado</td>,
 <td style="text-align: right">600</td>,
 <td style="text-align: left">lunch</td>,
 <td style="text-align: left">chicken breast, brown rice, lettuce</td>,
 <td style="text-align: right">700</td>,
 <td style="text-align: left">dinner</td>,
 <td style="text-align: left">steak, sweet potato, broccoli</td>,
 <td style="text-align: right">800</td>,
 <td style="text-align: left"> </td>,
 <td style="text-align: left"> </td>,
 <td style="text-align: right"> </td>]

In [48]:
import numpy as np
import pandas as pd
content = [x.text for x in table_part.find_all("td")]
values = np.array(content[:9]).reshape(3,3)
index = [x.text for x in table_part.find_all("th")]
pd.DataFrame(data=values, columns=index)

Unnamed: 0,Time,Food,Calorie
0,breakfast,"egg, milk, cereal, avocado",600
1,lunch,"chicken breast, brown rice, lettuce",700
2,dinner,"steak, sweet potato, broccoli",800


## Practice

This is from [Brown cs1915A scraping](https://cs.brown.edu/courses/csci1951-a/assignments/scraping.html)

To get started, we’re going to want to collect some data on the most active stocks in the market. Conveniently, Yahoo Finance [publishes this exact data](https://finance.yahoo.com/most-active). To collect this data, you’ll make use of web scraping.

For purposes of this assignment, we've made a copy of this page to keep the data static. Note, some of the data in our static copy is intentionally modified from real stock data to ensure you've handled edge cases. As such, you will scrape from this URL: [https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html](https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html)

Before scraping, you'll need your code to access this webpage. You should make use of the `request` library to make an HTTP request and collect the HTML. If you're not familar with the `request` library, you can read about it [here](http://docs.python-requests.org/en/master).

Once you have accessed the HTML and assigned it to some variable, you'll want to scrape it, collecting the following for each stock in the table.

* company name
* price
* market cap
* percentage daily change

You'll use Beautiful Soup, a Python package, to scrape the HTML. This will require looking at the HTML structure of the Yahoo Finance page. You can select various HTML elements on a page by tag name, class name, and/or id. Using [inspect element](https://zapier.com/blog/inspect-element-tutorial/) on your web browser, you can check what HTML tags and classes contain the relevant information.

In [None]:
"""Your code here"""

## API Excerse

Apple today [https://api.iextrading.com/1.0/stock/aapl/chart/1d](https://api.iextrading.com/1.0/stock/aapl/chart/1d)

Rather than using web scraping to collect this data, we’ll make use of an API. You’ll make requests to this API using Python’s `request` library. IEX Trading offers an API with various endpoints that offer information about stocks. We’re going to want to collect two pieces of information for each stock in Yahoo’s most active stock table:

* the average closing price of each of the most active stocks over the last 6 months
* the number of articles recently written about each stock

To do this, you’ll want to make use of the [chart endpoint](https://iextrading.com/developer/docs/#chart) to collect the historical stock pricing. Then, you will want to parse through the data and average the closing price for each day.

Using the [news endpoint](https://iextrading.com/developer/docs/#news), you should get the articles for a specific stock. Then, you should count how many articles were returned by the API.

**Hint**: Some stocks from Yahoo are not listed on major stock exchanges, and thus the IEX Trading API does not have data on them. In this case, the IEX Trading API will return a [404 status](https://en.wikipedia.org/wiki/HTTP_404) code. Your program should handle this error by disregarding stocks from Yahoo if they are not present in the IEX Trading API. That is, these stocks should not be added to the database. You can check the status code of a request by checking `requests.get(...).status_code`

In [None]:
"""Your code here"""