# Covered Topics
- Requests
- BeautifulSoup

# Requests

[Official Documentation](https://requests.readthedocs.io/en/master/)

In [3]:
import requests

req = requests.get('http://cat-fact.herokuapp.com/facts')
req

<Response [200]>

In [4]:
# .json() returns a dictionary of all the information
req.json()["all"][0]

{'_id': '5b453e380fd3a600147f32f3',
 'text': 'Exposure to UV light with hairless or partially-hairless cats can result in sunburn, even during cloudy or shady conditions. If your cat risks overexposure, consider applying sunscreen daily.',
 'type': 'cat',
 'user': {'_id': '5a9ac18c7478810ea6c06381',
  'name': {'first': 'Alex', 'last': 'Wohlbruck'}},
 'upvotes': 4,
 'userUpvoted': None}

# Read Twitch Data and Create DataFrame

- https://towardsdatascience.com/creating-a-dataset-using-an-api-with-python-dcc1607616d

In [39]:
import numpy as np
import pandas as pd
import requests
import json

url = "https://wind-bow.glitch.me/twitch-api/channels/freecodecamp"
JSONContent = requests.get(url).json()
content = json.dumps(JSONContent, indent = 4, sort_keys=True)
print(content)

{
    "_id": 79776140,
    "_links": {
        "chat": "https://api.twitch.tv/kraken/chat/freecodecamp",
        "commercial": "https://api.twitch.tv/kraken/channels/freecodecamp/commercial",
        "editors": "https://api.twitch.tv/kraken/channels/freecodecamp/editors",
        "follows": "https://api.twitch.tv/kraken/channels/freecodecamp/follows",
        "self": "https://api.twitch.tv/kraken/channels/freecodecamp",
        "stream_key": "https://api.twitch.tv/kraken/channels/freecodecamp/stream_key",
        "subscriptions": "https://api.twitch.tv/kraken/channels/freecodecamp/subscriptions",
        "teams": "https://api.twitch.tv/kraken/channels/freecodecamp/teams",
        "videos": "https://api.twitch.tv/kraken/channels/freecodecamp/videos"
    },
    "background": null,
    "banner": null,
    "broadcaster_language": "en",
    "created_at": "2015-01-14T03:36:47Z",
    "delay": null,
    "display_name": "FreeCodeCamp",
    "followers": 10122,
    "game": "Creative",
    "langua

In [40]:
# List of channels we want to access
channels = ["ESL_SC2", "OgamingSC2", "cretetion", "freecodecamp", "storbeck", "habathcx", "RobotCaleb", "noobs2ninjas",
            "ninja", "shroud", "Dakotaz", "esltv_cs", "pokimane", "tsm_bjergsen", "boxbox", "wtcn", "a_seagull",
           "kinggothalion", "amazhs", "jahrein", "thenadeshot", "sivhd", "kingrichard"]

channels_list = []
# For each channel, we access its information through its API
for channel in channels:
    JSONContent = requests.get("https://wind-bow.glitch.me/twitch-api/channels/" + channel).json()
    if 'error' not in JSONContent:
        channels_list.append([JSONContent['_id'], JSONContent['display_name'], JSONContent['status'],
                             JSONContent['followers'], JSONContent['views']])
                         
dataset = pd.DataFrame(channels_list)
dataset.head(5)

Unnamed: 0,0,1,2,3,4
0,30220059,ESL_SC2,RERUN: StarCraft 2 - Terminator vs. Parting (P...,135394,60991791
1,71852806,OgamingSC2,UnderDogs - Rediffusion - Qualifier.,40895,20694507
2,90401618,cretetion,It's a Divison kind of Day,908,11631
3,79776140,FreeCodeCamp,Greg working on Electron-Vue boilerplate w/ Ak...,10122,163747
4,86238744,storbeck,,10,1019


In [41]:
# Set names of columns
dataset.columns = ['Id', 'Name', 'Status', 'Followers', 'Views']

# Drop rows with non existent data
dataset.dropna(axis = 0, how = 'any', inplace = True)

# When dropping, there will be missing index places
# Reset index from 0 to new length of dataframe
dataset.reset_index(drop=True, inplace=True)
dataset.head(5)

Unnamed: 0,Id,Name,Status,Followers,Views
0,30220059,ESL_SC2,RERUN: StarCraft 2 - Terminator vs. Parting (P...,135394,60991791
1,71852806,OgamingSC2,UnderDogs - Rediffusion - Qualifier.,40895,20694507
2,90401618,cretetion,It's a Divison kind of Day,908,11631
3,79776140,FreeCodeCamp,Greg working on Electron-Vue boilerplate w/ Ak...,10122,163747
4,6726509,Habathcx,Massively Effective,14,764


In [42]:
# dataset.to_csv('twitch.csv', index=False)  

# BeautifulSoup

[Official Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Guides
- https://www.dataquest.io/blog/web-scraping-tutorial-python/

In this example, we’ll be scraping weather forecasts from the [National Weather Service](http://www.weather.gov/), and then analyzing them using the Pandas library.

### Quick Recap: HTML

```html
<html>
    
<head>
</head>
    
<body>
</body>

</html>
```
Right inside an `html` tag, we put two other tags, the `head` tag, and the `body` tag. The main content of the web page goes into the `body` tag. The `head` tag contains data about the title of the page, and other information that generally isn’t useful in web scraping.

```html
<html>
    
<head>
</head>
    
<body>
<p class="bold-paragraph">
Here's a paragraph of text!
<a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
</p>
<p class="bold-paragraph extra-large">
Here's a second paragraph of text!
<a href="https://www.python.org" class="extra-large">Python</a>
</p>
</body>
    
</html>
```

Tags have commonly used names that depend on their position in relation to other tags:
- **child** — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- **parent** — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- **sibiling** — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

In the above example, we added two `a` tags. `a` tags are links, and tell the browser to render a link to another web page. The `href` property of the tag determines where the link goes.

`a` and `p` are extremely common html tags. Here are a few others:
- `div` — indicates a division, or area, of the page.
- `b` — bolds any text inside.
- `i` — italicizes any text inside.
- `table` — creates a table.
- `form` — creates an input form.

For a full list of tags, look [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

`class` and `id` properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them. Adding classes and ids doesn’t change how the tags are rendered at all.

### `find_all` method

Website reference: http://dataquestio.github.io/web-scraping-pages/simple.html

In [7]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

In [8]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [9]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [17]:
# The find_all method finds all the instances of a tag on a page.

soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [18]:
for p in soup.find_all('p'):
    print(p.get_text())

Here is some simple content for this page.


In [19]:
# The find method finds the first instance of a tag on a page.

soup.find('p')

<p>Here is some simple content for this page.</p>

### Searching by classes and id's

Website reference: http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html

In [20]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [21]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [22]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [23]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### weather data

The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this [page](http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168).

You can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools. You should end up with a panel at the bottom of the browser like what you see below. Make sure the Elements panel is highlighted:
![Chrome](https://www.dataquest.io/wp-content/uploads/2019/01/devtools.png)

By right clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel:

![Chrome](https://www.dataquest.io/wp-content/uploads/2019/01/ex_selected.png)

We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast:

![Chrome](https://www.dataquest.io/wp-content/uploads/2019/01/div.png)

If you click around on the console, and explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a div with the class tombstone-container.

We now know enough to download the page and start parsing it. In the below code, we:

- Download the web page containing the forecast.
- Create a `BeautifulSoup` class to parse the page.
- Find the `div` with id `seven-day-forecast`, and assign to `seven_day`
- Inside `seven_day`, find each individual forecast item.
- Extract and print the first forecast item.

In [24]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')

In [26]:
seven_day = soup.find(id="seven-day-forecast")
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    San Francisco CA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Today<br/><br/></p>
<p><img alt="Today: Mostly sunny, with a high near 76. Light southwest wind becoming west 5 to 10 mph in the afternoon. " class="forecast-icon" src="newimages/medium/sct.png" title="Today: Mostly sunny, with a high near 76. Light southwest wind becoming west 5 to 10 mph in the afternoon. "/></p><p class="short-desc">Mostly Sunny</p><p class="temp temp-high">High: 76 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly cloudy, with a low around 53. Southwe

In [27]:
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Mostly sunny, with a high near 76. Light southwest wind becoming west 5 to 10 mph in the afternoon. " class="forecast-icon" src="newimages/medium/sct.png" title="Today: Mostly sunny, with a high near 76. Light southwest wind becoming west 5 to 10 mph in the afternoon. "/>
 </p>
 <p class="short-desc">
  Mostly Sunny
 </p>
 <p class="temp temp-high">
  High: 76 °F
 </p>
</div>


As you can see, inside the `forecast` item tonight is all the information we want.

In [28]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Today
Mostly Sunny
High: 76 °F


### Advanced: CSS Selectors

You can also search for items using [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors). These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

- `p a` — finds all `a` tags inside of a `p` tag.
- `body p a` — finds all `a` tags inside of a `p` tag inside of a `body` tag.
- `html body` — finds all `body` tags inside of an `html` tag.
- `p.outer-text` — finds all `p` tags with a class of `outer-text`.
- `p#first` — finds all `p` tags with an id of `first`.
- `body p.outer-text` — finds any `p` tags with a class of `outer-text` inside of a `body` tag.

You can learn more about CSS selectors [here](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors).

Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once.

In the below code, we:
- Select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.
- Use a list comprehension to call the `get_text` method on each BeautifulSoup object.

In [29]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday']

In [32]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
short_descs

['Mostly Sunny',
 'Mostly Cloudy',
 'Partly Sunny',
 'Mostly Cloudy',
 'Mostly Cloudy',
 'Mostly Cloudy',
 'ChanceShowers',
 'ShowersLikely',
 'ShowersLikely']

In [33]:
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
temps

['High: 76 °F',
 'Low: 53 °F',
 'High: 68 °F',
 'Low: 53 °F',
 'High: 66 °F',
 'Low: 54 °F',
 'High: 66 °F',
 'Low: 56 °F',
 'High: 64 °F']

In [34]:
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
descs

['Today: Mostly sunny, with a high near 76. Light southwest wind becoming west 5 to 10 mph in the afternoon. ',
 'Tonight: Mostly cloudy, with a low around 53. Southwest wind 6 to 11 mph. ',
 'Saturday: Partly sunny, with a high near 68. West southwest wind 7 to 14 mph, with gusts as high as 18 mph. ',
 'Saturday Night: Mostly cloudy, with a low around 53. West southwest wind 8 to 14 mph. ',
 'Sunday: Mostly cloudy, with a high near 66. West southwest wind 8 to 15 mph, with gusts as high as 18 mph. ',
 'Sunday Night: Mostly cloudy, with a low around 54.',
 'Monday: A slight chance of rain before noon, then a chance of showers after noon.  Partly sunny, with a high near 66. Chance of precipitation is 40%.',
 'Monday Night: Showers likely, mainly after midnight.  Mostly cloudy, with a low around 56.',
 'Tuesday: Showers likely, mainly before noon.  Partly sunny, with a high near 64.']

In [35]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Mostly Sunny,High: 76 °F,"Today: Mostly sunny, with a high near 76. Ligh..."
1,Tonight,Mostly Cloudy,Low: 53 °F,"Tonight: Mostly cloudy, with a low around 53. ..."
2,Saturday,Partly Sunny,High: 68 °F,"Saturday: Partly sunny, with a high near 68. W..."
3,SaturdayNight,Mostly Cloudy,Low: 53 °F,"Saturday Night: Mostly cloudy, with a low arou..."
4,Sunday,Mostly Cloudy,High: 66 °F,"Sunday: Mostly cloudy, with a high near 66. We..."
5,SundayNight,Mostly Cloudy,Low: 54 °F,"Sunday Night: Mostly cloudy, with a low around..."
6,Monday,ChanceShowers,High: 66 °F,"Monday: A slight chance of rain before noon, t..."
7,MondayNight,ShowersLikely,Low: 56 °F,"Monday Night: Showers likely, mainly after mid..."
8,Tuesday,ShowersLikely,High: 64 °F,"Tuesday: Showers likely, mainly before noon. ..."
