# Text Wrangling

This is adapted from https://realpython.com/python-web-scraping-practical-introduction/.

One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the urllib.request module contains a function called urlopen() that you can use to open a URL within a program.



In [1]:
from urllib.request import urlopen

The web page that you’ll open is at the following URL:



In [11]:
url = "http://olympus.realpython.org/profiles/aphrodite"

You can open the [URL](http://olympus.realpython.org/profiles/aphrodite) directly in your browser. You can open the page pass the content the urlopen with the following:

In [12]:
 page = urlopen(url)

In [13]:
page

<http.client.HTTPResponse at 0x7faae6812c10>

To extract the HTML from the page, first use the HTTPResponse object’s .read() method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8:

In [14]:
html_bytes = page.read()
html = html_bytes.decode("utf-8")

The output that you’re seeing is the HTML code of the website.

In [15]:
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



## Extract Text From HTML With String Methods
One way to extract information from a web page’s HTML is to use string methods. For instance, you can use .find() to search through the text of the HTML for the <title> tags and extract the title of the web page.

To start, you’ll extract the title of the web page that you requested in the previous example. If you know the index of the first character of the title and the index of the first character of the closing </title> tag, then you can use a string slice to extract the title.

Because .find() returns the index of the first occurrence of a substring, you can get the index of the opening <title> tag by passing the string "<title>" to .find():

In [16]:
title_index = html.find("<title>")
title_index

14

You don’t want the index of the title tag, though. You want the index of the title itself. To get the index of the first letter in the title, you can add the length of the string title to title_index:

In [17]:
start_index = title_index + len("<title>")
start_index

21

Now get the index of the closing titletag by passing the string title to .find():



In [19]:
end_index = html.find("</title>")
end_index

39

Finally, you can extract the title by slicing the html string:



In [21]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

Real-world HTML can be much more complicated and far less predictable than the HTML on the Aphrodite profile page. Here’s another profile page with some messier HTML that you can scrape:

In [22]:
url = "http://olympus.realpython.org/profiles/poseidon"
page = urlopen(url)
html = page.read().decode("utf-8")
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title


'\n<head>\n<title >Profile: Poseidon'

# TASK 1

Comment on why this is this happening and why this makes text scraping difficult.

If you do a lot of scraping, it is worth learning about regular expressions and how to use them in Python.  This [tutorial](https://realpython.com/python-web-scraping-practical-introduction/) has some basic info but there are many other good tutorials you can find with googling python and regular expressions.

# Beautiful Soup in Python

In [25]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [26]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

This program does three things:

Opens the URL http://olympus.realpython.org/profiles/dionysus by using urlopen() from the urllib.request module

Reads the HTML from the page as a string and assigns it to the html variable

Creates a BeautifulSoup object and assigns it to the soup variable

The BeautifulSoup object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.

In [27]:
soup

<html>
<head>
<title>Profile: Dionysus</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<img src="/static/dionysus.jpg"/>
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"/><br/><br/>
Hometown: Mount Olympus
<br/><br/>
Favorite animal: Leopard <br/>
<br/>
Favorite Color: Wine
</center>
</body>
</html>

In [28]:
 print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the .replace() string method if you need to.

Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the .find() string method is sometimes easier than working with regular expressions.

However, other times the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of "img" HTML tags.

In this case, you can use find_all() to return a list of all instances of that particular tag:

In [29]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

his returns a list of all img tags in the HTML document. The objects in the list look like they might be strings representing the tags, but they’re actually instances of the Tag object provided by Beautiful Soup. Tag objects provide a simple interface for working with the information they contain.

You can explore this a little by first unpacking the Tag objects from the list:

In [30]:
image1, image2 = soup.find_all("img")

In [31]:
image1.name

'img'

You can access the HTML attributes of the Tag object by putting their names between square brackets, just as if the attributes were keys in a dictionary. To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:


In [32]:
image1["src"]

'/static/dionysus.jpg'

Certain tags in HTML documents can be accessed by properties of the Tag object. For example, to get the <title> tag in a document, you can use the .title property:

In [33]:
soup.title

<title>Profile: Dionysus</title>

In [34]:
soup.title.string

'Profile: Dionysus'

# TASK 2

Write a program that grabs the full HTML from the page at the URL http://olympus.realpython.org/profiles.

Using Beautiful Soup, print out a list of all the links on the page by looking for HTML tags with the name a and retrieving the value taken on by the href attribute of each tag.



# TASK 3

Given that websites and their development are constantly changing, what might be challenges of writing sustainable code for text scraping?  What might you as someone who provides data do to provide a solution to these challenges?

# APIs

API stands for application programming interface, which is a set of definitions and protocols for building and integrating application software. APIs let your product or service communicate with other products and services without having to know how they’re implemented. They also allow you to share your data with other external users. 

# TASK 4

Identify 3 APIs that may provide data to users.  What types of data do they provide?  

# Example of an API:  openweathermap.org

Imagine that you are trying to work with collecting data (or maybe even creating an application) where you ahve to obtain current, past, or forecasted weather for varying locations.

One may use services of data repositories to collect this data.  One such service is OpenWeather.

"OpenWeather is a team of IT-intellectuals that create pivotal products for business using climate data. For each point on the globe, OpenWeather provides hyperlocal minutely forecast, historical data, current state, and from short-term to annual forecasted weather data."

OpenWeather is an example of an API that provides data based on user-based queries.  In order to use it, you'll need to have an API "key".  Presently, to make sure you are not a "bot", most data queries are restricted to specific logins that have been validated.  OpenWeather does this by providing a "key" to their services once you've verified your email.  It also has multiple tiers of services provided by its API depending on your subscription.  

In this class, we'll be working with the free subscription which limits the number of queries you are allowed and also the type of information provided (see this [link](https://openweathermap.org/price)).

We are going to learn to use the API for the following:

* Current Weather - https://openweathermap.org/current
* 5 day weather forecasts - https://openweathermap.org/forecast5


APIs generally have a pattern to their text request, which comes in the form of a URL.  For OpenWeather's API, the beginning part of this request should start with the following:

```
https://api.openweathermap.org/data/2.5/weather?
```

Amended on the end of the url can be other terms that specify the information you are requesting:

```
https://api.openweathermap.org/data/2.5/weather?zip=50014&units=imperial
```

If you are to paste this URL into your browser, you'll get an error about needing a "key" for the API.  This requirement is becoming increasingly common to make any sort of requests to a server to ensure that you are not a 'bot'.  You should've registered for a key that is private to you in the last class.  This 'key' is alphanumberic text and can be amended at the end of your URL replacing {API KEY} below.

```
https://api.openweathermap.org/data/2.5/weather?zip=50014&units=imperial&appid={API KEY}
```


# Task 5

Explore the documentation for finding current weather from OpenWeather.  Change the units and language of an API call for finding the current weather of your choice.  

https://openweathermap.org/current

Describe what kind of information you are getting back from your call.  Where is this information find in the API documentation?


In Python, you can use the 'requests' module to make a URL query, much like pasting a URL into a browser.  Within 'requests', you'll use the 'get' function.  This will store the information you obtain into a data object, 'resp'.  This data object has a number of attributes.  You can see the attributes of any python data object with the `dir` method.  You'll see in this case that `resp` has attributes like `raw` or `content` and can explore these.

In [13]:
import requests
url = "https://api.openweathermap.org/data/2.5/weather?zip=50014&units=imperial&appid=60d3c6c39ff016032e3271b5cb528e91"
resp = requests.get(url)
dir(resp)
print(resp.raw)


<urllib3.response.HTTPResponse object at 0x7fe4f76f3f90>
b'{"coord":{"lon":-93.6945,"lat":42.0486},"weather":[{"id":800,"main":"Clear","description":"clear sky","icon":"01d"}],"base":"stations","main":{"temp":32.81,"feels_like":25.14,"temp_min":29.97,"temp_max":37.89,"pressure":1017,"humidity":86},"visibility":10000,"wind":{"speed":9.22,"deg":140},"clouds":{"all":0},"dt":1673281345,"sys":{"type":2,"id":2041488,"country":"US","sunrise":1673271745,"sunset":1673305249},"timezone":-21600,"id":0,"name":"Ames","cod":200}'


# Task 6

What information is provided in the `resp.content`?

This is in a specific format called JSON.  You can read about [JSON here](https://en.wikipedia.org/wiki/JSON).
 
Print the JSON format of the `resp` data object using `resp.json`.
Store the JSON format as a variable and explore it as Python dictionary. Print out the keys and values of the JSON format object.  Identify where these key and values are on the API documentation.

Print out the temperature found within the 'main' output.

# Task 7

Pick 10 zipcodes to find the current weather.  Using API calls, store this information in a Pandas data table where you store the location, current temperature, minimum temperature, maximum temperature, and humidity.