# 🌐 Scraping, Part 4: Practice, practice, practice!

*Let's scrape some dataaaaaaa!*

*Note: In these examples, I'll be using `lxml`, but feel free to use `BeautifulSoup` if you prefer.*

## Let's start with Soma's personal website

It's https://jonathansoma.com.

Open it in your browser. View the raw HTML, and also practice popping open the element inspector.

## Q: What do you see? What would you want to extract from it?

Some ideas:

- How many hyperlinks does Soma's homepage contain?
- Which paragraph contains the most hyperlinks?

Let's load the HTML in Python. Remember how?

In [1]:
import requests

In [2]:
soma_html = requests.get("https://jonathansoma.com/").text
print(soma_html[:200])

<!DOCTYPE html>
<html>
<head>
<title>Jonathan Soma makes things</title>
<style>
#content {
width: 700px;
color: #333;
margin: 0 auto;
padding-bottom: 100px;
padding-top: 100px;
font-family: Georgia, s


Now let's parse it with `lxml`. Remember how?

In [3]:
import lxml.html
soma_dom = lxml.html.fromstring(soma_html)

## Q: How many links on Soma's homepage?

In [4]:
soma_links = soma_dom.cssselect("a")
len(soma_links)

11

## Q: What are all the URLs those hyperlinks point to?

In [5]:
for link in soma_links:
    print(link.attrib["href"])
    # In BeautifulSoup: print(link["href"])

http://brooklynbrainery.com
http://dabbles.in
http://www.omgmsg.com
https://investigate.ai
http://jonathansoma.com/singles
http://handsomeatlas.com
http://jonathansoma.com/notes/dosas-and-injera/
http://jonathansoma.com/open-source-language-map
https://tinyletter.com/jsoma
http://twitter.com/dangerscarf
mailto:jonathan.soma@gmail.com


## Exercise: How many links are in each paragraph?

Let's start with grabbing each paragraph:

In [6]:
soma_paras = soma_dom.cssselect("p")
for i, p in enumerate(soma_paras):
    print(f"Paragraph {i+1}: {p.text_content()}")
    print("---")

Paragraph 1: I run a fake school and a paid newsletter about hobbies and have been known to talk too much about food. I love just about everything.
---
Paragraph 2: I've worked on baby-steps data science for journalists and lonely young men and rad old maps and pancakes and crowdsourced linguistics.
---
Paragraph 3: Want updates? I have a newsletter for that, too.
---
Paragraph 4:  
---
Paragraph 5: pithy = @dangerscarf lengthy = jonathan.soma@gmail.com
---


Now let's search *within* each paragraph for its links; we can do this because `.cssselect(...)` works on *any* element:

In [7]:
for i, p in enumerate(soma_dom.cssselect("p")):
    p_links = p.cssselect("a")
    print(f"Paragraph {i+1} has {len(p_links)} link(s)")
    print("---")

Paragraph 1 has 3 link(s)
---
Paragraph 2 has 5 link(s)
---
Paragraph 3 has 1 link(s)
---
Paragraph 4 has 0 link(s)
---
Paragraph 5 has 2 link(s)
---


Now let's print the text and URL of each link:

In [8]:
for i, p in enumerate(soma_dom.cssselect("p")):
    p_links = p.cssselect("a")
    print(f"Paragraph {i+1} has {len(p_links)} link(s):")
    for a in p_links:
        text = a.text_content()
        url = a.attrib["href"]
        print(f"→ {text}: {url}")
    print("---")

Paragraph 1 has 3 link(s):
→ fake school: http://brooklynbrainery.com
→ paid newsletter about hobbies: http://dabbles.in
→ food: http://www.omgmsg.com
---
Paragraph 2 has 5 link(s):
→ baby-steps data science for journalists: https://investigate.ai
→ lonely young men: http://jonathansoma.com/singles
→ rad old maps: http://handsomeatlas.com
→ pancakes: http://jonathansoma.com/notes/dosas-and-injera/
→ crowdsourced linguistics: http://jonathansoma.com/open-source-language-map
---
Paragraph 3 has 1 link(s):
→ newsletter: https://tinyletter.com/jsoma
---
Paragraph 4 has 0 link(s):
---
Paragraph 5 has 2 link(s):
→ @dangerscarf: http://twitter.com/dangerscarf
→ jonathan.soma@gmail.com: mailto:jonathan.soma@gmail.com
---


## Exercise: `pandas` refresher

How would you make `pandas` `DataFrame` representing each link's text and URL?

(You can forget, for now, about what paragraph the link is in.)

In [9]:
import pandas as pd

In [10]:
soma_link_df = pd.DataFrame([ {
    "text": link.text_content(),
    "url": link.attrib["href"]
} for link in soma_links ])

soma_link_df

Unnamed: 0,text,url
0,fake school,http://brooklynbrainery.com
1,paid newsletter about hobbies,http://dabbles.in
2,food,http://www.omgmsg.com
3,baby-steps data science for journalists,https://investigate.ai
4,lonely young men,http://jonathansoma.com/singles
5,rad old maps,http://handsomeatlas.com
6,pancakes,http://jonathansoma.com/notes/dosas-and-injera/
7,crowdsourced linguistics,http://jonathansoma.com/open-source-language-map
8,newsletter,https://tinyletter.com/jsoma
9,@dangerscarf,http://twitter.com/dangerscarf


If [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) aren't your cup of tea, here's another way we could have done that:

In [11]:
soma_link_list = []

for link in soma_links:
    item_data = {
        "text": link.text_content(),
        "url": link.attrib["href"],
    }
    soma_link_list.append(item_data)

soma_link_list

[{'text': 'fake school', 'url': 'http://brooklynbrainery.com'},
 {'text': 'paid newsletter about hobbies', 'url': 'http://dabbles.in'},
 {'text': 'food', 'url': 'http://www.omgmsg.com'},
 {'text': 'baby-steps data science for journalists',
  'url': 'https://investigate.ai'},
 {'text': 'lonely young men', 'url': 'http://jonathansoma.com/singles'},
 {'text': 'rad old maps', 'url': 'http://handsomeatlas.com'},
 {'text': 'pancakes',
  'url': 'http://jonathansoma.com/notes/dosas-and-injera/'},
 {'text': 'crowdsourced linguistics',
  'url': 'http://jonathansoma.com/open-source-language-map'},
 {'text': 'newsletter', 'url': 'https://tinyletter.com/jsoma'},
 {'text': '@dangerscarf', 'url': 'http://twitter.com/dangerscarf'},
 {'text': 'jonathan.soma@gmail.com', 'url': 'mailto:jonathan.soma@gmail.com'}]

In [12]:
soma_link_df = pd.DataFrame(soma_link_list)
soma_link_df

Unnamed: 0,text,url
0,fake school,http://brooklynbrainery.com
1,paid newsletter about hobbies,http://dabbles.in
2,food,http://www.omgmsg.com
3,baby-steps data science for journalists,https://investigate.ai
4,lonely young men,http://jonathansoma.com/singles
5,rad old maps,http://handsomeatlas.com
6,pancakes,http://jonathansoma.com/notes/dosas-and-injera/
7,crowdsourced linguistics,http://jonathansoma.com/open-source-language-map
8,newsletter,https://tinyletter.com/jsoma
9,@dangerscarf,http://twitter.com/dangerscarf


## Exercise: Add the URL's *protocol* to the DataFrame

(The protocol is the bit that comes before the `:`.)

In [13]:
soma_link_list = []

for link in soma_links:
    item_data = {
        "text": link.text_content(),
        "url": link.attrib["href"],
        "protocol": link.attrib["href"].split(":")[0],
    }
    soma_link_list.append(item_data)

soma_link_df = pd.DataFrame(soma_link_list)

soma_link_df

Unnamed: 0,text,url,protocol
0,fake school,http://brooklynbrainery.com,http
1,paid newsletter about hobbies,http://dabbles.in,http
2,food,http://www.omgmsg.com,http
3,baby-steps data science for journalists,https://investigate.ai,https
4,lonely young men,http://jonathansoma.com/singles,http
5,rad old maps,http://handsomeatlas.com,http
6,pancakes,http://jonathansoma.com/notes/dosas-and-injera/,http
7,crowdsourced linguistics,http://jonathansoma.com/open-source-language-map,http
8,newsletter,https://tinyletter.com/jsoma,https
9,@dangerscarf,http://twitter.com/dangerscarf,http


In [14]:
soma_link_df["protocol"].value_counts()

protocol
http      8
https     2
mailto    1
Name: count, dtype: int64

---

---

---