# HTML and CSS - Parsing Markup

## HTML

HTML is the semantic structure of a website and is composed of elements denoted by specific tags. A tag looks like this:

```html
<tagname>tag content</tagname>
```

Tags usually have opening and closing tags, with the `/` being used to indicate the closing tag. This marks the end of the tag.

```html
<p>I'm a paragraph</p>
```

Elements can also be self closing - the image element technically does not need a closing tag

```html
<img src="https://some.url/coolimage.jpg">
```

There are many different types of tags, and they all have different semantic meaning ([you can read more here](https://www.hongkiat.com/blog/html-5-semantics/)). However, their implementation varies and there is no universally used standard when building websites. Below are some widely used versions:

* `p`: paragraph of text
* `a`: a link
* `ul`: an unordered (bulleted) list
* `li`: a list item
* `strong`: important text (typically displays bold)
* `h1` through `h6`: heading levels 1 through 6 
* `div`: a widely used tag used to signify a division or section of content

It's worth noting that of all these tags, the `div` tag is a "pure" container and does not inherently represent anything semantically. Typically it is used in order to apply styling with CSS, which we will cover later.

## Attributes

Attributes are extra bits of information that appear inside the opening tag - their values sit inside quotation marks. These values configure the elements or adjust their behavior in various ways.

```html
<tagname attributename="value">tag content</tagname>
```

Some attributes are specific to certain tags, while others can be applied to any tag. We will look at the two most important such attributes for our purposes - `id` and `class`.

The `id` attribute gives a **unique** id to a tag. This tag is traditionally used for accessing elements with javascript. There can only be **one** tag with any given `id`. We'll soon see why this level of specificity is useful.

The `class` attribute is primarily used for styling elements. Multiple tags of different type can share the same class name.

```html
<ul id="the_list">
    <li class="list_element">The data we want</li>
    <li class="list_element">The data we want</li>
    <li class="list_element">The data we want</li>
</ul>
```

Some attributes are specific to certain tags. This includes the image element's `src` tag which is a link to the image URL

```html
<img src="https://some.url/coolimage.jpg">
```

This also includes the `a` tag's `href` link, which indicates where the browser should navigate to when interacted with.

```html
<a href="https://mycoollink.com">check out my cool link</a>
```

## Webpage Structure

Most webpage documents contain the following tags:

* `html`: This surrounds the entire document
* `head`: Contains the title and other information such as links to style sheets and javascript. It does not render any visible content.
* `title`: The title of the page shown in the browser tab
* `body`: Contains all visible elements on the website

An example HTML document:

```html
<html>
    <head>
        <title>Some website with data</title>
    </head>
    <body>
        <h1>Welcome to our website</h1>
        <p>Please do not scrape this well formatted data</p>
        
        <ul>
            <li>beautiful</li>
            <li>delicious</li>
            <li>data</li>
        </ul>
    </body>
</html>
```

HTML Elements are nested within each other, indicating heirarchy. The indentations are optional but help to visually indicate the inherent heirarchy.

## Using the Chrome inspector 

We will use the Chrome developer tools to inspect our web pages, but Firefox, Safari, and Edge also have developer tools you can use. Chrome and Firefox are generally well regarded.

## Isolating elements with query selector

CSS is applied to html elements using query selectors - we will use the same to select HTML elements in python.
[Read more here](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors)

## Using Python to Scrape Websites

Covering the basics of web scraping with Python

You can use Python's standard library to scrape the web and there are also a number of other packages available for scraping in Python and other languages, but we will use `requests-html` due to its ease of use.

First we need to install `requests_html`

```console
$ pip3 install requests-html
```

The basic workflow for scraping is to download HTML from a url, and then parse the resulting HTML using query selectors.

In [1]:
from requests_html import HTMLSession

session = HTMLSession()

# Then we use session.get to get the contents of the url
response = session.get('https://newyork.craigslist.org/d/free-stuff/search/zip')

In [2]:
# First we get all the URLs for all of the listings
# Looking at the page, we can see that the links are
# a tags with "resulte-image" class applied. By using this
# query selector we get a list of all the links

link_tags = response.html.find("a.result-image")

In [3]:
# Then we can loop through all the links and grab
# the href attribute from the a tag

links = []

for link in link_tags:
    links.append(link.attrs.get('href'))

In [4]:
# We can use the pandas library to keep track of our data

import pandas as pd

clist_df = pd.DataFrame(columns=['id', 'images', 'title', 'description'])
clist_df

Unnamed: 0,id,images,title,description


In [5]:
# Next we loop through our links and grab the HTML for each link

for link in links:
    response = session.get(link)
    
    # Web scraping often breaks due to small errors, so it can
    # be easiest to wrap our code in a try block and grab the errors
    try: 
    
        page_id = link.split('/')[-1].replace('.html', '')

        images = [img.attrs.get('src') for img in response.html.find('.slide img')]

        title = response.html.find('#titletextonly', first=True).text

        description = response.html.find("#postingbody", first=True).text

        clist_df = clist_df.append({'id': page_id,
            'images': images,
            'title': title,
            'description': description}, ignore_index=True)

    except Exception as e:
        # Print the error
        print(e)
        # and print the associated link
        print('failed on ' + link)



'NoneType' object has no attribute 'text'
failed on https://newyork.craigslist.org/mnh/zip/d/new-york-counter-top-oven/6989340916.html
'NoneType' object has no attribute 'text'
failed on https://newyork.craigslist.org/brk/zip/d/brooklyn-closet-ikea/6989188399.html


In [6]:
# We can see we got all the data below! 
# Unfortunately only the first image of the slideshow renders
clist_df

Unnamed: 0,id,images,title,description
0,6989347730,[https://images.craigslist.org/00y0y_dzCOy8akK...,Free IKEA BOOKSHELF and Wooden Dresser,QR Code Link to This Post\nIKEA shelf and Wood...
1,6989344686,[https://images.craigslist.org/00u0u_3Bxy7UzgV...,Baby Food,QR Code Link to This Post\nI have new and unop...
2,6988069432,[https://images.craigslist.org/00N0N_YRPBBlU5o...,Two free CD cases,"QR Code Link to This Post\nNo discs, just cases"
3,6988070585,[https://images.craigslist.org/00C0C_5VNAb0tss...,Free DVD about managing genital herpes,QR Code Link to This Post\nCome and get it
4,6988074993,[https://images.craigslist.org/00D0D_JGwaEemNM...,8 Sony DVD+R new blank cds with cases,QR Code Link to This Post\nBrand new\nCome and...
...,...,...,...,...
113,6987426885,[https://images.craigslist.org/00i0i_kWJStA00L...,Large Rubbermaid full of Goods,QR Code Link to This Post\nLarge Rubbermaid fu...
114,6989121965,[https://images.craigslist.org/00606_8BBJRIePd...,Toilet rack,QR Code Link to This Post\nFits over a toilet ...
115,6980930696,[https://images.craigslist.org/00R0R_3JSNlmFVb...,2 Couches good condition,QR Code Link to This Post\n2 Couches good cond...
116,6989118778,[https://images.craigslist.org/01313_7k1wogZht...,Need gone asap,QR Code Link to This Post\nNeed these gone asap.


In [7]:
# We can use urllib to download the image urls
# into a folder names "scraped_images"

import urllib.request

for index, row in clist_df.iterrows():

    for url in row['images']:
        print(url)
        file_name = url.split('/')[-1]
        print(file_name)
        urllib.request.urlretrieve(url, 'scraped_images/' + file_name)

https://images.craigslist.org/00y0y_dzCOy8akKXS_600x450.jpg
00y0y_dzCOy8akKXS_600x450.jpg
https://images.craigslist.org/00u0u_3Bxy7UzgVEr_600x450.jpg
00u0u_3Bxy7UzgVEr_600x450.jpg
https://images.craigslist.org/00N0N_YRPBBlU5ox_600x450.jpg
00N0N_YRPBBlU5ox_600x450.jpg
https://images.craigslist.org/00C0C_5VNAb0tss7m_600x450.jpg
00C0C_5VNAb0tss7m_600x450.jpg
https://images.craigslist.org/00D0D_JGwaEemNMX_600x450.jpg
00D0D_JGwaEemNMX_600x450.jpg
https://images.craigslist.org/00Z0Z_50HBOTnAMAS_600x450.jpg
00Z0Z_50HBOTnAMAS_600x450.jpg
https://images.craigslist.org/00A0A_5dlH6mCdHkU_600x450.jpg
00A0A_5dlH6mCdHkU_600x450.jpg
https://images.craigslist.org/00Q0Q_cmr2VjIaJSx_600x450.jpg
00Q0Q_cmr2VjIaJSx_600x450.jpg
https://images.craigslist.org/01616_3RD8ZSKwPsf_600x450.jpg
01616_3RD8ZSKwPsf_600x450.jpg
https://images.craigslist.org/00Q0Q_fNnY30Cndfj_600x450.jpg
00Q0Q_fNnY30Cndfj_600x450.jpg
https://images.craigslist.org/00808_bJ9qnQCtLrN_600x450.jpg
00808_bJ9qnQCtLrN_600x450.jpg
https://images

https://images.craigslist.org/01515_k8Gznwt6Oxn_600x450.jpg
01515_k8Gznwt6Oxn_600x450.jpg
https://images.craigslist.org/00g0g_3UurbAfi3Dt_600x450.jpg
00g0g_3UurbAfi3Dt_600x450.jpg
https://images.craigslist.org/00n0n_cMAk96frP2y_600x450.jpg
00n0n_cMAk96frP2y_600x450.jpg
https://images.craigslist.org/00404_gR73cwdmOvn_600x450.jpg
00404_gR73cwdmOvn_600x450.jpg
https://images.craigslist.org/00F0F_3VRsnGLYXsJ_600x450.jpg
00F0F_3VRsnGLYXsJ_600x450.jpg
https://images.craigslist.org/01212_89gBBIZ1vCD_600x450.jpg
01212_89gBBIZ1vCD_600x450.jpg
https://images.craigslist.org/00i0i_kWJStA00Lfu_600x450.jpg
00i0i_kWJStA00Lfu_600x450.jpg
https://images.craigslist.org/00606_8BBJRIePdAc_600x450.jpg
00606_8BBJRIePdAc_600x450.jpg
https://images.craigslist.org/00R0R_3JSNlmFVbUu_600x450.jpg
00R0R_3JSNlmFVbUu_600x450.jpg
https://images.craigslist.org/01313_7k1wogZhtbY_600x450.jpg
01313_7k1wogZhtbY_600x450.jpg
https://images.craigslist.org/00G0G_bzTZwFt6ZO3_600x450.jpg
00G0G_bzTZwFt6ZO3_600x450.jpg
