# A BeautifulSoup Boilerplate 
(Let retrieve contents from a web page)

A couple of notes on Beautiful Soup:
* Do not execute JavaScript
* Allow us to use Cascading Style Sheets selectors


In the follow notebook I will show:

1. Retrieve the page
2. Select specific HTML nodes
3. Save content in CSV file

## Import and preparation

In [16]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
from pathlib import Path

# User Agent will allow us to present the request as from a web browser
USER_AGENT_HEADER = {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 9_3_1 like Mac OS X) ApclearpleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13E238 Safari/601.1'}
# URL is the page we will scrape da from 
URL = "https://github.com/linediconsine"
# Where I want to save the data
CSV_FILE = "simple_test.csv"

## 1. Retrieve the page content
( And yes that's a 2 lines of code for that)

In [11]:
page = requests.get(URL,headers=USER_AGENT_HEADER)
soup = BeautifulSoup(page.content, 'html5lib') 
# Note: 'html5lib' is the best library for CSS selectors in my test

## 2. Select specific HTML node

### CSS Selector

I will use CSS selectors because are simpler, faster than other selectors.
A CSS selector is the part of a CSS rule set that actually selects the content you want to style. Let's look at some example : 

**Syntax:**

```     soup.select('CSS SELECTOR')   ```

### CSS SELECTOR

* Class = .className

```python
soup.select('.slideshow-intro-content')```

* Id = #IdName

```python
i.e soup.select('#elmentiD')```

* attribute = element[attribute-name] 

```python
soup.select('div[data-timestamp]')```

* *Concetenate*

```python
.slideshow-intro-content * ```

* *Multiple nodes in the same selection*

```python
soup.select(".post-content p,.post-content li")```

**References**

*CSS selectors guide*
https://www.w3schools.com/cssref/css_selectors.asp

*Beautiful Soup 4 Cheatsheet*
http://akul.me/blog/2016/beautifulsoup-cheatsheet/



## Retrive a single element

In [36]:
# Retrive a single element

# NOTE: This approach avoid crash if an element is missing
user_name = ""
if soup.select('h1.vcard-names .p-name '):
    user_name = soup.select('h1.vcard-names .p-name ')[0].get_text()

# This code instead will crash if the element don't exist :
# title = soup.select('h1.vcard-names .p-name ')[0].get_text()
display(user_name)

'Marco A'

### Retrieve multiple elements

In [34]:
# Retrive multiple elements (note: output are HTML nodes )

array_repo = soup.select('.repo')
display(article_p)

[<span class="repo js-repo" title="UkuTuna">UkuTuna</span>,
 <span class="repo js-repo" title="iPython-Snippets">iPython-Snippets</span>,
 <span class="repo js-repo" title="TheFoxGame">TheFoxGame</span>]

In [33]:
# Clean the content from the HTML container

for repo in array_repo:
    display(repo.get_text())

'UkuTuna'

'iPython-Snippets'

'TheFoxGame'

## 3. Save content in CSV file

### Create a DataFrame

In [47]:
d = {'user_name': user_name , 'repo' : array_repo[0].get_text()}
df = pd.DataFrame(data=d, index=[0])
display(df)

Unnamed: 0,repo,user_name
0,UkuTuna,Marco A


### Save the df dataframe in a CSV file
* Create a csv file if file doesn't exist
* Append to a CSV file a new row if file exist

In [40]:
df.to_csv(CSV_FILE, mode='a', header=not(Path(CSV_FILE).is_file()),encoding="utf-16")