# Ejercicio Web Scraping con Beautiful Soup

### 1. Importa las librerías necesarias (BeatifulSoup, urllib y re)

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

### 2. Lee el contenido de la siguiente [liga](https://analytics.usa.gov) con la función urlopen() y crea un objeto del tipo BeautifulSoup

In [2]:
html = urlopen('https://analytics.usa.gov')
soup = BeautifulSoup(html, "lxml")
type(soup)

bs4.BeautifulSoup

### 3. Muestra el html que obtuviste en el paso anterior

In [3]:
print(soup)

<!DOCTYPE html>
<html lang="en">
<!-- Initalize title and data source variables -->
<head>
<!--

    Hi! Welcome to our source code.

    This dashboard uses data from the Digital Analytics Program, a US
    government team inside the General Services Administration, an independent
    federal agency.


    For a detailed tech breakdown of how 18F and friends built this site:

    https://18f.gsa.gov/2015/03/19/how-we-built-analytics-usa-gov/


    This is a fully open source project, and your contributions are welcome.

    Frontend static site: https://github.com/GSA/analytics.usa.gov
    Backend data reporting: https://github.com/18F/analytics-reporter

    -->
<meta charset="utf-8"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="NjbZn6hQe7OwV-nTsa6nLmtrOUcSGPRyFjxm5zkmCcg" name="google-site-verification"/>
<link href="/css/vendor/css/uswds.v0.9.1.css" rel="stylesheet"/>
<link href="/css/public_analytics.css" rel="stylesheet"/>
<link href="/images/analytics-f

### 4. Describe la estructura que observas en la página y en particular de los links

Podemos observar que la página utiliza D3 para animar las gráficas, que contiene comencinco niveles de headings, comentarios y links hacia otras páginas con el atributo 'href'

### 5. Busca e imprime los links de la página con el tag 'a' contenidos dentro del atributo 'href'

In [5]:
links_pagina = soup.find_all('a')

In [7]:
for link in links_pagina:
    print(link.get('href'))

/
#explanation
https://analytics.usa.gov/data/
data/
#top-pages-realtime
#top-pages-7-days
#top-pages-30-days
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
mailto:DAP@support.digitalgov.gov
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
https://github.com/GSA/analytics.usa.gov/issues
mailto:DAP@support.digitalgov.gov
https://analytics.usa.gov/data/


### 6. Crea una expresión regular para obtener palabras que comiencen con 'http'

In [8]:
regex_http = re.compile("^http")

### 7. Busca e imprime sólo los links que comienzan con "http"

In [17]:
links_http = soup.findAll('a', attrs={'href': regex_http})

In [26]:
links_http[0].attrs['href']

'https://analytics.usa.gov/data/'

In [30]:
links = []
for i in range(0,len(links_http)):
    links.append(links_http[i].attrs['href'])

In [32]:
for link in links: 
    print(link)

https://analytics.usa.gov/data/
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
https://github.com/GSA/analytics.usa.gov/issues
https://analytics.usa.gov/data/


###  8. Utiliza la función open para crear un archivo con el nombre parsed_data.txt

In [37]:
file = open('parsed_data.txt', 'w')

### 9. Escribe en el archivo todos los links que comienzan con 'http'

In [38]:
for link in links:
    str_link = str(link)
    file.write(str_link)

### 10. Cierra la conexión al archivo y muestra con el magic command !ls que creaste un nuevo archivo 'parsed_data.txt' en tu directorio actual

In [39]:
file.close()

In [36]:
!ls

Ejercicio6_Pandas_Series-de-Tiempo_Solucion.ipynb
Ejercicio7_Pandas_Series-de-Tiempo_Solucion.ipynb
Ejercicio8_Pandas_Series-de-Tiempo_Solucion.ipynb
[31mSoluciones_Ejercicio5_WebScraping.ipynb[m[m
parsed_data.txt
