## Automating Flat File Web Downloads & Scraping Web Data

### Downloading Flat Files from the Web

#### urllib

* Provides interface for fetching data from the web

* `urlopen()` - accepts URLs instead of file names

In [2]:
from urllib.request import urlretrieve
import pandas as pd

In [3]:
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

In [4]:
# download csv and save it to the working directory

urlretrieve(url, 'winequality-red.csv')

('winequality-red.csv', <http.client.HTTPMessage at 0x7f96f6e094a8>)

Following the above steps, the csv file is now in the working directory. It can be read into a DataFrame with a simple `read_csv()` method from Pandas.

If you need to see the full path in bash to write the expression, use this command in bash:

    readlink -f <filename>

In [23]:
# read the csv file into a dataframe

df = pd.read_csv('/home/jlewis425/dsi_notes/winequality-red.csv', sep=';')

In [24]:
# inspect the top few rows of the data

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Alternatively, it is possible to read the same file directly into Pandas without first saving the file to the working directory. The code for doing this is as follows:

    df = pd.read_csv(url, sep='sep type')
    
 
Pandas can also read Excel files directly into a dataframe using a similar methodology.

    


In [25]:
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

In [27]:
xl = pd.read_excel(url, sheet_name=None)

By choosing to set the sheet_name as None, the entire excel object has been passed to the xl variable. It is currently formatted as an ordered dictionary.

In [28]:
type(xl)

collections.OrderedDict

In order to see the Excel tab names, you call the `.keys()` method on the object.

In [29]:
print(xl.keys())

odict_keys(['1700', '1900'])


In [30]:
# inspect the first few lines of the '1700' sheet

print(xl['1700'].head())

                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000


### HTTP requests to import files from the web

### Basics of URLs

* Uniform Resource Locator

* References to web resources

* URLs often refer to web addresses, but they can also be file transfer protocols (FTP) or database access

* Components:
    
    * Protocol identifier: 'http:' or 'https:'
        
    * Resource name: websitename.com
        
* The combination of the two components uniquely specifies the web address

### HTTP

* HyperText Transfer Protocol

* Application protocol for distributed, collaborative, hypermedia information systems.

    * Less formally: Set of rules for transferring files (text, images, sound, video, etc.) on the World Wide Webt.
    
    * Foundation of data communication for the internet.
    
* HTTPS - a more secure form of HTTP

* Visiting a website = sending an HTTP request

    * GET request
    
* `urlretrieve()` performs a GET request

* HTML - HyperText Markup Language

### GET requests using requests

* Using `urlretrieve()` is awkward and laborious, so the **requests** package is far more popular.



### Web Scraping

* Inherently web scraping tends to violate the Terms of Use of a web site
        
* One way to avoid attracting attention is to use a technique that utilizes a real web browser

        pip install selenium
        
* Two drivers will also be needed:

    * chromedriver
    
    * geckodriver
    
* Use Development Tools to get an idea about the structure of the HTML of the target page

#### CSS Slectors

* Always a string (so it will always be in quotes in Python)

* Format:

    1. element type
    
    2. "#" (with id) or a "dot" (with class)
    
    3. Name of id or class


    


In [6]:
# import libraries

import urllib3
from bs4 import BeautifulSoup

In [10]:
# set PoolManager and specify the url

http = http = urllib3.PoolManager(
...     cert_reqs='CERT_REQUIRED',
...     ca_certs=certifi.where())


url = """"https://gutenberg.ca/ebooks/
hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-h.html"""

SyntaxError: invalid syntax (<ipython-input-10-2231f6b8a8d4>, line 4)

In [8]:
# query the website and return the html to the variable ‘page’

response = http.request('GET', url)

KeyError: '"https'

In [19]:
# parse the html using beautiful soup and store in variable `soup`

soup = BeautifulSoup(response.data)

In [26]:
table = soup.find("2017")

In [27]:
print(table)

None
