# Web Scraping 
## A Gentle Introduction

### The Problem


- Visit the NIST Statistical Reference Data Set [](https://www.itl.nist.gov/div898/strd/anova/SiRstv.html)
- Use the data in [Data File in Two-Column Format](https://www.itl.nist.gov/div898/strd/anova/SiRstv.dat)
- Reproduce the [image](https://www.itl.nist.gov/div898/strd/anova/SiRstv.gif)


### Manual Solution
- Download file to local drive
- Edit in text editor to remove comments
- Read using standard data table functions

# Python Solution

In [None]:
import pandas
tbl = pandas.read_table('./SiRstv.dat',sep='\s+')
tbl.plot.scatter(x='Instrument', y='Resistance')

In [None]:
import matplotlib.pyplot as plt
plt.plot(tbl['Instrument'], tbl['Resistance'],'ko')
plt.ylabel('Resistivity, ohm*cm')
plt.xlabel('Instrument')
plt.show()

Next Steps
========================================================

We're a long way from web scraping. We had to download a file from the web to edit it into an appropriate format, deleting documentation. That's acceptable for a small exmaple, more interesting data will be larger and dynamic.

Our next step towards web scraping will be to read files directly from the web. 



## Skipping lines

In [None]:
tbl = pandas.read_table('http://www.itl.nist.gov/div898/strd/anova/SiRstv.dat',sep='\s+',skiprows=59)
tbl.rename(columns = {'Data:':'Instrument', 'Instrument':'Resistance', 
                              'Resistance':'Blank'}, inplace = True) 
plt.plot(tbl['Instrument'], tbl['Resistance'],'ko')
plt.ylabel('Resistivity, ohm*cm')
plt.xlabel('Instrument')
plt.show()

We can read the data in one atomic action, i.e. 

In [None]:
fin = open('./SiRstv.dat','r')
txt = fin.read()
fin.close()
print(txt)

and via URL

In [None]:
import urllib
url = urllib.request.urlopen('http://www.itl.nist.gov/div898/strd/anova/SiRstv.dat')
txt = url.read()
url.close()
print(txt)

Note that by default the URL is read as a binary stream. We'll want to decode to text later.

But first, consider that if files are large, we may want to scan only part of the file. This we can do with


In [None]:
import re
url = urllib.request.urlopen('http://www.itl.nist.gov/div898/strd/anova/SiRstv.dat')
lines = 0
while True:
    line = url.readline().decode('utf-8')
    if not line:
        break
    if re.match('Data:', line, flags=0) == None:
        lines += 1
    else:
        break
print(lines)
url.close()

This finds the first occurence of `Data`, but that doesn't help use much. We can read line by line and let `pandas` determine if the line is numeric. However, I find we need to wrap each line with a IO stream:

In [None]:
import io
url = urllib.request.urlopen('http://www.itl.nist.gov/div898/strd/anova/SiRstv.dat')
lines = 0
while True:
    line = url.readline().decode('utf-8')
    bytes_read = len(line)
    if not line:
        break
    line = line.strip()
    if len(line)>0 :
        current_row = pandas.read_table(io.StringIO(line),sep='\s+', header=None)
        if not pandas.api.types.is_numeric_dtype(current_row):
            lines += 1
        else:
            print(line)            
            break
print(lines)
url.close()

In [None]:
tbl = pandas.read_table('http://www.itl.nist.gov/div898/strd/anova/SiRstv.dat',sep='\s+',skiprows=lines)
tbl.rename(columns = {'Data:':'Instrument', 'Instrument':'Resistance', 
                              'Resistance':'Blank'}, inplace = True) 
plt.plot(tbl['Instrument'], tbl['Resistance'],'ko')
plt.ylabel('Resistivity, ohm*cm')
plt.xlabel('Instrument')
plt.show()

# 1

Read a data table from [Data File in Table Format](https://www.itl.nist.gov/div898/strd/anova/SiRstvt.dat)

# 2

Repeat with a different data sets from https://www.itl.nist.gov/div898/strd/general/dataarchive.html