# Reading data from files

Sometimes we need to read data from files. In general, these will be text files or binary files. Text files are easy to read, binary files are not.

Let's start with reading some tops from a file.

In [1]:
with open('../data/L-30_tops.txt', 'r') as f:
    data = f.read()

In [2]:
data

'# L-30 well tops\nWyanDot FM,867.156\nDAWSON CANYON FM,984.50402\nLOGAN CANYON FM,1136.904\nUpper MISSISAUGA FM,2251.2529\nLower MISSISAUGA FM,3190.6464\nABENAKI FM,3404.3112\nMID BACCARO,3485.0832\nLower BACCARO,3964.5337\nBase O-Marker,2469.207\nTD,4268.0\nPay_sand_1-rft,2478.0\npay_sand_2,2499.0\npay_sand_3,2543.0\npay_sand_4,2637.0\nsand_5,2699.0\nsand_6,2795.0\nsand_7,2835.0\n'

In [3]:
with open('../data/L-30_tops.txt', 'r') as f:
    data = f.readlines()

In [4]:
data

['# L-30 well tops\n',
 'WyanDot FM,867.156\n',
 'DAWSON CANYON FM,984.50402\n',
 'LOGAN CANYON FM,1136.904\n',
 'Upper MISSISAUGA FM,2251.2529\n',
 'Lower MISSISAUGA FM,3190.6464\n',
 'ABENAKI FM,3404.3112\n',
 'MID BACCARO,3485.0832\n',
 'Lower BACCARO,3964.5337\n',
 'Base O-Marker,2469.207\n',
 'TD,4268.0\n',
 'Pay_sand_1-rft,2478.0\n',
 'pay_sand_2,2499.0\n',
 'pay_sand_3,2543.0\n',
 'pay_sand_4,2637.0\n',
 'sand_5,2699.0\n',
 'sand_6,2795.0\n',
 'sand_7,2835.0\n']

### Exercise

Write a `for` loop to read the lines of the file one by one, adding key: value pairs to a dictionary as you go.

<a title="You will need to skip the loop over lines that look like comments. Use str.split() to break the line at a comma, and `float()` to convert strings to numbers.">**Hints**</a>

In [5]:
tops = {}
for line in data:

    # Your code here!


In [5]:
tops = {}
for line in data:

    if line.startswith('#'):
        continue
    name, depth = line.split(',')
    name = name.title()
    depth = float(depth.strip())

    tops[name] = depth

In [6]:
tops

{'Abenaki Fm': 3404.3112,
 'Base O-Marker': 2469.207,
 'Dawson Canyon Fm': 984.50402,
 'Logan Canyon Fm': 1136.904,
 'Lower Baccaro': 3964.5337,
 'Lower Missisauga Fm': 3190.6464,
 'Mid Baccaro': 3485.0832,
 'Pay_Sand_1-Rft': 2478.0,
 'Pay_Sand_2': 2499.0,
 'Pay_Sand_3': 2543.0,
 'Pay_Sand_4': 2637.0,
 'Sand_5': 2699.0,
 'Sand_6': 2795.0,
 'Sand_7': 2835.0,
 'Td': 4268.0,
 'Upper Missisauga Fm': 2251.2529,
 'Wyandot Fm': 867.156}

Add this dictionary to `utils.py` by typing `tops = `, followed by this dict.

## Intro to Python students: stop here for now

----

## Read using NumPy

We can use `np.loadtxt()` for numeric files.

In [46]:
import numpy as np
np.loadtxt('../data/L-30_tops.txt', skiprows=1, usecols=[1], delimiter=',')

array([  867.156  ,   984.50402,  1136.904  ,  2251.2529 ,  3190.6464 ,
        3404.3112 ,  3485.0832 ,  3964.5337 ,  2469.207  ,  4268.     ,
        2478.     ,  2499.     ,  2543.     ,  2637.     ,  2699.     ,
        2795.     ,  2835.     ])

Or there's [`np.genfromtxt()`](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.io.genfromtxt.html), which copes better with missing values &mdash; try running it on `'../data/B-41_tops.txt'`.

In [44]:
np.genfromtxt('../data/L-30_tops.txt', skip_header=1, delimiter=',')

array([[        nan,         nan,         nan,         nan,         nan,
                nan,         nan,         nan,         nan,         nan,
                nan,         nan,         nan,         nan,         nan,
                nan,         nan],
       [  867.156  ,   984.50402,  1136.904  ,  2251.2529 ,  3190.6464 ,
         3404.3112 ,  3485.0832 ,  3964.5337 ,  2469.207  ,  4268.     ,
         2478.     ,  2499.     ,  2543.     ,  2637.     ,  2699.     ,
         2795.     ,  2835.     ]])

Both functions have a useful keyword argument, `unpack`, which you should set to `True` to get the columns back as separate vectors.

Note that both functions can read GZIP files too.

## `csv` built-in module

In [20]:
import csv

with open('../data/L-30_tops.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

['Formation name', 'Depth [m]']
['WyanDot FM', '867.156']
['DAWSON CANYON FM', '984.50402']
['LOGAN CANYON FM', '1136.904']
['Upper MISSISAUGA FM', '2251.2529']
['Lower MISSISAUGA FM', '3190.6464']
['ABENAKI FM', '3404.3112']
['MID BACCARO', '3485.0832']
['Lower BACCARO', '3964.5337']
['Base O-Marker', '2469.207']
['TD', '4268.0']
['Pay_sand_1-rft', '2478.0']
['pay_sand_2', '2499.0']
['pay_sand_3', '2543.0']
['pay_sand_4', '2637.0']
['sand_5', '2699.0']
['sand_6', '2795.0']
['sand_7', '2835.0']


In [22]:
import csv

with open('../data/L-30_tops.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['Formation name'], row['Depth [m]'])

WyanDot FM 867.156
DAWSON CANYON FM 984.50402
LOGAN CANYON FM 1136.904
Upper MISSISAUGA FM 2251.2529
Lower MISSISAUGA FM 3190.6464
ABENAKI FM 3404.3112
MID BACCARO 3485.0832
Lower BACCARO 3964.5337
Base O-Marker 2469.207
TD 4268.0
Pay_sand_1-rft 2478.0
pay_sand_2 2499.0
pay_sand_3 2543.0
pay_sand_4 2637.0
sand_5 2699.0
sand_6 2795.0
sand_7 2835.0


## Read file using pandas

In [23]:
import pandas as pd

df = pd.read_csv('../data/L-30_tops.csv')

In [24]:
df

Unnamed: 0,Formation name,Depth [m]
0,WyanDot FM,867.156
1,DAWSON CANYON FM,984.50402
2,LOGAN CANYON FM,1136.904
3,Upper MISSISAUGA FM,2251.2529
4,Lower MISSISAUGA FM,3190.6464
5,ABENAKI FM,3404.3112
6,MID BACCARO,3485.0832
7,Lower BACCARO,3964.5337
8,Base O-Marker,2469.207
9,TD,4268.0


In [33]:
import pandas as pd

df = pd.read_csv('../data/L-30_tops.txt', skiprows=1, names=['Formation', 'Depth'])

In [54]:
df

Unnamed: 0,Formation,Depth
0,Wyandot Fm,867.156
1,Dawson Canyon Fm,984.50402
2,Logan Canyon Fm,1136.904
3,Upper Missisauga Fm,2251.2529
4,Lower Missisauga Fm,3190.6464
5,Abenaki Fm,3404.3112
6,Mid Baccaro,3485.0832
7,Lower Baccaro,3964.5337
8,Base O-Marker,2469.207
9,Td,4268.0


In [55]:
df['Formation'] = df['Formation'].str.title()
df.head()

Unnamed: 0,Formation,Depth
0,Wyandot Fm,867.156
1,Dawson Canyon Fm,984.50402
2,Logan Canyon Fm,1136.904
3,Upper Missisauga Fm,2251.2529
4,Lower Missisauga Fm,3190.6464


In [56]:
df.to_csv('../data/L-30_tops_improved.csv')

## Exercises

- Read the data from B-41_tops.txt
- Write a function that will load data from either of these files
- Load the data to pandas
- Write a new CSV files with the cleaned data

In [70]:
import sqlite3 as lite
import sys

con = lite.connect('tops.db')

with con:
    
    cur = con.cursor()    
    cur.execute("CREATE TABLE strat(formation TEXT, depth DECIMAL, age INT)")
    
    for name, depth in tops.items():
        cur.execute("INSERT INTO strat VALUES('{}',{},{})".format(name, depth, 0))


In [71]:
con = lite.connect('tops.db')

with con:    
    
    cur = con.cursor()    
    cur.execute("SELECT * FROM strat")

    rows = cur.fetchall()

    for row in rows:
        print(row)

('Wyandot Fm', 867.156, 0)
('Dawson Canyon Fm', 984.50402, 0)
('Logan Canyon Fm', 1136.904, 0)
('Upper Missisauga Fm', 2251.2529, 0)
('Lower Missisauga Fm', 3190.6464, 0)
('Abenaki Fm', 3404.3112, 0)
('Mid Baccaro', 3485.0832, 0)
('Lower Baccaro', 3964.5337, 0)
('Base O-Marker', 2469.207, 0)
('Td', 4268, 0)
('Pay_Sand_1-Rft', 2478, 0)
('Pay_Sand_2', 2499, 0)
('Pay_Sand_3', 2543, 0)
('Pay_Sand_4', 2637, 0)
('Sand_5', 2699, 0)
('Sand_6', 2795, 0)
('Sand_7', 2835, 0)
