# Data ingestion

For video notes on these materials, please see the adjoining [video notes](03-03-data-ingest-video-notes.ipynb). Then complete this exercise. 

First watch the video and login to grading:

In [1]:
# Don't change this cell; just run it. 
from IPython.display import IFrame
IFrame('https://1813261-1.kaf.kaltura.com/media/t/1_odixlrd8/133896931', width=800, height=560)

from client.api.notebook import Notebook
ok = Notebook('03-03-data-ingest.ok')
# ok.auth(inline=True)

Assignment: 03-03 Data Ingestion
OK, version v1.14.15



1. **Write code to read data from the file `exer1.txt`.**
Put the data into the numpy array `exer1`

In [2]:
%more exer1.txt

In [3]:
import numpy as np
def read_comma_separated_file(filename):
    with open(filename) as file:
        all_rows = []
        for line in file:
            # remove leading and trailing whitespace
            line = line.strip()
            if not line:
                # ignore empty lines
                continue
            # split the line into words
            row = [float(str_) for str_ in line.split(',')]
            if row:
                all_rows.append(row)
        return np.array(all_rows)

exer1 = read_comma_separated_file("exer1.txt")
print(exer1)

[[25.  25.5 10.4]
 [10.  15.  25.5]
 [ 9.   2.   7. ]]


In [4]:
_ = ok.grade('q01')  # run this to check your answer

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



2. **Write code to read data from the file `exer2.txt`.** Put the result into `exer2`.

In [5]:
%more exer2.txt

In [6]:
def read_spaces_separated_file(filename):
    with open(filename) as file:
        all_rows = []
        for line in file:
            # remove leading and trailing whitespace
            line = line.strip()
            if not line:
                # ignore empty lines
                continue
            # split the line into words
            row = [float(str_) for str_ in line.split(' ') if str_]
            if row:
                all_rows.append(row)
        return np.array(all_rows)

exer2 = read_spaces_separated_file("exer2.txt")
print(exer2)

[[25.  25.5 10.4]
 [10.  15.  25.5]
 [ 9.   2.   7. ]]


In [7]:
_ = ok.grade('q02')  # run this to check your work

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



2. **Write code to read data from the file `exer3.txt`.** Put the result into `exer3`.

In [8]:
%more exer3.txt

In [9]:
def read_fixed_format_file(filename):
    with open(filename) as file:
        all_rows = []
        for line in file:
            # remove leading and trailing whitespace
            line = line.strip()
            if not line or line.startswith('#'):
                # ignore empty or comment lines
                continue
            # split the line into words
            row = [float(line[i:i+4]) for i in range(0,len(line),4)]
            if row:
                all_rows.append(row)
        return np.array(all_rows)

exer3 = read_fixed_format_file("exer3.txt")
print(exer3)

[[25.  25.5 10.4]
 [10.  15.  25.5]
 [ 9.   2.   7. ]]


## A better way

### Numpy `loadtxt` and `genfromtxt`

The [`loadtxt` function](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) is a convenience function that aims to be a fast reader for simply formatted files. 

The [`genfromtxt` function](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html) provides more sophisticated handling of, e.g., lines with missing values. 
We rewrite the above code using into the next cell using `genfromtxt` though, strictly speaking, `loadtxt` will suffice in this case.

In [10]:
def read_fixed_format_file(filename):
    data = np.genfromtxt(filename, delimiter=[4,4,4], comments='#')
    return data

exer3 = read_fixed_format_file("exer3.txt")
print(exer3)

[[25.  25.5 10.4]
 [10.  15.  25.5]
 [ 9.   2.   7. ]]


In [11]:
_ = ok.grade('q03')  # run this to check your answer

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



(A personal note: You might think this file format is really clueless. It contains, however, the exact same data as the other two files. This kind of column run-together is common in outputs of scientific computing, and *fixed-width parsing is the only way to unravel it!* These exercises are chosen from *my personal experience in frustrating data wrangling* to get data into python and other languages! -- Prof. Couch)

4. (Advanced) Write code to input the data in `exer4.txt`. Make the third column an integer rather than floating point. Hint: look at the `dtype` field in the documentation for `genfromtxt` and use it to define the field types. Consider allowing the system to figure it out via `dtype=None` 

In [12]:
%more exer4.txt

In [13]:
def read_fixed_format_ffi_file(filename):
    # data = np.genfromtxt(filename, delimiter=[4,4,4], comments='#')
    # The following produces tuple-like rows
    data = np.genfromtxt(filename, delimiter=[4,4,4], dtype=[('a','f8'),('b','f8'),('c','i8')], comments='#')
    
    return data

exer4 = read_fixed_format_ffi_file("exer4.txt")
print(exer4)

[(20. , 30. ,  4) (40.5, 50.5, 30) (60.2, 70.3, 50)]


See SciPy [Structured Arrays and Structured Datatypes](https://docs.scipy.org/doc/numpy/user/basics.rec.html) for details.

In [14]:
type(exer4[0][2])

numpy.int64

In [15]:
_ = ok.grade('q04')  # run this to check your answer

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



### Afterword: what is an nparray? 
In this example, the result will be an `nparray` of tuples rather than a regular `nparray` of numbers. Why? *A regular `nparray` requires elements of the same type!*. Compare the above with the output from exercise 3 above. All elements are float, so that's a regular `nparray`. When you violate that restriction, `numpy` automatically gives you a tuple rather than a list of values. 

5. (Advanced) As in problem 4, read data from `exer5.txt`, and additionally label the columns 'carbon', 'nitrogen', 'oxygen'. Put the result into exer5.

In [16]:
%more exer5.txt

In [17]:
def read_fixed_format_ffi_file(filename):
    # data = np.genfromtxt(filename, delimiter=[4,4,4], comments='#')
    # The following produces tuple-like rows
    data = np.genfromtxt(filename, 
                         delimiter=[4,4,4], 
                         names = ['carbon', 'nitrogen', 'oxygen'],
                         dtype=[('carbon','f8'),('nitrogen','f8'),('oxygen','i8')], 
                         comments='#')
    
    return data

exer5 = read_fixed_format_ffi_file("exer5.txt")
exer5

array([(20. , 30. ,  4), (40.5, 50.5, 30), (60.2, 70.3, 50)],
      dtype=[('carbon', '<f8'), ('nitrogen', '<f8'), ('oxygen', '<i8')])

In [18]:
# use this to test your work 
for key in exer5.dtype.names:
    print('{} is {}'.format(key, exer5[key]))

carbon is [20.  40.5 60.2]
nitrogen is [30.  50.5 70.3]
oxygen is [ 4 30 50]


Note that these are real `ndarray` vectors, not `tuple`s, because they each contain a single type of elements. 

In [19]:
_ = ok.grade('q05')  # run this to check your answer

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



# When you are done with this notebook, 
* Save and checkpoint. 
* Ensure that the name of this file is precisely `03-03-data-ingest.ipynb`. 
* <del>Change `ready` to `True` in the cell below. </del>
* <del>Run the cell below to submit your work for grading. </del>
* Save and checkpoint the notebook. 

* If your Jupyter installation can download the notebook as a PDF,
    * (File >> Download as >> PDF via LaTeX (.pdf)), 
    * Rename the downloaded file to `<loginid>-03-03-data-ingest.pdf`. In other words, my filename would be `jsingh11-03-03-data-ingest.pdf`.
    * Submit the file `<loginid>-03-03-data-ingest.pdf` to Canvas.
* Otherwise 
    * (File >> Download as >> Notebook (.ipynb)). In other words, my filename would be `jsingh11-03-03-data-ingest.ipynb`.
    * Rename the downloaded file to `<loginid>-03-03-data-ingest.ipynb`,
    * Submit the file `<loginid>-03-03-data-ingest.ipynb` to Canvas.