# Reading files

Run the following cell and watch the video. 

In [1]:
%%html
<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/1813261/sp/181326100/embedIframeJs/uiconf_id/38997502/partner_id/1813261?iframeembed=true&playerId=kaltura_player&entry_id=1_odixlrd8&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=1_3kmydx5s" width="512" height="288" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0" title="Kaltura Player"></iframe>

# The special problem of reading data

There are several problems involved with getting data into a program. 
* Several high-level data formats. 
* Each one has several subformats.  
* A multitude of tools for reading.
* Each tool has limits. 

## The basic formats: 
* **text**: human-readable and editable data. 
* **binary**: machine-readable data. 
     
## Text data
* Numbers in one of many printable formats. 
* Less space-efficient than binary formats. 
* Tends to be portable between machines.

## Binary data
* Much more efficiently stored. 
* Tends not to be portable between machines and software. 
* Several variants are incompatible with one another. 

We've already seen a really simple example of reading a file, 


In [None]:
# run this to see the file. 
%pycat data.txt

In [None]:
# read the file into a list
file = open('data.txt', 'r')
out = []
for line in file:
    numbers = line.strip().split(',')
    out.append(numbers)
file.close()
out

# Oops! This is not quite right. 
There is a profound difference between '1' and 1 (without the quotes). 
* `'1'` is a string. 
* `1` is an integer. 
* (`1.` is a floating point number.) 

# Internal and string representations of numbers
* In python, there are several formats for numbers. 
* `int`: an integer
* `float`: a floating point number. 
* `cfloat`: a complex number. 

# Internal representations of numbers share these attributes: 
* Binary representation. 
* Usually 4, 8, or 16 characters (bytes) long. 
* Expressed in base 2. 

# String representations of numbers are obviously different
* Variable length. 
* Digits in base 10. 

# Converting between strings and numbers
* `str(numb)`: gives the string version of a number. 
* `float(strng)` or `int(strng)`: gives the numeric version. 

# So, we can solve our little problem as follows: 

In [None]:
# read the file into a list
out2 = []
for row in out: 
    new_row = []
    for entry in row: 
        new_row.append(int(entry))
    out2.append(new_row)
out2

# Alas, if it were only that easy...

The reality: 
* 90% of the typical data scientist's time is spent finding ways to read data.
* Data reading errors are very common. 
* Data formats are not so neat as the above. 

Thus, an enormous amount of time has been spent on libraries for reading data. 

One of the simplest is `numpy.loadtxt`. 

See https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

For example, we can write:

In [None]:
import numpy as np
stuff = np.loadtxt('data.txt', delimiter=',')
stuff

# A few observations

* default conversion type is float. 
* default delimiter is any "whitespace". 

So, for this file: 

In [None]:
# run this to see the file
%pycat data2.txt

we might write, instead: 

In [None]:
stuff = np.loadtxt('data2.txt')  # default delimiter is whitespace 
stuff

# Non-numeric data
Obviously, we need to do something about non-numeric data in the file. 
If we try naively to parse a file with non-numeric data,

In [None]:
%pycat data3.txt

In [None]:
try:
    stuff = np.loadtxt('data3.txt', delimiter=',')
except Exception as e:
    print(e)

... and we must get more clever. Let's tell `loadtxt` what kinds of objects it is looking for. 
* i4: an integer
* U20: a string of up to 20 characters (U means *unicode*)

In [None]:
stuff = np.loadtxt('data3.txt', delimiter=',', dtype={'names': ['apples', 'oranges', 'name'],
                                                      'formats': ['i4', 'i4', 'U20']})
stuff

# Notice several things in this example
* `loadtxt` automatically chose to represent this as an array of tuples rather than an array of lists. 
* Indexing is the same, but math is different. 
* This is its way of saying *this is not a vector, matrix, or tensor.* 

# The curse of comma-separated values

Comma-separated values are a very common data representation, but have one **huge** problem. 
Microsoft excel often puts out files like this. 

In [None]:
%pycat data3.csv

Excel thinks it's being clever, because there is a comma in the data. 
* It surrounds the field containing commands with double quotes. 
* Our parser knows nothing about that. 
So, when we try to parse it naively, we get: 

In [None]:
try:
    stuff = np.loadtxt('data3.csv', delimiter=',', dtype={'names': ['apples', 'oranges', 'name'],
                                                          'formats': ['i4', 'i4', 'U20']})
except Exception as e:
    print(e)

This parser won't handle this case. 

We can, however, outsmart Excel by outputting as tab-separated values: 

In [None]:
%pycat data4.txt

In [None]:
stuff = np.loadtxt('data4.txt', delimiter='\t', dtype={'names': ['apples', 'oranges', 'name'],
                                                       'formats': ['i4', 'i4', 'U20']})
stuff

*This always works because Excel won't allow a tab character to be typed into a cell!*

# Fixed-width fields

Some files we need to load have fields that occupy a fixed number of characters. These kinds of files are found as output of scientific modeling programs in Fortran. For example, consider: 

In [None]:
%pycat data5.txt

We parse this kind of file with a different method that supports a more general *delimiter* field consisting of the widths -- as integers -- of the fields we should find. See https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
For example: 

In [None]:
stuff = np.genfromtxt('data5.txt', delimiter=(7,4,8,4))
stuff

The delimiter field says that there are numbers in the input that are 7, 4, 8, and 4 characters long. I included a comment line (starting with `#`) in the data to show you the column numbers. 

# Let's put these patterns into practice
First run this cell to login to the grading system

In [None]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('Reading files.ok')
ok.auth(inline=True)

1. **Write code to read data from the file `exer1.txt`.**
Put the data into the numpy array `exer1`

In [None]:
%pycat exer1.txt

In [None]:
# fill in details ... 
exer1 = ...
print(exer1)

In [None]:
_ = ok.grade('q01')  # run this to check your answer

2. **Write code to read data from the file `exer2.txt`.** Put the result into `exer2`.

In [None]:
%pycat exer2.txt

In [None]:
# fill in details ... 
exer2 = ...
print(exer2)

In [None]:
_ = ok.grade('q02')  # run this to check your work

2. **Write code to read data from the file `exer3.txt`.** Put the result into `exer3`.

In [None]:
%pycat exer3.txt

In [None]:
# fill in details ... 
exer3 = ...
print(exer3)

In [None]:
_ = ok.grade('q03')  # run this to check your answer

(A personal note: You might think this file format is really clueless. It contains, however, the exact same data as the other two files. This kind of column run-together is common in outputs of scientific computing, and *fixed-width parsing is the only way to unravel it!* These exercises are chosen from *my personal experience in frustrating data wrangling* to get data into python and other languages! -- Prof. Couch)

4. (Advanced) Write code to input the data in `exer4.txt`. Make the third column an integer rather than floating point. Hint: look at the `dtype` field in the documentation for `genfromtxt` and use it to define the field types. Consider allowing the system to figure it out via `dtype=None` 

In [None]:
%pycat exer4.txt

In [None]:
# fill in details ...
exer4 = ...
print(exer4)

In [None]:
type(exer4[0][2])

In [None]:
_ = ok.grade('q04')  # run this to check your answer

### Afterword: what is an nparray? 
In this example, the result will be an `nparray` of tuples rather than a regular `nparray` of numbers. Why? *A regular `nparray` requires elements of the same type!*. Compare the above with the output from exercise 3 above. All elements are float, so that's a regular `nparray`. When you violate that restriction, `numpy` automatically gives you a tuple rather than a list of values. 

5. (Advanced) As in problem 4, read data from `exer5.txt`, and additionally label the columns 'carbon', 'nitrogen', 'oxygen'. Put the result into exer5.

In [None]:
%pycat exer5.txt

In [None]:
# fill in details ...
exer5 = ...
exer5

In [None]:
# use this to test your work 
for key in exer5.dtype.names:
    print('{} is {}'.format(key, exer5[key]))

Note that these are real `ndarray` vectors, not `tuple`s, because they each contain a single type of elements. 

In [None]:
_ = ok.grade('q05')  # run this to check your answer

# When you are done with this notebook, 
* Save and checkpoint. 
* Change `ready` to `True` in the cell below. 
* Run the cell below. 

In [None]:
ready = False  # change to True when ready to submit
if not ready:
    raise Exception("change ready to True when ready to submit")
_ = ok.submit()