# Working with Data

## Loading data from files

### Loading data

An important part of this course is about using Python to analyse and visualise data.
Most data, of course, is supplied to us in various *formats*: spreadsheets, database dumps, or text files in various formats (csv, tsv, json, yaml, hdf5, netcdf)
It is also stored in some *medium*: on a local disk, a network drive, or on the internet in various ways.
It is important to distinguish the data format, how the data is structured into a file, from the data's storage, where it is put. 

We'll look first at the question of data *transport*: loading data from a disk, and at downloading data from the internet.
Then we'll look at data *parsing*: building Python structures from the data.
These are related, but separate questions.

### An example datafile

Let's write an example datafile to disk so we can investigate it. We'll just use a plain-text file. IPython notebook provides a way to do this: if we put
`%%writefile` at the top of a cell, instead of being interpreted as python, the cell contents are saved to disk.

In [3]:
%%writefile mydata.txt # this cell will be interpreted as text, not as code. so not actually run!
A poet once said, 'The whole universe is in a glass of wine.'
We will probably never know in what sense he meant it, 
for poets do not write to be understood. 
But it is true that if we look at a glass of wine closely enough we see the entire universe. 
There are the things of physics: the twisting liquid which evaporates depending
on the wind and weather, the reflection in the glass;
and our imagination adds atoms.
The glass is a distillation of the earth's rocks,
and in its composition we see the secrets of the universe's age, and the evolution of stars. 
What strange array of chemicals are in the wine? How did they come to be? 
There are the ferments, the enzymes, the substrates, and the products.
There in wine is found the great generalization; all life is fermentation.
Nobody can discover the chemistry of wine without discovering, 
as did Louis Pasteur, the cause of much disease.
How vivid is the claret, pressing its existence into the consciousness that watches it!
If our small minds, for some convenience, divide this glass of wine, this universe, 
into parts -- 
physics, biology, geology, astronomy, psychology, and so on -- 
remember that nature does not know it!

So let us put it all back together, not forgetting ultimately what it is for.
Let it give us one more final pleasure; drink it and forget it all!
   - Richard Feynman

Overwriting mydata.txt


Where did that go? It went to the current folder, which for a notebook, by default, is where the notebook is on disk.

In [4]:
import os # The 'os' module gives us all the tools we need to search in the file system
os.getcwd() # Use the 'getcwd' function from the 'os' module to find where we are on disk.

'C:\\Users\\Dustin\\Documents\\GitHub\\rsd-engineeringcourse\\ch01data'

Can we see if it is there?

In [5]:
import os
[x for x in os.listdir(os.getcwd()) if ".txt" in x] #os.listdir shows all files, here: use list comprehension to filter the files based on containing a .txt ending

['mydata.txt']

In [89]:
os.listdir()

['.ipynb_checkpoints',
 '060files.ipynb',
 '061internet.ipynb',
 '062csv.ipynb',
 '064JsonYamlXML.ipynb',
 '065MazeSaved.ipynb',
 '066QuakeExercise.ipynb',
 '068QuakesSolution.ipynb',
 '072plotting.ipynb',
 '082NumPy.ipynb',
 '084Boids.ipynb',
 '110Capstone.ipynb',
 'index.md',
 'mydata.txt',
 'myfile2',
 'myfile3',
 'some_str2']

Yep! Note how we used a list comprehension to filter all the extraneous files.

### Path independence and `os`

We can use `dirname` to get the parent folder for a folder, in a platform independent-way.

In [18]:
these=os.getcwd().split("/")
print(these[1:-1]) # --> see below. this is possible but not recommended bc windows and mac operate differently

C:\Users\Dustin\Documents\GitHub\rsd-engineeringcourse\ch01data


In [20]:
os.path.dirname(os.getcwd())

'C:\\Users\\Dustin\\Documents\\GitHub\\rsd-engineeringcourse'

We could do this manually using `split`:

In [21]:
"/".join(os.getcwd().split("/")[:-1])

'C:/Users/Dustin/Documents/GitHub/rsd-engineeringcourse'

In [33]:
this = os.getcwd().split("\\")
print(this)
print(this[-1::-1]) # invert

['C:', 'Users', 'Dustin', 'Documents', 'GitHub', 'rsd-engineeringcourse', 'ch01data']
['ch01data', 'rsd-engineeringcourse', 'GitHub', 'Documents', 'Dustin', 'Users', 'C:']


But this would not work on windows, where path elements are separated with a `\` instead of a `/`. So it's important 
to use `os.path` for this stuff.

**Supplementary Materials**: If you're not already comfortable with how files fit into folders, and folders form a tree,
    with folders containing subfolders, then look at http://swcarpentry.github.io/shell-novice/02-filedir/index.html. 

Satisfy yourself that after using `%%writedir`, you can then find the file on disk with Windows Explorer, OSX Finder, or the Linux Shell.

We can see how in Python we can investigate the file system with functions in the `os` module, using just the same programming approaches as for anything else.

We'll gradually learn more features of the `os` module as we go, allowing us to move around the disk, `walk` around the
disk looking for relevant files, and so on. These will be important to master for automating our data analyses.

### The python `file` type

So, let's read our file:

In [16]:
myfile=open('mydata.txt')

In [53]:
with open('mydata.txt') as myfile: #inserted DH --> see below, this allows you to open a file and after that code block it is automatically closed
    t=myfile.read()

print(t)

SyntaxError: invalid non-printable character U+00A0 (2894356459.py, line 1)

In [17]:
type(myfile)

_io.TextIOWrapper

We can go line-by-line, by treating the file as an iterable:

In [18]:
[x for x in myfile]

["A poet once said, 'The whole universe is in a glass of wine.'\n",
 'We will probably never know in what sense he meant it, \n',
 'for poets do not write to be understood. \n',
 'But it is true that if we look at a glass of wine closely enough we see the entire universe. \n',
 'There are the things of physics: the twisting liquid which evaporates depending\n',
 'on the wind and weather, the reflection in the glass;\n',
 'and our imagination adds atoms.\n',
 "The glass is a distillation of the earth's rocks,\n",
 "and in its composition we see the secrets of the universe's age, and the evolution of stars. \n",
 'What strange array of chemicals are in the wine? How did they come to be? \n',
 'There are the ferments, the enzymes, the substrates, and the products.\n',
 'There in wine is found the great generalization; all life is fermentation.\n',
 'Nobody can discover the chemistry of wine without discovering, \n',
 'as did Louis Pasteur, the cause of much disease.\n',
 'How vivid is the

If we do that again, the file has already finished, there is no more data.

In [19]:
[x for x in myfile]

[]

We need to 'rewind' it!

In [20]:
myfile.seek(0)
[len(x) for x in myfile if 'know' in x]

[56, 39]

In [96]:
new = [x for x in myfile]
print(new)

myfile.seek(0)
new = [x for x in myfile] # now i've converted it to a list of str 
#print(new)

print(new[0])
type(new[0])

ValueError: I/O operation on closed file.

It's really important to remember that a file is a *different* built in type than a string.

### Working with files.

We can read one line at a time with `readline`: 

In [28]:
myfile.seek(0)
first = myfile.readline() # acesses and then deletes the first line of myfile

In [29]:
first

"A poet once said, 'The whole universe is in a glass of wine.'\n"

In [30]:
second=myfile.readline() # acesses and then deletes the second line of myfile

In [31]:
second

'We will probably never know in what sense he meant it, \n'

We can read the whole remaining file with `read`:

In [18]:
rest=myfile.read()

In [19]:
rest

"for poets do not write to be understood. \nBut it is true that if we look at a glass of wine closely enough we see the entire universe. \nThere are the things of physics: the twisting liquid which evaporates depending\non the wind and weather, the reflection in the glass;\nand our imagination adds atoms.\nThe glass is a distillation of the earth's rocks,\nand in its composition we see the secrets of the universe's age, and the evolution of stars. \nWhat strange array of chemicals are in the wine? How did they come to be? \nThere are the ferments, the enzymes, the substrates, and the products.\nThere in wine is found the great generalization; all life is fermentation.\nNobody can discover the chemistry of wine without discovering, \nas did Louis Pasteur, the cause of much disease.\nHow vivid is the claret, pressing its existence into the consciousness that watches it!\nIf our small minds, for some convenience, divide this glass of wine, this universe, \ninto parts -- \nphysics, biology

Which means that when a file is first opened, read is useful to just get the whole thing as a string:

In [20]:
open('mydata.txt').read()

"A poet once said, 'The whole universe is in a glass of wine.'\nWe will probably never know in what sense he meant it, \nfor poets do not write to be understood. \nBut it is true that if we look at a glass of wine closely enough we see the entire universe. \nThere are the things of physics: the twisting liquid which evaporates depending\non the wind and weather, the reflection in the glass;\nand our imagination adds atoms.\nThe glass is a distillation of the earth's rocks,\nand in its composition we see the secrets of the universe's age, and the evolution of stars. \nWhat strange array of chemicals are in the wine? How did they come to be? \nThere are the ferments, the enzymes, the substrates, and the products.\nThere in wine is found the great generalization; all life is fermentation.\nNobody can discover the chemistry of wine without discovering, \nas did Louis Pasteur, the cause of much disease.\nHow vivid is the claret, pressing its existence into the consciousness that watches it!

You can also read just a few characters:

In [35]:
myfile.seek(1335)

1335

In [33]:
myfile.read(15)

'\n   - Richard F'

### Converting Strings to Files

Because files and strings are different types, we CANNOT just treat strings as if they were files:

In [38]:
mystring= "Hello World\n My name is James"

In [39]:
mystring

'Hello World\n My name is James'

In [40]:
mystring.readline()

AttributeError: 'str' object has no attribute 'readline'

This is important, because some file format parsers expect input from a **file** and not a string. 
We can convert between them using the StringIO module in the standard library:

In [91]:
from io import StringIO

In [92]:
mystringasafile=StringIO(mystring)

In [93]:
mystringasafile.readline()

'Hello World\n'

In [94]:
mystringasafile.readline()

' My name is James'

In [95]:
type(mystringasafile.readline())

str

Note that in a string, `\n` is used to represent a newline.

### Closing files

We really ought to close files when we've finished with them, as it makes the computer more efficient. (On a shared computer,
this is particularly important)

In [45]:
myfile.close()

Because it's so easy to forget this, python provides a **context manager** to open a file, then close it automatically at
the end of an indented block:

In [49]:
with open('mydata.txt') as somefile:
    content = somefile.read()

content

"A poet once said, 'The whole universe is in a glass of wine.'\nWe will probably never know in what sense he meant it, \nfor poets do not write to be understood. \nBut it is true that if we look at a glass of wine closely enough we see the entire universe. \nThere are the things of physics: the twisting liquid which evaporates depending\non the wind and weather, the reflection in the glass;\nand our imagination adds atoms.\nThe glass is a distillation of the earth's rocks,\nand in its composition we see the secrets of the universe's age, and the evolution of stars. \nWhat strange array of chemicals are in the wine? How did they come to be? \nThere are the ferments, the enzymes, the substrates, and the products.\nThere in wine is found the great generalization; all life is fermentation.\nNobody can discover the chemistry of wine without discovering, \nas did Louis Pasteur, the cause of much disease.\nHow vivid is the claret, pressing its existence into the consciousness that watches it!

The code to be done while the file is open is indented, just like for an `if` statement.

You should pretty much **always** use this syntax for working with files.

### Writing files

We might want to create a file from a string in memory. We can't do this with the notebook's `%%writefile` -- this is
just a notebook convenience, and isn't very programmable.

When we open a file, we can specify a 'mode', in this case, 'w' for writing. ('r' for reading is the default.)

In [59]:
with open('mywrittenfile', 'w') as target:
    target.write('Hello')
    target.write('World')

In [60]:
with open('mywrittenfile','r') as source:
    print(source.read())

HelloWorld


And we can "append" to a file with mode 'a':

In [61]:
with open('mywrittenfile', 'a') as target:
    target.write('Hello')
    target.write('James')

In [62]:
with open('mywrittenfile','r') as source:
    print(source.read())

HelloWorldHelloJames


If a file already exists, mode 'w' will overwrite it.

In [61]:
some_str = "i want to write this as a file"

In [62]:
type(some_str)

str

In [79]:
with open('myfile2','w') as fyle: #cool! you 'open' a file as a text file object into which you can write a str
          fyle.write(some_str)
    

In [82]:
some_str2 = "i also want to write THIS as a file"
print(type(some_str2))

some_str2=StringIO(some_str2)
print(type(some_str2))

open('some_str2','w')


<class 'str'>
<class '_io.StringIO'>


<_io.TextIOWrapper name='some_str2' mode='w' encoding='UTF-8'>

In [85]:
open('some_str2','r')
#unpacked = [x for x in some_str2]
#print(unpacked)
#unpacked.seek(0)

type(some_str2)

_io.StringIO

In [86]:
some_str2.readline()

'i also want to write THIS as a file'

In [87]:
some_str2.seek(0)

0

**SUMMARY**

In [88]:
#### directory stuff 

import os # import python module for handling directories and files 
os.getcwd() # get current directory
os.listdir() # list all files in current directory

[x for x in os.listdir(os.getcwd()) if ".txt" in x] # elegant list comprehension way to subset for filetypes or names
os.path.dirname(os.getcwd()) # 'parent' directory folder

#### handling files 

myfile=open('mydata.txt') #simply open a .txt file in read mode, same as open('mydata.txt','r')

# accessing the contents of the file in different ways 
# - read (--> convert to str) the entire file content 
str_of_myfile = myfile.read()
# - read (--> convert to str) a single line at a time
str_of_myfile_line = myfile.readline() # reads the next line chronologically 
str_of_myfile_line = myfile.readline(y) # reads a line specified by y
# - line by line list of myfile 
list_of_lines = [x for x in myfile]
# IMPORTANT: reading the file (either completely or line by line is exhaustive), need to rewind the file using
myfile.seek(0) # to back to character 0 and start from there, can also start elsewhere later in the file
open('mydata.txt').read()

# --> these methods can be combined 
str_of_myfile = open('mydata.txt').read() # IMPORTANT: READ is unpacking the file, NOT opening it from a .txt 

#### converting btw str and file
from io import StringIO # import file management package (DIFFERENT FROM IMPORT OS!!!)
mystringasafile=StringIO('my text which I manually typed')
# this file object is now readable using .read() and .readline()

#### closing and managing files concisely
myfile.close() # you have to close files (= delete them from memory to optimize computing power)
# to make this easiert, you can operate within a *** with open(x) as y: *** environment
# python automatically deletes the file after executing the indented code block

with open('mydata.txt') as somefile: # initialize the automatic file management block: with file x opened as file object y, execute a certain set of functions 
    extracted_str = somefile.read()
extracted_str #extracted_str is now a str, while mydata.txt which has been in memory as somefile has been deleted again automatically


#### write files and str text to files
# how to write str to files --> the .write() function is the inverse of the .read() function
with open('mywrittenfile.txt', 'w') as target: # this 'reads' a new file with name 'mywrittenfile.txt' into memory as file object target, which you can now add stuff to 
    target.write('Hello') # write a custom str to target (which becomes mywrittenfile.txt after this block)
    target.write('World') # add another custom str
    target.write(str_object) # can also add a pre-defined str object of course
    
with open('mywrittenfile.txt', 'a') as target: # same as above but won't overwrite mywrittenfile.txt if it already exists
    target.write('Hello')
    target.write('James')

with open('mywrittenfile.txt', 'r') as target: # just a reminder how to get the file back into memory as a str
    target.read()

# can't directly write a file to a file. need to go via conversion to str


'/Users/dustinherrmann/Documents/GitHub/rsd-engineeringcourse/ch01data'