I/O
====

The first step in data analysis is often getting data and parsing it into a convenient format for analysis. This notebook explores how to load and save plain text and JSON. We will cover numeric, tabular and database options in a subsequent lecture.

1. [Reading files](#read)
2. [Writing to files](#write)
3. [Loading from web](#web)
4. [Loading JSON](#json)
5. [Review Problems](#hmk)

When we want to read from or write to a file we need to open it first. When we are done, it needs to be closed, so that resources that are tied with the file are freed. Hence, in Python, a file operation takes place in the following order.

<ol>
<li>Open a file
<li>Read or write (perform operation)
<li>Close the file
</ol>

### Text files

Quick and dirty way to write a text file.

In [4]:
%%file /Users/ilanman/gdi/animals.txt
name|species|age|weight
arun|cat|5|7.3
bob|bird|2|1.5
coco|cat|2|5.5
dumbo|elephant|23|454
elmo|dog|5|11
fido|dog|3|24.5
gumba|bird|2|2.7

Writing /Users/ilanman/gdi/animals.txt


Loading a text file<a id="read"></a>
=====

<ul>
<li>Specify the mode while opening a file
<li>read 'r', write 'w' or append 'a' to the file
<li>Specify if we want to open the file in text mode or binary mode. The default is reading in text mode. 
<li>In this mode, we get strings when reading from the file.
<li>Binary mode returns bytes - used when dealing with non-text files like image or exe files.
</ul>

#### Basic way of opening a file

<ul>
<li>Remember to close the file after you're finished
</ul>

In [19]:
f = open('animals.txt')
print f.mode
print f.name

r
animals.txt


In [20]:
f.read()

'name|species|age|weight\narun|cat|5|7.3\nbob|bird|2|1.5\ncoco|cat|2|5.5\ndumbo|elephant|23|454\nelmo|dog|5|11\nfido|dog|3|24.5\ngumba|bird|2|2.7'

In [21]:
f.close()

#### Using a generator expression to read one line at a time

<ul>
<li>Useful if we only want to extract some lines and the entire file is too large to fit into memory.
<li>Note the use of the `with` context manager
<li>Automates the closing of the file resource once the `with` block is exited, avoiding leakage of system resources
</ul>

In [22]:
with open('animals.txt') as f:   # enter the generator
    for line in f:
        if 'cat' in line:
            print line.strip()

arun|cat|5|7.3
coco|cat|2|5.5


#### Reading into memory as a single string

In [26]:
with open('animals.txt') as f:
    text = f.read()

text

'name|species|age|weight\narun|cat|5|7.3\nbob|bird|2|1.5\ncoco|cat|2|5.5\ndumbo|elephant|23|454\nelmo|dog|5|11\nfido|dog|3|24.5\ngumba|bird|2|2.7'

#### Reading into memory as a list of strings

In [27]:
with open('animals.txt') as f:
    text = f.readlines()

text

['name|species|age|weight\n',
 'arun|cat|5|7.3\n',
 'bob|bird|2|1.5\n',
 'coco|cat|2|5.5\n',
 'dumbo|elephant|23|454\n',
 'elmo|dog|5|11\n',
 'fido|dog|3|24.5\n',
 'gumba|bird|2|2.7']

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

## EXERCISE TIME!

Read in the `animals.txt` file and only print the last column of the document. The result should be:

```
weight
7.3
1.5
5.5
454
11
24.5
2.7
```

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In [28]:
## SOLUTION

with open('animals.txt') as f:
    for line in f.readlines():
        print line.split('|')[::-1][0],

weight
7.3
1.5
5.5
454
11
24.5
2.7


Saving a text file<a id="write"></a>
=====

<ul>
<li>Need to open it in write 'w', append 'a' or exclusive creation 'x' mode. 
<li>Careful with the 'w' mode as it will overwrite into the file if it already exists. All previous data are erased.
</ul>

In [42]:
s = """
name|species|age|weight
arun|cat|5|7.3
bob|bird|2|1.5
coco|cat|2|5.5
dumbo|elephant|23|454
elmo|dog|5|11
fido|dog|3|24.5
gumba|bird|2|2.7
"""

In [43]:
with open('animals2.txt', 'w') as f:
    f.write(s)

In [44]:
!cat 'animals2.txt'


name|species|age|weight
arun|cat|5|7.3
bob|bird|2|1.5
coco|cat|2|5.5
dumbo|elephant|23|454
elmo|dog|5|11
fido|dog|3|24.5
gumba|bird|2|2.7


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

## EXERCISE TIME!

1) Create a folder in your home directory called `pygdi`<br>
2) In that directory, create a text file (called io.txt) with the following data:

```
1234567890
abcdefghij
```

3) From this notebook, load io.txt<br>
4) Print to the screen the letters associated with each even number from the first line. The output should be:

```
bdfhj
```

5) After step 4., delete the directory and the file inside it.

Note that you should do the above in Python only. Hint: Look up the `os` module.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In [33]:
## SOLUTION

import os
os.mkdir('/Users/ilanman/pygdi')

s = """1234567890 
abcdefghij"""

f = open('/Users/ilanman/pygdi/io.txt','wb')
f.write(s)

with open('/Users/ilanman/pygdi/io.txt') as f:
#     print f.read()
    print f.readlines()[1][1::2]

os.remove('/Users/ilanman/pygdi/io.txt')
os.rmdir('/Users/ilanman/pygdi')

bdfhj


Web resources<a id="web"></a>
=====

<ul>
<li>`requests`
<li>`urllib2`
</ul>

In [3]:
import requests

In [26]:
# Only download once - Project Gutenburg will block you if you do this repeatedly

try:
    with open('Ulysses.txt') as f:
        text = f.read()

except IOError:
    url = 'http://www.gutenberg.org/files/4300/4300-0.txt'
    resp = requests.get(url)
    text = resp.text
    
    with open('Ulysses.txt', 'w') as f:
        f.write(text.encode('utf-8'))

In [31]:
print(text[:500])

﻿


The Project Gutenberg EBook of Ulysses, by James Joyce

This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever.  You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.org


Title: Ulysses

Author: James Joyce

Release Date: August 1, 2008 [EBook #4300] 
Last Updated: May 16, 2016

Language: English

Character set encoding: UTF-8

*** ST


Loading JSON data<a id="json"></a>
----

<ul>
<li>JavaScript Object Notation (JSON) is a common way of storing data on the web.
<li>Python translates JSON into a list of nested dictinaries using the `json` module in the standard library
</ul>

In [36]:
import json
import urllib2

In [37]:
resp = urllib2.urlopen('http://api.tvmaze.com/singlesearch/shows?q=silicon-valley&embed=episodes')
text = resp.read()
with open('silicon_valley.json', 'w') as f:
    f.write(text)

In [38]:
with open('silicon_valley.json') as f:
    data = json.load(f)

In [39]:
!head -c 1000 silicon_valley.json

{"id":143,"url":"http://www.tvmaze.com/shows/143/silicon-valley","name":"Silicon Valley","type":"Scripted","language":"English","genres":["Comedy"],"status":"Running","runtime":30,"premiered":"2014-04-06","schedule":{"time":"22:00","days":["Sunday"]},"rating":{"average":8.6},"weight":13,"network":{"id":8,"name":"HBO","country":{"name":"United States","code":"US","timezone":"America/New_York"}},"webChannel":null,"externals":{"tvrage":33759,"thetvdb":277165,"imdb":"tt2575988"},"image":{"medium":"http://tvmazecdn.com/uploads/images/medium_portrait/53/132726.jpg","original":"http://tvmazecdn.com/uploads/images/original_untouched/53/132726.jpg"},"summary":"<p>In the high-tech gold rush of modern <strong><em>\"Silicon Valley\"</em></strong>, the people most qualified to succeed are the least capable of handling success. Mike Judge brings his irreverent brand of humor, and his own experiences working in <em>Silicon Valley</em>, to the award-winning comedy now entering its third season. </p>",

In [40]:
from pprint import pprint as pp

In [41]:
pp(data['_embedded'])

{u'episodes': [{u'_links': {u'self': {u'href': u'http://api.tvmaze.com/episodes/10897'}},
                u'airdate': u'2014-04-06',
                u'airstamp': u'2014-04-06T22:00:00-04:00',
                u'airtime': u'22:00',
                u'id': 10897,
                u'image': {u'medium': u'http://tvmazecdn.com/uploads/images/medium_landscape/49/123633.jpg',
                           u'original': u'http://tvmazecdn.com/uploads/images/original_untouched/49/123633.jpg'},
                u'name': u'Minimum Viable Product',
                u'number': 1,
                u'runtime': 30,
                u'season': 1,
                u'summary': u"<p>Attending an elaborate launch party, Richard  and his computer programmer friends - Big Head, Dinesh  and Gilfoyle  - dream of making it big. Instead, they're living in the communal Hacker Hostel owned by former programmer Erlich, who gets to claim ten percent of anything they invent there. When it becomes clear that Richard has developed

## Review Problems<a id='hmk'></a>

1) Read in and save the following JSONs, each to their own `.json` file, in the `/pdygdi` directory.

```python
http://www.carqueryapi.com/api/0.3/?callback=?&cmd=getModels&make=ford
```

where 

```python
makes = ['Ford', 'GMC', 'Acura', 'Cadillac', 'Ferrari', 'Jaguar', 'Mercedes-Benz', 'BMW', 'Nissan', 'Porsche', 'Subaru', 'Toyota']
```

In the end you'll have 12 `.json` files. Note that the formatting of the JSON object may need to be adjusted slightly.

2) Find the number of models, for each make, that start with `'S'`.

3) Save the results as a `json` in a new file, `S_models.json`.

Your `S_models.json` file should look something like:

```python
{
Acura: 1,
Jaguar: 3,
GMC: 8,
Cadillac: 6,
Nissan: 12,
Porsche: 0,
Toyota: 13,
Ford: 11,
Subaru: 2,
Ferrari: 1,
BMW: 1,
Mercedes-Benz: 16
}
```

Hint: Try to do the whole thing for one make, then think about wrapping the entire process in a loop over every make.

In [97]:
## SOLUTION

import urllib2
import json
os.mkdir('../pygdi')

makes = ['Ford', 'GMC', 'Acura', 'Cadillac', 'Ferrari', 'Jaguar', 'Mercedes-Benz', 'BMW', 'Nissan', 'Porsche', 'Subaru', 'Toyota']
s_models = {}

for m in makes:
    
    url = 'http://www.carqueryapi.com/api/0.3/?callback=?&cmd=getModels&make={}'.format(m)
    resp = urllib2.urlopen(url)
    text = resp.read()
    
    make_file_name = '../pygdi/{}.txt'.format(m)
    
    with open(make_file_name, 'w') as f:
        f.write(text[3:-2])
    
    with open(make_file_name, 'r') as f:
        data = json.load(f)
    
    cnt = 0
    for i in data['Models']:
        if i['model_name'][0] == 'S':
            cnt += 1

    s_models[m] = cnt
    

with open('../pygdi/S_models.json', 'w') as f:
    json.dump(s_models, f)