# Week 6. Data Loading, Storage, and File Formats

**In this lecture, you will learn**

* Working with files in Python
* Reading and writing data in text Format via pandas
* Handling binary data formats
* Using column-based file formats (Parquet)
* Using SQL (via the Python package ```sqlite3```)
  * We will continue this topic in week 7. 

---

In [1]:
import numpy as np
import pandas as pd
np.random.seed(12345)
np.set_printoptions(precision=4, suppress=True)

---

## 6.0 Files (Self-Reading)

* Most of this course uses high-level tools like ```pandas.read_csv``` to read data files from disk into Python data structures.  <br>
<br>
* However, it’s important to understand the basics of how to work with files in Python.

To open a file for reading or writing, use the built-in ```open``` function:

In [2]:
path = 'data/segismundo.txt'
f = open(path)

* By default, the file is opened in read-only mode 'r'. <br>
<br>
* We can then treat the file handle ```f``` like a list and iterate over the lines like so:

In [3]:
for line in f: 
    print(line)

Sueña el rico en su riqueza,

que más cuidados le ofrece;



sueña el pobre que padece

su miseria y su pobreza;



sueña el que a medrar empieza,

sueña el que afana y pretende,

sueña el que agravia y ofende,



y en el mundo, en conclusión,

todos sueñan lo que son,

aunque ninguno lo entiende.


In [4]:
# The rstrip() method removes any trailing characters (characters at the end a string), 
# space is the default trailing character to remove.
lines = [x.rstrip() for x in open(path)]
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.']

* When you use open to create file objects, it is important to explicitly close the file when you are finished with it. <br> 
<br>
* Closing the file releases its resources back to the operating system. 

In [5]:
f.close()

One of the ways to make it easier to clean up open files is to use the ```with``` statement.
 * This will automatically close the file ```f``` when exiting the ```with``` block.

In [6]:
with open(path) as f:
    lines = [x.rstrip() for x in f]

In [7]:
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.']

* If we had typed ```f = open(path, 'w')```, a new file at "data/segismundo.txt" would have been created (***be careful!***), overwriting any one in its place. <br>
<br>
* There is also the ```'x'``` file mode, which creates a writable file but fails if the file path already exists.

For readable files, some of the most commonly used methods are ```read```, ```seek```, and ```tell```. 
 * ```read``` returns a certain number of characters from the file.

In [8]:
f = open(path)
f.read(10)

'Sueña el r'

In [9]:
f.close()   # Lastly, we remember to close the files:

### Python file modes

* ```r``` Read-only mode
* ```w``` Write-only mode; creates a new file (erasing the data for any file with the same name)
* ```x``` Write-only mode; creates a new file, but fails if the file path already exists
* ```a``` Append to existing file (create the file if it does not already exist)
* ```r+``` Read and write
* ```b``` Add to mode for binary files (i.e.,'rb'or'wb')
* ```t``` Text mode for files (automatically decoding bytes to Unicode). This is the default if not specified. Add t to other modes to use this (i.e.,'rt'or'xt')




---

## 6.1 Reading and Writing Data in Text Format

**pandas** features a number of functions for reading tabular data (e.g., ```read_csv```, ```read_excel```, ```read_sql```, etc.) as a DataFrame object.

Some of these functions, like ```pandas.read_csv```, perform type inference, because the column data types are not part of the data format. That means you don’t necessarily have to specify which columns are numeric, integer, boolean, or string. 

In [10]:
df = pd.read_csv('data/ex1.csv')  # Since this is comma-delimited, we can use read_csv
print(df.shape)
print(df.index)
print(df.columns)
print(df.dtypes)
df

(3, 5)
RangeIndex(start=0, stop=3, step=1)
Index(['a', 'b', 'c', 'd', 'message'], dtype='object')
a           int64
b           int64
c           int64
d           int64
message    object
dtype: object


Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


We can also use ```read_table``` and specified the **delimiter**:

In [11]:
pd.read_table('data/ex1.csv', sep=',')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


A file will not always have a **header row**: We can use the argument ```header=None```. 

In [12]:
pd.read_csv('data/ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [13]:
pd.read_csv('data/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])  # you can specify names yourself

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose you wanted the message column to be the index of the returned DataFrame.

You can either indicate you want the column at index 4 or named 'message' using the ```index_col``` argument. 

In [14]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('data/ex2.csv', names=names, index_col='message')   # the column message becomes the index

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [15]:
pd.read_csv('data/ex2.csv', names=names, index_col='message').index

Index(['hello', 'world', 'foo'], dtype='object', name='message')

In the event that you want to form a **hierarchical index** from multiple columns, pass a list of column numbers or names

In [16]:
parsed = pd.read_csv('data/csv_mindex.csv',
                     index_col=['key1', 'key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [17]:
parsed.index

MultiIndex([('one', 'a'),
            ('one', 'b'),
            ('one', 'c'),
            ('one', 'd'),
            ('two', 'a'),
            ('two', 'b'),
            ('two', 'c'),
            ('two', 'd')],
           names=['key1', 'key2'])

In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields.

In [18]:
list(open('data/ex3.txt'))   # the fields here are separated by a variable amount of whitespace

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

In these cases, you can pass a regular expression as a delimiter for ```read_table```. This can be expressed by the regular expression ```\s+```.
 * Because there was one fewer column name than the number of data rows, read_table infers that the first column should be the DataFrame’s index in this special case.

In [19]:
result = pd.read_table('data/ex3.txt', sep='\s+')  
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


You can skip the first, third, and fourth rows of a file with ```skiprows```

In [20]:
pd.read_csv('data/ex4.csv', skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


* Handling missing values is an important and frequently nuanced part of the file parsing process. <br>
<br>
* Missing data is usually either not present (empty string) or marked by some ***sentinel*** value. 
 * By default, pandas uses a set of commonly occurring sentinels, such as ```NA``` and ```NULL```.

In [21]:
result = pd.read_csv('data/ex5.csv')
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [22]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The ```na_values``` option can take either a list or set of strings to consider missing values:

In [23]:
result = pd.read_csv('data/ex5.csv', na_values=['NULL'])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Different NA sentinels can be specified for each column in a ```dict```

In [24]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv('data/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


### 6.1.1 Reading Text Files in Pieces

When processing **very large files** or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact.

In [25]:
pd.options.display.max_rows = 10   # present only 10 rows of data

In [26]:
result = pd.read_csv('data/ex6.csv')   
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


If you want to only read a small number of rows (*avoiding reading the entire file*), specify that with ```nrows```

In [27]:
pd.read_csv('data/ex6.csv', nrows=5)   # only read the first three rows

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read a file in pieces, specify a ```chunksize``` as a number of rows:

In [28]:
chunker = pd.read_csv('data/ex6.csv', chunksize=1000)
chunker

<pandas.io.parsers.readers.TextFileReader at 0x7f6a53013e50>

The TextParser object returned by ```read_csv``` allows you to iterate over the parts of the file according to the ```chunksize```. 
 * For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:

In [29]:
chunker = pd.read_csv('data/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

  tot = pd.Series([])


In [30]:
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

### 6.1.2 Writing Data to Text Format

In [31]:
data = pd.read_csv('data/ex5.csv')
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Using DataFrame’s ```to_csv``` method, we can write the data out to a comma-separated file

In [32]:
data.to_csv('data/out.csv')

Other delimiters can be used

In [33]:
import sys
data.to_csv(sys.stdout, sep='|')  # writing to sys.stdout so it prints the text result to the console

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


Missing values appear as **empty strings** in the output. You might want to denote them by some other sentinel value:

In [34]:
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


In [35]:
data.to_csv(sys.stdout, index=False, header=False)   # don't need to index and column names

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


You can also write only **a subset of the columns**, and in an order of your choosing:

In [36]:
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0


Series also has a ```to_csv``` method:

In [37]:
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('data/tseries.csv')
print(ts)

2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Freq: D, dtype: int64


### 6.1.3 JSON Data

**JSON (short for JavaScript Object Notation)** has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV. Here is an example of JSON string: 

In [38]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

To convert a JSON string to Python form, use ```json.loads```

In [39]:
import json
result = json.loads(obj)
print(type(result))   # It is a dict
result

<class 'dict'>


{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

```json.dumps```, on the other hand, converts a Python object back to **JSON**

In [40]:
asjson = json.dumps(result)
print(type(asjson))   # It is a (JSON) string
asjson

<class 'str'>


'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

We can use the JSON object (e.g., ```result```) to create a DataFrame

In [41]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


The ```pandas.read_json``` can automatically convert JSON datasets in specific arrangements into a Series/DataFrame.

In [42]:
data = pd.read_json('data/example.json')
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


If you need to export data from pandas to JSON, one way is to use the ```to_json``` methods on Series and DataFrame:

In [43]:
print(data.to_json())

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}


In [44]:
print(data.to_json(orient='records'))

[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]


In [45]:
?pd.DataFrame.to_json

---

## 6.2 Binary Data Formats

One of the easiest ways to store data (also known as serialization) efficiently in **binary format** is using Python’s built-in **pickle** serialization.
* ```pandas.read_pickle```: read pickle files as DataFrames
* ```pandas.to_pickle```: save DataFrames as pickle files

In [46]:
frame = pd.read_csv('data/ex1.csv')
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [47]:
frame.to_pickle('data/frame_pickle')

In [48]:
pd.read_pickle('data/frame_pickle')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


### Reading Microsoft Excel Files

In [49]:
xlsx = pd.ExcelFile('data/ex1.xlsx')

In [50]:
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can also  simply pass the filename to ```pandas.read_excel```

In [51]:
frame = pd.read_excel('data/ex1.xlsx', 'Sheet1')
frame

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


In [52]:
writer = pd.ExcelWriter('data/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.close()

In [53]:
# You can also pass a file path to to_excel and avoid the ExcelWriter:
frame.to_excel('data/ex2.xlsx')

---

## 6.3 Column-Based File Formats (Parquet)

**Column pruning** is a big performance improvement that's possible for column-based file formats (e.g., **Parquet**) and not possible for row-based file formats (e.g., CSV). *I highly recommend that you use the Parquet format*. 
 * Parquet files are smaller than CSV files, and they can be read and written much faster.
 * Parquet files also support nested data structures, which makes them ideal for storing complex data. 

Let's look at the file sizes of the same dataset (a panel of Hong Kong listed companies with stock returns and 151 signals to predict returns) stored in **parquet** and **csv** formats:

In [54]:
import os
print('Size of parquet format:', str(os.path.getsize('data/HK_stocks_151signals.parquet')/1000**2), 'MB')
print('Size of csv format:', str(os.path.getsize('data/HK_stocks_151signals.csv')/1000**2), 'MB')

Size of parquet format: 281.776629 MB
Size of csv format: 941.151779 MB


In [55]:
import timeit

In [56]:
start = timeit.default_timer()

D = pd.read_csv('data/HK_stocks_151signals.csv')
print(D.shape)

print(f'Total configuration execution time: {(timeit.default_timer() - start):.4f}s.', flush=True)

(413279, 155)
Total configuration execution time: 4.8096s.


In [57]:
start = timeit.default_timer()

D = pd.read_parquet('data/HK_stocks_151signals.parquet', engine='pyarrow')
print(D.shape)

print(f'Total configuration execution time: {(timeit.default_timer() - start):.4f}s.', flush=True)

(413279, 154)
Total configuration execution time: 0.1755s.


Again, Parquet is a columnar file format, so Pandas can grab the columns that you want and can skip the other columns. **This is a massive performance improvement!**
 * However, the ```usecols``` argument in ```pandas.read_csv``` function cannot skip over other columns because of the row nature of the CSV file format. 

In [58]:
start = timeit.default_timer()

D_small = pd.read_csv('data/HK_stocks_151signals.csv', usecols=['id', 'eom', 'ret_exc_lead1m'])

print(f'Total configuration execution time: {(timeit.default_timer() - start):.4f}s.', flush=True)

Total configuration execution time: 2.5336s.


In [59]:
start = timeit.default_timer()

D_small = pd.read_parquet('data/HK_stocks_151signals.parquet', engine='pyarrow', 
                          columns=['id', 'eom', 'ret_exc_lead1m'])

print(f'Total configuration execution time: {(timeit.default_timer() - start):.4f}s.', flush=True)

Total configuration execution time: 0.0249s.


### In short, please use parquet format if possible. 

---

# END