![read_write](images/read_write.jpg)

In [21]:
import pandas as pd
import numpy as np

In [22]:
f1 = pd.read_csv('data/ex1.csv')

In [23]:
f1

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [24]:
f2 = pd.read_table('data/ex1.csv', sep = ',')

In [25]:
f2

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [26]:
f3 = pd.read_csv('data/ex2.csv', header = None) # comment = '#' is used to define which chracters to consider as comment
# na_values = 'Nothing' is used to mention values which is considered as NA.

In [74]:
f3

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [27]:
pd.read_csv('data/ex2.csv', names = ['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


### `index_col`
* Make specific column as index column of dataframe

In [28]:
names = ['a', 'b', 'c', 'd', 'message']

In [29]:
pd.read_csv('data/ex2.csv', names = names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


### Hierarchical Index from multiple columns

In [30]:
pd.read_csv('data/ex3.csv', index_col = ['key1', 'key2'])

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [31]:
list(open('data/ex4.txt'))

['A B C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb 0.927272 0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382 1.100491']

In [32]:
pd.read_table('data/ex4.txt', sep = '\s+')

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


* There is a one fewer column in header, read_table infers that first column should be dataframe's index.

In [38]:
pd.read_csv('data/ex5.csv', skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


### handling missing values

In [39]:
pd.read_csv('data/ex6.csv')

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


### `na_values` 
* consider this list/set as missing values.

In [35]:
pd.read_csv('data/ex6.csv', na_values=['NULL'])

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


* Different such values can be defined for each columns.

In [36]:
sentinels = {'message' : ['foo', 'NA'], 'something': ['two']}

In [37]:
pd.read_csv('data/ex6.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


![read csv options](images/read_csv.jpg)

In [40]:
pd.options.display.max_rows = 10

* To read only small number of rows specify,
```
pd.read_csv('filename', nrows = 5)
```

* To read file in piece specify,
```
pd.read_csv('filename', chunksize = 1000)
```

### Writing data to text format

In [41]:
f4 = pd.read_csv('data/ex6.csv')

In [42]:
f4

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [43]:
f4.to_csv('data/ex7.csv')

In [46]:
!type data\ex7.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


* Even we can use `sys.stdout` to prints on console

In [47]:
import sys

In [48]:
f4.to_csv(sys.stdout, sep = '|', na_rep = 'NULL')

|something|a|b|c|d|message
0|one|1|2|3.0|4|NULL
1|two|5|6|NULL|8|world
2|three|9|10|11.0|12|foo


In [49]:
f4.to_csv(sys.stdout, index = False, header = False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


* Writing only subset of columns

In [50]:
f4.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0


In [51]:
dates = pd.date_range('1/1/2017', periods=7)

In [52]:
dates

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07'],
              dtype='datetime64[ns]', freq='D')

In [53]:
s1 = pd.Series(range(7), index = dates)

In [56]:
s1

2017-01-01    0
2017-01-02    1
2017-01-03    2
2017-01-04    3
2017-01-05    4
2017-01-06    5
2017-01-07    6
Freq: D, dtype: int64

In [55]:
s1.to_csv(sys.stdout)

2017-01-01,0
2017-01-02,1
2017-01-03,2
2017-01-04,3
2017-01-05,4
2017-01-06,5
2017-01-07,6


### Python's csv module

In [57]:
import csv

In [59]:
lines = []
with open('data/ex8.csv') as fd:
    lines = list(csv.reader(fd))

In [60]:
lines

[['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3']]

In [61]:
header, values = lines[0], lines[1:]

In [62]:
data_dict = {h:v for h,v in zip(header, zip(*values))}

In [63]:
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

* CSV file comes in many different flavors. To define new format with a different delimiter. string quotation convention, line terminator we define subclass of `csv.Dialect`

In [64]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

In [65]:
fd = open('data/ex8.csv')
reader = csv.reader(fd, dialect = my_dialect)

* We can also give individual csv dialect parameter as keyword without defining subclass.

In [66]:
reader = csv.reader(fd, delimiter = '|')

![reader parameter](images/reader.jpg)

* File with fixed multicharacter delimiters, csv modules will not work. We have to do line splitting and cleanup.

In [67]:
with open('data/ex9.csv', 'w') as fd:
    writer = csv.writer(fd, dialect = my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))

### JSON data `json` 

In [68]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
} """

In [69]:
import json

In [70]:
result = json.loads(obj)

In [71]:
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [72]:
asjson = json.dumps(result) # convert python object to JSON

In [73]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])

In [74]:
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


In [76]:
pd.read_json('data/ex10.json')

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


### HTML

* `pd.read_html` parse tables out of HTML files as DataFrame. By default it searches for <table> tags and parse all tabular data within it.

In [77]:
tables = pd.read_html('data/failed_bank_list.html')

In [78]:
len(tables)

1

In [79]:
failures = tables[0]

In [80]:
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","February 21, 2018"
1,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","February 21, 2018"
2,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb","May 26, 2017","July 26, 2017"
3,"Guaranty Bank, (d/b/a BestBank in Georgia & Mi...",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,"May 5, 2017","March 22, 2018"
4,First NBC Bank,New Orleans,LA,58302,Whitney Bank,"April 28, 2017","December 5, 2017"


In [81]:
close_timestamps = pd.to_datetime(failures['Closing Date'])

In [82]:
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
       ... 
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, Length: 16, dtype: int64

### XML

* eXtensible Markup Language
* `lxml.objectify` is useful to parse the file. `getroot` is used to rood node of XML file.

In [83]:
from lxml import objectify

In [84]:
parsed = objectify.parse(open('data/ex11.xml'))

In [85]:
root = parsed.getroot()

In [86]:
root

<Element INDICATOR at 0x1f3f8ef8e88>

### Binary data format
* Using python's built-in `pickle` serialization we can store data efficiently in binary format.

In [87]:
df = pd.read_csv('data/ex1.csv')

In [88]:
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


* Basically at lowest levels computer understand 0 and 1. Encoding are systems for representing characters in binary.
* ASCII was very popular to represent english characters and symbols. It contains 128 characters.
* Many other languages has their own encoding and it can lead to confusion. Ex "Latin-1", "Mac Roman".
* Python use 'UTF-8' encoding
* Other popular are 'Windows-1251' and "Latin-1"
* To define encoding duing reading csv file use `encoding="Latin-1"`. Default encoding is UTF-8.

In [90]:
df.to_pickle('data/ex12')

In [91]:
pd.read_pickle('data/ex12')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


* pickle should be used to short term storage. As its get changed with newer version and we might not be able to find it in future.

* Other binary formats are HDF5, MessagePack, Feather, bcolz

### Excel files
* Using `ExcelFile` class or `read_Excel` we can read excel files.

In [93]:
xlsx = pd.ExcelFile('data/ex13.xlsx')

In [94]:
xlsx.sheet_names # Identify sheets in escel file

['Sheet1']

* To parse specific sheet in dataframe

In [95]:
df = xlsx.parse('Sheet1') # or xlsf.parse(0)

* Another parameters for parse are `parse_cols` list of columns index to be used `skiprows` list of row index to be igrnred. `names` list of names for the columns

In [96]:
f12 = pd.read_excel(xlsx, 'Sheet1')

In [97]:
pd.read_excel('data/ex13.xlsx', 'Sheet1')

Unnamed: 0,a,b,c,message
0,1,2,3,hello
1,4,5,6,world
2,7,8,9,foo


* To write to excel first create `ExcelWriter`.

In [98]:
writer = pd.ExcelWriter('data/ex14.xlsx')

In [99]:
f12.to_excel(writer, 'Sheet1')

In [100]:
writer.save()

In [101]:
f12.to_excel('data/ex15.xlsx')

### SAS and Stata files
* `SAS` : statistical analysis system
* `Stata` : Statistics + data

```
from sas7bdat import SAS7BDAT

with SAS7BDAT('filename.sas7dat') as file:
    df_sas = file.to_data_frame()
    
```

#### `read_stata`
* `pd.read_stata('filename.dta')`

### HDF5 file
* Each HDF5 file can store multiple datasets and supporting meta data. Useful to store large quantity of data that don't fit in memory, we can efficiently read and write small sections of much larger arrays.
* Hierarchical data fromat version 5

```
import h5py
data = h5py.File("filename.hdf5", 'r')
```

### Structure of HDF5 file

* Printing name of the groups
```
for key in data.keys():
    print(keys)
```
- `meta`: Metadata for file
- `Quality` : quality of data
- `strain` : 

* To explore metadata
```
for key in data['meta'].keys():
    print(key)
```

* After exploration we can print out corresponding value
```
data['meta'][`key`].value
```

### Matlab file
* `.mat` file contains all the variables of workspace.

```
import scipy.io
mat = scipy.io.loadmat('filename.mat')
print(type(mat)) # it will be dict
```
* keys = MATLAB variable name
    * values = object assigned to variable

### Web APIs

In [102]:
import requests

In [103]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [104]:
resp = requests.get(url)

In [105]:
resp

<Response [200]>

In [106]:
data = resp.json() # return dictionary containing JSON parsed into native python object

In [107]:
data[0]['title']



In [108]:
issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])

In [109]:
issues

Unnamed: 0,number,title,labels,state
0,24864,BLD: silence npy_no_deprecated warnings with n...,[],open
1,24863,BUG: fix floating precision formatting in pres...,[],open
2,24862,Output excel table objects with to_xlsx(),"[{'id': 49254273, 'node_id': 'MDU6TGFiZWw0OTI1...",open
3,24861,Displayed prevision of float in presence of np...,[],open
4,24860,ExtensionDtype.construct_array_type is not opt...,"[{'id': 849023693, 'node_id': 'MDU6TGFiZWw4NDk...",open
...,...,...,...,...
25,24827,TST: test_constructors.py::TestSeriesConstruct...,"[{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open
26,24825,DOC/CLN: Timezone section in timeseries.rst,"[{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...",open
27,24823,HDFExtError: Problem creating the table,"[{'id': 307649777, 'node_id': 'MDU6TGFiZWwzMDc...",open
28,24819,BUG: DataFrame.merge(suffixes=) does not respe...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open


### Database connection