# CH. 6 - Data loading, storage, and file formats

Pandas parsing functions for different data formats optional arguments:
- Indexing: treat one or more col's as the returend df and whether to get col names from the file, user, or not at all
- Type Inference and Data Conversion:
    - includes user-defined value conversions and custom list of missing value markers
- Datetime Parsing:
    - includes combining capability - can combine date and time info spread over multiple cols
- Iterating:
    - support for iterating over chunks of very large files
- Unclean Data Issues:
    - skip rows or a footer, comments and other minor things like numbers seperated with commas

Because of how complex real world data can be, there are a large amount of different parameters when reading in files



If no header in file - skip header and let pd assign headers  
`pd.read_csv('somefile.csv', header=None)`

Set your own headers  
`pd.read_csv('somefile.csv', names=['a', 'b', 'c'])  

Set index from a column  
`names = ['a', 'b', 'c']`  
`pd.read_csv('file.csv', names=names, index_col='c')`

Hierarchical index from multiple columns - pass a list of col numbers or names (pp. 172)  
`parsed = pd.read_csv('file.csv', index_col=['key1', 'key2'])`

Specify a whitespace delimmeter  
`result = pd.read_csv('examples/ex3.txt', sep='\s+')`

Skip rows in data:  
`pd.read_csv('data.csv', skiprows=[0, 2, 3])`  

Specify null values in data  
`pd.read_csv('data.csv', na_values=['NULL'])`



## Reading Text Files in Pieces

if dealing with large file may only want to read a small piece or do chunks

Set pd display settings more compact  
`pd.options.display.max_rows = 10`

Read small number of rows  
pd.read_csv('data.csv', nrows=5)

Set a `chunksize` as a number of rows  
`chunker = pd.read_csv('data.csv', chunksize=1000)`

Export data to csv  
`data.to_csv('filepath/data.csv')`

disable row and column labels  
`data.to_csv('filepath/data.csv', index=False, header=False)`

choose subset of coulumns  
`data.to_csv('filepath/data.csv', index=False, columns=['a', 'b'])`





# JSON Data

Convert json object into python form
`import json`  
`result = json.loads(obj)`  

Convert back to json  
`asjson = json.dumps(result)`  

Pass a list of dicts to a df  
`siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])

`pandas.read_json` can auto convert JSON in specific arrangements to series or df  
assumes that each object in the JSON array is a row in the table




# Web Scraping HTML and XML

`pandas.read_html` - auto parse tables out of HTML files to df objects

## Parse XML with `lxml.objectify`

`from lxml import objectify`  
`path = 'some/file/path.xml`  
`parsed = objectify.parse(open(path))`  
`root = parsed.getroot()` - get to the root node of the XML file  

`root.INDICATOR` returns a generator yeilding each XML element - for each element we can populate a dict of tag names to data values  

`skip_fields = ['a', 'd']`

`for elt in root.INDICATOR:`  
    `el_data = {}`
    `for child in elt.getChildren():`  
        `if child.tag in skip_fields:`  
            `continue`  
        `el_data[child_tag] = child.pyval`  
    `data.append(el_data)`  


   

# 6.2 Binary Data Formats

store data in binary format using python's built in `pickle` serialization  
pandas objects all have a `to_pickle` method that writes thet data to a disk in pickle format:

`df.to_pickle('path/pickledata')`  

Read pickled data  
`pd.read_pickle('path/pickledata')`

## HDF5 Format

intended for storing large quantities of scientific array data  
can store multiple datasets and support metadata

good choice for very large datasets - can efficiently read and write small sections of much larger arrays

## Excel Files
must use add on packages `xlrd` and `openpyxl` to read XLS and XLSX files respectively

`xlsx = pd.ExcelFile('file.xlsx')`  
`pd.read_excel(xlsx, 'Sheet1)`



# 6.3 Web APIs

`import requests`  
`url = 'someurl.com/api_stuff`  
`resp = requests.get(url)`  

JSON method return a dict containing JSON parsed into native python objs  
`data = resp.json()`  
`data[0]['title']`  

# 6.4  Databases

loading data from SQL into a df is fairly straightforward

create a SQLite db using `sqlite3` driver:  
`import sqlite3`  
`query = """`
`CREATE TABLE test`  
`(a VARCHAR(20), b VARCHAR(20)`  
` c REAL, d INTEGER`  
`); """`

`con = sqlite3.connect('mydata.sqlite')` 