<a href="https://colab.research.google.com/github/nguyentranforte1609/Book_PythonForDataAnalysis/blob/master/Chapter06_DataLoading_Storage_FileFormats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 06: Data Loading, Storage, and File Formats

In [None]:
import pandas as pd
import numpy as np

example06DataURL = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/ex6.csv'

## 6.1 Reading and Writing Data in Text Format

- Parsing functions in pandas

| Function | Description |
|----------|-------------|
| read_csv | Load delimited data from a file, URL, or file-like  object; use comma as default delimiter |
| read_table | Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter |
| read_fwf | Read data in fixed-width column format |
| read_clipboard | Version of `read_table` that reads data from the clipboard; useful for converting tables from web pages |
| read_excel | Read tabular data from an Excel XLS or XLSX file |
| read_hdf | Read HDF5 files written by pandas |
| read_html | Read all tables found in the given HTML document |
| read_json | Read data from a JSON string representation |
| read_msgpack | Read pandas data encoded using the MessagePack binary format |
| read_pickle | Read an arbitrary object stored in Python pickle format |
| read_sas | Read a SAS dataset stored in one of the SAS system's custom storrage formats |
| read_sql | Read the result of a SQL query as a pandas DataFrame |
| read_stata | Read a dataset from Stata file format |
| read_feather | Read the Feather binary file format |

- Some `read_csv`/`read_table` function arguments

| Argument | Description |
|----------|-------------|
| path | String indicating filesystem location, URL, or file-like object |
| sep or delimiter | Character sequence or regular expression to use to split fields in each row |
| header | Row number to use as column names; defaults to 0 (first row), but should be None if there is no header row |
| index_col | Column numbers or names to use as the row index in the result, can be a single name/ number or a list of them for a hierachical index |
| names | List of column names for result, combine with header=None |
| skiprows | Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip |
| na_values | Sequence of values to replace with NA |
| comment | Character(s) to split comments off the end of lines |
| parse_dates | Attempt to parse data to `datetime`; `False` by default. If `True`, will attempt to parse all columns. Otherwise can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine multiple columns together and parse to `date`|
| keep_date_col | If joining colums to parse data, keep the joined columns; `False` by default |
| converters | Dict containing column number of name mapping to functions |
| dayfirst | When parsing potentially ambiguous dates, treat as international format; `False` by default |
| date_parser | Function to use to parse dates |
| nrows | Number of rows to read from the beginning of file |
| iterator | Return a TextParser object for reading file piecemeal |
| chunksize | Size of file chunks for each iteration |
| skip_footer | Number of lines to ignore at end of file |
| verbose | Print various parser output information, like the number of missing values placed in non-numeric columns |
| encoding | Text encoding for Unicode |
| squeeze | If the parsed data only contains one column, return a Series |
| thousans | Separator for thousands |


### Reading Files in Pieces

- When processing very large files or figuring out the right set of arguments to correctly process a large file, we may want to read in a small piece of a file or iterate through smaller chunks of the file

In [2]:
# Example - Reading the first 5 rows of a large CSV file
ex06Data = pd.read_csv(example06DataURL, nrows = 5)
ex06Data

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


In [3]:
# Example -Reading a chunk of data from a large CSV file
# then iterate through row data in chunk and count values
ex06DataChunk = pd.read_csv(example06DataURL, chunksize = 1000)

countResult = pd.Series([])
for row in ex06DataChunk:
    countResult = countResult.add(row['key'].value_counts(), fill_value=0)

countResult[:5]

0    151.0
1    146.0
2    152.0
3    162.0
4    171.0
dtype: float64

## The rest of this chapter is Skipped