# Chapter 6.  Data Loading, Storage, and File Formats 

In [None]:
import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

* Input and output typically falls into a few main categories: reading text files and other
more efficient on-disk formats, loading data from databases, and interacting with net‐
work sources like web APIs.

## 6.1  Reading and Writing Data in Text Format

* pandas features a number of functions for reading tabular data as a DataFrame
object. 
* Table 6-1 summarizes some of them, though **read_csv** and **read_table** are
likely the ones you’ll use the most.

<img style="float: left;" src="pic/pic_6_1.png" width="700">

<img style="float: left;" src="pic/pic_6_2.png" width="700">

I’ll give an overview of the mechanics of these functions, which are meant to convert
text data into a DataFrame. 


The optional arguments for these functions may fall into
a few categories:

* *Indexing*  
Can treat one or more columns as the returned DataFrame, and whether to get
column names from the file, the user, or not at all.


* *Type inference and data conversion*  
This includes the user-defined value conversions and custom list of missing value
markers.


* *Datetime parsing*  
Includes combining capability, including combining date and time information
spread over multiple columns into a single column in the result.


* *Iterating*  
Support for iterating over chunks of very large files.


* *Unclean data issues*  
Skipping rows or a footer, comments, or other minor things like numeric data
with thousands separated by commas.

Because of how messy data in the real world can be, some of the data loading functions (especially **read_csv**) have grown very complex in their options over time. 

It’s
normal to feel overwhelmed by the number of different parameters (**read_csv** has
over 50 as of this writing).

In [None]:
pd.read_csv?

* Some of these functions, like **pandas.read_csv**, perform *type inference*, because the
column data types are not part of the data format.   
* That means you don’t necessarily
have to specify which columns are numeric, integer, boolean, or string. 

* Handling dates and other custom types can require extra effort. 

Let’s start with a
small comma-separated (CSV) text file.

examples/ex1.csv 화일을 엑셀로 확인한다.

Since this is comma-delimited, we can use **read_csv** to read it into a DataFrame.

#### **왼쪽 폴더 화면에서 ex1.csv를 클릭하여 열어볼것!**

In [None]:
df = pd.read_csv('examples/ex1.csv')
df

We could also have used **read_table** and specified the delimiter.

In [None]:
pd.read_table?

A file will not always have a header row. Consider this file.

examples/ex2.csv 화일을 엑셀로 확인한다.

To read this file, you have a couple of options. You can allow pandas to assign default
column names, or you can specify names yourself.

In [None]:
pd.read_csv('examples/ex2.csv')

In [None]:
pd.read_csv('examples/ex2.csv', header=None)

In [None]:
pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Suppose you wanted the message column to be the index of the returned DataFrame.
You can either indicate you want the column at index 4 or named 'message' using
the index_col argument.

In [None]:
names = ['a', 'b', 'c', 'd', 'message']

In [None]:
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

In [None]:
pd.read_csv('examples/ex2.csv', names=names, index_col=4)

In the event that you want to form a hierarchical index from multiple columns, pass a
list of column numbers or names.

examples/csv_mindex.csv 화일을 엑셀로 확인한다.

In [None]:
parsed = pd.read_csv('examples/csv_mindex.csv',
                     index_col=['key1', 'key2'])

In [None]:
parsed

In some cases, a table might not have a fixed delimiter, using whitespace or some
other pattern to separate fields. Consider a text file that looks like this.

In [None]:
list(open('examples/ex3.txt'))

While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression as a
delimiter for **read_table**. This can be expressed by the regular expression **\s+**, so we
have then.

whitespace is any character or series of characters that represent horizontal or vertical space in printing.

<pre>
‘ ‘ – Space
‘\t’ – Horizontal tab
‘\n’ – Newline
‘\v’ – Vertical tab
‘\f’ – Feed
‘\r’ – Carriage return

In [None]:
result = pd.read_table('examples/ex3.txt', sep='\s+')

In [None]:
result

구분자가 길이가 정해지지 않은 공백인 경우에는 \s+라는 정규식(regular expression) 문자열을 사용한다.

In [None]:
pd.read_csv('examples/ex3.txt', sep='\s+')

Because there was one fewer column name than the number of data rows,
**read_table** infers that the first column should be the DataFrame’s index in this special case.

The parser functions have many additional arguments to help you handle the wide
variety of exception file formats that occur (see a partial listing in Table 6-2). For
example, you can skip the first, third, and fourth rows of a file with **skiprows**.

examples/ex4.csv 화일을 엑셀로 확인한다.

In [None]:
#list(open('examples/ex4.csv'))

In [None]:
pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])

Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by
some sentinel value. By default, pandas uses a set of commonly occurring sentinels,
such as **NA** and **NULL**:

In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically in a loop or recursive algorithm.

In [None]:
#list(open('examples/ex5.csv'))

In [None]:
result = pd.read_csv('examples/ex5.csv')

In [None]:
result

In [None]:
pd.isnull(result)

The **na_values** option can take either a list or set of strings to consider missing
values.

Different NA sentinels can be specified for each column in a dict.

In [None]:
result

In [None]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

In [None]:
pd.read_csv('examples/ex5.csv', na_values=sentinels)

Table 6-2 lists some frequently used options in **pandas.read_csv** and **pan
das.read_table**.

<img style="float: left;" src="pic/pic_6_5.png" width="700">

<img style="float: left;" src="pic/pic_6_6.png" width="700">

### Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate
through smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact:

In [None]:
pd.options.display.max_rows = 10

In [None]:
result = pd.read_csv('examples/ex6.csv')
result

If you want to only read a small number of rows (avoiding reading the entire file),
specify that with **nrows**.

In [None]:
pd.read_csv('examples/ex6.csv', nrows=5)

To read a file in pieces, specify a **chunksize** as a number of rows:

In [None]:
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

In [None]:
chunker

The TextParser object returned by **read_csv** allows you to iterate over the parts of
the file according to the **chunksize**.   
For example, we can iterate over ex6.csv, aggre‐
gating the value counts in the 'key' column like so:

In [None]:
chunker = pd.read_csv('examples/ex6.csv', chunksize=10)

In [None]:
tot = pd.Series([])

In [None]:
tot.add?

In [None]:
for piece in chunker:
    print(piece)
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
    print(tot)

In [None]:
tot.sort_values?

In [None]:
tot = tot.sort_values(ascending=False)

In [None]:
tot[:10]

### Writing Data to Text Format

Data can also be exported to a delimited format.   
Let’s consider one of the CSV files
read before:

In [None]:
list(open('examples/ex5.csv'))

In [None]:
data = pd.read_csv('examples/ex5.csv')
data

Using DataFrame’s **to_csv** method, we can write the data out to a comma-separated
file.

In [None]:
data.to_csv('examples/out.csv')

In [None]:
list(open('examples/out.csv'))

Other delimiters can be used, of course (writing to **sys.stdout** so it prints the text
result to the console).

In [None]:
import sys
data.to_csv(sys.stdout, sep='?')

Missing values appear as empty strings in the output.  
You might want to denote them
by some other sentinel value.

In [None]:
data.to_csv(sys.stdout, na_rep='NULL')

With no other options specified, both the row and column labels are written. Both of
these can be disabled.

In [None]:
data.to_csv(sys.stdout, index=False, header=False)

In [None]:
data

You can also write only a subset of the columns, and in an order of your choosing.

In [None]:
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])

Series also has a **to_csv** method.

In [None]:
dates = pd.date_range('1/1/2000', periods=7)

In [None]:
dates

In [None]:
ts = pd.Series(np.arange(7), index=dates)

In [None]:
ts

In [None]:
ts.to_csv('examples/tseries.csv')

In [None]:
list(open('examples/tseries.csv'))

### Working with Delimited Formats

It’s possible to load most forms of tabular data from disk using functions like **pandas.read_table**.   
In some cases, however, some manual processing may be necessary.  
It’s not uncommon to receive a file with one or more malformed lines that trip up
**read_table**.   
To illustrate the basic tools, consider a small CSV file:


In [None]:
#list(open('examples/ex7.csv'))

For any file with a single-character delimiter, you can use Python’s built-in **csv** module.   
To use it, pass any open file or file-like object to csv.reader:

In [None]:
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)

Iterating through the reader like a file yields tuples of values with any quote characters removed.

In [None]:
for line in reader:
    print(line)

From there, it’s up to you to do the wrangling necessary to put the data in the form
that you need it.   
Let’s take this step by step.   
First, we read the file into a list of lines.

In [None]:
with open('examples/ex7.csv') as f:
    lines = list(csv.reader(f))

Then, we split the lines into the header line and the data lines:

In [None]:
header, values = lines[0], lines[1:]

In [None]:
header

In [None]:
values

Then we can create a dictionary of data columns using a dictionary comprehension
and the expression zip(*values), which transposes rows to columns:


In [None]:
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

In [None]:
data_dict = {h: v for h, v in zip(header, zip(values))} #zip(values)는 ( )안에 values 하나 밖에 없으므로 ([1,2,3],없음) 이 된다.
data_dict

In [None]:
data_dict = {h: v for h, v in zip(header, values)}
data_dict

In [None]:
A = [[ 1, 2, 3],[ 4, 5, 6]]

In [None]:
list(zip(A))

In [None]:
list(zip(*A)) #zip(*a) is equal to zip(a[0], a[1], a[2], ...)

In [None]:
A[0]

In [None]:
A[1]

In [None]:
list(zip(A[0],A[1]))

In [None]:
for i,j in zip(A[0],A[1]):
    print(i,j)

In [None]:
list(zip([1,2,3],[4,5,6],[7,8,9]))

### JSON Data

생략

* JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web browsers and other applications. 
* It is
a much more free-form data format than a tabular text form like CSV. 

Here is an
example:

In [None]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [None]:
type(obj)

* JSON is very nearly valid Python code with the exception of its null value null and
some other nuances (such as disallowing trailing commas at the end of lists). 
* The
basic types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. 
* All
of the keys in an object must be strings. 
* There are several Python libraries for reading and writing JSON data. 
* I’ll use json here, as it is built into the Python standard
library. 
* To convert a JSON string to Python form, use **json.loads**:

In [None]:
import json
result = json.loads(obj)
result

json.dumps, on the other hand, converts a Python object back to JSON:

In [None]:
asjson = json.dumps(result)

In [None]:
type(asjson)

How you convert a JSON object or list of objects to a DataFrame or some other data
structure for analysis will be up to you.   
Conveniently, you can pass a list of dicts
(which were previously JSON objects) to the DataFrame constructor and select a subset of the data fields:

In [None]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings

In [None]:
siblings = pd.DataFrame(result['siblings'])
siblings

The **pandas.read_json** can automatically convert JSON datasets in specific arrangements into a Series or DataFrame. For example:

In [None]:
list(open('examples/example.json'))

The default options for **pandas.read_json** assume that each object in the JSON array
is a row in the table:

In [None]:
data = pd.read_json('examples/example.json')
data

If you need to export data from pandas to JSON, one way is to use the **to_json** methods on Series and DataFrame:

In [None]:
print(data.to_json())

In [None]:
print(data.to_json(orient='records'))

### XML and HTML: Web Scraping

* Python has many libraries for reading and writing data in the ubiquitous HTML and
XML formats. 
* Examples include lxml, Beautiful Soup, and html5lib. 
* While lxml is
comparatively much faster in general, the other libraries can better handle malformed
HTML or XML files.

* pandas has a built-in function, **read_html**, which uses libraries like lxml and Beautiful Soup to automatically parse tables out of HTML files as DataFrame objects. 

To
show how this works, I downloaded an HTML file (used in the pandas documenta‐
tion) from the United States FDIC government agency showing bank failures.  
First,
you must install some additional libraries used by **read_html**:

<p style="font-family: Courier New; font-size: 1.15em;">
conda install lxml

<p style="font-family: Courier New; font-size: 1.15em;">
pip install beautifulsoup4 html5lib

* The **pandas.read_html** function has a number of options, but by default it searches
for and attempts to parse all tabular data contained within < table > tags. 
    
The result is
a list of DataFrame objects:

examples/fdic_failed_bank_list.html 을 클릭하여 구글 chrome으로 열기


메모장으로 examples/fdic_failed_bank_list.html 연 후, 찾기 < table 을 한다.


In [None]:
tables = pd.read_html('examples/fdic_failed_bank_list.html')

In [None]:
len(tables)

In [None]:
failures = tables[0]

In [None]:
failures.head()

#### Parsing XML with lxml.objectify

생략

## 6.2  Binary Data Formats

* One of the easiest ways to store data (also known as serialization) efficiently in binary
format is using Python’s built-in **pickle** serialization. 
* pandas objects all have a
**to_pickle** method that writes the data to disk in pickle format:


In [None]:
list(open('examples/ex1.csv'))

In [None]:
frame = pd.read_csv('examples/ex1.csv')

In [None]:
frame

In [None]:
frame.to_pickle('examples/frame_pickle')

In [None]:
pd.read_pickle('examples/frame_pickle')

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">pickle is only recommended as a short-term storage format. The
problem is that it is hard to guarantee that the format will be stable
over time.

### Using HDF5 Format

생략

### Reading Microsoft Excel Files

* pandas also supports reading tabular data stored in Excel 2003 (and higher) files
using either the **ExcelFile** class or **pandas.read_excel** function. 
* Internally these
tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. 
* You may need to install these manually with pip or conda.

In [None]:
frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame

In [None]:
frame.to_excel('examples/ex2.xlsx')

### 이하 생략

## Interacting with Web APIs

* Many websites have public APIs providing data feeds via JSON or some other format.
* There are a number of ways to access these APIs from Python; one easy-to-use
method that I recommend is the requests package.


In [None]:
import requests

In [None]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [None]:
resp = requests.get(url)

In [None]:
resp

The Response object’s **json** method will return a dictionary containing JSON parsed
into native Python objects:

In [None]:
data = resp.json()

In [None]:
data[0]['title']

In [None]:
data

Each element in data is a dictionary containing all of the data found on a GitHub
issue page (except for the comments).   
We can pass data directly to DataFrame and
extract fields of interest:

In [None]:
issues = pd.DataFrame(data, columns=['number', 'title',
                                     'labels', 'state'])

In [None]:
issues

## Interacting with Databases

* In a business setting, most data may not be stored in text or Excel files. 
* SQL-based
relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use,
and many alternative databases have become quite popular. 
* The choice of database is
usually dependent on the performance, data integrity, and scalability needs of an
application.

* Loading data from SQL into a DataFrame is fairly straightforward, and pandas has
some functions to simplify the process. As an example, 
* I’ll create a SQLite database
using Python’s built-in **sqlite3** driver:

(생략) sqlite 관련 화일은 sqlite_new.zip 에 있음.