# **数据加载、存储与文件格式**
---

## **读写文本格式的数据**
> **read_csv和read_table**

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv('examples/ex1.csv')
df


In [None]:
import numpy as np
import pandas as pd
df = pd.read_table('examples/ex1.csv',sep=',')
df

> **names：无标题文件可以自定义表格标题**

In [None]:
pd.read_csv('examples/ex2.csv',names=['a','b','c','d','message'])

> **indes_col：指定列作为索引**

In [None]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv',names=names,index_col='message')

> **index_col=['key1', 'key2']创建分层索引**

In [None]:
parsed = pd.read_csv('examples/csv_mindex.csv',
                    index_col=['key1', 'key2'])
parsed

> **skiprows=[1,2,3,4]：跳过行**

In [None]:
pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])

In [None]:
result = pd.read_csv('examples/ex5.csv')
result

> **read_csv/table常见参数**
 ---
> **path：文件位置**   
> **sep:拆分方式、分隔符或者正则表达式**  
> **header：列名，如果没有列名应该设置为none**  
> **index_col：索引列**   
> **names:设置列名，结合header**   
> **skiprows :跳过行**  
> **na_values:替换NA值**  
> **nrows:需要读取的行数**  

### **读取文件块**

> **nrows：读取指定行数**

In [None]:
pd.options.display.max_rows = 10
result = pd.read_csv('examples/ex6.csv',index_col='key',nrows=5)
result

> **chunksize:逐块读取**

In [None]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
chunker

### **将数据写出到文本格式**

> **to_csv:将数据写入文件**  
> **na_rep='null'：null替换空值**  
> **index=False, header=False:对于没数据的列禁用索引和列名**

In [None]:
data = pd.read_csv('examples/ex5.csv')
data.to_csv('examples/out.csv')


In [None]:
import sys
data.to_csv(sys.stdout,sep='|')

### **处理分隔符样式**

In [None]:
import csv # 导入系统模块
f = open('examples/ex7.csv') # 读取数据
csv.reader(f) 
for line in csv.reader(f): # 遍历表格
    print(line)

### **读取JSON**

> **JSON.loads()读取数据**

In [None]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

import json
result = json.loads(obj)
result

In [None]:
import pandas as pd
siblings = pd.DataFrame(result['siblings'],columns=['name','age'])
siblings

> **json.dumps()对象转json格式**

In [None]:
asjson = json.dumps(result)

print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

> **pd.read_json()读取数据集**

In [None]:
data = pd.read_json('examples/example.json')
data

### **读取WEB数据**

> **pd.read_html()**   

In [None]:
pip install lxml # 安装lxml库

In [None]:
tables = pd.read_html('examples/fdic_failed_bank_list.html')
len(tables)
failures = tables[0] # failures => 故障
failures.head()

In [None]:
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

### **利用XML.objecttify解析XML**

## **二进制数据格式**

> **df.to_pickle(路径)作为临时格式存储**

In [None]:
frame = pd.read_csv('examples/ex1.csv')
frame.to_pickle('examples/frame_pickle') # 作为临时格式存储
pd.read_pickle('examples/frame_pickle')

### **使用HDF5格式**  
> **HDF5是一种存储大规模科学数组数据的非常好的文件格式**

### **读取excel文件**

In [None]:
pip install xlrd
pip install openpyxl

> **read_excel读取excel文件**

In [None]:
xlsx = pd.ExcelFile('examples/ex1.xlsx')
pd.read_excel(xlsx,'Sheet1')

In [None]:
frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame

> **to_excel写入数据到excel文件**

In [None]:
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()

## **Web APIs 交互**

> **requests模块爬取数据**

In [None]:

import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp # 返回一个包含被解析过的JSON字典，加载到对象中


In [None]:
data = resp.json() 
data[0]['title']

In [None]:
issues = pd.DataFrame(data, columns=['number', 'title',
                                    'labels', 'state'])

issues

## **数据库交互**

In [None]:
import sqlite3 # Python内置的sqlite3驱动器
query = """ # 创建查询表
        CREATE TABLE test
        (a VARCHAR(20), b VARCHAR(20),
        c REAL,        d INTEGER
        );"""
con = sqlite3.connect('mydata.sqlite') # 连接到表
con.execute(query) 
con.commit()


In [None]:
data = [('Atlanta', 'Georgia', 1.25, 6),
            ('Tallahassee', 'Florida', 2.6, 3),
            ('Sacramento', 'California', 1.7, 5)]

stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)

In [None]:
cursor.description
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

> **read_sql:pandas有一个read_sql函数，可以让你轻松的从SQLAlchemy连接读取数据。这里，我们用SQLAlchemy连接SQLite数据库，并从之前创建的表读取数据：**

In [None]:
import sqlalchemy as sqla

db = sqla.create_engine('sqlite:///mydata.sqlite')

pd.read_sql('select * from test', db)
