# **数据加载、存储与文件格式**
---

## **读写文本格式的数据**
> **read_csv和read_table**

In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv('examples/ex1.csv')
df


Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [3]:
import numpy as np
import pandas as pd
df = pd.read_table('examples/ex1.csv',sep=',')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


> **names：无标题文件可以自定义表格标题**

In [4]:
pd.read_csv('examples/ex2.csv',names=['a','b','c','d','message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


> **indes_col：指定文件列作为索引**

In [5]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv',names=names,index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


> **index_col=['key1', 'key2']创建分层索引**

In [6]:
parsed = pd.read_csv('examples/csv_mindex.csv',
                    index_col=['key1', 'key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


> **skiprows=[1,2,3,4]：跳过行**

In [13]:
pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [16]:
result = pd.read_csv('examples/ex5.csv')
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


> **read_csv/table常见参数**
 ---
> **path：文件位置**   
> **sep:拆分方式、分隔符或者正则表达式**  
> **header：列名，如果没有列名应该设置为none**  
> **index_col：索引列**   
> **names:设置列名，结合header**   
> **skiprows :跳过行**  
> **na_values:替换NA值**  
> **nrows:需要读取的行数**  

### **读取文件块**

> **nrows：读取指定行数**

In [23]:
pd.options.display.max_rows = 10
result = pd.read_csv('examples/ex6.csv',index_col='key',nrows=5)
result

Unnamed: 0_level_0,one,two,three,four
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
L,0.467976,-0.038649,-0.295344,-1.824726
B,-0.358893,1.404453,0.704965,-0.200638
G,-0.50184,0.659254,-0.421691,-0.057688
R,0.204886,1.074134,1.388361,-0.982404
Q,0.354628,-0.133116,0.283763,-0.837063


> **chunksize:逐块读取**

In [None]:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
chunker

### **将数据写出到文本格式**

> **to_csv:将数据写入文件**  
> **na_rep='null'：null替换空值**  
> **index=False, header=False:对于没数据的列禁用索引和列名**

In [26]:
data = pd.read_csv('examples/ex5.csv')
data.to_csv('examples/out.csv')


In [27]:
import sys
data.to_csv(sys.stdout,sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


### **处理分隔符样式**

In [8]:
import csv # 导入系统模块
f = open('examples/ex7.csv') # 读取数据
csv.reader(f) 
for line in csv.reader(f): # 遍历表格
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


### **读取JSON**

> **JSON.loads()读取数据**

In [10]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

import json
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [15]:
import pandas as pd
siblings = pd.DataFrame(result['siblings'],columns=['name','age'])
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


> **json.dumps()对象转json格式**

In [None]:
asjson = json.dumps(result)

print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

> **pd.read_json()读取数据集**

In [16]:
data = pd.read_json('examples/example.json')
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


### **读取WEB数据**

> **pd.read_html()**   

In [18]:
pip install lxml # 安装lxml库

Collecting lxml
  Downloading lxml-4.9.0-cp310-cp310-macosx_10_15_x86_64.whl (4.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m36.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:04[0mm
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.9.0
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [24]:
tables = pd.read_html('examples/fdic_failed_bank_list.html')
len(tables)
failures = tables[0] # failures => 故障
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


In [28]:
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

### **利用XML.objecttify解析XML**

## **二进制数据格式**

## **Web APIs 交互**

## **数据库交互**