## Pandas 介紹

- pandas overview
- pandas DataFrame
- how to create DataFrame
- accessing and modifying data

Pandas是一個Python語言的程式庫，提供資料分析的能力。資料處理經常以表格型式處理資料，如sql的關聯表或Excel的表格。Pandas提供載入、處理、計算、及分析表格化的資料，與matplotlib及seaborn結合提供資料視覺化能力。 Pandas的主要資料結構為資料框(DataFrame)是二維表格化資料結構，每行包含相同型態資料。DataFrame的結構包括列標籤(row labels)、欄位標籤(column labels，可視為欄位名稱)、及資料項目(data items)。以下為一個典型的資料框範例，描述學生資料

|   |sid | name | phone | AI | datastructure | python
|----|----|----|-----|-----|-------|----
101 | 11375001 | 李大明 | 0922555123 | 66 | 86 | 69
102 | 11375010 | 陳傑憲 | 0919123456 | 96 | 88 | 77
103 | 11375022 | 林玉明 | 0955235743 | 88 | 77 | 66
104 | 11375055 | 姜昆雨 | 0931239097 | 77 | 87 | 98
105 | 11375199 | 陳辰威 | 0932098543 | 67 | 78 | 89

- row labels: 101, 102, 103, .., 105
- column labels: sid, name, phone, AI, datastructure, pyhton
- data items: 每一筆(row)資料的內容

<img src="../data/dataframe1.png" width="600" height="700"/>

In [2]:
# create table with dictionary
student = {
    'sid': ['11375001', '11375010', '11375022', '11375055', '11375199'],
    'name': [ '李大明', '陳傑憲', '林玉明', '姜昆雨', '陳辰威'],
    'phone': ['0922555123', '0919123456', '0955235743', '0931239097', '0932098543'],
    'AI': [66, 96, 88, 77, 67],
    'datastructure': [86, 88, 77, 87, 78],
    'python': [69, 77, 66, 98, 89]
}
row_labels = [101, 102, 103, 104, 105]

In [3]:
#work with pandas by importing pandas
import pandas as pd

In [4]:
#create DataFrame with DataFrame()
df = pd.DataFrame(student, index=row_labels)

In [5]:
df

Unnamed: 0,sid,name,phone,AI,datastructure,python
101,11375001,李大明,922555123,66,86,69
102,11375010,陳傑憲,919123456,96,88,77
103,11375022,林玉明,955235743,88,77,66
104,11375055,姜昆雨,931239097,77,87,98
105,11375199,陳辰威,932098543,67,78,89


In [7]:
df['name'] #same as df.name

101    李大明
102    陳傑憲
103    林玉明
104    姜昆雨
105    陳辰威
Name: name, dtype: object

資料框中每一欄資料為一個 `pandas.Series` 的物件，包含一維資料及列標籤

In [8]:
df.loc[103]

sid                11375022
name                    林玉明
phone            0955235743
AI                       88
datastructure            77
python                   66
Name: 103, dtype: object

資料框的 .loc[x] 方法可用來讀取列標籤為x的資料內容，其輸出結構為一個 `pandas.Series`，輸出資料包含所有欄位名稱（欄標籤）

## 建立DataFrame
- dictionary
- list
- Numpy Array
- files

**建立資料框時要注意欄標籤、列標籤、及資料項**

In [10]:
#create dataframe with list
l = [[1, 2, 100],
     [2, 4, 200],
     [3, 5, 300]]


In [11]:
df1 = pd.DataFrame(l, columns=['x', 'y', 'z']) #加入 x, y, z 的欄位名稱

In [12]:
df1

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,200
2,3,5,300


row labels = [0, 1, 2], column labels = ['x', 'y', 'z']

In [13]:
#create dataframe with numpy arrays
import numpy as np
arr = np.array([[1, 2, 200],
               [2, 4, 400],
               [3, 5, 500]])               

In [14]:
df2 = pd.DataFrame(arr, columns=['x', 'y', 'z'])

In [15]:
df2

Unnamed: 0,x,y,z
0,1,2,200
1,2,4,400
2,3,5,500


In [16]:
arr[0,0] = 999 #copy option is set to False, if you modify the array, then your DataFrame will change too

In [17]:
df2

Unnamed: 0,x,y,z
0,999,2,200
1,2,4,400
2,3,5,500


In [26]:
l[0][0] = 99 #in list no copy 

In [28]:
df1

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,200
2,3,5,300


 ### 由.csv檔案建立資料框

In [31]:
# save dataframe to a csv file
df.to_csv('data.csv')

data.csv

,sid,name,phone,AI,datastructure,python<br>
101,11375001,李大明,0922555123,66,86,69<br>
102,11375010,陳傑憲,0919123456,96,88,77<br>
103,11375022,林玉明,0955235743,88,77,66<br>
104,11375055,姜昆雨,0931239097,77,87,98<br>
105,11375199,陳辰威,0932098543,67,78,89

In [33]:
pd.read_csv('data.csv')

Unnamed: 0.1,Unnamed: 0,sid,name,phone,AI,datastructure,python
0,101,11375001,李大明,922555123,66,86,69
1,102,11375010,陳傑憲,919123456,96,88,77
2,103,11375022,林玉明,955235743,88,77,66
3,104,11375055,姜昆雨,931239097,77,87,98
4,105,11375199,陳辰威,932098543,67,78,89


In [34]:
pd.read_csv("data.csv", index_col=0) #index_col=0 定義第一欄為row labels

Unnamed: 0,sid,name,phone,AI,datastructure,python
101,11375001,李大明,922555123,66,86,69
102,11375010,陳傑憲,919123456,96,88,77
103,11375022,林玉明,955235743,88,77,66
104,11375055,姜昆雨,931239097,77,87,98
105,11375199,陳辰威,932098543,67,78,89


### dataframe 屬性

In [35]:
df.index

Index([101, 102, 103, 104, 105], dtype='int64')

In [36]:
df.columns

Index(['sid', 'name', 'phone', 'AI', 'datastructure', 'python'], dtype='object')

In [37]:
df.columns[1]

'name'

In [38]:
df.index = np.arange(10, 15)

In [40]:
df #note that row labels of df change to 10, 11, 12, 13, 14

Unnamed: 0,sid,name,phone,AI,datastructure,python
10,11375001,李大明,922555123,66,86,69
11,11375010,陳傑憲,919123456,96,88,77
12,11375022,林玉明,955235743,88,77,66
13,11375055,姜昆雨,931239097,77,87,98
14,11375199,陳辰威,932098543,67,78,89
