- author: Lee Meng
- date: 2019-07-22 09:00
- title: 輕鬆把玩數據：X 個常用 Pandas 技巧
- slug: pandas-top-x-tricks-and-cheatsheet
- tags: Python
- description: 
- summary: 
- image: doors-1767562_1280.jpg
- image_credit_url: 
- status: draft

In [9]:
import pandas as pd
pd.__version__

'0.23.4'

## 建立 DataFrame

### 用 Python dict 建立 DataFrame

In [10]:
d = {"col_1": ['a', 'b', 'c'], "col_2": [1, 2, 3]}
pd.DataFrame(d)

Unnamed: 0,col_1,col_2
0,a,1
1,b,2
2,c,3


### 使用 pd.util.testing 隨機建立 DataFrame

在你只是想要測試 pandas 的一些功能時十分好用。

In [12]:
pd.util.testing.makeDataFrame().head(10)

Unnamed: 0,A,B,C,D
NWJOmfCirM,1.625947,1.804767,-0.710319,-1.052005
EUs7nTWHOi,-1.366118,0.309858,-0.627835,1.990895
DJjqLicICV,-0.70018,-0.398182,-0.485865,-3.076456
mcpdpreFVO,0.583953,0.608293,-0.796824,-2.932026
qkdYp98dtN,-1.259726,-0.502375,-1.614751,-2.366699
qt6RJD3jXz,2.145456,-0.297916,-1.163679,0.80367
FUH5HSmVvK,-0.81152,-0.378954,0.127611,0.694116
aGm5IiKTpN,0.358053,0.283839,0.411738,-0.304223
fPWPGd1zB3,0.388082,-0.671565,1.406873,1.106747
wzrNzTVd4c,0.65269,-0.834069,2.102678,1.27709


### 將剪貼簿內容轉換成 DataFrame

In [17]:
df = pd.read_clipboard()
df

Unnamed: 0,A,B,C,D
NWJOmfCirM,1.625947,1.804767,-0.710319,-1.052005
EUs7nTWHOi,-1.366118,0.309858,-0.627835,1.990895
DJjqLicICV,-0.70018,-0.398182,-0.485865,-3.076456
mcpdpreFVO,0.583953,0.608293,-0.796824,-2.932026


不過要注意 reproduciblity 問題，記得另外存成檔案供後人使用。

### 讀取 CSV 檔案

不限於本地，只要有 URL 以及網路連線就可以將網路上任意 CSV 檔案轉成 DataFrame。

In [4]:
titanic = pd.read_csv('http://bit.ly/kaggletrain')

In [5]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 將 Google Sheet 表格轉成 DataFrame

In [None]:
!pip install --upgrade -q gspread

In [None]:
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

spreadsheet_name = "text summarization evaluation dataset (DS research week)"
worksheet = gc.open(spreadsheet_name).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
print(rows)

# Convert to a DataFrame and render.
df = pd.DataFrame.from_records(rows)

toc
- pandas還有numpy使用vertorization 還有 broadcasting來加快運算, 多使用內建函數
- 建立 dataframe
    - 基本用用鐵達尼號資料夾展示
    - csv / 結合多個 csvs (row / columns) `pd.concat`
        - 去頭
        - 只讀取特定 column `usecol` in read_csv
        - 使用 `df.info` 看記憶體使用量並將 numerical 轉成 categorical 省資源
    - dummy data frame
        - created for fast testing
            - pd.util.testing.makeDataFrame() ➡️ contains random values
            - .makeMissingDataframe() ➡️ some values missing
            - .makeTimeDataFrame() ➡️ has DateTimeIndex
            - .makeMixedDataFrame() ➡️ mixed data type
        - made from `dict`: key = col, v = list of values
        - np.rand.(#row, #col)
- 數據清理 / 整理
    - drop columns
        - https://gist.github.com/600474cca52227129872d44559d312f2
    - rename columns
    - reindex 然後 drop
    - fillna
    - (conditional) drop columns
    - 將字串 split 成兩個 columns
    - 將 numerical 切成 categorical (pd.cut)
- select data
    - 記得loc, iloc, ix, at, iat都不是function, 而是attribute, 所以是使用[], 而不是()
    - reserve row / col orders
    - row and/or columns slicer 
        - [col_a:col_b]
    - mask and `query`
    - 選出任何一行有 NLL 值的 rows
        - `df[df.isnull().any(axis=1)]`
    - 選擇 dtype 為特定型態的 column
    - 選擇某些 column 特定值 (`isin`)
    - 選擇某些 column 為 top k 值的 row
    - `df.filter`
        - 下面的例子把string的columns 還有 年份從20XX年的columns取出來 (19XX的columns被過濾掉)
            - `cols = ['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank', '20.*']`
            - `df.filter(regex='|'.join(cols), axis=1).head()`
    - iterows 做特定處理
    - 依照某個string column值來選擇row
        - `df2[df2['comment'].str.contains('yoooo')]`
- sorting

```python
df1.index = pd.CategoricalIndex(df1.index, 
                               categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'], 
                               sorted=True)
df1.sort_index().plot(kind='bar', figsize=(15, 12))
```

- 數據處理
    - apply
    - 將一個 list column 轉成多個 columns
    - 切兩個 subset
    - to_datetime
        - https://twitter.com/justmarkham/status/1148217934298406912?s=20
    - join dataframe
        - `df.merge`
- 匯總
    - value_counts + sorted value
    - unique
    - groupby + (multi-agg func or describe())
        - 對時間的匯總可以用 `resample`
            - https://twitter.com/justmarkham/status/1151846604216971264?s=20
    - group by custom lambda func
    - transform 函式
    - multi-index groupby (`unstack`) VS pivot_table
- 簡單畫圖
    - easy plot + nice style
        - `df.plot()`
    - 改變 display options / style obj.
        - chain your operations!
        - https://t.co/6xlytNLmGm
        - https://t.co/mhz9GiueaN
- output
    - windows friendly output: `to_csv(encoding="cp932")`
- powerful tools
    - pandas profiling
        - `pip install pandas-profiling`
        - 適合用在 numerical features 的分析
    - qgrid
        - https://www.evernote.com/l/AET7-dpk349LNJcQVCWP-rGWdnyGA6-mz2w
    - tqdm
        - https://www.evernote.com/l/AETKpFnXeB5B84M5PbPBwXR_dMDZ4vAu0Xw
    - swifter
    - Facets
        - https://www.evernote.com/l/AETaGMqtguRAKbEF9z4hWtF9HazEVkWr70c
    - cufflinks and plotly
        - https://www.kdnuggets.com/2019/07/10-simple-hacks-speed-data-analysis-python.html
- good reference
    - youtube
    - pocket 那篇 10 個
    - safari
    - dataquest cheat sheet
        - https://storage.googleapis.com/molten/lava/2018/09/f0c721d9-pandas-cheat-sheet-dataquest.jpg
    
    

## 選擇子集

In [None]:
customers = customers[customers >= 35]
products = products[products >= 20]

reduced_df = df.merge(pd.DataFrame({'customer_id': customers.index})).merge(pd.DataFrame({'product_id': products.index}))

## 看 columns 裡頭的值的分佈

In [None]:
customers = df['customer_id'].value_counts()
products = df['product_id'].value_counts()

quantiles = [0, 0.1, 0.25, 0.5, 0.75, 0.8, 0.85, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99, 0.995, 0.999, 0.9999, 1]
print('customers\n', customers.quantile(quantiles))
print('products\n', products.quantile(quantiles))