# DataFrames.jl 介紹 (三): Reshaping / Sorting

![](https://juliadata.github.io/DataFrames.jl/stable/assets/logo.png)

DataFrames.jl 官方網站: [https://juliadata.github.io/DataFrames.jl/stable/](https://juliadata.github.io/DataFrames.jl/stable/)

DataFrames.jl GitHub: [https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/index.md](https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/index.md)

## 0. 安裝

如果尚未安裝過 DataFrames.jl 的話, 執行 `Pkg.add()` 進行安裝

In [1]:
using Pkg
Pkg.add(PackageSpec(name="DataFrames", version="0.20.2"))

[32m[1m  Updating[22m[39m registry at `C:\Users\james\.julia\registries\General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `C:\Users\james\.julia\environments\v1.2\Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `C:\Users\james\.julia\environments\v1.2\Manifest.toml`
[90m [no changes][39m


## 1. Reshaping

In [2]:
using DataFrames, CSV

┌ Info: Recompiling stale cache file C:\Users\james\.julia\compiled\v1.2\CSV\HHBkp.ji for CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1240


In [3]:
df = CSV.read("iris.csv", delim=",")
first(df, 5)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


### stack 與 stackdf

`stack()` 函式可將 DataFrame 的形狀改變成"長"形，也就是將各 column 展開轉為 row 及其對應值，並自動產生兩個新的 column 名為 `variable` 和 `value`。

呼叫 `stack()` 指定要展開的 column 時，可以使用 index 值也可以使用 column 名稱。以下範例為使用 index 值，原先 DataFrame 的形狀為 (150, 5)，轉換後變成 (750, 2)。

In [4]:
longdf = stack(df, 1:5)
size(longdf)

(750, 2)

In [5]:
first(longdf, 5)

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Any
1,SepalLength,5.1
2,SepalLength,4.9
3,SepalLength,4.7
4,SepalLength,4.6
5,SepalLength,5.0


In [6]:
last(longdf, 5)

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Any
1,Class,Iris-virginica
2,Class,Iris-virginica
3,Class,Iris-virginica
4,Class,Iris-virginica
5,Class,Iris-virginica


`stackdf()` 函式的功能與 `stack()` 幾乎相同, 差異在於 `stackdf()` 提供的是原 DataFrame 的 view, 所以在記憶體上比較節省, 更適用於大的 dataset.

In [28]:
stackdf(df)[1:5, :]

│   caller = ip:0x0
└ @ Core :-1


Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Any,String
1,SepalLength,5.1,Iris-setosa
2,SepalLength,4.9,Iris-setosa
3,SepalLength,4.7,Iris-setosa
4,SepalLength,4.6,Iris-setosa
5,SepalLength,5.0,Iris-setosa


以下範例為使用 column 名稱。

In [8]:
longdf = stack(df, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth])
size(longdf)

(600, 3)

In [9]:
first(longdf, 5)

Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Float64,String
1,SepalLength,5.1,Iris-setosa
2,SepalLength,4.9,Iris-setosa
3,SepalLength,4.7,Iris-setosa
4,SepalLength,4.6,Iris-setosa
5,SepalLength,5.0,Iris-setosa


In [10]:
last(longdf, 5)

Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Float64,String
1,PetalWidth,2.3,Iris-virginica
2,PetalWidth,1.9,Iris-virginica
3,PetalWidth,2.0,Iris-virginica
4,PetalWidth,2.3,Iris-virginica
5,PetalWidth,1.8,Iris-virginica


下列範例示範僅 stack 2 個欄位與指定其 ID 值 (Class 欄位)。

In [11]:
longdf = stack(df, [:SepalLength, :SepalWidth], :Class)
first(longdf, 5)

Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Float64,String
1,SepalLength,5.1,Iris-setosa
2,SepalLength,4.9,Iris-setosa
3,SepalLength,4.7,Iris-setosa
4,SepalLength,4.6,Iris-setosa
5,SepalLength,5.0,Iris-setosa


In [12]:
last(longdf, 5)

Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Float64,String
1,SepalWidth,3.0,Iris-virginica
2,SepalWidth,2.5,Iris-virginica
3,SepalWidth,3.0,Iris-virginica
4,SepalWidth,3.4,Iris-virginica
5,SepalWidth,3.0,Iris-virginica


### melt and meltdf

`melt()` 函式是由 `stack()` 延伸出來，跟 `stack()` 不同的之處在於，呼叫 `melt()` 時指定 ID 欄位就可以進行形狀轉換。

In [32]:
longdf = melt(df, :Class)
first(longdf, 5)

│   caller = top-level scope at In[32]:1
└ @ Core In[32]:1


Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Float64,String
1,SepalLength,5.1,Iris-setosa
2,SepalLength,4.9,Iris-setosa
3,SepalLength,4.7,Iris-setosa
4,SepalLength,4.6,Iris-setosa
5,SepalLength,5.0,Iris-setosa


In [34]:
longdf = stack(df, Not(:Class))
first(longdf, 5)

Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Float64,String
1,SepalLength,5.1,Iris-setosa
2,SepalLength,4.9,Iris-setosa
3,SepalLength,4.7,Iris-setosa
4,SepalLength,4.6,Iris-setosa
5,SepalLength,5.0,Iris-setosa


`meltdf()` 函式的功能與 `melt()` 幾乎相同, 差異在於 `meltdf()` 提供的是原 DataFrame 的 view, 所以在記憶體上比較節省, 更適用於大的 dataset.

In [14]:
meltdf(df, :Class)[1:5, :]

│   caller = top-level scope at In[14]:1
└ @ Core In[14]:1
│   caller = ip:0x0
└ @ Core :-1


Unnamed: 0_level_0,variable,value,Class
Unnamed: 0_level_1,Symbol,Any,String
1,SepalLength,5.1,Iris-setosa
2,SepalLength,4.9,Iris-setosa
3,SepalLength,4.7,Iris-setosa
4,SepalLength,4.6,Iris-setosa
5,SepalLength,5.0,Iris-setosa


### unstack

`unstack()` 的功能與 `stack()` / `melt()` 相反, 可將長形的 DataFrame 轉變為原先"寬"形的.

In [38]:
# 為能正確執行 unstack, 需先在 DataFrame 中加入 index 欄位.
df.id = 1:size(df, 1)
# longdf = melt(df, [:Class, :id])
longdf = stack(df, Not([:Class, :id]))

Unnamed: 0_level_0,variable,value,Class,id
Unnamed: 0_level_1,Symbol,Float64,String,Int64
1,SepalLength,5.1,Iris-setosa,1
2,SepalLength,4.9,Iris-setosa,2
3,SepalLength,4.7,Iris-setosa,3
4,SepalLength,4.6,Iris-setosa,4
5,SepalLength,5.0,Iris-setosa,5
6,SepalLength,5.4,Iris-setosa,6
7,SepalLength,4.6,Iris-setosa,7
8,SepalLength,5.0,Iris-setosa,8
9,SepalLength,4.4,Iris-setosa,9
10,SepalLength,4.9,Iris-setosa,10


進行 unstack, 將長資料集轉換還原成原先的 shape.

In [39]:
widedf = unstack(longdf, :variable, :value)
size(widedf)

(150, 6)

In [40]:
first(widedf, 5)

Unnamed: 0_level_0,Class,id,PetalLength,PetalWidth,SepalLength,SepalWidth
Unnamed: 0_level_1,String,Int64,Float64⍰,Float64⍰,Float64⍰,Float64⍰
1,Iris-setosa,1,1.4,0.2,5.1,3.5
2,Iris-setosa,2,1.4,0.2,4.9,3.0
3,Iris-setosa,3,1.3,0.2,4.7,3.2
4,Iris-setosa,4,1.5,0.2,4.6,3.1
5,Iris-setosa,5,1.4,0.2,5.0,3.6


In [18]:
last(widedf, 5)

Unnamed: 0_level_0,Class,id,PetalLength,PetalWidth,SepalLength,SepalWidth
Unnamed: 0_level_1,String,Int64,Float64⍰,Float64⍰,Float64⍰,Float64⍰
1,Iris-virginica,146,5.2,2.3,6.7,3.0
2,Iris-virginica,147,5.0,1.9,6.3,2.5
3,Iris-virginica,148,5.2,2.0,6.5,3.0
4,Iris-virginica,149,5.4,2.3,6.2,3.4
5,Iris-virginica,150,5.1,1.8,5.9,3.0


## 2. Sorting

排序的函式有 `sort()` 及 `sort!()`, 差異點在於 `sort!()` 會排序並更新原 DataFrame, 而 `sort()` 不會更新原 DataFrame.

In [19]:
df = CSV.read("iris.csv", delim=",") |> DataFrame
first(df, 5)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


預設的排序是對每個 column 的值依序進行由小到大 (ascending) 的排序

In [20]:
sort(df)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,4.3,3.0,1.1,0.1,Iris-setosa
2,4.4,2.9,1.4,0.2,Iris-setosa
3,4.4,3.0,1.3,0.2,Iris-setosa
4,4.4,3.2,1.3,0.2,Iris-setosa
5,4.5,2.3,1.3,0.3,Iris-setosa
6,4.6,3.1,1.5,0.2,Iris-setosa
7,4.6,3.2,1.4,0.2,Iris-setosa
8,4.6,3.4,1.4,0.3,Iris-setosa
9,4.6,3.6,1.0,0.2,Iris-setosa
10,4.7,3.2,1.3,0.2,Iris-setosa


使用 `rev=true` 參數值, 進行由大到小 (descending) 排序.

In [21]:
sort(df, rev = true)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,7.9,3.8,6.4,2.0,Iris-virginica
2,7.7,3.8,6.7,2.2,Iris-virginica
3,7.7,3.0,6.1,2.3,Iris-virginica
4,7.7,2.8,6.7,2.0,Iris-virginica
5,7.7,2.6,6.9,2.3,Iris-virginica
6,7.6,3.0,6.6,2.1,Iris-virginica
7,7.4,2.8,6.1,1.9,Iris-virginica
8,7.3,2.9,6.3,1.8,Iris-virginica
9,7.2,3.6,6.1,2.5,Iris-virginica
10,7.2,3.2,6.0,1.8,Iris-virginica


指定欲排序的欄位, 可以用 index 也可以用欄位名.

In [22]:
sort(df, [1, 2, 3, 4], rev=true)
# 結果上例完全相同

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,7.9,3.8,6.4,2.0,Iris-virginica
2,7.7,3.8,6.7,2.2,Iris-virginica
3,7.7,3.0,6.1,2.3,Iris-virginica
4,7.7,2.8,6.7,2.0,Iris-virginica
5,7.7,2.6,6.9,2.3,Iris-virginica
6,7.6,3.0,6.6,2.1,Iris-virginica
7,7.4,2.8,6.1,1.9,Iris-virginica
8,7.3,2.9,6.3,1.8,Iris-virginica
9,7.2,3.6,6.1,2.5,Iris-virginica
10,7.2,3.2,6.0,1.8,Iris-virginica


若要針對 2 個欄位進行排序, 而且排序方法不同的話, 可依照下面示範.

In [23]:
# 一個欄位是 descending 排序, 一個是 ascending
sort(df, (:Class, :PetalWidth), rev=(true,false))

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,6.1,2.6,5.6,1.4,Iris-virginica
2,6.0,2.2,5.0,1.5,Iris-virginica
3,6.3,2.8,5.1,1.5,Iris-virginica
4,7.2,3.0,5.8,1.6,Iris-virginica
5,4.9,2.5,4.5,1.7,Iris-virginica
6,6.3,2.9,5.6,1.8,Iris-virginica
7,7.3,2.9,6.3,1.8,Iris-virginica
8,6.7,2.5,5.8,1.8,Iris-virginica
9,6.5,3.0,5.5,1.8,Iris-virginica
10,6.3,2.7,4.9,1.8,Iris-virginica


In [24]:
# 使用 order, 結果與上例完全相同
sort(df, (order(:Class, rev=true), order(:PetalWidth, rev=false)))

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,6.1,2.6,5.6,1.4,Iris-virginica
2,6.0,2.2,5.0,1.5,Iris-virginica
3,6.3,2.8,5.1,1.5,Iris-virginica
4,7.2,3.0,5.8,1.6,Iris-virginica
5,4.9,2.5,4.5,1.7,Iris-virginica
6,6.3,2.9,5.6,1.8,Iris-virginica
7,7.3,2.9,6.3,1.8,Iris-virginica
8,6.7,2.5,5.8,1.8,Iris-virginica
9,6.5,3.0,5.5,1.8,Iris-virginica
10,6.3,2.7,4.9,1.8,Iris-virginica


下面範例是使用 `by` 參數根據欄位值長度 (Class名稱長度) 排序.

In [25]:
sort(df, :Class, by=length)
first(df, 5)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


In [26]:
last(df, 5)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String
1,6.7,3.0,5.2,2.3,Iris-virginica
2,6.3,2.5,5.0,1.9,Iris-virginica
3,6.5,3.0,5.2,2.0,Iris-virginica
4,6.2,3.4,5.4,2.3,Iris-virginica
5,5.9,3.0,5.1,1.8,Iris-virginica


#### unique

使用 `Base.unique()` 函式，可以查看欄位中不重覆資料。

In [27]:
unique(df[!, :Class])

3-element Array{String,1}:
 "Iris-setosa"    
 "Iris-versicolor"
 "Iris-virginica" 