# Julia DataFrames.jl 介紹 (一): 入門操作

![](https://juliadata.github.io/DataFrames.jl/stable/assets/logo.png)

DataFrames.jl 官方網站: [https://juliadata.github.io/DataFrames.jl/stable/](https://juliadata.github.io/DataFrames.jl/stable/)

DataFrames.jl GitHub: [https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/index.md](https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/index.md)

## 0. 安裝

如果尚未安裝過 DataFrames.jl 的話, 執行 `Pkg.add()` 進行安裝

In [1]:
using Pkg
Pkg.add(PackageSpec(name="DataFrames", version="0.20.2"))

[32m[1m   Updating[22m[39m registry at `C:\Users\qwerz\.julia\registries\General`


[?25l

[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`




[32m[1m  Resolving[22m[39m package versions...
[32m[1m   Updating[22m[39m `C:\Users\qwerz\.julia\environments\v1.4\Project.toml`
[90m [no changes][39m
[32m[1m   Updating[22m[39m `C:\Users\qwerz\.julia\environments\v1.4\Manifest.toml`
[90m [no changes][39m


## 1. 建立 DataFrame

In [2]:
using DataFrames

### 1.1 使用向量建立 DataFrame

In [3]:
df = DataFrame(col1 = 1:5, col2 = ["M", "F", "F", missing, "M"])

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M


### 1.2 使用 column by column 的方式建立 DataFrame

In [4]:
# 使用建構子建立空的 DataFrame
df = DataFrame()

In [5]:
# 指定各個 column 及其值, 加入到 DataFrame 中
df.col1 = 1:5
df.col2 = ["M", "F", "F", missing, "M"]

# DataFrames.show() 函式顯示 DataFrame
show(df)

5×2 DataFrame
│ Row │ col1  │ col2    │
│     │ [90mInt64[39m │ [90mString⍰[39m │
├─────┼───────┼─────────┤
│ 1   │ 1     │ M       │
│ 2   │ 2     │ F       │
│ 3   │ 3     │ F       │
│ 4   │ 4     │ [90mmissing[39m │
│ 5   │ 5     │ M       │

### 1.3 新增 row 資料列到 DataFrame

新增 row 到 DataFrame, 資料值的部分可以使用 tuple, vector, 或是 dictionary

In [6]:
# 使用 tuple
push!(df, (1, "M"))

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M


In [7]:
# 使用 vector
push!(df, [2, "f"])

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M
7,2,f


In [8]:
# 使用 Dict
push!(df, Dict(:col2 => "F", :col1 => 2))

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M
7,2,f
8,2,F


### 1.4 刪除 Row 資料

呼叫 `deleterows!()` 函式可將 DataFrame 中指定的 row 刪除

In [9]:
deleterows!(df, 7:8)

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M


### 1.5 載入資料集

延續先前範例, 使用 CSV 載入 UCI Machine Learning Repository 的 Auto MPG Data Set, 資料集的物件為 DataFrames 類型.

若尚未安裝 CSV.jl 的話請先安裝.

In [10]:
Pkg.add("CSV")

[32m[1m  Resolving[22m[39m package versions...
[32m[1m   Updating[22m[39m `C:\Users\qwerz\.julia\environments\v1.4\Project.toml`
[90m [no changes][39m
[32m[1m   Updating[22m[39m `C:\Users\qwerz\.julia\environments\v1.4\Manifest.toml`
[90m [no changes][39m


In [11]:
using CSV

使用 CSV.jl 透過 `read()` 函式將 CSV 資料產生為 DataFrame, `CSV.read()` 之回傳資料類別即為 DataFrame 型別.

warning 的訊息是正常的，原因是資料集裡面有缺值，所以載入 CSV 時會有警告訊息.

In [12]:
df = CSV.read("../../data/auto-mpg.data", delim=',')



Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0
6,15.0,8,429.0,198.0,4341.0,10.0,70.0
7,14.0,8,454.0,220.0,4354.0,9.0,70.0
8,14.0,8,440.0,215.0,4312.0,8.5,70.0
9,14.0,8,455.0,225.0,4425.0,10.0,70.0
10,15.0,8,390.0,190.0,3850.0,8.5,70.0


In [13]:
# 要顯示所有 column 或 row 的話, 可以透過 `show()` 函式
# 下面示範顯示第 1 - 5筆資料列的所有 column
show(df[1:5, :], allcols=true)

5×9 DataFrame
│ Row │ mpg     │ cylinders │ displacement │ horsepower │ weight   │
│     │ [90mFloat64[39m │ [90mInt64⍰[39m    │ [90mFloat64[39m      │ [90mString[39m     │ [90mFloat64⍰[39m │
├─────┼─────────┼───────────┼──────────────┼────────────┼──────────┤
│ 1   │ 18.0    │ 8         │ 307.0        │ 130.0      │ 3504.0   │
│ 2   │ 15.0    │ 8         │ 350.0        │ 165.0      │ 3693.0   │
│ 3   │ 18.0    │ 8         │ 318.0        │ 150.0      │ 3436.0   │
│ 4   │ 16.0    │ 8         │ 304.0        │ 150.0      │ 3433.0   │
│ 5   │ 17.0    │ 8         │ 302.0        │ 140.0      │ 3449.0   │

│ Row │ acceleration │ model year │ origin  │ car name                  │
│     │ [90mFloat64⍰[39m     │ [90mFloat64[39m    │ [90mFloat64[39m │ [90mString[39m                    │
├─────┼──────────────┼────────────┼─────────┼───────────────────────────┤
│ 1   │ 12.0         │ 70.0       │ 1.0     │ chevrolet chevelle malibu │
│ 2   │ 11.5         │ 70.0       │ 1.0     │ b

### 1.6 複製 DataFrame

呼叫 `copy()` 函式可以複製並建立一個新的 DataFrame

In [14]:
df2 = copy(df)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0
6,15.0,8,429.0,198.0,4341.0,10.0,70.0
7,14.0,8,454.0,220.0,4354.0,9.0,70.0
8,14.0,8,440.0,215.0,4312.0,8.5,70.0
9,14.0,8,455.0,225.0,4425.0,10.0,70.0
10,15.0,8,390.0,190.0,3850.0,8.5,70.0


In [15]:
# df2 和 df 的內容相同
describe(df2)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Union…,Type
1,mpg,23.5146,9.0,23.0,46.6,,,Float64
2,cylinders,5.44836,3.0,4.0,8,,1.0,"Union{Missing, Int64}"
3,displacement,192.682,8.0,146.0,455.0,,,Float64
4,horsepower,,304.0,,?,94.0,,String
5,weight,2966.01,193.0,2797.5,5140.0,,6.0,"Union{Missing, Float64}"
6,acceleration,27.5656,8.0,15.5,4732.0,,6.0,"Union{Missing, Float64}"
7,model year,112.433,18.5,76.0,3035.0,,,Float64
8,origin,1.98719,1.0,1.0,70.0,,,Float64
9,car name,,71.0,,vw rabbit custom,306.0,,String


## 2. 將 DataFrame 儲存到 CSV 檔案

In [16]:
CSV.write("a.csv", df)

"a.csv"

從目錄中可以看到 csv 檔案已寫入

In [17]:
readdir()

7-element Array{String,1}:
 ".ipynb_checkpoints"
 "04-02-2020.csv"
 "README.md"
 "a.csv"
 "auto-mpg.data"
 "julia_017_example.ipynb"
 "julia_017_hw.ipynb"

In [18]:
# 使用 Julia 內建的 DelimitedFiles library 
# 驗證檔案 header 與前 5 筆資料
using DelimitedFiles

readdlm("a.csv")[1:6]

6-element Array{Any,1}:
 "mpg,cylinders,displacement,horsepower,weight,acceleration,model"
 "18.0,8,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet"
 "15.0,8,350.0,165.0,3693.0,11.5,70.0,1.0,buick"
 "18.0,8,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth"
 "16.0,8,304.0,150.0,3433.0,12.0,70.0,1.0,amc"
 "17.0,8,302.0,140.0,3449.0,10.5,70.0,1.0,ford"

## 3. DataFrame 的操作

### 3.1 檢視 DataFrame

In [19]:
# 檢視 DataFrame 的尺寸
size(df)

(398, 9)

In [20]:
# 彙總 DataFrame 資訊
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Union…,Type
1,mpg,23.5146,9.0,23.0,46.6,,,Float64
2,cylinders,5.44836,3.0,4.0,8,,1.0,"Union{Missing, Int64}"
3,displacement,192.682,8.0,146.0,455.0,,,Float64
4,horsepower,,304.0,,?,94.0,,String
5,weight,2966.01,193.0,2797.5,5140.0,,6.0,"Union{Missing, Float64}"
6,acceleration,27.5656,8.0,15.5,4732.0,,6.0,"Union{Missing, Float64}"
7,model year,112.433,18.5,76.0,3035.0,,,Float64
8,origin,1.98719,1.0,1.0,70.0,,,Float64
9,car name,,71.0,,vw rabbit custom,306.0,,String


下面三行程式, 均可列出所有的 row 與 column

In [21]:
df
df[!, :]
df[:, :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0
6,15.0,8,429.0,198.0,4341.0,10.0,70.0
7,14.0,8,454.0,220.0,4354.0,9.0,70.0
8,14.0,8,440.0,215.0,4312.0,8.5,70.0
9,14.0,8,455.0,225.0,4425.0,10.0,70.0
10,15.0,8,390.0,190.0,3850.0,8.5,70.0


在 Jupyter Notebook 環境中, 預設顯示螢幕容許大小的資料, 因此可能不會顯示所有 column 和 row. 使用 `show()` 函式, 可以有效地控制顯示. 下面的例子是設定 `allcols=true` 及 `allrows=true` 以顯示所有 column 及 row.

In [22]:
show(df, allcols=true, allrows=true)

398×9 DataFrame
│ Row │ mpg     │ cylinders │ displacement │ horsepower │ weight   │
│     │ [90mFloat64[39m │ [90mInt64⍰[39m    │ [90mFloat64[39m      │ [90mString[39m     │ [90mFloat64⍰[39m │
├─────┼─────────┼───────────┼──────────────┼────────────┼──────────┤
│ 1   │ 18.0    │ 8         │ 307.0        │ 130.0      │ 3504.0   │
│ 2   │ 15.0    │ 8         │ 350.0        │ 165.0      │ 3693.0   │
│ 3   │ 18.0    │ 8         │ 318.0        │ 150.0      │ 3436.0   │
│ 4   │ 16.0    │ 8         │ 304.0        │ 150.0      │ 3433.0   │
│ 5   │ 17.0    │ 8         │ 302.0        │ 140.0      │ 3449.0   │
│ 6   │ 15.0    │ 8         │ 429.0        │ 198.0      │ 4341.0   │
│ 7   │ 14.0    │ 8         │ 454.0        │ 220.0      │ 4354.0   │
│ 8   │ 14.0    │ 8         │ 440.0        │ 215.0      │ 4312.0   │
│ 9   │ 14.0    │ 8         │ 455.0        │ 225.0      │ 4425.0   │
│ 10  │ 15.0    │ 8         │ 390.0        │ 190.0      │ 3850.0   │
│ 11  │ 15.0    │ 8         │ 383.0  

`first()` 和 `last()` 函式用來顯示 DataFrame 中的前 n 筆或後 n 筆的資料

In [23]:
first(df, 5)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0


In [24]:
last(df, 10)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,26.0,4,156.0,92.0,2585.0,14.5,82.0
2,22.0,6,232.0,112.0,2835.0,14.7,82.0
3,32.0,4,144.0,96.0,2665.0,13.9,82.0
4,36.0,4,135.0,84.0,2370.0,13.0,82.0
5,27.0,4,151.0,90.0,2950.0,17.3,82.0
6,27.0,4,140.0,86.0,2790.0,15.6,82.0
7,44.0,4,97.0,52.0,2130.0,24.6,82.0
8,32.0,4,135.0,84.0,2295.0,11.6,82.0
9,28.0,4,120.0,79.0,2625.0,18.6,82.0
10,31.0,4,119.0,82.0,2720.0,19.4,82.0


### 3.2 DataFrame 子集

要查看 DataFrame 子集, 可以使用 `df[<row index>, <column index>]`

In [25]:
df[1:5, :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0


In [26]:
# 如前述, 顯示所有 column
show(df[1:5, :], allcols=true)

5×9 DataFrame
│ Row │ mpg     │ cylinders │ displacement │ horsepower │ weight   │
│     │ [90mFloat64[39m │ [90mInt64⍰[39m    │ [90mFloat64[39m      │ [90mString[39m     │ [90mFloat64⍰[39m │
├─────┼─────────┼───────────┼──────────────┼────────────┼──────────┤
│ 1   │ 18.0    │ 8         │ 307.0        │ 130.0      │ 3504.0   │
│ 2   │ 15.0    │ 8         │ 350.0        │ 165.0      │ 3693.0   │
│ 3   │ 18.0    │ 8         │ 318.0        │ 150.0      │ 3436.0   │
│ 4   │ 16.0    │ 8         │ 304.0        │ 150.0      │ 3433.0   │
│ 5   │ 17.0    │ 8         │ 302.0        │ 140.0      │ 3449.0   │

│ Row │ acceleration │ model year │ origin  │ car name                  │
│     │ [90mFloat64⍰[39m     │ [90mFloat64[39m    │ [90mFloat64[39m │ [90mString[39m                    │
├─────┼──────────────┼────────────┼─────────┼───────────────────────────┤
│ 1   │ 12.0         │ 70.0       │ 1.0     │ chevrolet chevelle malibu │
│ 2   │ 11.5         │ 70.0       │ 1.0     │ b

In [27]:
# 可以指定特定要查看的 row / column
df[[1, 3, 5], [1, 2, 9]]

Unnamed: 0_level_0,mpg,cylinders,car name
Unnamed: 0_level_1,Float64,Int64⍰,String
1,18.0,8,chevrolet chevelle malibu
2,18.0,8,plymouth satellite
3,17.0,8,ford torino


指定 column 可以使用 index, 也可以使用 column 名稱, 使用的方式為 ":" 加上 column 名稱. 示範如下:

In [30]:
df[1:5, [:mpg, :displacement, :horsepower]]

Unnamed: 0_level_0,mpg,displacement,horsepower
Unnamed: 0_level_1,Float64,Float64,String
1,18.0,307.0,130.0
2,15.0,350.0,165.0
3,18.0,318.0,150.0
4,16.0,304.0,150.0
5,17.0,302.0,140.0


### 3.3 `select()` 及 `select!()`

如果要篩選 DataFrame 中的 column, 可以使用 `select()` 和 `select!()` 函式. 兩者不同之處在於, `select()` 不會變更原 DataFrame 而會傳回傳變更後的 DataFrame, 而 `select!()`會變更原 DataFrame.

In [31]:
select(df2, 1:3)

Unnamed: 0_level_0,mpg,cylinders,displacement
Unnamed: 0_level_1,Float64,Int64⍰,Float64
1,18.0,8,307.0
2,15.0,8,350.0
3,18.0,8,318.0
4,16.0,8,304.0
5,17.0,8,302.0
6,15.0,8,429.0
7,14.0,8,454.0
8,14.0,8,440.0
9,14.0,8,455.0
10,15.0,8,390.0


In [32]:
# df2 未改變
size(df2)

(398, 9)

In [33]:
# 呼叫 select!() 後 df2 被變更, 僅剩下被篩選的 3 個 column
select!(df2, 1:3)
size(df2)

(398, 3)

### 3.4 行 (column) 的操作

#### Aggregate

`aggregate` 函式可以套用到 column 中的每一個值, 例如如果要計算及找出汽車油耗 (mpg) 與排氣量 (displacement) 的平均數和中位數, 可以透過下列的示範來達成. 計算平均數和中位數時, 我們運用 Statistics 模組的 `mean` 及 `median` 函式來計算.

In [34]:
df3 = df[1:5, [:mpg, :displacement]]

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,18.0,307.0
2,15.0,350.0
3,18.0,318.0
4,16.0,304.0
5,17.0,302.0


In [35]:
using Statistics

aggregate(df3, [mean, median])

Unnamed: 0_level_0,mpg_mean,displacement_mean,mpg_median,displacement_median
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,16.8,316.2,17.0,307.0


#### Sort 簡介

Sorting 在之後的內容會有更詳細的介紹

`sort()` 排序後不會改變原來的 DataFrame

In [36]:
sort(df3)

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,15.0,350.0
2,16.0,304.0
3,17.0,302.0
4,18.0,307.0
5,18.0,318.0


In [37]:
df3

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,18.0,307.0
2,15.0,350.0
3,18.0,318.0
4,16.0,304.0
5,17.0,302.0


`sort!()` 排序後會改變原來的 DataFrame

下面範例是依 displacement 反序排序

In [38]:
sort!(df3, :displacement, rev=true)

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,15.0,350.0
2,18.0,318.0
3,18.0,307.0
4,16.0,304.0
5,17.0,302.0


In [39]:
df3

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,15.0,350.0
2,18.0,318.0
3,18.0,307.0
4,16.0,304.0
5,17.0,302.0
