# 「Juliaプログラミングクックブック」7章をJulia1.5で実行する
- Juliaのパッケージ群は変更が激しく、Julia1.2想定のレシピはJulia1.5で既に動作しなくなっている。
  - 特にDataFrames.jlの変更が大きいため、その扱いを学ぶ第7章について修正版を記録しておく。

## 7.4. カテゴリデータを処理する
- `levels!`関数がDataFrames.jlには無く、CategoricalArray.jlが必要となる。

成績リスト

In [1]:
using DataFrames
grade_levels = ["F"; [x*y for x in 'D':-1:'A' for y in ["-", "", "+"]]]

13-element Array{String,1}:
 "F"
 "D-"
 "D"
 "D+"
 "C-"
 "C"
 "C+"
 "B-"
 "B"
 "B+"
 "A-"
 "A"
 "A+"

In [2]:
using Random
Random.seed!(1);

In [3]:
grades = categorical(rand(grade_levels, 100), ordered=true);
levels!(grades, grade_levels);
df = DataFrame(id=eachindex(grades), grades = grades);

LoadError: UndefVarError: levels! not defined

`levels!`を使うためにCategoricalArray.jlを使用する。

In [4]:
using CategoricalArrays
grades = categorical(rand(grade_levels, 100), ordered=true);
levels!(grades, grade_levels);
df = DataFrame(id=eachindex(grades), grades = grades);

順序づけの結果を確認

In [5]:
isordered(grades)

true

In [6]:
levels(grades)

13-element Array{String,1}:
 "F"
 "D-"
 "D"
 "D+"
 "C-"
 "C"
 "C+"
 "B-"
 "B"
 "B+"
 "A-"
 "A"
 "A+"

In [7]:
describe(df, :eltype)

Unnamed: 0_level_0,variable,eltype
Unnamed: 0_level_1,Symbol,DataType
1,id,Int64
2,grades,"CategoricalValue{String,UInt32}"


成績がA以上の生徒のみフィルタ

In [8]:
filter(x -> x.grades > "A-", df)

Unnamed: 0_level_0,id,grades
Unnamed: 0_level_1,Int64,Cat…
1,4,A+
2,10,A+
3,11,A
4,18,A+
5,26,A
6,41,A+
7,63,A+
8,67,A
9,92,A
10,96,A+


## 7.5. 欠損値を扱う

In [9]:
download("https://openmv.net/file/class-grades.csv", "grades.csv")

"grades.csv"

In [10]:
using CSV, DataFrames, Statistics

In [11]:
df = CSV.read("grades.csv");

LoadError: ArgumentError: provide a valid sink argument, like `using DataFrames; CSV.read(source, DataFrame)`

`CSV.read()`が廃止されているため、以下のいずれかに改める必要がある。
- `df = DataFrame!(CSV.File())`
- `df = CSV.File() |> DataFrame`

In [12]:
df = DataFrame!(CSV.File("grades.csv"));

└ @ CSV /Users/yusuketamai/.julia/packages/CSV/YUbbG/src/file.jl:604
└ @ CSV /Users/yusuketamai/.julia/packages/CSV/YUbbG/src/file.jl:604
└ @ CSV /Users/yusuketamai/.julia/packages/CSV/YUbbG/src/file.jl:604


- ここで既にvalidateの結果が表示されている。

In [13]:
summary(df)

"99×6 DataFrame"

In [14]:
describe(df, :min, :max, :nmissing)

Unnamed: 0_level_0,variable,min,max,nmissing
Unnamed: 0_level_1,Symbol,Real,Real,Int64
1,Prefix,4.0,8.0,0
2,Assignment,28.14,100.83,0
3,Tutorial,34.09,112.58,0
4,Midterm,28.12,110.0,0
5,TakeHome,16.91,108.89,1
6,Final,28.06,108.89,3


In [15]:
CSV.validate("grades.csv")

LoadError: UndefVarError: validate not defined

`CSV.validate()`も廃止されている

In [16]:
[cor(df[!, i], df[!, j]) for i in axes(df, 2), j in axes(df, 2)]

6×6 Array{Union{Missing, Float64},2}:
  1.0        0.0224759  0.431078  -0.0625435  missing  missing
  0.0224759  1.0        0.440115   0.215868   missing  missing
  0.431078   0.440115   1.0        0.135597   missing  missing
 -0.0625435  0.215868   0.135597   1.0        missing  missing
   missing    missing    missing    missing   missing  missing
   missing    missing    missing    missing   missing  missing

In [17]:
df2 = dropmissing(df);

In [18]:
describe(df2, :nmissing)

Unnamed: 0_level_0,variable,nmissing
Unnamed: 0_level_1,Symbol,Int64
1,Prefix,0
2,Assignment,0
3,Tutorial,0
4,Midterm,0
5,TakeHome,0
6,Final,0


In [19]:
[cor(df2[!, i], df2[!, j]) for i in axes(df2, 2), j in axes(df2, 2)]

6×6 Array{Float64,2}:
  1.0        0.0484327  0.434525  -0.0586403  -0.0689997  0.0881758
  0.0484327  1.0        0.459001   0.200715    0.483206   0.286304
  0.434525   0.459001   1.0        0.148637    0.238167   0.23987
 -0.0586403  0.200715   0.148637   1.0         0.42719    0.724478
 -0.0689997  0.483206   0.238167   0.42719     1.0        0.474231
  0.0881758  0.286304   0.23987    0.724478    0.474231   1.0

In [20]:
function cor2(x, y)
    df = dropmissing(DataFrame([x, y]))
    cor(df[!, 1], df[!, 2])
end

cor2 (generic function with 1 method)

In [21]:
[cor2(df[!, i], df[!, j]) for i in axes(df, 2), j in axes(df, 2)]

6×6 Array{Float64,2}:
  1.0        0.0224759  0.431078  -0.0625435  -0.0916684  0.0902548
  0.0224759  1.0        0.440115   0.215868    0.492297   0.291232
  0.431078   0.440115   1.0        0.135597    0.209513   0.240551
 -0.0625435  0.215868   0.135597   1.0         0.442408   0.725121
 -0.0916684  0.492297   0.209513   0.442408    1.0        0.474231
  0.0902548  0.291232   0.240551   0.725121    0.474231   1.0

## 7.6. データフレームを使って分割・適用・結合を行う

In [23]:
df = CSV.File("06. データフレームを使って分割‐適用‐結合を行う/iris.csv", footerskip=1, 
              header=["PetalLength", "PetalWidth", "SepalLength", "SepalWidth", "Class"]) |> DataFrame;

In [24]:
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,PetalLength,5.84333,4.3,5.8,7.9,0,Float64
2,PetalWidth,3.054,2.0,3.0,4.4,0,Float64
3,SepalLength,3.75867,1.0,4.35,6.9,0,Float64
4,SepalWidth,1.19867,0.1,1.3,2.5,0,Float64
5,Class,,Iris-setosa,,Iris-virginica,0,String


In [25]:
using Statistics
by(df, :Class) do x
    DataFrame(n = nrow(x),
            mean = mean(x.SepalWidth),
            std = std(x.SepalWidth))
end

LoadError: ArgumentError: by function was removed from DataFrames.jl. Use the `combine(groupby(...), ...)` or `combine(f, groupby(...))` instead.

`by()`も廃止されており、やや勝手は異なるが`groupby()`と`combine()`で代用する。

In [26]:
gdf = groupby(df, :Class)
combine(gdf, :SepalWidth => mean)

Unnamed: 0_level_0,Class,SepalWidth_mean
Unnamed: 0_level_1,String,Float64
1,Iris-setosa,0.244
2,Iris-versicolor,1.326
3,Iris-virginica,2.026


より正確に再現するなら、やや長くなるが

In [35]:
combine(gdf, nrow, :SepalWidth => mean => :mean, :SepalWidth => std => :std)

Unnamed: 0_level_0,Class,nrow,mean,std
Unnamed: 0_level_1,String,Int64,Float64,Float64
1,Iris-setosa,50,0.244,0.10721
2,Iris-versicolor,50,1.326,0.197753
3,Iris-virginica,50,2.026,0.27465


## 7.7. 縦型と横型を変換する

In [36]:
df.id = axes(df, 1);

縦型に変換

In [37]:
sdf = stack(df);
first(sdf, 3)

Unnamed: 0_level_0,Class,id,variable,value
Unnamed: 0_level_1,String,Int64,String,Float64
1,Iris-setosa,1,PetalLength,5.1
2,Iris-setosa,2,PetalLength,4.9
3,Iris-setosa,3,PetalLength,4.7


横型に変換

In [38]:
udf = unstack(sdf, :variable, :value);

In [39]:
names(udf)

6-element Array{String,1}:
 "Class"
 "id"
 "PetalLength"
 "PetalWidth"
 "SepalLength"
 "SepalWidth"

変換前後の同一性判定

In [41]:
permutecols!(udf, names(df));
df == udf

LoadError: UndefVarError: permutecols! not defined

- `permutecols!()`が廃止されているが、ここは単純な操作で回避できる

In [42]:
df == udf[:, names(df)]

true

この後の集計部分は、7.6.と同様 `by()`に変えて`groupby()`と`combine()`を使えばOK

## 7.8. データフレームの同一性を判定する

これまでと同様`CSV.read()`を代替

In [44]:
df1 = DataFrame!(CSV.File("grades.csv"));
first(df1, 3)

└ @ CSV /Users/yusuketamai/.julia/packages/CSV/YUbbG/src/file.jl:604
└ @ CSV /Users/yusuketamai/.julia/packages/CSV/YUbbG/src/file.jl:604
└ @ CSV /Users/yusuketamai/.julia/packages/CSV/YUbbG/src/file.jl:604


Unnamed: 0_level_0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
Unnamed: 0_level_1,Int64,Float64,Float64,Float64,Float64?,Float64?
1,5,57.14,34.09,64.38,51.48,52.5
2,8,95.05,105.49,67.5,99.07,68.33
3,8,83.7,83.17,30.0,63.15,48.89


In [45]:
using Random
Random.seed!(1)
df2 = df1[shuffle(axes(df1, 1)), shuffle(axes(df1, 2))];
first(df2, 3)

Unnamed: 0_level_0,Tutorial,TakeHome,Prefix,Assignment,Final,Midterm
Unnamed: 0_level_1,Float64,Float64?,Int64,Float64,Float64?,Float64
1,70.24,52.41,6,95.05,47.78,52.5
2,58.51,53.7,6,28.14,68.33,72.5
3,65.7,103.52,7,74.29,55.0,78.75


In [46]:
res = join(df1, df2, kind=:outer,
            on=union(names(df1), names(df2)),
            indicator=:check, validate=(true, true))
unique(res.check)

LoadError: ArgumentError: join function for data frames is not supported. Use innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin, or crossjoin

`join()`関数が廃止されているため、`join(kind=:outer)`なら`outerjoin()`で代替する。  
また欠測がある場合`matchmissing`をデフォルトから`:equal`に帰る必要あり。

In [48]:
res = outerjoin(df1, df2, 
                on=union(names(df1), names(df2)),
                indicator=:check, validate=(true, true),
                matchmissing=:equal);
unique(res.check)

1-element Array{String,1}:
 "both"