## Getting started with Julia, Jupyter and Dataframes

This project was completed using the [Julia language](https://julialang.org) using [Jupyter notebooks](https://docs.jupyter.org/en/latest/start/index.html).  

Julia is a relatively new (started in 2013) programming language that was designed for scientific computing and uses modern language features.  It has become also a well-used language for data analysis and data science.  There are many packages available for Julia, and one that we will use throughout this is the [DataFrames](https://dataframes.juliadata.org/stable/) package, which is fast and flexible.

This document is a getting started with these tools and just enough to run and understand.  We first start with some basics of a DataFrame, which we load with:

In [1]:
using DataFrames

Note: to evaluate (run) the cell, use SHIFT-ENTER and if there was an error, you may need to install the package.  There is usually information on how to do that upon the error.

Although we will typically get data from an external file, we start with a DataFrame that is created by entering in the data. This happens to be the final standings of the NBA in the 1959-60 season.

In [2]:
nba1960 = DataFrame(team=["Boston Celtics", "Cincinnati Royals", "Detroit Pistons", "Minneapolis Lakers",
 "New York Knicks", "Philadelphia Warriors", "St. Louis Hawks", "Syracuse Nationals"],
W = [59, 19, 29, 26, 27, 49, 46, 45],
L = [16, 56, 46, 49, 48, 26, 29, 30])

Row,team,W,L
Unnamed: 0_level_1,String,Int64,Int64
1,Boston Celtics,59,16
2,Cincinnati Royals,19,56
3,Detroit Pistons,29,46
4,Minneapolis Lakers,26,49
5,New York Knicks,27,48
6,Philadelphia Warriors,49,26
7,St. Louis Hawks,46,29
8,Syracuse Nationals,45,30


Which is the final standings of the 1960 season in the NBA.  (We will show how to pull this together from files all games)

This DataFrame has 3 columns and we create it by putting in the 3 columns as Vectors (1D array) using []. Note, you will get an error, if these are not the same length. 

Also, each of the three columns have types--the first being a String, the other two `Int64`, which is short for a 64-bit integer (the standard integer type on most current computers).  The DataFrames package automatically determines the type and this is important for speed. 

## Chaining operation on Dataframes

As we will see working with Dataframes often includes multiple steps that are chained together.  The steps including subsetting by columns or rows, transforming a dataframe to make new columns as well as sorting. 

First, we will just do a single operations, but still use the `Chain.jl` package, which will help as the operations get more complex.  To use the `Chain` package, 

In [3]:
using Chain

and again, if this is not downloaded and available, you will be instructed how to obtain it. 

The following will take the DataFrame above and sort it by wins:

In [4]:
@chain nba1960 begin
  sort(:W)
end

Row,team,W,L
Unnamed: 0_level_1,String,Int64,Int64
1,Cincinnati Royals,19,56
2,Minneapolis Lakers,26,49
3,New York Knicks,27,48
4,Detroit Pistons,29,46
5,Syracuse Nationals,45,30
6,St. Louis Hawks,46,29
7,Philadelphia Warriors,49,26
8,Boston Celtics,59,16


A note about the syntax.  The general way that you set up some steps on a dataframe `df` is 
```julia
@chain df begin

end
```

And then the operations go between the `begin` and `end`.  The `sort` function is one such function, but just an example.  We will see many others here. 

The following will sort the dataframe from high to low:

In [5]:
@chain nba1960 begin
  sort(:W, rev=true)
end

Row,team,W,L
Unnamed: 0_level_1,String,Int64,Int64
1,Boston Celtics,59,16
2,Philadelphia Warriors,49,26
3,St. Louis Hawks,46,29
4,Syracuse Nationals,45,30
5,Detroit Pistons,29,46
6,New York Knicks,27,48
7,Minneapolis Lakers,26,49
8,Cincinnati Royals,19,56


And this shows the standings in an expected way in which the leading teams are on the top.

Before moving on, it should be pointed out that use of `@chain` is a bit overkill for this and some of the other examples later on.  Instead, we could do

In [6]:
sort(nba1960, :W, rev=true)

Row,team,W,L
Unnamed: 0_level_1,String,Int64,Int64
1,Boston Celtics,59,16
2,Philadelphia Warriors,49,26
3,St. Louis Hawks,46,29
4,Syracuse Nationals,45,30
5,Detroit Pistons,29,46
6,New York Knicks,27,48
7,Minneapolis Lakers,26,49
8,Cincinnati Royals,19,56


However, the examples here will use the `@chain` command (technically a julia macro) to be consistent as we move to extensive examples. 

## Filtering a DataFrame by rows

Another important operation is to find all rows of a DataFrame that fits some condition.  Let's start with the original `nba1960` DataFrame and find only those rows with at least 35 wins. We will use the `subset` function. (Note: there is also a `filter` function, however it's syntax is not the same as other DataFrame functions, so we will not use that).

In [7]:
@chain nba1960 begin
  subset(:W => w-> w .> 35)
end

Row,team,W,L
Unnamed: 0_level_1,String,Int64,Int64
1,Boston Celtics,59,16
2,Philadelphia Warriors,49,26
3,St. Louis Hawks,46,29
4,Syracuse Nationals,45,30


A note about the syntax with `subset` and other operations.  The `:W` refers to the W column so we are filtering based on that.  The right side of the `=>` is a boolean function of a vector.  That is, the `w` variable is the entire `:W` column.  The function `w -> w .> 35` is called an anonymous function because it doesn't have a name, but this is common for relatively simple functions. 

The function has input variable `w` (it doesn't matter this variable's name) and then compares it to 35 and the `.` in the `.>` broadcasts the `>` function to the vector.  Then only rows that return `true` are left in the resultant DataFrame. The operation `w .>35` returns a vector of true/false, that is true if the element of `w` is greater than 35. 

Another example is to take all rows of the DataFrame in which the L column ends in a 6.  

In [8]:
@chain nba1960 begin
  subset(:L => l -> mod.(l,10) .== 6)
end

Row,team,W,L
Unnamed: 0_level_1,String,Int64,Int64
1,Boston Celtics,59,16
2,Cincinnati Royals,19,56
3,Detroit Pistons,29,46
4,Philadelphia Warriors,49,26


Notice above that both the `mod` function and the `==` operator need to be broadcast over vectors so a `.` is added (and note that for a named function like `mod`, the `.` goes at the end and for operators like `==` is prepended with the `.`).

### Taking a subset of columns

If you only want a few columns (or drop some columns) from a DataFrame, the `select` function does this.  Let's take the team name and wins with

In [9]:
@chain nba1960 begin
  select(:team,:W)
end

Row,team,W
Unnamed: 0_level_1,String,Int64
1,Boston Celtics,59
2,Cincinnati Royals,19
3,Detroit Pistons,29
4,Minneapolis Lakers,26
5,New York Knicks,27
6,Philadelphia Warriors,49
7,St. Louis Hawks,46
8,Syracuse Nationals,45


And the previous can also be done by `dropping` the L column with

In [10]:
@chain nba1960 begin
  select(Not(:L))
end

Row,team,W
Unnamed: 0_level_1,String,Int64
1,Boston Celtics,59
2,Cincinnati Royals,19
3,Detroit Pistons,29
4,Minneapolis Lakers,26
5,New York Knicks,27
6,Philadelphia Warriors,49
7,St. Louis Hawks,46
8,Syracuse Nationals,45


There is also a `Between` command to select all columns between two or to match columns with a regular expression.

### Transforming a Dataframe: creating new columns

Another important THING to do with Dataframes is to create new columns from existing ones.  There are two ways to do this, 1) using the `select` function, which drops other columns and only keeps the created ones and 2) using the `transform` function, which adds the new column(s).

In general either of these have the same syntax in that you pass the following:
```
:col => FUNCTION on the column => :new_col
```

We start with an example which takes the number of wins of each team and multiplies by 2 (this is just an example, but not much relevant in games/schedules).  First we will use the `select` function to do this.

In [11]:
@chain nba1960 begin
  select(:W => (w -> 2*w) => :twice_W)
end

Row,twice_W
Unnamed: 0_level_1,Int64
1,118
2,38
3,58
4,52
5,54
6,98
7,92
8,90


And note a few things.  

1) since `select` was used, the only column is the new one called `twice_W`. We'll see below if we use `transform` instead. 
2) The function between the thick arrows `=>` need to be surrounded by ().  Without them, an error will occur. This is because of order of operations.  
3) The function is an anonymous function (defined with a skinny arrow `->`) just like above in the subset example above. 
4) The `*` in this example does not need to be broadcast with a `.` because multiplying a vector by a constant is defined and returns a vector.  

If we use `transform` instead:

In [12]:
@chain nba1960 begin
  transform(:W => (w -> 2*w) => :twice_W)
end

Row,team,W,L,twice_W
Unnamed: 0_level_1,String,Int64,Int64,Int64
1,Boston Celtics,59,16,118
2,Cincinnati Royals,19,56,38
3,Detroit Pistons,29,46,58
4,Minneapolis Lakers,26,49,52
5,New York Knicks,27,48,54
6,Philadelphia Warriors,49,26,98
7,St. Louis Hawks,46,29,92
8,Syracuse Nationals,45,30,90


And notice that the difference is that the new column is added to the DataFrame.  The use of `transform` versus `select` just depends on what one wants the resulting DataFrame.

For the next example, we'll add the wins and losses of each row with:

In [13]:
@chain nba1960 begin
  transform([:W,:L] => ((w,l) -> w+l) => :num_games)
end

Row,team,W,L,num_games
Unnamed: 0_level_1,String,Int64,Int64,Int64
1,Boston Celtics,59,16,75
2,Cincinnati Royals,19,56,75
3,Detroit Pistons,29,46,75
4,Minneapolis Lakers,26,49,75
5,New York Knicks,27,48,75
6,Philadelphia Warriors,49,26,75
7,St. Louis Hawks,46,29,75
8,Syracuse Nationals,45,30,75


The big difference with this example is that this takes two columns to perform the calculation.  The two columns are created as a vector (array) using [ ] and then the anonymous function is a binary function and we use the variables `w` and `l` and note that these need to be surrounded by ().  

The last example here will be to pull the nickname from a team. Below we will use the `split` function within the `transform` function, but before that notice that

In [14]:
split("Boston Celtics"," ")

2-element Vector{SubString{String}}:
 "Boston"
 "Celtics"

creates an array of strings (technically a substring) by splitting on a space.  We will pull the last one with: 

And we will create a function that pulls out the nickname using

In [15]:
last(split("Boston Celtics"," "))

"Celtics"

The following then will generate a column with a nickname

In [16]:
@chain nba1960 begin
  transform(:team => (t->string.(last.(split.(t," ")))) => :nickname)
end

Row,team,W,L,nickname
Unnamed: 0_level_1,String,Int64,Int64,String
1,Boston Celtics,59,16,Celtics
2,Cincinnati Royals,19,56,Royals
3,Detroit Pistons,29,46,Pistons
4,Minneapolis Lakers,26,49,Lakers
5,New York Knicks,27,48,Knicks
6,Philadelphia Warriors,49,26,Warriors
7,St. Louis Hawks,46,29,Hawks
8,Syracuse Nationals,45,30,Nationals


Notice that because of broadcasting (applying a function to an entire vector), each of the functions string, last and split need to have a . appended to the end. 

### Multiple Operations on a DataFrame

Next, we can create the same DataFrame using multiple steps and this is the real power of the `@chain` macro. 

First, we will make a new column that is just the team name split by spaces:

In [17]:
@chain nba1960 begin
  transform(:team => (t->split.(t, " ")) => :tmp_col)
end

Row,team,W,L,tmp_col
Unnamed: 0_level_1,String,Int64,Int64,Array…
1,Boston Celtics,59,16,"SubString{String}[""Boston"", ""Celtics""]"
2,Cincinnati Royals,19,56,"SubString{String}[""Cincinnati"", ""Royals""]"
3,Detroit Pistons,29,46,"SubString{String}[""Detroit"", ""Pistons""]"
4,Minneapolis Lakers,26,49,"SubString{String}[""Minneapolis"", ""Lakers""]"
5,New York Knicks,27,48,"SubString{String}[""New"", ""York"", ""Knicks""]"
6,Philadelphia Warriors,49,26,"SubString{String}[""Philadelphia"", ""Warriors""]"
7,St. Louis Hawks,46,29,"SubString{String}[""St."", ""Louis"", ""Hawks""]"
8,Syracuse Nationals,45,30,"SubString{String}[""Syracuse"", ""Nationals""]"


And notice that the new column contains the vector.  We can then do the next step by pulling just the last column:

In [18]:
@chain nba1960 begin
  transform(:team => (t->split.(t, " ")) => :tmp_col)
  transform(:tmp_col => (v->string.(last.(v))) => :nickname)
end

Row,team,W,L,tmp_col,nickname
Unnamed: 0_level_1,String,Int64,Int64,Array…,String
1,Boston Celtics,59,16,"SubString{String}[""Boston"", ""Celtics""]",Celtics
2,Cincinnati Royals,19,56,"SubString{String}[""Cincinnati"", ""Royals""]",Royals
3,Detroit Pistons,29,46,"SubString{String}[""Detroit"", ""Pistons""]",Pistons
4,Minneapolis Lakers,26,49,"SubString{String}[""Minneapolis"", ""Lakers""]",Lakers
5,New York Knicks,27,48,"SubString{String}[""New"", ""York"", ""Knicks""]",Knicks
6,Philadelphia Warriors,49,26,"SubString{String}[""Philadelphia"", ""Warriors""]",Warriors
7,St. Louis Hawks,46,29,"SubString{String}[""St."", ""Louis"", ""Hawks""]",Hawks
8,Syracuse Nationals,45,30,"SubString{String}[""Syracuse"", ""Nationals""]",Nationals


We could then drop the the `tmp_col` column with the `select` function using:

In [19]:
@chain nba1960 begin
  transform(:team => (t->split.(t, " ")) => :tmp_col)
  transform(:tmp_col => (v->string.(last.(v))) => :nickname)
  select(Not(:tmp_col))
end

Row,team,W,L,nickname
Unnamed: 0_level_1,String,Int64,Int64,String
1,Boston Celtics,59,16,Celtics
2,Cincinnati Royals,19,56,Royals
3,Detroit Pistons,29,46,Pistons
4,Minneapolis Lakers,26,49,Lakers
5,New York Knicks,27,48,Knicks
6,Philadelphia Warriors,49,26,Warriors
7,St. Louis Hawks,46,29,Hawks
8,Syracuse Nationals,45,30,Nationals


This can further be simplified using the `ByRow` and `AsTable` methods. We will still split the `team` field but create a named tuple

In [34]:
@chain nba1960 begin
  transform(:team =>ByRow(t->(city = join(split(t," ")[1:end-1]," "), nickname = last(split(t," ")))))
end

Row,team,W,L,team_function
Unnamed: 0_level_1,String,Int64,Int64,NamedTup…
1,Boston Celtics,59,16,"(city = ""Boston"", nickname = ""Celtics"")"
2,Cincinnati Royals,19,56,"(city = ""Cincinnati"", nickname = ""Royals"")"
3,Detroit Pistons,29,46,"(city = ""Detroit"", nickname = ""Pistons"")"
4,Minneapolis Lakers,26,49,"(city = ""Minneapolis"", nickname = ""Lakers"")"
5,New York Knicks,27,48,"(city = ""New York"", nickname = ""Knicks"")"
6,Philadelphia Warriors,49,26,"(city = ""Philadelphia"", nickname = ""Warriors"")"
7,St. Louis Hawks,46,29,"(city = ""St. Louis"", nickname = ""Hawks"")"
8,Syracuse Nationals,45,30,"(city = ""Syracuse"", nickname = ""Nationals"")"


Note that if we want the full city (like for "St. Louis"), we need to pull the first two elements of the split up team string.  This is what `join(split(t," ")[1:end-1]," ")` does.

The `ByRow` method applies the function to each element in the row because there is not a way to broadcast this to a vector. Next, the `AsTable` takes the named tuple and creates two new columns with the given names.  We'll then rearrange the columns with the `select`

In [36]:
@chain nba1960 begin
  transform(:team =>ByRow(t->(city = join(split(t," ")[1:end-1]," "), nickname = last(split(t," ")))) => AsTable)
  select(:city, :nickname, :W, :L)
end

Row,city,nickname,W,L
Unnamed: 0_level_1,String,SubStrin…,Int64,Int64
1,Boston,Celtics,59,16
2,Cincinnati,Royals,19,56
3,Detroit,Pistons,29,46
4,Minneapolis,Lakers,26,49
5,New York,Knicks,27,48
6,Philadelphia,Warriors,49,26
7,St. Louis,Hawks,46,29
8,Syracuse,Nationals,45,30


Often, when working with a Dataframe, you need to figure out what the resulting DataFrame should look like.  It then often takes multiple steps to get to the result, but `@chain` allows you to see step-by-step how to get there. 

Troubleshooting/Understanding help.  The `#` character is the comment character in Julia. Put that at the beginning of a line inside of a set of commands in the `@chain` block and you can see step-by-step what each line does.

### Summarizing a DataFrame

The last key idea that we will present here with a DataFrame is that of summarizing a DataFrame.  This generally take a DataFrame and finds mins/maxs/means, etc. 

The first way to do this is with the `describe` function that automatically finds the mean, min, median and max of each column.  the result is another DataFrame with each summarized column on a row. If we do this on the original `nba1960` DataFrame we get:

In [37]:
describe(nba1960)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,team,,Boston Celtics,,Syracuse Nationals,0,String
2,W,37.5,19,37.0,59,0,Int64
3,L,37.5,16,38.0,56,0,Int64


Notice that the `team` column is a string and therefore numerical values like mean and median don't make sense, but `min` and `max` are listed lexicographically. 

We can also add `describe` to the end of a `@chain` like: 

In [38]:
@chain nba1960 begin
  transform(:team =>ByRow(t->(city = join(split(t," ")[1:end-1]," "), nickname = last(split(t," ")))) => AsTable)
  select(:city, :nickname, :W, :L)
  describe()
end

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,city,,Boston,,Syracuse,0,String
2,nickname,,Celtics,,Warriors,0,SubString{String}
3,W,37.5,19,37.0,59,0,Int64
4,L,37.5,16,38.0,56,0,Int64


We can also summarize in a more custom way with the `combine` function.  We can summarize the original `nba1960` DataFrame by including the first and 3rd quartile using the `quantile` function, which is defined in the `Statistics` package that we load with:

In [25]:
using Statistics

The following will do some summarizing of the wins and losse, by finding the quartiles and medians of the wins and losses.

In [40]:
@chain nba1960 begin
  combine(
    :W => (w->quantile(w,0.25)) => :W_Q1, 
    :W => (w->quantile(w,0.50)) => :W_median,
    :W => (w->quantile(w,0.75)) => :W_Q3,
    :L => (l->quantile(l,0.25)) => :L_Q1, 
    :L => (l->quantile(l,0.50)) => :L_median,
    :L => (l->quantile(l,0.75)) => :L_Q3 
  )
end

Row,W_Q1,W_median,W_Q3,L_Q1,L_median,L_Q3
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,26.75,37.0,46.75,28.25,38.0,48.25


In terms of syntax, there is just one combine function, since we want all of the summarized variables in the same resultant dataframe. Also, each of the summarized variables are put on their own line for readability--they could be on a single line.

We could also add the `combine` to the end of a number of `@chain` such as

In [27]:
@chain nba1960 begin
  transform(:team =>ByRow(t->(city = join(split(t," ")[1:end-1]," "), nickname = last(split(t," ")))) => AsTable)
  select(:city, :nickname, :W, :L)
  combine(
    :W => (w->quantile(w,0.25)) => :W_Q1, 
    :W => (w->quantile(w,0.50)) => :W_median,
    :W => (w->quantile(w,0.75)) => :W_Q3,
    :L => (l->quantile(l,0.25)) => :L_Q1, 
    :L => (l->quantile(l,0.50)) => :L_median,
    :L => (l->quantile(l,0.75)) => :L_Q3,
    :city => (c->mean(length.(c))) => :city_length_mean,
    nrow
  )
end

Row,W_Q1,W_median,W_Q3,L_Q1,L_median,L_Q3,city_length_mean,nrow
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Int64
1,26.75,37.0,46.75,28.25,38.0,48.25,8.875,8


where the mean length of the city is given. One needs to be careful with this.  We need to broadcast the `length` function to get the number of characters in each city.  We then take the mean of this vector.  If you drop the `.` from the `length`, then one will get 8 just because there are 8 rows in this data set. 

Also, we have added the `nrow` to the `combine` which just lists the number of rows in the dataset. 

### More about DataFrames

This just touches the tip of the iceberg of the `DataFrames` module.  You should take a look at the [standard documentation](https://dataframes.juliadata.org/stable/) or there are many good YouTube videos, especially those from Bogumił Kamiński, one of the main contributor to `DataFrames.jl`. A recent deep dive was [given at JuliaCon 2022](https://www.youtube.com/watch?v=SXF4BawX-hs)