## Chapter 19: Data Analysis

This chapter covers the introduction to some data analysis.  We will load data from existing datasets (that are from a package), do some plotting and some analysis of the dataset. 

Let's load the following packages.

In [None]:
using RDatasets, StatsPlots,  StatsBase, Statistics, DataFrames

The RDatasets package is a bunch of datasets that are built-in to the R language, a statistics language that is quite popular.  The following shows all of the data packages:

In [None]:
RDatasets.packages()

Each package as a set of datasets within it.  Here's the datasets in the `datasets` package. 

In [None]:
RDatasets.datasets("datasets")

This loads the `iris` dataset from the `datasets` package.  The result is a `DataFrame` and we will see details about this later.  Each column in a `DataFrame` has a particular type, but this is better for general datasets than an array.  The column headers give the name of the column as well as the datatype of the column: 

In [None]:
iris=RDatasets.dataset("datasets","iris")

#### 19.2: Accessing the DataFrame

We can get a column by using brackets and the name of the column (preceded with a colon).  The result is a 1-D array of the type give by the column type. 

In [None]:
iris[!,:SepalWidth]

We can also select the column by number.

In [None]:
iris[!,2]

So for example, we can find the mean of this column using the built-in `mean` function:

In [None]:
mean(iris[!,:SepalWidth])

And the following will find the standard deviation:

In [None]:
std(iris[!,:SepalWidth])

If you only want part of a column, we can use a range to access the desired rows.

In [None]:
iris[11:20,:SepalWidth]

In [None]:
iris[1:2:end,:PetalWidth]

#### common functions of DataFrames

This is the size of the DataFrame.  this is similar to the `size` function for an array.  This shows that there are 150 rows and 5 columns. 

In [None]:
size(iris)

Here are the column names:

In [None]:
names(iris)

Here are the column types.  Note that the last one is a Categorical Value (since there are only 3 different values).  This runs the built-in function `eltype` over each of the columns.  The `.` is broadcasting over the columns.

In [None]:
eltype.(eachcol(iris))

The first few rows of a DataFrame

In [None]:
first(iris,5)

And the last 5 rows:

In [None]:
last(iris,5)

### Basic Statistics

The `describe` function gives basic information about numerical variables with include mean, min, median, max and the number of missing values as well as the type.  Note: we will see missing values later in the chapter.

In [None]:
describe(iris)

In [None]:
 mean(iris[!,:SepalLength]),std(iris[!,:SepalLength])

#### 19.3: Creating a Dataframe

Although typically data (as a `DataFrame`) is loaded from a file or from the `RDatasets` package, we can make a data frame in the following way. 

Note: that the data is in columns and each is a vector or converted to a vector.

In [None]:
data = DataFrame(A = 1:2:13, B = ["M", "F", "F", "M","X","F","M"], C=[3.0,2.5,pi,-2.3,1/3,56,100],D=[(-1)^n//n for n=1:7])

In [None]:
size(data)

In [None]:
names(data)

In [None]:
describe(data)

#### 19.2.4: Plotting data in a DataFrame

We can plot the data in a DataFrame in a manner similar to that in the `Plots` package, such as:

In [None]:
scatter(iris[!,:SepalLength],iris[!,:SepalWidth])

However, the `StatsPlots` package has some nice shorthand for this.  The `@df` macro is used to plot a DataFrame where the first object is the DataFrame and then the plot command.  Note that this macro allows to just add the columns.

In [None]:
@df iris scatter(:SepalLength,:SepalWidth)

Here's a nice plot by coloring depending on Species, the categorical variable.

In [None]:
@df iris scatter(:SepalLength,:SepalWidth,group=:Species)

In [None]:
mean(iris[!,:SepalWidth]),std(iris[!,:SepalWidth])

### 19.3: Manipulating DataFrames

Typically once you have a data frame, you will need to manipulate it.  This includes filtering (subsets) rows or columns and creating new columns.  We will cover that in this section. 

#### Selecting columns
First, if we want to select specific columns, we can do this in a couple of different ways.  First, using the same technique as above.  This selects the A and D column.  Note the colon (:) in front of the column names.  

In [None]:
data[!,[:A,:D]]

Alternatively, we can use the `select` function.  The first argument is the data frame and the others are columns names.

In [None]:
select(data, :A, :D)

And we can use column numbers and reorder:

In [None]:
select(data, 4,3)

#### Filtering (or subsetting) the rows

Next, we see how to filter (or subset) the rows based on some condition.  This example shows that we take only the rows where the column A values are less than 10.

In [None]:
subset(data, :A => a-> a .< 10)

Note that the last argument is a function (anonymous) whose input is the entire column and we want a vector of booleans.  This is why the less than sign is broadcast (`.<`). This returns a vector of booleans (`BitVector`)

In [None]:
data[!,:A] .< 10

Alternatively, we can use the `ByRow` function on a non-vector function like:

In [None]:
subset(data, :A => ByRow(a-> a < 10))

This filters all rows where column B is "F":

In [None]:
subset(data, :B => b-> b.== "F")

We can filter on more than one column.  This example returns all rows where column A is larger than column C.

Note: that the columns must be put into vector format and then the function must have 2 columns. 

In [None]:
subset(data, [:A, :C] => (a,c) -> a .> c)

And here's an example using three columns

In [None]:
subset(data, [:A, :D, :C] => (a,d,c) -> a.*d .> c)

#### Exercise

- find all rows where the absolute value of the C column is greater than 2.
- find all rows where the product of columns C and D is greater than 1.

#### Transforming Data Frames

If you want a new column that is some function of one or more of the columns, we will use either `select` or `transform`:
- use `select` if you only want the new column(s) in the data frame
- use `transform` if you want the original data frame as well as the new columns

The following makes a dataframe with a single column that is the square of the A column

In [None]:
select(data, :A => a-> a.^2)

Notice that the new column has the generic column `A_function`. Instead, if we want to give that column a better name use: 

In [None]:
select(data, :A => (a-> a.^2) => :Asq)

Note: make sure the ( ) are around the function.  Remove them to see what happens.

We can also make a column based on a function of two columns. For example:

In [None]:
select(data, [:C, :D] => ((c,d)-> c.*d) => :prod)

And if we want to do both:

In [None]:
select(data, :A => (a-> a.^2) => :Asq, [:C, :D] => ((c,d)-> c.*d) => :prod)

If instead of ignoring the original dataframe, we can add additional columns to it with the `transform` function

In [None]:
transform(data, :A => (a-> a.^2) => :Asq)

#### Exercise

- create a new data frame from `data` which is the square root of column C.
- Using the `iris` dataframe produce a new column called `area` which is the area of a petal using the `PetalLength` and `PetalWidth` variables and the area of an ellipse. Keep the original columns with this new dataframe.

#### Sorting DataFrames

Sorting data frames is quite helpful in many situations.  We use the `sort` function to do this.  The following sorts on column C

In [None]:
sort(data, :C)

And if we want to sort in reverse order

In [None]:
sort(data, :C, rev = true)

And sorting is done by type.  This sorts lexiographically.

In [None]:
sort(data, :B)

## 19.5: Joining DataFrames

Another important activity to do with data frames is joining two or more.  Typically this means that both data frame have a common piece of information on which to join.  Consider the following:

In [None]:
simpsons = DataFrame(
    id=1:2:13,
    name=["Homer","Marge","Lisa","Bart","Maggie","Apu","Moe"],
    age =[45,42,8,10,1,38,59],
    salary = [50000,25000,10000,missing,missing,45000,3000],
    favorite_food = ["pork chops","casserole","salad","hamburger",missing,"saag paneer","peanuts"]
  )

A keen eye notices that the `salary` and `favorite_food` columns data types have a ?.  This is because they have missing data.  Again, we'll explain how to handle this later. 

If we want to join this to the data frame called `data` where column `id` above matches `A` on `data`, we do the following:

In [None]:
innerjoin(data, simpsons, on = :A => :id)

We will explain the `innerjoin` below, but a couple of things.  First, the first 4 columns came from `data` and the last 4 from `simpsons` (these don't need to be equal).  The `id` column from `simpsons` was dropped. 

The following are the joins in the `DataFrames` package:

- **innerjoin:** the output contains rows for values of the key that exist in all passed data frames.
- **leftjoin:** the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
- **rightjoin:** the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
- **outerjoin:** the output contains rows for values of the key that exist in any of the passed data frames.
- **semijoin:** Like an inner join, but output is restricted to columns from the first (left) argument.
- **antijoin:** The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
- **crossjoin:** The output is the cartesian product of rows from all passed data frames.

## 19.6: Summarizing Data

Usually one wants to boil down a dataset to a few numbers. This is typically what the mean, median, standard deviation and quartiles are useful for.  

The `combine` function in `DataFrames` will do this:

In [None]:
combine(data, :A => mean, :C => mean, :C => std, :D => (d -> quantile(d,0.75)) )

where the last function returns the 3rd quartile (0.75 quantile). 

We will see `combine` below in which a dataset is grouped and then computations are made on a per group basis.

## 19.7: The Pipe function

Consider the following crazy, made up nested function evaluation
$$ \ln(\sqrt{e^{\sin(\tan^{-1}(0.25)}})$$
In julia we can evaluate this with

And if you are not careful, the parentheses can be difficult to balance right. However, another way to think of this is to start with the number 0.25, apply the arctangent, apply the sine, apply the exponent, apply the square root, then apply the log. This can be done without parenetheses in the following way:

In [None]:
0.25 |> atan |> sin |> exp |> sqrt |> log

resulting in the same value.  The `|>` is the pipe command which takes the value on the left and "sends" it to the function on the right. We can successively send (or pipe) to multiple functions as above and this is how it is powerful.

This works however only for functions of a single variable that has a name, but we can adapt this to other functions with an anonymous function.  Consider calculating $\sin(1+ e^{0.25})$ using pipes. We do this with

In [None]:
0.25 |> x-> 1+ exp(x) |> sin

#### Another example

Let's look at an example with an array.  Calculate  ` deleteat!([1,2,3,4,5],3)` using pipes

In [None]:
1:5 |> collect |> arr -> deleteat!(arr,3)

If you find this kinda overkill, you're right with these examples. We will do this with dataframes below, which is where it can get quite nice.

We will use the `Chain` package to help as well.  add this to the package manager and then

In [None]:
using Chain

The chain package has a macro called `@chain` which will make the steps even clearer.  The above steps can be written as

In [None]:
@chain 1:5 begin
  collect
  deleteat!(3)
end

Between the `@chain` and the `begin` is what we are starting with.  In this case, the range `1:5`. Then each line inside the `@chain` does a separate step.  
1. call `collect`, that is make the vector.
2. delete the 3rd element. 

The way this works, is that `@chain` inserts the line above in the first argument of each line.  That is the first line is really `collect(1:5)` and then the second line is `deleteat!(collect(1:5),3)`.

Hopefully you can see that with 3 or more steps, this can simplify things. 

Here's another example. 
1. Start with the vector [1,2,3,4,5,6,7,8,9,10],
2. square each element 
3. keep all even numbers 
4. find the mean.

We will do this starting with `1:10` and then making a single function call on each line of the `@chain`.

In [None]:
@chain 1:10 begin
  collect
  _.^2
  filter(x->mod(x,2)==0,_)
  mean
end

Notice that on the 3rd and 4th lines that there is an underscore `_`. This tells `@chain` where to put the argument from the above line.

#### Exercise

We are going to find the standard deviation of the numbers $\pi/3, 4\pi/3, 7\pi,3, 10\pi/3, \ldots, 100\pi/3$ by the following.
1. start with the range from 1 to 100.
2. make a vector.
3. filter only values that only appear in 1,4,7,10, ...
4. multiply by $\pi/3$ 
5. find the standard deviation

### Using a data frame with Chain

The above examples are still a bit overkill, but let's repeat the above steps with a dataframe, which is typically how we will use Chain. 

In [None]:
df = DataFrame(x=1:10)
@chain df begin
  select(:x => (x->x.^2) => :xsq)
  subset(:xsq => x-> mod.(x,2) .== 0)
  combine(:xsq => mean)
end

### 19.8: Missing Data

Often in a DataFrame, data is missing and julia has a data type called `Missing` that has only one value, `missing`.  Before we examine missing and DataFrames, here's some examples with just the missing value:

In [None]:
typeof(missing)

In [None]:
missing+6

In [None]:
mean([1,2,3,missing,5])

In many ways, any operation of `missing` results in `missing` and in many ways, this is a way to signal that data is missing.

#### missing values in a DataFrame

Recall that the `simpsons` dataset above had missing data:

In [None]:
simpsons

First, note that the datatypes on the salary and favorite_food columns have a ?.  Actually the data type of these is: 

In [None]:
eltype.(eachcol(simpsons))

And you can see that the last two element types are `Union{Missing,Int64}` and `Union{Missing,String}`.  The Union datatype is a way to handle more than Type.  This means that the elements of salary can be either `String` or `Missing`.  

We can find the maximum age with:

In [None]:
maximum(simpsons[!,:age])

But if we do the same with the salary column:

In [None]:
maximum(simpsons[!,:salary])

There is a nice function called `skipmissing` which is a bit strange in that:

In [None]:
skipmissing(simpsons[!,:salary])

And doesn't see to do anything except wrap the array in a `skipmissing` function, but if we now look for the maximum with:

In [None]:
maximum(skipmissing(simpsons[!,:salary]))

This returns what we expect.

In [None]:
mean(skipmissing(simpsons[!,:salary]))

An alternative way to do this using `@chain` is as follows:

In [None]:
@chain simpsons begin
  _[!,:salary]
  skipmissing
  mean
end

which just finds the mean of the 3 non-missing values.

#### A better way of handling missing values in a dataframe

A better way to do this is with a `dropmissing` function in `DataFrames`.  This creates a new dataframe that filters out any row that has a missing value. For example

In [None]:
dropmissing(simpsons)

Also, if you want to just filter rows with missing value in a particular column(s), you can do the following:

In [None]:
dropmissing(simpsons, :favorite_food)

And using this we can repeat the steps above with the `@chain` syntax:

In [None]:
@chain simpsons begin
    dropmissing(:salary)
    combine(:salary => mean)
end

## 19.9: Split-Apply-Combine

A common situation with data analysis is to have a dataset and you want to compare means or standard deviations within a dataset. What needs to often happen is that you first split a dataset, do some analysis on each group then summarize. This is know as *split-apply-combine*. We will demonstrate this with an example. Let’s return to the iris dataset that we loaded at the beginning of this chapter.

The following splits the `iris` dataset by the `Species` column. (there are 3)

In [None]:
gdf = groupby(iris, :Species)

The `combine` function will summarize and result in a row per group.  Let's say we want the mean and standard deviation of the `PetalLength` variable:

In [None]:
combine(gdf, :PetalLength => mean, :PetalLength => std)

If we just want the number of rows in each group

In [None]:
combine(gdf, nrow)

Typically, there are multiple steps involved in the split-apply-combine.  Let's say that we want to split the iris data set as above, find the maximum of the `SepalLength`, the median of the `SepalWidth` and then the mean area of the petal (as found above in the exercise)

In [None]:
@chain iris begin
  transform([:PetalWidth, :PetalLength] => ((w,l) -> pi*w.*l) => :PetalArea)
  groupby(:Species)
  combine(:SepalLength => maximum, :SepalWidth => median, :PetalArea => mean)
end

In the next couple of chapters we will use these techniques on a more interesting data set.