## Chapter 19: Data Analysis

This chapter covers the introduction to some data analysis.  We will load data from existing datasets (that are from a package), do some plotting and some analysis of the dataset. 

Let's load the following packages.

In [None]:
using RDatasets, StatsPlots, Query, StatsBase, Statistics, DataFrames, Plots

The RDatasets package is a bunch of datasets that are built-in to the R language, a statistics language that is quite popular.  The following shows all of the data packages:

In [None]:
RDatasets.packages()

Each package as a set of datasets within it.  Here's the datasets in the `datasets` package. 

In [None]:
RDatasets.datasets("datasets")

This loads the `iris` dataset from the `datasets` package.  The result is a `DataFrame` and we will see details about this later.  Each column in a `DataFrame` has a particular type, but this is better for general datasets than an array.  The column headers give the name of the column as well as the datatype of the column: 

In [None]:
iris=RDatasets.dataset("datasets","iris")

#### 19.2: Accessing the DataFrame

We can get a column by using brackets and the name of the column (preceded with a colon).  The result is a 1-D array of the type give by the column type. 

In [None]:
iris[:SepalWidth]

So for example, we can find the mean of this column using the built-in `mean` function:

In [None]:
mean(iris[:SepalWidth])

And the following will find the standard deviation:

In [None]:
std(iris[:SepalWidth])

If you only want part of a column, we can use a range to access the desired elements

In [None]:
iris[11:20,:SepalWidth]

In [None]:
iris[1:2:end,:PetalWidth]

#### common functions of DataFrames

This is the size of the DataFrame.  this is similar to the `size` function for an array.  This shows that there are 150 rows and 5 columns. 

In [None]:
size(iris)

Here are the column names:

In [None]:
names(iris)

Here are the column types.  Note that the last one is a Categorical Value (since there are only 3 different values). 

In [None]:
eltypes(iris)

The first few rows of a DataFrame

In [None]:
first(iris,5)

And the last 5 rows:

In [None]:
last(iris,5)

#### 19.3: Creating a DataFrame

The following is a simple DataFrame that we can create:

In [None]:
data = DataFrame(A = 1:5, B = ["M", "F", "F", "M","X"], C=[3.0,2.5,pi,-2.3,1/3])

#### 19.2.4: Plotting data in a DataFrame

We can plot the data in a DataFrame in a manner similar to that in the `Plots` package, such as:

In [None]:
scatter(iris[!,:SepalLength],iris[!,:SepalWidth])

However, the `StatsPlots` package has some nice shorthand for this.  The `@df` macro is used to plot a DataFrame where the first object is the DataFrame and then the plot command.  Note that this macro allows to just add the columns.

In [None]:
@df iris scatter(:SepalLength,:SepalWidth)

Here's a nice plot by coloring depending on Species, the categorical variable.

In [None]:
@df iris scatter(:SepalLength,:SepalWidth,group=:Species)

In [None]:
mean(iris[!,:SepalWidth]),std(iris[!,:SepalWidth])

### 19.3: Pipe command

Often in computing, we nest various functions to produce a result.  For example:

In [None]:
sqrt(sin(big(2.0)))

If the number of functions are large, this can get hard to understand.  Instead, we can write this as:

In [None]:
2.0 |> big |> sin |> sqrt

Which is often called a postfix notation, that is the function is applied from left to right.  The symbol |> is called the pipe command, which is often thought of as starting with 2.0 and piping it (like plumbing) to the `big` command, then the `sin` function, then the `sqrt` function.  Note that the results are the same, but the syntax, and often the way you think about it changes.  

#### Another example

Here is pushing elements onto an array.  First, this is the way we have seen so far in this course:

In [None]:
A=zeros(Int,0)

In [None]:
push!(A,3)

In [None]:
A

And this is the pipe version.  Note, that since the `push!` command takes two variables, we need to use an anonymous function

In [None]:
4 |> x->push!(A,x)

In my opinion, this is not clearer.  However, we are using this when manipulating DataFrames, which can string a number of functions together.

#### 19.3.1: Manipulating DataFrames using the Query package

The Query package has a number of ways to take DataFrames (and other similar structures) and generally they are used with the pipe command. 

Recall that we can take an array `[1,2,3,4,5]` and square each element with the `map` function:

In [None]:
map(a->a^2,[1,2,3,4,5])

We can do the same with the `@map` macro of the Query package.  The `_^2` is shorthand for `x->x^2` or any other variable.  The way I like to think of this is that we start with the array and apply the square to each element.

In [None]:
[1,2,3,4,5] |> @map(_^2)

Here's a more complicated example:

In [None]:
collect(1:10) |> @map(_^2) |> @filter(_%2==0) |> mean

This does the following:
1. Start with the array from 1 to 10
2. square each element
3. filter only the even numbers
4. take the mean.

#### 19.4: Other Commands in the Query package

There are a lot of nice functions in the Query package.  The [documentation](https://www.queryverse.org/Query.jl/stable/) has details of all of the functions, but the more important ones will see here.

We can apply the `@map` macro to a dataset as well.  This is an example that applies three different functions (and creates a new DataFrame with 3 new column names).  Note to access the columns we use `_.` column name. 

In [None]:
data |> @map({a1 = _.A^2, a2 = 2*_.A, g=string("gender:",_.B)})

Also, with this, note that the DataFrame `data` is not changed, but creates a new table-like thing based on `data`.

In [None]:
data |> @map({a1 = _.A^2, a2 = 2*_.A, g=string("gender:",_.B)}) |> typeof

If you want to make a new DataFrame (which is often desirable), try:

In [None]:
data |> @map({a1 = _.A^2, a2 = 2*_.A, g=string("gender:",_.B)}) |> DataFrame

#### Filtering a DataFrame

The `@filter` macro takes a boolean expression and returns a new table with all rows that return `true`:

In [None]:
data |> @filter(_.A>2)

In [None]:
data |> @filter(_.B != "M")

In [None]:
data |> @filter(abs(_.C) < 2.4 && _.A % 2 == 0)

#### Grouping Data

A very common operation on DataFrames is to group a dataframe according to some property.  A simple example is to group according to a Categorical variable.  For example, on the `iris` dataset, we use the `Species` column:

In [None]:
iris |> @groupby(_.Species)

Although this is a bit crazy, you can see three big groups (where all data is listed). 

We then want to do something in each group.  This will give a count of the number of each species:

In [None]:
iris |> 
  @groupby(_.Species) |> 
  @map({Species = key(_), Count = length(_)}) |> 
  DataFrame

Note: that when there are a number of commands piped together, I recommend putting each command on a separate line.  If you do this the `|>` needs to be the last on the line.

Here's a table that finds the mean and standard deviation of the SepalWidth column for each species:

In [None]:
iris |>
  @groupby(_.Species) |>
  @map({Species = key(_), mean_sepal_width = mean(_.SepalWidth), mean_sepal_length = mean(_.SepalLength)}) |>
  DataFrame

#### The @orderby command

If we want to order (sort) a dataframe, we can use this macro.  It will keep the row together.  Here's an example of sorting by the C column:

In [None]:
data |> @orderby(_.C)

We can also order by descending (from largest to smallest):

In [None]:
data |> @orderby_descending(_.C)

And we can also sort on a function of a column too.  This sorts by the absolute value of C:

In [None]:
data |> @orderby(abs(_.C))

If you want to sort first by one column, then by others we can use the `@thenby` or `@thenby_descending`.  Consider this example:

In [None]:
df = DataFrame(a=[2,1,1,2,1,3],b=[2,2,1,1,3,2])

In [None]:
df |> @orderby(_.a) |> @thenby(_.b)

#### The @join command

Another common thing we need to do with data is to merge two datasets.  First, let's consider this DataFrame  

In [None]:
simpsons = DataFrame(
  id=1:5,
  name=["Homer","Marge","Lisa","Bart","Maggie"],
  age =[45,42,8,10,1],
  salary = [50000,25000,10000,missing,missing],
  favorite_food = ["pork chops","casserole","salad","hamburger",missing]
)

And then we will merge this with the `data` DataFrame where we will match the column `id` of the `simpsons` DataFrame with the `A` column of the `data` DataFrame.

The following joins these.  The first argument is the 2nd DataFrame, the second argument is the column of the first DataFrame and the 3rd  argument is the column of the 2nd dataframe (how to match).  The expression inside the { } are the columns from the 2nd (with two underscores) and first (with one underscore) to include in the result. 

In [None]:
data |> 
  @join(simpsons, _.A, _.id, {__.name, __.age, _.B, _.C}) |> 
  DataFrame

#### The @mutate command

The `@map` macro generates a new DataFrame from an old one, but perhaps, we want to take the original DataFrame and add new columns that perhaps are functions of columns. Here's an example:

In [None]:
data |> @mutate(a1 = _.A^2, c1=2*_.C, r = rand())

### 19.5: Missing Data

Often in a DataFrame, data is missing and julia has a data type called `Missing` that has only one value, `missing`.  Before we examine missing and DataFrames, here's some examples with just the missing value:

In [None]:
typeof(missing)

In [None]:
missing+6

In [None]:
mean([1,2,3,missing,5])

In many ways, any operation of `missing` results in `missing` and in many ways, this is a way to signal that data is missing.

#### missing values in a DataFrame

Recall that the `simpsons` dataset above had missing data:

In [None]:
simpsons

First, note that the datatypes on the salary and favorite_food columns have a ?.  What this actually means is that:

In [None]:
eltypes(simpsons)

And you can see that the last two element types are `Union{Missing,Int64}` and `Union{Missing,String}`.  The Union datatype is a way to handle more than Type.  This means that the elements of salary can be either `String` or `Missing`.  

We can find the maximum age with:

In [None]:
simpsons[:age] |> maximum

But if we do the same with the salary column:

In [None]:
simpsons[:salary] |> maximum

There is a nice function called `skipmissing` which is a bit strange in that:

In [None]:
simpsons[!,:salary] |> skipmissing

And doesn't see to do anything except wrap the array in a `skipmissing` function, but if we now look for the maximum with:

In [None]:
simpsons[!,:salary] |> skipmissing |> maximum

This returns what we expect.

In [None]:
simpsons[!,:salary] |> skipmissing |> mean

which just finds the mean of the 3 non-missing values.