## Chapter 19: Data Analysis

This chapter covers the introduction to some data analysis.  We will load data from existing datasets (that are from a package), do some plotting and some analysis of the dataset. 

Let's load the following packages.

In [None]:
using RDatasets, StatsPlots,  StatsBase, Statistics, DataFrames

The RDatasets package is a bunch of datasets that are built-in to the R language, a statistics language that is quite popular.  The following shows all of the data packages:

In [None]:
RDatasets.packages()

Each package as a set of datasets within it.  Here's the datasets in the `datasets` package. 

In [None]:
RDatasets.datasets("datasets")

This loads the `iris` dataset from the `datasets` package.  The result is a `DataFrame` and we will see details about this later.  Each column in a `DataFrame` has a particular type, but this is better for general datasets than an array.  The column headers give the name of the column as well as the datatype of the column: 

In [None]:
iris=RDatasets.dataset("datasets","iris")

#### 19.2: Accessing the DataFrame

We can get a column by using brackets and the name of the column (preceded with a colon).  The result is a 1-D array of the type give by the column type. 

In [None]:
iris[:SepalWidth]

So for example, we can find the mean of this column using the built-in `mean` function:

In [None]:
mean(iris[:SepalWidth])

And the following will find the standard deviation:

In [None]:
std(iris[:SepalWidth])

If you only want part of a column, we can use a range to access the desired elements

In [None]:
iris[11:20,:SepalWidth]

In [None]:
iris[1:2:end,:PetalWidth]

#### common functions of DataFrames

This is the size of the DataFrame.  this is similar to the `size` function for an array.  This shows that there are 150 rows and 5 columns. 

In [None]:
size(iris)

Here are the column names:

In [None]:
names(iris)

Here are the column types.  Note that the last one is a Categorical Value (since there are only 3 different values). 

In [None]:
eltypes(iris)

The first few rows of a DataFrame

In [None]:
first(iris,5)

And the last 5 rows:

In [None]:
last(iris,5)

### Basic Statistics

In [None]:
describe(iris)

In [None]:
 mean(iris[!,:SepalLength]),std(iris[!,:SepalLength])

#### 19.3: Creating a Dataframe

In [None]:
data = DataFrame(A = 1:2:13, B = ["M", "F", "F", "M","X","F","M"], C=[3.0,2.5,pi,-2.3,1/3,56,100],D=[(-1)^n//n for n=1:7])

In [None]:
size(data)

In [None]:
names(data)

In [None]:
describe(data)

`first` and `last` returns the first or last part of the dataframe

#### 19.2.4: Plotting data in a DataFrame

We can plot the data in a DataFrame in a manner similar to that in the `Plots` package, such as:

In [None]:
scatter(iris[!,:SepalLength],iris[!,:SepalWidth])

However, the `StatsPlots` package has some nice shorthand for this.  The `@df` macro is used to plot a DataFrame where the first object is the DataFrame and then the plot command.  Note that this macro allows to just add the columns.

In [None]:
@df iris scatter(:SepalLength,:SepalWidth)

Here's a nice plot by coloring depending on Species, the categorical variable.

In [None]:
@df iris scatter(:SepalLength,:SepalWidth,group=:Species)

In [None]:
mean(iris[!,:SepalWidth]),std(iris[!,:SepalWidth])

### 19.3: Manipulating DataFrames

In [None]:
data[:,[:A,:D]]

In [None]:
select(data, :A, :D)

#### Filtering (or subsetting) the rows

In [None]:
subset(data, :A => a-> a .< 10)

In [None]:
subset(data, :B => b-> b.== "F")

In [None]:
subset(data, [:A, :C] => (a,c) -> a-c .> 0)

In [None]:
subset(data, [:A, :D, :C] => (a,d,c) -> a.*d .> c)

In [None]:
subset(data, [:A, :D, :C] => ByRow((a,d,c)-> a*d >c))

#### Transforming Data Frames

If you want a new column that is some function of one or more of the columns, we will use either `select` or `transform`:
- use `select` if you only want the new column(s) in the data frame
- use `transform` if you want the original data frame as well as the new columns

In [None]:
select(data, :A => a-> a.^2)

In [None]:
select(data, :A => (a-> a.^2) => :Asq)

Note: make sure the ( ) are around the function.  Remove them to see what happens.

In [None]:
transform(data, :A => (a-> a.^2) => :Asq)

In [None]:
select(data, [:C, :D] => ((c,d)-> c.*d) => :prod)

#### Exercise

Using the `iris` dataframe produce a new column called `area` which is the area of a petal using the `PetalLength` and `PetalWidth` variables and the area of an ellipse.

#### Sorting DataFrames

In [None]:
sort(data, :C)

In [None]:
sort(data, :C, rev = true)

In [None]:
sort(data, :B)

## 19.5: Joining DataFrames

In [None]:
simpsons = DataFrame(
    id=1:2:13,
    name=["Homer","Marge","Lisa","Bart","Maggie","Apu","Moe"],
    age =[45,42,8,10,1,38,59],
    salary = [50000,25000,10000,missing,missing,45000,3000],
    favorite_food = ["pork chops","casserole","salad","hamburger",missing,"saag paneer","peanuts"]
  )

In [None]:
innerjoin(data, simpsons, on = :A => :id)

- **innerjoin:** the output contains rows for values of the key that exist in all passed data frames.
- **leftjoin:** the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
- **rightjoin:** the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
- **outerjoin:** the output contains rows for values of the key that exist in any of the passed data frames.
- **semijoin:** Like an inner join, but output is restricted to columns from the first (left) argument.
- **antijoin:** The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
- **crossjoin:** The output is the cartesian product of rows from all passed data frames.

### 19.5: Missing Data

Often in a DataFrame, data is missing and julia has a data type called `Missing` that has only one value, `missing`.  Before we examine missing and DataFrames, here's some examples with just the missing value:

In [None]:
typeof(missing)

In [None]:
missing+6

In [None]:
mean([1,2,3,missing,5])

In many ways, any operation of `missing` results in `missing` and in many ways, this is a way to signal that data is missing.

#### missing values in a DataFrame

Recall that the `simpsons` dataset above had missing data:

In [None]:
simpsons

First, note that the datatypes on the salary and favorite_food columns have a ?.  What this actually means is that:

In [None]:
eltypes(simpsons)

And you can see that the last two element types are `Union{Missing,Int64}` and `Union{Missing,String}`.  The Union datatype is a way to handle more than Type.  This means that the elements of salary can be either `String` or `Missing`.  

We can find the maximum age with:

In [None]:
simpsons[:age] |> maximum

But if we do the same with the salary column:

In [None]:
simpsons[:salary] |> maximum

There is a nice function called `skipmissing` which is a bit strange in that:

In [None]:
simpsons[!,:salary] |> skipmissing

And doesn't see to do anything except wrap the array in a `skipmissing` function, but if we now look for the maximum with:

In [None]:
simpsons[!,:salary] |> skipmissing |> maximum

This returns what we expect.

In [None]:
simpsons[!,:salary] |> skipmissing |> mean

which just finds the mean of the 3 non-missing values.