### 29: Manipulating DataFrames

Typically once you have a data frame, you will need to manipulate it.  This includes filtering (subsets) rows or columns, creating new columns.  We will cover that in this section. 

#### Selecting columns
First, if we want to select specific columns, we can do this in a couple of different ways.  First, using the same technique as above.  This selects the A and D column.  Note the colon (:) in front of the column names.  

In [None]:
data[!,[:A,:D]]

Row,A,D
Unnamed: 0_level_1,Int64,Rational…
1,1,-1//1
2,3,1//2
3,5,-1//3
4,7,1//4
5,9,-1//5
6,11,1//6
7,13,-1//7


Alternatively, we can use the `select` function.  The first argument is the data frame and the others are columns names.

In [None]:
select(data, :A, :D)

Row,A,D
Unnamed: 0_level_1,Int64,Rational…
1,1,-1//1
2,3,1//2
3,5,-1//3
4,7,1//4
5,9,-1//5
6,11,1//6
7,13,-1//7


And we can use column numbers and reorder:

In [None]:
select(data, 4,3)

Row,D,C
Unnamed: 0_level_1,Rational…,Float64
1,-1//1,3.0
2,1//2,2.5
3,-1//3,3.14159
4,1//4,-2.3
5,-1//5,0.333333
6,1//6,56.0
7,-1//7,100.0


#### Filtering (or subsetting) the rows

Next, we see how to filter (or subset) the rows based on some condition.  This example shows that we take only the rows where the column A values are less than 10.

In [None]:
subset(data, :A => a-> a .< 10)

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,1,M,3.0,-1//1
2,3,F,2.5,1//2
3,5,F,3.14159,-1//3
4,7,M,-2.3,1//4
5,9,X,0.333333,-1//5


Note that the last argument is a function (anonymous) whose input is the entire column and we want a vector of booleans.  This is why the less than sign is broadcast (`.<`).

Alternatively, we can use the `ByRow` function on a non-vector function like:

In [None]:
subset(data, :A => ByRow(a-> a < 10))

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,1,M,3.0,-1//1
2,3,F,2.5,1//2
3,5,F,3.14159,-1//3
4,7,M,-2.3,1//4
5,9,X,0.333333,-1//5


This filters all rows where column B is "F":

In [None]:
subset(data, :B => b-> b.== "F")

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,3,F,2.5,1//2
2,5,F,3.14159,-1//3
3,11,F,56.0,1//6


We can filter on more than one column.  This example returns all rows where column A is larger than column C.

Note: that the columns must be put into vector format and then the function must have 2 columns. 

In [None]:
subset(data, [:A, :C] => (a,c) -> a .> c)

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,3,F,2.5,1//2
2,5,F,3.14159,-1//3
3,7,M,-2.3,1//4
4,9,X,0.333333,-1//5


And here's an example using three columns

In [None]:
subset(data, [:A, :D, :C] => (a,d,c) -> a.*d .> c)

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,7,M,-2.3,1//4


#### Exercise

- find all rows where the absolute value of the C column is greater than 2.
- find all rows where the product of columns C and D is greater than 1.

#### Transforming Data Frames

If you want a new column that is some function of one or more of the columns, we will use either `select` or `transform`:
- use `select` if you only want the new column(s) in the data frame
- use `transform` if you want the original data frame as well as the new columns

The following makes a dataframe with a single column that is the square of the A column

In [None]:
select(data, :A => a-> a.^2)

Row,A_function
Unnamed: 0_level_1,Int64
1,1
2,9
3,25
4,49
5,81
6,121
7,169


Notice that the new column has the generic column `A_function`. Instead, if we want to give that column a better name use: 

In [None]:
select(data, :A => (a-> a.^2) => :Asq)

Row,Asq
Unnamed: 0_level_1,Int64
1,1
2,9
3,25
4,49
5,81
6,121
7,169


Note: make sure the ( ) are around the function.  Remove them to see what happens.

We can also make a column based on a function of two columns. For example:

In [None]:
select(data, [:C, :D] => ((c,d)-> c.*d) => :prod)

Row,prod
Unnamed: 0_level_1,Float64
1,-3.0
2,1.25
3,-1.0472
4,-0.575
5,-0.0666667
6,9.33333
7,-14.2857


And if we want to do both:

In [None]:
select(data, :A => (a-> a.^2) => :Asq, [:C, :D] => ((c,d)-> c.*d) => :prod)

Row,Asq,prod
Unnamed: 0_level_1,Int64,Float64
1,1,-3.0
2,9,1.25
3,25,-1.0472
4,49,-0.575
5,81,-0.0666667
6,121,9.33333
7,169,-14.2857


If instead of ignoring the original dataframe, we can add additional columns to it with the `transform` function

In [None]:
transform(data, :A => (a-> a.^2) => :Asq)

Row,A,B,C,D,Asq
Unnamed: 0_level_1,Int64,String,Float64,Rational…,Int64
1,1,M,3.0,-1//1,1
2,3,F,2.5,1//2,9
3,5,F,3.14159,-1//3,25
4,7,M,-2.3,1//4,49
5,9,X,0.333333,-1//5,81
6,11,F,56.0,1//6,121
7,13,M,100.0,-1//7,169


#### Exercise

- create a new data frame from `data` which is the square root of column C.
- Using the `iris` dataframe produce a new column called `area` which is the area of a petal using the `PetalLength` and `PetalWidth` variables and the area of an ellipse. Keep the original columns with this new dataframe.

#### Sorting DataFrames

Sorting data frames is quite helpful in many situations.  We use the `sort` function to do this.  The following sorts on column C

In [None]:
sort(data, :C)

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,7,M,-2.3,1//4
2,9,X,0.333333,-1//5
3,3,F,2.5,1//2
4,1,M,3.0,-1//1
5,5,F,3.14159,-1//3
6,11,F,56.0,1//6
7,13,M,100.0,-1//7


And if we want to sort in reverse order

In [None]:
sort(data, :C, rev = true)

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,13,M,100.0,-1//7
2,11,F,56.0,1//6
3,5,F,3.14159,-1//3
4,1,M,3.0,-1//1
5,3,F,2.5,1//2
6,9,X,0.333333,-1//5
7,7,M,-2.3,1//4


And sorting is done by type.  This sorts lexiographically.

In [None]:
sort(data, :B)

Row,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Rational…
1,3,F,2.5,1//2
2,5,F,3.14159,-1//3
3,11,F,56.0,1//6
4,1,M,3.0,-1//1
5,7,M,-2.3,1//4
6,13,M,100.0,-1//7
7,9,X,0.333333,-1//5


## Joining DataFrames

Another important activity to do with data frames is joining two or more.  Typically this means that both data frame have a common piece of information on which to join.  Consider the following:

In [None]:
simpsons = DataFrame(
    id=1:2:13,
    name=["Homer","Marge","Lisa","Bart","Maggie","Apu","Moe"],
    age =[45,42,8,10,1,38,59],
    salary = [50000,25000,10000,missing,missing,45000,3000],
    favorite_food = ["pork chops","casserole","salad","hamburger",missing,"saag paneer","peanuts"]
  )

Row,id,name,age,salary,favorite_food
Unnamed: 0_level_1,Int64,String,Int64,Int64?,String?
1,1,Homer,45,50000,pork chops
2,3,Marge,42,25000,casserole
3,5,Lisa,8,10000,salad
4,7,Bart,10,missing,hamburger
5,9,Maggie,1,missing,missing
6,11,Apu,38,45000,saag paneer
7,13,Moe,59,3000,peanuts


A keen eye notices that the `salary` and `favorite_food` columns data types have a ?.  This is because they have missing data.  Again, we'll explain how to handle this later. 

If we want to join this to the data frame called `data` where column `id` above matches `A` on `data`, we do the following:

In [None]:
innerjoin(data, simpsons, on = :A => :id)

Row,A,B,C,D,name,age,salary,favorite_food
Unnamed: 0_level_1,Int64,String,Float64,Rational…,String,Int64,Int64?,String?
1,1,M,3.0,-1//1,Homer,45,50000,pork chops
2,3,F,2.5,1//2,Marge,42,25000,casserole
3,5,F,3.14159,-1//3,Lisa,8,10000,salad
4,7,M,-2.3,1//4,Bart,10,missing,hamburger
5,9,X,0.333333,-1//5,Maggie,1,missing,missing
6,11,F,56.0,1//6,Apu,38,45000,saag paneer
7,13,M,100.0,-1//7,Moe,59,3000,peanuts


We will explain the `innerjoin` below, but a couple of things.  First, the first 4 columns came from `data` and the last 4 from `simpsons` (these don't need to be equal).  The `id` column from `simpsons` was dropped. 

The following are the joins in the `DataFrames` package:

- **innerjoin:** the output contains rows for values of the key that exist in all passed data frames.
- **leftjoin:** the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
- **rightjoin:** the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
- **outerjoin:** the output contains rows for values of the key that exist in any of the passed data frames.
- **semijoin:** Like an inner join, but output is restricted to columns from the first (left) argument.
- **antijoin:** The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
- **crossjoin:** The output is the cartesian product of rows from all passed data frames.

## 19.6: Summarizing Data

Usually one wants to boil down a dataset to a few numbers. This is typically what the mean, median, standard deviation and quartiles are useful for.  

The `combine` function in `DataFrames` will do this:

In [None]:
combine(data, :A => mean, :C => mean, :C => std, :D => (d -> quantile(d,0.75)) )

Row,A_mean,C_mean,C_std,D_function
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,7.0,23.2393,39.5518,0.208333


where the last function returns the 3rd quartile (0.75 quantile). 

We will see `combine` below in which a dataset is grouped and then computations are made on a per group basis.

### Using a data frame with Chain

The above examples are still a bit overkill, but let's repeat the above steps with a dataframe, which is typically how we will use Chain. 

In [None]:
df = DataFrame(x=1:10)
@chain df begin
  select(:x => (x->x.^2) => :xsq)
  subset(:xsq => x-> mod.(x,2) .== 0)
  combine(:xsq => mean)
end

Row,xsq_mean
Unnamed: 0_level_1,Float64
1,44.0


#### A better way of handling missing values in a dataframe

A better way to do this is with a `dropmissing` function in `DataFrames`.  This creates a new dataframe that filters out any row that has a missing value. For example

In [None]:
dropmissing(simpsons)

Row,id,name,age,salary,favorite_food
Unnamed: 0_level_1,Int64,String,Int64,Int64,String
1,1,Homer,45,50000,pork chops
2,3,Marge,42,25000,casserole
3,5,Lisa,8,10000,salad
4,11,Apu,38,45000,saag paneer
5,13,Moe,59,3000,peanuts


Also, if you want to just filter rows with missing value in a particular column(s), you can do the following:

In [None]:
dropmissing(simpsons, :favorite_food)

Row,id,name,age,salary,favorite_food
Unnamed: 0_level_1,Int64,String,Int64,Int64?,String
1,1,Homer,45,50000,pork chops
2,3,Marge,42,25000,casserole
3,5,Lisa,8,10000,salad
4,7,Bart,10,missing,hamburger
5,11,Apu,38,45000,saag paneer
6,13,Moe,59,3000,peanuts


And using this we can repeat the steps above with the `@chain` syntax:

In [None]:
@chain simpsons begin
    dropmissing(:salary)
    combine(:salary => mean)
end

Row,salary_mean
Unnamed: 0_level_1,Float64
1,26600.0


## Split-Apply-Combine

A common situation with data analysis is to have a dataset and you want to compare means or standard deviations within a dataset. What needs to often happen is that you first split a dataset, do some analysis on each group then summarize. This is know as *split-apply-combine*. We will demonstrate this with an example. Let’s return to the iris dataset that we loaded at the beginning of this chapter.

The following splits the `iris` dataset by the `Species` column. (there are 3)

In [None]:
gdf = groupby(iris, :Species)

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,6.3,3.3,6.0,2.5,virginica
2,5.8,2.7,5.1,1.9,virginica
3,7.1,3.0,5.9,2.1,virginica
4,6.3,2.9,5.6,1.8,virginica
5,6.5,3.0,5.8,2.2,virginica
6,7.6,3.0,6.6,2.1,virginica
7,4.9,2.5,4.5,1.7,virginica
8,7.3,2.9,6.3,1.8,virginica
9,6.7,2.5,5.8,1.8,virginica
10,7.2,3.6,6.1,2.5,virginica


The `combine` function will summarize and result in a row per group.  Let's say we want the mean and standard deviation of the `PetalLength` variable:

In [None]:
combine(gdf, :PetalLength => mean, :PetalLength => std)

Row,Species,PetalLength_mean,PetalLength_std
Unnamed: 0_level_1,Cat…,Float64,Float64
1,setosa,1.462,0.173664
2,versicolor,4.26,0.469911
3,virginica,5.552,0.551895


If we just want the number of rows in each group

In [None]:
combine(gdf, nrow)

Row,Species,nrow
Unnamed: 0_level_1,Cat…,Int64
1,setosa,50
2,versicolor,50
3,virginica,50


Typically, there are multiple steps involved in the split-apply-combine.  Let's say that we want to split the iris data set as above, find the maximum of the `SepalLength`, the median of the `SepalWidth` and then the mean area of the petal (as found above in the exercise)

In [None]:
@chain iris begin
  transform([:PetalWidth, :PetalLength] => ((w,l) -> pi*w.*l) => :PetalArea)
  groupby(:Species)
  combine(:SepalLength => maximum, :SepalWidth => median, :PetalArea => mean)
end

Row,Species,SepalLength_maximum,SepalWidth_median,PetalArea_mean
Unnamed: 0_level_1,Cat…,Float64,Float64,Float64
1,setosa,5.8,3.4,1.14857
2,versicolor,7.0,2.8,17.9712
3,virginica,7.9,3.0,35.4881
