# Week 3 Notes - Getting and Cleaning Data

## Subsetting and Sorting
Time to wrangle and mould our datasets to our desires. Create a basic dataframe, reshjuffle the contents of the columns and then insert some NA values too. 
```R
set.seed(1345)
X <- data.frame("var1"=sample(1:5), "var2"=sample(6:10), "var3"=sample(11:15))
X <- X[sample(1:5),]; X$var2[c(1,3)] = NA
``` 

In Julia

In [1]:
using DataFrames, CSV, Pkg

In [2]:
X = DataFrame(col1=rand(1:20, 5), col2=rand(2:30, 5), col3=[missing, missing, 5, 4, 21])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,5,missing
2,6,17,missing
3,4,15,5
4,9,9,4
5,10,22,21


Let's subset the dataframe and take specific columns and row combinations

In R
`X[,1]` to take the first column. We can also take the first column by passing the column name as a string `X[,"var1]`. Let's take the first two rows of column 2 `X[1:2,"var2"]`    

With Julia we can use the basic .dot syntax

In [3]:
X.col1

5-element Vector{Int64}:
 19
  6
  4
  9
 10

In [4]:
# This string based extraction is a bit slower
# Julia converts the String to a Symbol type 
X."col1"

5-element Vector{Int64}:
 19
  6
  4
  9
 10

We can also use indexing and the column names

In [5]:
X[:, 2]

5-element Vector{Int64}:
  5
 17
 15
  9
 22

In [6]:
X[:, "col1"]

5-element Vector{Int64}:
 19
  6
  4
  9
 10

In [7]:
X[:, :col1]

5-element Vector{Int64}:
 19
  6
  4
  9
 10

So to summarise it all, if we're using indexing, we can use colnumber, "colname", :Colname. If we're using .dot syntax we can use df.number, df."colname" - wooooooo

In Julia, to check the column index of a certain column e.g. is it in the 10th column? etc., we can use the **columnindex()** function

In [8]:
columnindex(X, "col3")

3

To test whether a specific column is in the dataframe, based on it's name, we can do **hasproperty()**

In [9]:
hasproperty(X, "col3")

true

#### Subset the dataframe using conditions, such as, print the dataframe in which the first column has values over 18

In [10]:
filter(row -> row.col1 > 18, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,5,missing


If we have multiple conditons

In [11]:
filter(row -> row.col1 > 1 || row.col2 > 1, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,5,missing
2,6,17,missing
3,4,15,5
4,9,9,4
5,10,22,21


In R we could do 
```R
X[(X$var1 <= 3 & X$var3 > 11),]
X[(X$var1 <= 3 | X$var3 > 11)]
``` 

### Sorting 

In R;
```R
sort(X$var1)
# Sort in reverse
sort(X$var1, decreasing=TRUE)
``` 

In Julia to just get a vector of a specific dataframes column
https://dataframes.juliadata.org/stable/man/sorting/

In [12]:
sort(X.col1, rev=true)

5-element Vector{Int64}:
 19
 10
  9
  6
  4

To print out the entire dataframe for viewing

In [13]:
sort(X, "col1") 
# or 
sort(X, 1)
# or - this is the slowest one 
sort(X, [:1])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,4,15,5
2,6,17,missing
3,9,9,4
4,10,22,21
5,19,5,missing


### Ordering
Ordering is used in conjunction with sorting, as it will allow us to specificy the sorting order of the columns in the DataFrame, e.g. first X and reverse sort it

In R; 
```R
X[order(X$var1, X$var3),] 
```

Now in Julia, based on the help information

In [14]:
sort(X, order("col1", rev=true))

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,5,missing
2,10,22,21
3,9,9,4
4,6,17,missing
5,4,15,5


We can pass multiple order functions within a single dataframe in order to handle the other columns

In [15]:
sort!(X, ["col1", "col2"], rev=[true, false])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,5,missing
2,10,22,21
3,9,9,4
4,6,17,missing
5,4,15,5


### Adding rows and columns 
Adding rows and columns is a very common procedure - it should become as comfortable as adding sides to the playdough structure that we've made.

In R;
```R
X$var4 <- rnorm(5)
```

In Julia, a very basic way to do this is via indexing, we can index into a column which doesn't exist yet, but soon will, and provide the data which will fill the column

In [16]:
X.col4 = rand(100:200, 5)

5-element Vector{Int64}:
 162
 175
 148
 162
 151

In [17]:
X

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,5,missing,162
2,10,22,21,175
3,9,9,4,148
4,6,17,missing,162
5,4,15,5,151


## Summarising Data 
We'll be looking at different ways of providing a snapshot of the general big picture of our datasets - the averages, limits, deviations and so on.  

Let's do the basics, the beginning and end of the datasets; 
In R; 
```R
head(data, n=3)
tail(data, n=5)
``` 

In Julia

In [18]:
first(X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,5,missing,162


In [19]:
last(X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
5,4,15,5,151


To get a brief summary of the data with descriptve stats and other information such as the Types of the variables in the columns, we can use `summary(data)` in R and in Julia we can use **describe()**

In [20]:
describe(X)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Int64,Float64,Int64,Int64,Type
1,col1,9.6,4,9.0,19,0,Int64
2,col2,13.6,5,15.0,22,0,Int64
3,col3,10.0,4,5.0,21,2,"Union{Missing, Int64}"
4,col4,159.6,148,162.0,175,0,Int64


In R you can also use the `str(data)` command 

In [21]:
typeof(X.col1)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

To get the quantiles of a vector, in R we have the base function `quantile(data$column.na.rm=TRUE)` and in Julia we have to use the **Statistics.jl** package to add this functionality. Remember Julia is a more general language compared to R which was always tailored towards statistical computing - see https://www.jlhub.com/julia/manual/en/function/quantile-exclamation 

In [22]:
using Statistics

In [333]:
# Print quarter quantiles
quantile!(X.col1, [0.25, 0.5, 0.75, 1], )

4-element Vector{Float64}:
  6.0
  9.0
 10.0
 19.0

To skip the missing values and print the median value

In [24]:
quantile(skipmissing(X.col3), 0.5) 

5.0

### Checking for missing values
In R, count the number of missing values 
```R
sum(is.na(data$column))
```
Check is **any** na values are present
```R
any(is.na(data$column))
```
Test to see whether all the values meet a certain condition (over 0)
```R
all(data$column > 0)
```

In Julia, get the sum of missing values - using the one line iterators

In [25]:
sum(x -> ismissing(x), X.col3)

2

If any missing values are in there 

In [26]:
any(x -> ismissing(x), X.col3)

true

If all the values are a certain condition 

In [27]:
all(x -> ismissing(x), X.col1)

false

In [28]:
all(x -> x > 0, X.col1)

true

A cool little function in Julia to only extract the dataframes rows which contain missing values 

In [29]:
filter(x -> any(ismissing, x), X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,4,5,missing,162
2,10,17,missing,162


Perform a quick sum of all of the columns in a horizontal fashion - intuitively this would mean summing the entire row, and producing a new sum in the final column of the same row e.g. |1|2|3|6 (final)|. This can be a very quick way of checking whether there are any missing values as the missing values will propogate across!

In R;
```R
colSums(is.na(data))
```

In Julia

In [30]:
sum(eachcol(X))

5-element Vector{Union{Missing, Int64}}:
    missing
 224
 170
    missing
 190

In Julia if we want to actually sum the entire column, meaning every value in the column vertically, we can collect the column and then sum it OR we can just performing broadcasting using the sum function - fascinating but easily confusing! 

In [31]:
sum.(collect(eachcol(X)))

4-element Vector{Union{Missing, Int64}}:
  48
  68
    missing
 798

In [32]:
sum.(eachcol(X))

4-element Vector{Union{Missing, Int64}}:
  48
  68
    missing
 798

There is an equivalent operation by using the broadcasting over **eachrow()**

In [33]:
sum.(eachrow(X))

5-element Vector{Union{Missing, Int64}}:
    missing
 224
 170
    missing
 190

If we want to skip missing values when doing these operations we would broadcasting **skipmissing()** across

In [34]:
sum.(skipmissing.(eachrow(X)))

5-element Vector{Int64}:
 171
 224
 170
 189
 190

### Subsetting the dataframe based upon values in the columns
Say for example that we only want the data which have a specific zipcode (generic value) in a column, what can we do? In R;
```R
data[data$zipCode %in% c("4109", "4110"),] 
```

In Julia - get a dataframe in which the values in the first column are 1 

In [35]:
filter(row -> row.col1 == 1, X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64


Now another one wherein the values are larger and 1 and smaller than 15

In [36]:
filter(row -> row.col1 > 1 && row.col1 < 15, X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,4,5,missing,162
2,6,22,21,175
3,9,9,4,148
4,10,17,missing,162


## Cross Tabulation aka Frequency Tables
In order provide small snapshots of potential interactions and relations, we can see cross-tabulation or frequency comparisons between variables, say, male and female and acceptance rates to university     

In R we have some base functions;
```R
xt <- xtabs(Freq ~ Gender + Admit, data=DF)
```

In Julia we have to load a specific package called **FreqTables** https://github.com/nalimilan/FreqTables.jl

In [37]:
using Pkg; Pkg.add("FreqTables") ; using FreqTables

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


Do a frequency table between columns 1 and 4 - clearly there is not much here to see given both are randomly generated vectors

In [38]:
freqtable(X, :col1, :col4)

5×4 Named Matrix{Int64}
col1 ╲ col4 │ 148  151  162  175
────────────┼───────────────────
4           │   0    0    1    0
6           │   0    0    0    1
9           │   1    0    0    0
10          │   0    0    1    0
19          │   0    1    0    0

## Size of the data in human readable form 
Very simple and yet very informative information - how big is our data?
In R;
```R
object.size(data), units="Mb")
```

In Julia, we can use **varinfo()**

## Creating New Variables
Often our datasets may consist of variables and values which we need to prune, transform and mould to our likings - perhaps they have broken delimiters, or are a combination of two values in ones, or are better represented in a different form - and so on. These tasks require us to create new variables and add these to the dataset - remember that we should always keep copies of the original dataset and not simply mutate it into oblivion. 

Let's take some restaurant data from Baltimore city to use as the sample dataset

In [39]:
download("https://gist.githubusercontent.com/slowteetoe/528c78213fcd80f05419/raw/e0a4a89476fca79e692df1a373e5025f5112a5f6/restaurants.csv", "restaurants.csv") 

"restaurants.csv"

In [40]:
rest_data = CSV.File("restaurants.csv") |> DataFrame
# We can also do 
# CSV.read(open("file.csv"), DataFrame)

Row,name,zipCode,neighborhood,councilDistrict,policeDistrict,Location 1
Unnamed: 0_level_1,String,Int64,String,Int64,String15,String
1,410,21206,Frankford,2,NORTHEASTERN,"4509 BELAIR ROAD\nBaltimore, MD\n"
2,1919,21231,Fells Point,1,SOUTHEASTERN,"1919 FLEET ST\nBaltimore, MD\n"
3,SAUTE,21224,Canton,1,SOUTHEASTERN,"2844 HUDSON ST\nBaltimore, MD\n"
4,#1 CHINESE KITCHEN,21211,Hampden,14,NORTHERN,"3998 ROLAND AVE\nBaltimore, MD\n"
5,#1 chinese restaurant,21223,Millhill,9,SOUTHWESTERN,"2481 frederick ave\nBaltimore, MD\n"
6,19TH HOLE,21218,Clifton Park,14,NORTHEASTERN,"2722 HARFORD RD\nBaltimore, MD\n"
7,3 KINGS,21205,McElderry Park,13,SOUTHEASTERN,"2510 MCELDERRY ST\nBaltimore, MD\n"
8,"3 MILES HOUSE, INC.",21211,Remington,7,NORTHERN,"2701 MILES AVE\nBaltimore, MD\n"
9,3 W'S TAVERN,21205,McElderry Park,13,SOUTHEASTERN,"2518 MONUMENT ST\nBaltimore, MD\n"
10,300 SOUTH ANN STREET,21231,Upper Fells Point,1,SOUTHEASTERN,"300 ANN ST\nBaltimore, MD\n"


### Creating sequences 
We can create sequences to use as indexes for extracting data e.g. 
`rest_data[index sequence]`  

In R we can create a sequence of intergers which a step size of 2 by ;
```R
seq_one <- seq(1,10, by=2)
```
or also using the **c()** collect function 
```R
seq_c <- c(1,3,8,25,100) ; seq(along = x)
``` 

In Julia we can do it using **collect()** function as well - with the ranges on either side and the step size in the middle

In [41]:
seq_one = collect(1:3:10)

4-element Vector{Int64}:
  1
  4
  7
 10

Or just create a vector directly 

In [42]:
seq_vec = Vector(1:3:10)

4-element Vector{Int64}:
  1
  4
  7
 10

Say we want to create a new variable which indicates whether a certain data meet a certain condition, say, they are in the neighbourhoods of Sunnybank or Sunnybank Hills, and thus they are close to me - and if they meet this condition, they are assigned a "TRUE" value or a "FALSE" value in a new row - how would we do this?

In R;
```R
restData$nearMe = restData$neighbourhood %in% c("Roland Park", "Homeland")
```

In Julia - this is a new one for me! - make sure we use the double encapsulation in the array [[]] otherwise broadcasting won't work 

In [43]:
rest_data.nearMe = in.(rest_data.neighborhood, [["Roland Park", "Homeland"]])

1327-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0

R base has the very handy function table() which creates a nice tally of ones variables and counts them in a neat table output - Julia base doesn't really have this single item function so we have to use the **StatsBase** package and the **countmap()** function

In [44]:
Pkg.add("StatsBase") ; using StatsBase

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


In [45]:
# 0 = true, 1 = false 
# perhaps we can convert the 0,1 to TRUE/FALSE?
qz = countmap(rest_data.nearMe)
println(qz)

Dict{Bool, Int64}(0 => 1314, 1 => 13)


### Creating binary variables 
We can use the **ifelse()** function with broadcasting to evaluate a conditonal statement and print true or false in a new variables based upon the answer of the condition. Here we'll do it in R first;
```R
restData$wrongZip = ifelse(restData$zipCode) < -, TRUE, FALSE) 
```    
And then in Julia

In [46]:
rest_data.wrongZip = ifelse.(rest_data.zipCode .< 0, true, false)

1327-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

In [47]:
countmap(rest_data.wrongZip)

Dict{Bool, Int64} with 2 entries:
  0 => 1326
  1 => 1

### Creating categorical variables
We may want to summarise certain aspects of our dataset by chunking them into categorical blocks - similar to percentiles. Say we want to get a look at the distribution of gene lengths or zip codes in our dataset, at quartile ranges - we can turn to categorical variables

Create them in Julia https://categoricalarrays.juliadata.org/v0.1/using.html

In [48]:
Pkg.add("CategoricalArrays") ; using CategoricalArrays

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


In [49]:
cut(X.col1, 4)

5-element CategoricalArray{String,1,UInt32}:
 "Q1: [4.0, 6.0)"
 "Q2: [6.0, 9.0)"
 "Q3: [9.0, 10.0)"
 "Q4: [10.0, 19.0]"
 "Q4: [10.0, 19.0]"

In [50]:
 countmap(cut(rest_data.zipCode, 4))

Dict{CategoricalValue{String, UInt32}, Int64} with 4 entries:
  "Q4: [21225.5, 21287.0]"  => 332
  "Q2: [21202.0, 21218.0)"  => 507
  "Q3: [21218.0, 21225.5)"  => 351
  "Q1: [-21226.0, 21202.0)" => 137

In R let's use the **Hmisc** library 
```R
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode, g=4)
table(restData$zipGroups)
```

When we turn our data into Categorical variables, we turn them into factors  

### Reshaping data 
The notions of split-apply-combine are prevalent in data analytics, as they are terrific guiding principles for approaching data frames. They were originally popularised in R and built around R packages, but are now supported by plenty of packages in most high level programming languages. 

We'll start with R and the basic **reshape2()** library 
```R
library(reshape2)
head(mtcars) #standard dataset in R
``` 

We can use some of the popular R datasets in Julia using the **RDatasets.jl** package - as we can see almost everything is supported in Julia

In [51]:
Pkg.add("RDatasets") ; using RDatasets

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


Load the mtcars dataset from "datasets" 

In [52]:
mtcars = dataset("datasets", "mtcars")

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
6,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
7,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
8,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
9,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
10,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [53]:
first(mtcars, 5)

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [54]:
# descriptive stats
describe(mtcars)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,Model,,AMC Javelin,,Volvo 142E,0,String31
2,MPG,20.0906,10.4,19.2,33.9,0,Float64
3,Cyl,6.1875,4,6.0,8,0,Int64
4,Disp,230.722,71.1,196.3,472.0,0,Float64
5,HP,146.688,52,123.0,335,0,Int64
6,DRat,3.59656,2.76,3.695,4.93,0,Float64
7,WT,3.21725,1.513,3.325,5.424,0,Float64
8,QSec,17.8487,14.5,17.71,22.9,0,Float64
9,VS,0.4375,0,0.0,1,0,Int64
10,AM,0.40625,0,0.0,1,0,Int64


In [55]:
# column names
names(mtcars)

12-element Vector{String}:
 "Model"
 "MPG"
 "Cyl"
 "Disp"
 "HP"
 "DRat"
 "WT"
 "QSec"
 "VS"
 "AM"
 "Gear"
 "Carb"

### "Melting" dataframes in R / "Reshaping" in Julia
The main idea behind melting is the presence of "id" and "value" variables - the ID variables will represent the consistent identifiers we want to retain, and the value variables will be "stacked" to form a group with its own column - almost like a sub-data frame - it's hard to imagine but easier to understand once we play with it a little bit ourselves - so when performing the melt and stack functions, we need to always designate these two variables. In R, id variables come first and the measure variables seconds, whereas in Julia it is the other way around. Let's take a look in R;
```R
mtcars$carname <- rownames(mtcars) # looks redunant based on julia code 
carMelt <- melt(mtcares, id=c("carname", "gear", "cyl"), measure.vars=c("mpg", "hp"))
```

In Julia this is perform with the DataFrames package and **stack()**

In [56]:
mt_stack = stack(mtcars, [:MPG, :HP], [:Model, :Gear, :Cyl])

Row,Model,Gear,Cyl,variable,value
Unnamed: 0_level_1,String31,Int64,Int64,String,Float64
1,Mazda RX4,4,6,MPG,21.0
2,Mazda RX4 Wag,4,6,MPG,21.0
3,Datsun 710,4,4,MPG,22.8
4,Hornet 4 Drive,3,6,MPG,21.4
5,Hornet Sportabout,3,8,MPG,18.7
6,Valiant,3,6,MPG,18.1
7,Duster 360,3,8,MPG,14.3
8,Merc 240D,4,4,MPG,24.4
9,Merc 230,4,4,MPG,22.8
10,Merc 280,4,6,MPG,19.2


### Casting dataframes

Imagine we want to get a quick overview of how many values of a certain variable we have in our dataset - say, the distribution of miles per gallon and horse power based upon the cylinders a car has - the relation between cylinders and engine performance : Cylinders ~ engine. In R we would use **dcast()**; 
```R
cylData <- dcast(carMelt, cyl ~ variable) 
```

This is a littler trickier in Julia as there's not as much supporting documentation yet, but this is called a "pivot table" in Julia -- bogumil (who else?!) has a great post here https://www.juliabloggers.com/pivot-tables-in-dataframes-jl/ 

In [57]:
unstack(mt_stack, :Cyl, :variable, :Cyl, combine=length)

Row,Cyl,MPG,HP
Unnamed: 0_level_1,Int64,Int64?,Int64?
1,6,7,7
2,4,11,11
3,8,14,14


Now let's get the mean of these measures, rather than their length?

In [58]:
unstack(mt_stack, :Cyl, :variable, :value, combine=mean)

Row,Cyl,MPG,HP
Unnamed: 0_level_1,Int64,Float64?,Float64?
1,6,19.7429,122.286
2,4,26.6636,82.6364
3,8,15.1,209.214


### Averaging values

In [59]:
countmap(sum.(mtcars.AM))

Dict{Int64, Int64} with 2 entries:
  0 => 19
  1 => 13

In [60]:
groupby(mtcars, :Cyl)

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
2,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
3,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
4,Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
5,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
6,Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
7,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
8,Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
9,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
10,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
2,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
3,Merc 450SE,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
4,Merc 450SL,17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
5,Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0,0,3,3
6,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
7,Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
8,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
9,Dodge Challenger,15.5,8,318.0,150,2.76,3.52,16.87,0,0,3,2
10,AMC Javelin,15.2,8,304.0,150,3.15,3.435,17.3,0,0,3,2


## DataFrames.jl 
Almost everything covered herein appears to be described somewhere on the DataFrames.jl package site https://dataframes.juliadata.org/stable/man/split_apply_combine/ and/or on the DataFramesMeta site https://juliadata.org/DataFramesMeta.jl/stable/dplyr/ . It may take some googling and forum scouring, but rest assured that the developers have considered many many procedures and functions. As such it is best to work through Bogumil's book and create your own projects that you're interested in - passion and curiosity as the backbones of science and inquiry! 

## Merging/Joining data 
Joining datasets together is a common operation, especially within SQL - we can think of the common terminology of inner join, outer join and so on that's almost synonymous with databases. In R lets merge some data; 
```R
mergedData = merge(reviews, solution, by.x="solution_id", by.y="yd", all=TRUE)
```

In Julia let's create some mock data 

In [61]:
people = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])

Row,ID,Name
Unnamed: 0_level_1,Int64,String
1,20,John Doe
2,40,Jane Doe


In [62]:
jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"]) 

Row,ID,Job
Unnamed: 0_level_1,Int64,String
1,20,Lawyer
2,40,Doctor


Say there is a common ID shared between datasets, we can take advantage of this ID and 'sink' our data around it, in some sense this is like performing a union operation on a set - the common element is never duplicated. 

**Inner joins are "the output contains rows for values of the key that exist in all passed data frames."**

In [63]:
innerjoin(people, jobs, on = :ID)

Row,ID,Name,Job
Unnamed: 0_level_1,Int64,String,String
1,20,John Doe,Lawyer
2,40,Jane Doe,Doctor


In [64]:
intersect(names(people), names(jobs))

1-element Vector{String}:
 "ID"

In [65]:
breaks = DataFrame(ID=[20, 60], Fruit=["Apple", "Watermelon"])

Row,ID,Fruit
Unnamed: 0_level_1,Int64,String
1,20,Apple
2,60,Watermelon


To merge all of the datasets together based upon a shared variable, even if not all of the rows are present in the other datasets, we can use the **outerjoin()** method 

In [66]:
outerjoin(breaks, people, jobs, on = :ID)

Row,ID,Fruit,Name,Job
Unnamed: 0_level_1,Int64,String?,String?,String?
1,20,Apple,John Doe,Lawyer
2,40,missing,Jane Doe,Doctor
3,60,Watermelon,missing,missing


In R we provide all of the column names this time;
```R
mergedData2 <- merge(reviews, solutions, all=TRUE)
```

# Quiz

## 1. 
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here: 

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf 

Create a logical vector that identifies the households on greater than 10 acres who sold more than $10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE. 

```R
which(agricultureLogical) 
```

What are the first 3 values that result? 

In [67]:
download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv", "quiz3_q1.csv")

"quiz3_q1.csv"

In [94]:
first_q = CSV.File("quiz3_q1.csv") |> DataFrame

Row,RT,SERIALNO,DIVISION,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FS,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,GRNTP,GRPIP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,LNGI,MV,NOC,NPF,NPP,NR,NRC,OCPIP,PARTNER,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,WORKSTAT,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,⋯
Unnamed: 0_level_1,String1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,H,186,8,700,4,16,1015675,89,4,1,1,missing,4,2,2,missing,180,0,2,3,3,600,1,missing,1,1300,1,1,1,9,missing,missing,missing,1,1,missing,17,3,840,5,2,105600,2,missing,missing,1,1,105600,0,2,2,2,1,4,2,4,0,0,2,18,0,0,1,0,0,1,1550,3,0,1,24,3,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
2,H,306,8,700,4,16,1015675,310,1,1,missing,missing,1,7,missing,missing,60,0,2,3,3,missing,1,missing,missing,missing,missing,missing,1,2,2,600,missing,1,3,missing,missing,1,1,3,missing,missing,missing,660,23,1,4,34000,0,4,4,4,1,3,0,missing,0,0,0,missing,0,0,0,0,0,2,missing,missing,1,0,missing,missing,missing,missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
3,H,395,8,100,4,16,1015675,106,2,1,1,missing,3,2,2,missing,70,0,2,30,1,200,1,missing,missing,missing,missing,3,1,7,missing,missing,missing,1,2,missing,18,2,50,5,7,9400,2,missing,missing,1,3,9400,0,2,2,2,1,2,1,2,0,0,1,23,0,0,1,0,0,1,179,missing,0,1,16,1,13,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
4,H,506,8,700,4,16,1015675,240,4,1,1,missing,4,2,2,missing,40,0,2,80,1,200,1,missing,1,860,1,1,1,6,missing,missing,400,1,1,missing,19,3,500,2,1,66000,1,missing,missing,1,1,66000,0,1,1,1,1,3,2,4,0,0,2,26,0,0,1,0,0,2,1422,1,0,1,31,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
5,H,835,8,800,4,16,1015675,118,4,1,2,1,5,2,2,missing,250,0,2,3,3,700,1,missing,1,1900,1,1,1,7,missing,missing,650,1,1,missing,20,5,2,3,1,93000,2,missing,missing,1,1,93000,0,2,2,2,1,1,1,4,0,0,1,36,0,0,1,0,0,1,2800,1,0,1,25,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
6,H,989,8,700,4,16,1015675,115,4,1,1,missing,3,2,2,missing,130,0,2,3,3,250,1,missing,1,700,1,1,1,6,missing,missing,400,1,1,missing,15,2,1200,5,2,61000,1,missing,missing,1,1,61000,0,1,1,1,1,4,2,4,0,0,2,26,0,0,1,0,0,2,1330,2,0,1,7,1,7,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
7,H,1861,8,700,4,16,1015675,0,1,2,missing,missing,missing,missing,missing,missing,missing,0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,5,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,⋯
8,H,2120,8,200,4,16,1015675,35,1,1,1,missing,2,1,2,missing,40,0,480,3,4,missing,1,missing,missing,missing,missing,missing,1,4,missing,missing,missing,1,4,missing,missing,1,650,5,missing,missing,missing,missing,missing,1,6,10400,0,4,4,4,1,5,0,missing,0,0,0,missing,0,0,0,1,1,2,missing,missing,1,0,missing,missing,missing,missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
9,H,2278,8,400,4,16,1015675,47,2,1,1,missing,3,2,2,missing,2,0,2,3,3,770,1,missing,1,750,1,1,1,6,missing,missing,missing,1,1,missing,13,2,660,3,2,209000,4,missing,missing,1,1,209000,0,4,4,4,1,1,0,2,0,0,0,5,0,0,0,1,1,1,805,3,0,1,22,1,6,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
10,H,2428,8,500,4,16,1015675,51,2,1,1,missing,2,1,2,missing,20,0,2,140,1,120,1,220,missing,missing,missing,3,1,5,missing,missing,missing,1,2,missing,1,2,2,5,missing,missing,missing,missing,missing,2,5,35400,0,4,4,4,2,1,0,missing,0,1,0,7,0,0,0,0,0,1,196,missing,0,0,4,missing,missing,missing,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,⋯


The **filter()** function seems like an obvious first choice, until we try to work with missing values, and this is where it comes short - instead we should use **subset()** in combination by **ByRow()** to parse  this correctly

In [69]:
#filter(row -> row.AGS == 6 && row.ACR == 3, first_q)
subset(first_q, :AGS => ByRow(==(6)), :ACR => ByRow(==(3)), skipmissing=true)

Row,RT,SERIALNO,DIVISION,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FS,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,GRNTP,GRPIP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,LNGI,MV,NOC,NPF,NPP,NR,NRC,OCPIP,PARTNER,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,WORKSTAT,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,⋯
Unnamed: 0_level_1,String1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,H,30346,8,400,4,16,1015675,120,4,1,3,6,3,2,2,missing,150,0,2,3,3,600,1,missing,1,1400,1,1,1,5,missing,missing,missing,1,1,missing,20,2,2,3,1,62600,2,missing,missing,3,1,62600,0,2,2,2,1,1,2,4,0,0,2,30,0,0,1,0,0,1,1550,3,0,0,24,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,⋯
2,H,53292,8,300,4,16,1015675,26,3,1,3,6,2,3,2,missing,120,0,1000,3,6,1500,1,missing,2,1400,2,1,1,4,missing,missing,missing,1,1,missing,11,3,100,5,1,120000,2,missing,missing,1,1,120000,0,2,2,2,1,6,1,3,0,0,1,18,0,0,1,0,0,1,1819,3,0,0,22,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
3,H,56299,8,800,4,16,1015675,97,2,1,3,6,2,2,2,missing,80,0,2,110,1,missing,1,missing,missing,missing,missing,missing,1,5,missing,missing,missing,1,4,missing,missing,2,2,9,2,23200,4,missing,missing,1,1,23200,0,4,4,4,1,4,0,2,0,0,0,missing,0,0,0,2,0,2,missing,missing,0,0,missing,1,3,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
4,H,101282,8,800,4,16,1015675,76,2,1,3,6,3,2,2,missing,70,0,850,3,4,860,1,missing,2,600,2,1,1,9,missing,missing,missing,1,1,missing,14,4,2,9,1,27500,4,missing,missing,1,1,27500,0,4,4,4,1,7,0,2,0,0,0,39,0,0,0,1,0,1,883,3,0,0,18,2,2,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,⋯
5,H,120351,8,800,4,16,1015675,51,5,1,3,6,5,2,2,missing,80,0,2,10,2,0,1,missing,missing,missing,missing,3,1,9,missing,missing,missing,1,2,missing,24,2,2,1,2,41500,3,missing,missing,1,1,41500,0,1,1,1,1,1,3,5,0,0,3,7,0,0,1,0,0,1,257,missing,0,0,32,2,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
6,H,122802,8,800,4,16,1015675,63,5,1,3,6,3,2,2,missing,150,0,2,3,3,4800,1,missing,2,1100,2,1,1,6,missing,missing,missing,1,1,missing,24,3,2,5,1,5200,2,missing,missing,2,1,5200,0,2,2,2,1,5,3,5,0,0,3,101,0,0,1,0,0,1,2233,3,0,0,65,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,⋯
7,H,133128,8,300,4,16,1015675,15,2,1,3,6,1,2,2,missing,100,0,770,3,4,2000,1,missing,2,870,2,1,1,5,missing,missing,missing,1,1,missing,22,4,2,9,2,32200,4,missing,missing,1,1,32200,0,4,4,4,1,7,0,2,0,0,0,53,0,0,0,2,2,2,1409,3,0,0,37,1,6,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
8,H,140896,8,400,4,16,1015675,72,2,1,3,6,4,2,2,missing,130,0,2,3,2,800,1,missing,missing,missing,missing,3,1,9,missing,missing,missing,1,2,missing,17,2,2,8,2,32000,4,missing,missing,1,1,32000,0,4,4,4,1,6,0,2,0,0,0,11,0,0,0,0,0,1,289,missing,0,0,23,1,6,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
9,H,169806,8,800,4,16,1015675,62,1,1,3,6,4,2,2,missing,100,0,2,3,3,1500,1,missing,missing,missing,missing,3,1,8,missing,missing,missing,1,2,missing,24,2,2,6,missing,missing,missing,missing,missing,1,4,30700,0,4,4,4,1,7,0,missing,0,0,0,28,0,0,0,1,1,1,725,missing,0,0,64,missing,missing,missing,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
10,H,173013,8,500,4,16,1015675,77,2,1,3,6,3,2,2,missing,50,0,2,70,3,200,1,missing,1,900,2,1,1,6,missing,missing,missing,1,1,missing,20,5,2,2,2,120000,4,missing,missing,1,1,120000,0,4,4,4,1,3,0,2,0,0,0,11,0,0,0,2,0,1,1120,3,0,0,24,1,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯


But we have to create construct a new variable based upon the evaluation of a conditional - if it is true then it will be represented as such in each row in the new column, and conversly as false if it is not. I was stuck for a very long time until finding this post by who else than Bogumil https://stackoverflow.com/questions/50114613/add-column-conditionally 

In [95]:
first_q.largeLand = ifelse.((ismissing.(first_q.AGS)) .| (first_q.AGS .!= 6) .| (first_q.ACR .!= 3), false, true)

6496-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

In [96]:
countmap(first_q.largeLand)

Dict{Bool, Int64} with 2 entries:
  0 => 6419
  1 => 77

In [98]:
findall(first_q.largeLand .== 1)

77-element Vector{Int64}:
  125
  238
  262
  470
  555
  568
  608
  643
  787
  808
  824
  849
  952
    ⋮
 5236
 5326
 5417
 5531
 5574
 5894
 6033
 6044
 6089
 6275
 6376
 6420

The answer is 125, 238, 262 - these represent the row numbers

## 2. 
Using the jpeg package read in the following picture of your instructor into R

 https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg 

Use the parameter native=TRUE. What are the 30th and 80th quantiles of the resulting data? (some Linux systems may produce an answer 638 different for the 30th quantile)

In [99]:
download("https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg", "leek.jpg")

"leek.jpg"

In [142]:
Pkg.add(["ImageIO", "Images", "ImageMagick", "FileIO"]) ; using ImageIO, FileIO, ImageMagick

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m JpegTurbo_jll ───────────── v3.0.1+0
[32m[1m   Installed[22m[39m LERC_jll ────────────────── v3.0.0+1
[32m[1m   Installed[22m[39m TiledIteration ──────────── v0.3.1
[32m[1m   Installed[22m[39m Images ──────────────────── v0.24.1
[32m[1m   Installed[22m[39m ImageMagick ─────────────── v1.3.0
[32m[1m   Installed[22m[39m FFTW ────────────────────── v1.8.0
[32m[1m   Installed[22m[39m IdentityRanges ──────────── v0.3.1
[32m[1m   Installed[22m[39m IntelOpenMP_jll ─────────── v2024.0.2+0
[32m[1m   Installed[22m[39m Netpbm ──────────────────── v1.0.1
[32m[1m   Installed[22m[39m StaticArrays ────────────── v1.9.2
[32m[1m   Installed[22m[39m RealDot ─────────────────── v0.1.0
[32m[1m   Installed[22m[39m Distances ───────────────── v0.10.11
[32m[1m   Installed[22m[39m StaticArraysCore ────────── v1.4.2
[32m[1m   Installed[22m[39m CustomUnitRanges ────────── v1.0.

[32m  ✓ [39m[90mStaticArrays → StaticArraysChainRulesCoreExt[39m
[32m  ✓ [39mImageMagick
[32m  ✓ [39m[90mCoordinateTransformations[39m
[32m  ✓ [39m[90mRotations[39m
[32m  ✓ [39m[90mInterpolations[39m
MKL_jll[33m Waiting for background task / IO / timer.[39m
[pid 371619] waiting for IO to finish:
 Handle type        uv_handle_t->data
 timer              0x140ffc0->0x7f7c3ff8fb80
This means that a package has started a background task or event source that has not finished running. For precompilation to complete successfully, the event source needs to be closed explicitly. See the developer documentation on fixing precompilation hangs for more help.
[32m  ✓ [39m[90mImageTransformations[39m
[32m  ✓ [39m[90mMKL_jll[39m
[32m  ✓ [39m[90mImageContrastAdjustment[39m
[32m  ✓ [39m[90mFFTW[39m
[32m  ✓ [39m[90mFFTViews[39m
[32m  ✓ [39m[90mImageFiltering[39m
[32m  ✓ [39m[90mImageQualityIndexes[39m
[32m  ✓ [39mImages
  54 dependencies successfully pre

I'm gonna put this on hold as it's a bit hard to understand at the moment 

## 3. 
Load the Gross Domestic Product data for the 190 ranked countries in this data set:

 https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
 

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank (so United States is last). What is the 13th country in the resulting data frame?

Original data sources: 

http://data.worldbank.org/data-catalog/GDP-ranking-table

http://data.worldbank.org/data-catalog/ed-stats

1 point
 

In [167]:
download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", "gdp.csv") ; download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", "education.csv")

"education.csv"

Take a look at the structure of the .csv using some bash commands - this will let us know what the delimiters look like, the header structure and any other important features to take into consideration

In [178]:
run(`head gdp.csv -n 20`)

,Gross domestic product 2012,,,,,,,,
,,,,,,,,,
,,,,(millions of,,,,,
,Ranking,,Economy,US dollars),,,,,
,,,,,,,,,
USA,1,,United States," 16,244,600 ",,,,,
CHN,2,,China," 8,227,103 ",,,,,
JPN,3,,Japan," 5,959,718 ",,,,,
DEU,4,,Germany," 3,428,131 ",,,,,
FRA,5,,France," 2,612,878 ",,,,,
GBR,6,,United Kingdom," 2,471,784 ",,,,,
BRA,7,,Brazil," 2,252,664 ",,,,,
RUS,8,,Russian Federation," 2,014,775 ",,,,,
ITA,9,,Italy," 2,014,670 ",,,,,
IND,10,,India," 1,841,710 ",,,,,
CAN,11,,Canada," 1,821,424 ",,,,,
AUS,12,,Australia," 1,532,408 ",,,,,
ESP,13,,Spain," 1,322,965 ",,,,,
MEX,14,,Mexico," 1,178,126 ",,,,,
KOR,15,,"Korea, Rep."," 1,129,598 ",,,,,


Process(`[4mhead[24m [4mgdp.csv[24m [4m-n[24m [4m20[24m`, ProcessExited(0))

Import the data into a Dataframe, start parsing the data from line 6 as the first 5 lines are headers, and the header is located on the 4th line. Since the dataframe contains many ill-formated columns and missing rows, we'll only extract the 4 columns that have been properly input, and then drop the missing rows. We'll also rename the columns to make it easier to read and work with. 

In [298]:
gdp_raw = CSV.read("gdp.csv", DataFrame, skipto=6, header=4) 
gdp_raw_nomissing = dropmissing(gdp_raw[:, [:1, :2, :4, :5]])
gdp_cleaned = rename!(gdp_raw_nomissing, ["CountryCode", "Rank", "Country", "US_dollars"])

Row,CountryCode,Rank,Country,US_dollars
Unnamed: 0_level_1,String3,String,String31,String15
1,USA,1,United States,16244600
2,CHN,2,China,8227103
3,JPN,3,Japan,5959718
4,DEU,4,Germany,3428131
5,FRA,5,France,2612878
6,GBR,6,United Kingdom,2471784
7,BRA,7,Brazil,2252664
8,RUS,8,Russian Federation,2014775
9,ITA,9,Italy,2014670
10,IND,10,India,1841710


The commas in the US_dollars column are causing the numeric values to be represented as Strings, and thus the column type is String15 -- in order to perform numeric operations, we need to change this column type. We'll remove the commas and then parse the columns values as Intergers to create the new frame

In [299]:
gdp_cleaned.US_dollars .= replace.(gdp_cleaned.US_dollars, "," => "")
gdp_cleaned.US_dollars = parse.(Int, gdp_cleaned.US_dollars)
gdp_cleaned.Rank = parse.(Int, gdp_cleaned.Rank)
#gdp_cleaned

190-element Vector{Int64}:
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
   ⋮
 178
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190

In [300]:
gdp_cleaned

Row,CountryCode,Rank,Country,US_dollars
Unnamed: 0_level_1,String3,Int64,String31,Int64
1,USA,1,United States,16244600
2,CHN,2,China,8227103
3,JPN,3,Japan,5959718
4,DEU,4,Germany,3428131
5,FRA,5,France,2612878
6,GBR,6,United Kingdom,2471784
7,BRA,7,Brazil,2252664
8,RUS,8,Russian Federation,2014775
9,ITA,9,Italy,2014670
10,IND,10,India,1841710


Import education data

In [301]:
edu_raw = CSV.read("education.csv", DataFrame)

Row,CountryCode,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,ABW,Aruba,High income: nonOECD,Latin America & Caribbean,missing,missing,Aruban florin,2000,missing,missing,1995,missing,missing,missing,missing,missing,missing,missing,Special,missing,missing,missing,missing,missing,missing,2008,missing,AW,AW,Aruba,Aruba
2,ADO,Principality of Andorra,High income: nonOECD,Europe & Central Asia,missing,missing,Euro,Register based,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,General,missing,missing,missing,Yes,missing,missing,2006,missing,AD,AD,Andorra,Andorra
3,AFG,Islamic State of Afghanistan,Low income,South Asia,IDA,HIPC,Afghan afghani,1979,"MICS, 2003",Fiscal year end: March 20; reporting period for national accounts data: FY.,2002/2003,missing,missing,VAB,missing,missing,missing,Actual,General,Consolidated,GDDS,missing,missing,missing,missing,2008,2000,AF,AF,Afghanistan,Afghanistan
4,AGO,People's Republic of Angola,Lower middle income,Sub-Saharan Africa,IDA,missing,Angolan kwanza,1970,"MICS, 2001, MIS, 2006/07",missing,1997,missing,missing,VAP,1991-96,2005,BPM5,Actual,Special,missing,GDDS,"IHS, 2000",missing,1964-65,missing,1991,2000,AO,AO,Angola,Angola
5,ALB,Republic of Albania,Upper middle income,Europe & Central Asia,IBRD,missing,Albanian lek,2001,"MICS, 2005",missing,missing,1996,1993,VAB,missing,2005,BPM5,Actual,General,Consolidated,GDDS,"LSMS, 2005",Yes,1998,2005,2008,2000,AL,AL,Albania,Albania
6,ARE,United Arab Emirates,High income: nonOECD,Middle East & North Africa,missing,missing,U.A.E. dirham,2005,missing,missing,1995,missing,missing,VAB,missing,missing,BPM4,missing,General,Consolidated,GDDS,missing,missing,1998,missing,2008,2005,AE,AE,United Arab Emirates,United Arab Emirates
7,ARG,Argentine Republic,Upper middle income,Latin America & Caribbean,IBRD,missing,Argentine peso,2001,missing,missing,1993,missing,1993,VAB,1971-84,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2006",Yes,2002,2001,2008,2000,AR,AR,Argentina,Argentina
8,ARM,Republic of Armenia,Lower middle income,Europe & Central Asia,Blend,missing,Armenian dram,2001,"DHS, 2005",missing,missing,1996,1993,VAB,1990-95,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2007",Yes,missing,missing,2008,2000,AM,AM,Armenia,Armenia
9,ASM,American Samoa,Upper middle income,East Asia & Pacific,missing,missing,U.S. dollar,2000,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,Yes,missing,missing,missing,missing,AS,AS,American Samoa,American Samoa
10,ATG,Antigua and Barbuda,Upper middle income,Latin America & Caribbean,IBRD,missing,East Caribbean dollar,2001,missing,The government has revised national accounts data for 1998-2008.,1990,missing,missing,VAB,missing,missing,BPM5,missing,General,missing,GDDS,missing,Yes,missing,missing,2007,1990,AG,AG,Antigua and Barbuda,Antigua and Barbuda


Perform an inner join on the "Country Code" columns

In [302]:
gdp_edu = innerjoin(gdp_cleaned, edu_raw, on=:CountryCode)

Row,CountryCode,Rank,Country,US_dollars,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,Int64,String31,Int64,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,ABW,161,Aruba,2584,Aruba,High income: nonOECD,Latin America & Caribbean,missing,missing,Aruban florin,2000,missing,missing,1995,missing,missing,missing,missing,missing,missing,missing,Special,missing,missing,missing,missing,missing,missing,2008,missing,AW,AW,Aruba,Aruba
2,AFG,105,Afghanistan,20497,Islamic State of Afghanistan,Low income,South Asia,IDA,HIPC,Afghan afghani,1979,"MICS, 2003",Fiscal year end: March 20; reporting period for national accounts data: FY.,2002/2003,missing,missing,VAB,missing,missing,missing,Actual,General,Consolidated,GDDS,missing,missing,missing,missing,2008,2000,AF,AF,Afghanistan,Afghanistan
3,AGO,60,Angola,114147,People's Republic of Angola,Lower middle income,Sub-Saharan Africa,IDA,missing,Angolan kwanza,1970,"MICS, 2001, MIS, 2006/07",missing,1997,missing,missing,VAP,1991-96,2005,BPM5,Actual,Special,missing,GDDS,"IHS, 2000",missing,1964-65,missing,1991,2000,AO,AO,Angola,Angola
4,ALB,125,Albania,12648,Republic of Albania,Upper middle income,Europe & Central Asia,IBRD,missing,Albanian lek,2001,"MICS, 2005",missing,missing,1996,1993,VAB,missing,2005,BPM5,Actual,General,Consolidated,GDDS,"LSMS, 2005",Yes,1998,2005,2008,2000,AL,AL,Albania,Albania
5,ARE,32,United Arab Emirates,348595,United Arab Emirates,High income: nonOECD,Middle East & North Africa,missing,missing,U.A.E. dirham,2005,missing,missing,1995,missing,missing,VAB,missing,missing,BPM4,missing,General,Consolidated,GDDS,missing,missing,1998,missing,2008,2005,AE,AE,United Arab Emirates,United Arab Emirates
6,ARG,26,Argentina,475502,Argentine Republic,Upper middle income,Latin America & Caribbean,IBRD,missing,Argentine peso,2001,missing,missing,1993,missing,1993,VAB,1971-84,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2006",Yes,2002,2001,2008,2000,AR,AR,Argentina,Argentina
7,ARM,133,Armenia,9951,Republic of Armenia,Lower middle income,Europe & Central Asia,Blend,missing,Armenian dram,2001,"DHS, 2005",missing,missing,1996,1993,VAB,1990-95,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2007",Yes,missing,missing,2008,2000,AM,AM,Armenia,Armenia
8,ATG,172,Antigua and Barbuda,1134,Antigua and Barbuda,Upper middle income,Latin America & Caribbean,IBRD,missing,East Caribbean dollar,2001,missing,The government has revised national accounts data for 1998-2008.,1990,missing,missing,VAB,missing,missing,BPM5,missing,General,missing,GDDS,missing,Yes,missing,missing,2007,1990,AG,AG,Antigua and Barbuda,Antigua and Barbuda
9,AUS,12,Australia,1532408,Commonwealth of Australia,High income: OECD,East Asia & Pacific,missing,missing,Australian dollar,2006,missing,Fiscal year end: June 30; reporting period for national accounts data: FY.,missing,2007,1993,VAB,missing,2005,BPM5,missing,General,Consolidated,SDDS,"ES/BS, 1994",Yes,2001,2004,2008,2000,AU,AU,Australia,Australia
10,AUT,27,Austria,394708,Republic of Austria,High income: OECD,Europe & Central Asia,missing,Euro area,Euro,2001,missing,"A simple multiplier is used to convert the national currencies of EMU members to euros. The following irrevocable euro conversion rate was adopted by the EU Council on January 1, 1999: 1 euro = 13.7603 Austrian schilling. Please note that historical data before 1999 are not actual euros and are not comparable or suitable for aggregation across countries.",2000,missing,1993,VAB,missing,2005,BPM5,missing,Special,Consolidated,SDDS,IS 2000,Yes,1999-2000,2004,2008,2000,AT,AT,Austria,Austria


Sort the dataframe column "US_dollars" 

In [279]:
sort!(gdp_edu, :US_dollars, rev=false)

Row,CountryCode,Rank,Country,US_dollars,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,String,String31,Int64,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,TUV,190,Tuvalu,40,Tuvalu,Lower middle income,East Asia & Pacific,missing,missing,Australian dollar,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,TV,TV,Tuvalu,Tuvalu
2,KIR,189,Kiribati,175,Republic of Kiribati,Lower middle income,East Asia & Pacific,IDA,missing,Australian dollar,2005,missing,The government statistical office has revised national accounts data for 1970-2008.,1991,missing,missing,VAB,missing,missing,missing,missing,General,missing,GDDS,missing,missing,missing,missing,2005,missing,KI,KI,Kiribati,Kiribati
3,MHL,188,Marshall Islands,182,Republic of the Marshall Islands,Lower middle income,East Asia & Pacific,IBRD,missing,U.S. dollar,1999,missing,missing,1991,missing,missing,VAB,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,MH,MH,Marshall Islands,Marshall Islands
4,PLW,187,Palau,228,Republic of Palau,Upper middle income,East Asia & Pacific,IBRD,missing,U.S. dollar,2005,missing,missing,1995,missing,missing,VAB,missing,missing,missing,missing,missing,missing,missing,missing,Yes,missing,missing,missing,missing,PW,PW,Palau,Palau
5,STP,186,S\xe3o Tom\xe9 and Principe,263,Democratic Republic of S\xe3o Tom\xe9 and Principe,Lower middle income,Sub-Saharan Africa,IDA,HIPC,S\xe3o Tom\xe9 and Principe dobra,2001,missing,missing,2001,missing,missing,VAP,missing,2005,missing,Preliminary,Special,missing,GDDS,PS 2000-01,missing,missing,missing,2008,missing,ST,ST,S\xe3o Tom\xe9 and Principe,S\xe3o Tom\xe9 and Principe
6,FSM,185,"Micronesia, Fed. Sts.",326,Federated States of Micronesia,Lower middle income,East Asia & Pacific,IBRD,missing,U.S. dollar,2000,missing,The government statistical office has revised national accounts data for 1995-2008.,1998,missing,missing,VAB,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,FM,FM,"Micronesia, Fed. Sts.",Micronesia
7,TON,184,Tonga,472,Kingdom of Tonga,Lower middle income,East Asia & Pacific,IDA,missing,Tongan pa'anga,2006,missing,missing,2000/2001,missing,missing,VAB,missing,missing,BPM5,Actual,missing,missing,GDDS,missing,Yes,2001,missing,2007,missing,TO,TO,Tonga,Tonga
8,DMA,183,Dominica,480,Commonwealth of Dominica,Upper middle income,Latin America & Caribbean,Blend,missing,East Caribbean dollar,2001,missing,missing,1990,missing,1993,VAB,missing,missing,BPM5,Actual,General,missing,GDDS,missing,Yes,missing,missing,2008,missing,DM,DM,Dominica,Dominica
9,COM,182,Comoros,596,Union of the Comoros,Low income,Sub-Saharan Africa,IDA,HIPC,Comorian franc,2003,"MICS, 2000",missing,1990,missing,missing,VAP,missing,2005,missing,Preliminary,missing,missing,missing,"IHS, 2004",missing,missing,missing,2007,missing,KM,KM,Comoros,Comoros
10,WSM,181,Samoa,684,Samoa,Lower middle income,East Asia & Pacific,IDA,missing,Samoan tala,2006,missing,missing,2002,missing,missing,VAB,missing,missing,BPM5,Preliminary,General,missing,missing,missing,missing,1999,missing,2008,missing,WS,WS,Samoa,Samoa


The answer is !
189 matches, 13th country is St. Kitts and Nevis 

## 4. 
What is the average GDP ranking for the "High income: OECD" and "High income: nonOECD" group?  

In [280]:
filter(x -> x == "High income: nonOECD", gdp_edu.var"Income Group")

23-element PooledArrays.PooledVector{Union{Missing, String31}, UInt32, Vector{UInt32}}:
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"
 "High income: nonOECD"

In [304]:
high_non = filter(row -> row.var"Income Group" == "High income: nonOECD", gdp_edu)

Row,CountryCode,Rank,Country,US_dollars,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,Int64,String31,Int64,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,ABW,161,Aruba,2584,Aruba,High income: nonOECD,Latin America & Caribbean,missing,missing,Aruban florin,2000,missing,missing,1995,missing,missing,missing,missing,missing,missing,missing,Special,missing,missing,missing,missing,missing,missing,2008,missing,AW,AW,Aruba,Aruba
2,ARE,32,United Arab Emirates,348595,United Arab Emirates,High income: nonOECD,Middle East & North Africa,missing,missing,U.A.E. dirham,2005,missing,missing,1995,missing,missing,VAB,missing,missing,BPM4,missing,General,Consolidated,GDDS,missing,missing,1998,missing,2008,2005,AE,AE,United Arab Emirates,United Arab Emirates
3,BHR,93,Bahrain,29044,Kingdom of Bahrain,High income: nonOECD,Middle East & North Africa,missing,missing,Bahraini dinar,2001,missing,missing,1985,missing,missing,VAP,missing,2005,BPM5,missing,General,Consolidated,GDDS,missing,Yes,missing,missing,2007,2003,BH,BH,Bahrain,Bahrain
4,BHS,138,"Bahamas, The",8149,Commonwealth of The Bahamas,High income: nonOECD,Latin America & Caribbean,missing,missing,Bahamian dollar,2000,missing,The government has revised national accounts data for 1997-2007. The new base year is 2006.,2006,missing,1993,VAB,missing,missing,BPM5,missing,General,Budgetary,GDDS,missing,missing,missing,1997,2008,missing,BS,BS,"Bahamas, The",The Bahamas
5,BMU,149,Bermuda,5474,The Bermudas,High income: nonOECD,North America,missing,missing,Bermuda dollar,2000,missing,The Statistical Office has revised national accounts data for 1996-2007.,1996,missing,missing,VAB,missing,missing,missing,missing,missing,missing,missing,missing,Yes,missing,missing,2008,missing,BM,BM,Bermuda,Bermuda
6,BRB,153,Barbados,4225,Barbados,High income: nonOECD,Latin America & Caribbean,missing,missing,Barbados dollar,2000,missing,missing,1974,missing,missing,VAB,missing,missing,BPM5,missing,General,Consolidated,GDDS,missing,Yes,missing,missing,2008,2000,BB,BB,Barbados,Barbados
7,BRN,113,Brunei Darussalam,16954,Brunei Darussalam,High income: nonOECD,East Asia & Pacific,missing,missing,Brunei dollar,2001,missing,missing,2000,missing,missing,VAP,missing,2005,missing,missing,General,missing,GDDS,missing,Yes,missing,missing,2006,missing,BN,BN,Brunei Darussalam,Brunei
8,CYP,102,Cyprus,22767,Republic of Cyprus,High income: nonOECD,Europe & Central Asia,missing,Euro area,Euro,2001,missing,"A simple multiplier is used to convert the national currencies of EMU members to euros. The following irrevocable euro conversion rate entered into force on January 1, 2008: 1 euro = 0.585274 Cyprus pounds. Please note that historical data are not actual euros and are not comparable or suitable for aggregation across countries.",missing,2000,missing,VAB,missing,2005,BPM5,missing,General,Consolidated,SDDS,missing,Yes,missing,2005,2008,2000,CY,CY,Cyprus,Cyprus
9,EST,103,Estonia,22390,Republic of Estonia,High income: nonOECD,Europe & Central Asia,missing,missing,Estonian kroon,2000,missing,missing,2000,missing,1993,VAB,1987-95,2005,BPM5,missing,General,Consolidated,SDDS,"ES/BS, 2004",Yes,2001,2005,2008,2000,EE,EE,Estonia,Estonia
10,GNQ,110,Equatorial Guinea,17697,Republic of Equatorial Guinea,High income: nonOECD,Sub-Saharan Africa,IBRD,missing,CFA franc,2002,missing,missing,2000,missing,missing,VAB,1965-84,2005,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,2000,GQ,GQ,Equatorial Guinea,Equatorial Guinea


In [306]:
mean.(eachcol(high_non.Rank))

1-element Vector{Float64}:
 91.91304347826087

Answer is 
32.96667, 91.91304

## 5. 
Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries are Lower middle income but among the 38 nations with highest GDP? 

Let's see how the quantiles look - combine countmap() with cut(df, n)

In [323]:
countmap(cut(gdp_edu.Rank, 5))

Dict{CategoricalValue{String, UInt32}, Int64} with 5 entries:
  "Q5: [152.4, 190.0]" => 38
  "Q4: [113.8, 152.4)" => 38
  "Q3: [76.2, 113.8)"  => 37
  "Q2: [38.6, 76.2)"   => 38
  "Q1: [1.0, 38.6)"    => 38

Add a new variable containing the quantile category that each row belongs to 

In [312]:
gdp_edu.quartile = cut(gdp_edu.Rank, 5)

189-element CategoricalArray{String,1,UInt32}:
 "Q5: [152.4, 190.0]"
 "Q3: [76.2, 113.8)"
 "Q2: [38.6, 76.2)"
 "Q4: [113.8, 152.4)"
 "Q1: [1.0, 38.6)"
 "Q1: [1.0, 38.6)"
 "Q4: [113.8, 152.4)"
 "Q5: [152.4, 190.0]"
 "Q1: [1.0, 38.6)"
 "Q1: [1.0, 38.6)"
 "Q2: [38.6, 76.2)"
 "Q5: [152.4, 190.0]"
 "Q1: [1.0, 38.6)"
 ⋮
 "Q1: [1.0, 38.6)"
 "Q2: [38.6, 76.2)"
 "Q5: [152.4, 190.0]"
 "Q1: [1.0, 38.6)"
 "Q2: [38.6, 76.2)"
 "Q5: [152.4, 190.0]"
 "Q5: [152.4, 190.0]"
 "Q3: [76.2, 113.8)"
 "Q1: [1.0, 38.6)"
 "Q3: [76.2, 113.8)"
 "Q3: [76.2, 113.8)"
 "Q4: [113.8, 152.4)"

Stack the variables - make a table, so that the new value column is differentiated by the income group

In [316]:
stacked_income_quartile = stack(gdp_edu, :"Income Group", :quartile)

Row,quartile,variable,value
Unnamed: 0_level_1,Cat…,String,String31?
1,"Q5: [152.4, 190.0]",Income Group,High income: nonOECD
2,"Q3: [76.2, 113.8)",Income Group,Low income
3,"Q2: [38.6, 76.2)",Income Group,Lower middle income
4,"Q4: [113.8, 152.4)",Income Group,Upper middle income
5,"Q1: [1.0, 38.6)",Income Group,High income: nonOECD
6,"Q1: [1.0, 38.6)",Income Group,Upper middle income
7,"Q4: [113.8, 152.4)",Income Group,Lower middle income
8,"Q5: [152.4, 190.0]",Income Group,Upper middle income
9,"Q1: [1.0, 38.6)",Income Group,High income: OECD
10,"Q1: [1.0, 38.6)",Income Group,High income: OECD


Subset the dataframe so that only the lower middle income entries are reatined

In [328]:
lower_middle_income_quartile = filter(row -> row.value == "Lower middle income", stacked_income_quartile)

Row,quartile,variable,value
Unnamed: 0_level_1,Cat…,String,String31?
1,"Q2: [38.6, 76.2)",Income Group,Lower middle income
2,"Q4: [113.8, 152.4)",Income Group,Lower middle income
3,"Q5: [152.4, 190.0]",Income Group,Lower middle income
4,"Q3: [76.2, 113.8)",Income Group,Lower middle income
5,"Q5: [152.4, 190.0]",Income Group,Lower middle income
6,"Q1: [1.0, 38.6)",Income Group,Lower middle income
7,"Q3: [76.2, 113.8)",Income Group,Lower middle income
8,"Q3: [76.2, 113.8)",Income Group,Lower middle income
9,"Q4: [113.8, 152.4)",Income Group,Lower middle income
10,"Q5: [152.4, 190.0]",Income Group,Lower middle income


Now perform a final countmap() on the newly added quartile variable to answer our question - how many lower middle income countries belong to the upper ranges of the GDP quartile?!

In [329]:
countmap(lower_middle_income_quartile.quartile)

Dict{CategoricalValue{String, UInt32}, Int64} with 5 entries:
  "Q4: [113.8, 152.4)" => 9
  "Q5: [152.4, 190.0]" => 16
  "Q3: [76.2, 113.8)"  => 11
  "Q2: [38.6, 76.2)"   => 13
  "Q1: [1.0, 38.6)"    => 5

The answer is 5 ! 