# Week 3 Notes - Getting and Cleaning Data

## Subsetting and Sorting
Time to wrangle and mould our datasets to our desires. Create a basic dataframe, reshjuffle the contents of the columns and then insert some NA values too. 
```R
set.seed(1345)
X <- data.frame("var1"=sample(1:5), "var2"=sample(6:10), "var3"=sample(11:15))
X <- X[sample(1:5),]; X$var2[c(1,3)] = NA
``` 

In Julia

In [173]:
using DataFrames, CSV, Pkg

In [174]:
X = DataFrame(col1=rand(1:20, 5), col2=rand(2:30, 5), col3=[missing, missing, 5, 4, 21])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,5,26,missing
2,9,8,missing
3,3,5,5
4,2,24,4
5,19,15,21


Let's subset the dataframe and take specific columns and row combinations

In R
`X[,1]` to take the first column. We can also take the first column by passing the column name as a string `X[,"var1]`. Let's take the first two rows of column 2 `X[1:2,"var2"]`    

With Julia we can use the basic .dot syntax

In [175]:
X.col1

5-element Vector{Int64}:
  5
  9
  3
  2
 19

In [176]:
# This string based extraction is a bit slower
# Julia converts the String to a Symbol type 
X."col1"

5-element Vector{Int64}:
  5
  9
  3
  2
 19

We can also use indexing and the column names

In [177]:
X[:, 2]

5-element Vector{Int64}:
 26
  8
  5
 24
 15

In [178]:
X[:, "col1"]

5-element Vector{Int64}:
  5
  9
  3
  2
 19

In [179]:
X[:, :col1]

5-element Vector{Int64}:
  5
  9
  3
  2
 19

So to summarise it all, if we're using indexing, we can use colnumber, "colname", :Colname. If we're using .dot syntax we can use df.number, df."colname" - wooooooo

In Julia, to check the column index of a certain column e.g. is it in the 10th column? etc., we can use the **columnindex()** function

In [180]:
columnindex(X, "col3")

3

To test whether a specific column is in the dataframe, based on it's name, we can do **hasproperty()**

In [181]:
hasproperty(X, "col3")

true

#### Subset the dataframe using conditions, such as, print the dataframe in which the first column has values over 18

In [182]:
filter(row -> row.col1 > 18, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,15,21


If we have multiple conditons

In [183]:
filter(row -> row.col1 > 1 || row.col2 > 1, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,5,26,missing
2,9,8,missing
3,3,5,5
4,2,24,4
5,19,15,21


In R we could do 
```R
X[(X$var1 <= 3 & X$var3 > 11),]
X[(X$var1 <= 3 | X$var3 > 11)]
``` 

### Sorting 

In R;
```R
sort(X$var1)
# Sort in reverse
sort(X$var1, decreasing=TRUE)
``` 

In Julia to just get a vector of a specific dataframes column
https://dataframes.juliadata.org/stable/man/sorting/

In [184]:
sort(X.col1, rev=true)

5-element Vector{Int64}:
 19
  9
  5
  3
  2

To print out the entire dataframe for viewing

In [185]:
sort(X, "col1") 
# or 
sort(X, 1)
# or - this is the slowest one 
sort(X, [:1])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,2,24,4
2,3,5,5
3,5,26,missing
4,9,8,missing
5,19,15,21


### Ordering
Ordering is used in conjunction with sorting, as it will allow us to specificy the sorting order of the columns in the DataFrame, e.g. first X and reverse sort it

In R; 
```R
X[order(X$var1, X$var3),] 
```

Now in Julia, based on the help information

In [186]:
sort(X, order("col1", rev=true))

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,15,21
2,9,8,missing
3,5,26,missing
4,3,5,5
5,2,24,4


We can pass multiple order functions within a single dataframe in order to handle the other columns

In [187]:
sort!(X, ["col1", "col2"], rev=[true, false])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,15,21
2,9,8,missing
3,5,26,missing
4,3,5,5
5,2,24,4


### Adding rows and columns 
Adding rows and columns is a very common procedure - it should become as comfortable as adding sides to the playdough structure that we've made.

In R;
```R
X$var4 <- rnorm(5)
```

In Julia, a very basic way to do this is via indexing, we can index into a column which doesn't exist yet, but soon will, and provide the data which will fill the column

In [188]:
X.col4 = rand(100:200, 5)

5-element Vector{Int64}:
 113
 113
 141
 194
 125

In [189]:
X

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,15,21,113
2,9,8,missing,113
3,5,26,missing,141
4,3,5,5,194
5,2,24,4,125


## Summarising Data 
We'll be looking at different ways of providing a snapshot of the general big picture of our datasets - the averages, limits, deviations and so on.  

Let's do the basics, the beginning and end of the datasets; 
In R; 
```R
head(data, n=3)
tail(data, n=5)
``` 

In Julia

In [190]:
first(X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,15,21,113


In [191]:
last(X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
5,2,24,4,125


To get a brief summary of the data with descriptve stats and other information such as the Types of the variables in the columns, we can use `summary(data)` in R and in Julia we can use **describe()**

In [192]:
describe(X)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Int64,Float64,Int64,Int64,Type
1,col1,7.6,2,5.0,19,0,Int64
2,col2,15.6,5,15.0,26,0,Int64
3,col3,10.0,4,5.0,21,2,"Union{Missing, Int64}"
4,col4,137.2,113,125.0,194,0,Int64


In R you can also use the `str(data)` command 

In [193]:
typeof(X.col1)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

To get the quantiles of a vector, in R we have the base function `quantile(data$column.na.rm=TRUE)` and in Julia we have to use the **Statistics.jl** package to add this functionality. Remember Julia is a more general language compared to R which was always tailored towards statistical computing - see https://www.jlhub.com/julia/manual/en/function/quantile-exclamation 

In [194]:
using Statistics

In [195]:
# Print quarter quantiles
quantile!(X.col1, [0, 0.25, 0.5, 0.75, 1], )

5-element Vector{Float64}:
  2.0
  3.0
  5.0
  9.0
 19.0

To skip the missing values and print the median value

In [196]:
quantile(skipmissing(X.col3), 0.5) 

5.0

### Checking for missing values
In R, count the number of missing values 
```R
sum(is.na(data$column))
```
Check is **any** na values are present
```R
any(is.na(data$column))
```
Test to see whether all the values meet a certain condition (over 0)
```R
all(data$column > 0)
```

In Julia, get the sum of missing values - using the one line iterators

In [197]:
sum(x -> ismissing(x), X.col3)

2

If any missing values are in there 

In [198]:
any(x -> ismissing(x), X.col3)

true

If all the values are a certain condition 

In [199]:
all(x -> ismissing(x), X.col1)

false

In [200]:
all(x -> x > 0, X.col1)

true

A cool little function in Julia to only extract the dataframes rows which contain missing values 

In [201]:
filter(x -> any(ismissing, x), X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,3,8,missing,113
2,5,26,missing,141


Perform a quick sum of all of the columns in a horizontal fashion - intuitively this would mean summing the entire row, and producing a new sum in the final column of the same row e.g. |1|2|3|6 (final)|. This can be a very quick way of checking whether there are any missing values as the missing values will propogate across!

In R;
```R
colSums(is.na(data))
```

In Julia

In [202]:
sum(eachcol(X))

5-element Vector{Union{Missing, Int64}}:
 151
    missing
    missing
 213
 172

In Julia if we want to actually sum the entire column, meaning every value in the column vertically, we can collect the column and then sum it OR we can just performing broadcasting using the sum function - fascinating but easily confusing! 

In [203]:
sum.(collect(eachcol(X)))

4-element Vector{Union{Missing, Int64}}:
  38
  78
    missing
 686

In [204]:
sum.(eachcol(X))

4-element Vector{Union{Missing, Int64}}:
  38
  78
    missing
 686

There is an equivalent operation by using the broadcasting over **eachrow()**

In [205]:
sum.(eachrow(X))

5-element Vector{Union{Missing, Int64}}:
 151
    missing
    missing
 213
 172

If we want to skip missing values when doing these operations we would broadcasting **skipmissing()** across

In [206]:
sum.(skipmissing.(eachrow(X)))

5-element Vector{Int64}:
 151
 124
 172
 213
 172

### Subsetting the dataframe based upon values in the columns
Say for example that we only want the data which have a specific zipcode (generic value) in a column, what can we do? In R;
```R
data[data$zipCode %in% c("4109", "4110"),] 
```

In Julia - get a dataframe in which the values in the first column are 1 

In [207]:
filter(row -> row.col1 == 1, X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64


Now another one wherein the values are larger and 1 and smaller than 15

In [208]:
filter(row -> row.col1 > 1 && row.col1 < 15, X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,2,15,21,113
2,3,8,missing,113
3,5,26,missing,141
4,9,5,5,194


## Cross Tabulation aka Frequency Tables
In order provide small snapshots of potential interactions and relations, we can see cross-tabulation or frequency comparisons between variables, say, male and female and acceptance rates to university     

In R we have some base functions;
```R
xt <- xtabs(Freq ~ Gender + Admit, data=DF)
```

In Julia we have to load a specific package called **FreqTables** https://github.com/nalimilan/FreqTables.jl

In [209]:
using Pkg; Pkg.add("FreqTables") ; using FreqTables

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


Do a frequency table between columns 1 and 4 - clearly there is not much here to see given both are randomly generated vectors

In [210]:
freqtable(X, :col1, :col4)

5×4 Named Matrix{Int64}
col1 ╲ col4 │ 113  125  141  194
────────────┼───────────────────
2           │   1    0    0    0
3           │   1    0    0    0
5           │   0    0    1    0
9           │   0    0    0    1
19          │   0    1    0    0

## Size of the data in human readable form 
Very simple and yet very informative information - how big is our data?
In R;
```R
object.size(data), units="Mb")
```

In Julia, we can use **varinfo()**

## Creating New Variables
Often our datasets may consist of variables and values which we need to prune, transform and mould to our likings - perhaps they have broken delimiters, or are a combination of two values in ones, or are better represented in a different form - and so on. These tasks require us to create new variables and add these to the dataset - remember that we should always keep copies of the original dataset and not simply mutate it into oblivion. 

Let's take some restaurant data from Baltimore city to use as the sample dataset

In [214]:
download("https://gist.githubusercontent.com/slowteetoe/528c78213fcd80f05419/raw/e0a4a89476fca79e692df1a373e5025f5112a5f6/restaurants.csv", "restaurants.csv") 

"restaurants.csv"

In [317]:
rest_data = CSV.File("restaurants.csv") |> DataFrame
# We can also do 
# CSV.read(open("file.csv"), DataFrame)

Row,name,zipCode,neighborhood,councilDistrict,policeDistrict,Location 1
Unnamed: 0_level_1,String,Int64,String,Int64,String15,String
1,410,21206,Frankford,2,NORTHEASTERN,"4509 BELAIR ROAD\nBaltimore, MD\n"
2,1919,21231,Fells Point,1,SOUTHEASTERN,"1919 FLEET ST\nBaltimore, MD\n"
3,SAUTE,21224,Canton,1,SOUTHEASTERN,"2844 HUDSON ST\nBaltimore, MD\n"
4,#1 CHINESE KITCHEN,21211,Hampden,14,NORTHERN,"3998 ROLAND AVE\nBaltimore, MD\n"
5,#1 chinese restaurant,21223,Millhill,9,SOUTHWESTERN,"2481 frederick ave\nBaltimore, MD\n"
6,19TH HOLE,21218,Clifton Park,14,NORTHEASTERN,"2722 HARFORD RD\nBaltimore, MD\n"
7,3 KINGS,21205,McElderry Park,13,SOUTHEASTERN,"2510 MCELDERRY ST\nBaltimore, MD\n"
8,"3 MILES HOUSE, INC.",21211,Remington,7,NORTHERN,"2701 MILES AVE\nBaltimore, MD\n"
9,3 W'S TAVERN,21205,McElderry Park,13,SOUTHEASTERN,"2518 MONUMENT ST\nBaltimore, MD\n"
10,300 SOUTH ANN STREET,21231,Upper Fells Point,1,SOUTHEASTERN,"300 ANN ST\nBaltimore, MD\n"


### Creating sequences 
We can create sequences to use as indexes for extracting data e.g. 
`rest_data[index sequence]`  

In R we can create a sequence of intergers which a step size of 2 by ;
```R
seq_one <- seq(1,10, by=2)
```
or also using the **c()** collect function 
```R
seq_c <- c(1,3,8,25,100) ; seq(along = x)
``` 

In Julia we can do it using **collect()** function as well - with the ranges on either side and the step size in the middle

In [234]:
seq_one = collect(1:3:10)

4-element Vector{Int64}:
  1
  4
  7
 10

Or just create a vector directly 

In [235]:
seq_vec = Vector(1:3:10)

4-element Vector{Int64}:
  1
  4
  7
 10

Say we want to create a new variable which indicates whether a certain data meet a certain condition, say, they are in the neighbourhoods of Sunnybank or Sunnybank Hills, and thus they are close to me - and if they meet this condition, they are assigned a "TRUE" value or a "FALSE" value in a new row - how would we do this?

In R;
```R
restData$nearMe = restData$neighbourhood %in% c("Roland Park", "Homeland")
```

In Julia - this is a new one for me! - make sure we use the double encapsulation in the array [[]] otherwise broadcasting won't work 

In [318]:
rest_data.nearMe = in.(rest_data.neighborhood, [["Roland Park", "Homeland"]])

1327-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0

R base has the very handy function table() which creates a nice tally of ones variables and counts them in a neat table output - Julia base doesn't really have this single item function so we have to use the **StatsBase** package and the **countmap()** function

In [343]:
Pkg.add("StatsBase") ; using StatsBase

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


In [349]:
# 0 = true, 1 = false 
# perhaps we can convert the 0,1 to TRUE/FALSE?
qz = countmap(rest_data.nearMe)
println(qz)

Dict{Bool, Int64}(0 => 1314, 1 => 13)


### Creating binary variables 
We can use the **ifelse()** function with broadcasting to evaluate a conditonal statement and print true or false in a new variables based upon the answer of the condition. Here we'll do it in R first;
```R
restData$wrongZip = ifelse(restData$zipCode) < -, TRUE, FALSE) 
```    
And then in Julia

In [354]:
rest_data.wrongZip = ifelse.(rest_data.zipCode .< 0, true, false)

1327-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

In [355]:
countmap(rest_data.wrongZip)

Dict{Bool, Int64} with 2 entries:
  0 => 1326
  1 => 1

### Creating categorical variables
We may want to summarise certain aspects of our dataset by chunking them into categorical blocks - similar to percentiles. Say we want to get a look at the distribution of gene lengths or zip codes in our dataset, at quartile ranges - we can turn to categorical variables

Create them in Julia https://categoricalarrays.juliadata.org/v0.1/using.html

In [211]:
Pkg.add("CategoricalArrays") ; using CategoricalArrays

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


In [212]:
cut(X.col1, 4)

5-element CategoricalArray{String,1,UInt32}:
 "Q1: [2.0, 3.0)"
 "Q2: [3.0, 5.0)"
 "Q3: [5.0, 9.0)"
 "Q4: [9.0, 19.0]"
 "Q4: [9.0, 19.0]"

In [359]:
 countmap(cut(rest_data.zipCode, 4))

Dict{CategoricalValue{String, UInt32}, Int64} with 4 entries:
  "Q4: [21225.5, 21287.0]"  => 332
  "Q2: [21202.0, 21218.0)"  => 507
  "Q3: [21218.0, 21225.5)"  => 351
  "Q1: [-21226.0, 21202.0)" => 137

In R let's use the **Hmisc** library 
```R
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode, g=4)
table(restData$zipGroups)
```

When we turn our data into Categorical variables, we turn them into factors  

### Reshaping data 
The notions of split-apply-combine are prevalent in data analytics, as they are terrific guiding principles for approaching data frames. They were originally popularised in R and built around R packages, but are now supported by plenty of packages in most high level programming languages. 

We'll start with R and the basic **reshape2()** library 
```R
library(reshape2)
head(mtcars) #standard dataset in R
``` 

We can use some of the popular R datasets in Julia using the **RDatasets.jl** package - as we can see almost everything is supported in Julia

In [363]:
Pkg.add("RDatasets") ; using RDatasets

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


Load the mtcars dataset from "datasets" 

In [364]:
mtcars = dataset("datasets", "mtcars")

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
6,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
7,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
8,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
9,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
10,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [366]:
first(mtcars, 5)

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [371]:
# descriptive stats
describe(mtcars)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,Model,,AMC Javelin,,Volvo 142E,0,String31
2,MPG,20.0906,10.4,19.2,33.9,0,Float64
3,Cyl,6.1875,4,6.0,8,0,Int64
4,Disp,230.722,71.1,196.3,472.0,0,Float64
5,HP,146.688,52,123.0,335,0,Int64
6,DRat,3.59656,2.76,3.695,4.93,0,Float64
7,WT,3.21725,1.513,3.325,5.424,0,Float64
8,QSec,17.8487,14.5,17.71,22.9,0,Float64
9,VS,0.4375,0,0.0,1,0,Int64
10,AM,0.40625,0,0.0,1,0,Int64


In [370]:
# column names
names(mtcars)

12-element Vector{String}:
 "Model"
 "MPG"
 "Cyl"
 "Disp"
 "HP"
 "DRat"
 "WT"
 "QSec"
 "VS"
 "AM"
 "Gear"
 "Carb"

### "Melting" dataframes in R / "Reshaping" in Julia
The main idea behind melting is the presence of "id" and "value" variables - the ID variables will represent the consistent identifiers we want to retain, and the value variables will be "stacked" to form a group with its own column - almost like a sub-data frame - it's hard to imagine but easier to understand once we play with it a little bit ourselves - so when performing the melt and stack functions, we need to always designate these two variables. In R, id variables come first and the measure variables seconds, whereas in Julia it is the other way around. Let's take a look in R;
```R
mtcars$carname <- rownames(mtcars) # looks redunant based on julia code 
carMelt <- melt(mtcares, id=c("carname", "gear", "cyl"), measure.vars=c("mpg", "hp"))
```

In Julia this is perform with the DataFrames package and **stack()**

In [375]:
mt_stack = stack(mtcars, [:MPG, :HP], [:Model, :Gear, :Cyl])

Row,Model,Gear,Cyl,variable,value
Unnamed: 0_level_1,String31,Int64,Int64,String,Float64
1,Mazda RX4,4,6,MPG,21.0
2,Mazda RX4 Wag,4,6,MPG,21.0
3,Datsun 710,4,4,MPG,22.8
4,Hornet 4 Drive,3,6,MPG,21.4
5,Hornet Sportabout,3,8,MPG,18.7
6,Valiant,3,6,MPG,18.1
7,Duster 360,3,8,MPG,14.3
8,Merc 240D,4,4,MPG,24.4
9,Merc 230,4,4,MPG,22.8
10,Merc 280,4,6,MPG,19.2


### Casting dataframes

Imagine we want to get a quick overview of how many values of a certain variable we have in our dataset - say, the distribution of miles per gallon and horse power based upon the cylinders a car has - the relation between cylinders and engine performance : Cylinders ~ engine. In R we would use **dcast()**; 
```R
cylData <- dcast(carMelt, cyl ~ variable) 
```

This is a littler trickier in Julia as there's not as much supporting documentation yet, but this is called a "pivot table" in Julia -- bogumil (who else?!) has a great post here https://www.juliabloggers.com/pivot-tables-in-dataframes-jl/ 

In [410]:
unstack(mt_stack, :Cyl, :variable, :Cyl, combine=length)

Row,Cyl,MPG,HP
Unnamed: 0_level_1,Int64,Int64?,Int64?
1,6,7,7
2,4,11,11
3,8,14,14


Now let's get the mean of these measures, rather than their length?

In [411]:
unstack(mt_stack, :Cyl, :variable, :value, combine=mean)

Row,Cyl,MPG,HP
Unnamed: 0_level_1,Int64,Float64?,Float64?
1,6,19.7429,122.286
2,4,26.6636,82.6364
3,8,15.1,209.214


### Averaging values

In [436]:
countmap(sum.(mtcars.AM))

Dict{Int64, Int64} with 2 entries:
  0 => 19
  1 => 13

In [437]:
groupby(mtcars, :Cyl)

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
2,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
3,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
4,Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
5,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
6,Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
7,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
8,Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
9,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
10,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2

Row,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String31,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
2,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
3,Merc 450SE,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
4,Merc 450SL,17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
5,Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0,0,3,3
6,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
7,Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
8,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
9,Dodge Challenger,15.5,8,318.0,150,2.76,3.52,16.87,0,0,3,2
10,AMC Javelin,15.2,8,304.0,150,3.15,3.435,17.3,0,0,3,2


## DataFrames.jl 
Almost everything covered herein appears to be described somewhere on the DataFrames.jl package site https://dataframes.juliadata.org/stable/man/split_apply_combine/ and/or on the DataFramesMeta site https://juliadata.org/DataFramesMeta.jl/stable/dplyr/ . It may take some googling and forum scouring, but rest assured that the developers have considered many many procedures and functions. As such it is best to work through Bogumil's book and create your own projects that you're interested in - passion and curiosity as the backbones of science and inquiry! 

## Merging/Joining data 
Joining datasets together is a common operation, especially within SQL - we can think of the common terminology of inner join, outer join and so on that's almost synonymous with databases. In R lets merge some data; 
```R
mergedData = merge(reviews, solution, by.x="solution_id", by.y="yd", all=TRUE)
```

In Julia let's create some mock data 

In [439]:
people = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])

Row,ID,Name
Unnamed: 0_level_1,Int64,String
1,20,John Doe
2,40,Jane Doe


In [440]:
jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"]) 

Row,ID,Job
Unnamed: 0_level_1,Int64,String
1,20,Lawyer
2,40,Doctor


Say there is a common ID shared between datasets, we can take advantage of this ID and 'sink' our data around it, in some sense this is like performing a union operation on a set - the common element is never duplicated. 

**Inner joins are "the output contains rows for values of the key that exist in all passed data frames."**

In [442]:
innerjoin(people, jobs, on = :ID)

Row,ID,Name,Job
Unnamed: 0_level_1,Int64,String,String
1,20,John Doe,Lawyer
2,40,Jane Doe,Doctor


In [445]:
intersect(names(people), names(jobs))

1-element Vector{String}:
 "ID"

In [451]:
breaks = DataFrame(ID=[20, 60], Fruit=["Apple", "Watermelon"])

Row,ID,Fruit
Unnamed: 0_level_1,Int64,String
1,20,Apple
2,60,Watermelon


To merge all of the datasets together based upon a shared variable, even if not all of the rows are present in the other datasets, we can use the **outerjoin()** method 

In [453]:
outerjoin(breaks, people, jobs, on = :ID)

Row,ID,Fruit,Name,Job
Unnamed: 0_level_1,Int64,String?,String?,String?
1,20,Apple,John Doe,Lawyer
2,40,missing,Jane Doe,Doctor
3,60,Watermelon,missing,missing


In R we provide all of the column names this time;
```R
mergedData2 <- merge(reviews, solutions, all=TRUE)
```

# Quiz

1. The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here: 

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf 

Create a logical vector that identifies the households on greater than 10 acres who sold more than $10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE. 

```R
which(agricultureLogical) 
```

What are the first 3 values that result? 

In [456]:
download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv", "quiz3_q1.csv")

"quiz3_q1.csv"

In [458]:
first_q = CSV.File("quiz3_q1.csv") |> DataFrame

Row,RT,SERIALNO,DIVISION,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FS,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,GRNTP,GRPIP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,LNGI,MV,NOC,NPF,NPP,NR,NRC,OCPIP,PARTNER,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,WORKSTAT,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,⋯
Unnamed: 0_level_1,String1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,H,186,8,700,4,16,1015675,89,4,1,1,missing,4,2,2,missing,180,0,2,3,3,600,1,missing,1,1300,1,1,1,9,missing,missing,missing,1,1,missing,17,3,840,5,2,105600,2,missing,missing,1,1,105600,0,2,2,2,1,4,2,4,0,0,2,18,0,0,1,0,0,1,1550,3,0,1,24,3,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
2,H,306,8,700,4,16,1015675,310,1,1,missing,missing,1,7,missing,missing,60,0,2,3,3,missing,1,missing,missing,missing,missing,missing,1,2,2,600,missing,1,3,missing,missing,1,1,3,missing,missing,missing,660,23,1,4,34000,0,4,4,4,1,3,0,missing,0,0,0,missing,0,0,0,0,0,2,missing,missing,1,0,missing,missing,missing,missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
3,H,395,8,100,4,16,1015675,106,2,1,1,missing,3,2,2,missing,70,0,2,30,1,200,1,missing,missing,missing,missing,3,1,7,missing,missing,missing,1,2,missing,18,2,50,5,7,9400,2,missing,missing,1,3,9400,0,2,2,2,1,2,1,2,0,0,1,23,0,0,1,0,0,1,179,missing,0,1,16,1,13,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
4,H,506,8,700,4,16,1015675,240,4,1,1,missing,4,2,2,missing,40,0,2,80,1,200,1,missing,1,860,1,1,1,6,missing,missing,400,1,1,missing,19,3,500,2,1,66000,1,missing,missing,1,1,66000,0,1,1,1,1,3,2,4,0,0,2,26,0,0,1,0,0,2,1422,1,0,1,31,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
5,H,835,8,800,4,16,1015675,118,4,1,2,1,5,2,2,missing,250,0,2,3,3,700,1,missing,1,1900,1,1,1,7,missing,missing,650,1,1,missing,20,5,2,3,1,93000,2,missing,missing,1,1,93000,0,2,2,2,1,1,1,4,0,0,1,36,0,0,1,0,0,1,2800,1,0,1,25,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
6,H,989,8,700,4,16,1015675,115,4,1,1,missing,3,2,2,missing,130,0,2,3,3,250,1,missing,1,700,1,1,1,6,missing,missing,400,1,1,missing,15,2,1200,5,2,61000,1,missing,missing,1,1,61000,0,1,1,1,1,4,2,4,0,0,2,26,0,0,1,0,0,2,1330,2,0,1,7,1,7,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
7,H,1861,8,700,4,16,1015675,0,1,2,missing,missing,missing,missing,missing,missing,missing,0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,5,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,⋯
8,H,2120,8,200,4,16,1015675,35,1,1,1,missing,2,1,2,missing,40,0,480,3,4,missing,1,missing,missing,missing,missing,missing,1,4,missing,missing,missing,1,4,missing,missing,1,650,5,missing,missing,missing,missing,missing,1,6,10400,0,4,4,4,1,5,0,missing,0,0,0,missing,0,0,0,1,1,2,missing,missing,1,0,missing,missing,missing,missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
9,H,2278,8,400,4,16,1015675,47,2,1,1,missing,3,2,2,missing,2,0,2,3,3,770,1,missing,1,750,1,1,1,6,missing,missing,missing,1,1,missing,13,2,660,3,2,209000,4,missing,missing,1,1,209000,0,4,4,4,1,1,0,2,0,0,0,5,0,0,0,1,1,1,805,3,0,1,22,1,6,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
10,H,2428,8,500,4,16,1015675,51,2,1,1,missing,2,1,2,missing,20,0,2,140,1,120,1,220,missing,missing,missing,3,1,5,missing,missing,missing,1,2,missing,1,2,2,5,missing,missing,missing,missing,missing,2,5,35400,0,4,4,4,2,1,0,missing,0,1,0,7,0,0,0,0,0,1,196,missing,0,0,4,missing,missing,missing,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,⋯


In [508]:
#filter(row -> row.AGS == 6 && row.ACR == 3, first_q)
#filter(row -> row.col1 > 1 && row.col1 < 15, X)
#filter(row -> row.AGS == 6, first_q) 
#filter(x -> x.AGS == 6, first_q)
subset(first_q, :AGS => ByRow(==(6)), :ACR => ByRow(==(3)), skipmissing=true)

Row,RT,SERIALNO,DIVISION,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FS,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,GRNTP,GRPIP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,LNGI,MV,NOC,NPF,NPP,NR,NRC,OCPIP,PARTNER,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,WORKSTAT,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,⋯
Unnamed: 0_level_1,String1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,H,30346,8,400,4,16,1015675,120,4,1,3,6,3,2,2,missing,150,0,2,3,3,600,1,missing,1,1400,1,1,1,5,missing,missing,missing,1,1,missing,20,2,2,3,1,62600,2,missing,missing,3,1,62600,0,2,2,2,1,1,2,4,0,0,2,30,0,0,1,0,0,1,1550,3,0,0,24,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,⋯
2,H,53292,8,300,4,16,1015675,26,3,1,3,6,2,3,2,missing,120,0,1000,3,6,1500,1,missing,2,1400,2,1,1,4,missing,missing,missing,1,1,missing,11,3,100,5,1,120000,2,missing,missing,1,1,120000,0,2,2,2,1,6,1,3,0,0,1,18,0,0,1,0,0,1,1819,3,0,0,22,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
3,H,56299,8,800,4,16,1015675,97,2,1,3,6,2,2,2,missing,80,0,2,110,1,missing,1,missing,missing,missing,missing,missing,1,5,missing,missing,missing,1,4,missing,missing,2,2,9,2,23200,4,missing,missing,1,1,23200,0,4,4,4,1,4,0,2,0,0,0,missing,0,0,0,2,0,2,missing,missing,0,0,missing,1,3,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
4,H,101282,8,800,4,16,1015675,76,2,1,3,6,3,2,2,missing,70,0,850,3,4,860,1,missing,2,600,2,1,1,9,missing,missing,missing,1,1,missing,14,4,2,9,1,27500,4,missing,missing,1,1,27500,0,4,4,4,1,7,0,2,0,0,0,39,0,0,0,1,0,1,883,3,0,0,18,2,2,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,⋯
5,H,120351,8,800,4,16,1015675,51,5,1,3,6,5,2,2,missing,80,0,2,10,2,0,1,missing,missing,missing,missing,3,1,9,missing,missing,missing,1,2,missing,24,2,2,1,2,41500,3,missing,missing,1,1,41500,0,1,1,1,1,1,3,5,0,0,3,7,0,0,1,0,0,1,257,missing,0,0,32,2,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
6,H,122802,8,800,4,16,1015675,63,5,1,3,6,3,2,2,missing,150,0,2,3,3,4800,1,missing,2,1100,2,1,1,6,missing,missing,missing,1,1,missing,24,3,2,5,1,5200,2,missing,missing,2,1,5200,0,2,2,2,1,5,3,5,0,0,3,101,0,0,1,0,0,1,2233,3,0,0,65,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,⋯
7,H,133128,8,300,4,16,1015675,15,2,1,3,6,1,2,2,missing,100,0,770,3,4,2000,1,missing,2,870,2,1,1,5,missing,missing,missing,1,1,missing,22,4,2,9,2,32200,4,missing,missing,1,1,32200,0,4,4,4,1,7,0,2,0,0,0,53,0,0,0,2,2,2,1409,3,0,0,37,1,6,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
8,H,140896,8,400,4,16,1015675,72,2,1,3,6,4,2,2,missing,130,0,2,3,2,800,1,missing,missing,missing,missing,3,1,9,missing,missing,missing,1,2,missing,17,2,2,8,2,32000,4,missing,missing,1,1,32000,0,4,4,4,1,6,0,2,0,0,0,11,0,0,0,0,0,1,289,missing,0,0,23,1,6,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
9,H,169806,8,800,4,16,1015675,62,1,1,3,6,4,2,2,missing,100,0,2,3,3,1500,1,missing,missing,missing,missing,3,1,8,missing,missing,missing,1,2,missing,24,2,2,6,missing,missing,missing,missing,missing,1,4,30700,0,4,4,4,1,7,0,missing,0,0,0,28,0,0,0,1,1,1,725,missing,0,0,64,missing,missing,missing,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
10,H,173013,8,500,4,16,1015675,77,2,1,3,6,3,2,2,missing,50,0,2,70,3,200,1,missing,1,900,2,1,1,6,missing,missing,missing,1,1,missing,20,5,2,2,2,120000,4,missing,missing,1,1,120000,0,4,4,4,1,3,0,2,0,0,0,11,0,0,0,2,0,1,1120,3,0,0,24,1,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯


## Some scrambled notes for web-scraping

In [438]:
Pkg.add("Cascadia") ; using Cascadia, HTTP, Gumbo

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


In [18]:
link =  HTTP.get("https://www.amazon.com.au/s?k=1780225598")

HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Server: Server
Date: Fri, 09 Feb 2024 07:08:12 GMT
x-amz-rid: XBCSXRHSWPM48GETM3RJ
set-cookie: ******
Pragma: no-cache
Content-Encoding: gzip
content-security-policy-report-only: default-src 'self' blob: https: data: mediastream: 'unsafe-eval' 'unsafe-inline';report-uri https://metrics.media-amazon.com/
X-XSS-Protection: 1;
X-Content-Type-Options: nosniff
Accept-CH: ect,rtt,downlink,device-memory,sec-ch-device-memory,viewport-width,sec-ch-viewport-width,dpr,sec-ch-dpr,sec-ch-ua-platform,sec-ch-ua-platform-version
Accept-CH-Lifetime: 86400
Cache-Control: no-cache
Expires: -1
Content-Language: en-AU
Content-Security-Policy: upgrade-insecure-requests;report-uri https://metrics.media-amazon.com/
Strict-Transport-Security: max-age=47474747; includeSubDomains; preload
Vary: Content-Type,Accept-Encoding,User-Agent
X-Frame-Options: SAMEORIGIN

In [19]:
link_string = parsehtml(String(link.body)) 

HTML Document:
<!DOCTYPE html>
HTMLElement{:HTML}:<HTML class="a-no-js" data-19ax5a9jf="dingo" lang="en-au">
  <head>
    <script>var aPageStart = (new Date()).getTime();    </script>
    <meta charset="utf-8"/>
    <script type="text/javascript">var ue_t0=ue_t0||+new Date();    </script>
    <meta content="on" http-equiv="x-dns-prefetch-control"/>
    <link href="https://images-fe.ssl-images-amazon.com" rel="dns-prefetch"/>
    <link href="https://m.media-amazon.com" rel="dns-prefetch"/>
    <link href="https://completion.amazon.com" rel="dns-prefetch"/>
    <script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b

In [20]:
link_body = link_string.root[2]

HTMLElement{:body}:<body class="a-aui_72554-c a-aui_a11y_1_699934-c a-aui_a11y_4_835613-c a-aui_a11y_6_837773-c a-aui_a11y_sr_678508-c a-aui_killswitch_csa_logger_372963-c a-aui_pci_risk_banner_210084-c a-aui_preload_261698-c a-aui_rel_noreferrer_noopener_309527-c a-aui_template_weblab_cache_333406-c a-aui_tnr_v2_180836-c">
  <div id="a-page">
    <script data-a-state="{&quot;key&quot;:&quot;a-wlab-states&quot;}" type="a-state">{"AUI_A11Y_6_837773":"C","AUI_TNR_V2_180836":"C","AUI_PRELOAD_261698":"C","AUI_TEMPLATE_WEBLAB_CACHE_333406":"C","AUI_72554":"C","AUI_A11Y_1_699934":"C","AUI_A11Y_4_835613":"C","AUI_KILLSWITCH_CSA_LOGGER_372963":"C","AUI_A11Y_SR_678508":"C","AUI_REL_NOREFERRER_NOOPENER_309527":"C","AUI_PCI_RISK_BANNER_210084":"C"}    </script>
    <script>typeof uex === 'function' && uex('ld', 'portal-bb', {wb: 1})    </script>
    <img alt="" height="1" onload="window.ue_sbl &amp;&amp; window.ue_sbl();" src="//fls-fe.amazon.com.au/1/batch/1/OP/A39IBJ37TRP1C6:358-8618874-4820313

In [122]:
zz = eachmatch(sel".a-offscreen", link_body)

2-element Vector{HTMLNode}:
 HTMLElement{:span}:<span class="a-offscreen">
  $19.25
</span>


 HTMLElement{:span}:<span class="a-offscreen">
  $24.99
</span>



In [98]:
link_body[1][28][10]

HTMLElement{:div}:<div class="s-desktop-width-max s-desktop-content s-opposite-dir s-wide-grid-style sg-row">
  <div class="sg-col-20-of-24 s-matching-dir sg-col-16-of-20 sg-col sg-col-8-of-12 sg-col-12-of-16">
    <div class="sg-col-inner">
      <div id="s-skipLinkTargetForMainSearchResults" tabindex="-1"></div>
      <span class="rush-component s-latency-cf-section" data-component-type="s-search-results">
        <div class="s-main-slot s-result-list s-search-results sg-row">
          <div class="a-section a-spacing-none s-result-item s-flex-full-width s-border-bottom-none s-widget s-widget-spacing-large" data-asin="" data-index="0">
            <div cel_widget_id="MAIN-TOP_BANNER_MESSAGE-0" class="s-widget-container s-spacing-mini s-widget-container-height-mini celwidget slot=MAIN template=TOP_BANNER_MESSAGE widgetId=messaging-messages-results-header-builder" data-uuid="43338d61-129c-47bc-ad08-60ab9a6a6d95">
              <span class="rush-component" data-component-type="s-messagi

In [155]:
price_element = zz[1].children[1]

HTML Text: `$19.25`

In [160]:
price_text = nodeText(price_element)

"\$19.25"

In [172]:
price_text[2:end]

"19.25"

In [None]:
some