# Week 3 Notes - Getting and Cleaning Data

## Subsetting and Sorting
Time to wrangle and mould our datasets to our desires. Create a basic dataframe, reshjuffle the contents of the columns and then insert some NA values too. 
```R
set.seed(1345)
X <- data.frame("var1"=sample(1:5), "var2"=sample(6:10), "var3"=sample(11:15))
X <- X[sample(1:5),]; X$var2[c(1,3)] = NA
``` 

In Julia

In [6]:
using DataFrames, CSV

In [10]:
X = DataFrame(col1=rand(1:20, 5), col2=rand(2:30, 5), col3=[missing, missing, 5, 4, 21])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,6,25,missing
2,19,14,missing
3,3,4,5
4,18,9,4
5,1,13,21


Let's subset the dataframe and take specific columns and row combinations

In R
`X[,1]` to take the first column. We can also take the first column by passing the column name as a string `X[,"var1]`. Let's take the first two rows of column 2 `X[1:2,"var2"]`    

With Julia we can use the basic .dot syntax

In [14]:
X.col1

5-element Vector{Int64}:
  6
 19
  3
 18
  1

In [26]:
# This string based extraction is a bit slower
# Julia converts the String to a Symbol type 
X."col1"

5-element Vector{Int64}:
  6
 19
  3
 18
  1

We can also use indexing and the column names

In [24]:
X[:, 2]

5-element Vector{Int64}:
 25
 14
  4
  9
 13

In [29]:
X[:, "col1"]

5-element Vector{Int64}:
  6
 19
  3
 18
  1

In [30]:
X[:, :col1]

5-element Vector{Int64}:
  6
 19
  3
 18
  1

So to summarise it all, if we're using indexing, we can use colnumber, "colname", :Colname. If we're using .dot syntax we can use df.number, df."colname" - wooooooo

In Julia, to check the column index of a certain column e.g. is it in the 10th column? etc., we can use the **columnindex()** function

In [31]:
columnindex(X, "col3")

3

To test whether a specific column is in the dataframe, based on it's name, we can do **hasproperty()**

In [32]:
hasproperty(X, "col3")

true

#### Subset the dataframe using conditions, such as, print the dataframe in which the first column has values over 18

In [42]:
filter(row -> row.col1 > 18, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,14,missing


If we have multiple conditons

In [57]:
filter(row -> row.col1 > 1 || row.col2 > 1, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,6,25,missing
2,19,14,missing
3,3,4,5
4,18,9,4
5,1,13,21


In R we could do 
```R
X[(X$var1 <= 3 & X$var3 > 11),]
X[(X$var1 <= 3 | X$var3 > 11)]
``` 

### Sorting 

In R;
```R
sort(X$var1)
# Sort in reverse
sort(X$var1, decreasing=TRUE)
``` 

In Julia to just get a vector of a specific dataframes column
https://dataframes.juliadata.org/stable/man/sorting/

In [64]:
sort(X.col1, rev=true)

5-element Vector{Int64}:
 19
 18
  6
  3
  1

To print out the entire dataframe for viewing

In [68]:
sort(X, "col1") 
# or 
sort(X, 1)
# or - this is the slowest one 
sort(X, [:1])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,1,13,21
2,3,4,5
3,6,25,missing
4,18,9,4
5,19,14,missing


### Ordering
Ordering is used in conjunction with sorting, as it will allow us to specificy the sorting order of the columns in the DataFrame, e.g. first X and reverse sort it

In R; 
```R
X[order(X$var1, X$var3),] 
```

Now in Julia, based on the help information

In [71]:
sort(X, order("col1", rev=true))

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,14,missing
2,18,9,4
3,6,25,missing
4,3,4,5
5,1,13,21


We can pass multiple order functions within a single dataframe in order to handle the other columns

In [89]:
sort!(X, ["col1", "col2"], rev=[true, false])

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,14,missing,189
2,18,9,4,132
3,6,25,missing,158
4,3,4,5,148
5,1,13,21,178


### Adding rows and columns 
Adding rows and columns is a very common procedure - it should become as comfortable as adding sides to the playdough structure that we've made.

In R;
```R
X$var4 <- rnorm(5)
```

In Julia, a very basic way to do this is via indexing, we can index into a column which doesn't exist yet, but soon will, and provide the data which will fill the column

In [87]:
X.col4 = rand(100:200, 5)

5-element Vector{Int64}:
 189
 132
 158
 148
 178

In [88]:
X

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,14,missing,189
2,18,9,4,132
3,6,25,missing,158
4,3,4,5,148
5,1,13,21,178


## Summarising Data 
We'll be looking at different ways of providing a snapshot of the general big picture of our datasets - the averages, limits, deviations and so on.  

Let's do the basics, the beginning and end of the datasets; 
In R; 
```R
head(data, n=3)
tail(data, n=5)
``` 

In Julia

In [90]:
first(X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,14,missing,189


In [91]:
last(X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
5,1,13,21,178


To get a brief summary of the data with descriptve stats and other information such as the Types of the variables in the columns, we can use `summary(data)` in R and in Julia we can use **describe()**

In [95]:
describe(X)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Int64,Float64,Int64,Int64,Type
1,col1,9.4,1,6.0,19,0,Int64
2,col2,13.0,4,13.0,25,0,Int64
3,col3,10.0,4,5.0,21,2,"Union{Missing, Int64}"
4,col4,161.0,132,158.0,189,0,Int64


In R you can also use the `str(data)` command 

In [98]:
typeof(X.col1)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

To get the quantiles of a vector, in R we have the base function `quantile(data$column.na.rm=TRUE)` and in Julia we have to use the **Statistics.jl** package to add this functionality. Remember Julia is a more general language compared to R which was always tailored towards statistical computing - see https://www.jlhub.com/julia/manual/en/function/quantile-exclamation 

In [105]:
using Statistics

In [107]:
# Print quarter quantiles
quantile!(X.col1, [0, 0.25, 0.5, 0.75, 1], )

5-element Vector{Float64}:
  1.0
  3.0
  6.0
 18.0
 19.0

To skip the missing values and print the median value

In [110]:
quantile(skipmissing(X.col3), 0.5) 

5.0

### Checking for missing values
In R, count the number of missing values 
```R
sum(is.na(data$column))
```
Check is **any** na values are present
```R
any(is.na(data$column))
```
Test to see whether all the values meet a certain condition (over 0)
```R
all(data$column > 0)
```

In Julia, get the sum of missing values - using the one line iterators

In [157]:
sum(x -> ismissing(x), X.col3)

2

If any missing values are in there 

In [158]:
any(x -> ismissing(x), X.col3)

true

If all the values are a certain condition 

In [159]:
all(x -> ismissing(x), X.col1)

false

In [160]:
all(x -> x > 0, X.col1)

true

A cool little function in Julia to only extract the dataframes rows which contain missing values 

In [197]:
filter(x -> any(ismissing, x), X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,1,14,missing,189
2,6,25,missing,158


Perform a quick sum of all of the columns in a horizontal fashion - intuitively this would mean summing the entire row, and producing a new sum in the final column of the same row e.g. |1|2|3|6 (final)|. This can be a very quick way of checking whether there are any missing values as the missing values will propogate across!

In R;
```R
colSums(is.na(data))
```

In Julia

In [221]:
sum(eachcol(X))

5-element Vector{Union{Missing, Int64}}:
    missing
 148
    missing
 175
 231

In Julia if we want to actually sum the entire column, meaning every value in the column vertically, we can collect the column and then sum it OR we can just performing broadcasting using the sum function - fascinating but easily confusing! 

In [234]:
sum.(collect(eachcol(X)))

4-element Vector{Union{Missing, Int64}}:
  47
  65
    missing
 805

In [235]:
sum.(eachcol(X))

4-element Vector{Union{Missing, Int64}}:
  47
  65
    missing
 805

There is an equivalent operation by using the broadcasting over **eachrow()**

In [238]:
sum.(eachrow(X))

5-element Vector{Int64}:
 204
 148
 189
 175
 231

If we want to skip missing values when doing these operations we would broadcasting **skipmissing()** across

In [241]:
sum.(skipmissing.(eachrow(X)))

5-element Vector{Int64}:
 204
 148
 189
 175
 231

### Subsetting the dataframe based upon values in the columns
Say for example that we only want the data which have a specific zipcode (generic value) in a column, what can we do? In R;
```R
data[data$zipCode %in% c("4109", "4110"),] 
```

In Julia - get a dataframe in which the values in the first column are 1 

In [253]:
filter(row -> row.col1 == 1, X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,1,14,missing,189


Now another one wherein the values are larger and 1 and smaller than 15

In [254]:
filter(row -> row.col1 > 1 && row.col1 < 15, X)

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,3,9,4,132
2,6,25,missing,158


## Cross Tabulation aka Frequency Tables
In order provide small snapshots of potential interactions and relations, we can see cross-tabulation or frequency comparisons between variables, say, male and female and acceptance rates to university     

In R we have some base functions;
```R
xt <- xtabs(Freq ~ Gender + Admit, data=DF)
```

In Julia we have to load a specific package called **FreqTables** https://github.com/nalimilan/FreqTables.jl

In [258]:
using Pkg; Pkg.add("FreqTables") ; using FreqTables

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Combinatorics ── v1.0.2
[32m[1m   Installed[22m[39m FreqTables ───── v0.4.6
[32m[1m   Installed[22m[39m DelimitedFiles ─ v1.9.1
[32m[1m   Installed[22m[39m NamedArrays ──── v0.10.0
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[da1fdf0e] [39m[92m+ FreqTables v0.4.6[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
  [90m[861a8166] [39m[92m+ Combinatorics v1.0.2[39m
  [90m[8bb1440f] [39m[92m+ DelimitedFiles v1.9.1[39m
  [90m[da1fdf0e] [39m[92m+ FreqTables v0.4.6[39m
  [90m[86f7a689] [39m[92m+ NamedArrays v0.10.0[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mDelimitedFiles[39m
[32m  ✓ [39m[90mCombinatorics[39m
[32m  ✓ [39m[90mNamedArrays[39m
[32m  ✓ [39mFreqTables
  4 dependencies successfully precompiled in 5 seconds. 98 already precompiled.


Do a frequency table between columns 1 and 4 - clearly there is not much here to see given both are randomly generated vectors

In [263]:
freqtable(X, :col1, :col4)

5×5 Named Matrix{Int64}
col1 ╲ col4 │ 132  148  158  178  189
────────────┼────────────────────────
1           │   0    0    0    0    1
3           │   1    0    0    0    0
6           │   0    0    1    0    0
18          │   0    1    0    0    0
19          │   0    0    0    1    0

## Size of the data in human readable form 
Very simple and yet very informative information - how big is our data?
In R;
```R
object.size(data), units="Mb")
```

In Julia, we can use **varinfo()**

In [268]:
varinfo(sortby=:size)

| name |      size | summary       |
|:---- | ---------:|:------------- |
| Base |           | Module        |
| Core |           | Module        |
| Main |           | Module        |
| X    | 989 bytes | 5×4 DataFrame |
