# Week 3 Notes - Getting and Cleaning Data

## Subsetting and Sorting
Time to wrangle and mould our datasets to our desires. Create a basic dataframe, reshjuffle the contents of the columns and then insert some NA values too. 
```R
set.seed(1345)
X <- data.frame("var1"=sample(1:5), "var2"=sample(6:10), "var3"=sample(11:15))
X <- X[sample(1:5),]; X$var2[c(1,3)] = NA
``` 

In Julia

In [6]:
using DataFrames, CSV

In [10]:
X = DataFrame(col1=rand(1:20, 5), col2=rand(2:30, 5), col3=[missing, missing, 5, 4, 21])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,6,25,missing
2,19,14,missing
3,3,4,5
4,18,9,4
5,1,13,21


Let's subset the dataframe and take specific columns and row combinations

In R
`X[,1]` to take the first column. We can also take the first column by passing the column name as a string `X[,"var1]`. Let's take the first two rows of column 2 `X[1:2,"var2"]`    

With Julia we can use the basic .dot syntax

In [14]:
X.col1

5-element Vector{Int64}:
  6
 19
  3
 18
  1

In [26]:
# This string based extraction is a bit slower
# Julia converts the String to a Symbol type 
X."col1"

5-element Vector{Int64}:
  6
 19
  3
 18
  1

We can also use indexing and the column names

In [24]:
X[:, 2]

5-element Vector{Int64}:
 25
 14
  4
  9
 13

In [29]:
X[:, "col1"]

5-element Vector{Int64}:
  6
 19
  3
 18
  1

In [30]:
X[:, :col1]

5-element Vector{Int64}:
  6
 19
  3
 18
  1

In Julia, to check the column index of a certain column e.g. is it in the 10th column? etc., we can use the **columnindex()** function

In [31]:
columnindex(X, "col3")

3

To test whether a specific column is in the dataframe, based on it's name, we can do **hasproperty()**

In [32]:
hasproperty(X, "col3")

true

#### Subset the dataframe using conditions, such as, print the dataframe in which the first column has values over 18

In [42]:
filter(row -> row.col1 > 18, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,14,missing


If we have multiple conditons

In [57]:
filter(row -> row.col1 > 1 || row.col2 > 1, X)

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,6,25,missing
2,19,14,missing
3,3,4,5
4,18,9,4
5,1,13,21


In R we could do 
```R
X[(X$var1 <= 3 & X$var3 > 11),]
X[(X$var1 <= 3 | X$var3 > 11)]
``` 

### Sorting 

In R;
```R
sort(X$var1)
# Sort in reverse
sort(X$var1, decreasing=TRUE)
``` 

In Julia to just get a vector of a specific dataframes column
https://dataframes.juliadata.org/stable/man/sorting/

In [64]:
sort(X.col1, rev=true)

5-element Vector{Int64}:
 19
 18
  6
  3
  1

To print out the entire dataframe for viewing

In [68]:
sort(X, "col1") 
# or 
sort(X, 1)
# or - this is the slowest one 
sort(X, [:1])

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,1,13,21
2,3,4,5
3,6,25,missing
4,18,9,4
5,19,14,missing


### Ordering
Ordering is used in conjunction with sorting, as it will allow us to specificy the sorting order of the columns in the DataFrame, e.g. first X and reverse sort it

In R; 
```R
X[order(X$var1, X$var3),] 
```

Now in Julia, based on the help information

In [71]:
sort(X, order("col1", rev=true))

Row,col1,col2,col3
Unnamed: 0_level_1,Int64,Int64,Int64?
1,19,14,missing
2,18,9,4
3,6,25,missing
4,3,4,5
5,1,13,21


We can pass multiple order functions within a single dataframe in order to handle the other columns

In [89]:
sort!(X, ["col1", "col2"], rev=[true, false])

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,14,missing,189
2,18,9,4,132
3,6,25,missing,158
4,3,4,5,148
5,1,13,21,178


### Adding rows and columns 
Adding rows and columns is a very common procedure - it should become as comfortable as adding sides to the playdough structure that we've made.

In R;
```R
X$var4 <- rnorm(5)
```

In Julia we can index into a column which doesn't exist yet, but soon will 

In [87]:
X.col4 = rand(100:200, 5)

5-element Vector{Int64}:
 189
 132
 158
 148
 178

In [88]:
X

Row,col1,col2,col3,col4
Unnamed: 0_level_1,Int64,Int64,Int64?,Int64
1,19,14,missing,189
2,18,9,4,132
3,6,25,missing,158
4,3,4,5,148
5,1,13,21,178
