# Week 2 Notes - Getting and Cleaning Data

## Reading from MySQL
Data in SQL are structured into databases -> databases consist of tables with fields -> tables contain entries as rows. The tables themselves often represent specific aspects of the data which are interlinked within the database - say a table for the salaries of employees, another of the annnual leave, another table for their personal details and so on. 

## Let's install MySQL
In R `install.packages("RMySQL")` 

In julia

In [1]:
using Pkg; Pkg.add("MySQL") ; using MySQL

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


In [2]:
Pkg.add("DataFrames") ; using DataFrames  

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


## Connecting to databases - UCSC genome browser example

Connecting to the UCSC MySQL server and pulling the databases on UCSC that are available to us -- let's do it in R. This will establish a connection to the server - a stream, and then using this stream we will execute a MySQL command `show databases;` to retrieve the available databases, and then disconnect from the stream. 
```R
ucscDB <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")  
result <- dbGetQuery(ucscDB, "show databases;"); dbDisconnect(ucscDB);
```

In Julia we can do this by employing the MySQL.jl package - part of the Databases.jl family;

In [3]:
# Connecting https://mysql.juliadatabases.org/dev/
ucscDB = DBInterface.connect(MySQL.Connection, "genome-mysql.soe.ucsc.edu", "genome")

MySQL.Connection(host="genome-mysql.soe.ucsc.edu", user="genome", port="3306", db="")

In [4]:
# Query the server and store the query in a dataframe - or a csv etc. 
result = DBInterface.execute(ucscDB, "show databases") |> DataFrame; 

In [5]:
# Lets view the result - we can see that it lists all the genomes stored on UCSC
result

Row,Database
Unnamed: 0_level_1,String
1,acaChl1
2,ailMel1
3,allMis1
4,allSin1
5,amaVit1
6,anaPla1
7,ancCey1
8,angJap1
9,anoCar1
10,anoCar2


In [6]:
"hg38" in result.Database 

true

In [7]:
# Close the connection stream 
DBInterface.close!(ucscDB)

### Now that we've connected to the MySQL server, we will connect to a specific database and perform some queries. 

In R - we'll connect, retrieve all of the table associated with the db, and then execute a funtion to see how many table are stored 
```R
hg38 <- dbConnect(MySQL(), user="genome", db="hg38", host="genome-mysql.soe.ucsc.edu")
allTables <- dbListTables(hg38)
length(allTables)
```

Let's get cracking on Julia

In [8]:
hg38 = DBInterface.connect(MySQL.Connection, "genome-mysql.soe.ucsc.edu", "genome", db="hg38")

MySQL.Connection(host="genome-mysql.soe.ucsc.edu", user="genome", port="3306", db="hg38")

In [9]:
hg38Tables = DBInterface.execute(hg38, "show tables") |> DataFrame; 

In [10]:
hg38Tables[1:10, :]

Row,Tables_in_hg38
Unnamed: 0_level_1,String
1,affyGnf1h
2,affyU133
3,affyU95
4,all_est
5,all_mrna
6,all_sts_primer
7,all_sts_seq
8,altLocations
9,altSeqLiftOverPsl
10,altSeqLiftOverPslP3


A whopping 2835 different tables ! UCSC is extreeeeemly information rich, as we can see, there are many many many data sources we can pull from.    

### Now to investigate the specific fields within a specific table

In R; get a table and perform a basic SQL function to show how many fields are in the table
```R
dbListFields(hg38, "all_mrna")
dbGetQuery(hg38, "select count(*) from all_mrna")
```

In Julia

In [11]:
countsql = DBInterface.execute(hg38, "select count(*) from all_mrna") |> DataFrame; 

In [12]:
countsql

Row,count(*)
Unnamed: 0_level_1,Int64
1,10489979


Now lets play with the contents of the fields in R

```R
mrnaData <- dbReadTable(hg38, "all_mrna")
head(mrnaData)
```

Julia !

In [30]:
mrnaData = DBInterface.execute(hg38, "select * from all_mrna;") |> DataFrame

Row,bin,matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,qName,qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts
Unnamed: 0_level_1,UInt16,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,String,String,UInt32,UInt32,UInt32,String,UInt32,UInt32,UInt32,UInt32,Array…,Array…,Array…
1,585,1579,25,0,0,0,0,2,884,+,AM992877,1604,0,1604,chr1,248956422,11873,14361,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x31, 0x31, 0x34, 0x31, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31, 0x32, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"
2,585,1419,21,0,0,0,0,2,1048,+,AM992881,1440,0,1440,chr1,248956422,11873,14361,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x32, 0x37, 0x2c, 0x39, 0x35, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x38, 0x31, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x35, 0x39, 0x34, 0x2c, 0x31, 0x33, 0x34, 0x30, 0x32, 0x2c]"
3,585,1533,12,0,0,0,0,4,944,+,AM992878,1545,0,1545,chr1,248956422,11873,14362,5,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x35, 0x32, 0x2c, 0x34, 0x33, 0x36, 0x2c, 0x32, 0x39, 0x39, 0x2c, 0x34, 0x30, 0x34, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x30, 0x36, 0x2c, 0x38, 0x34, 0x32, 0x2c, 0x31, 0x31, 0x34, 0x31, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x34 … 0x36, 0x35, 0x38, 0x2c, 0x31, 0x33, 0x39, 0x35, 0x38, 0x2c]"
4,585,1578,27,0,0,0,0,2,884,+,AM992879,1605,0,1605,chr1,248956422,11873,14362,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x31, 0x31, 0x34, 0x32, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31, 0x32, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"
5,585,1652,0,0,0,0,0,2,884,+,AM992871,1652,0,1652,chr1,248956422,11873,14409,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x31, 0x31, 0x38, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31, 0x32, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"
6,585,1650,2,0,0,0,0,2,884,+,AM992872,1652,0,1652,chr1,248956422,11873,14409,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x31, 0x31, 0x38, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31, 0x32, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"
7,585,1648,4,0,0,0,0,2,884,+,AM992875,1652,0,1652,chr1,248956422,11873,14409,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x31, 0x31, 0x38, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31, 0x32, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"
8,585,1485,3,0,0,0,0,2,1048,+,AM992880,1488,0,1488,chr1,248956422,11873,14409,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x32, 0x37, 0x2c, 0x31, 0x30, 0x30, 0x37, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x38, 0x31, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x35, 0x39, 0x34, 0x2c, 0x31, 0x33, 0x34, 0x30, 0x32, 0x2c]"
9,585,1631,8,0,0,0,0,4,897,+,BC032353,1673,0,1639,chr1,248956422,11873,14409,5,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x37, 0x33, 0x37, 0x2c, 0x33, 0x30, 0x30, 0x2c, 0x31, 0x33, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c, 0x31, 0x32, 0x30, 0x30, 0x2c, 0x31, 0x35, 0x30, 0x30, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31 … 0x39, 0x35, 0x38, 0x2c, 0x31, 0x34, 0x32, 0x37, 0x30, 0x2c]"
10,585,1736,4,0,0,0,0,3,796,+,LP896001,1740,0,1740,chr1,248956422,11873,14409,4,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x32, 0x37, 0x2c, 0x37, 0x30, 0x2c, 0x31, 0x31, 0x38, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x38, 0x31, 0x2c, 0x35, 0x35, 0x31, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x35, 0x39 … 0x39, 0x37, 0x34, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"


In [27]:
DataFrame(mrnaData, mrnaData[:, 1]) 

LoadError: MethodError: no method matching getindex(::MySQL.TextCursor{true}, ::Colon, ::Int64)

In [29]:
Pkg.add("CSV") ; using CSV

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


### Refined queries 
MySQL has a wide range of query options which permit the extration of virtually any aspect of the data, with conditionals, ranges, mismatches and so on. These statements must simply be crafted according to the query structure and provided to the julia functions in order to get what you're looking for.    

For instance we can extract entries from the table which have values in the 'mismatches' column between 1 and 3;   

In R
```R
query <- dbSendQuery(hg38, "select * from all_mrna where misMatches between 1 and 3") 
mrnas <- fetch(query)
```

In Julia

In [32]:
query = DBInterface.execute(hg38, "select * from all_mrna where misMatches between 1 and 3") |> DataFrame

Row,bin,matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,qName,qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts
Unnamed: 0_level_1,UInt16,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,String,String,UInt32,UInt32,UInt32,String,UInt32,UInt32,UInt32,UInt32,Array…,Array…,Array…
1,585,1650,2,0,0,0,0,2,884,+,AM992872,1652,0,1652,chr1,248956422,11873,14409,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x31, 0x31, 0x38, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31, 0x32, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"
2,585,1485,3,0,0,0,0,2,1048,+,AM992880,1488,0,1488,chr1,248956422,11873,14409,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x32, 0x37, 0x2c, 0x31, 0x30, 0x30, 0x37, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x38, 0x31, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x35, 0x39, 0x34, 0x2c, 0x31, 0x33, 0x34, 0x30, 0x32, 0x2c]"
3,585,925,2,0,0,1,3,7,11787,-,AK310121,930,0,930,chr1,248956422,16630,29344,8,"UInt8[0x31, 0x33, 0x35, 0x2c, 0x31, 0x39, 0x38, 0x2c, 0x31, 0x33 … 0x30, 0x32, 0x2c, 0x31, 0x35, 0x34, 0x2c, 0x32, 0x34, 0x2c]","UInt8[0x30, 0x2c, 0x31, 0x33, 0x35, 0x2c, 0x33, 0x33, 0x33, 0x2c … 0x30, 0x2c, 0x37, 0x35, 0x32, 0x2c, 0x39, 0x30, 0x36, 0x2c]","UInt8[0x31, 0x36, 0x36, 0x33, 0x30, 0x2c, 0x31, 0x36, 0x38, 0x35 … 0x37, 0x33, 0x37, 0x2c, 0x32, 0x39, 0x33, 0x32, 0x30, 0x2c]"
4,585,986,1,0,0,0,0,8,11630,-,AK310139,987,0,987,chr1,248956422,16727,29344,9,"UInt8[0x33, 0x38, 0x2c, 0x31, 0x39, 0x38, 0x2c, 0x31, 0x33, 0x36 … 0x35, 0x38, 0x2c, 0x31, 0x35, 0x34, 0x2c, 0x32, 0x34, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x38, 0x2c, 0x32, 0x33, 0x36, 0x2c, 0x33 … 0x31, 0x2c, 0x38, 0x30, 0x39, 0x2c, 0x39, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x36, 0x37, 0x32, 0x37, 0x2c, 0x31, 0x36, 0x38, 0x35 … 0x37, 0x33, 0x37, 0x2c, 0x32, 0x39, 0x33, 0x32, 0x30, 0x2c]"
5,585,970,3,0,0,0,0,4,1101,-,AK294377,973,0,973,chr1,248956422,16938,19012,5,"UInt8[0x31, 0x31, 0x37, 0x2c, 0x35, 0x31, 0x30, 0x2c, 0x31, 0x34, 0x37, 0x2c, 0x39, 0x39, 0x2c, 0x31, 0x30, 0x30, 0x2c]","UInt8[0x30, 0x2c, 0x31, 0x31, 0x37, 0x2c, 0x36, 0x32, 0x37, 0x2c, 0x37, 0x37, 0x34, 0x2c, 0x38, 0x37, 0x33, 0x2c]","UInt8[0x31, 0x36, 0x39, 0x33, 0x38, 0x2c, 0x31, 0x37, 0x32, 0x33 … 0x32, 0x36, 0x37, 0x2c, 0x31, 0x38, 0x39, 0x31, 0x32, 0x2c]"
6,585,953,2,0,0,0,0,8,11369,-,AK300161,955,0,955,chr1,248956422,17020,29344,9,"UInt8[0x33, 0x35, 0x2c, 0x31, 0x33, 0x36, 0x2c, 0x31, 0x32, 0x35 … 0x32, 0x37, 0x2c, 0x31, 0x35, 0x34, 0x2c, 0x32, 0x34, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x2c, 0x31, 0x37, 0x31, 0x2c, 0x32 … 0x30, 0x2c, 0x37, 0x37, 0x37, 0x2c, 0x39, 0x33, 0x31, 0x2c]","UInt8[0x31, 0x37, 0x30, 0x32, 0x30, 0x2c, 0x31, 0x37, 0x32, 0x33 … 0x37, 0x33, 0x37, 0x2c, 0x32, 0x39, 0x33, 0x32, 0x30, 0x2c]"
7,585,775,1,0,0,0,0,6,10897,-,AK308540,776,0,776,chr1,248956422,17671,29344,7,"UInt8[0x37, 0x31, 0x2c, 0x31, 0x34, 0x37, 0x2c, 0x39, 0x35, 0x2c … 0x32, 0x37, 0x2c, 0x31, 0x35, 0x34, 0x2c, 0x32, 0x34, 0x2c]","UInt8[0x30, 0x2c, 0x37, 0x31, 0x2c, 0x32, 0x31, 0x38, 0x2c, 0x33 … 0x31, 0x2c, 0x35, 0x39, 0x38, 0x2c, 0x37, 0x35, 0x32, 0x2c]","UInt8[0x31, 0x37, 0x36, 0x37, 0x31, 0x2c, 0x31, 0x37, 0x39, 0x31 … 0x37, 0x33, 0x37, 0x2c, 0x32, 0x39, 0x33, 0x32, 0x30, 0x2c]"
8,585,974,2,0,0,0,0,0,0,-,AK311358,976,0,976,chr1,248956422,29043,30019,1,"UInt8[0x39, 0x37, 0x36, 0x2c]","UInt8[0x30, 0x2c]","UInt8[0x32, 0x39, 0x30, 0x34, 0x33, 0x2c]"
9,585,1123,1,0,0,0,0,2,341,-,AY341950,1124,0,1124,chr1,248956422,34612,36077,3,"UInt8[0x35, 0x36, 0x32, 0x2c, 0x32, 0x30, 0x35, 0x2c, 0x33, 0x35, 0x37, 0x2c]","UInt8[0x30, 0x2c, 0x35, 0x36, 0x32, 0x2c, 0x37, 0x36, 0x37, 0x2c]","UInt8[0x33, 0x34, 0x36, 0x31, 0x32, 0x2c, 0x33, 0x35, 0x32, 0x37, 0x36, 0x2c, 0x33, 0x35, 0x37, 0x32, 0x30, 0x2c]"
10,585,1122,2,0,0,0,0,2,341,-,AY341952,1124,0,1124,chr1,248956422,34612,36077,3,"UInt8[0x35, 0x36, 0x32, 0x2c, 0x32, 0x30, 0x35, 0x2c, 0x33, 0x35, 0x37, 0x2c]","UInt8[0x30, 0x2c, 0x35, 0x36, 0x32, 0x2c, 0x37, 0x36, 0x37, 0x2c]","UInt8[0x33, 0x34, 0x36, 0x31, 0x32, 0x2c, 0x33, 0x35, 0x32, 0x37, 0x36, 0x2c, 0x33, 0x35, 0x37, 0x32, 0x30, 0x2c]"


In [34]:
first(query)

Row,bin,matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,qName,qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts
Unnamed: 0_level_1,UInt16,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,UInt32,String,String,UInt32,UInt32,UInt32,String,UInt32,UInt32,UInt32,UInt32,Array…,Array…,Array…
1,585,1650,2,0,0,0,0,2,884,+,AM992872,1652,0,1652,chr1,248956422,11873,14409,3,"UInt8[0x33, 0x35, 0x34, 0x2c, 0x31, 0x30, 0x39, 0x2c, 0x31, 0x31, 0x38, 0x39, 0x2c]","UInt8[0x30, 0x2c, 0x33, 0x35, 0x34, 0x2c, 0x34, 0x36, 0x33, 0x2c]","UInt8[0x31, 0x31, 0x38, 0x37, 0x33, 0x2c, 0x31, 0x32, 0x36, 0x31, 0x32, 0x2c, 0x31, 0x33, 0x32, 0x32, 0x30, 0x2c]"


In [None]:
size(query)

### Close the connection!

In [37]:
DBInterface.close!(hg38)

## HDF5 - Reading and Handling this data type
HDF5 is a data format used for storing large datasets - FAST5 uses a HDF5 backbone, and we know how large FAST5 files are!!! It stands for **H**eirarchical **D**ata **F**ormat. 

As expected, HDF5 support for R comes in a library, this time it is downloaded from bioconductor 
```R
source("url")
biocLite("rhdf5")
library(rhdf5)
```

We can create an example file using the hdf5 functions `created = h5createFile("example.h5")`    

Julia too has a package for working with HDF5 formats, can you guess the original name? **HDF5.jl**. Remarkable indeed. https://juliaio.github.io/HDF5.jl/stable/ - A nice explaner of HDF5 from the julia software page 

*"HDF5 stands for Hierarchical Data Format v5 and is closely modeled on file systems. In HDF5, a "group" is analogous to a directory, a "dataset" is like a file. HDF5 also uses "attributes" to associate metadata with a particular group or dataset. HDF5 uses ASCII names for these different objects, and objects can be accessed by Unix-like pathnames, e.g., "/sample1/tempsensor/firsttrial" for a top-level group "sample1", a subgroup "tempsensor", and a dataset "firsttrial"."*  

(Datasets cannot have child datasets, but groups can have either.) 

Groups/subgroups/datasets NOT Groups/datasets/datasets/groups

In [39]:
Pkg.add("HDF5") ; using HDF5

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m MPIPreferences ─ v0.1.10
[32m[1m   Installed[22m[39m HDF5_jll ─────── v1.12.2+2
[32m[1m   Installed[22m[39m Requires ─────── v1.3.0
[32m[1m   Installed[22m[39m HDF5 ─────────── v0.17.1
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[f67ccb44] [39m[92m+ HDF5 v0.17.1[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
  [90m[f67ccb44] [39m[92m+ HDF5 v0.17.1[39m
  [90m[3da0fdf6] [39m[92m+ MPIPreferences v0.1.10[39m
  [90m[ae029012] [39m[92m+ Requires v1.3.0[39m
[32m⌃[39m [90m[0234f1f7] [39m[92m+ HDF5_jll v1.12.2+2[39m
[36m[1m        Info[22m[39m Packages marked with [32m⌃[39m have new versions available and may be upgradable.
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mMPIPreferences[39m
[32m  ✓ [39m[90mRequires[39m
[32m  ✓ [39m[90mHDF5_jll[39m
[32m  ✓ [39mHDF5
  4 dependencies

To write an example file in julia

In [41]:
example_hd = h5open("example.h5", "cw")

🗂️ HDF5.File: (read-write) example.h5

In [43]:
close(example_hd)

In [45]:
example_hd = h5open("example.h5", "r+")

🗂️ HDF5.File: (read-write) example.h5

Since HDF5 files are hierarchical and based upon a file-system like structure, we create groups

Once we have groups, we write to the groups, and perhaps subgroups, subsubgroups etc. - remember, HDF5 is akin to a filesystem

We'll create some matrix data which we'll write to a group called foo/A in R
```R
A = matrix(1:10, nr=5, nc=2)
h5write(A, "example.h5", "foo/A") #group is the third argument
B = array(seq(0.1,2.0, by=0.1), dim=c(5,2,2))
attr(B, "scale") <- "liter"
h5write(B, "example.h5", "foo/foobaa/B") 
h5ls("example.h5") # h5 ls view 
```

![image.png](attachment:image.png)

In Julia let's work with the HDF5 package 

Create a group called "foo" 

In [46]:
create_group(example_hd, "foo")

📂 HDF5.Group: /foo (file: example.h5)

Create some mock data

In [49]:
samp = Array(rand(2, 4))

2×4 Matrix{Float64}:
 0.453924  0.774484  0.896302  0.714723
 0.141766  0.911775  0.745648  0.932475

Write it to the group foo

In [51]:
# If a group doesn't already exist we can write to it by indexing
example_hd["newgroup"] = "yes"

"yes"

In [66]:
# To write to a pre-existing group 
# First initialize the group and load it into a variable
g = example_hd["foo"]
# Write to it by indexing directly
g["mydataset"] = samp 
# Write to it using the create_dataset() function
create_dataset(g, "simplestring", zeros(1, 2)) 

(HDF5.Dataset: /foo/simplestring (file: example.h5 xfer_mode: 0), HDF5.Datatype: H5T_IEEE_F64LE)

In [60]:
example_hd["foo/mydataset"]

🔢 HDF5.Dataset: /foo/mydataset (file: example.h5 xfer_mode: 0)

Read the contents of the dataset and its groups using the **read** function

In [61]:
read(example_hd["foo/mydataset"])

2×4 Matrix{Float64}:
 0.453924  0.774484  0.896302  0.714723
 0.141766  0.911775  0.745648  0.932475

In [67]:
read(example_hd["foo/simplestring"])

1×2 Matrix{Float64}:
 0.0  0.0

In [68]:
example_hd

🗂️ HDF5.File: (read-write) example.h5
├─ 📂 foo
│  ├─ 🔢 mydataset
│  └─ 🔢 simplestring
└─ 🔢 newgroup

In [71]:
close(example_hd)

For convience and consistency we can also use the **do** block conventions, which will take care of closing the stream for us 

In [73]:
h5open("example.h5", "r+") do stream
    group = create_group(stream, "dogroup")
    dataset = create_dataset(group, "thisdata", Float64, (10,10))
    write(dataset, rand(10,10))
end 

Basic notes;
Datasets can be create by 
```julia
g["mydataset"] = rand(3,5)
# or
write(g, "mydataset", rand(3,5))
``` 

### Reading specific parts of the data 
In R; 
```R
h5read("example.h5", "foo/A")
h5read("example.h5", "foo/new/dataset")
```

In Julia

In [89]:
openh5 = h5open("example.h5", "r+")

🗂️ HDF5.File: (read-write) example.h5
├─ 📂 dogroup
│  └─ 🔢 thisdata
├─ 📂 foo
│  ├─ 🔢 mydataset
│  └─ 🔢 simplestring
└─ 🔢 newgroup

In [91]:
read(openh5,"foo/mydataset")

2×4 Matrix{Float64}:
 0.453924  0.774484  0.896302  0.714723
 0.141766  0.911775  0.745648  0.932475

### Chunking and Compression 
The section on the Julia HDF5.jl package site does a nice job of explaining this ! https://juliaio.github.io/HDF5.jl/stable/

## Webscraping HTML 
Scraping information from the internet is a fun endevour, which based upon the virility of ones creativity, can lead to a large number of exciting avenues of adventure. As the web has been around for decades, almost every high level programming language has its own packages and libraries for working with web HTML and XML files. 
The most important part of web scraping is to avoid excessive request to websites as this will likely get your IP address blocked. We can safely assume that large companies such as amazon are interested in protecting themselves from massive data trawling operations, even though they are doing it on a massive scale. 

### Examples using google scholar

In R first; 

We'll store a url link, and then simply read the lines of the link, and then close the link 
```R
con = url("https://scholar.google.com/citations?user=kzht3-0AAAAJ&hl=en")
htmlCode = readLines(con)
close(con)
htmlCode
```

Now in Julia - we can use the HTTP.jl package - or Requests.jl

In [105]:
Pkg.add("HTTP") ; using HTTP

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m ExceptionUnwrapping ─ v0.1.10
[32m[1m   Installed[22m[39m SimpleBufferStream ── v1.1.0
[32m[1m   Installed[22m[39m ConcurrentUtilities ─ v2.3.0
[32m[1m   Installed[22m[39m BitFlags ──────────── v0.1.8
[32m[1m   Installed[22m[39m OpenSSL ───────────── v1.4.1
[32m[1m   Installed[22m[39m LoggingExtras ─────── v1.0.3
[32m[1m   Installed[22m[39m URIs ──────────────── v1.5.1
[32m[1m   Installed[22m[39m HTTP ──────────────── v1.10.1
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[cd3eb016] [39m[92m+ HTTP v1.10.1[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
  [90m[d1d4a3ce] [39m[92m+ BitFlags v0.1.8[39m
  [90m[f0e56b4a] [39m[92m+ ConcurrentUtilities v2.3.0[39m
  [90m[460bff9d] [39m[92m+ ExceptionUnwrapping v0.1.10[39m
  [90m[cd3eb016] [39m[92m+ HTTP v1.10.1[39m
  [90m[e6f89c97] [39m[92m+ Loggi

In [94]:
url = "https://scholar.google.com/citations?user=kzht3-0AAAAJ&hl=en"

"https://scholar.google.com/citations?user=kzht3-0AAAAJ&hl=en"

In [113]:
resp = HTTP.get(url)

HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Wed, 31 Jan 2024 02:32:23 GMT
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Cache-Control: no-cache, must-revalidate
Pragma: no-cache
Content-Type: text/html; charset=ISO-8859-1
X-Content-Type-Options: nosniff
Content-Encoding: gzip
Server: citations
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Transfer-Encoding: chunked

<!doctype html><html><head><title>Michael Lynch - Google Scholar</title><meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="referrer" content="always"><meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no"><link rel="shortcut icon" href="/favicon.ico"><link rel="canonical" href="https://scholar.google.com/citations?user=kzht3-0AAAAJ&amp;hl=en"><meta name="description" content=

#### A note on GET vs POST requests
**GET**
* parameters are in the URL
* used for fetching and *GETTING* documents
* maximum URL length 
* OK to cache
* won't change the server

**POST**
* parameters are in the body
* used for updating and *POSTING* data 
* not ok to cache
* can change the server

#### Parsing HTML file as a XML - conversion 
Depending on the structure of the HTML, we may be able to translate it as a XML and interpret it in the same structure - let's give it a go and see if it works. 

In R; 
```R
library(XML)
url <- "https://scholar.google.com/citations?user=kzht3-0AAAAJ&hl=en"
html <- htmlTreeParse(url, useInternalNodes=T)
```
Using xpath language queries

```R
xpathSApply(html, "//title", xmlValue)
```

Now let's look at the citation counts
```R
xpathSApply(html, "//td[@id='col-citedby']", xmlValue)
``` 

Give it a crack in Julia using the EzXML package - it will take a few more steps, and perhaps it can get compressed down with experience into some simpler code.

**The basic steps are**
1. Load HTTP package, perform a HTTP get request on the url
2. Parse the URL as a string by accessing the requests body 
3. Using the EzXML package, parse the html string using parsehtml()
4. Use the xpath query language with findall() and nodecontent() to extract the relevant information from the file


In [114]:
Pkg.add("EzXML") ; using EzXML

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m XML2_jll ─ v2.12.2+0
[32m[1m   Installed[22m[39m EzXML ──── v1.2.0
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[8f5d6c58] [39m[92m+ EzXML v1.2.0[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
  [90m[8f5d6c58] [39m[92m+ EzXML v1.2.0[39m
  [90m[02c8fc9c] [39m[92m+ XML2_jll v2.12.2+0[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mXML2_jll[39m
[32m  ✓ [39mEzXML
  2 dependencies successfully precompiled in 2 seconds. 72 already precompiled.


Open a HTTP request, and then parse the HTML file as a string which can be read by the EzXML package using the **parsehtml()** function. 

In [144]:
samp = HTTP.get(url, cookies=true);
data = String(samp.body)

"<!doctype html><html><head><title>Michael Lynch - Google Scholar</title><meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"><meta http-equiv=\"X-UA-Compatible\" content=\"IE=Edge\"><meta name=\"referrer\" content=\"always\"><meta name=\"viewport\" content=\"wid"[93m[1m ⋯ 151840 bytes ⋯ [22m[39m"le=\"menuitem\" href=\"/intl/en/scholar/about.html\" tabindex=\"-1\" class=\"gs_md_li\">About Scholar</a><a role=\"menuitem\" href=\"//support.google.com/websearch?p=scholar_dsa&amp;hl=en&amp;oe=ASCII\" tabindex=\"-1\" class=\"gs_md_li\">Search help</a></div></div></div></body></html>"

Parse the html file now and store the root as a variable

In [154]:
q = parsehtml(data)
scholar_root = root(q)

[33m[1m└ [22m[39m[90m@ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97[39m
[33m[1m└ [22m[39m[90m@ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97[39m
[33m[1m└ [22m[39m[90m@ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97[39m
[33m[1m└ [22m[39m[90m@ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97[39m
[33m[1m└ [22m[39m[90m@ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97[39m
[33m[1m└ [22m[39m[90m@ EzXML ~/.julia/packages/EzXML/DL8na/src/error.jl:97[39m


EzXML.Node(<ELEMENT_NODE[html]@0x0000000005486f00>)

Using the **nodecontent.()** function of EzXML, find your query using the xpath language 

In [160]:
nodecontent.(findall("//title", scholar_root))

1-element Vector{String}:
 "Michael Lynch - Google Scholar"

Now for the more elaborate Xpath query

In [164]:
for citation in nodecontent.(findall("//td[@id='col-citedby']", scholar_root))
    println(citation)
end 

#### Using the httr library for R 
httr makes doing much of this a bit easier - as we can see the workflow is very similar to that of HTTP.jl 
```R
library(httr); html2 = GET(url)
content2 = content(html2, as="text")
parsedHtml = htmlParse(content2, asText=True)
xpathSapply(parsedHtml, "//title", xmlValue)
```

### Websites with user and password authentication
In order to access websites which request a user and password, in R we can include the information as an argument
```R
pp2 = GET("url", authenticate("user","passwd"))
```

In Julia we prefix the server address as such;

In [168]:
pp2 = HTTP.get("https://user:passwd@httpbin.org/basic-auth/user/passwd")

HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Thu, 01 Feb 2024 02:37:55 GMT
Content-Type: application/json
Content-Length: 47
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

{
  "authenticated": true, 
  "user": "user"
}
"""

## Quiz 
1. Read up on the github API 
https://github.com/settings/applications
. Access the API to get information on your instructors repositories (hint: this is the url you want "https://api.github.com/users/jtleek/repos"). Use this data to find the time that the datasharing repo was created. What time was it created?   

This tutorial may be useful (
https://github.com/hadley/httr/blob/master/demo/oauth2-github.r
). You may also need to run the code in the base R package and not R studio.

In Julia, we'll use the github.jl package from https://github.com/JuliaWeb/GitHub.jl 

In [174]:
Pkg.add("GitHub") ; using GitHub

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


Using julias github package, let's take a look at all of jtleeks repos, using the simple **repos** function with his username as the argument 

In [197]:
leeks_repos = repos("jtleek")

(Repo[Repo (all fields are Union{Nothing, T}):
  name: "2018"
  full_name: "jtleek/2018"
  description: "Fall 2018 repository with course materials for JHU Advanced Data Science"
  language: "HTML"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 155565363
  size: 60855
  forks_count: 3
  stargazers_count: 1
  watchers_count: 1
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/2018")
  html_url: URI("https://github.com/jtleek/2018")
  clone_url: URI("https://github.com/jtleek/2018.git")
  ssh_url: URI("git@github.com:jtleek/2018.git")
  homepage: URI("https://jhu-advdatasci.github.io/2018/")
  pushed_at: DateTime("2018-10-30T18:13:41")
  created_at: DateTime("2018-10-31T13:50:37")
  updated_at: DateTime("2021-12-05T09:30:00")
  has_issues: false
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: true, Repo (all fields are Union{Nothing, T}):
  name: "ads2020"
  full_name: "jtleek/ads2020"
  description: "Advanced Data Scienc

In [210]:
# Doesn't work - I can't seem to get an iterator going
for f in leeks_repos
        print(f)
    end 
end 

Repo[Repo (all fields are Union{Nothing, T}):
  name: "2018"
  full_name: "jtleek/2018"
  description: "Fall 2018 repository with course materials for JHU Advanced Data Science"
  language: "HTML"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 155565363
  size: 60855
  forks_count: 3
  stargazers_count: 1
  watchers_count: 1
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/2018")
  html_url: URI("https://github.com/jtleek/2018")
  clone_url: URI("https://github.com/jtleek/2018.git")
  ssh_url: URI("git@github.com:jtleek/2018.git")
  homepage: URI("https://jhu-advdatasci.github.io/2018/")
  pushed_at: DateTime("2018-10-30T18:13:41")
  created_at: DateTime("2018-10-31T13:50:37")
  updated_at: DateTime("2021-12-05T09:30:00")
  has_issues: false
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: true, Repo (all fields are Union{Nothing, T}):
  name: "ads2020"
  full_name: "jtleek/ads2020"
  description: "Advanced Data Science

  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "capitalIn21stCenturyinR"
  full_name: "jtleek/capitalIn21stCenturyinR"
  description: "Piketty in R"
  language: "HTML"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 20234724
  size: 374812
  forks_count: 128
  stargazers_count: 213
  watchers_count: 213
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/capitalIn21stCenturyinR")
  html_url: URI("https://github.com/jtleek/capitalIn21stCenturyinR")
  clone_url: URI("https://github.com/jtleek/capitalIn21stCenturyinR.git")
  ssh_url: URI("git@github.com:jtleek/capitalIn21stCenturyinR.git")
  pushed_at: DateTime("2016-07-18T17:22:51")
  created_at: DateTime("2014-05-27T20:38:31")
  updated_at: DateTime("2024-01-04T22:12:27")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: true
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):

  url: URI("https://api.github.com/repos/jtleek/day1")
  html_url: URI("https://github.com/jtleek/day1")
  clone_url: URI("https://github.com/jtleek/day1.git")
  ssh_url: URI("git@github.com:jtleek/day1.git")
  pushed_at: DateTime("2017-07-11T15:25:20")
  created_at: DateTime("2017-07-10T21:44:28")
  updated_at: DateTime("2017-07-11T14:13:55")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "derfinder"
  full_name: "jtleek/derfinder"
  language: "R"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 11549405
  size: 388
  forks_count: 0
  stargazers_count: 0
  watchers_count: 0
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/derfinder")
  html_url: URI("https://github.com/jtleek/derfinder")
  clone_url: URI("https://github.com/jtleek/derfinder.git")
  ssh_url: URI("git@github.com:jtleek/derfinder.git")
  pushed_at: DateTime("2013-06-24T21:17:27")
 

  size: 41702
  forks_count: 11
  stargazers_count: 16
  watchers_count: 16
  open_issues_count: 1
  url: URI("https://api.github.com/repos/jtleek/genstats_site")
  html_url: URI("https://github.com/jtleek/genstats_site")
  clone_url: URI("https://github.com/jtleek/genstats_site.git")
  ssh_url: URI("git@github.com:jtleek/genstats_site.git")
  pushed_at: DateTime("2015-09-07T13:54:52")
  created_at: DateTime("2015-08-20T12:23:21")
  updated_at: DateTime("2023-10-05T18:12:28")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: true
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "github-slideshow"
  full_name: "jtleek/github-slideshow"
  description: "A robot powered training repository :robot:"
  language: "Ruby"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 289258980
  size: 3515
  forks_count: 0
  stargazers_count: 0
  watchers_count: 0
  open_issues_count: 1
  url: URI("https://api.github.com/repos/jtleek/github-slides

  html_url: URI("https://github.com/jtleek/jhsph753and4")
  clone_url: URI("https://github.com/jtleek/jhsph753and4.git")
  ssh_url: URI("git@github.com:jtleek/jhsph753and4.git")
  pushed_at: DateTime("2014-05-15T14:22:41")
  created_at: DateTime("2014-01-04T21:06:44")
  updated_at: DateTime("2022-04-14T23:52:19")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: true
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "jhudash"
  full_name: "jtleek/jhudash"
  description: "A repository for all things DaSH"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 42834789
  size: 381
  forks_count: 0
  stargazers_count: 0
  watchers_count: 0
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/jhudash")
  html_url: URI("https://github.com/jtleek/jhudash")
  clone_url: URI("https://github.com/jtleek/jhudash.git")
  ssh_url: URI("git@github.com:jtleek/jhudash.git")
  pushed_at: DateTime("2015-09-21T10:35:40")
  created_a

  html_url: URI("https://github.com/jtleek/newproject")
  clone_url: URI("https://github.com/jtleek/newproject.git")
  ssh_url: URI("git@github.com:jtleek/newproject.git")
  pushed_at: DateTime("2023-08-08T04:37:22")
  created_at: DateTime("2017-08-30T18:14:23")
  updated_at: DateTime("2021-11-21T14:36:49")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "new_project"
  full_name: "jtleek/new_project"
  description: "This is my new project"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 101907019
  size: 0
  forks_count: 0
  stargazers_count: 0
  watchers_count: 0
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/new_project")
  html_url: URI("https://github.com/jtleek/new_project")
  clone_url: URI("https://github.com/jtleek/new_project.git")
  ssh_url: URI("git@github.com:jtleek/new_project.git")
  pushed_at: DateTime("2017-08-30T16:58:22")
  c

  full_name: "jtleek/rmd4edu"
  description: "A fresh batch of R Markdown templates"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 273780186
  size: 12873
  forks_count: 0
  stargazers_count: 0
  watchers_count: 0
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/rmd4edu")
  html_url: URI("https://github.com/jtleek/rmd4edu")
  clone_url: URI("https://github.com/jtleek/rmd4edu.git")
  ssh_url: URI("git@github.com:jtleek/rmd4edu.git")
  pushed_at: DateTime("2019-09-03T14:08:12")
  created_at: DateTime("2020-06-20T20:33:17")
  updated_at: DateTime("2023-09-08T18:09:02")
  has_issues: false
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: true, Repo (all fields are Union{Nothing, T}):
  name: "robotjeff"
  full_name: "jtleek/robotjeff"
  description: "This is the Shiny app to make robot Jeff talk"
  language: "R"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 94168246
  size: 2
  forks_count: 0
  stargazers_count

  has_wiki: true
  has_downloads: true
  has_pages: true
  license: License("NOASSERTION")
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "slipper"
  full_name: "jtleek/slipper"
  description: "Tidy and easy bootstrapping"
  language: "R"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 75975252
  size: 74
  forks_count: 12
  stargazers_count: 119
  watchers_count: 119
  open_issues_count: 2
  url: URI("https://api.github.com/repos/jtleek/slipper")
  html_url: URI("https://github.com/jtleek/slipper")
  clone_url: URI("https://github.com/jtleek/slipper.git")
  ssh_url: URI("git@github.com:jtleek/slipper.git")
  pushed_at: DateTime("2017-10-05T18:12:29")
  created_at: DateTime("2016-12-08T21:08:11")
  updated_at: DateTime("2024-01-24T22:26:21")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "software"
  full_name: "jtleek/software"
  de

  created_at: DateTime("2015-07-05T12:57:04")
  updated_at: DateTime("2015-07-05T12:57:04")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "testproject"
  full_name: "jtleek/testproject"
  default_branch: "master"
  owner: Owner("jtleek")
  id: 38634839
  size: 112
  forks_count: 0
  stargazers_count: 0
  watchers_count: 0
  open_issues_count: 0
  url: URI("https://api.github.com/repos/jtleek/testproject")
  html_url: URI("https://github.com/jtleek/testproject")
  clone_url: URI("https://github.com/jtleek/testproject.git")
  ssh_url: URI("git@github.com:jtleek/testproject.git")
  pushed_at: DateTime("2015-07-06T17:37:08")
  created_at: DateTime("2015-07-06T17:34:35")
  updated_at: DateTime("2015-07-06T17:34:35")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: false, Repo (all fields are Union{Nothing, T}):
  name: "testrepo

LoadError: ParseError:
[90m# Error @ [0;0m]8;;file:///home/number25/MEGA/Computational-Bio/Kobe-n-Pascal/ComputationalBiology-for-Autodidacts/Course_notes/Coursera_Data_Science_Track_JHOP/Mod3_Data_Cleaning/Week_2/In[210]#5:1\[90mIn[210]:5:1[0;0m]8;;\
    end 
[48;2;120;70;70mend[0;0m 
[90m└─┘ ── [0;0m[91minvalid identifier[0;0m

Search the output for the repo "datasharing" - hmmm this seems more difficult than I first assumed, as the output type is a Repo-Union custom type it seems 

Let's try honing in a specific repo, so use the **repo** function with the exact repo as the argument (including his username: username/repo)

In [184]:
repo("jtleek/datasharing")

Repo (all fields are Union{Nothing, T}):
  name: "datasharing"
  full_name: "jtleek/datasharing"
  description: "The Leek group guide to data sharing "
  default_branch: "master"
  owner: Owner("jtleek")
  id: 14204342
  size: 590
  subscribers_count: 561
  forks_count: 243587
  stargazers_count: 6434
  watchers_count: 6434
  open_issues_count: 892
  url: URI("https://api.github.com/repos/jtleek/datasharing")
  html_url: URI("https://github.com/jtleek/datasharing")
  clone_url: URI("https://github.com/jtleek/datasharing.git")
  ssh_url: URI("git@github.com:jtleek/datasharing.git")
  pushed_at: DateTime("2024-01-05T04:49:32")
  created_at: DateTime("2013-11-07T13:25:07")
  updated_at: DateTime("2024-02-01T12:03:04")
  has_issues: true
  has_wiki: true
  has_downloads: true
  has_pages: false
  private: false
  fork: false

Answer to the question is **created_at: DateTime("2013-11-07T13:25:07")**
 