# Data
Being able to easily load and process data

In [1]:
using BenchmarkTools
using DataFrames
using DelimitedFiles
using CSV
using XLSX

# Get some data 

In [5]:
using Downloads



In [7]:
?Downloads.download

```
download(url, [ output = tempfile() ];
    [ method = "GET", ]
    [ headers = <none>, ]
    [ timeout = <none>, ]
    [ progress = <none>, ]
    [ verbose = false, ]
    [ downloader = <default>, ]
) -> output

    url        :: AbstractString
    output     :: Union{AbstractString, AbstractCmd, IO}
    method     :: AbstractString
    headers    :: Union{AbstractVector, AbstractDict}
    timeout    :: Real
    progress   :: (total::Integer, now::Integer) --> Any
    verbose    :: Bool
    downloader :: Downloader
```

Download a file from the given url, saving it to `output` or if not specified, a temporary path. The `output` can also be an `IO` handle, in which case the body of the response is streamed to that handle and the handle is returned. If `output` is a command, the command is run and output is sent to it on stdin.

If the `downloader` keyword argument is provided, it must be a `Downloader` object. Resources and connections will be shared between downloads performed by the same `Downloader` and cleaned up automatically when the object is garbage collected or there have been no downloads performed with it for a grace period. See `Downloader` for more info about configuration and usage.

If the `headers` keyword argument is provided, it must be a vector or dictionary whose elements are all pairs of strings. These pairs are passed as headers when downloading URLs with protocols that supports them, such as HTTP/S.

The `timeout` keyword argument specifies a timeout for the download in seconds, with a resolution of milliseconds. By default no timeout is set, but this can also be explicitly requested by passing a timeout value of `Inf`.

If the `progress` keyword argument is provided, it must be a callback funtion which will be called whenever there are updates about the size and status of the ongoing download. The callback must take two integer arguments: `total` and `now` which are the total size of the download in bytes, and the number of bytes which have been downloaded so far. Note that `total` starts out as zero and remains zero until the server gives an indiation of the total size of the download (e.g. with a `Content-Length` header), which may never happen. So a well-behaved progress callback should handle a total size of zero gracefully.

If the `verbose` option is set to true, `libcurl`, which is used to implement the download functionality will print debugging information to `stderr`.


In [8]:
P = Downloads.download("https://github.com/nassarhuda/easy_data/raw/master/programming_languages.csv",
    "programminglanguages.csv")

"programminglanguages.csv"

In [10]:
;wget "https://github.com/nassarhuda/easy_data/raw/master/programming_languages.csv"

--2021-06-27 01:56:14--  https://github.com/nassarhuda/easy_data/raw/master/programming_languages.csv
Resolving github.com (github.com)... 13.234.210.38
Connecting to github.com (github.com)|13.234.210.38|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv [following]
--2021-06-27 01:56:15--  https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 876 [text/plain]
Saving to: 'programming_languages.csv'

     0K                                                       100% 1.94M=0s

2021-06-27 01:56:15 (1.94 MB/s) - 'programming_languages.csv' saved [876/876]



# Read data from text files

1. readdlm: std way to read a delimitor file
2. writedlm

In [26]:
P,H = readdlm("programming_languages.csv",',';header=true)

(Any[1951 "Regional Assembly Language"; 1952 "Autocode"; … ; 2012 "Julia"; 2014 "Swift"], AbstractString["year" "language"])

In [30]:
writedlm("programming_languages_dlm.txt", P, "-")

A more powerful package to use here is `CSV` package. BY default the `CSV` package imports the data to a DataFrame, which can have several advantages.

In [41]:
C = CSV.read("programming_languages.csv", DataFrame)

Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


In [46]:
@show typeof(C)
C[1:10,:]

Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


In [48]:
@show names(C)

names(C) = ["year", "language"]


2-element Vector{String}:
 "year"
 "language"

In [50]:
names(C)
C.year
C.language
describe(C)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,year,1982.99,1951,1986.0,2014,,,Int64
2,language,,ALGOL 58,,dBase III,73.0,,String
