# Julia for Data Science

In this tutorial, we will discuss why *Julia* is the tool you want to use for your data science applications.

We will cover the following:
* **Data**
* Data processing
* Visualization

### Data: Build a strong relationship with your data.
Every data science task has one main ingredient, the _data_! Most likely, you want to use your data to learn something new. But before the _new_ part, what about the data you already have? Let's make sure you can **read** it, **store** it, and **understand** it before you start using it.

Julia makes this step really easy with data structures and packages to process the data, as well as, existing functions that are readily usable on your data. 

The goal of this first part is get you acquainted with some Julia's tools to manage your data.

First let's download some packages that we will need

to install packages we use Pkg.add("package-name")

In [71]:
using Pkg
Pkg.add("CSV")
Pkg.add("DataFrames")

[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `C:\Users\mhamed\.julia\environments\v1.0\Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `C:\Users\mhamed\.julia\environments\v1.0\Manifest.toml`
[90m [no changes][39m
[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `C:\Users\mhamed\.julia\environments\v1.0\Project.toml`
 [90m [a93c6f00][39m[92m + DataFrames v0.14.1[39m
[32m[1m  Updating[22m[39m `C:\Users\mhamed\.julia\environments\v1.0\Manifest.toml`
[90m [no changes][39m


First, let's download iris csv file.

Note: `download` depends on external tools such as curl, wget or fetch. So you must have one of these.

In [1]:
iris = download("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv","iris_data.csv")

"iris_data.csv"

And there's the *.csv file we downloaded!

By default, `readcsv` will fill an array with the data stored in the input .csv file. If we set the keyword argument `header` to `true`, we'll get a second output array.

In [12]:
using DelimitedFiles

In [13]:
data,header = readdlm("iris_data.csv",',',header=true)

(Any[5.1 3.5 … 0.2 "setosa"; 4.9 3 … 0.2 "setosa"; … ; 6.2 3.4 … 2.3 "virginica"; 5.9 3 … 1.8 "virginica"], AbstractString["sepal_length" "sepal_width" … "petal_width" "species"])

In [30]:
header

1×5 Array{AbstractString,2}:
 "sepal_length"  "sepal_width"  "petal_length"  "petal_width"  "species"

In [29]:
data[1:10,:]

10×5 Array{Any,2}:
 5.1  3.5  1.4  0.2  "setosa"
 4.9  3    1.4  0.2  "setosa"
 4.7  3.2  1.3  0.2  "setosa"
 4.6  3.1  1.5  0.2  "setosa"
 5    3.6  1.4  0.2  "setosa"
 5.4  3.9  1.7  0.4  "setosa"
 4.6  3.4  1.4  0.3  "setosa"
 5    3.4  1.5  0.2  "setosa"
 4.4  2.9  1.4  0.2  "setosa"
 4.9  3.1  1.5  0.1  "setosa"

Here we write our first small function: <br>

This function will allow us to find all the iris flowers with a petal_width less than a pased value

In [50]:
function find_less_than(data, val::Float64, colomun::Int64 = 4)
    loc = findall(data[:,colomun].<=val)
    return data[loc,:]
end

find_less_than (generic function with 5 methods)

In [53]:
width_max = 0.15
result = find_less_than(data,width_max)
result

6×5 Array{Any,2}:
 4.9  3.1  1.5  0.1  "setosa"
 4.8  3    1.4  0.1  "setosa"
 4.3  3    1.1  0.1  "setosa"
 5.2  4.1  1.5  0.1  "setosa"
 4.9  3.1  1.5  0.1  "setosa"
 4.9  3.1  1.5  0.1  "setosa"

**Reading and writing to files is really easy in Julia.** <br>

You can use different delimiters with the function `readdlm` 

To write to files, we can use `writedlm`. <br>

Let's write this same data to a file with a different delimiter.

In [57]:
new_data = [header;result]
writedlm("iris_petal_width_lt$width_max.txt", new_data, '-')

and also check that we can use `readdlm` to read our new text file correctly.

In [77]:
ir,hd = readdlm("iris_petal_width_lt$width_max.txt",'-' ,header=true)
println(hd)
ir

AbstractString["sepal_length" "sepal_width" "petal_length" "petal_width" "species"]


6×5 Array{Any,2}:
 4.9  3.1  1.5  0.1  "setosa"
 4.8  3    1.4  0.1  "setosa"
 4.3  3    1.1  0.1  "setosa"
 5.2  4.1  1.5  0.1  "setosa"
 4.9  3.1  1.5  0.1  "setosa"
 4.9  3.1  1.5  0.1  "setosa"

### DataFrames! 
*Shout out to R fans!*
One other way to play around with data in Julia is to use a DataFrame.

This requires loading the `DataFrames` package

In [95]:
using DataFrames

In [83]:
df = DataFrame(petal_width = ir[:,4], species = ir[:,5])

Unnamed: 0_level_0,petal_width,species
Unnamed: 0_level_1,Any,Any
1,0.1,setosa
2,0.1,setosa
3,0.1,setosa
4,0.1,setosa
5,0.1,setosa
6,0.1,setosa


You can access columns by header name, or column index.

In this case, `df[1]` is equivalent to `df[:petal_width]`.

Note that if we want to access columns by header name, we precede the header name with a colon! In Julia, this means that the header names are treated as *symbols*.

In [None]:
df[:petal_width]

As we saw DataFrame colomuns were of types are any, so we had to convert them <br>

To avoid this we will use the `CSV` package that will read our file and return us a DataFrame

In [87]:
using CSV

┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1186


In [99]:
df = CSV.File("iris_data.csv") |> DataFrame;

`DataFrames` provides the `describe` can give you quick statistics about each column in your dataframe 

In [100]:
head(df)

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,Float64⍰,Float64⍰,Float64⍰,Float64⍰,String⍰
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


In [104]:
typeof(df) 

DataFrame

**`DataFrames` provides some handy features when dealing with data**

First, it uses the "missing" type.

In [108]:
a = missing
typeof(a)

Missing

### RDatasets

We can use RDatasets to play around with pre-existing datasets <br>

[available datasets] (https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html)

In [None]:
Pkg.add("RDatasets")

In [None]:
using RDatasets