# Data
Being able to easily load and process data is a crucial task that can make any data science more pleasant. In this notebook we will cover the most common types often encountred in data science tasks, and we will be using the data throughout the rest of the course.

In [None]:
using DataFrames
using DelimitedFiles
using CSV
using XLSX

# 🗃️ Get Some Data

In Julia, it's pretty easy to dowload a file from the web using the download function. But also, you can use your favorite command line commad to download files by easily switching from Julia via the `;` key.

Note: `download` depends on external tools such as curl, wget or fetch. So you must have one of these.

In [None]:
download("https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv",
    "Data/programming_languages.csv")

# 📂 Read your data from text files.

## Delimited Files

Let's start with the package `DelimitedFiles` which is in the standard library.

In [None]:
#=
readdlm(source, 
    delim::AbstractChar, 
    T::Type, 
    eol::AbstractChar; 
    header=false, 
    skipstart=0, 
    skipblanks=true, 
    use_mmap, 
    quotes=true, 
    dims, 
    comments=false, 
    comment_char='#')
=#
P, H = readdlm("Data/programming_languages.csv", ','; header=true)

In [None]:
P # the data as a matrix

In [None]:
H # The headers as a matrix

In [None]:
# Writing to a text file:
writedlm("Created Data/programminglanguages_dlm.txt", P, ';')

## CSV Files

A more powerful package to use here is the `CSV` package. By default, the CSV package imports the data to a DataFrame, which can have several advantages as we will see below.

In general, `CSV.jl` is the recommended way to load CSVs in Julia. Only use DelimitedFiles when you have a more complicated file where you want to specify several things.

### Reading CSV Files

In [None]:
C = CSV.read("Data/programming_languages.csv", DataFrame)

In [None]:
@show typeof(C)

### Selecting Rows and Columns

In [None]:
# Selecting Columns
C.year # Returns a vector populated by the values in the column
C[!, :year] # Returns a vector populated by the values in the column
C[:, :year] # Returns a vector populated by the values in the column

In [None]:
# Selecting Rows
C[1:5, :] # Selecting mulitple columns returns a dataFrame

In [None]:
names(C) # Returns the column names

In [None]:
describe(C) # similar to pandas.info()

## XLSX Files

Another type of files that we may often need to read is XLSX files. Let's try to read a new file.

For more documentation: https://felipenoris.github.io/XLSX.jl/stable/tutorial/

### Reading XLSX

The easiest way to get `XLSX.jl` to read data as a table for DataFrames to parse is to use the XLSX.readtable method instead of `XLSX.readdata`.

`XLSX.readtable` automatically reads the first non-empty row in the file as column labels. It also skips empty columns at the left side of the worksheet automatically as well.

#### Recommended Way

In [None]:
# If you don't want to specify cell ranges... though this will take a little longer...
G = XLSX.readtable("Data/zillow_data_download_april2020.xlsx", "Sale_counts_city");

In [None]:
zillowdf = DataFrame(G) # Convert XLSX DataTable to DataFrame

#### Other Way

In [None]:
# Read cells from XLSX and return matrix
T = XLSX.readdata("Data/zillow_data_download_april2020.xlsx", # file name
    "Sale_counts_city", # sheet name
    "A1:F9" # cell range
)

T = XLSX.readdata("Data/zillow_data_download_april2020.xlsx", # file name
    "Sale_counts_city!A1:F9" # sheet name and cell range
)
T

If you insist on using `XLSX.readdata`, you will have to manually convert the first row to a vector of Strings:

In [None]:
zillowtdf = DataFrame(T[2:end,:],  convert(Vector{String}, T[1,:])) # Turn matrix into a dataframe

# DataFrames

## Creating DataFrames

In [None]:
foods = ["apple", "cucumber", "tomato", "banana"]
calories = [105,47,22,105]
prices = [0.85,1.6,0.8,0.6,]

caloriesdf = DataFrame(item=foods, calories=calories)
@show caloriesdf

pricesdf = DataFrame(item=foods, prices=prices)
@show pricesdf

## Joins

In [None]:
fooddf = innerjoin(caloriesdf, pricesdf, on=:item)

# 🔢 Time to process the data from Julia

`.==`
This is a Vectorized dot operation and is used to apply the operator to an array. This is check for equality

In [None]:
P

Here are some quick questions we might want to ask about this simple data.
+ Which year was was a given language invented?
+ How many languages were created in a given year?

## Processing Matrix

In [None]:
# Q1: Which year was was a given language invented?
function yearcreated(language::String)
    try
        loc = findfirst(P[:, 2] .== language) # find col where lang == lang
        return P[loc, 1]
    catch
        println("Error: Language not found in data")
    end
end

In [None]:
yearcreated("Julia")

In [None]:
yearcreated("Java")

In [None]:
yearcreated("Type Script")

In [None]:
# Q2: How many languages were created in a given year?
function langcreatedinyear(year::Int64)
    try
        langs = findall(P[:, 1] .== year) # find cols where year == year
        return length(langs)
    catch
        println("Error: year not found in data")
    end
end

In [None]:
langcreatedinyear(1958)

In [None]:
langcreatedinyear(1962)

In [None]:
langcreatedinyear(1995)

## Processing DataFrame

In [None]:
Pdf = DataFrame(year=P[:, 1], language=P[:, 2]) # Turning matrix to df

In [None]:
Pdf[1:5, :year]

In [None]:
# Q1: Which year was was a given language invented?
function yearcreateddf(lang::String)
    try
        loc = findfirst(Pdf.language .== lang)
        return Pdf.year[loc]
    catch
        println("Error: Language not found in data")
    end
end

In [None]:
@show yearcreateddf("Ruby")
@show yearcreateddf("Kotlin")
@show yearcreateddf("F#")
@show yearcreateddf("Rust")
println()
@show yearcreateddf("FakeLang")

In [None]:
# Q2: How many languages were created in a given year?
function langcreatedinyeardf(year::Int64)
    return length(findall(Pdf.year.== year))
end

In [None]:
@show langcreatedinyeardf(1958)
@show langcreatedinyeardf(1962)
@show langcreatedinyeardf(1995)
@show langcreatedinyeardf(2020)

## Processing Dictionaries

In [None]:
Dict([("A", 1), ("B", 2), (1, [1, 2])]) # Making a dict

In [None]:
somedict = Dict{Integer, Vector{String}}() # Empty dict

In [None]:
# somedict["julia"] = 7 # this is not going to work.

In [None]:
somedict[1] = ["Wow", "Yum"]

In [None]:
Pdict = Dict(pairs(eachcol(Pdf))) # Turning dataframe into a dict

In [None]:
@show length(Pdict[:year])
@show length(unique(Pdict[:year]))

In [None]:
Pdict[:year][1]

In [None]:
# Q1: Which year was was a given language invented?
function yearcreateddict(lang::String)
    try
        loc = findfirst(Pdict[:language].==lang)
        return Pdict[:year][loc]
    catch
        ("Error: Language not found in data")
    end
end

In [None]:
@show yearcreateddict("Ruby")
@show yearcreateddict("Kotlin")
@show yearcreateddict("F#")
@show yearcreateddict("Rust")
println()
@show yearcreateddict("FakeLang")

In [None]:
# Q2: How many languages were created in a given year?
function langcreatedinyeardict(year::Int64)
    try
        return length(findall(Pdict[:year].== year))
    catch
        ("Error: Language not found in data")
    end
end

In [None]:
@show langcreatedinyeardict(1958)
@show langcreatedinyeardict(1962)
@show langcreatedinyeardict(1995)
@show langcreatedinyeardict(2020)

# 📝 Missing data

In [None]:
Pdf[1, 1] = missing
Pdf[3, 1] = missing
Pdf[1:5, :]

In [None]:
dropmissing(Pdf) # Doesnt save automatically, must reassign dataframe to this func output