# Getting started with data and plotting

In many cases we might need to read data available in an external file rather
than type it into Julia ourselves.

This tutorial is concerned with reading tabular data into Julia. We'll cover
basic plotting along the way.

## Where to get help

* Plots.jl documentation: http://docs.juliaplots.org/latest/
* CSV.jl documentation: http://csv.juliadata.org/stable
* DataFrames.jl documentation: https://dataframes.juliadata.org/stable/

In [None]:
# We need this constant to point to where the data file are.
const DATA_DIR = joinpath(@__DIR__, "data");

!!! note
    There are multiple ways to read the same kind of data into Julia. This
    tutorial focuses on DataFrames.jl because it provides the ecosystem to
    work with most of the required file types in a straightforward manner.

### DataFrames.jl

The `DataFrames` package provides a set of tools for working with tabular
data. It is available through the Julia package system.
```julia
using Pkg
Pkg.add("DataFrames")
```

In [None]:
import DataFrames

### Plots.jl

The `Plots` package provides a set of tools for plotting. It is available
through the Julia package system.
```julia
using Pkg
Pkg.add("Plotting")
```

In [None]:
import Plots

### What is a DataFrame?

A DataFrame is a data structure like a table or spreadsheet. You can use it
for storing and exploring a set of related data values. Think of it as a
smarter array for holding tabular data.

## Reading Tabular Data into a DataFrame

We will begin by reading data from different file formats into a DataFrame
object.

### CSV files

CSV and other delimited text files can be read by the CSV.jl package.

```julia
Pkg.add("CSV")
```

In [None]:
import CSV

To read a CSV file into a DataFrame, we use the `CSV.read` function.

In [None]:
csv_df = CSV.read(joinpath(DATA_DIR, "StarWars.csv"), DataFrames.DataFrame)

Let's try plotting some of this data

In [None]:
Plots.scatter(
    csv_df.Weight,
    csv_df.Height,
    xlabel = "Weight",
    ylabel = "Height",
)

That doesn't look right. What happened? If you look at the dataframe above, it
read `Weight` in as a `String` column because there are "NA" fields. Let's
correct that, by telling CSV to consider "NA" as `missing`.

In [None]:
csv_df = CSV.read(
    joinpath(DATA_DIR, "StarWars.csv"),
    DataFrames.DataFrame,
    missingstring="NA",
)

Then let's re-plot our data

In [None]:
Plots.scatter(
    csv_df.Weight,
    csv_df.Height,
    title = "Height vs Weight of StarWars characters",
    xlabel = "Weight",
    ylabel = "Height",
    label = false,
    ylims = (0, 3),
)

Better! Read the [CSV documentation](https://csv.juliadata.org/stable/) for
other parsing options.

DataFrames.jl supports manipulation using functions similar to pandas. For
example, split the dataframe into groups based on eye-color:

In [None]:
by_eyecolor = DataFrames.groupby(csv_df, :Eyecolor)

Then recombine into a single dataframe based on a function operating over the
split dataframes:

In [None]:
eyecolor_count = DataFrames.combine(by_eyecolor) do df
    return DataFrames.nrow(df)
end

We can rename columns:

In [None]:
DataFrames.rename!(eyecolor_count, :x1 => :count)

Drop some missing rows:

In [None]:
DataFrames.dropmissing!(eyecolor_count, :Eyecolor)

Then we can visualize the data:

In [None]:
sort!(eyecolor_count, :count, rev = true)
Plots.bar(
    eyecolor_count.Eyecolor,
    eyecolor_count.count,
    xlabel = "Eyecolor",
    ylabel = "Number of characters",
    label = false,
)

### Other Delimited Files

We can also use the `CSV.jl` package to read any other delimited text file
format.

By default, CSV.File will try to detect a file's delimiter from the first 10
lines of the file.

Candidate delimiters include `','`, `'\t'`, `' '`, `'|'`, `';'`, and `':'`. If
it can't auto-detect the delimiter, it will assume `','`.

Let's take the example of space separated data.

In [None]:
ss_df = CSV.read(joinpath(DATA_DIR, "Cereal.txt"), DataFrames.DataFrame)

We can also specify the delimiter by passing the `delim` argument.

In [None]:
delim_df = CSV.read(
    joinpath(DATA_DIR, "Soccer.txt"),
    DataFrames.DataFrame,
    delim = "::",
)

## Working with DataFrames

Now that we have read the required data into a DataFrame, let us look at some
basic operations we can perform on it.

### Querying Basic Information

The `size` function gets us the dimensions of the DataFrame.

In [None]:
DataFrames.size(ss_df)

We can also us the `nrow` and `ncol` functions to get the number of rows and
columns respectively.

In [None]:
DataFrames.nrow(ss_df), DataFrames.ncol(ss_df)

The `describe` function gives basic summary statistics of data in a DataFrame.

In [None]:
DataFrames.describe(ss_df)

Names of every column can be obtained by the `names` function.

In [None]:
DataFrames.names(ss_df)

Corresponding data types are obtained using the broadcasted `eltype` function.

In [None]:
eltype.(ss_df)

### Accessing the Data

Similar to regular arrays, we use numerical indexing to access elements of a
DataFrame.

In [None]:
csv_df[1, 1]

The following are different ways to access a column.

In [None]:
csv_df[!, 1]

In [None]:
csv_df[!, :Name]

In [None]:
csv_df.Name

In [None]:
csv_df[:, 1] # Note that this creates a copy.

The following are different ways to access a row.

In [None]:
csv_df[1:1, :]

In [None]:
csv_df[1, :] # This produces a DataFrameRow.

We can change the values just as we normally assign values.

Assign a range to scalar.

In [None]:
csv_df[1:3, :Height] .= 1.83

Vector to equal length vector.

In [None]:
csv_df[4:6, :Height] = [1.8, 1.6, 1.8]

In [None]:
csv_df

!!! tip
    There are a lot more things which can be done with a DataFrame. Read the
    [docs](https://juliadata.github.io/DataFrames.jl/stable/) for more
    information.

For information on dplyr-type syntax:
* Read: https://dataframes.juliadata.org/stable/man/querying_frameworks/
* Check out DataFramesMeta.jl: https://github.com/JuliaData/DataFramesMeta.jl

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*