# Reading data into Julia

This notebooks demonstrates reading data into Julia and performing basic manipulations.

## Load libraries

First, we need to load the libraries we will be working with. In Julia, functionality tends to be split between libraries much more than it is in other languages such as Python or R. To read data from a CSV file, we need two libraries: CSV and DataFrames. We also load the StatsBase package, which contains many basic statistical functions (mean, standard deviation, etc.).

In [None]:
using CSV, DataFrames, StatsBase

## Reading data

Since CSV reading and tabular data manipulation are split across libraries, we use the `CSV.read` function to load data into a `DataFrame`.

Unlike many other languages, Julia in an interactive session will print out the result of an operation, even if you assign it to a variable. To prevent this, add a semicolon at the end of the line.

In [None]:
sensors = CSV.read("data/bay_area_freeways.csv", DataFrame)

## Accessing data in a loaded table

Once the data is loaded, we can access it by column or by row. Each column is a Vector (known as an array or list in other languages).

In [None]:
sensors.avg_occ

## Computing functions of columns

Many functions in Julia are defined for Vectors of numbers—for instance, the mean.

In [None]:
mean(sensors.avg_occ)

## Transforming columns

You can apply any Julia function elementwise using the . operator.

In [None]:
log.(sensors.avg_occ)

## Filtering data

You can also filter by rows by using logical operators. Like functions, they can be applied elementwise with a . operator, but the . goes before the operator. We have to explicitly specify that we want all columns (`:`)

In [None]:
sensors[sensors.avg_occ .> 0.05,:]

### Logical operators

`sensors.avg_occ .> 0.05` is nothing particularly special, it's just creating a vector of boolean values, one for each row of the data, indicating whether that row has `avg_occ` > 0.05.

In [None]:
sensors.avg_occ .> 0.05

## Split-apply-combine

Split-apply-combine is a common pattern in many other data manipulation languages (e.g. groupby in Python, group_by %>% summarize in R). This pattern involves dividing the dataset based on the values of certain variables, applying operations to each subset, and combining the results - most often into a single row per group. For instance, we can compute mean occupancy by freeway and direction, below.

In Julia, we use the `groupby` and `combine` functions for this. We reference variable names by placing a : in front of them, or putting them in double quotes if they contain spaces, dashes, or start with a number.

In [None]:
combine(
    groupby(sensors, [:freeway_number, :direction]),  # Group the data by these variables. You only need [] if you are grouping by multiple variables.
    # and create a new variable mean_avg_occ by applying the mean() function to the avg_occ column in
    # each subset of the data
    :avg_occ => mean => :mean_avg_occ  
)