# Introduction to DataFrames

In [None]:
using DataArrays
using DataFrames

## Missing values

* A missing value is represented by ``NA`` in Julia.
* ``NA`` is not part of Base, it is provided by the ``DataArrays`` package.
* ``NA`` poisons other values.

In [None]:
# NA poisons other values
1+NA

In [None]:
# Check if the evaluation of an expression results in NA
isna(1+NA)

In [None]:
# Note the difference between NaN and NA
(isa(NaN, Float64), isa(NA, Float64))

## DataArrays

* ``DataArray``'s are used for representing arrays that contain missing values
* ``DataArray{T}`` allows storing ``T`` or ``NA``
* In other words, ``DataArray{T}`` adds ``NA``'s to ``Array{T}``
* ``PooledDataArray{T}`` is used for storing data efficiently.
* ``PooledDataArray{T}`` compresses ``DataArray{T}``.

### Constructing DataArrays

In [None]:
# Call the DataArray() constructor by passing a Vector to it
DataArray([0.1, 0.5, -2.4])

In [None]:
# Construct a DataArray by calling the @data() macro with a Vector input argument
@data([0.1, 0.5, -2.4])

In [None]:
# Convert Vector to DataArray
convert(DataArray, [0.1, 0.5, -2.4])

In [None]:
# It is not possible to call DataArray() with NA in its input argument
DataArray([0.1, NA, -2.4])

In [None]:
# However, it is possible to pass NA to the @data() macro
@data([0.1, NA, -2.4])

In [None]:
# The DataArray() constructor can be called with a Matrix input argument
DataArray([0.4 1.2; 3.5 7.2])

In [None]:
# The @data() macro can also be called with a Matrix input argument
@data([0.4 1.2; 3.5 7.2])

In [None]:
# Convert a Matrix to DataArray
convert(DataArray, [0.4 1.2; 3.5 7.2])

### Numerical computing with DataArrays

In [None]:
# Numerical computing can be done with data vectors
x = @data([0.1, NA, -2.4])
y = @data([-9.9, 0.5, 6.7])
x+y

In [None]:
# To remove missing values (NA), call dropna()
x = @data([0.1, NA, -2.4])
dropna(x)

In [None]:
# Numerical computing can be done with data matrices and data vectors
A = @data([0.4 1.2 4.4; NA 7.2 3.9; 5.1 1.8 4.5])
y = @data([-9.9, 0.5, 6.7])
A*y

## DataFrames

* ``DataFrame``'s are used for representing data tables.
* A ``DataFrame`` is a list of ``DataArray``'s.
* So every ``DataArray`` of a  ``DataFrame`` represents a column of the corresponding data table.
* ``DataFrame``'s accommodate heterogeneous data that might contain missing values.
* Every column (``DataArray``) of a ``DataFrame`` has its own type.

### Example 02-01-01: NBA champions

#### Constructing DataFrames

In [None]:
# Call the DataFrame() constructor with keyword arguments (columns) of type Vector
DataFrame(
  player = ["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"],
  champions = [3, 5, 6, 6]
)

In [None]:
# Start with an empty DataFrame and populate it
ChampionsFrame = DataFrame()
ChampionsFrame[:player] = ["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"]
ChampionsFrame[:champions] = [3, 5, 6, 6]
ChampionsFrame

In [None]:
# Provide CSV-like tabular data to construct a new DataFrame
csv"""
  player,champions
  Larry Bird,3
  Magic Johnson,5
  Michael Jordan,6
  Scottie Pippen,6
"""

In [None]:
# Call the DataFrame() constructor with keyword arguments (columns) of type DataArray
player = @data(["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"])
champions = @data([3, 5, 6, 6])
ChampionsFrame = DataFrame(player=player, champions=champions)

In [None]:
# Construct a DataFrame by joining two existing DataFrames
height = [2.06, 2.06, 1.98, 2.03]
HeightsFrame = DataFrame(player=player, height=height)
join(ChampionsFrame, HeightsFrame, on = :player)

#### Quering basic information about DataFrames

In [None]:
# Get number of rows of a DataFrame
size(ChampionsFrame, 1)

In [None]:
# Get number of columns of a DataFrame
size(ChampionsFrame, 2)

In [None]:
# Get a numeric summary of a DataFrame
describe(ChampionsFrame)

#### Indexing DataFrames

In [None]:
# Index DataFrame by column name to get a specific column
ChampionsFrame[:player]

In [None]:
# Index DataFrame by row numbers to get specific rows
ChampionsFrame[2:3, :]