In [1]:
# HIDDEN
Base.displaysize() = (5, 80)

## Getting Started

In the remaining sections of this chapter we will work with the Baby Names dataset from Chapter 1. We will pose a question, break the question down into high-level steps, then translate each step into Julia code using `DataFrames`. We begin by importing `DataFrames` and `CSV`:

In [2]:
using DataFrames
using CSV

Now we can read in the data using `CSV.read` ([docs](https://juliadata.github.io/CSV.jl/stable/)).

In [29]:
baby = CSV.read("babynames.csv")
baby

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,Int64
1,Mary,F,9217,1884
2,Anna,F,3860,1884
3,Emma,F,2587,1884
4,Elizabeth,F,2549,1884
5,Minnie,F,2243,1884
⋮,⋮,⋮,⋮,⋮


Note that for the code above to work, the `babynames.csv` file must be located in the same directory as this notebook. We can check what files are in the current folder by running the `ls` command-line tool:

In [30]:
;ls

babynames.csv
julia_indexes.ipynb
julia_intro.ipynb
julia_structure.ipynb
others
pandas_apply_strings_plotting.ipynb
pandas_apply_strings_plotting.md
pandas_apply_strings_plotting_files
pandas_grouping_pivoting.ipynb
pandas_grouping_pivoting.md
pandas_indexes.ipynb
pandas_indexes.md
pandas_intro.ipynb
pandas_intro.md
pandas_structure.ipynb
pandas_structure.md


-- ADJUST?--

When we use `DataFrames` to read in data, we get a DataFrame. A DataFrame is a tabular data structure where each column is labeled (in this case 'Name', 'Sex', 'Count', 'Year') and each row is labeled (in this case 0, 1, 2, ..., 1891893). Note that the Table object introduced in Data 8 only labels columns while DataFrames label both columns and rows.

## Indexes, Slicing, and Sorting

Let's use `DataFrames` to answer the following question:

**What were the five most popular baby names in 2016?**

### Breaking the Problem Down

We can decompose this question into the following simpler table manipulations:

1. Slice out the rows for the year 2016.
2. Sort the rows in descending order by Count.

Now, we can express these steps in `DataFrames`.

### Taking a subset

Specific subsets of a data frame can be extracted using the indexing syntax. The colon `:` indicates that all items (rows or columns depending on its position) should be retained:

In [31]:
baby

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,Int64
1,Mary,F,9217,1884
2,Anna,F,3860,1884
3,Emma,F,2587,1884
4,Elizabeth,F,2549,1884
5,Minnie,F,2243,1884
⋮,⋮,⋮,⋮,⋮


To obtain a list of column names, use the names() function:

In [32]:
print(names(baby))

Symbol[:Name, :Sex, :Count, :Year]

Column names are called symbols and are preceded by a `:`. So `:Count` refers to the column named Count, or column number 3.

Julia is 1 based indexing, instead of the commonly used 0 based indexing. To select the element at row 2 and column Name we can do:

In [33]:
baby[2, :Name]

"Anna"

To slice out multiple rows or columns, we can use `:`. Note that slicing is inclusive. To specify a range of rows and columns use index numbers:

In [34]:
baby[2:6, 1:3]

Unnamed: 0_level_0,Name,Sex,Count
Unnamed: 0_level_1,String,String,Int64
1,Anna,F,3860
2,Emma,F,2587
3,Elizabeth,F,2549
4,Minnie,F,2243
5,Margaret,F,2142


We will often want a single column from a DataFrame:

In [35]:
baby[:, :Year]

1891894-element Array{Int64,1}:
 ⋮

Note that when we select a single column, we get an `Array` that we can then perform arithmetic on all elements at once:

In [36]:
baby[:, :Year] * 2

1891894-element Array{Int64,1}:
 ⋮

To select out specific columns, we can pass a list into the slice:

In [37]:
# This is a DataFrame again
baby[:, [:Name, :Year]]

Unnamed: 0_level_0,Name,Year
Unnamed: 0_level_1,String,Int64
1,Mary,1884
2,Anna,1884
3,Emma,1884
4,Elizabeth,1884
5,Minnie,1884
⋮,⋮,⋮


Selecting columns is common, so there's a shorthand.

In [38]:
# Shorthand for baby[:, :Name]
baby.Name

1891894-element CSV.Column{String,PooledString}:
 ⋮

#### Selecting rows with conditions

To select all rows with the year 2016:

In [52]:
baby_2016 = baby[baby.Year .== 2016, :]

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,Int64
1,Emma,F,19414,2016
2,Olivia,F,19246,2016
3,Ava,F,16237,2016
4,Sophia,F,16070,2016
5,Isabella,F,14722,2016
⋮,⋮,⋮,⋮,⋮


The 'inner' phrase `baby.Year .== 2016` carries out an element-wise comparison of the values in column Name, and returns an array of Boolean true or false values, one for each row. Notice the broadcasting operator `.`.

### Sorting Rows

The next step is the sort the rows in descending order by 'Count'. We can use the `sort!()` function. The `!` operator indicates the sorting is done in place.

In [54]:
sort!(baby_2016, :Count, rev=true)

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,Int64
1,Emma,F,19414,2016
2,Olivia,F,19246,2016
3,Noah,M,19015,2016
4,Liam,M,18138,2016
5,Ava,F,16237,2016
⋮,⋮,⋮,⋮,⋮


We can then slice the resulted DataFrame. Remember Julia uses 1 indexing:

In [56]:
# Get the value in the first row, first column
baby_2016[1, 1]

"Emma"

In [57]:
# Get the first five rows
baby_2016[1:5, :]

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,Int64
1,Emma,F,19414,2016
2,Olivia,F,19246,2016
3,Noah,M,19015,2016
4,Liam,M,18138,2016
5,Ava,F,16237,2016


## In Conclusion

We now have the five most popular baby names in 2016 and learned to express the following operations in `DataFrames` and `CSV`:

| Goal | Operation |
| --------- | -------  |
| Read a CSV file | `CSV.read` |
| Slicing using labels or indices | `:` syntax |
| Slicing rows using a condition | `.` operator for element-wise comparison |
| Sorting rows | `sort!()` |