# Data Analysis with DataFrames

In this guide, we will delve into the `DataFrames.jl` package, a tool for data manipulation, analysis, and visualization. It's a crucial package in the Julia data ecosystem, providing functionality similar to `pandas` in Python or `data.frame` in R.

## Loading Data

We will require the following packages to load an example dataset into a DataFrame.

In [None]:
import Pkg
Pkg.add("DataFrames") # data frames
Pkg.add("CSV")  # reading csv files

In [None]:
using DataFrames, CSV

In the following we are going to work with an example dataset of house prices. An analysis of this dataset can provide us some insight into the real estate market, and how the sale price of a house depends on its properties.

Let's read the data from a .csv file into a dataframe.

In [None]:
data_dir = "../.assets/data/"

In [None]:
data = CSV.read(joinpath(data_dir, "houses_seattle", "kc_house_data.csv"), DataFrame)

## DataFrame Operations

Let's walk through some basic operations:

### Dimensions and Column Names

Here is how to obtain the dimensions of the dataframe, its column names, and its column datatypes:

In [None]:
n_rows, n_columns = size(data)

In [None]:
names(data)

In [None]:
mapcols(eltype, data)

### Accessing Rows

To take a first look into the dataframe, we have

In [None]:
first(data, 5)

In [None]:
last(data, 5)

Getting specifc rows by row number: 

In [None]:
data[1, :]

In [None]:
data[10:20, :]

### Sampling

To take a random sample from the data, we can get help from the `StatsBase` package.

In [None]:
Pkg.add("StatsBase")

using StatsBase

In [None]:
# Sample a number of rows without replacement
data[sample(1:nrow(data), 10; replace=false), :]

In [None]:
# Sample a fraction of rows without replacement
n_samples = floor(Int, nrow(data) * 0.001)
data[sample(1:nrow(data),n_samples; replace=false), :]

### Accessing Columns

**Accessing a specific column:** To access a specific column, you can use the column's name in square brackets. Let's break down the syntax of the following command:


- The ! (bang) symbol makes the operation non-copying. This means that the operation will return a reference to the existing column, not a new copy of the column. Therefore, if you modify the returned array, the original DataFrame df will also be modified.
- `:price` is a symbol representing the name of the column you wish to select. Julia uses the colon : to create symbols, which are like lightweight string identifiers often used for column names, among other things.

In [None]:
data[!, :price]

To make a copy of the column:

In [None]:
data[:, :price]

**Basic Statistics**: 
- The `mean` function is part of Julia's Statistics module.
- To get an overview of the data, we can use the `describe` function. This will return a DataFrame containing basic statistics for each column.

In [None]:
using Statistics

mean(data[!, :price])

In [None]:
describe(data)

The following code is used to filter rows in a dataframe based on a condition.

Let's break it down:

- `data[!, :price]` is selecting the 'price' column from the DataFrame data. The ! symbol makes the operation non-copying.
- `data[!, :price] .> 1000000` compares each value in the 'price' column to 1 million. This is a broadcasted operation, meaning it's applied element-wise to each value in the 'price' column. The result is a Boolean array where each element is true if the corresponding condition is true, and false otherwise.
- `data[data[!, :price] .> 1000000, :]` then selects all rows from data where the condition is true. The `:` symbol denotes all columns.

In [None]:
data[data[!, :price] .> 1000000, :] 

**Selecting multiple columns**: To select multiple columns, you can pass a vector of column names to the DataFrame. 

In [None]:
data[!, [:price, :bedrooms, :bathrooms]]    

### Modifying and Reshaping

**Adding a column**: Let's add a new column to the dataframe. After running this line, your DataFrame data will have a new column 'price_per_sqft' containing the price per square foot for each house. 

In [None]:
data[!, :price_per_sqft] = data[!, :price] ./ data[!, :sqft_living]

In [None]:
first(data, 5)

**Exercise**: Add a column with the price per square meter.

**Grouping and Aggregation**: You can group data by one or more columns and then calculate aggregate statistics for each group. For example, let's calculate the average price for each number of bedrooms:



In [None]:
combine(
    groupby(data, :bedrooms),
    :price => mean => :mean_price,
)

**Renaming columns**: To rename columns, you can use the `rename` function and pass a mapping of old column names to new column names.

In [None]:
data = rename(
    data, 
    :yr_built => :year_built, 
    :yr_renovated => :year_renovated,
)

**Sorting data**: To sort a DataFrame by one or more columns, you can use the `sort` function. Let's sort the data in place by price in descending order:

In [None]:
sort!(data, :price) # inplace, modifying the dataframe

In [None]:
sort(data, :sqft_living) # not inplace, returning a view

## Data Visualization

In [None]:
Pkg.add("Plots")
using Plots

**Scatter plots**: To create a scatter plot, you can use the `scatter` function from the Plots.jl package. Let's create a scatter plot of price vs. living area:

In [None]:
scatter(
    data[!, :sqft_living], 
    data[!, :price], 
    title = "Price vs Living Area", 
    xlab = "Living Area", 
    ylab = "Price",
    alpha = 0.2,
    markersize = 2,
)

In [None]:
using Statistics

avg_price_by_bedrooms = combine(
    groupby(data, :bedrooms), 
    :price => mean => :avg_price
)

bar(
    avg_price_by_bedrooms[!, :bedrooms], 
    avg_price_by_bedrooms[!, :avg_price], 
    title = "Average Price by Number of Bedrooms", 
    xlab = "Number of Bedrooms", 
    ylab = "Average Price"
)

In [None]:
histogram(
    data[!, :price], 
    bins = 50, 
    title = "Histogram of Prices", 
    xlab = "Price", 
    ylab = "Frequency"
)

For more advanced statistical data visualization, we can use the `StatsPlots` package.

In [None]:
Pkg.add("StatsPlots")

In [None]:
using StatsPlots

# Make sure that :bedrooms and :price are the correct column names
@df data boxplot(
    :bedrooms, 
    :price, 
    group = :bedrooms,
    xlabel = "Number of Bedrooms",
    ylabel = "Price",
    title = "Boxplot of Prices per Number of Bedrooms",
    legend = false,
    outliers = false,
)

The @df syntax in Julia is a macro provided by the StatsPlots package. A macro in Julia, denoted by the @ symbol, is a way to include code that gets evaluated at parse-time, i.e., before the actual execution of the rest of the code.

The @df macro in particular is a convenience macro for working with DataFrames in Julia. It allows you to refer to the columns of a DataFrame within a plotting command without having to index into the DataFrame each time.
    
For instance, if you have a DataFrame `df` with columns `:x` and `:y`, instead of writing `plot(df[:x], df[:y])`, you can write `@df df plot(:x, :y)``. This can be particularly handy when working with longer and more complex plotting commands.

## 🫳 Exercise

1. Are houses with a waterfront view (waterfront = 1) significantly more expensive than those without a waterfront view?
2. Is there a significant difference in prices between houses with different conditions (based on the condition column)?

Analyze the data to answer these questions and present your results.

In [None]:
# your code here

## Joining DataFrames

The `DataFrames.jl` package provides functions to join dataframes based on column values:

In [None]:

# Create the first DataFrame
df1 = DataFrame(ID = [1, 2, 3, 4], Value1 = ["A", "B", "C", "D"])



In [None]:
# Create the second DataFrame
df2 = DataFrame(ID = [3, 4, 5, 6], Value2 = ["X", "Y", "Z", "W"])



In [None]:
# Perform a left join on the 'ID' column
leftjoin(df1, df2, on = :ID)



In [None]:
# Perform a left join on the 'ID' column
rightjoin(df1, df2, on = :ID)



In [None]:
innerjoin(df1, df2, on = :ID)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_