# Day 6: Session A - Sorting, Grouping, Joining, and Applying

[Link to session webpage](https://eds-217-essential-python.github.io/course-materials/interactive-sessions/6a_grouping_joining_sorting.html)

Date: 09/10/2024

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# create some random data:
np.random.seed(42)  # sets a seed for the random functions for reproducability
dates = pd.date_range(start='2023-01-01', periods=100)

data = {
    'date': dates,
    'site': np.random.choice(['Forest', 'Grassland', 'Wetland'], 100),
    'species': np.random.choice(['Oak', 'Maple', 'Pine', 'Birch'], 100),
    'count': np.random.randint(1, 20, 100),
    'temperature': np.random.normal(15, 5, 100)
}

df = pd.DataFrame(data)
print(df.head())

## Sorting Data

### Basic Sorting

In [None]:
df_sorted = df.sort_values('count', ascending = False)
print(df_sorted.head())

### Multi-column Sorting

more advanced sorting, and we can change sort order between different rows

In [None]:
# first sort by 'site'and then by 'count' 
df_multi_sorted = df.sort_values(['site', 'count'], ascending=[True, False])
print(df_multi_sorted)

# no need for .copy() when sorting. sorting automatically makes a new data frame.

## Grouping and Aggregating

### Basic Groupby

In [None]:
# group by site, them take the sums of the count column
sites = df.groupby('site')['count'].sum()
print(sites)

### Multiple Aggregations

We can provide a list of aggregation functions instead of just one function using `agg()`

In [None]:
# for one column, let's get the results of multiple aggregations:
# pass the commands in as strings, which is a little weird
site_stats = df.groupby('site')['count'].agg(['sum', 'min', 'max'])
print(site_stats)

In [None]:
# provide column-specific aggregations in a dictionary to agg():
site_stats = df.groupby('site').agg({
    'count': ['sum', 'min', 'max'],
    'species': 'nunique',
    'temperature': 'mean'
})
print(site_stats)

## Joining Data

In [None]:
# create a second dataframe with site characteristics
site_data = pd.DataFrame({
    'site': ['Forest', 'Grassland', 'Wetland'],
    'soil_pH': [6.5, 7.2, 6.8],
    'annual_rainfall': [1200, 800, 1500]
})

In [None]:
# performing a join uding pd.merge command:
# arguments:
# 1. initial (main?) dataframe
# 2. new dataframe
# 3. on = 'site' <-- column that you want to join on
# 4. how = 'inner' <-- how to do the join (inner is most common)
merged_df = pd.merge(df, site_data, on = 'site', how = 'inner')
print(merged_df.head())

In [None]:
# for timeseries data, it's often nice to make the row index the timestamp!
# use set_index() method to set the index to a specific column
merged_df.set_index('date')

### Using `inplace` keyword arguments in pandas. 

If you are calling a method that usually generates a new dataframe (like, `set_index()`, `sort_values()`) and you want the operation to act on the dataframe without making a copy...

Then you can use the `inplace` keyword argument to force this behavior.

In [None]:
# Make a copy so we dont mess up our dataframe while playing around.
df_copy = merged_df.copy()
print("Before:\n", df_copy.head())

df_copy.set_index('date', inplace = True)
print("With inplace:\n", df_copy.head())