# Showcase Peacock Aggregation Functions

```Peacock``` includes convenience aggregation function for formatting the data prior to plotting. This notebook showcases these functions and provides examples on how to use them

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
from TELF.post_processing.Peacock.Utility import aggregate as pag

  from .autonotebook import tqdm as notebook_tqdm


## 0. Load Sample Data

In [3]:
df = pd.read_csv(os.path.join("..", "..", "data", "sample2.csv"))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   eid               235 non-null    object 
 1   s2id              230 non-null    object 
 2   doi               235 non-null    object 
 3   title             235 non-null    object 
 4   abstract          232 non-null    object 
 5   year              235 non-null    int64  
 6   authors           235 non-null    object 
 7   author_ids        235 non-null    object 
 8   affiliations      235 non-null    object 
 9   funding           109 non-null    object 
 10  PACs              95 non-null     object 
 11  publication_name  235 non-null    object 
 12  subject_areas     235 non-null    object 
 13  s2_authors        230 non-null    object 
 14  s2_author_ids     230 non-null    object 
 15  citations         201 non-null    object 
 16  references        191 non-null    object 
 1

## 1. General Aggregation Functions

### ```nunique```

The `nunique` function calculates the number of unique values in a target column of a pandas DataFrame, optionally grouping by specified columns.

#### Parameters
- ```data```: The DataFrame to operate on.
- ```target_column```: The column for which the unique value count is calculated.
- ```group_columns```: (Optional) A list of column names in data to group by. If not provided, the function returns the total count of unique values in the `target_column`.

#### Functionality
- **Grouping**: If `group_columns` is provided, the function groups the DataFrame `data` based on these columns.
- **Unique Counting**: It then counts the unique occurrences of values in `target_column`. If `group_columns` is provided, this count is done within each group.
- **Result**: The output is a DataFrame where each row represents a unique group from `group_columns`, with a column showing the count of unique values in `target_column`. If no `group_columns` are provided, the result is an integer representing the total count of unique values in `target_column`.

#### Use Case
- This function is versatile for data analysis tasks that require understanding the uniqueness of data points, either overall or within specific groups. 
- For example, in academic research, it could be used to count the number of unique publications per author in each year or just the total unique publications in the dataset.

In [4]:
pag.nunique(df, target_column='year')

36

**Let's first assign random cluster numbers for the demo. Normally, we would have used NMFk to get the cluster numbers.**

In [5]:
df["cluster"] = np.random.randint(0,10, len(df))

In [6]:
pag.nunique(df, target_column='year', group_columns=['cluster']).head(10)

Unnamed: 0,cluster,year
0,0,17
1,1,18
2,2,19
3,3,19
4,4,12
5,5,20
6,6,14
7,7,15
8,8,19
9,9,16


### ```sum```

The `sum` function aggregates numeric data in a DataFrame by computing the sum within specified groups.

#### Parameters
- ```data```: The DataFrame to operate on.
- ```group_columns```: A list of column names in data to group by.
- ```top_n```: (Optional) Limits the output to the top N groups based on the `sort_by` column.
- ```sort_by```: (Optional) The column to sort by when selecting the top N groups.
- ```round_floats```: (Optional) The number of decimal places to round numeric results to.
- ```preserve```: (Optional) A list of columns to exclude from aggregation, preserving their first value in each group.

#### Functionality
- **Grouping**: The DataFrame `data` is grouped based on `group_columns`
- **Sum Calculation**: Calculates the sum for all numeric columns within each group.
- **Sorting and Top N Filtering**: If `top_n` and `sort_by` are provided, the top N groups sorted by the specified column are included in the result.
- **Preserving Columns**: Specific non-numeric columns can be preserved in the output to maintain context.

#### Use Case
- This function is ideal for quantitative analysis where the total sum of certain metrics within specified groups is required.
- It can be utilized to evaluate total performance metrics, summarize total counts or amounts over time, or aggregate operational data.

In [7]:
pag.sum(df, group_columns=['cluster']).head(10)

Unnamed: 0,cluster,num_citations,year,num_references
0,0,205.0,40248.0,363.0
1,1,808.0,58329.0,809.0
2,2,885.0,52296.0,743.0
3,3,696.0,54336.0,1025.0
4,4,310.0,32217.0,392.0
5,5,768.0,62374.0,677.0
6,6,425.0,38221.0,598.0
7,7,509.0,50279.0,689.0
8,8,463.0,44252.0,722.0
9,9,1086.0,40196.0,494.0


In [8]:
pag.sum(df, group_columns=['cluster', 'year'])

Unnamed: 0,cluster,year,num_citations,num_references
0,0,1994,83.0,0.0
1,0,1995,17.0,1.0
2,0,1996,16.0,3.0
3,0,1998,0.0,4.0
4,0,2003,12.0,8.0
...,...,...,...,...
164,9,2019,0.0,0.0
165,9,2020,9.0,40.0
166,9,2022,2.0,38.0
167,9,2023,0.0,29.0


### ```mean```

The `mean` function groups a pandas DataFrame by specified columns and calculates the mean of every numeric variable in the data for each group, with options for sorting and filtering the results.

#### Parameters
- ```data```: The DataFrame to operate on.
- ```group_columns```: A list of column names in data to group by.
- ```top_n```: (Optional) Limits the output to the top N groups based on the `sort_by` column.
- ```sort_by```: (Optional) The column to sort by when selecting the top N groups.
- ```round_floats```: (Optional) The number of decimal places to round numeric results to.
- ```preserve```: (Optional) A list of columns to exclude from aggregation, preserving their first value in each group.

#### Functionality
- **Grouping**: The DataFrame `data` is grouped based on `group_columns`
- **Mean Calculation**: Calculates the mean for all numeric columns within each group.
- **Sorting and Top N Filtering**: If `top_n` and `sort_by` are provided, the top N groups sorted by the specified column are included in the result.
- **Preserving Columns**: Specific non-numeric columns can be preserved in the output to maintain context.

#### Use Case
- This function is particularly useful for statistical analysis where the average of certain metrics needs to be understood within specified groups. 
- It can be utilized to evaluate total performance metrics, summarize total counts or amounts over time, or aggregate operational data.

In [9]:
pag.mean(df, group_columns=['cluster']).head(10)

Unnamed: 0,cluster,num_citations,year,num_references
0,0,10.25,2012.4,20.167
1,1,27.862,2011.345,29.963
2,2,34.038,2011.385,28.577
3,3,25.778,2012.444,37.963
4,4,19.375,2013.562,26.133
5,5,24.774,2012.065,21.839
6,6,22.368,2011.632,31.474
7,7,20.36,2011.16,27.56
8,8,21.045,2011.455,32.818
9,9,54.3,2009.8,24.7


In [10]:
pag.mean(df, group_columns=['cluster', 'year'])

Unnamed: 0,cluster,year,num_citations,num_references
0,0,1994,83.0,0.0
1,0,1995,17.0,1.0
2,0,1996,16.0,3.0
3,0,1998,0.0,4.0
4,0,2003,12.0,8.0
...,...,...,...,...
164,9,2019,0.0,0.0
165,9,2020,9.0,40.0
166,9,2022,2.0,38.0
167,9,2023,0.0,29.0
