# Aggregating data
## Summary Statistics
- Summary statistics summarize many numbers into a single statistic.
- Examples of summary statistics include:
  - Mean, Median, Minimum, Maximum, Standard deviation
- Calculating summary statistics helps you:
  - Understand your data better.
  - Handle large datasets more easily by extracting key insights.

In [1]:
import pandas as pd

# Load the students dataset
students = pd.read_csv(r'C:\Users\Dell\OneDrive\Desktop\KaranCodes\Datacampcourses\Associate-Data-Scientist-Python-Track\resources\DatamanipulationwithPandas-datasets\students.csv')

# Print the first few rows of the students DataFrame
print(students.head())

# Print information about the students DataFrame
print(students.info())

# Print the mean GPA of students
print(students["gpa"].mean())

# Print the median GPA of students
print(students["gpa"].median())

   student_id first_name last_name  age  gender             major  gpa  \
0           1       John       Doe   20    Male  Computer Science  3.5   
1           2       Emma     Smith   22  Female          Business  3.8   
2           3       Noah   Johnson   19    Male       Engineering  3.2   
3           4        Ava  Williams   21  Female       Mathematics  3.7   
4           5       Liam     Brown   20    Male           Physics  3.1   

   credits_completed  
0                 60  
1                 90  
2                 45  
3                 80  
4                 55  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   student_id         50 non-null     int64  
 1   first_name         50 non-null     object 
 2   last_name          50 non-null     object 
 3   age                50 non-null     int64  
 4   gender             50 

In [2]:
# Print the maximum age of students
print(students["age"].max())

# Print the minimum age of students
print(students["age"].min())

23
18


## Aggregating custom summary statistics function
- Pandas and NumPy offer many built-in functions for data summarization.
- Sometimes, you may need a custom function to summarize your data.
- The .agg() method allows you to:
  - Apply custom functions to a DataFrame.
  - Apply functions to multiple columns at once, making aggregation more efficient.
- Syntax example:
  - df['column'].agg(function)
- Custom function example:
  - IQR (Inter-Quartile Range) = 75th percentile − 25th percentile.
- IQR is a robust alternative to standard deviation, especially when the dataset has outliers.

In [3]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Apply IQR function to 'age', 'gpa', and 'credits_completed' columns
print(students[["age", "gpa", "credits_completed"]].agg(iqr))

age                   2.0
gpa                   0.5
credits_completed    30.0
dtype: float64


In [7]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Apply IQR and median to 'age', 'gpa', and 'credits_completed'
print(students[["age", "gpa", "credits_completed"]].agg([iqr, "median"]))

         age   gpa  credits_completed
iqr      2.0  0.50               30.0
median  21.0  3.45               70.0


## Cumulative statistics
- Cumulative statistics can also be helpful in tracking summary statistics over time.

In [11]:
# Sort students by age
students = students.sort_values("age")

# Get the cumulative sum of credits_completed, add as cum_credits_completed column
students["cum_credits_completed"] = students["credits_completed"].cumsum()

# Get the cumulative max of GPA, add as cum_max_gpa column
students["cum_max_gpa"] = students["gpa"].cummax()

# See the result
print(students[["first_name", "credits_completed", "cum_credits_completed", "gpa", "cum_max_gpa"]])

   first_name  credits_completed  cum_credits_completed  gpa  cum_max_gpa
6       Mason                 30                     30  2.9          2.9
15     Amelia                 40                     70  3.2          3.2
27      Sofia                 30                    100  3.5          3.5
11        Mia                 50                    150  3.5          3.5
18     Elijah                 35                    185  2.7          3.5
23      Emily                 50                    235  3.6          3.6
2        Noah                 45                    280  3.2          3.6
33   Scarlett                 45                    325  3.9          3.9
48      Isaac                 50                    375  3.3          3.9
42       Jack                 45                    420  3.1          3.9
38       Owen                 40                    460  2.8          3.9
29      Avery                 70                    530  3.8          3.9
16  Alexander                 60      