# Aggregating data
## Summary Statistics
- Summary statistics summarize many numbers into a single statistic.
- Examples of summary statistics include:
  - Mean, Median, Minimum, Maximum, Standard deviation
- Calculating summary statistics helps you:
  - Understand your data better.
  - Handle large datasets more easily by extracting key insights.

In [1]:
import pandas as pd

# Load the students dataset
students = pd.read_csv(r'C:\Users\Dell\OneDrive\Desktop\KaranCodes\Datacampcourses\Associate-Data-Scientist-Python-Track\resources\DatamanipulationwithPandas-datasets\students.csv')

# Print the first few rows of the students DataFrame
print(students.head())

# Print information about the students DataFrame
print(students.info())

# Print the mean GPA of students
print(students["gpa"].mean())

# Print the median GPA of students
print(students["gpa"].median())

   student_id first_name last_name  age  gender             major  gpa  \
0           1       John       Doe   20    Male  Computer Science  3.5   
1           2       Emma     Smith   22  Female          Business  3.8   
2           3       Noah   Johnson   19    Male       Engineering  3.2   
3           4        Ava  Williams   21  Female       Mathematics  3.7   
4           5       Liam     Brown   20    Male           Physics  3.1   

   credits_completed  
0                 60  
1                 90  
2                 45  
3                 80  
4                 55  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   student_id         50 non-null     int64  
 1   first_name         50 non-null     object 
 2   last_name          50 non-null     object 
 3   age                50 non-null     int64  
 4   gender             50 

In [2]:
# Print the maximum age of students
print(students["age"].max())

# Print the minimum age of students
print(students["age"].min())

23
18


## Aggregating custom summary statistics function
- Pandas and NumPy offer many built-in functions for data summarization.
- Sometimes, you may need a custom function to summarize your data.
- The .agg() method allows you to:
  - Apply custom functions to a DataFrame.
  - Apply functions to multiple columns at once, making aggregation more efficient.
- Syntax example:
  - df['column'].agg(function)
- Custom function example:
  - IQR (Inter-Quartile Range) = 75th percentile − 25th percentile.
- IQR is a robust alternative to standard deviation, especially when the dataset has outliers.

In [3]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Apply IQR function to 'age', 'gpa', and 'credits_completed' columns
print(students[["age", "gpa", "credits_completed"]].agg(iqr))

age                   2.0
gpa                   0.5
credits_completed    30.0
dtype: float64


In [7]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Apply IQR and median to 'age', 'gpa', and 'credits_completed'
print(students[["age", "gpa", "credits_completed"]].agg([iqr, "median"]))

         age   gpa  credits_completed
iqr      2.0  0.50               30.0
median  21.0  3.45               70.0


## Cumulative statistics
- Cumulative statistics can also be helpful in tracking summary statistics over time.

In [11]:
# Sort students by age
students = students.sort_values("age")

# Get the cumulative sum of credits_completed, add as cum_credits_completed column
students["cum_credits_completed"] = students["credits_completed"].cumsum()

# Get the cumulative max of GPA, add as cum_max_gpa column
students["cum_max_gpa"] = students["gpa"].cummax()

# See the result
print(students[["first_name", "credits_completed", "cum_credits_completed", "gpa", "cum_max_gpa"]])

   first_name  credits_completed  cum_credits_completed  gpa  cum_max_gpa
6       Mason                 30                     30  2.9          2.9
15     Amelia                 40                     70  3.2          3.2
27      Sofia                 30                    100  3.5          3.5
11        Mia                 50                    150  3.5          3.5
18     Elijah                 35                    185  2.7          3.5
23      Emily                 50                    235  3.6          3.6
2        Noah                 45                    280  3.2          3.6
33   Scarlett                 45                    325  3.9          3.9
48      Isaac                 50                    375  3.3          3.9
42       Jack                 45                    420  3.1          3.9
38       Owen                 40                    460  2.8          3.9
29      Avery                 70                    530  3.8          3.9
16  Alexander                 60      

## Counting
- Counting is used to summarize the categorical data.
- We need to avoid double counting.
### Dropping duplicate names and pairs
- Removing duplicates is essential for getting accurate counts in our data.
- Often, we want to avoid counting the same item multiple times.

In [1]:
import pandas as pd

# Load the students dataset
students = pd.read_csv(r'C:\Users\Dell\OneDrive\Desktop\KaranCodes\Datacampcourses\Associate-Data-Scientist-Python-Track\resources\DatamanipulationwithPandas-datasets\students.csv')

# Drop duplicate first_name/major combinations
student_majors = students.drop_duplicates(subset=["first_name", "major"])
print(student_majors.head())

# Drop duplicate first_name/age combinations
student_ages = students.drop_duplicates(subset=["first_name", "age"])
print(student_ages.head())

# Subset the students who have gpa > 3.5 (honor students) and drop duplicate last names
honor_students = students[students["gpa"] > 3.5].drop_duplicates(subset=["last_name"])

# Print last_name column of honor_students
print(honor_students["last_name"])

   student_id first_name last_name  age  gender             major  gpa  \
0           1       John       Doe   20    Male  Computer Science  3.5   
1           2       Emma     Smith   22  Female          Business  3.8   
2           3       Noah   Johnson   19    Male       Engineering  3.2   
3           4        Ava  Williams   21  Female       Mathematics  3.7   
4           5       Liam     Brown   20    Male           Physics  3.1   

   credits_completed  
0                 60  
1                 90  
2                 45  
3                 80  
4                 55  
   student_id first_name last_name  age  gender             major  gpa  \
0           1       John       Doe   20    Male  Computer Science  3.5   
1           2       Emma     Smith   22  Female          Business  3.8   
2           3       Noah   Johnson   19    Male       Engineering  3.2   
3           4        Ava  Williams   21  Female       Mathematics  3.7   
4           5       Liam     Brown   20    Male

### Counting categorical variables
- Counting is a powerful method to get an overview of your data.
- It helps you spot patterns, curiosities, or anomalies that might not be obvious at first glance.

In [4]:
# Drop duplicate first_name/major combinations
student_majors = students.drop_duplicates(subset=["first_name", "major"])

# Drop duplicate first_name/age combinations
student_ages = students.drop_duplicates(subset=["first_name", "age"])

# Count the number of students in each major
major_counts = student_majors["major"].value_counts()
print(major_counts.head())

# Get the proportion of students in each major
major_props = student_majors["major"].value_counts(normalize=True)
print(major_props.head())

# Count the number of students for each age and sort
age_counts_sorted = student_ages["age"].value_counts(sort=True)
print(age_counts_sorted)

# Get the proportion of students at each age and sort
age_props_sorted = student_ages["age"].value_counts(sort=True, normalize=True)
print(age_props_sorted)

major
Computer Science    4
Business            4
Engineering         4
Mathematics         4
Physics             4
Name: count, dtype: int64
major
Computer Science    0.08
Business            0.08
Engineering         0.08
Mathematics         0.08
Physics             0.08
Name: proportion, dtype: float64
age
20    13
22    11
21    10
19     8
23     5
18     3
Name: count, dtype: int64
age
20    0.26
22    0.22
21    0.20
19    0.16
23    0.10
18    0.06
Name: proportion, dtype: float64


## Grouped Summary Statistics

In [2]:
import pandas as pd

# Load the students dataset
students = pd.read_csv(r'C:\Users\Dell\OneDrive\Desktop\KaranCodes\Datacampcourses\Associate-Data-Scientist-Python-Track\resources\DatamanipulationwithPandas-datasets\students.csv')

# Calculate total credits completed by all students
total_credits = students["credits_completed"].sum()

# Subset for Computer Science majors, calc total credits
credits_cs = students[students["major"] == "Computer Science"]["credits_completed"].sum()

# Subset for Business majors, calc total credits
credits_business = students[students["major"] == "Business"]["credits_completed"].sum()

# Subset for Engineering majors, calc total credits
credits_engineering = students[students["major"] == "Engineering"]["credits_completed"].sum()

# Get proportion of credits completed for each major
credits_propn_by_major = [credits_cs, credits_business, credits_engineering] / total_credits

# Print the result
print(credits_propn_by_major)

[0.08225108 0.07503608 0.07936508]


### Calculations using groupby method
- As seen above we can group statistics traditionally but it is much easier with groupby method.
- The .groupby() method simplifies calculations across groups of data.
- It allows you to group data by one or more variables and perform calculations within each group.

In [4]:
# Group by major and calculate total credits completed
credits_by_major = students.groupby("major")["credits_completed"].sum()

# Group by major and is_honors (GPA > 3.5), and calculate total credits completed
students["is_honors"] = students["gpa"] > 3.5
credits_by_major_honors = students.groupby(["major", "is_honors"])["credits_completed"].sum()

# Print the result
print(credits_by_major_honors.head())

major             is_honors
Biology           False        225
Business          False         40
                  True         220
Chemistry         False        165
Computer Science  False        190
Name: credits_completed, dtype: int64


### Multiple grouped summaries
- The .agg() method can be used to compute multiple statistics on grouped data.
- It allows you to apply multiple functions at once after grouping.
- NumPy (imported as np) provides many summary statistics functions, such as:
  - np.min — minimum value
  - np.max — maximum value
  - np.mean — mean (average)
  - np.median — median
- Combining .groupby() and .agg() makes your summaries more detailed and efficient.

In [7]:
# For each major, aggregate gpa: get min, max, mean, and median
gpa_stats = students.groupby("major")["gpa"].agg(["min", "max", "mean", "median"])

# Print gpa_stats
print(gpa_stats)

# For each major, aggregate age and credits_completed: get min, max, mean, and median
age_credits_stats = students.groupby("major")[["age", "credits_completed"]].agg(["min", "max", "mean", "median"])

# Print age_credits_stats
print(age_credits_stats)

                        min  max      mean  median
major                                             
Biology                 2.7  3.4  3.075000    3.10
Business                3.2  3.8  3.625000    3.75
Chemistry               2.8  3.0  2.900000    2.90
Computer Science        3.1  3.6  3.325000    3.30
Economics               3.2  3.5  3.333333    3.30
Education               3.5  3.7  3.600000    3.60
Engineering             2.9  3.4  3.150000    3.15
History                 3.3  3.6  3.450000    3.45
Literature              3.4  3.7  3.533333    3.50
Mathematics             3.5  3.8  3.700000    3.75
Mechanical Engineering  2.8  3.5  3.166667    3.20
Physics                 2.7  3.1  2.950000    3.00
Political Science       3.6  3.7  3.633333    3.60
Psychology              3.6  3.9  3.825000    3.90
                       age                       credits_completed       \
                       min max       mean median               min  max   
major                             

## Pivot Tables
### Pivoting one variable
- Pivot tables are a standard method for aggregating data in spreadsheets.
- In pandas, pivot tables provide another way to perform grouped calculations.
- The .pivot_table() method is an alternative to .groupby().
- It allows you to summarize and organize data more visually and cleanly.

In [8]:
# Pivot for mean and median GPA for each major
mean_med_gpa_by_major = students.pivot_table(values="gpa", index="major", aggfunc=["mean", "median"])

# Print mean_med_gpa_by_major
print(mean_med_gpa_by_major)

                            mean median
                             gpa    gpa
major                                  
Biology                 3.075000   3.10
Business                3.625000   3.75
Chemistry               2.900000   2.90
Computer Science        3.325000   3.30
Economics               3.333333   3.30
Education               3.600000   3.60
Engineering             3.150000   3.15
History                 3.450000   3.45
Literature              3.533333   3.50
Mathematics             3.700000   3.75
Mechanical Engineering  3.166667   3.20
Physics                 2.950000   3.00
Political Science       3.633333   3.60
Psychology              3.825000   3.90


In [9]:
# Create a new column: is_honors (True if GPA > 3.5)
students["is_honors"] = students["gpa"] > 3.5

# Pivot for mean GPA by major and honors status
mean_gpa_by_major_honors = students.pivot_table(values="gpa", index="major", columns="is_honors")

# Print the pivot table
print(mean_gpa_by_major_honors)

is_honors                  False     True 
major                                     
Biology                 3.075000       NaN
Business                3.200000  3.766667
Chemistry               2.900000       NaN
Computer Science        3.233333  3.600000
Economics               3.333333       NaN
Education               3.500000  3.650000
Engineering             3.150000       NaN
History                 3.400000  3.600000
Literature              3.450000  3.700000
Mathematics             3.500000  3.766667
Mechanical Engineering  3.166667       NaN
Physics                 2.950000       NaN
Political Science            NaN  3.633333
Psychology                   NaN  3.825000


In [10]:
# Create a new column: is_honors (True if GPA > 3.5)
students["is_honors"] = students["gpa"] > 3.5

# Pivot for mean credits_completed by major and honors status; fill missing with 0s; sum all rows and columns
pivot_credits_by_major_honors = students.pivot_table(values="credits_completed", index="major", columns="is_honors", fill_value=0, margins=True)

# Print the result
print(pivot_credits_by_major_honors)

is_honors                   False        True        All
major                                                   
Biology                 56.250000    0.000000  56.250000
Business                40.000000   73.333333  65.000000
Chemistry               55.000000    0.000000  55.000000
Computer Science        63.333333   95.000000  71.250000
Economics               76.666667    0.000000  76.666667
Education               30.000000   82.500000  65.000000
Engineering             68.750000    0.000000  68.750000
History                 78.333333  100.000000  83.750000
Literature              52.500000   80.000000  61.666667
Mathematics             60.000000   88.333333  81.250000
Mechanical Engineering  75.000000    0.000000  75.000000
Physics                 62.500000    0.000000  62.500000
Political Science        0.000000   68.333333  68.333333
Psychology               0.000000   76.250000  76.250000
All                     63.437500   79.722222  69.300000
