# Summary and Aggregation Functions 汇总和聚合函数

## 1. Summary Functions  概要函数
- Pandas has many simple "summary functions" (well, this is not an official name) that help you to restructure your data in a very useful way and displays useful information about the data.  Pandas 有许多简单的"汇总函数"（嗯，这并不是官方名称），它们能以非常实用的方式帮助你重新组织数据，并显示有关数据的有用信息。

In [2]:
import pandas as pd
import numpy as np

exam_scores = pd.read_csv('data/exam_scores.csv')

### a. info()

In [3]:
exam_scores.info()  # Displays information about the DataFrame, such as number of entries, column names, data types, and memory usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


### b. describe()

In [4]:
exam_scores.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,67.128,70.174,68.973
std,14.815367,14.85599,15.109155
min,15.0,18.0,10.0
25%,58.0,60.0,59.0
50%,67.0,70.0,69.0
75%,78.0,81.0,80.0
max,100.0,100.0,100.0


In [5]:
# If we want to get a summary of categorical columns separately, then we can use the parameter 'include'. 如果我们想分别获取分类列的摘要，可以使用参数'include'。
exam_scores.describe(include='object')  # Displays summary statistics for categorical columns.

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
count,1000,1000,1000,1000,1000
unique,2,5,6,2,2
top,female,group C,some college,standard,none
freq,502,294,226,649,654


In [6]:
# Also, we can get a summary of numerical and categorical columns together using the same parameter 'include'.  我们也可以使用相同的参数'include'，一起获取数值和分类列的摘要。
exam_scores.describe(include='all')  # Displays summary statistics for both numerical and categorical columns.

# count - 特定列中非空条目的数量。例如，性别列有 1000 个非空条目。
# unique - 列中唯一值的数量。仅适用于分类列。例如，性别列有 2 个唯一值——男性和女性。
# top - 这也仅适用于分类列。它告诉我们哪个类别出现的次数最多。例如，在性别列中，'female' 出现的次数最多。
# freq - 这同样仅适用于分类列。它告诉你该列中最高频类别的出现次数。例如，在性别列中，'female' 出现了 502 次。
# mean - 数值列的均值。例如，数学平均分是 67.128。
# std - 数值列的标准差。它告诉你数据的波动情况。
# min - 数值列中的最小值。
# 25% - 数值列中的 25 分位数（或第 1 四分位数）值。
# 50% - 数值列中的 50 分位数（或第 2 四分位数或中位数）值。
# 75% - 数值列中的 75 分位数（或第 3 四分位数）值。
# max - 数值列中的最大值。
# NaN 值表示对于特定列，某个汇总值不可用。

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
count,1000,1000,1000,1000,1000,1000.0,1000.0,1000.0
unique,2,5,6,2,2,,,
top,female,group C,some college,standard,none,,,
freq,502,294,226,649,654,,,
mean,,,,,,67.128,70.174,68.973
std,,,,,,14.815367,14.85599,15.109155
min,,,,,,15.0,18.0,10.0
25%,,,,,,58.0,60.0,59.0
50%,,,,,,67.0,70.0,69.0
75%,,,,,,78.0,81.0,80.0


In [7]:
# We can also use describe() method on a particular column/series:  我们也可以对特定列/序列使用 describe()方法：
exam_scores['math score'].describe()  # Displays summary statistics for the 'math score' column.

count    1000.000000
mean       67.128000
std        14.815367
min        15.000000
25%        58.000000
50%        67.000000
75%        78.000000
max       100.000000
Name: math score, dtype: float64

In [8]:
# 如果列名不包含空格还可以用.访问
exam_scores.gender.describe()

count       1000
unique         2
top       female
freq         502
Name: gender, dtype: object

## 2. Aggregation Functions  聚合函数

In [10]:
# We can also use the individual methods like mean(), median(), unique() to get this information on a DataFrame or a series.
# 我们还可以使用 mean()、median()、unique()等单个方法来获取 DataFrame 或序列上的这些信息。
exam_scores.gender.unique()

array(['male', 'female'], dtype=object)

In [11]:
# To see all the unique values and the number of times they are occurring in the dataset, we have a method called value_counts():
# 要查看数据集中所有唯一值及其出现的次数，我们有一个名为 value_counts()的方法：
exam_scores.gender.value_counts()  # Displays the count of unique values

gender
female    502
male      498
Name: count, dtype: int64

## 3. Exercise 练习

In [13]:
sma_data = pd.read_csv('data/Standard_Metropolitan_Areas_Data-data.csv')
sma_data.head()

Unnamed: 0,land_area,percent_city,percent_senior,physicians,hospital_beds,graduates,work_force,income,region,crime_rate
0,1384,78.1,12.3,25627,69678,50.1,4083.9,72100,1,75.55
1,3719,43.9,9.4,13326,43292,53.9,3305.9,54542,2,56.03
2,3553,37.4,10.7,9724,33731,50.6,2066.3,33216,1,41.32
3,3916,29.9,8.8,6402,24167,52.2,1966.7,32906,2,67.38
4,2480,31.5,10.5,8502,16751,66.1,1514.5,26573,4,80.19


In [14]:
sma_data.mean()

land_area         2615.727273
percent_city        42.518182
percent_senior       9.781818
physicians        1828.333333
hospital_beds     6345.868687
graduates           54.463636
work_force         449.366667
income            6762.505051
region               2.494949
crime_rate          55.643030
dtype: float64

In [15]:
sma_data.region.unique()

array([1, 2, 4, 3])

In [16]:
sma_data.describe(include='all')

Unnamed: 0,land_area,percent_city,percent_senior,physicians,hospital_beds,graduates,work_force,income,region,crime_rate
count,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
mean,2615.727273,42.518182,9.781818,1828.333333,6345.868687,54.463636,449.366667,6762.505051,2.494949,55.64303
std,3045.82621,17.348277,2.524547,3192.199763,9136.202716,7.773286,610.990885,10393.34966,1.013921,13.470943
min,47.0,13.4,3.9,140.0,481.0,30.3,66.9,769.0,1.0,23.32
25%,1408.0,30.1,8.35,459.0,2390.0,50.25,150.3,2003.0,2.0,46.115
50%,1951.0,39.5,9.7,774.0,3472.0,54.0,257.2,3510.0,3.0,56.06
75%,2890.5,52.6,10.75,1911.5,6386.5,58.3,436.5,6283.5,3.0,63.86
max,27293.0,100.0,21.8,25627.0,69678.0,72.8,4083.9,72100.0,4.0,85.62


In [17]:
sample_data1 = sma_data[sma_data.region == 3]
sample_data1.describe(include='all')  # Displays summary statistics for the sample data where region is 3.

Unnamed: 0,land_area,percent_city,percent_senior,physicians,hospital_beds,graduates,work_force,income,region,crime_rate
count,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0
mean,2485.972222,46.172222,9.283333,1313.722222,4962.777778,51.002778,343.797222,4806.861111,3.0,58.265556
std,1653.8138,19.101019,3.458199,1459.654903,3879.692625,7.328164,354.312228,5452.647656,0.0,10.155822
min,654.0,14.5,3.9,140.0,481.0,30.3,66.9,769.0,3.0,36.36
25%,1517.0,31.35,7.6,509.75,2483.5,46.95,146.2,1940.75,3.0,52.4475
50%,2018.0,47.0,8.85,767.0,3876.0,50.95,230.55,2976.5,3.0,58.725
75%,2911.0,59.725,9.975,1583.5,5811.75,55.175,366.95,4829.5,3.0,64.9225
max,8360.0,90.7,21.8,7340.0,16941.0,70.7,1541.9,25663.0,3.0,76.35


In [18]:
sma_data.region.value_counts()  # Displays the count of unique values in the 'region' column.

region
3    36
2    25
1    21
4    17
Name: count, dtype: int64