# Data Aggregation and Statistics in Python

Data aggregation and statistics in Python uses the Pandas library, involve reducing large datasets into meaningful summaries to support Exploratory Data Analysis (EDA). 
Methods like df.describe(), df.sum(), df.min(), and df.count() compute descriptive statistics on the values within the DataFrame. 
Attributes such as df.columns, df.index, df.shape, df.info(), and df.nunique() provide essential metadata about the DataFrame's structure, dimensions, data types, and the number of unique entries it contains.

In [1]:
import pandas as pd

In [2]:
df_ext = pd.read_csv("data/nba.csv") # Load the dataset

print(type(df_ext)) # Check the type of the object

<class 'pandas.core.frame.DataFrame'>


In [3]:
df_ext.columns # View the column names of the NBA DataFrame

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [None]:
df_ext.index # View the index of the NBA DataFrame

RangeIndex(start=0, stop=458, step=1)

In [5]:
df_ext.shape # Get the shape of the NBA DataFrame

(458, 9)

In [6]:
df_ext.count() # Count non-null entries for each column

Name        457
Team        457
Number      457
Position    457
Age         457
Height      457
Weight      457
College     373
Salary      446
dtype: int64

In [7]:
df_ext.nunique() # Count unique values for each column

Name        457
Team         30
Number       53
Position      5
Age          22
Height       18
Weight       87
College     118
Salary      309
dtype: int64

In [8]:
df_ext.info() # Get a concise summary of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


In [9]:
df_ext.describe() # Get statistical summary of numerical columns

Unnamed: 0,Number,Age,Weight,Salary
count,457.0,457.0,457.0,446.0
mean,17.678337,26.938731,221.522976,4842684.0
std,15.96609,4.404016,26.368343,5229238.0
min,0.0,19.0,161.0,30888.0
25%,5.0,24.0,200.0,1044792.0
50%,13.0,26.0,220.0,2839073.0
75%,25.0,30.0,240.0,6500000.0
max,99.0,40.0,307.0,25000000.0


## Aggregation Methods

In [None]:
df_ext.head() # View the first few rows of the DataFrame

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [14]:
df_ext.sum() # Sum of each numerical column

TypeError: can only concatenate str (not "int") to str

In [19]:
df_ext.sum(numeric_only=True) # Sum of each numerical column (explicitly specifying numeric_only)

Number    8.079000e+03
Age       1.231100e+04
Weight    1.012360e+05
Salary    2.159837e+09
dtype: float64

In [None]:
df_ext.sum(axis=0,numeric_only=True) # Sum of each numerical column (specifying axis=0), column-wise

Number    8.079000e+03
Age       1.231100e+04
Weight    1.012360e+05
Salary    2.159837e+09
dtype: float64

In [22]:
df_ext.min(numeric_only=True) # Minimum value of each numerical column

Number        0.0
Age          19.0
Weight      161.0
Salary    30888.0
dtype: float64

In [23]:
df_ext['Salary'].max() # Maximum value of the 'Salary' column

np.float64(25000000.0)

In [24]:
df_ext.median(numeric_only=True) # Median of each numerical column

Number         13.0
Age            26.0
Weight        220.0
Salary    2839073.0
dtype: float64

In [25]:
df_ext['Salary'].mean() # Mean of the 'Salary' column

np.float64(4842684.105381166)

In [26]:
df_ext.value_counts() # Count unique rows in the DataFrame

Name              Team                    Number  Position  Age   Height  Weight  College          Salary    
Zach Randolph     Memphis Grizzlies       50.0    PF        34.0  6-9     260.0   Michigan State   9638555.0     1
Aaron Brooks      Chicago Bulls           0.0     PG        31.0  6-0     161.0   Oregon           2250000.0     1
Aaron Gordon      Orlando Magic           0.0     PF        20.0  6-9     220.0   Arizona          4171680.0     1
Aaron Harrison    Charlotte Hornets       9.0     SG        21.0  6-6     210.0   Kentucky         525093.0      1
Adreian Payne     Minnesota Timberwolves  33.0    PF        25.0  6-10    237.0   Michigan State   1938840.0     1
                                                                                                                ..
Andre Roberson    Oklahoma City Thunder   21.0    SG        24.0  6-7     210.0   Colorado         1210800.0     1
Andrew Bogut      Golden State Warriors   12.0    C         31.0  7-0     260.0   Uta

In [27]:
df_ext['Team'].value_counts() # Count unique values in the 'Team' column

Team
New Orleans Pelicans      19
Memphis Grizzlies         18
New York Knicks           16
Milwaukee Bucks           16
Brooklyn Nets             15
Boston Celtics            15
Los Angeles Clippers      15
Los Angeles Lakers        15
Phoenix Suns              15
Sacramento Kings          15
Chicago Bulls             15
Philadelphia 76ers        15
Toronto Raptors           15
Golden State Warriors     15
Indiana Pacers            15
Detroit Pistons           15
Cleveland Cavaliers       15
Dallas Mavericks          15
Houston Rockets           15
San Antonio Spurs         15
Atlanta Hawks             15
Charlotte Hornets         15
Miami Heat                15
Washington Wizards        15
Denver Nuggets            15
Oklahoma City Thunder     15
Utah Jazz                 15
Portland Trail Blazers    15
Orlando Magic             14
Minnesota Timberwolves    14
Name: count, dtype: int64

Next Chapter [Indexing and Selection](3.IndexingSelection.ipynb)