In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

The wonderful Pandas library offers a list of functions, among which a function called pivot_table is used to summarize a featureâ€™s values in a neat two-dimensional table. 

The pivot table is similar to the dataframe.groupby() function in Pandas. 

In [9]:
df=pd.read_csv("titanic_train.csv")
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [4]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

###### How to Group Data Using Index in a Pivot Table?

pivot_table requires <b>data, and an index parameter </b> 

    data is the Pandas dataframe you pass to the function.
    
    index is the feature that allows you to group your data. 

The index feature will appear as an index in the resultant table. 

Generally, <i> categorical columns are used as indexes </i>

In [5]:

table = pd.pivot_table(
    df,
    index=['Sex', 'Pclass'],       
    aggfunc={'Age': np.mean, 'Survived': np.sum}   
)

table

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Survived
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,34.611765,91
female,2,28.722973,70
female,3,21.75,72
male,1,41.281386,45
male,2,30.740707,17
male,3,26.507589,47


aggfunc() is an aggregate function that pivot_table applies to grouped data.

Just provide a <b>dictionary as an input to the aggfunc parameter with the feature name as the key and the corresponding aggregate function as the value</b>.

Aggregate on Specific Features With Values Parameter

The <b>values</b>.  parameter tells the function which features to aggregate on

In [6]:
table = pd.pivot_table(
    df,
    index=['Sex', 'Pclass'],  
    values=['Survived'],       
    aggfunc=np.mean            
)

table


Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Sex,Pclass,Unnamed: 2_level_1
female,1,0.968085
female,2,0.921053
female,3,0.5
male,1,0.368852
male,2,0.157407
male,3,0.135447


In [None]:
table.plot(kind='barh', figsize=(8,4))
plt.show()

The <b>columns</b> parameter displays the values horizontally on the top of the resultant table. 

Both columns and the index parameters are optional, but using them effectively will help you to intuitively understand the relationship between the features.

In [7]:
table = pd.pivot_table(
    df,
    index=['Sex'],         
    columns=['Pclass'],   
    values=['Survived'],   
    aggfunc=np.sum         
)

table


Unnamed: 0_level_0,Survived,Survived,Survived
Pclass,1,2,3
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,91,70,72
male,45,17,47


In [None]:
table.plot(kind='bar', figsize=(8,5))
plt.show()

Handling Missing Data

pivot_table even allows you to deal with the missing values through the parameters dropna and <b>fill_value</b>:

dropna allows you to drop the null values in the grouped table whose all values are null 
fill_value parameter can be used to replace the NaN values in the grouped table with the values that you provide here

In [8]:
table = pd.pivot_table(
    df,
    index=['Sex','Survived','Pclass'],
    columns=['Embarked'],
    values=['Age'],
    aggfunc=np.mean
)
table


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Age,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,Embarked,C,Q,S
Sex,Survived,Pclass,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
female,0,1,50.0,,13.5
female,0,2,,,36.0
female,0,3,20.7,28.1,23.688889
female,1,1,35.675676,33.0,33.619048
female,1,2,19.142857,30.0,29.091667
female,1,3,11.045455,17.6,22.548387
male,0,1,43.05,44.0,45.3625
male,0,2,29.5,57.0,33.414474
male,0,3,27.555556,28.076923,27.168478
male,1,1,36.4375,,36.121667


In [10]:

table = pd.pivot_table(
    df,
    index=['Sex','Survived','Pclass'],
    columns=['Embarked'],
    values=['Age'],
    aggfunc=np.mean,
    fill_value=np.mean(df['Age'])   
)
table


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Age,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,Embarked,C,Q,S
Sex,Survived,Pclass,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
female,0,1,50.0,29.699118,13.5
female,0,2,29.699118,29.699118,36.0
female,0,3,20.7,28.1,23.688889
female,1,1,35.675676,33.0,33.619048
female,1,2,19.142857,30.0,29.091667
female,1,3,11.045455,17.6,22.548387
male,0,1,43.05,44.0,45.3625
male,0,2,29.5,57.0,33.414474
male,0,3,27.555556,28.076923,27.168478
male,1,1,36.4375,29.699118,36.121667


### Crosstab

Crosstab (Cross-tabulation) is a frequency table showing how two (or more) categorical variables interact.

It counts how many observations fall into each category combination.

While pivot_table is more general (allows aggregation with many functions), crosstab is specialized for counting & percentages.

| Feature / Aspect        | **Crosstab (`pd.crosstab`)**                                    | **Pivot Table (`pd.pivot_table`)**                                  |
| ----------------------- | --------------------------------------------------------------- | ------------------------------------------------------------------- |
| **Data Types**          | Mainly categorical (for frequencies)                            | Works with both categorical + numerical data                        |
| **Primary Purpose**     | Show frequency counts or proportions between categories         | Summarize and aggregate numerical data by categories                |
| **Default Aggregation** | Counts (frequency table)                                        | Mean (unless specified via `aggfunc`)                               |
| **Aggregation Options** | Limited (`aggfunc` possible, but usually count/proportion)      | Flexible: sum, mean, count, min, max, custom functions              |
| **Number of Variables** | Typically 2 (row vs column categories), supports more if nested | Can handle multiple `index`, `columns`, and multiple `values`       |
| **Normalization**       | Built-in with `normalize` (`index`, `columns`, `all`)           | No direct parameter (must normalize manually)                       |
| **Margins (Totals)**    | `margins=True` adds totals row/column                           | `margins=True` also supported                                       |
| **Complexity**          | Simpler, faster for categorical cross-tabulation                | More powerful, flexible for complex data summarization              |
| **Common Use Case**     | Contingency tables, chi-square tests, quick category comparison | Exploratory data analysis, business summaries, grouped aggregations |


In [None]:
import pandas as pd

df = pd.read_csv("titanic_train.csv")

ct = pd.crosstab(df['Sex'], df['Pclass'])
print("Crosstab:\n", ct)

In [None]:

pt = pd.pivot_table(df, index='Sex', columns='Pclass', values='Survived', aggfunc='mean')
print("\nPivot Table:\n", pt.round(2))


Use Crosstabs when you want to count or compare proportions between categorical variables.

Use Pivot Tables when you want flexible aggregations (sum, mean, etc.) across multiple variables.