# Binning Numeric Columns

When grouping, numeric columns are often used as the aggregating column and not the grouping column. In this chapter, we'll learn how to bin numeric columns into specific groups using the `cut` and `qcut` **functions** (not methods). After binning, we'll be able to more easily use them for grouping. Let's begin with the housing dataset, which has a few numeric columns that make sense to bin.

In [1]:
import pandas as pd
usecols = ['Neighborhood', 'OverallQual', 'YearBuilt', 'Exterior1st', 
           'Foundation', 'GrLivArea', 'SalePrice']
df = pd.read_csv('../data/housing.csv', usecols=usecols)
df.head()

Unnamed: 0,Neighborhood,OverallQual,YearBuilt,Exterior1st,Foundation,GrLivArea,SalePrice
0,CollgCr,7,2003,VinylSd,PConc,1710,208500
1,Veenker,6,1976,MetalSd,CBlock,1262,181500
2,CollgCr,7,2001,VinylSd,PConc,1786,223500
3,Crawfor,7,1915,Wd Sdng,BrkTil,1717,140000
4,NoRidge,8,2000,VinylSd,PConc,2198,250000


## Grouping with numeric columns

Any column regardless of its data type may be used as the grouping column. Although numeric columns are usually used as the aggregating column, there are cases where it is sensible to use them as the grouping column.  Here, we find the median price for the ten unique values of `OverallQual` and also report the total number of houses in the group.

In [2]:
df.groupby('OverallQual')['SalePrice'].agg(['mean', 'size'])

Unnamed: 0_level_0,mean,size
OverallQual,Unnamed: 1_level_1,Unnamed: 2_level_1
1,50150.0,2
2,51770.333333,3
3,87473.75,20
4,108420.655172,116
5,133523.347607,397
6,161603.034759,374
7,207716.423197,319
8,274735.535714,168
9,367513.023256,43
10,438588.388889,18


The `GrLivArea` column is also numeric, but is a poor choice for grouping as there are many unique values. Let's perform the same operation as above.

In [3]:
df_temp = df.groupby('GrLivArea')['SalePrice'].agg(['mean', 'size'])
df_temp.head()

Unnamed: 0_level_0,mean,size
GrLivArea,Unnamed: 1_level_1,Unnamed: 2_level_1
334,39300.0,1
438,60000.0,1
480,35311.0,1
520,68500.0,1
605,86000.0,1


The first five unique values of `GrLivArea` all appear exactly once. Groups with one observation are usually not that interesting. In fact, the average group size has just 1.7 rows in it.

In [4]:
df_temp['size'].mean()

np.float64(1.6957026713124275)

There are more than half as many groups as there are rows in the DataFrame.

In [5]:
print(f'There are {len(df_temp)} groups from {len(df)} total rows.')

There are 861 groups from 1460 total rows.


## Binning with `pd.cut`

The `pd.cut` function provides the machinery for binning numeric columns into a specific number of bins. Pass a numeric Series as the first argument and the boundaries of the bins as the second.

In [6]:
s = pd.cut(df['GrLivArea'], bins=[0, 500, 1000, 1500, 2000, 3000, 10_000])
s.head()

0    (1500, 2000]
1    (1000, 1500]
2    (1500, 2000]
3    (1500, 2000]
4    (2000, 3000]
Name: GrLivArea, dtype: category
Categories (6, interval[int64, right]): [(0, 500] < (500, 1000] < (1000, 1500] < (1500, 2000] < (2000, 3000] < (3000, 10000]]

An ordered categorical Series will be returned with one category less than the number of boundaries given. Each category will be an interval with two endpoints. The left endpoint is **exclusive**, while the right is **inclusive**. For instance, the interval `(1500, 2000]` does not include 1500 exactly, but does include 2000. 

### Interval data type

While the resulting column is categorical, each individual value in the column is an **Interval** object, which is specific to pandas. The `cat` accessor is used to return all six of these Interval categories.

In [7]:
s.cat.categories

IntervalIndex([(0, 500], (500, 1000], (1000, 1500], (1500, 2000], (2000, 3000],
               (3000, 10000]],
              dtype='interval[int64, right]')

A single value may be retrieved using integer location.

In [8]:
s.cat.categories[2]

Interval(1000, 1500, closed='right')

### Must know minimum and maximum value

You must know both the minimum and maximum value of the column you are binning to make precise bins around the current data. In this case, 0 is lower than the minimum and 10,000 is much greater than the maximum `GrLivArea` so all values will be placed within a bin. If there are values greater than the last given bin value, then these values will be missing in the returned Series. 

Now that the data is binned, we can count the number of houses within each of these six categories. Notice how only three houses have `GrLivArea` less than 500.

In [9]:
s.value_counts(sort=False)

GrLivArea
(0, 500]           3
(500, 1000]      228
(1000, 1500]     554
(1500, 2000]     461
(2000, 3000]     196
(3000, 10000]     18
Name: count, dtype: int64

To get the precise lower and upper boundaries, use the minimum and maximum of the column. You'll also need to set the `include_lowest` parameter to `True` so the very first bin includes the lowest value.

In [10]:
area_min, area_max = df['GrLivArea'].agg(['min', 'max'])
s = pd.cut(df['GrLivArea'], bins=[area_min, 500, 1000, 1500, 2000, 3000, area_max],
          include_lowest=True)
s.value_counts(sort=False)

GrLivArea
(333.999, 500.0]      3
(500.0, 1000.0]     228
(1000.0, 1500.0]    554
(1500.0, 2000.0]    461
(2000.0, 3000.0]    196
(3000.0, 5642.0]     18
Name: count, dtype: int64

### Cut into a specific number of bins

A second way to use `pd.cut` is to supply it a single integer for the number of bins to create. Each bin created will have equal width. Here, we create eight bins on the same column and immediately find the counts of each.

In [11]:
pd.cut(df['GrLivArea'], bins=8).value_counts(sort=False)

GrLivArea
(328.692, 997.5]    228
(997.5, 1661.0]     740
(1661.0, 2324.5]    389
(2324.5, 2988.0]     85
(2988.0, 3651.5]     14
(3651.5, 4315.0]      0
(4315.0, 4978.5]      3
(4978.5, 5642.0]      1
Name: count, dtype: int64

### Take care when setting the `precision` parameter

By default, pandas uses up to three digits of precision for creating the bins. You may use the `precision` parameter to set the decimal precision (just like rounding), though care must be taken, as it only affects the boundary value after the cut has taken place. The real boundaries are still the same as above. To show this, we'll set `precision` to -3.

In [12]:
pd.cut(df['GrLivArea'], bins=8, precision=-3).value_counts(sort=False)

GrLivArea
(300.0, 1000.0]     228
(1000.0, 1700.0]    740
(1700.0, 2300.0]    389
(2300.0, 3000.0]     85
(3000.0, 3700.0]     14
(3700.0, 4300.0]      0
(4300.0, 5000.0]      3
(5000.0, 5600.0]      1
Name: count, dtype: int64

Setting precision to -3 (rounding to the nearest thousand) results in the exact same counts as above. It would appear that the same number of houses (740) have `GrLivArea` greater than 998 up to 1661 as those with `GrLivArea` greater than 1000 up to 1700.

The `between` method is used below to determine whether a house has a `GrLivArea` within a certain range. The resulting boolean Series is summed to find the count. Note how the true count below does not match the count produced from `pd.cut` as setting the `precision` parameter only round the boundaries after the cut has been made.

In [13]:
df['GrLivArea'].between(999, 1661).sum()

np.int64(740)

In [14]:
df['GrLivArea'].between(1001, 1700).sum()

np.int64(782)

### Label the bins with string names

Each bin may be labeled with a string instead of the interval by setting the `labels` parameter to a list of strings, one for each bin. Here, we create three equal-width bins with three string labels. When using string labels, you won't know the endpoints for the bins unless you return them by setting `retbins` to `True`. Both the Series and the bin boundaries will be returned as a tuple, which we unpack into separate variable names.

In [15]:
s, bins = pd.cut(df['GrLivArea'], bins=3, 
                 labels=['small', 'medium', 'large'], retbins=True)
s.head()

0     small
1     small
2     small
3     small
4    medium
Name: GrLivArea, dtype: category
Categories (3, object): ['small' < 'medium' < 'large']

The bin boundaries are displayed below.

In [None]:
bins

## Quantile binning with `pd.qcut`

When we cut our Series into eight equal-width bins, one of the categories had zero observations in it. Instead of using equal-width bins, you may wish to have an equal number of observations in each bin. The `pd.qcut` function bins according to quantiles. You may provide it a list of floats as the quantile boundaries or an integer to create that many bins all with (approximately) equal number of observations in each. Below, we attempt to create eight bins with the same number of observations in each. Because there are duplicate `GrLivArea` values, it may be impossible to create boundaries where each bin has an equal number of observations.

In [16]:
pd.qcut(df['GrLivArea'], 8, precision=0).value_counts(sort=False)

GrLivArea
(333.0, 954.0]      184
(954.0, 1130.0]     181
(1130.0, 1304.0]    183
(1304.0, 1464.0]    183
(1464.0, 1620.0]    182
(1620.0, 1777.0]    182
(1777.0, 2079.0]    182
(2079.0, 5642.0]    183
Name: count, dtype: int64

Provide a list of quantiles as the second argument to create bins of a specific size. Here, three bins are created that hold 20%, 70%, and 10% of the data.

In [17]:
pd.qcut(df['GrLivArea'], [0, 0.2, 0.9, 1], precision=0).value_counts(sort=False)

GrLivArea
(333.0, 1067.0]      292
(1067.0, 2158.0]    1022
(2158.0, 5642.0]     146
Name: count, dtype: int64

We can use the `quantile` method to verify the bin edge values.

In [18]:
df['GrLivArea'].quantile([0, 0.2, 0.9, 1])

0.0     334.0
0.2    1066.6
0.9    2158.3
1.0    5642.0
Name: GrLivArea, dtype: float64

## Grouping with bins

Grouping is often more sensible after binning numeric columns that have many unique values. Let's create a new column, `AreaBin`, that cuts `GrLivArea` into five categories each with the same number of observations.

In [19]:
df['AreaBin'] = pd.qcut(df['GrLivArea'], 5)
df.head(3)

Unnamed: 0,Neighborhood,OverallQual,YearBuilt,Exterior1st,Foundation,GrLivArea,SalePrice,AreaBin
0,CollgCr,7,2003,VinylSd,PConc,1710,208500,"(1578.0, 1869.0]"
1,Veenker,6,1976,MetalSd,CBlock,1262,181500,"(1066.6, 1339.0]"
2,CollgCr,7,2001,VinylSd,PConc,1786,223500,"(1578.0, 1869.0]"


We can now use this column like we do any other grouping column and do so below to find the median price for houses in each bin.

In [20]:
df.groupby('AreaBin')['SalePrice'].median().round(-3)

  df.groupby('AreaBin')['SalePrice'].median().round(-3)


AreaBin
(333.999, 1066.6]    120000.0
(1066.6, 1339.0]     145000.0
(1339.0, 1578.0]     174000.0
(1578.0, 1869.0]     193000.0
(1869.0, 5642.0]     250000.0
Name: SalePrice, dtype: float64

Here, we create a pivot table of the median price by `Foundation` and `AreaBin`.

In [21]:
df.pivot_table(index='Foundation', columns='AreaBin', 
               values='SalePrice', aggfunc='median')

  df.pivot_table(index='Foundation', columns='AreaBin',


AreaBin,"(333.999, 1066.6]","(1066.6, 1339.0]","(1339.0, 1578.0]","(1578.0, 1869.0]","(1869.0, 5642.0]"
Foundation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BrkTil,88000.0,114000.0,137500.0,140000.0,189000.0
CBlock,124000.0,141000.0,154000.0,165325.0,216000.0
PConc,132625.0,164990.0,188750.0,224900.0,280000.0
Slab,88750.0,91500.0,118858.0,127300.0,144000.0
Stone,116000.0,102776.0,,,201489.5
Wood,,,143000.0,164000.0,250000.0


## Exercises

Use the `bikes` DataFrame for the following exercises.

In [22]:
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


In [24]:
import pandas as pd
import numpy as np

# ------------------------------------------------------------------------------
# SETUP: Load and Validate Data
# ------------------------------------------------------------------------------
bikes = pd.read_csv('../data/bikes.csv')

# Architect's Note:
# Binning transforms continuous numerical data into discrete categories.
# This is crucial for handling outliers, non-linear relationships, and simplifying reporting.

# ------------------------------------------------------------------------------
# EXERCISE 1: Custom Binning (0-100, 101-1000, 1001+)
# ------------------------------------------------------------------------------
def classify_trip_duration(df: pd.DataFrame) -> pd.Series:
    """
    Categorizes trips into custom logical buckets.
    Explanation:
    - pd.cut() allows defining specific boundaries.
    - We use -1 instead of 0 for the lower bound to strictly include 0 if present.
    - Labels make the result human-readable immediately.
    """
    return pd.cut(
        df['tripduration'],
        bins=[-1, 100, 1000, np.inf],
        labels=['Short (0-100)', 'Medium (101-1000)', 'Long (1001+)']
    ).value_counts().sort_index()

# ------------------------------------------------------------------------------
# EXERCISE 2: Equal-Width Binning
# ------------------------------------------------------------------------------
def equal_width_analysis(df: pd.DataFrame) -> pd.Series:
    """
    Cuts data into 5 bins of equal 'range' (width).
    Explanation:
    - bins=5 calculates (max - min) / 5.
    - ISSUE: In distributions with outliers (power law), this is often useless.
      Most data clusters in the first bin, while empty bins stretch to the outliers.
    """
    return pd.cut(df['tripduration'], bins=5).value_counts().sort_index()

# ------------------------------------------------------------------------------
# EXERCISE 3: Equal-Frequency (Quantile) Binning
# ------------------------------------------------------------------------------
def equal_freq_analysis(df: pd.DataFrame) -> pd.Series:
    """
    Cuts data into 5 bins with equal 'counts' (quantiles).
    Explanation:
    - pd.qcut() splits data so each bin has ~20% of the rows.
    - RESULT: Much more useful for skewed data (like trip durations) as it 
      reveals the distribution relative to the population density.
    """
    return pd.qcut(df['tripduration'], q=5).value_counts().sort_index()

# ------------------------------------------------------------------------------
# EXERCISE 4: Bivariate Quantile Analysis (Crosstab)
# ------------------------------------------------------------------------------
def duration_temp_crosstab(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyzes the relationship between Temperature and Trip Duration.
    Explanation:
    - We discretize BOTH variables into quantiles (low, med, high, etc.).
    - pd.crosstab() creates a frequency matrix (Heatmap data).
    - Pattern Search: Do long trips happen more in mild weather?
    """
    return pd.crosstab(
        index=pd.qcut(df['tripduration'], q=5, labels=['Shortest', 'Short', 'Med', 'Long', 'Longest']),
        columns=pd.qcut(df['temperature'], q=5, labels=['Coldest', 'Cold', 'Med', 'Hot', 'Hottest'])
    )

# ------------------------------------------------------------------------------
# EXERCISE 5: Pivot Table with Binned Columns
# ------------------------------------------------------------------------------
def pivot_duration_gender_temp(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculates Average Trip Duration by Gender across Temperature Deciles.
    Explanation:
    - We bin temperature into 10 buckets (deciles) inside the pivot_table call.
    - This creates a detailed profile of how behavior changes with weather.
    """
    return df.pivot_table(
        index='gender',
        columns=pd.qcut(df['temperature'], q=10),
        values='tripduration',
        aggfunc='mean'
    )

# ------------------------------------------------------------------------------
# EXERCISE 6: Handling Data Quality & Contextual Binning
# ------------------------------------------------------------------------------
def clean_and_bin_temperature(df: pd.DataFrame) -> pd.Series:
    """
    1. Identifies and removes the anomaly (9999 or similar).
    2. Bins valid data into semantic categories (Cold -> Hot).
    3. Counts occurrences, explicitly tracking Missing Values (NaN).
    """
    # 1. Clean: Replace impossible temps (e.g., > 150F) with NaN
    # Architect's Note: Using .where() for vectorized replacement
    clean_temp = df['temperature'].where(df['temperature'] < 150, np.nan)
    
    # 2. Bin: Define semantic boundaries (Fahrenheit assumptions)
    # Cold: <40, Cool: 40-55, Mild: 55-70, Warm: 70-85, Hot: >85
    return pd.cut(
        clean_temp,
        bins=[-np.inf, 40, 55, 70, 85, np.inf],
        labels=['Cold', 'Cool', 'Mild', 'Warm', 'Hot']
    ).value_counts(dropna=False).sort_index() # dropna=False keeps NaNs visible

# ------------------------------------------------------------------------------
# EXECUTION & ANALYSIS
# ------------------------------------------------------------------------------
print("--- Ex 1: Custom Binning (0-100, 101-1000, 1001+) ---")
print(classify_trip_duration(bikes))

print("\n--- Ex 2: Equal Width Binning (Does it make sense?) ---")
print(equal_width_analysis(bikes))
print("Architect's Verdict: NO. The data is heavily right-skewed. Bin 1 contains 99% of data.")

print("\n--- Ex 3: Equal Frequency Binning (Quantiles) ---")
print(equal_freq_analysis(bikes))
print("Architect's Verdict: YES. This creates balanced groups for comparison.")

print("\n--- Ex 4: Duration vs Temperature Crosstab ---")
print(duration_temp_crosstab(bikes))

print("\n--- Ex 5: Avg Duration by Gender & Temp Deciles ---")
print(pivot_duration_gender_temp(bikes).iloc[:, :3]) # Showing first 3 deciles for brevity

print("\n--- Ex 6: Semantic Weather Buckets (with NaNs) ---")
print(clean_and_bin_temperature(bikes))

--- Ex 1: Custom Binning (0-100, 101-1000, 1001+) ---
tripduration
Short (0-100)          242
Medium (101-1000)    39669
Long (1001+)         10178
Name: count, dtype: int64

--- Ex 2: Equal Width Binning (Does it make sense?) ---
tripduration
(-26.128, 17285.6]    50060
(17285.6, 34511.2]       11
(34511.2, 51736.8]        9
(51736.8, 68962.4]        3
(68962.4, 86188.0]        6
Name: count, dtype: int64
Architect's Verdict: NO. The data is heavily right-skewed. Bin 1 contains 99% of data.

--- Ex 3: Equal Frequency Binning (Quantiles) ---
tripduration
(59.999, 317.0]      10043
(317.0, 480.0]       10011
(480.0, 682.0]       10024
(682.0, 1007.0]       9997
(1007.0, 86188.0]    10014
Name: count, dtype: int64
Architect's Verdict: YES. This creates balanced groups for comparison.

--- Ex 4: Duration vs Temperature Crosstab ---
temperature   Coldest  Cold   Med   Hot  Hottest
tripduration                                    
Shortest         2712  2204  1931  1670     1526
Short       

  return df.pivot_table(


### Exercise 1

<span style="color:green; font-size:16px">Find the number of rides between trip durations of 0 to 100, 101 to 1000, and 1001 and above.</span>

In [32]:
classify_trip_duration(bikes)

tripduration
Short (0-100)          242
Medium (101-1000)    39669
Long (1001+)         10178
Name: count, dtype: int64

### Exercise 2

<span style="color:green; font-size:16px">Cut the trip duration into five bins where the width of each bin is the same size. Count the occurrence of each bin. Sort the resulting Series by the index. Does it make sense to use the type of binning?</span>

In [27]:
equal_width_analysis(bikes)

tripduration
(-26.128, 17285.6]    50060
(17285.6, 34511.2]       11
(34511.2, 51736.8]        9
(51736.8, 68962.4]        3
(68962.4, 86188.0]        6
Name: count, dtype: int64

### Exercise 3

<span style="color:green; font-size:16px">Cut the trip duration into five bins where the number of observations in each bin is the approximately the same. Count the occurrence of each bin. Sort the resulting Series by the index. Does it make sense to use the type of binning?</span>

In [28]:
equal_freq_analysis(bikes)

tripduration
(59.999, 317.0]      10043
(317.0, 480.0]       10011
(480.0, 682.0]       10024
(682.0, 1007.0]       9997
(1007.0, 86188.0]    10014
Name: count, dtype: int64

### Exercise 4

<span style="color:green; font-size:16px">Quantile cut trip duration and temperature into five equal-sized bins and count the occurrences using `pd.crosstab`. Do you notice any patterns?</span>

In [29]:
duration_temp_crosstab(bikes)



temperature,Coldest,Cold,Med,Hot,Hottest
tripduration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shortest,2712,2204,1931,1670,1526
Short,2412,1947,2068,1927,1657
Med,2151,2103,2067,1917,1786
Long,1832,1940,2180,2101,1944
Longest,1458,1754,2328,2331,2143


### Exercise 5

<span style="color:green; font-size:16px">Create a pivot table containing the average trip duration by gender and temperature quantile cut into 10 equal-sized bins.</span>

In [30]:

pivot_duration_gender_temp(bikes).iloc[:, :3]


  return df.pivot_table(


temperature,"(-9999.001, 37.0]","(37.0, 48.0]","(48.0, 55.9]"
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,796.886049,670.031379,762.476715
Male,586.97294,647.648968,622.234574


### Exercise 6

<span style="color:green; font-size:16px">The temperature column has a single obviously wrong value. Replace this value with the numpy nan object and then cut the resulting Series into five bins, labeling them 'cold', 'cool', 'mild', 'warm', 'hot'. Choose the boundaries of the bins that make sense for these labels. Then count the occurence of each label and include the missing values.</span>

In [31]:

clean_and_bin_temperature(bikes)

temperature
Cold     6434
Cool     8483
Mild    14819
Warm    18355
Hot      1998
Name: count, dtype: int64