# CALCULATING STREAKS

In [None]:
https://joshdevlin.com/blog/calculate-streaks-in-pandas/?utm_campaign=Data_Elixir&utm_source=Data_Elixir_285

In [2]:
import pandas as pd
import numpy as np

## 1/ Creating the Set of Data

In [13]:
# set a random seed using np.random.seed() to give us reproducible results.
# use np.random.choice() to generate an array of make and miss strings.
# convert that array to a column in a pandas dataframe.


np.random.seed(23)
seq = np.random.choice(['make', 'miss'], size=10, p=[.65, .35])   # p is the probabilit assinged to the list
streaks = pd.Series(seq, name='result').to_frame()
streaks

Unnamed: 0,result
0,make
1,miss
2,miss
3,make
4,make
5,miss
6,make
7,make
8,make
9,make


## 2/ Identify the start of each streak

In [None]:
The first step in calculating our streak in pandas is to identify the start of each streak. 
We’ll do this by using Series.shift() to create a new series with each row shifted down one position. 
We’ll then use Series.ne() to compare the two series’ and tell us which are not equal.

In [14]:
streaks['start_of_streak'] = streaks.result.ne(streaks['result'].shift())
streaks

Unnamed: 0,result,start_of_streak
0,make,True
1,miss,True
2,miss,False
3,make,True
4,make,False
5,miss,True
6,make,True
7,make,False
8,make,False
9,make,False


## 3/ Creating Streak IDs

In [None]:
To calculate this column, we’re going to use Series.cumsum() to calculate the cumulative sum of our start_of_streak column. 
The cumsum() method is going to treat True as 1 and False as 0, 
    - which has the effect of incrementing the count for every True value,
    - which indicates the start of each streak, which you can see illustrated below:

In [15]:
streaks['streak_id'] = streaks['start_of_streak'].cumsum()
streaks

Unnamed: 0,result,start_of_streak,streak_id
0,make,True,1
1,miss,True,2
2,miss,False,2
3,make,True,3
4,make,False,3
5,miss,True,4
6,make,True,5
7,make,False,5
8,make,False,5
9,make,False,5


## 4/ Counting Streaks

In [None]:
We’re going to group by our streak_id and then use GroupBy.cumcount() to count each streak. 
Note that cumcount() starts counting from zero, so we’ll have to add one to the result that we get.

In [16]:
streaks['streak_counter'] = streaks.groupby('streak_id').cumcount() + 1
streaks

Unnamed: 0,result,start_of_streak,streak_id,streak_counter
0,make,True,1,1
1,miss,True,2,1
2,miss,False,2,2
3,make,True,3,1
4,make,False,3,2
5,miss,True,4,1
6,make,True,5,1
7,make,False,5,2
8,make,False,5,3
9,make,False,5,4


## 5/ Creating a Streak Function

In [None]:
If we are going to be calculating streaks a lot, we can create a function that uses the logic we’ve just outlined. 
Our function will take a dataframe with a results column containing make and miss data and return that same dataframe with an added streak_counter column.

The format of the make/miss data actually doesn’t matter as long as they are unique 
It could be make and miss strings as in our example above, Y and N strings, or True or False values. 

Because we start by shifting and comparing the values, as long as they’re unique it will work.

In [18]:
def generate_streak_info(shots):
    """
    Parameters
    ----------
    
    shots:
        A dataframe containing data about shots.
        Must contain a `results` column with two unique values for made and missed shots.
        Must be homogenous (contain only shots that qualify for the streak type you want to calculate (eg all FT for a single player) and be pre-sorted by time.

    Returns
    -------

    shots_with_streaks:
        The original dataframe with a new column `streak_counter` containing integers with counts for each streak.
    """
    
    data = shots['result'].to_frame()
    data['start_of_streak'] = data['result'].ne(data['result'].shift())
    data['streak_id'] = data.start_of_streak.cumsum()
    data['streak_counter'] = data.groupby('streak_id').cumcount() + 1
    shots_with_streaks = pd.concat([shots, data['streak_counter']], axis=1)
    return shots_with_streaks

shots = streaks['result'].to_frame()
generate_streak_info(shots)

Unnamed: 0,result,streak_counter
0,make,1
1,miss,1
2,miss,2
3,make,1
4,make,2
5,miss,1
6,make,1
7,make,2
8,make,3
9,make,4
