# Series Deep Dive

This notebook covers core pandas series concepts using data from the US Fuel Economy Website.

## Loading Data


In [1]:
import pandas as pd

url = "./data/vehicles.csv"

df = pd.read_csv(url)

# Assign the column 'city08' to a new variable
city_mpg = df.city08

# Assign the column 'highway08' to a new variable
highway_mpg = df.highway08

  df = pd.read_csv(url)


In [2]:
# Preview the city_mpg series 
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [3]:
# Preview the highway data
highway_mpg

0        25
1        14
2        33
3        12
4        23
         ..
41139    26
41140    28
41141    24
41142    24
41143    21
Name: highway08, Length: 41144, dtype: int64

# Series Attributes

Pandas provides several methods for working with Series. Built-in 'dir' function provides a list of all attributes available.
- Dunder methods (.__add__, .__iter__, etc) provide many numeric operations, looping,
attribute access, and index access. For the numeric operations, these return Series
- Corresponding operator methods for many of the numeric operations allow us to tweak the
behavior (there is an .add method in addition to .__add__).
- Aggregate methods and properties which reduce or aggregate the values in a series down to
a single scalar value. The .mean, .max, and .sum methods and .is_monotonic property are all
examples.
- Conversion methods. Some of these start with .to_ and export the data to other formats.
- Manipulation methods such as .sort_values, .drop_duplicates, that return Series objects with
the same index.
- Indexing and accessor methods and attributes such as .loc and .iloc. These return Series or
scalars.
- String manipulation methods using .str.
- Date manipulation methods using .dt.
- Plotting methods using .plot.
- Categorical manipulation methods using .cat.
- Transformation methods such as .unstack and .reset_index, .agg, .transform.

In [4]:
len(dir(city_mpg))

420

## Vectorized Operations

We can apply most math operations on a series with another series and also use a scalar. When you operate with 2 series, pandas will align and index before performing the operation. Aligning will take each index entry in the left and match it up with every entry with the same name in the index of the right series. Values of the same index are added together.

Hence, before performing vector operations, please make sure:
- Indexes are unique
- Indexes are common to both series



In [5]:
(city_mpg + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

> If these situations do not exist you will get missing values or a combinatoric explosion of results. Here is a simple example of two series that have repeated index entries as well as non-common entries:

In [6]:
s1 = pd.Series([10, 20, 30], index=[1, 2, 2])

s2 = pd.Series([100, 200, 300], index=[2, 2, 4])

s1 + s2

1      NaN
2    120.0
2    220.0
2    130.0
2    230.0
4      NaN
dtype: float64

## Broadcasting

When performing math operations with a scalar, pandas broadcasts the operation to all values. There is another advantage to broadcasting. With many math operations, these are optimized and happen very quickly in the CPU. This is called vectorization. (A numeric pandas series is a block of memory, and modern CPUs leverage a technology called Single Instruction/Multiple Data (SIMD) to apply a math operation to the block of memory.)

## Iteration

> Note that there is also a .__iter__ method on a series, and you can loop over the items in a series. However, I recommend avoiding using a for loop with a series. That is a code smell, indicating that you are probably doing things the wrong way. You are removing one of the benefits of pandas—vectorization and operating at the C level.

## Chaining

Chaining manipulations make life easier. Chaining makes the code easy to read and understand. 

In [7]:
(
    city_mpg.add(highway_mpg)
    .div(2)
)

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

## Aggregate Methods

These collapse the values of a series down to a scalar. Aggregations are the numbers that we report.
For e.g. if your superior came in and asked for a sales report, the reply will be:
- How many people came in (count)
- How much food was ordered (count)
- What was the total revenue (sum)
- When did people come in (skew)

Aggregations allow you to take a detailed data sample and collapse it to a single value.

In [8]:
# To calculate the mean 
city_mpg.mean()

18.369045304297103

In [9]:
# Aggregation Properties
city_mpg.is_monotonic_increasing

False

In [10]:
# Using the Quantile Method
city_mpg.quantile(0.25)

15.0

In [11]:
# To return multiple Quantlie values
city_mpg.quantile([0.25, 0.5, 0.75])

0.25    15.0
0.50    17.0
0.75    20.0
Name: city08, dtype: float64

## Count and Mean of an Attribute

To count values that meet some criteria, the sum method can be used. 

In [12]:
(city_mpg
 .gt(20) # Returns a boolean series
 .sum() # Returns the sum of the boolean series
)

10272

In [13]:
(city_mpg
 .gt(20) # Returns a boolean series 
 .mul(100) # Returns a series of the same length as the boolean series after multiplying by 100
 .mean() # Returns the mean of the series
)

24.965973167412017

>The code above works because Python treats True as 1 and False as 0.

## .agg and Aggregation Strings

The .agg method takes aggregations a step further but can transform the data in other ways depending on how it is called.
Where .agg shines is in the ability to perform multiple aggregations. In that case, it returns a series. You can pass in the names of aggregations methods, NumPy reduction functions, Python aggregations, or define your own aggregation function.

In [14]:
import numpy as np

# Defining my own function
def last_value(s):
    """Returns the last value in a series"""
    return s.iloc[-1]
(city_mpg
 .agg(["mean",np.var, last_value, "max", "quantile"]) 
)

mean           18.369045
var            62.503036
last_value     16.000000
max           150.000000
quantile       17.000000
Name: city08, dtype: float64

## Conversion Methods

Sometimes there is the need to change the type of data being used. This may be due to formats that do not include type information, or the need to utilize more memory.

The city_mpg data is loaded with dtype = int64. This is an overkill as the the data ranges between `6 - 150`. To specify the type of  for a series, the `.astype()` method can be used. Using the correct type can save significant amounts of memory. The default numeric type is 8 bytes wide (64 bits, ie int64 or float64). If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data.

In [15]:
city_mpg.describe()

count    41144.000000
mean        18.369045
std          7.905886
min          6.000000
25%         15.000000
50%         17.000000
75%         20.000000
max        150.000000
Name: city08, dtype: float64

In [16]:
# Inspecting limits of data types
import numpy as np

# Returns the information about the data type uint8
# np.iinfo("uint8")

# Returns the information about the data type np.int8
np.iinfo(np.int8)

# Returns the information about the data type np.int16
# np.iinfo(np.int16)

# Returns the information about the data type np.int32
# np.iinfo(np.int32)

# Returns the information about the data type np.int64
# np.iinfo(np.int64)

# Returns information about the data type float 16
# np.finfo(np.float16)

# Returns information about the data type float 16
# np.finfo(np.float32)

# Returns information about the data type float 16
# np.finfo(np.float64)

iinfo(min=-128, max=127, dtype=int8)

In [17]:
# Specifying a type

(city_mpg
 .astype(np.int16) 
).describe()

count    41144.000000
mean        18.369045
std          7.905886
min          6.000000
25%         15.000000
50%         17.000000
75%         20.000000
max        150.000000
Name: city08, dtype: float64

> After the dtype conversion, please cross-check the min and max values to ensure that they are the same.

## Memory Usage

To determine memory usage of the series, `nbytes` property or the `.memory_usage` methods can be used. For strings/object dtypes, use the `.memory_usage(deep=True)` to include
the amount of memory used by the Python objects in the Series.

In [18]:
(
    city_mpg
    .nbytes
)

329152

In [19]:
(
    city_mpg
    # Convert to int16
    .astype(np.int16)
    # Check the memory usage
    .nbytes
)

82288

Using .nbytes with object types only shows how much memory the Pandas object is taking.
The make of the autos has strings and is stored as an object. To get the amount of memory that
includes the strings, we need to use the .memory_usage method:

In [20]:

(
    df['make']
    .memory_usage(deep=True)
)


2606395

> The value of .nbytes is just the memory that the data is using and not the ancillary parts of the
Series. The .memory_usage includes the index memory and can include the contribution from object
types.

In [21]:
# How does the make data look like?

(
    df['make']
    .value_counts()
)

Chevrolet                      4003
Ford                           3371
Dodge                          2583
GMC                            2494
Toyota                         2071
                               ... 
Volga Associated Automobile       1
Panos                             1
Mahindra                          1
Excalibur Autos                   1
London Coach Co Inc               1
Name: make, Length: 136, dtype: int64

## Conversion to categorical Variable

Categorical strings are useful for string data and can result in large memory savings. When strings are converted into categorical data, pandas no longer uses Python strings for each value but optimizes it, so repeating values are not duplicated.

In [22]:
# Lets Convert The make data to category a categorical variable

(
    df['make']
    # Convert to category
    .astype('category')
    # Check the memory usage
    .memory_usage(deep=True)
)

95888

### Ordered Categories

To create ordered categories, you need to define your own CategoricalDtype:

In [23]:
# Create a sorted list of the of the unique values
values = pd.Series(sorted(set(city_mpg)))

# Set the values as an ordered categorical data type 
city_type = pd.CategoricalDtype(categories=values, ordered=True)

# Set the city_mpg column to the city_type data type
city_mpg.astype(city_type)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]

In [24]:
# Convert Series to a DataFrame

city_mpg.to_frame()

Unnamed: 0,city08
0,19
1,9
2,23
3,10
4,17
...,...
41139,19
41140,20
41141,18
41142,18


## Manipulation Methods

The manipulation methods are the workhorses of pandas. They can be used to understand, clean data.

### .apply and .where
The `.apply` method should be avoided whenever possible. It applies a function element-wise to every value. Numpy function broadcasts the operation to the series but the `.apply` method typically operates on each individual value in a series. If you have one million values in a series, it will be called one million times.

In [25]:
# Defining Functions
import timeit

def gt20(val):
    """Check return if the value is greater than 20"""
    return val > 20


> The code below the broadcasted and hence 50 times faster.

In [26]:
%%timeit
(
    city_mpg
    .apply(gt20)
)

5.96 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


> In the code below, I want to limit the make column in the dataset to the top 5 makes and everything else `Other`.

In [27]:
# Select the make of a car
make = df['make']

top_5_makes = (
                make
                # Fet the value counts for each make
                .value_counts()
                # Select the top 5 makes
                .head(5)
                )


In [28]:
%%timeit
# Keep the top 5 makes and replace the rest with Other

(
                make
                # Where the make is in the top 5 makes keep it else fill with "Other"
                .where(make.isin(top_5_makes), other='Other')
)


1.79 ms ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [29]:
# We could define a function that does same
def generalize_top5(val):
    """Generalize the top 5 makes"""
    if val in top_5_makes:
        return val
    else:
        return 'Other'

In [30]:
%%timeit
make.apply(generalize_top5)

27.9 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


> Using the where methods works faster than the apply method.

## If Else with Pandas

If I wanted to keep the top five makes and use Top10 for the remainder of the top ten makes, with
Other for the rest, there is no built-in pandas method to do that. I could use the following function in combination with .apply:

In [31]:
vc = make.value_counts()

top_5 = vc.index[:5]

top_10 = vc.index[:10]

def generalize(val):
    """Generalize the top 5 makes"""
    if val in top_5:
        return val
    elif val in top_10:
        return 'Top10'
    else:
        return 'Other'
    
make.apply(generalize)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

In [32]:
# To replicate this in pandas

(
    make
    # Where the make is in the top 5 makes keep it else fill with "Other"
    .where(make.isin(top_5), 'Top10')
    .where(make.isin(top_10), 'Other')
)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

Another option is using the select function from the NumPy Library.

In [33]:
np.select([make.isin(top_5), make.isin(top_10)], [make, 'Top10'], 'Other')

array(['Other', 'Other', 'Dodge', ..., 'Other', 'Other', 'Other'],
      dtype=object)

> The above code returns a NumPy array which can then be converted to a series

In [34]:
pd.Series(np.select([make.isin(top_5), make.isin(top_10)], [make, 'Top10'], 'Other'))

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Length: 41144, dtype: object

## Missing Data

Filling in Missing Data is important because many machine learning models do not work if there is missing data. Also, it is prudent to be aware of how much data is missing to make sure you are getting the full story from your data.

In [35]:
cyl_df = df["cylinders"]

(
    cyl_df
    # Check if data is missing
    .isna()
    # Sum the number of missing records with missing values
    .sum()
)

206

In [36]:
# Get further context about which make models have missing data

(
    make
    # Select make records with missing cylinder data
    .loc[cyl_df.isna()]
) 

7138     Nissan
7139     Toyota
8143     Toyota
8144       Ford
8146       Ford
          ...  
34563     Tesla
34564     Tesla
34565     Tesla
34566     Tesla
34567     Tesla
Name: make, Length: 206, dtype: object

## Filling in Missing Data

It looks like Cylinder information is missing for electric vehicles. The `.fillna` method allows you to specify a replacement value for any missing data.

In [37]:
# Select record with missing cylinder value and fill with 0

(
    cyl_df
    # Fill the missing cylinder data with 0
    .fillna(0)
)

0         4.0
1        12.0
2         4.0
3         8.0
4         4.0
         ... 
41139     4.0
41140     4.0
41141     4.0
41142     4.0
41143     4.0
Name: cylinders, Length: 41144, dtype: float64

> the above operation returns a new series with the missing values replaced by zero. If I want to
update my cyl variable, I would need to assign it to this new result.

## Interpolating Data

The `.interpolate()` method comes in handy if the data is ordered (as time series data often is) and there are holes in the data. 

In [38]:
# Interpolating ordered series

temp = pd.Series([32, 40, None, 42, 39, 32])

temp

0    32.0
1    40.0
2     NaN
3    42.0
4    39.0
5    32.0
dtype: float64

In [39]:
# Using interpolate to fill the data

temp.interpolate()

0    32.0
1    40.0
2    41.0
3    42.0
4    39.0
5    32.0
dtype: float64

> Notice that index label 2 was missing, however, there are values for 1 and 3. After interpolation, the missing value becomes 41.0.

## Clipping Data

If you have outliers in your data, you might want to use the `.clip` method.

In [40]:
city_mpg.loc[:446]

0      19
1       9
2      23
3      10
4      17
       ..
442    15
443    15
444    15
445    15
446    31
Name: city08, Length: 447, dtype: int64

In [41]:
(
    city_mpg
    .loc[:446]
    # Set values below the lower threshold to lower and values above the upper threshold to upper
    .clip(
        lower=city_mpg.quantile(0.05),
        upper=city_mpg.quantile(0.75)
    )
)

0      19.0
1      11.0
2      20.0
3      11.0
4      17.0
       ... 
442    15.0
443    15.0
444    15.0
445    15.0
446    20.0
Name: city08, Length: 447, dtype: float64

## Sorting Values

The `.sort_values` method will sort values in ascending order and also rearrange the index accordingly.

In [42]:
city_mpg.sort_values()

7901       6
34557      6
37161      6
21060      6
35887      6
        ... 
34563    138
34564    140
32599    150
31256    150
33423    150
Name: city08, Length: 41144, dtype: int64

> Note that because of index alignment, you can still do math operations (and many other
operations) on a sorted series:

In [43]:
(city_mpg.sort_values() + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

## Sorting Index

The `.sort_index` method is used to sort the indexes of Datasets.

In [44]:
city_mpg.sort_values().sort_index()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

## Ranking Data

The `.rank` method will return a series that keeps the original index but uses the ranks of values from the original series. By default, if two values are the same, their rank will be the average of the positions they take.

In [45]:
city_mpg.rank()

0        27060.5
1          235.5
2        35830.0
3          607.5
4        19484.0
          ...   
41139    27060.5
41140    29719.5
41141    23528.0
41142    23528.0
41143    15479.0
Name: city08, Length: 41144, dtype: float64

In [46]:
# To put equal values on the same rank,

city_mpg.rank(method='min')

0        25555.0
1          136.0
2        35119.0
3          336.0
4        17467.0
          ...   
41139    25555.0
41140    28567.0
41141    21502.0
41142    21502.0
41143    13492.0
Name: city08, Length: 41144, dtype: float64

In [47]:
# To not skip an positions

city_mpg.rank(method='dense')

0        14.0
1         4.0
2        18.0
3         5.0
4        12.0
         ... 
41139    14.0
41140    15.0
41141    13.0
41142    13.0
41143    11.0
Name: city08, Length: 41144, dtype: float64

## Replacing Data

The `.replace` method allows mapping values to new values. You can specify a whole string to replace a string or use a dictionary to
map old values to new values.

In [48]:
(
    make
    # Replace 'Subaru' with 'Subaru-Not'
    .replace('Subaru', 'Subaru-Not')
)

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4        Subaru-Not
            ...    
41139    Subaru-Not
41140    Subaru-Not
41141    Subaru-Not
41142    Subaru-Not
41143    Subaru-Not
Name: make, Length: 41144, dtype: object

> The to_replace parameter’s value can contain a regular expression if you provide the regex=True
parameter. In this example we use regular expression capture groups (they are specified in the
expression by the parentheses).

In [49]:
(
    make
    # Using regex to replace 'Subaru'
    .replace(regex=True, to_replace=r'(Su)ba(r.*)', value=r'\2-other-\1')
)

0         Alfa Romeo
1            Ferrari
2              Dodge
3              Dodge
4        ru-other-Su
            ...     
41139    ru-other-Su
41140    ru-other-Su
41141    ru-other-Su
41142    ru-other-Su
41143    ru-other-Su
Name: make, Length: 41144, dtype: object

## Binning Data

The `cut` method can be used to create equal bins of data

In [50]:
pd.cut(city_mpg, bins=15)

0         (15.6, 25.2]
1        (5.856, 15.6]
2         (15.6, 25.2]
3        (5.856, 15.6]
4         (15.6, 25.2]
             ...      
41139     (15.6, 25.2]
41140     (15.6, 25.2]
41141     (15.6, 25.2]
41142     (15.6, 25.2]
41143     (15.6, 25.2]
Name: city08, Length: 41144, dtype: category
Categories (15, interval[float64, right]): [(5.856, 15.6] < (15.6, 25.2] < (25.2, 34.8] < (34.8, 44.4] ... (111.6, 121.2] < (121.2, 130.8] < (130.8, 140.4] < (140.4, 150.0]]

In [51]:
# Specifying bin sizes

pd.cut(city_mpg, bins=[0,10,20,40,80,160])

0        (10, 20]
1         (0, 10]
2        (20, 40]
3         (0, 10]
4        (10, 20]
           ...   
41139    (10, 20]
41140    (10, 20]
41141    (10, 20]
41142    (10, 20]
41143    (10, 20]
Name: city08, Length: 41144, dtype: category
Categories (5, interval[int64, right]): [(0, 10] < (10, 20] < (20, 40] < (40, 80] < (80, 160]]

## Indexing Operations

Many index operations work on the `index position` while other work on the `index label`.

In [52]:
# Renaming index

# Convert the make values to a dictionary
make_dict = make.to_dict()

# Rename the city_mpg index with the make dictionary
city_mpg_2 = city_mpg.rename(make_dict)

In [53]:
# To view the index

city_mpg_2.index

Index(['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', 'Subaru', 'Subaru',
       'Toyota', 'Toyota', 'Toyota',
       ...
       'Saab', 'Saturn', 'Saturn', 'Saturn', 'Saturn', 'Subaru', 'Subaru',
       'Subaru', 'Subaru', 'Subaru'],
      dtype='object', length=41144)

In [54]:
city_mpg_2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

## Resetting the Index

To reset the index to monotonic increasing, and therefor unique integers starting at zero, use the `.reset_index` method.

In [55]:
city_mpg_2.reset_index(drop=True)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

> Note that you can sort the values and the index with .sort_values and .sort_index respectively.
Because those keep the same index, but just rearrange the order, they do not impact operations that
align on the index.

## The .loc Attribute

The ideal way of pulling data out by using indexing operators is by using the `loc` or `iloc` attributes.

The `.loc` attributes deals with index labels. It allows you to pull out pieces of the series. The following can be passed into an index operation on `.loc`:

- A scalar value of one of the index labels
- A list of index labels
- A slice of labels (closed interval so it includes the stop value)
- An index
- A boolean array
- A function that accepts a series




In [56]:
city_mpg_2.loc[['Subaru']]

Subaru    17
Subaru    21
Subaru    22
Subaru    19
Subaru    20
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 885, dtype: int64

In [57]:
# Some more indexing

city_mpg_2.loc[['Subaru', 'Ford']]

Subaru    17
Subaru    21
Subaru    22
Subaru    19
Subaru    20
          ..
Ford      26
Ford      19
Ford      21
Ford      18
Ford      19
Name: city08, Length: 4256, dtype: int64

> After multiple operations, an intermediate object you are operating on might have a
completely different index than the original object. By using a function, you will have access to the intermediate series and be able to create a row filter based on it. For series objects, this might seem like overkill, but it comes in very handy with dataframes.

In [58]:
cost = pd.Series([1.00, 2.25, 3.50, 4.75, 6.00],
          index = ["Gum", "Apple", "Orange", "Pear", "Banana"]
          )

inflation = 1.10

(
    cost
    # Multiply by inflation
    .mul(1.10)
    # select records with costs > 3.50 based on the intermediate variable after mul
    .loc[lambda val: val > 3.5]

)

Orange    3.850
Pear      5.225
Banana    6.600
dtype: float64

## The .iloc Attribute

This attribute is analogous to .loc but
with a few differences. When we slice off of this attribute, we pull out items by index position. The .iloc attribute supports indexing with the following:

- A scalar index position
- A list of index positions
- A slice of positions (half-open interval so it does not include stop value)
- A NumPy array (or Python list) of boolean values
- A function that accepts a series and returns one of the above



In [59]:
# Pulling out the first value

city_mpg_2.iloc[0]

19

In [60]:
# Pulling out the last
 
 
city_mpg_2.iloc[-1] 

16

In [61]:
# Pulling out specific records

city_mpg_2.iloc[[0,-1]]

Alfa Romeo    19
Subaru        16
Name: city08, dtype: int64

In [62]:
# Slicing the data

city_mpg_2.iloc[0:2]

Alfa Romeo    19
Ferrari        9
Name: city08, dtype: int64

In [63]:
# Return the last 10 numbers

city_mpg_2.iloc[-10:]

Saab      18
Saturn    23
Saturn    21
Saturn    24
Saturn    21
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

In [64]:
mask = city_mpg_2 > 50 


city_mpg_2.iloc[mask.to_numpy()]

Nissan     81
Toyota     81
Toyota     81
Ford       74
Nissan     84
         ... 
Tesla     140
Tesla     115
Tesla     104
Tesla      98
Toyota     55
Name: city08, Length: 236, dtype: int64

## Heads and Tails

The `.head` and `.tail` methods are useful for pulling out values at the start or end of the series,
respectively. These methods are used to quickly inspect a chunk of the data. The following code
inspects the three values at the start and end:

In [65]:
city_mpg_2.head()

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
Name: city08, dtype: int64

In [66]:
city_mpg_2.tail()

Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

## Sampling

While `.head` and `.tail` allow us to inspect the data, sampling the data can be a better
choice. Often the first few entries of the data may be incomplete, test data, or not representative
of all of the values. Sampling might be a better option.

In [67]:
# Sampling 7 records

city_mpg_2.sample(7, random_state=42)

Volvo         16
Mitsubishi    19
Buick         27
Jeep          15
Land Rover    13
Saab          17
Mercury       20
Name: city08, dtype: int64

## Filtering Index Values

The `.filter` method will filter index labels by exact match, substring, or regular expression. These are controlled with the mutually exclusive items, like, and regex parameters, respectively.

> Note that exact match (with items) fails with duplicate index labels:

In [68]:
# city_mpg_2.filter(items=['Ford', 'Chevy'])

In [69]:
city_mpg_2.filter(like='Ford')

Ford    18
Ford    16
Ford    17
Ford    17
Ford    15
        ..
Ford    26
Ford    19
Ford    21
Ford    18
Ford    19
Name: city08, Length: 3371, dtype: int64

In [70]:
# Using regex for filtering

city_mpg_2.filter(regex='(Ford)|(Subaru)')

Subaru    17
Subaru    21
Subaru    22
Ford      18
Ford      16
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 4256, dtype: int64

## Reindexing

The `.reindex` method allows you to pull out values by index label. It will conform the series or
return a series with the order of the index labels provided. Unlike .loc and .filter, you can pass in labels that are not in the index, and it will not throw an error. Rather it will insert missing values.

In [71]:
s_1 = pd.Series([10,20,30], index = ['a','b','c'])

s_2 = pd.Series([100,200,300], index = ['x','b','c'])

# Addition will not work due to different index labels
# s_1 + s_2


In [72]:
s_2

x    100
b    200
c    300
dtype: int64

In [73]:
s_2.reindex(s_1.index)

a      NaN
b    200.0
c    300.0
dtype: float64

## Strings and Objects

Pandas has a `string` type that supports missing values that are not `NaN`.

In [74]:
# The make column has an object type by default
make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [75]:
# Convert the object type to string type

make.astype('string')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string

> The difference b/n the `string` type and strings stored in object (and category) type series is that the string methods return a nullable type when you use a `string` series. If the
result of the string method is missing, pandas will use the newer types that have native pandas
nullable types.

## Categorical Strings

If you have low cardinality string columns, consider using a categorical type for them. You will
have access to many of the same string manipulation methods (though some are not available in
this case). The main advantage here is memory savings and performance improvements, as the
operations need to be done only on the individual categories and not each value in the series:

In [76]:
make.astype('category')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

## The .str Accessor

The object, 'string', and 'category' types have a .str accessor that provides string manipulation
methods. Most of these methods are modeled after the Python string methods.

In [77]:
#  Make string lowercase
make.str.lower()

0        alfa romeo
1           ferrari
2             dodge
3             dodge
4            subaru
            ...    
41139        subaru
41140        subaru
41141        subaru
41142        subaru
41143        subaru
Name: make, Length: 41144, dtype: object

In [78]:
# More string accessors

make.str.find('Subaru')

0       -1
1       -1
2       -1
3       -1
4        0
        ..
41139    0
41140    0
41141    0
41142    0
41143    0
Name: make, Length: 41144, dtype: int64

## Searching

Searching through strings with regex

In [79]:
(
    make
    # find all of the non alphabetic characters nad return as a series
    .str.extract(r'([^a-z A-Z])', expand=False)
    .value_counts()
)

-    1727
.      46
,       9
Name: make, dtype: int64

## Splitting

When dealing with survey data, you may come across binned numeric values. The survey probably
had a drop-down of different ranges. It might have said, what is your age? And have options for
20-29, 30-39, 40-49, etc. Those survey results come in as strings because pandas cannot handle the
dash. Hence we cannot perform math operations on the ages, like calculating the minimum or
mean values.

In [80]:
age = pd.Series(['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70'])


age

0     0-10
1    10-20
2    20-30
3    30-40
4    40-50
5    50-60
6    60-70
dtype: object

In [81]:

(
    age
    # Split based on the '-' character
    .str.split('-', expand=True)
    # Take the Upperside
    .iloc[:, 1]
    .astype(int)
)    


0    10
1    20
2    30
3    40
4    50
5    60
6    70
Name: 1, dtype: int32

In [82]:
(
    age
    .str.split('-', expand=True)
    .astype(int)
    # the average of the bin ranges
    .mean(axis='columns')
)

0     5.0
1    15.0
2    25.0
3    35.0
4    45.0
5    55.0
6    65.0
dtype: float64

> The `.mean` is applied across each row (manipulating across the row is accomplished with the axis='columns' parameter). 

## Replacing Text


Both the series and the .str attribute have a .replace method, and these methods have overlapping
functionality. If I want to replace single characters, I typically use .str.replace, but if I have
complete replacements for many of the values I use .replace.

In [83]:
# The code below replaces all "a" with "ao" in the entire 
make.str.replace('a', 'ao')

0        Alfao Romeo
1           Ferraori
2              Dodge
3              Dodge
4            Subaoru
            ...     
41139        Subaoru
41140        Subaoru
41141        Subaoru
41142        Subaoru
41143        Subaoru
Name: make, Length: 41144, dtype: object

## Date and Time Manipulation

In [84]:
col = pd.Series ([ '2015 -03 -08 08:00:00+00:00 ' ,
                    '2015 -03 -08 08:30:00+00:00 ' ,
                    '2015 -03 -08 09:00:00+00:00 ' ,
                    '2015 -03 -08 09:30:00+00:00 ' ,
                    '2015 -11 -01 06:30:00+00:00 ' ,
                    '2015 -11 -01 07:00:00+00:00 ' ,
                    '2015 -11 -01 07:30:00+00:00 ' ,
                    '2015 -11 -01 08:00:00+00:00 ' ,
                    '2015 -11 -01 08:30:00+00:00 ' ,
                    '2015 -11 -01 08:00:00+00:00 ' ,
                    '2015 -11 -01 08:30:00+00:00 ' ,
                    '2015 -11 -01 09:00:00+00:00 ' ,
                    '2015 -11 -01 09:30:00+00:00 ' ,
                    '2015 -11 -01 10:00:00+00:00 '])

col

0     2015 -03 -08 08:00:00+00:00 
1     2015 -03 -08 08:30:00+00:00 
2     2015 -03 -08 09:00:00+00:00 
3     2015 -03 -08 09:30:00+00:00 
4     2015 -11 -01 06:30:00+00:00 
5     2015 -11 -01 07:00:00+00:00 
6     2015 -11 -01 07:30:00+00:00 
7     2015 -11 -01 08:00:00+00:00 
8     2015 -11 -01 08:30:00+00:00 
9     2015 -11 -01 08:00:00+00:00 
10    2015 -11 -01 08:30:00+00:00 
11    2015 -11 -01 09:00:00+00:00 
12    2015 -11 -01 09:30:00+00:00 
13    2015 -11 -01 10:00:00+00:00 
dtype: object

In [85]:
# Convert from object dtype to datetime frim the utc timezone

utc_s = pd.to_datetime(col, utc=True)

utc_s

0    2015-03-08 08:00:00+00:00
1    2015-03-08 08:30:00+00:00
2    2015-03-08 09:00:00+00:00
3    2015-03-08 09:30:00+00:00
4    2015-11-01 06:30:00+00:00
5    2015-11-01 07:00:00+00:00
6    2015-11-01 07:30:00+00:00
7    2015-11-01 08:00:00+00:00
8    2015-11-01 08:30:00+00:00
9    2015-11-01 08:00:00+00:00
10   2015-11-01 08:30:00+00:00
11   2015-11-01 09:00:00+00:00
12   2015-11-01 09:30:00+00:00
13   2015-11-01 10:00:00+00:00
dtype: datetime64[ns, UTC]

> Note the result of the dtype. It indicates that the dates are stored as UTC. Once you have
converted a series into a datetime64[ns] object, you have the ability to leverage the .dt attribute.

In [86]:
# Convert from datetime to Kuala Lumpur timezone

utc_s.dt.tz_convert('Asia/Kuala_Lumpur')

0    2015-03-08 16:00:00+08:00
1    2015-03-08 16:30:00+08:00
2    2015-03-08 17:00:00+08:00
3    2015-03-08 17:30:00+08:00
4    2015-11-01 14:30:00+08:00
5    2015-11-01 15:00:00+08:00
6    2015-11-01 15:30:00+08:00
7    2015-11-01 16:00:00+08:00
8    2015-11-01 16:30:00+08:00
9    2015-11-01 16:00:00+08:00
10   2015-11-01 16:30:00+08:00
11   2015-11-01 17:00:00+08:00
12   2015-11-01 17:30:00+08:00
13   2015-11-01 18:00:00+08:00
dtype: datetime64[ns, Asia/Kuala_Lumpur]

In [87]:
# Loading data with Offset

s = pd.Series ([ '2015 -03 -08 01:00:00 -07:00 ' ,
                    '2015 -03 -08 01:30:00 -07:00 ' ,
                    '2015 -03 -08 03:00:00 -06:00 ' ,
                    '2015 -03 -08 03:30:00 -06:00 ' ,
                    '2015 -11 -01 00:30:00 -06:00 ' ,
                    '2015 -11 -01 01:00:00 -06:00 ' ,
                    '2015 -11 -01 01:30:00 -06:00 ' ,
                    '2015 -11 -01 01:00:00 -07:00 ' ,
                    '2015 -11 -01 01:30:00 -07:00 ' ,
                    '2015 -11 -01 01:00:00 -07:00 ' ,
                    '2015 -11 -01 01:30:00 -07:00 ' ,
                    '2015 -11 -01 02:00:00 -07:00 ' ,
                    '2015 -11 -01 02:30:00 -07:00 ' ,
                    '2015 -11 -01 03:00:00 -07:00 ']
               )

pd.to_datetime(s, utc=True).dt.tz_convert('Asia/Kuala_Lumpur')

0    2015-03-08 16:00:00+08:00
1    2015-03-08 16:30:00+08:00
2    2015-03-08 17:00:00+08:00
3    2015-03-08 17:30:00+08:00
4    2015-11-01 14:30:00+08:00
5    2015-11-01 15:00:00+08:00
6    2015-11-01 15:30:00+08:00
7    2015-11-01 16:00:00+08:00
8    2015-11-01 16:30:00+08:00
9    2015-11-01 16:00:00+08:00
10   2015-11-01 16:30:00+08:00
11   2015-11-01 17:00:00+08:00
12   2015-11-01 17:30:00+08:00
13   2015-11-01 18:00:00+08:00
dtype: datetime64[ns, Asia/Kuala_Lumpur]

## Loading Local Time Data

If we want to load local date information, we need to have the date, the offset, and the timezone.

In [88]:
time = pd.Series ([ '2015 -03 -08 01:00:00 ' ,
                    '2015 -03 -08 01:30:00 ' ,
                    '2015 -03 -08 02:00:00 ' ,
                    '2015 -03 -08 02:30:00 ' ,
                    '2015 -03 -08 03:00:00 ' ,
                    '2015 -03 -08 02:00:00 ' ,
                    '2015 -03 -08 02:30:00 ' ,
                    '2015 -03 -08 03:00:00 ' ,
                    '2015 -03 -08 03:30:00 ' ,
                    '2015 -11 -01 00:30:00 ' ,
                    '2015 -11 -01 01:00:00 ' ,
                    '2015 -11 -01 01:30:00 ' ,
                    '2015 -11 -01 02:00:00 ' ,
                    '2015 -11 -01 02:30:00 ' ,
                    '2015 -11 -01 01:00:00 ' ,
                    '2015 -11 -01 01:30:00 ' ,
                    '2015 -11 -01 02:00:00 ' ,
                    '2015 -11 -01 02:30:00 ' ,
                    '2015 -11 -01 03:00:00 '])


offset = pd.Series ([-7, -7, -7, -7, -7, -6, -6,
                    -6, -6, -6, -6, -6, -6, -6, -7, -7, -7, -7, -7])

> To apply the offset to the corresponding time, use the `.groupby` with `.transform`

In [89]:
offset = offset.replace ({-7:'-07:00', -6: '-06:00 '})

(
    pd.to_datetime(time)
    .groupby(offset)
    .transform(lambda s: s.dt.tz_localize(s.name)
                          .dt.tz_convert('Asia/Kuala_Lumpur'))               
)


0    2015-03-08 16:00:00+08:00
1    2015-03-08 16:30:00+08:00
2    2015-03-08 17:00:00+08:00
3    2015-03-08 17:30:00+08:00
4    2015-03-08 18:00:00+08:00
5    2015-03-08 16:00:00+08:00
6    2015-03-08 16:30:00+08:00
7    2015-03-08 17:00:00+08:00
8    2015-03-08 17:30:00+08:00
9    2015-11-01 14:30:00+08:00
10   2015-11-01 15:00:00+08:00
11   2015-11-01 15:30:00+08:00
12   2015-11-01 16:00:00+08:00
13   2015-11-01 16:30:00+08:00
14   2015-11-01 16:00:00+08:00
15   2015-11-01 16:30:00+08:00
16   2015-11-01 17:00:00+08:00
17   2015-11-01 17:30:00+08:00
18   2015-11-01 18:00:00+08:00
dtype: datetime64[ns, Asia/Kuala_Lumpur]

## Converting to UTC

If you have a series with local time information (stored as datetime64[ns] and not a string), you can use the .dt.tz_convert method to change it to UTC time:

In [90]:
local = pd.to_datetime(time)


local.dt.tz_localize('UTC')

0    2015-03-08 01:00:00+00:00
1    2015-03-08 01:30:00+00:00
2    2015-03-08 02:00:00+00:00
3    2015-03-08 02:30:00+00:00
4    2015-03-08 03:00:00+00:00
5    2015-03-08 02:00:00+00:00
6    2015-03-08 02:30:00+00:00
7    2015-03-08 03:00:00+00:00
8    2015-03-08 03:30:00+00:00
9    2015-11-01 00:30:00+00:00
10   2015-11-01 01:00:00+00:00
11   2015-11-01 01:30:00+00:00
12   2015-11-01 02:00:00+00:00
13   2015-11-01 02:30:00+00:00
14   2015-11-01 01:00:00+00:00
15   2015-11-01 01:30:00+00:00
16   2015-11-01 02:00:00+00:00
17   2015-11-01 02:30:00+00:00
18   2015-11-01 03:00:00+00:00
dtype: datetime64[ns, UTC]

## Converting to Epochs

You can get the seconds past the UNIX epoch from a UTC or local time information.

In [91]:
secs = local.view('int64').floordiv(1e9).astype('int64')

secs

0     1425776400
1     1425778200
2     1425780000
3     1425781800
4     1425783600
5     1425780000
6     1425781800
7     1425783600
8     1425785400
9     1446337800
10    1446339600
11    1446341400
12    1446343200
13    1446345000
14    1446339600
15    1446341400
16    1446343200
17    1446345000
18    1446346800
dtype: int64

In [92]:
# Loading Epoch information

(
    pd.to_datetime(secs, unit='s')
    .dt.tz_localize('UTC')
)

0    2015-03-08 01:00:00+00:00
1    2015-03-08 01:30:00+00:00
2    2015-03-08 02:00:00+00:00
3    2015-03-08 02:30:00+00:00
4    2015-03-08 03:00:00+00:00
5    2015-03-08 02:00:00+00:00
6    2015-03-08 02:30:00+00:00
7    2015-03-08 03:00:00+00:00
8    2015-03-08 03:30:00+00:00
9    2015-11-01 00:30:00+00:00
10   2015-11-01 01:00:00+00:00
11   2015-11-01 01:30:00+00:00
12   2015-11-01 02:00:00+00:00
13   2015-11-01 02:30:00+00:00
14   2015-11-01 01:00:00+00:00
15   2015-11-01 01:30:00+00:00
16   2015-11-01 02:00:00+00:00
17   2015-11-01 02:30:00+00:00
18   2015-11-01 03:00:00+00:00
dtype: datetime64[ns, UTC]

## Manipulating dates


In [98]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv'


time_df = pd.read_csv(url)

time_df.head()

Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,DAPR,DASF,MDPR,MDSF,...,SNWD,TMAX,TMIN,TOBS,WT01,WT03,WT04,WT05,WT06,WT11
0,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,1980-01-01,,,,,...,29.0,38.0,25.0,25.0,,,,,,
1,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,1980-01-02,,,,,...,34.0,27.0,18.0,18.0,,,,,,
2,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,1980-01-03,,,,,...,30.0,27.0,12.0,18.0,,,,,,
3,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,1980-01-04,,,,,...,30.0,31.0,18.0,27.0,,,,,,
4,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,1980-01-05,,,,,...,30.0,34.0,26.0,34.0,,,,,,


In [99]:
# Convert to the datetime
dates = pd.to_datetime(time_df['DATE'])

dates

0       1980-01-01
1       1980-01-02
2       1980-01-03
3       1980-01-04
4       1980-01-05
           ...    
14155   2019-09-03
14156   2019-09-04
14157   2019-09-05
14158   2019-09-06
14159   2019-09-07
Name: DATE, Length: 14160, dtype: datetime64[ns]

>Given that the dates series has the `dtype=datetime64[ns]`, I can now use the `.dt` accessor.

In [100]:
#Get the weekdays

dates.dt.day_name()

0          Tuesday
1        Wednesday
2         Thursday
3           Friday
4         Saturday
           ...    
14155      Tuesday
14156    Wednesday
14157     Thursday
14158       Friday
14159     Saturday
Name: DATE, Length: 14160, dtype: object

In [101]:
# Date Properties

dates.dt.is_month_end

0        False
1        False
2        False
3        False
4        False
         ...  
14155    False
14156    False
14157    False
14158    False
14159    False
Name: DATE, Length: 14160, dtype: bool

In [102]:
# More date properties

dates.dt.weekday

0        1
1        2
2        3
3        4
4        5
        ..
14155    1
14156    2
14157    3
14158    4
14159    5
Name: DATE, Length: 14160, dtype: int64

In [103]:
# Date formatting

dates.dt.strftime('%Y-%m-%d')

0        1980-01-01
1        1980-01-02
2        1980-01-03
3        1980-01-04
4        1980-01-05
            ...    
14155    2019-09-03
14156    2019-09-04
14157    2019-09-05
14158    2019-09-06
14159    2019-09-07
Name: DATE, Length: 14160, dtype: object

## Dates in the Index


In [104]:
snow = (
    time_df['SNOW']
    # Set the DATE as the index
    .rename(dates)
)

snow

1980-01-01    2.0
1980-01-02    3.0
1980-01-03    1.0
1980-01-04    0.0
1980-01-05    0.0
             ... 
2019-09-03    0.0
2019-09-04    0.0
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    0.0
Name: SNOW, Length: 14160, dtype: float64

In [105]:
# Finding Missing Data

snow[snow.isna()]

1985-07-30   NaN
1985-09-12   NaN
1985-09-19   NaN
1986-02-07   NaN
1986-06-26   NaN
              ..
2017-04-26   NaN
2017-09-20   NaN
2017-10-02   NaN
2017-12-23   NaN
2018-12-03   NaN
Name: SNOW, Length: 365, dtype: float64

In [106]:
# Date indexing from Sept - Oct 1985

snow.loc['1985-09':'1985-10']

1985-09-01    0.0
1985-09-02    0.0
1985-09-03    0.0
1985-09-04    0.0
1985-09-05    0.0
             ... 
1985-10-27    0.0
1985-10-28    0.0
1985-10-29    0.0
1985-10-30    0.0
1985-10-31    5.0
Name: SNOW, Length: 61, dtype: float64

## Dealing with Missing Data

In [107]:
## Filling in Missing Values

(
    snow
    # Select data from Aug - Dec 1985
    .loc['1985-08':'1985-12']
    # Fill the Null Values with 0
    .fillna(0)
)

1985-08-01    0.0
1985-08-02    0.0
1985-08-03    0.0
1985-08-04    0.0
1985-08-05    0.0
             ... 
1985-12-27    0.0
1985-12-28    0.0
1985-12-29    0.0
1985-12-30    5.0
1985-12-31    2.0
Name: SNOW, Length: 153, dtype: float64

> The best way to do with missing data is the talk to a subject matter
expert and determine why it is missing:

## Dropping Missing Values

In [108]:
(
    snow
    # Drop all missing values
    .dropna()
)

1980-01-01    2.0
1980-01-02    3.0
1980-01-03    1.0
1980-01-04    0.0
1980-01-05    0.0
             ... 
2019-09-03    0.0
2019-09-04    0.0
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    0.0
Name: SNOW, Length: 13795, dtype: float64

> Only use this method after talking to a subject matter expert who confirms
that it is ok to drop the data. It can be hard to tell later if the data is missing.

## Shifting Data



In [109]:
# Shifting data forward

snow.shift(1)

1980-01-01    NaN
1980-01-02    2.0
1980-01-03    3.0
1980-01-04    1.0
1980-01-05    0.0
             ... 
2019-09-03    0.0
2019-09-04    0.0
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    0.0
Name: SNOW, Length: 14160, dtype: float64

In [110]:
# Shifting data backward

snow.shift(-1)

1980-01-01    3.0
1980-01-02    1.0
1980-01-03    0.0
1980-01-04    0.0
1980-01-05    1.0
             ... 
2019-09-03    0.0
2019-09-04    0.0
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    NaN
Name: SNOW, Length: 14160, dtype: float64

## Rolling Average



In [111]:
(
    snow
    # Calculate the 5-Day Moving Average
    .rolling(5)
    .mean()
)

1980-01-01    NaN
1980-01-02    NaN
1980-01-03    NaN
1980-01-04    NaN
1980-01-05    1.2
             ... 
2019-09-03    0.0
2019-09-04    0.0
2019-09-05    0.0
2019-09-06    0.0
2019-09-07    0.0
Name: SNOW, Length: 14160, dtype: float64

## Resampling



In [112]:
# Find the maximum rainfall by month

(
    snow
    # Group by month
    .resample('M')
    # Get the max value for each month
    .max()
)

1980-01-31    20.0
1980-02-29    25.0
1980-03-31    16.0
1980-04-30    10.0
1980-05-31     9.0
              ... 
2019-05-31     5.1
2019-06-30     0.0
2019-07-31     0.0
2019-08-31     0.0
2019-09-30     0.0
Freq: M, Name: SNOW, Length: 477, dtype: float64

> The .resample method is used to aggregate values at different levels. At a high level, it groups date entries by some interval (yearly, monthly, weekly) and then aggregate the values at that interval.

In [113]:
(
    snow
    # Group by every 2 months
    .resample('2M')
    # Find the max for the grouped months
    .max()
)

1980-01-31    20.0
1980-03-31    25.0
1980-05-31    10.0
1980-07-31     1.0
1980-09-30     0.0
              ... 
2019-01-31    19.0
2019-03-31    20.7
2019-05-31    18.0
2019-07-31     0.0
2019-09-30     0.0
Freq: 2M, Name: SNOW, Length: 239, dtype: float64

> If we want to aggregate the maximum value for each ski season, which normally ends in May,
we could use the following code. This offset alias, 'A-MAY', indicates that we want an annual
grouping ('A'), but ending in May of each year:

In [114]:
(
    snow
    # Group by Each Year ending in May
    .resample('A-May')
    # Find the max for each year
    .max()
)

1980-05-31    25.0
1981-05-31    26.0
1982-05-31    34.0
1983-05-31    38.0
1984-05-31    25.0
1985-05-31    22.0
1986-05-31    34.0
1987-05-31    16.0
1988-05-31    23.0
1989-05-31    30.0
1990-05-31    32.0
1991-05-31    28.0
1992-05-31    22.0
1993-05-31    30.0
1994-05-31    36.0
1995-05-31    25.0
1996-05-31    34.0
1997-05-31    22.0
1998-05-31    29.0
1999-05-31    26.0
2000-05-31    23.0
2001-05-31    19.0
2002-05-31    28.0
2003-05-31    14.0
2004-05-31    24.0
2005-05-31    31.0
2006-05-31    27.0
2007-05-31    15.0
2008-05-31    21.0
2009-05-31    23.0
2010-05-31    32.0
2011-05-31    22.0
2012-05-31    18.0
2013-05-31    19.0
2014-05-31    11.0
2015-05-31    25.0
2016-05-31    15.0
2017-05-31    26.0
2018-05-31    21.8
2019-05-31    20.7
2020-05-31     0.0
Freq: A-MAY, Name: SNOW, dtype: float64

## Gathering Aggregate Values (with Index)

Below, instead of performing an aggregation with `.resample`, we leverage the
method, `.transform` which works on aggregation groups but returns a series with the original index. This makes it easy to do things like calculate the percentage of quarterly snowfall the fell in a day:

In [115]:
(
    snow
    # Divide by the total snowfall for that Quarter
    .div( snow
          # Aggregate by the Quarter  
          .resample('Q')
          # Transform and get the sum for the quarter and return it indexed
          .transform('sum')
         )
    # Multiply by 100 to get the percentage
    .mul(100)
    # Fill null values with 0
    .fillna(0)
)

1980-01-01    0.527009
1980-01-02    0.790514
1980-01-03    0.263505
1980-01-04    0.000000
1980-01-05    0.000000
                ...   
2019-09-03    0.000000
2019-09-04    0.000000
2019-09-05    0.000000
2019-09-06    0.000000
2019-09-07    0.000000
Name: SNOW, Length: 14160, dtype: float64

In [116]:
(
    snow
    # Select the year 2019
    .loc['2019-01':'2019-12']
    # Resample by monthly interval
    .resample('M')
    # Find the sum of each month
    .sum()
    # Divide by the total snowfall for that year
    .div( 
         snow.loc['2019-01':'2019-12']
        #  Total snow fall for the year
        .sum()
         )
    # Multiply by 100
    .mul(100)
)

2019-01-31    23.631509
2019-02-28    29.559413
2019-03-31    24.939920
2019-04-30    14.926569
2019-05-31     6.942590
2019-06-30     0.000000
2019-07-31     0.000000
2019-08-31     0.000000
2019-09-30     0.000000
Freq: M, Name: SNOW, dtype: float64

## Groupby Operations



In [135]:
# creating a function that will determine ski season by looking at the index with date information. 
# It considers a season to be from October to September and shifts the remainder (Oct - Dec) to the next year

def season(idx):
    year = idx.year
    month = idx.month
    return year.where((month < 10), year + 1)

In [136]:
(
    snow
    # Groupby the season
    .groupby(season)
    # Aggregate by finding the sum for each season
    .sum()
)

1980    457.5
1981    503.0
1982    842.5
1983    807.5
1984    816.0
1985    536.0
1986    740.8
1987    243.1
1988    314.5
1989    429.5
1990    331.5
1991    504.7
1992    340.8
1993    683.5
1994    321.0
1995    645.0
1996    525.5
1997    563.6
1998    579.6
1999    435.7
2000    453.0
2001    468.0
2002    457.8
2003    365.4
2004    514.0
2005    472.0
2006    594.6
2007    319.7
2008    606.0
2009    476.8
2010    391.0
2011    533.8
2012    293.5
2013    362.8
2014    358.7
2015    284.3
2016    354.6
2017    524.0
2018    308.8
2019    504.5
Name: SNOW, dtype: float64

## Cumulative Operations

There are also a handful of cumulative methods that work well with sequence data. These are
`.cummin`, `.cummax`, `.cumprod`, and `.cumsum`. They return the cumulative minimum, maximum, product,
and sum respectively.

In [178]:
(
    snow
    # Select Oct 2018 -Oct 2019 data
    .loc['2016-10':'2017-10']
    .cumsum()
)

2016-10-01      0.0
2016-10-02      0.0
2016-10-03      4.9
2016-10-04      4.9
2016-10-05      5.5
              ...  
2017-10-27    530.9
2017-10-28    530.9
2017-10-29    530.9
2017-10-30    530.9
2017-10-31    530.9
Name: SNOW, Length: 395, dtype: float64

> Cumulative sum of a groupby

In [188]:
(
    snow
    # Select Oct 2016 -Nov 2016 data
    .loc['2016-10':'2016-11']
    # Group By the month
    .resample('M')
    # Unpack the records with transform and get the cumsum
    .transform('cumsum')
)

2016-10-01     0.0
2016-10-02     0.0
2016-10-03     4.9
2016-10-04     4.9
2016-10-05     5.5
              ... 
2016-11-26    27.7
2016-11-27    43.0
2016-11-28     NaN
2016-11-29    49.0
2016-11-30    49.0
Name: SNOW, Length: 61, dtype: float64

Alternatively, it we wanted to do this calculation for every year, we can combine `.resample` with
`.transform` and `'cumsum'`:
```
    (snow
    .resample('A-SEP ')
    .transform('cumsum ')
    )
```