<a href="https://colab.research.google.com/github/prof-rossetti/intro-to-python/blob/main/notes/python/packages/Pandas_Package_Overview_for_Data_Science_(Summer_2023).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pandas Package Overview for Data Science (Mega Notebook)**

The [`pandas` package](https://pypi.org/project/pandas/) makes it easy to work with CSV formatted data, by providing us with two new datatypes, called the `DataFrame` and the `Series`. This notebook will walk you through common and practical ways of working with these objects.

## 1) Obtaining DataFrames

The `pandas` package provides a datatype called the `DataFrame`, which represents tabular, spreadsheet-style data with rows and columns.






### The `DataFrame` Class Constructor

If we have some data in an eligible format (list of lists, list of dictionaries, dictionary of lists), we can pass it to the `DataFrame` class constructor to obtain a dataframe object.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

Constructing a `DataFrame` from a list of lists:

In [371]:
from pandas import DataFrame

# lesser used format (list of lists)
prices = [
    ["2020-10-01", 100.00],
    ["2020-10-02", 101.01],
    ["2020-10-03", 120.20],
    ["2020-10-04", 107.07],
    ["2020-10-05", 142.42],
    ["2020-10-06", 135.35],
    ["2020-10-07", 160.60],
    ["2020-10-08", 162.62],
]

df = DataFrame(prices, columns=["date","stock_price_usd"])
df.head()

Unnamed: 0,date,stock_price_usd
0,2020-10-01,100.0
1,2020-10-02,101.01
2,2020-10-03,120.2
3,2020-10-04,107.07
4,2020-10-05,142.42


Constructing a `DataFrame` from a list of dictionaries (i.e. "records" format):

In [372]:
from pandas import DataFrame

# common "records" format (list of dicts)
prices = [
    {"date": "2020-10-01", "stock_price_usd": 100.00},
    {"date": "2020-10-02", "stock_price_usd": 101.01},
    {"date": "2020-10-03", "stock_price_usd": 120.20},
    {"date": "2020-10-04", "stock_price_usd": 107.07},
    {"date": "2020-10-05", "stock_price_usd": 142.42},
    {"date": "2020-10-06", "stock_price_usd": 135.35},
    {"date": "2020-10-07", "stock_price_usd": 160.60},
    {"date": "2020-10-08", "stock_price_usd": 162.62},
]

df = DataFrame(prices)
df.head()

Unnamed: 0,date,stock_price_usd
0,2020-10-01,100.0
1,2020-10-02,101.01
2,2020-10-03,120.2
3,2020-10-04,107.07
4,2020-10-05,142.42


Constructing a `DataFrame` from a dictionary of lists:

In [373]:
from pandas import DataFrame

# lesser used format (dict of lists)
prices = {
    "date": ["2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04", "2020-10-05", "2020-10-06", "2020-10-07", "2020-10-08"],
    "stock_price_usd": [100.00, 101.01, 120.20, 107.07, 142.42, 135.35, 160.60, 162.62]
}
df = DataFrame(prices)
df.head()

Unnamed: 0,date,stock_price_usd
0,2020-10-01,100.0
1,2020-10-02,101.01
2,2020-10-03,120.2
3,2020-10-04,107.07
4,2020-10-05,142.42


### The `read_csv` Function

If we have data in CSV format, we can leverage the `read_csv` function to convert that data into a dataframe object.

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

FYI - if we have an XLS or XLSX file, you can alternatively use the `read_excel` function in the same way.

https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html


#### Reading Local CSV Files

We can read local CSV files, for example there is a CSV file in colab at path "sample_data/california_housing_train.csv", otherwise you could upload your own file to the colab filesystem:

In [374]:
from pandas import read_csv

csv_filepath = "sample_data/california_housing_train.csv"
housing_df = read_csv(csv_filepath)
housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [375]:
# upload your CSV file into the colab notebook filesystem, and note it's name (like "example.csv")

#csv_filepath = "example.csv"

#my_df = read_csv(csv_filepath)
#my_df.head()

#### Reading Hosted CSV Files

We can also read CSV formatted data hosted on the Internet:

In [376]:
from pandas import read_csv

request_url = "https://raw.githubusercontent.com/prof-rossetti/intro-to-python/main/data/daily_adjusted_nflx.csv"
prices_df = read_csv(request_url)
prices_df.head()

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
3,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0


## 2) Working with DataFrames

Now that we have obtained a dataframe, let's start working with it.

#### Inspecting the Data

Inspecting first or last few rows:

In [377]:
prices_df.head()

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
3,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0


In [378]:
prices_df.tail()

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
95,2021-06-03,495.19,496.66,487.25,489.43,489.43,3887445,0.0,1.0
96,2021-06-02,499.82,503.22,495.82,499.24,499.24,2268979,0.0,1.0
97,2021-06-01,504.01,505.41,497.7443,499.08,499.08,2482555,0.0,1.0
98,2021-05-28,504.4,511.76,502.53,502.81,502.81,2911297,0.0,1.0
99,2021-05-27,501.8,505.1,498.54,503.86,503.86,3253773,0.0,1.0


Counting number of rows:

In [379]:
len(prices_df) #> int of n_rows

100

Dataset shape (rows and columns):

In [380]:
prices_df.shape #> tuple of (n_rows, n_cols)

(100, 9)

Identifying column names:

In [381]:
prices_df.columns

# prices_df.columns.tolist() # as a list datatype

Index(['timestamp', 'open', 'high', 'low', 'close', 'adjusted_close', 'volume',
       'dividend_amount', 'split_coefficient'],
      dtype='object')

Be aware, every dataframe has an index (or set of unique row identifiers). By default the index is a set of auto-incrementing numbers starting at 0.

In [382]:
print(prices_df.index)

RangeIndex(start=0, stop=100, step=1)


#### Accessing Columns

We can access one or more columns worth of values, using a dictionary-like accessor.

When we access a single column (using a string column name), we get a pandas `Series` object:

In [383]:
closing_prices = prices_df["adjusted_close"]
print(type(closing_prices)) #> Series

closing_prices.head()

<class 'pandas.core.series.Series'>


0    637.97
1    628.29
2    633.80
3    629.76
4    624.94
Name: adjusted_close, dtype: float64

When we access multiple columns (using a list of column names), we get a `DataFrame` object back:

In [384]:
closing_prices = prices_df[["timestamp", "adjusted_close", "volume"]]
print(type(closing_prices)) #> DataFrame

closing_prices.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,timestamp,adjusted_close,volume
0,2021-10-18,637.97,4669071
1,2021-10-15,628.29,4116874
2,2021-10-14,633.8,2672535
3,2021-10-13,629.76,2424638
4,2021-10-12,624.94,3227349


#### Accessing Rows

We use a list-like accessor approach to reference a given row, using the `iloc` method.

When we access a single row, we get a `Series` object:

In [385]:
latest = prices_df.iloc[0]
print(type(latest)) #> Series

latest

<class 'pandas.core.series.Series'>


timestamp            2021-10-18
open                      632.1
high                     638.41
low                    620.5901
close                    637.97
adjusted_close           637.97
volume                  4669071
dividend_amount             0.0
split_coefficient           1.0
Name: 0, dtype: object

When we access multiple rows, we get a `DataFrame` object:

In [386]:
latest_few = prices_df.iloc[0:3]
print(type(latest_few)) #> DataFrame

latest_few

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0


#### Iterating / Looping through Rows


We can loop through each row using the `iterrows` method, involving a destructuring approach to reference the row index as well as its values.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html

When we reference a given row in this approach we are dealing with a `Series` object.

In [387]:
for i, row in prices_df.head(3).iterrows():
    print("------------")
    print("INDEX:", i)
    print(type(row))
    print(row)
    #print(row["timestamp"])

------------
INDEX: 0
<class 'pandas.core.series.Series'>
timestamp            2021-10-18
open                      632.1
high                     638.41
low                    620.5901
close                    637.97
adjusted_close           637.97
volume                  4669071
dividend_amount             0.0
split_coefficient           1.0
Name: 0, dtype: object
------------
INDEX: 1
<class 'pandas.core.series.Series'>
timestamp            2021-10-15
open                      638.0
high                     639.42
low                      625.16
close                    628.29
adjusted_close           628.29
volume                  4116874
dividend_amount             0.0
split_coefficient           1.0
Name: 1, dtype: object
------------
INDEX: 2
<class 'pandas.core.series.Series'>
timestamp            2021-10-14
open                     632.23
high                     636.88
low                      626.79
close                     633.8
adjusted_close            633.8
volume      

#### Sorting Rows

We can use the DataFrame's `sort_values` method to sort rows on the basis of one or more given columns.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

By default, `sort_values` is not mutating, but we can use the `inplace` parameter to perform a mutating sort:

In [388]:
# can store back in same variable to overwrite:
# prices_df = prices_df.sort_values(by="timestamp", ascending=True)

# or alternatively use inplace=True parameter to perform a MUTATING operations
prices_df.sort_values(by="timestamp", ascending=True, inplace=True)
prices_df.head()

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
99,2021-05-27,501.8,505.1,498.54,503.86,503.86,3253773,0.0,1.0
98,2021-05-28,504.4,511.76,502.53,502.81,502.81,2911297,0.0,1.0
97,2021-06-01,504.01,505.41,497.7443,499.08,499.08,2482555,0.0,1.0
96,2021-06-02,499.82,503.22,495.82,499.24,499.24,2268979,0.0,1.0
95,2021-06-03,495.19,496.66,487.25,489.43,489.43,3887445,0.0,1.0


In [389]:
prices_df.sort_values(by="timestamp", ascending=False, inplace=True)
prices_df.head()

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
3,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0


#### Filtering Rows

We can filter a `DataFrame` to get only the rows that match some given condition.

We specify a "mask" condition that determines for each row, whether it meet the criteria or not (True vs False).

Then we pass this mask into a dataframe object accessor to give us only the rows that meet the given condition.

In [391]:
# this is the mask:
prices_df["timestamp"] == "2021-10-15"

0     False
1      True
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: timestamp, Length: 100, dtype: bool

Filtering based on equality, using familiar `==` operator:



In [392]:
prices_df[  prices_df["timestamp"] == "2021-10-15" ]

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0


Filtering based on numeric operations:

In [393]:
prices_df[  prices_df["timestamp"] >= "2021-10-12" ]

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
3,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0


Filtering based on values between lower and upper bound, using `between` method:


In [394]:
prices_df[  prices_df["timestamp"].between("2021-10-01", "2021-11-01") ]

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
3,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0
5,2021-10-11,633.195,639.42,626.78,627.04,627.04,2862470,0.0,1.0
6,2021-10-08,634.165,643.8,630.86,632.66,632.66,3272093,0.0,1.0
7,2021-10-07,642.2252,646.84,630.45,631.85,631.85,3556886,0.0,1.0
8,2021-10-06,628.18,639.8699,626.36,639.1,639.1,4580434,0.0,1.0
9,2021-10-05,606.94,640.39,606.89,634.81,634.81,9534293,0.0,1.0


Filtering based on inclusion, using `isin` method:

In [395]:
dates_of_interest = ["2021-10-15", "2021-10-14", "2021-10-12"]

prices_df[  prices_df["timestamp"].isin(dates_of_interest) ]

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0


Filtering on substring match, using `str.contains` method:

In [396]:
prices_df[  prices_df["timestamp"].str.contains("2021-10-1") ]

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
3,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0
5,2021-10-11,633.195,639.42,626.78,627.04,627.04,2862470,0.0,1.0


### Saving Data to CSV File

We can use the `to_csv` method to export a `DataFrame` to a CSV file with a specified name.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [407]:
# run this cell then check the colab notebook filesystem, where you can see the file and download it
prices_df.to_csv("prices_export.csv", index=False)

## 3) Working with `Series`

Let's return to focus on the `Series` datatype a bit.

https://pandas.pydata.org/docs/reference/api/pandas.Series.html

When we access a given **column** of values, we have a `Series` object, and can use list-like accessors, and convert it to a list of values as desired:

In [397]:
closing_prices = prices_df["adjusted_close"]

print(type(closing_prices))
print(closing_prices)

<class 'pandas.core.series.Series'>
0     637.97
1     628.29
2     633.80
3     629.76
4     624.94
       ...  
95    489.43
96    499.24
97    499.08
98    502.81
99    503.86
Name: adjusted_close, Length: 100, dtype: float64


In [398]:
closing_prices[0]

637.97

In [399]:
print(closing_prices.tolist())

[637.97, 628.29, 633.8, 629.76, 624.94, 627.04, 632.66, 631.85, 639.1, 634.81, 603.35, 613.15, 610.34, 599.06, 583.85, 592.64, 592.39, 593.26, 590.65, 573.14, 575.43, 589.35, 586.5, 582.87, 577.76, 589.29, 598.72, 597.54, 606.05, 606.71, 590.53, 588.55, 582.07, 569.19, 566.18, 558.92, 550.12, 547.58, 553.41, 553.33, 546.88, 543.71, 521.87, 518.91, 517.92, 515.92, 510.72, 512.4, 515.84, 519.97, 520.55, 524.89, 517.35, 510.82, 515.15, 517.57, 514.25, 519.3, 518.91, 516.49, 515.41, 511.77, 513.63, 531.05, 532.28, 530.31, 542.95, 547.95, 540.68, 537.31, 535.98, 530.76, 535.96, 541.64, 533.98, 533.54, 528.21, 533.5, 533.03, 527.07, 518.06, 512.74, 508.82, 497.0, 500.77, 498.34, 492.41, 491.9, 499.89, 488.77, 487.27, 485.81, 492.39, 494.66, 494.74, 489.43, 499.24, 499.08, 502.81, 503.86]


When we access a given **row** of values, we have a Series object, and can convert it to a dictionary as desired:

In [400]:
latest = prices_df.iloc[0]
print(type(latest))
print(latest)

<class 'pandas.core.series.Series'>
timestamp            2021-10-18
open                      632.1
high                     638.41
low                    620.5901
close                    637.97
adjusted_close           637.97
volume                  4669071
dividend_amount             0.0
split_coefficient           1.0
Name: 0, dtype: object


In [401]:
latest["close"]

637.97

In [402]:
latest.to_dict()

{'timestamp': '2021-10-18',
 'open': 632.1,
 'high': 638.41,
 'low': 620.5901,
 'close': 637.97,
 'adjusted_close': 637.97,
 'volume': 4669071,
 'dividend_amount': 0.0,
 'split_coefficient': 1.0}

### Value Counts

We can use the `value_counts` method to count the number of times each value appears in the series.

https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

In [503]:
prices_df["dividend_amount"].value_counts()

0.0    100
Name: dividend_amount, dtype: int64

In [501]:
prices_df["adjusted_close"].value_counts()

518.91    2
637.97    1
531.05    1
541.64    1
535.96    1
         ..
590.53    1
606.71    1
606.05    1
597.54    1
503.86    1
Name: adjusted_close, Length: 99, dtype: int64

Using the `normalize` parameter will display our counts as percentage of the total number of rows:

In [504]:
prices_df["dividend_amount"].value_counts(normalize=True)

0.0    1.0
Name: dividend_amount, dtype: float64

In [502]:
prices_df["adjusted_close"].value_counts(normalize=True)

518.91    0.02
637.97    0.01
531.05    0.01
541.64    0.01
535.96    0.01
          ... 
590.53    0.01
606.71    0.01
606.05    0.01
597.54    0.01
503.86    0.01
Name: adjusted_close, Length: 99, dtype: float64

### Series Aggregations

Consult the pandas Series docs for a comprehensive list of methods, including these aggregation methods:

In [403]:
prices_df["adjusted_close"].min()

485.81

In [404]:
prices_df["adjusted_close"].max()

639.1

In [405]:
prices_df["adjusted_close"].mean()

548.3657

In [406]:
prices_df["adjusted_close"].median()

533.76

## 4) Manipulating DataFrames

Let's zoom back out to the `DataFrame` level, and learn how to manipulate or change them.

### Creating New Columns

Creating a column of constants, for example (see also Mapping Columns section for more examples):

In [408]:
prices_df["my_constant"] = 5
prices_df["my_constant"]

0     5
1     5
2     5
3     5
4     5
     ..
95    5
96    5
97    5
98    5
99    5
Name: my_constant, Length: 100, dtype: int64

### Renaming Columns

We can use the `rename` method to rename columns. This is not mutating unless we use the `inplace` parameter.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html

In [409]:
prices_df.rename(columns={"my_constant":"renamed_constant"}, inplace=True)
prices_df["renamed_constant"]

0     5
1     5
2     5
3     5
4     5
     ..
95    5
96    5
97    5
98    5
99    5
Name: renamed_constant, Length: 100, dtype: int64

### Dropping Columns

We can use the `drop` method to remove columns from the dataset. This is not mutating unless we use the `inplace` parameter.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

In [410]:
prices_df.drop(columns=["renamed_constant"], inplace=True)
prices_df.head()

Unnamed: 0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
0,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
1,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
3,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
4,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0


### Resetting the Index

Every `DataFrame` has a default index values, but we can set our own.

We might do this in practice when one of the columns in the dataset contains values which uniquely identify each row in the dataset.

In [411]:
# assuming our timestamp values are unique, we could use them as the index:
#prices_df.index = prices_df["timestamp"]

#prices_df.head()

In [412]:
# creating a copy of our dataset, so we don't change the original:
prices_copy = prices_df.copy()

# set new index
# assuming our "timestamp" values are unique, we could use them as the index:
prices_copy.index = prices_copy["timestamp"]
prices_copy.head()

Unnamed: 0_level_0,timestamp,open,high,low,close,adjusted_close,volume,dividend_amount,split_coefficient
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-10-18,2021-10-18,632.1,638.41,620.5901,637.97,637.97,4669071,0.0,1.0
2021-10-15,2021-10-15,638.0,639.42,625.16,628.29,628.29,4116874,0.0,1.0
2021-10-14,2021-10-14,632.23,636.88,626.79,633.8,633.8,2672535,0.0,1.0
2021-10-13,2021-10-13,632.1791,632.1791,622.1,629.76,629.76,2424638,0.0,1.0
2021-10-12,2021-10-12,633.02,637.655,621.99,624.94,624.94,3227349,0.0,1.0


### Mapping Columns

We can create mapped / transformed versions of the original columns, and store them back in the dataframe in a new column.

When we perform an operation with a scalar (single value), we perform it on each item in the Series:

In [413]:
prices_df["volume"] * 100

0     466907100
1     411687400
2     267253500
3     242463800
4     322734900
        ...    
95    388744500
96    226897900
97    248255500
98    291129700
99    325377300
Name: volume, Length: 100, dtype: int64

In [414]:
#prices_df["timestamp"] + " 12:00:00"

When we perform operations with two series, this performs an element-wise operation where the first values in each series are compared, then the second values, etc.

This essentially allows us to access and compare multiple values from the same row:

In [415]:
# mapped Series of values:
prices_df["close"] - prices_df["open"]

0     5.8700
1    -9.7100
2     1.5700
3    -2.4191
4    -8.0800
       ...  
95   -5.7600
96   -0.5800
97   -4.9300
98   -1.5900
99    2.0600
Length: 100, dtype: float64

In [416]:
# storing back in a new column:
prices_df["daily_range"] = prices_df["close"] - prices_df["open"]

prices_df[["timestamp", "open", "close", "daily_range"]].head()

Unnamed: 0,timestamp,open,close,daily_range
0,2021-10-18,632.1,637.97,5.87
1,2021-10-15,638.0,628.29,-9.71
2,2021-10-14,632.23,633.8,1.57
3,2021-10-13,632.1791,629.76,-2.4191
4,2021-10-12,633.02,624.94,-8.08


##### Applying a Transformation Function

For more complex logic, we can use the `apply` method to apply a transformation function.

We define the transformation function to accept a single parameter representing one of the values in the original series we are applying the transformation over.

Then we pass that transformation function as a parameter to the `apply` method.

https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html

In [436]:
# transformation function:

def buy_sell_recommendation(adjusted_closing_price):
    if adjusted_closing_price < 630:
        return "BUY"
    else:
        return "SELL"

assert buy_sell_recommendation(629.9) == "BUY"
assert buy_sell_recommendation(630.1) == "SELL"

In [441]:
prices_df["recommendation"] = prices_df["adjusted_close"].apply(buy_sell_recommendation)

prices_df[[ "adjusted_close", "recommendation"]].head()

Unnamed: 0,adjusted_close,recommendation
0,637.97,SELL
1,628.29,BUY
2,633.8,SELL
3,629.76,BUY
4,624.94,BUY


In [418]:
# transformation function:

def parse_year_month(my_date_str):
    """Param my_date_str (str) : a date string like '2021-10-18'."""
    # take the first seven characters, like "2021-10"
    return my_date_str[0:7]

assert parse_year_month("2050-01-30") == "2050-01"

In [442]:
prices_df["year_month"] = prices_df["timestamp"].apply(parse_year_month)

prices_df[["timestamp", "year_month"]]

Unnamed: 0,timestamp,year_month
0,2021-10-18,2021-10
1,2021-10-15,2021-10
2,2021-10-14,2021-10
3,2021-10-13,2021-10
4,2021-10-12,2021-10
...,...,...
95,2021-06-03,2021-06
96,2021-06-02,2021-06
97,2021-06-01,2021-06
98,2021-05-28,2021-05


##### Converting to Dates and Times

For date and time specific values, we can use the `to_datetime` helper function from pandas to make life easier.

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

In [458]:
print(type(prices_df["timestamp"][0]))

<class 'str'>


In [459]:
from pandas import to_datetime

prices_df["timestamp_dt"] = to_datetime(prices_df["timestamp"])

print(type(prices_df["timestamp_dt"][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


Now we have a datetime aware column that respects datetime operations. See:

  + https://docs.python.org/3/library/datetime.html
  + https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
  + https://docs.python.org/3/library/datetime.html#datetime.datetime.weekday

In [456]:
prices_df["year"] = prices_df["timestamp_dt"].dt.year
prices_df["month"] = prices_df["timestamp_dt"].dt.month
prices_df["weekday"] = prices_df["timestamp_dt"].dt.weekday
prices_df["day_name"] = prices_df["timestamp_dt"].dt.strftime("%A")

prices_df[["timestamp", "timestamp_dt", "year_month", "year", "month", "weekday", "day_name"]]

Unnamed: 0,timestamp,timestamp_dt,year_month,year,month,weekday,day_name
0,2021-10-18,2021-10-18,2021-10,2021,10,0,Monday
1,2021-10-15,2021-10-15,2021-10,2021,10,4,Friday
2,2021-10-14,2021-10-14,2021-10,2021,10,3,Thursday
3,2021-10-13,2021-10-13,2021-10,2021,10,2,Wednesday
4,2021-10-12,2021-10-12,2021-10,2021,10,1,Tuesday
...,...,...,...,...,...,...,...
95,2021-06-03,2021-06-03,2021-06,2021,6,3,Thursday
96,2021-06-02,2021-06-02,2021-06,2021,6,2,Wednesday
97,2021-06-01,2021-06-01,2021-06,2021,6,1,Tuesday
98,2021-05-28,2021-05-28,2021-05,2021,5,4,Friday


In [None]:
#def day_of_week(day_index):
#    weekdays_map = {0: "Monday", 1: "Tuesday", 2: "Wednesday", 3: "Thursday", 4: "Friday", 5: "Saturday", 6: "Sunday"}
#    return weekdays_map[day_index]
#
#prices_df["day_of_week"] = prices_df["weekday"].apply(day_of_week)
#prices_df[["timestamp", "timestamp_dt", "year_month", "year", "month", "weekday", "day_of_week"]]

## 5) DataFrame Aggregations

To better illustrate some advanced techniques, we will use other datasets.

In this sales dataset, you will see we have a row per date per product sold on that date.

In [465]:
from pandas import read_csv, to_datetime

sales_df = read_csv(f"https://raw.githubusercontent.com/prof-rossetti/data-analytics-in-python/main/data/unit-2/monthly-sales/sales-201803.csv")
sales_df["date"] = to_datetime(sales_df["date"])
sales_df.head()

Unnamed: 0,date,product,unit price,units sold,sales price
0,2018-03-01,Button-Down Shirt,65.05,2,130.1
1,2018-03-01,Vintage Logo Tee,15.95,1,15.95
2,2018-03-01,Sticker Pack,4.5,1,4.5
3,2018-03-02,Super Soft Hoodie,75.0,2,150.0
4,2018-03-02,Button-Down Shirt,65.05,7,455.35


In [422]:
print(len(sales_df))

117


In [423]:
products = sales_df["product"].unique()
print(products)

['Button-Down Shirt' 'Vintage Logo Tee' 'Sticker Pack' 'Super Soft Hoodie'
 'Baseball Cap' 'Khaki Pants' 'Brown Boots']


In [424]:
days = sales_df["date"].unique()

print(len(days))
print(days)

31
['2018-03-01' '2018-03-02' '2018-03-03' '2018-03-04' '2018-03-05'
 '2018-03-06' '2018-03-07' '2018-03-08' '2018-03-09' '2018-03-10'
 '2018-03-11' '2018-03-12' '2018-03-13' '2018-03-14' '2018-03-15'
 '2018-03-16' '2018-03-17' '2018-03-18' '2018-03-19' '2018-03-20'
 '2018-03-21' '2018-03-22' '2018-03-23' '2018-03-24' '2018-03-25'
 '2018-03-26' '2018-03-27' '2018-03-28' '2018-03-29' '2018-03-30'
 '2018-03-31']


We can get the total sales pretty easily using `Series` aggregation:

In [425]:
print("TOTAL SALES:", sales_df["sales price"].sum().round(2))

TOTAL SALES: 12000.71


However how can we dynamically get the total sales for each product?

### Grouping

Enter the `groupby` method for aggregations

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

In [426]:
sales_by_product = sales_df.groupby("product")["sales price"].sum()
print("SALES PER PRODUCT:")
print(type(sales_by_product))
sales_by_product.sort_values(ascending=False)

SALES PER PRODUCT:
<class 'pandas.core.series.Series'>


product
Button-Down Shirt    6960.35
Super Soft Hoodie    1875.00
Khaki Pants          1602.00
Vintage Logo Tee      941.05
Brown Boots           250.00
Sticker Pack          216.00
Baseball Cap          156.31
Name: sales price, dtype: float64

How about the total sales per day?

In [427]:
sales_by_date = sales_df.groupby("date")["sales price"].sum()
print("SALES PER DAY:")
print(type(sales_by_date))
sales_by_date

SALES PER DAY:
<class 'pandas.core.series.Series'>


date
2018-03-01     150.55
2018-03-02     675.53
2018-03-03     902.63
2018-03-04     209.28
2018-03-05     177.50
2018-03-06     117.40
2018-03-07     371.60
2018-03-08      69.55
2018-03-09     340.38
2018-03-10     765.20
2018-03-11    1067.48
2018-03-12     383.05
2018-03-13     252.00
2018-03-14     449.15
2018-03-15      24.95
2018-03-16     290.90
2018-03-17     641.18
2018-03-18     739.25
2018-03-19     199.45
2018-03-20     327.00
2018-03-21     306.55
2018-03-22     215.60
2018-03-23     495.70
2018-03-24     610.40
2018-03-25     259.40
2018-03-26     228.10
2018-03-27     295.10
2018-03-28     117.40
2018-03-29     298.60
2018-03-30     510.33
2018-03-31     509.50
Name: sales price, dtype: float64

Sales per weekday?

In [466]:
sales_df["weekday"] = sales_df["date"].dt.strftime("%A")

sales_by_weekday = sales_df.groupby("weekday")["sales price"].sum()
print("SALES PER DAY OF WEEK:")
print(type(sales_by_weekday))
sales_by_weekday

SALES PER DAY OF WEEK:
<class 'pandas.core.series.Series'>


weekday
Friday       2312.84
Monday        988.10
Saturday     3428.91
Sunday       2275.41
Thursday      759.25
Tuesday       991.50
Wednesday    1244.70
Name: sales price, dtype: float64

### Pivot Tables

We can alternatively use the `pivot_table` function for more fine-grained grouping and aggregation.

https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

The `index` parameter specifies the rows we want to wind up with (i.e. "row per what?").

The `values` param specifies what columns we would like to aggregate.

The `aggfunc` param specifies how to aggregate those columns. We can pass our own aggregation function or get them from numpy. We can aggregate different columns differently.

In [443]:
from pandas import pivot_table
import numpy as np

dates_pivot = pivot_table(sales_df,
                          index=["date"],
                          values=["sales price"],
                          aggfunc={"sales price": np.sum}
                        )

print(type(dates_pivot))
dates_pivot.rename(columns={"sales price": "sales_total"}, inplace=True)
dates_pivot.sort_values(by=["sales_total"], ascending=False, inplace=True)
dates_pivot.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,sales_total
date,Unnamed: 1_level_1
2018-03-11,1067.48
2018-03-03,902.63
2018-03-10,765.2
2018-03-18,739.25
2018-03-02,675.53


In [461]:
from pandas import pivot_table
import numpy as np

dates_pivot = pivot_table(sales_df,
    index=["date"],
    values=["sales price", "units sold"],
    aggfunc={
        "sales price": np.sum,
        "units sold": np.sum
    } # designate the agg function to be used for each original column. can use our own custom functions here as well
)

print(type(dates_pivot))
dates_pivot.rename(columns={"sales price": "sales_total", "units sold": "units_sold"}, inplace=True)
dates_pivot.sort_values(by=["sales_total"], ascending=False, inplace=True)
dates_pivot.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,sales_total,units_sold
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-11,1067.48,20
2018-03-03,902.63,15
2018-03-10,765.2,15
2018-03-18,739.25,18
2018-03-02,675.53,13


In [462]:
from pandas import pivot_table
import numpy as np

products_pivot = pivot_table(sales_df,
    index=["product"],
    values=["sales price", "units sold"],
    aggfunc={
        "sales price": np.sum,
        "units sold": np.sum
    }
)

print(type(products_pivot))
products_pivot.rename(columns={"sales price": "sales_total", "units sold": "units_sold"}, inplace=True)
products_pivot.sort_values(by=["sales_total"], ascending=False, inplace=True)
products_pivot.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,sales_total,units_sold
product,Unnamed: 1_level_1,Unnamed: 2_level_1
Button-Down Shirt,6960.35,107
Super Soft Hoodie,1875.0,25
Khaki Pants,1602.0,18
Vintage Logo Tee,941.05,59
Brown Boots,250.0,2


In [478]:
from pandas import pivot_table
import numpy as np

weekdays_pivot = pivot_table(sales_df,
    index=["weekday"],
    values=["sales price", "units sold"],
    aggfunc={
        "sales price": np.mean,
        "units sold": np.mean
    }
)

print(type(weekdays_pivot))
weekdays_pivot.rename(columns={"sales price": "sales_avg", "units sold": "units_avg"}, inplace=True)
weekdays_pivot.sort_values(by=["sales_avg"], ascending=False, inplace=True)
weekdays_pivot.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,sales_avg,units_avg
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Saturday,155.859545,3.045455
Sunday,126.411667,2.944444
Friday,115.642,2.2
Wednesday,77.79375,1.8125
Monday,76.007692,1.923077


In [None]:
#import plotly.express as px
#
#chart_df = products_pivot.copy()
#chart_df["product"] = chart_df.index
#
## sorting inverse order to get the bars to show up in the right order when charting
#chart_df.sort_values(by=["sales_total"], ascending=True, inplace=True)
#px.bar(chart_df,  y="product", x="sales_total", orientation="h", title="Top Selling Products (March 2018)")


In [None]:
#import plotly.express as px
#
#chart_df = dates_pivot.copy()
#chart_df.sort_values(by=["date"], ascending=True, inplace=True)
#chart_df["date"] = chart_df.index # adding this as a separate column, for charting purposes
#
#px.bar(chart_df, x="date", y="sales_total", title="Sales by Day (March 2018)")

In [None]:
#import plotly.express as px
#
#chart_df = weekdays_pivot.copy()
#chart_df["weekday"] = chart_df.index
#
## sorting inverse order to get the bars to show up in the right order when charting
#chart_df.sort_values(by=["sales_avg"], ascending=True, inplace=True)
#px.bar(chart_df,  y="weekday", x="sales_avg", orientation="h", title="Average Sales per Day of Week (March 2018)")


## 6) Shift-based Methods

We will now cover shift-based methods, which allow us to perform operations using a given value and other corresponding values in the rows above or below.

Here is a very simple example dataset to illustrate these concepts:

In [467]:
from pandas import DataFrame

gdp_df = DataFrame([
    {"year": 1990, "gdp": 100},
    {"year": 1991, "gdp": 105},
    {"year": 1992, "gdp": 110},
    {"year": 1993, "gdp": 115},
    {"year": 1994, "gdp": 110}

])
gdp_df.head()

Unnamed: 0,year,gdp
0,1990,100
1,1991,105
2,1992,110
3,1993,115
4,1994,110


### Shift, Growth, Percent Change

Before performing shift-based methods, because row order matters, it is important to ensure the rows are sorted in the proper order.

In [469]:
# it is already sorted, but here we are sorting by year for good measure
gdp_df.sort_values(by=["year"], ascending=True, inplace=True)
gdp_df.head()

Unnamed: 0,year,gdp
0,1990,100
1,1991,105
2,1992,110
3,1993,115
4,1994,110


We can use the `shift` function and specify the number of rows above or below via the `periods` parameter.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html

In [470]:
gdp_df["gdp"].shift(periods=1) # 1 or -1 depending on order

0      NaN
1    100.0
2    105.0
3    110.0
4    115.0
Name: gdp, dtype: float64

This allows us to perform an ad-hoc percent growth calculation from one period to another:

In [481]:
# our own calculation of growth / percent change, using current and previous values:
gdp_df["gdp_prev"] = gdp_df["gdp"].shift(periods=1)

gdp_df["gdp_growth"] = (gdp_df["gdp"] - gdp_df["gdp_prev"]) / gdp_df["gdp_prev"]

gdp_df[["year", "gdp", "gdp_prev", "gdp_growth"]]

Unnamed: 0,year,gdp,gdp_prev,gdp_growth
0,1990,100,,
1,1991,105,100.0,0.05
2,1992,110,105.0,0.047619
3,1993,115,110.0,0.045455
4,1994,110,115.0,-0.043478


Although, there is also a `pct_change` method for this purpose.

https://pandas.pydata.org/docs/reference/api/pandas.Series.pct_change.html

In [472]:
# equivalent, leveraging the pct_change method:
gdp_df["gdp_growth"] = gdp_df["gdp"].pct_change(periods=1)

gdp_df[["year", "gdp", "gdp_growth"]]

Unnamed: 0,year,gdp,gdp_growth
0,1990,100,
1,1991,105,0.05
2,1992,110,0.047619
3,1993,115,0.045455
4,1994,110,-0.043478


### Cumulative Growth

To calculate cumulative growth for a particular period, we can use the `cumprod()` (i.e. cumulative product) function, or the `product()` function, depending on the use case


https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumprod.html

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.product.html

Before we calculate a product, we'll need the growth numbers to be relative to 1, rather than 0. We'll also need to fill in the initial NULL value with a 1, so the first period represents 100%.

In [484]:
gdp_df["gdp_growth"] + 1

0         NaN
1    1.050000
2    1.047619
3    1.045455
4    0.956522
Name: gdp_growth, dtype: float64

In [483]:
# loc[row_index, column_name]
gdp_df.loc[0, "gdp_growth"]

nan

In [485]:
# overwriting initial NaN value to make our math work later
gdp_df.loc[0, "gdp_growth"] = 0

gdp_df

Unnamed: 0,year,gdp,gdp_prev,gdp_growth
0,1990,100,,0.0
1,1991,105,100.0,0.05
2,1992,110,105.0,0.047619
3,1993,115,110.0,0.045455
4,1994,110,115.0,-0.043478


In [486]:
gdp_df["gdp_growth"] + 1

0    1.000000
1    1.050000
2    1.047619
3    1.045455
4    0.956522
Name: gdp_growth, dtype: float64

In [487]:
(gdp_df["gdp_growth"] + 1).cumprod()

0    1.00
1    1.05
2    1.10
3    1.15
4    1.10
Name: gdp_growth, dtype: float64

In [488]:
gdp_df["cumulative_growth"] = (gdp_df["gdp_growth"] + 1).cumprod()

gdp_df

Unnamed: 0,year,gdp,gdp_prev,gdp_growth,cumulative_growth
0,1990,100,,0.0,1.0
1,1991,105,100.0,0.05,1.05
2,1992,110,105.0,0.047619,1.1
3,1993,115,110.0,0.045455,1.15
4,1994,110,115.0,-0.043478,1.1


### Trends and Moving Averages

We can leverage shift-based methods to calculate our own trends and moving averages.

Rolling Window Averages:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

Exponential Weighted Moving Averages:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html

In [None]:
trends_df = prices_df.copy()
trends_df.sort_values(by=["timestamp"], inplace=True) # sort for proper rolling order

trends_df["ma_50"] = trends_df["adjusted_close"].rolling(window=50).mean()

trends_df["ema_50"] = trends_df["adjusted_close"].ewm(span=50, min_periods=0, adjust=False, ignore_na=False).mean()

#trends_df

In [490]:
import plotly.express as px
px.line(trends_df, x="timestamp", y=["close", "ma_50", "ema_50"],
        title=f"Adjusted Closing Prices",
        color_discrete_map={
                "close": "royalblue",
                "ma_50": "orange",
                "ema_50":"yellow"
            }
)

You'll notice there are no values for the first N number of periods in our rolling window average, where N is the size of the window. This is because there aren't enough values to complete the average. It's OK!