# Pandas

## Introduction

[Documentation](https://pandas.pydata.org/docs/#pandas-documentation)

___Pandas is well suited for many different kinds of data:___

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) (1-dimensional) and [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

___Here are just a few of the things that pandas does well:___

* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
* Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
* Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
* Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
* Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
* Intuitive merging and joining data sets
* Flexible reshaping and pivoting of data sets
* Hierarchical labeling of axes (possible to have multiple labels per tick)
* Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
* Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

___Mutability and copying of data___

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.  
* The length of a __Series__ cannot be changed
* Columns can be inserted into a __DataFrame__. 

However, the vast majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability where sensible.

## Basics

* The columns and the index are known as the axes. The index is axis 0, and the columns are axis 1.
* Pandas uses `NaN` (not a number) to represent missing values.
* By default, pandas shows 60 rows and 20 columns, but we have limited that in the book, so the data fits in a page.
* The `.head` method accepts an optional parameter, `n`, which controls the number of rows displayed. The default value for `n` is 5. Similarly, the `.tail` method returns the last `n` rows.


![image.png](./images/dataframe-struct.png)

### Data types

https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#datetime-data

In very broad terms, data may be classified as either __continuous__ or __categorical__. 
* Continuous data is always numeric and represents some kind of measurements, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. 
* Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.

The following describes common pandas data types:

* float – The NumPy float type, which supports missing values
* int – The NumPy integer type, which does not support missing values
* 'Int64' – pandas nullable integer type
* `object` – The NumPy type for storing strings (and mixed types). The object data type is the one data type that is unlike the others. A column that is of the object data type may contain values that are of any valid Python object. __Typically, when a column is of the object data type, it signals that the entire column is strings. When you load CSV files and string columns are missing values, pandas will stick in a NaN (float) for that cell. So the column might have both object and float (missing) values in it.__
* 'category' – pandas categorical type, which does support missing values. As pandas grew larger and more popular, the `object` data type proved to be too generic for all columns with string values. __pandas created its own categorical data type to handle columns of strings (or numbers) with a fixed number of possible values.__
* bool – The NumPy Boolean type, which does not support missing values (None becomes False, np.nan becomes True)
* 'boolean' – pandas nullable Boolean type
* datetime64[ns] – The NumPy date type, which does support missing values (NaT)

## DataFrame
https://pandas.pydata.org/docs/reference/frame.html

In [1]:
import pandas as pd
import numpy as np

In [34]:
stocks = pd.read_csv("data/stocks.csv")
print(type(stocks))
display(stocks)

print('shape = ', stocks.shape)
print('size = ', stocks.size)
print('ndim = ', stocks.ndim)
print('len = ', len(stocks))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


shape =  (3, 4)
size =  12
ndim =  2
len =  3


### General properties and methods

* The index and the columns represent the same thing but along different axes. They are occasionally referred to as the row index and column index.  
* If you do not specify the index, pandas will use a `RangeIndex`. A `RangeIndex` is a subclass of an `Index` that is analogous to Python's `range` object. __Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory.__  
* When possible, Index objects are implemented using hash tables that allow for very fast selection and data alignment. They are ordered and can have duplicate entries.
* Beneath the `index`, `columns`, and `data` are NumPy `ndarrays`.

In [3]:
# componentes of DataFrame

columns = stocks.columns
index = stocks.index
data = stocks.to_numpy()

display(index)
display(columns)

# Beneath the index, columns, and data are NumPy ndarrays.
display(data)
display(index.to_numpy())
display(columns.to_numpy())

RangeIndex(start=0, stop=3, step=1)

Index(['Symbol', 'Shares', 'Low', 'High'], dtype='object')

array([['AAPL', 40, 135, 170],
       ['AMZN', 8, 900, 1125],
       ['TSLA', 50, 220, 400]], dtype=object)

array([0, 1, 2], dtype=int64)

array(['Symbol', 'Shares', 'Low', 'High'], dtype=object)

In [39]:
# data type of each columns
display(stocks.dtypes)

# counts of each data type
display(stocks.dtypes.value_counts())

# number of non-missing values for each column
display(stocks.count())

Symbol    object
Shares     int64
Low        int64
High       int64
dtype: object

int64     3
object    1
dtype: int64

Symbol    3
Shares    3
Low       3
High      3
dtype: int64

In [41]:
# get info on the dataframe
display(stocks.info())

# get info on the dataframe
display(stocks.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Symbol  3 non-null      object
 1   Shares  3 non-null      int64 
 2   Low     3 non-null      int64 
 3   High    3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes


None

Unnamed: 0,Shares,Low,High
count,3.0,3.0,3.0
mean,32.666667,418.333333,565.0
std,21.93931,419.295043,498.422512
min,8.0,135.0,170.0
25%,24.0,177.5,285.0
50%,40.0,220.0,400.0
75%,45.0,560.0,762.5
max,50.0,900.0,1125.0


### Accessing Columns

Selecting a single column from a DataFrame returns a Series

In [6]:
display(stocks)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


In [7]:
display(stocks['Low'])
display(stocks.Low)

0    135
1    900
2    220
Name: Low, dtype: int64

0    135
1    900
2    220
Name: Low, dtype: int64

We can also index off of the `.loc` and `.iloc` attributes to pull out a Series. The former allows us to pull out by column name, while the latter by position. These are referred to as _label-based_ and _positional-based_ in the pandas documentation.

`loc/iloc[row_selector, column_selector]`

In [8]:
display(stocks.loc[:, 'Low'])
display(stocks.iloc[:, 2])
display(stocks.iloc[:1, 2])

0    135
1    900
2    220
Name: Low, dtype: int64

0    135
1    900
2    220
Name: Low, dtype: int64

0    135
Name: Low, dtype: int64

#### Accessing multiple columns

In [9]:
# This can also be used to order columns
display(stocks[
    [
        "High",
        "Low"
    ]
])

display(type(stocks[["High"]]))
display(type(stocks["High"]))

display(type(stocks.loc[:, ["Low"]]))
display(type(stocks.loc[:, "Low"]))

Unnamed: 0,High,Low
0,170,135
1,1125,900
2,400,220


pandas.core.frame.DataFrame

pandas.core.series.Series

pandas.core.frame.DataFrame

pandas.core.series.Series

#### Selecting and Filtering Columns by Data Types and Names

In [10]:
display(stocks)
display(stocks.select_dtypes(include=["number"]))
display(stocks.select_dtypes(exclude=[np.int64]))

# searches column names (or index labels) based on which parameter is used. 
# like parameter is used to search for all the columns or index names that contain the exact string 'AAPL'
display(stocks.filter(like='Low'))

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Shares,Low,High
0,40,135,170
1,8,900,1125
2,50,220,400


Unnamed: 0,Symbol
0,AAPL
1,AMZN
2,TSLA


Unnamed: 0,Low
0,135
1,900
2,220


### Ordering Columns

In [31]:
display(stocks)

order = ['Shares', 'High', 'Low']
        
display(stocks[order])

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Shares,High,Low
0,40,170,135
1,8,1125,900
2,50,400,220


### Renaming Columns

In [11]:
display(stocks)

column_dict = {column : column.lower() for column in stocks.columns.to_list()}
display(stocks.rename(columns=column_dict)) # not an in place change

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,symbol,shares,low,high
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


### Setting and Renaming Index

In [12]:
display(stocks)
index_map={'AAPL': 'Apple Inc.'}
display(stocks.set_index('Symbol').rename(index=index_map))  # not an in place change

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0_level_0,Shares,Low,High
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apple Inc.,40,135,170
AMZN,8,900,1125
TSLA,50,220,400


### Processing data in different columns together

In [13]:
low_high_1 = stocks.Low + stocks.High
low_high_2 = stocks.loc[:, ['Low', 'High']].sum(axis="columns")

display(type(low_high_1))

display(low_high_1)
display(low_high_2)

pandas.core.series.Series

0     305
1    2025
2     620
dtype: int64

0     305
1    2025
2     620
dtype: int64

### Creating Columns
`assign` method - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
`insert` method - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html

In [14]:
stocks_copy = stocks.copy(deep=False)

display(stocks_copy)

display(stocks_copy.assign(vol_gtr_45= lambda col: col.Shares > 45))
display(stocks_copy.assign(vol_gtr_45= stocks_copy.Shares > 45))

stocks_copy.insert(loc=0, column="Difference", value=stocks_copy["High"] - stocks_copy["Low"])
stocks_copy['change_percentage'] = 0
stocks_copy['average_price'] = stocks.Low + stocks.High / 2
display(stocks_copy)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Symbol,Shares,Low,High,vol_gtr_45
0,AAPL,40,135,170,False
1,AMZN,8,900,1125,False
2,TSLA,50,220,400,True


Unnamed: 0,Symbol,Shares,Low,High,vol_gtr_45
0,AAPL,40,135,170,False
1,AMZN,8,900,1125,False
2,TSLA,50,220,400,True


Unnamed: 0,Difference,Symbol,Shares,Low,High,change_percentage,average_price
0,35,AAPL,40,135,170,0,220.0
1,225,AMZN,8,900,1125,0,1462.5
2,180,TSLA,50,220,400,0,420.0


### Deleting Columns

In [15]:
display(stocks.drop(columns='High'))
display(stocks)

Unnamed: 0,Symbol,Shares,Low
0,AAPL,40,135
1,AMZN,8,900
2,TSLA,50,220


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


## Series
https://pandas.pydata.org/docs/reference/series.html

In [16]:
# get series out of DataFrame

symbol_series = stocks.Symbol
shares_series = stocks.Shares

display(type(symbol_series))
display(symbol_series.dtype)
display(shares_series.dtype)

pandas.core.series.Series

dtype('O')

dtype('int64')

### Get samples from Series

In [17]:
# This function returns the first `n` rows for the object based on position
display(symbol_series.head())

# n = Number of items from axis to return
display(symbol_series.sample(n=1))

0    AAPL
1    AMZN
2    TSLA
Name: Symbol, dtype: object

1    AMZN
Name: Symbol, dtype: object

### Get stats from Series

In [18]:
display(symbol_series.value_counts())

# Return number of non-NA/null observations in the Series
print('\nCount = ', symbol_series.count())

# return a NumPy array with the unique values
print('\nUnique = ', symbol_series.unique())

# Basic summary statistics are provided with .min, .max, .mean, .median, and .std

print('\nDescribe:\n', shares_series.describe())

TSLA    1
AAPL    1
AMZN    1
Name: Symbol, dtype: int64


Count =  3

Unique =  ['AAPL' 'AMZN' 'TSLA']

Describe:
 count     3.000000
mean     32.666667
std      21.939310
min       8.000000
25%      24.000000
50%      40.000000
75%      45.000000
max      50.000000
Name: Shares, dtype: float64


### Series Operations

In [19]:
display(shares_series)

# a new Series or DataFrame is returned when using an operator
display(shares_series + 100)
display(shares_series.add(100))
display(shares_series > 100)

0    40
1     8
2    50
Name: Shares, dtype: int64

0    140
1    108
2    150
Name: Shares, dtype: int64

0    140
1    108
2    150
Name: Shares, dtype: int64

0    False
1    False
2    False
Name: Shares, dtype: bool

### Chanining

In [20]:
# The .pipe method on a Series needs to be passed a function that accepts a Series as input and can return anything

def debug_ser(series):
 print("From pipe:")
 print(series)
 return series

print("\nend result:\n", shares_series.add(100).pipe(debug_ser).astype(float))

From pipe:
0    140
1    108
2    150
Name: Shares, dtype: int64

end result:
 0    140.0
1    108.0
2    150.0
Name: Shares, dtype: float64
