# Pandas

## Introduction

[Documentation](https://pandas.pydata.org/docs/#pandas-documentation)

___Pandas is well suited for many different kinds of data:___

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) (1-dimensional) and [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

___Here are just a few of the things that pandas does well:___

* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
* Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
* Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
* Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
* Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
* Intuitive merging and joining data sets
* Flexible reshaping and pivoting of data sets
* Hierarchical labeling of axes (possible to have multiple labels per tick)
* Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
* Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

___Mutability and copying of data___

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.  
* The length of a __Series__ cannot be changed
* Columns can be inserted into a __DataFrame__. 

However, the vast majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability where sensible.

## Basics

* The columns and the index are known as the axes. The index is axis 0, and the columns are axis 1.
* Pandas uses `NaN` (not a number) to represent missing values.
* By default, pandas shows 60 rows and 20 columns, but we have limited that in the book, so the data fits in a page.
* The `.head` method accepts an optional parameter, `n`, which controls the number of rows displayed. The default value for `n` is 5. Similarly, the `.tail` method returns the last `n` rows.


![image.png](./images/dataframe-struct.png)

### Data types

https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#datetime-data

In very broad terms, data may be classified as either __continuous__ or __categorical__. 
* Continuous data is always numeric and represents some kind of measurements, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. 
* Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.

The following describes common pandas data types:

* float – The NumPy float type, which supports missing values
* int – The NumPy integer type, which does not support missing values
* 'Int64' – pandas nullable integer type
* object – The NumPy type for storing strings (and mixed types)
* 'category' – pandas categorical type, which does support missing values
* bool – The NumPy Boolean type, which does not support missing values (None becomes False, np.nan becomes True)
* 'boolean' – pandas nullable Boolean type
* datetime64[ns] – The NumPy date type, which does support missing values (NaT)

## DataFrame

In [1]:
import pandas as pd

In [11]:
stocks = pd.read_csv("data/stocks.csv")
print(type(stocks))
stocks

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


### Index

* The index and the columns represent the same thing but along different axes. They are occasionally referred to as the row index and column index.  
* If you do not specify the index, pandas will use a `RangeIndex`. A `RangeIndex` is a subclass of an `Index` that is analogous to Python's `range` object. __Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory.__  
* When possible, Index objects are implemented using hash tables that allow for very fast selection and data alignment. They are ordered and can have duplicate entries.
* Beneath the `index`, `columns`, and `data` are NumPy `ndarrays`.

In [22]:
# componentes of DataFrame

columns = stocks.columns
index = stocks.index
data = stocks.to_numpy()

display(index)
display(columns)

# Beneath the index, columns, and data are NumPy ndarrays.
display(data)
display(index.to_numpy())
display(columns.to_numpy())

RangeIndex(start=0, stop=3, step=1)

Index(['Symbol', 'Shares', 'Low', 'High'], dtype='object')

array([['AAPL', 40, 135, 170],
       ['AMZN', 8, 900, 1125],
       ['TSLA', 50, 220, 400]], dtype=object)

array([0, 1, 2], dtype=int64)

array(['Symbol', 'Shares', 'Low', 'High'], dtype=object)

In [31]:
# data type of each columns
display(stocks.dtypes)

# counts of each data type
stocks.dtypes.value_counts()

Symbol    object
Shares     int64
Low        int64
High       int64
dtype: object

int64     3
object    1
dtype: int64

In [33]:
# get info on the dataframe
stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Symbol  3 non-null      object
 1   Shares  3 non-null      int64 
 2   Low     3 non-null      int64 
 3   High    3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes
