# Pandas

## Introduction

[Documentation](https://pandas.pydata.org/docs/#pandas-documentation)

___Pandas is well suited for many different kinds of data:___

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) (1-dimensional) and [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

___Here are just a few of the things that pandas does well:___

* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
* Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
* Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
* Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
* Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
* Intuitive merging and joining data sets
* Flexible reshaping and pivoting of data sets
* Hierarchical labeling of axes (possible to have multiple labels per tick)
* Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
* Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

___Mutability and copying of data___

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.  
* The length of a __Series__ cannot be changed
* Columns can be inserted into a __DataFrame__. 

However, the vast majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability where sensible.

## Basics

* The columns and the index are known as the axes. The index is axis 0, and the columns are axis 1.
* Pandas uses `NaN` (not a number) to represent missing values.
* By default, pandas shows 60 rows and 20 columns, but we have limited that in the book, so the data fits in a page.
* The `.head` method accepts an optional parameter, `n`, which controls the number of rows displayed. The default value for `n` is 5. Similarly, the `.tail` method returns the last `n` rows. The `.sample`method returns a random sample of items from an axis of object.


![image.png](./images/dataframe-struct.png)

### Data types

https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#datetime-data

In very broad terms, data may be classified as either __continuous__ or __categorical__. 
* Continuous data is always numeric and represents some kind of measurements, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. 
* Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.

The following describes common pandas data types:

* float – The NumPy float type, which supports missing values
* int – The NumPy integer type, which does not support missing values
* 'Int64' – pandas nullable integer type. __Note: This is different than `int64`__
* `object` – The NumPy type for storing strings (and mixed types). The object data type is the one data type that is unlike the others. A column that is of the object data type may contain values that are of any valid Python object. __Typically, when a column is of the object data type, it signals that the entire column is strings. When you load CSV files and string columns are missing values, pandas will stick in a NaN (float) for that cell. So the column might have both object and float (missing) values in it.__
* 'category' – pandas categorical type, which does support missing values. As pandas grew larger and more popular, the `object` data type proved to be too generic for all columns with string values. __pandas created its own categorical data type to handle columns of strings (or numbers) with a fixed number of possible values.__
* bool – The NumPy Boolean type, which does not support missing values (None becomes False, np.nan becomes True)
* 'boolean' – pandas nullable Boolean type
* datetime64[ns] – The NumPy date type, which does support missing values (NaT)

## DataFrame
https://pandas.pydata.org/docs/reference/frame.html

In [1]:
import pandas as pd
import numpy as np

In [2]:
stocks = pd.read_csv("data/stocks.csv")
print(type(stocks))
display(stocks)

print('shape = ', stocks.shape)
print('size = ', stocks.size)
print('ndim = ', stocks.ndim)
print('len = ', len(stocks))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


shape =  (3, 4)
size =  12
ndim =  2
len =  3


### General properties and methods

* The index and the columns represent the same thing but along different axes. They are occasionally referred to as the row index and column index.  
* If you do not specify the index, pandas will use a `RangeIndex`. A `RangeIndex` is a subclass of an `Index` that is analogous to Python's `range` object. __Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory.__  
* When possible, Index objects are implemented using hash tables that allow for very fast selection and data alignment. They are ordered and can have duplicate entries.
* Beneath the `index`, `columns`, and `data` are NumPy `ndarrays`.

In [3]:
# componentes of DataFrame

columns = stocks.columns
index = stocks.index
data = stocks.to_numpy()

display(index)
display(columns)

# Beneath the index, columns, and data are NumPy ndarrays.
display(data)
display(index.to_numpy())
display(columns.to_numpy())

RangeIndex(start=0, stop=3, step=1)

Index(['Symbol', 'Shares', 'Low', 'High'], dtype='object')

array([['AAPL', 40, 135, 170],
       ['AMZN', 8, 900, 1125],
       ['TSLA', 50, 220, 400]], dtype=object)

array([0, 1, 2], dtype=int64)

array(['Symbol', 'Shares', 'Low', 'High'], dtype=object)

In [4]:
# data type of each columns
display(stocks.dtypes)

# counts of each data type
display(stocks.dtypes.value_counts())

# number of non-missing values for each column
display(stocks.count())

Symbol    object
Shares     int64
Low        int64
High       int64
dtype: object

int64     3
object    1
dtype: int64

Symbol    3
Shares    3
Low       3
High      3
dtype: int64

In [5]:
# get info on the dataframe
display(stocks.info())

# get info on the dataframe
display(stocks.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Symbol  3 non-null      object
 1   Shares  3 non-null      int64 
 2   Low     3 non-null      int64 
 3   High    3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes


None

Unnamed: 0,Shares,Low,High
count,3.0,3.0,3.0
mean,32.666667,418.333333,565.0
std,21.93931,419.295043,498.422512
min,8.0,135.0,170.0
25%,24.0,177.5,285.0
50%,40.0,220.0,400.0
75%,45.0,560.0,762.5
max,50.0,900.0,1125.0


### Accessing Rows and Columns

Selecting a single column from a DataFrame returns a Series

In [6]:
display(stocks)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


#### Accessing columns

In [7]:
display(stocks['Low'])
display(stocks.Low)

0    135
1    900
2    220
Name: Low, dtype: int64

0    135
1    900
2    220
Name: Low, dtype: int64

We can also index off of the `.loc` and `.iloc` attributes to pull out a Series. The former allows us to pull out by column name, while the latter by position. These are referred to as _label-based_ and _positional-based_ in the pandas documentation.

`loc/iloc[row_selector, column_selector]`

In [8]:
display(stocks.loc[:, 'Low'])
display(stocks.iloc[:, 2])
display(stocks.iloc[:1, 2])

0    135
1    900
2    220
Name: Low, dtype: int64

0    135
1    900
2    220
Name: Low, dtype: int64

0    135
Name: Low, dtype: int64

#### Accessing Rows

In [9]:
display(stocks)
display(stocks.loc[0, :])
display(stocks.loc[0:1, :])

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Symbol    AAPL
Shares      40
Low        135
High       170
Name: 0, dtype: object

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125


#### Accessing multiple columns

In [10]:
# This can also be used to order columns
display(stocks[
    [
        "High",
        "Low"
    ]
])

display(type(stocks[["High"]]))
display(type(stocks["High"]))

display(type(stocks.loc[:, ["Low"]]))
display(type(stocks.loc[:, "Low"]))

Unnamed: 0,High,Low
0,170,135
1,1125,900
2,400,220


pandas.core.frame.DataFrame

pandas.core.series.Series

pandas.core.frame.DataFrame

pandas.core.series.Series

#### Selecting and Filtering Columns by Data Types and Names

In [11]:
display(stocks)
display(stocks.select_dtypes(include=["number"]))
display(stocks.select_dtypes(exclude=[np.int64]))

# searches column names (or index labels) based on which parameter is used. 
# like parameter is used to search for all the columns or index names that contain the exact string 'AAPL'
display(stocks.filter(like='Low'))

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Shares,Low,High
0,40,135,170
1,8,900,1125
2,50,220,400


Unnamed: 0,Symbol
0,AAPL
1,AMZN
2,TSLA


Unnamed: 0,Low
0,135
1,900
2,220


### Ordering Columns

In [12]:
display(stocks)

order = ['Shares', 'High', 'Low']
        
display(stocks[order])

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Shares,High,Low
0,40,170,135
1,8,1125,900
2,50,400,220


### Renaming Columns

In [13]:
display(stocks)

column_dict = {column : column.lower() for column in stocks.columns.to_list()}
display(stocks.rename(columns=column_dict)) # not an in place change

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,symbol,shares,low,high
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


### Setting and Renaming Index

In [14]:
display(stocks)
index_map={'AAPL': 'Apple Inc.'}
display(stocks.set_index('Symbol').rename(index=index_map))  # not an in place change

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0_level_0,Shares,Low,High
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apple Inc.,40,135,170
AMZN,8,900,1125
TSLA,50,220,400


### Processing data in different columns together

In [15]:
low_high_1 = stocks.Low + stocks.High
low_high_2 = stocks.loc[:, ['Low', 'High']].sum(axis="columns")

display(type(low_high_1))

display(low_high_1)
display(low_high_2)

pandas.core.series.Series

0     305
1    2025
2     620
dtype: int64

0     305
1    2025
2     620
dtype: int64

### Creating Columns
`assign` method - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
`insert` method - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html

In [16]:
stocks_copy = stocks.copy(deep=False)

display(stocks_copy)

display(stocks_copy.assign(vol_gtr_45= lambda col: col.Shares > 45))
display(stocks_copy.assign(vol_gtr_45= stocks_copy.Shares > 45))

stocks_copy.insert(loc=0, column="Difference", value=stocks_copy["High"] - stocks_copy["Low"])
stocks_copy['change_percentage'] = 0
stocks_copy['average_price'] = stocks.Low + stocks.High / 2
display(stocks_copy)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Symbol,Shares,Low,High,vol_gtr_45
0,AAPL,40,135,170,False
1,AMZN,8,900,1125,False
2,TSLA,50,220,400,True


Unnamed: 0,Symbol,Shares,Low,High,vol_gtr_45
0,AAPL,40,135,170,False
1,AMZN,8,900,1125,False
2,TSLA,50,220,400,True


Unnamed: 0,Difference,Symbol,Shares,Low,High,change_percentage,average_price
0,35,AAPL,40,135,170,0,220.0
1,225,AMZN,8,900,1125,0,1462.5
2,180,TSLA,50,220,400,0,420.0


### Chaining

In [17]:
display(stocks)
display(stocks.sum())
display(stocks.sum()[1:].sum())
display(stocks[['Shares', 'High', 'Low']].sum().sum())

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Symbol    AAPLAMZNTSLA
Shares              98
Low               1255
High              1695
dtype: object

3048

3048

### Operations on DataFrame

In [18]:
display(stocks)

# plus operator, which attempts to add a scalar value to each value of each column of the DataFrame
display(stocks.iloc[:, 1:] + 1)

# chaining
display(stocks.iloc[:, 1:].add(1))

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Shares,Low,High
0,41,136,171
1,9,901,1126
2,51,221,401


Unnamed: 0,Shares,Low,High
0,41,136,171
1,9,901,1126
2,51,221,401


### Deleting Columns

In [19]:
display(stocks.drop(columns='High'))
display(stocks)

Unnamed: 0,Symbol,Shares,Low
0,AAPL,40,135
1,AMZN,8,900
2,TSLA,50,220


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


### Creating DataFrames from scratch

In [20]:
symbols = ['RIL', 'TTM']
quotes = [1324.2, 12.4]

stocks_data_frame_dict = {'symbol': symbols, 'quote': quotes}

# By default, pandas will create a RangeIndex 
df_1 = pd.DataFrame(stocks_data_frame_dict, dtype=np.float16)
display(df_1)
display(df_1.index)

# sepcify our own index
df_2 = pd.DataFrame(stocks_data_frame_dict, index=['a', 'b'])
display(df_2)
display(df_2.index)

# list of dicts
list_of_dicts = [
    {
        'symbol' : 'NSDQ',
        'quote' : 33
    },
    {
        'symbol' : 'AAPL',
        'quote' : 333.1,
        'volume': 22
    },
    {
        'symbol' : 'ZNGA',
        'quote' : None,
        'volume': 'Delisted'
    }
]

display(pd.DataFrame(list_of_dicts))

Unnamed: 0,symbol,quote
0,RIL,1324.0
1,TTM,12.398438


RangeIndex(start=0, stop=2, step=1)

Unnamed: 0,symbol,quote
a,RIL,1324.2
b,TTM,12.4


Index(['a', 'b'], dtype='object')

Unnamed: 0,symbol,quote,volume
0,NSDQ,33.0,
1,AAPL,333.1,22
2,ZNGA,,Delisted


### Exporting DataFarme

DataFrames can be exported using few methods on the DataFrame that start with to_. Example: to_csv, to_clipboard, to_json.

In [21]:
display(stocks)

# write json to string buffer

from io import StringIO
fout = StringIO()

stocks.to_json(fout)
display(fout.getvalue())

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


'{"Symbol":{"0":"AAPL","1":"AMZN","2":"TSLA"},"Shares":{"0":40,"1":8,"2":50},"Low":{"0":135,"1":900,"2":220},"High":{"0":170,"1":1125,"2":400}}'

#### Exporting to type tracking formats:

Once you have your data in a format you like, you can save it in a binary format that tracks types, such as the Feather format (pandas leverages the pyarrow library to do this).

##### Feather format

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

* Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
* Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
* High read and write performance. When possible, Feather operations should be bound by local disk performance.
* This format is meant to enable in-memory transfer of structured data between languages and optimized so that data can be used as is without internal conversion.

Look at DataFrame's `to_feather` and `pd.read_feather` method.

##### Parquet format

Whereas Feather optimizes the binary data for the in-memory structure, Parquet optimizes for the on-disk format. 

Look at DataFrame's `to_fparquet` and `pd.read_parquetr` method.

Right now there is some conversion required for pandas to load data from both Parquet and Feather. But both are quicker than CSV and persist types.

### Efficiently reading data into DataFrame

Look into pd.read_ methods

Tips:
* Read a limited number of rows/lines in a file. Example using the `nrows` param in `pd.read_csv`
* Specify a smaller `dtype` for a given column than the default which might take more space
* If there are very unique values in a column, use the `category` data type
* If you can process chunks of the data at a time and do not need all of it in memory, you can use the `chunksize` parameter

#### Data type is inferend by Pandas

> Because CSV files contain no information about type, pandas tries to infer the types of the columns. If all of the values of a column are whole numbers and none of them are missing, then it uses the int64 type. If the column is numeric but not whole numbers, or if there are missing values, it uses float64. These data types may store more information that you need. For example, if your numbers are all below 200, you could use a smaller type, like np.int16 (or np.int8 if they are all positive).

> As of pandas 0.24, there is a new type 'Int64' (note the capitalization) that supports integer types with missing numbers. You will need to specify it with the dtype parameter if you want to use this type, as pandas will convert integers that have missing numbers to float64.

> If the column turns out to be non-numeric, pandas will convert it to an object column, and treat the values as strings. String values in pandas take up a bunch of memory as each value is stored as a Python string. If we convert these to categoricals, pandas will use much less memory as it only stores the string once, rather than creating new strings (even if they repeat) for every row.

The pandas library can also read CSV files found on the internet. You can point the read_csv function to the URL directly.

#### CSV and JSONs

In [22]:
# Note the memory usage in the output

# reading the whole file
display(pd.read_csv('data/stocks.csv').info())

# reading a limited number of rows
display(pd.read_csv('data/stocks.csv', 
                    nrows=1,
                    dtype={'Low': np.int32}
                    ).info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Symbol  3 non-null      object
 1   Shares  3 non-null      int64 
 2   Low     3 non-null      int64 
 3   High    3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Symbol  1 non-null      object
 1   Shares  1 non-null      int64 
 2   Low     1 non-null      int32 
 3   High    1 non-null      int64 
dtypes: int32(1), int64(2), object(1)
memory usage: 156.0+ bytes


None

#### SQL

1. Create a SQLite database to store the Beatles information

``` python
>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
...     cur = con.cursor()
...     cur.execute("""DROP TABLE Band""")
...     cur.execute(
...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
...         fname TEXT, lname TEXT, birthyear INT)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         0, 'Paul', 'McCartney', 1942)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         1, 'John', 'Lennon', 1940)"""
...     )
...     _ = con.commit()
```

2. Read the table from the database into a DataFrame. Note that if we are reading a table, we need to use a SQLAlchemy connection. SQLAlchemy is a library that abstracts databases for us:

``` python
>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
...     "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
...     "Band", sa_connection, index_col="id"
... )
>>> beat
   fname      lname  birthyear
id                            
0   Paul  McCartney       1942
1   John     Lennon       1940
```

3. Read from the table using a SQL query. This can use a SQLite connection or a SQLAlchemy connection:
``` python
>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
  fname  birthyear
0  Paul       1942
1  John       1940
```

The pandas library leverages the SQLAlchemy library, which can talk to most SQL databases. This lets you create DataFrames from tables, or you can run a SQL select query and create the DataFrame from the query.

#### Reading from HTML

In [23]:
url = 'https://ssltsw.forexprostools.com'

dfs = pd.read_html(url)

display(len(dfs))
display(dfs)

4

[    0   1        2         3            4   5
 0 NaN NaN  EUR/USD    1.1766   Strong Buy NaN
 1 NaN NaN  GBP/USD    1.3082         Sell NaN
 2 NaN NaN  USD/JPY  105.9200          Buy NaN
 3 NaN NaN  AUD/USD    0.7130         Sell NaN
 4 NaN NaN  USD/CAD    1.3402  Strong Sell NaN
 5 NaN NaN  EUR/JPY  124.6100      Neutral NaN
 6 NaN NaN  EUR/CHF    1.0764      Neutral NaN,
     0   1              2         3            4   5
 0 NaN NaN           Gold  1976.700  Strong Sell NaN
 1 NaN NaN         Silver    24.370      Neutral NaN
 2 NaN NaN         Copper     2.855   Strong Buy NaN
 3 NaN NaN  Crude Oil WTI    39.950          Buy NaN
 4 NaN NaN      Brent Oil    43.250   Strong Buy NaN
 5 NaN NaN    Natural Gas     1.856      Neutral NaN
 6 NaN NaN    US Coffee C   118.750  Strong Sell NaN,
     0   1              2         3            4   5
 0 NaN NaN  Euro Stoxx 50   3174.32  Strong Sell NaN
 1 NaN NaN        S&P 500   3271.12   Strong Buy NaN
 2 NaN NaN            DAX  12313.36  St

### Memory Tips

In [24]:
#  list limits for NumPy integer types
display(np.iinfo(np.int32))

# get information about floating-point numbers
display(np.finfo(np.float16))

# ask a DataFrame or Series how many bytes it is using with the .memory_usage method. 
# Note that this also includes the memory requirements of the index. 
# For pandas to extract the exact amount of memory of an object data type column, the deep parameter must be set to True
display(stocks.memory_usage())
display(stocks.Low.memory_usage())
display(stocks.memory_usage(deep=True))

iinfo(min=-2147483648, max=2147483647, dtype=int32)

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

Index     128
Symbol     24
Shares     24
Low        24
High       24
dtype: int64

152

Index     128
Symbol    183
Shares     24
Low        24
High       24
dtype: int64

#### Reducing memory by changing data types

In [25]:
stocks_deep_copy = stocks.copy(deep=True)

# check max values before conversion
display(stocks_deep_copy.select_dtypes(np.int64).describe())

display(stocks_deep_copy.dtypes)
display(stocks_deep_copy.memory_usage())

# convert int64 to int16
stocks_deep_copy['Low'] = stocks_deep_copy['Low'].astype(np.int16)
stocks_deep_copy['High'] = stocks_deep_copy['High'].astype(np.int16)
stocks_deep_copy['Shares'] = stocks_deep_copy['Shares'].astype(np.int16)

display(stocks_deep_copy.dtypes)
display(stocks_deep_copy.memory_usage())

Unnamed: 0,Shares,Low,High
count,3.0,3.0,3.0
mean,32.666667,418.333333,565.0
std,21.93931,419.295043,498.422512
min,8.0,135.0,170.0
25%,24.0,177.5,285.0
50%,40.0,220.0,400.0
75%,45.0,560.0,762.5
max,50.0,900.0,1125.0


Symbol    object
Shares     int64
Low        int64
High       int64
dtype: object

Index     128
Symbol     24
Shares     24
Low        24
High       24
dtype: int64

Symbol    object
Shares     int16
Low        int16
High       int16
dtype: object

Index     128
Symbol     24
Shares      6
Low         6
High        6
dtype: int64

#### Convert to Categorical Data

Consider changing object data types to categorical if they have a reasonably low cardinality (number of unique values)

In [26]:
stocks_deep_copy = stocks.copy(deep=True)

display(stocks_deep_copy)

display(stocks_deep_copy.Symbol.dtype)

print('Num of unique values = ', stocks_deep_copy.select_dtypes(include=["object"]).nunique())
stocks_deep_copy['Symbol'] = stocks_deep_copy['Symbol'].astype("category")

display(stocks_deep_copy.Symbol.dtype)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


dtype('O')

Num of unique values =  Symbol    3
dtype: int64


CategoricalDtype(categories=['AAPL', 'AMZN', 'TSLA'], ordered=False)

## Series
https://pandas.pydata.org/docs/reference/series.html

In [27]:
# get series out of DataFrame

symbol_series = stocks.Symbol
shares_series = stocks.Shares

display(type(symbol_series))
display(symbol_series.dtype)
display(shares_series.dtype)

pandas.core.series.Series

dtype('O')

dtype('int64')

### Get samples from Series

In [28]:
# This function returns the first `n` rows for the object based on position
display(symbol_series.head())

# n = Number of items from axis to return
display(symbol_series.sample(n=1))

0    AAPL
1    AMZN
2    TSLA
Name: Symbol, dtype: object

0    AAPL
Name: Symbol, dtype: object

### Get stats from Series

In [29]:
display(symbol_series.value_counts())

# Return number of non-NA/null observations in the Series
print('\nCount = ', symbol_series.count())

# return a NumPy array with the unique values
print('\nUnique = ', symbol_series.unique())

# Basic summary statistics are provided with .min, .max, .mean, .median, and .std

display(shares_series.describe())

AMZN    1
AAPL    1
TSLA    1
Name: Symbol, dtype: int64


Count =  3

Unique =  ['AAPL' 'AMZN' 'TSLA']


count     3.000000
mean     32.666667
std      21.939310
min       8.000000
25%      24.000000
50%      40.000000
75%      45.000000
max      50.000000
Name: Shares, dtype: float64

### Series Operations

In [30]:
display(shares_series)

# a new Series or DataFrame is returned when using an operator
display(shares_series + 100)
display(shares_series.add(100))
display(shares_series > 100)

0    40
1     8
2    50
Name: Shares, dtype: int64

0    140
1    108
2    150
Name: Shares, dtype: int64

0    140
1    108
2    150
Name: Shares, dtype: int64

0    False
1    False
2    False
Name: Shares, dtype: bool

### Chanining

In [31]:
# The .pipe method on a Series needs to be passed a function that accepts a Series as input and can return anything

def debug_ser(series):
 print("From pipe:")
 print(series)
 return series

print("\nend result:\n", shares_series.add(100).pipe(debug_ser).astype(float))

From pipe:
0    140
1    108
2    150
Name: Shares, dtype: int64

end result:
 0    140.0
1    108.0
2    150.0
Name: Shares, dtype: float64


## Performing Analysis

In [32]:
display(stocks)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


### High Level Summary of Data

In [33]:
display(stocks.info())
display(stocks.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Symbol  3 non-null      object
 1   Shares  3 non-null      int64 
 2   Low     3 non-null      int64 
 3   High    3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes


None

Unnamed: 0,Shares,Low,High
count,3.0,3.0,3.0
mean,32.666667,418.333333,565.0
std,21.93931,419.295043,498.422512
min,8.0,135.0,170.0
25%,24.0,177.5,285.0
50%,40.0,220.0,400.0
75%,45.0,560.0,762.5
max,50.0,900.0,1125.0


### Get n largest and smallest values

In [34]:
display(stocks.nlargest(2, "Shares"))
display(stocks.nsmallest(2, "Shares"))

Unnamed: 0,Symbol,Shares,Low,High
2,TSLA,50,220,400
0,AAPL,40,135,170


Unnamed: 0,Symbol,Shares,Low,High
1,AMZN,8,900,1125
0,AAPL,40,135,170


### Sorting

In [35]:
display(stocks)

# sort single conlumn
display(stocks.sort_values('Shares', ascending=True))

# sort multiple conlumns
# High will be treated as the tie breaker
display(stocks.sort_values(['Low', 'High'], ascending=True))

display(stocks)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


Unnamed: 0,Symbol,Shares,Low,High
1,AMZN,8,900,1125
0,AAPL,40,135,170
2,TSLA,50,220,400


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
2,TSLA,50,220,400
1,AMZN,8,900,1125


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


### Drop Duplicates

In [36]:
series_with_dupes = pd.Series([3,54,7653,23,7653, 3])
display(series_with_dupes)

display(series_with_dupes.drop_duplicates())

0       3
1      54
2    7653
3      23
4    7653
5       3
dtype: int64

0       3
1      54
2    7653
3      23
dtype: int64

### Group By

TBD

### String operations

In [44]:
ser = pd.Series(['test', 'lol', 'omg', 'test_1', 23])
display(ser)

display(ser.str.replace('test', 'tst'))

0      test
1       lol
2       omg
3    test_1
4        23
dtype: object

0      tst
1      lol
2      omg
3    tst_1
4      NaN
dtype: object

### Binning 
https://pbpython.com/pandas-qcut-cut.html
    
When dealing with continuous numeric data, it is often helpful to bin the data into multiple buckets for further analysis. There are several different terms for binning including bucketing, discrete binning, discretization or quantization. Pandas supports these approaches using the `cut` and `qcut` functions.

#### qcut

The pandas documentation describes qcut as a “Quantile-based discretization function.” This basically means that __`qcut` tries to divide up the underlying data into equal sized bins__. The function defines the bins using percentiles based on the distribution of the data, not the actual numeric edges of the bins.

In [73]:
df = pd.DataFrame({'nums': np.arange(12)})
display(df)

# Because we asked for quantiles with q=4 the bins match the percentiles from the describe function.
display(df.describe())

#  create 4 equal sized groupings of the data
display(pd.qcut(df['nums'], q=4))

df['quantiles (qcut)'] = pd.qcut(df['nums'], q=4)

# adding labels to the buckets
df['quantile_labels (qcut)'] = pd.qcut(df['nums'], q=4, labels=['25', '50', '75', '100'])

display(df['quantiles (qcut)'].value_counts())
display(df)

Unnamed: 0,nums
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9


Unnamed: 0,nums
count,12.0
mean,5.5
std,3.605551
min,0.0
25%,2.75
50%,5.5
75%,8.25
max,11.0


0     (-0.001, 2.75]
1     (-0.001, 2.75]
2     (-0.001, 2.75]
3        (2.75, 5.5]
4        (2.75, 5.5]
5        (2.75, 5.5]
6        (5.5, 8.25]
7        (5.5, 8.25]
8        (5.5, 8.25]
9       (8.25, 11.0]
10      (8.25, 11.0]
11      (8.25, 11.0]
Name: nums, dtype: category
Categories (4, interval[float64]): [(-0.001, 2.75] < (2.75, 5.5] < (5.5, 8.25] < (8.25, 11.0]]

(8.25, 11.0]      3
(5.5, 8.25]       3
(2.75, 5.5]       3
(-0.001, 2.75]    3
Name: quantiles (qcut), dtype: int64

Unnamed: 0,nums,quantiles (qcut),quantile_labels (qcut)
0,0,"(-0.001, 2.75]",25
1,1,"(-0.001, 2.75]",25
2,2,"(-0.001, 2.75]",25
3,3,"(2.75, 5.5]",50
4,4,"(2.75, 5.5]",50
5,5,"(2.75, 5.5]",50
6,6,"(5.5, 8.25]",75
7,7,"(5.5, 8.25]",75
8,8,"(5.5, 8.25]",75
9,9,"(8.25, 11.0]",100


#### cut

`cut` is used to specifically define the bin edges. There is no guarantee about the distribution of items in each bin. In fact, you can define bins in such a way that no items are included in a bin or nearly all items are in a single bin.

In [78]:
display(df)

df['cut'] = pd.cut(df['nums'], bins=[0, 5, 10, 15])
display(df)

Unnamed: 0,nums,quantiles (qcut),quantile_labels (qcut),cut
0,0,"(-0.001, 2.75]",25,
1,1,"(-0.001, 2.75]",25,"(0.0, 5.0]"
2,2,"(-0.001, 2.75]",25,"(0.0, 5.0]"
3,3,"(2.75, 5.5]",50,"(0.0, 5.0]"
4,4,"(2.75, 5.5]",50,"(0.0, 5.0]"
5,5,"(2.75, 5.5]",50,"(0.0, 5.0]"
6,6,"(5.5, 8.25]",75,"(5.0, 10.0]"
7,7,"(5.5, 8.25]",75,"(5.0, 10.0]"
8,8,"(5.5, 8.25]",75,"(5.0, 10.0]"
9,9,"(8.25, 11.0]",100,"(5.0, 10.0]"


Unnamed: 0,nums,quantiles (qcut),quantile_labels (qcut),cut
0,0,"(-0.001, 2.75]",25,
1,1,"(-0.001, 2.75]",25,"(0.0, 5.0]"
2,2,"(-0.001, 2.75]",25,"(0.0, 5.0]"
3,3,"(2.75, 5.5]",50,"(0.0, 5.0]"
4,4,"(2.75, 5.5]",50,"(0.0, 5.0]"
5,5,"(2.75, 5.5]",50,"(0.0, 5.0]"
6,6,"(5.5, 8.25]",75,"(5.0, 10.0]"
7,7,"(5.5, 8.25]",75,"(5.0, 10.0]"
8,8,"(5.5, 8.25]",75,"(5.0, 10.0]"
9,9,"(8.25, 11.0]",100,"(5.0, 10.0]"


## Methods Overview

Overview of some useful methods:

__Index__: `s` = series, `df` = DataFrame

1. `df.mean(), df.std() and df.quantile()` - Get a statical summary of the data.
2. `df.select_dtypes("int64").describe()` - Describe all columns of data type `int64`. This can be useful to convert column to a smaller datatype after checking the max in the output for each column
3. `s/df.nunique()` - # of unique values
4. `df.head(), df.tail(), df.sample()` - Get some samples from DataSet
5. `pd.isna()` - Detect missing values for an array-like object.
6. `s.str.replace/split` - Replace and Split operations on Strings