# Pandas Tutorial

Pandas is named stands for 'panel data analysis'. Panel data is data that includes multiple variables about individuals over time (so a combination of cross sectional data and time series data). For example, medical studies where a myriad patient's health statistics are stored over time for thousands of patients.

If you continue using python for projects at work, you will love pandas and use it more frequently than you use Excel. It is extermely powerful and easy to use (now that you have the other tutorials under your belts.)

Numpy's main object is arrays and pandas brings these arrays (or series as they are called in pandas) into an an object called a dataframe. A dataframe is very similar to an excel spreadsheet, but the rows and columns are on steroids allowing for more flexibility and power than excel. (Extreme understatement!)

So lets get started!

### Import statement
ALWAYS import pandas as pd

In [1]:
import pandas as pd

## 1. Creating dataframes

As with anything in python, there are multiple ways to create a dataframe in pandas. I am only going to present two here.

##### NOTE 1: Although you have the flexibility to use any variable names you want, the python standard is to use df for dataframe. If you have a program or function with only one dataframe, call it df. If you have a program that contains multiple dataframes, then call them df_billings, df_FX_rates, df_waterfall, df_derivatives, df_ex_girlfriends etc. Start the variable name with df_ to let the user know you are talking about a dataframe.

##### NOTE 2: A note about my naming convention:
In python classes are generally capitalized and all other variables <b> DO NOT </b> contain capitals. However, when there is a common abbreviation, I tend to use capital letters for the abbreviation. Some examples:

 - Foreign Exchange: FX
 - United Stated Dollar (equivalent): US
 - Document Currency: DC
 - Currency tickers: AUD, EUR, GBP, JPY, USD
 - Company code: ADIR, ADUS, AILP, etc.
 - Three month United States Treasury Yield: UST_3m


##### NOTE: In practice 90% of the time we will load data into pandas from some other source and not create them directly, but you need a basic understanding of how to do this because you will need to create them from time to time.

## 1.1 Creating a dictionary from a ndarray

In [2]:
import numpy as np

In [3]:
data = np.random.randint(0, 10, size= (5,5))
data

array([[0, 5, 7, 9, 7],
       [3, 0, 0, 1, 8],
       [1, 1, 5, 3, 7],
       [6, 9, 9, 2, 0],
       [8, 9, 3, 8, 1]])

In [4]:
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D', 'E'])
df

Unnamed: 0,A,B,C,D,E
0,0,5,7,9,7
1,3,0,0,1,8
2,1,1,5,3,7
3,6,9,9,2,0
4,8,9,3,8,1


The first column is called the index. It starts at zero and is similar to the row numbers in excel.

The columns are labeled (in this case just like excel as letters.)

## 1.2 Creating a dataframe from a dictionary of lists

In [5]:
list_albums = ['Forever', 'Greatest Hits', 'Spice', 'Spiceworld']
list_dates = ['11-01-2000', '11-07-2007 ', '11-04-1996', '11-03-1997']        
ww_sales = [4000000, 1200000, 23000000, 13000000]
with_Ginger = [False, True, True, True]
US_certification = [np.nan, np.nan, '7x Platnum', '4x Platnum']

In [6]:
data_dict = {'album': list_albums,
            'date': list_dates,
            'global_sales': ww_sales,
            'Ginger?': with_Ginger,
            'US_certification': US_certification}
data_dict

{'album': ['Forever', 'Greatest Hits', 'Spice', 'Spiceworld'],
 'date': ['11-01-2000', '11-07-2007 ', '11-04-1996', '11-03-1997'],
 'global_sales': [4000000, 1200000, 23000000, 13000000],
 'Ginger?': [False, True, True, True],
 'US_certification': [nan, nan, '7x Platnum', '4x Platnum']}

In [7]:
df_spice_girls = pd.DataFrame(data_dict)
df_spice_girls

Unnamed: 0,album,date,global_sales,Ginger?,US_certification
0,Forever,11-01-2000,4000000,False,
1,Greatest Hits,11-07-2007,1200000,True,
2,Spice,11-04-1996,23000000,True,7x Platnum
3,Spiceworld,11-03-1997,13000000,True,4x Platnum


## 2. Exploring your dataframe

There are several methods that provide descriptive information about your dataframe that can be very helpful if you have a large dataset loaded into pandas.

### 2. a. df.info()

## <font color='red'> The first thing you should always do is check that the variable types are as you expect them to be using df.info() </font>
Many times we get data from excel, web scraping or user input that is not as we expect. We need to check that the data types are as we expect. The .info() method will display data types and general information about the data.

In [8]:
df_spice_girls.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   album             4 non-null      object
 1   date              4 non-null      object
 2   global_sales      4 non-null      int64 
 3   Ginger?           4 non-null      bool  
 4   US_certification  2 non-null      object
dtypes: bool(1), int64(1), object(3)
memory usage: 260.0+ bytes


So what is this telling us.

The df_spice_girls is a pandas dataframe (it is an object, i.e. a class)

The index is a range index from 0 to 3.

The # column is just the index of the column.

The 'Column' displays all of the column names

The third column displays the count of the variables that are not nulls. <b>This can indicate blanks in your dataset.</b> (Seeing 'non-null' is a good thing). This is very helpful if there are many rows and you are concerned that your dataset is not complete. If you see several columns with 5,000 rows and a column with 4,997, then you know you are missing some data or have blanks that you were not expecting (and need to do something with).

The Dtype displays the datatype. The possible datatypes in a dataframe are:
 - float64: (a floating point number, i.e. a decimal)
 - int64: (an integer)
 - datetime64[ns]: This is a date and time format that can store dates to the nanosecond. (Nanoseconds are not really relevant for most of our work.)
 - bool: A column of true/false values.
 - object: Usually strings.
 
 
##### Objects are strings! This is important. Pandas really only cares about numbers and treats all other things as strings. If you see a variable that should be an number listed as an object, then there is a problem with the data.
Usually this is someone has made a number in excel that is text.

We can see from our df_spice_girls that pandas does not recognize data as a datetime64 type. We need to change this so that pandas recognizes this as a date. The function we use to convert to a date is to_datetime. to_datetime is a function in the pandas module, so we have to write pd.to_datetime()



In [9]:
df_spice_girls['date'] = pd.to_datetime(df_spice_girls['date'])

In [10]:
df_spice_girls.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   album             4 non-null      object        
 1   date              4 non-null      datetime64[ns]
 2   global_sales      4 non-null      int64         
 3   Ginger?           4 non-null      bool          
 4   US_certification  2 non-null      object        
dtypes: bool(1), datetime64[ns](1), int64(1), object(2)
memory usage: 260.0+ bytes


![spice-girls.jpg](attachment:spice-girls.jpg)


## 2. b. df.describe()
The describe method will provide summary statistics for all of the numeric variables in your dataframe

In [11]:
df_spice_girls.describe()

Unnamed: 0,global_sales
count,4.0
mean,10300000.0
std,9850212.0
min,1200000.0
25%,3300000.0
50%,8500000.0
75%,15500000.0
max,23000000.0


The global sales is the only numeric field (column) in our dataframe, so the descriptive statistics for this variable are presented with the describe method. If we had multiple variables that were numeric, describe would show the details for all of them.

## 2. c. df.value_counts()
The value_counts method will count the number of times each unique categorical (i.e. text) value shows up in a particular column. 

In [12]:
df_spice_girls['album'].value_counts()

Greatest Hits    1
Spiceworld       1
Spice            1
Forever          1
Name: album, dtype: int64

In [13]:
df_spice_girls['Ginger?'].value_counts()

True     3
False    1
Name: Ginger?, dtype: int64

In [14]:
df_spice_girls['US_certification'].value_counts()

4x Platnum    1
7x Platnum    1
Name: US_certification, dtype: int64

### Notice that the Nan values (so blanks in a spreadsheet) did not show up in the value counts!!!!
To fix this we need to add dropna=False to the value_counts method.

### <font color = 'red'> This is a very common problem using data from excel or the web. It is a good idea to always include dropna=False in every value count. </font>

In [15]:
df_spice_girls['US_certification'].value_counts(dropna=False)

NaN           2
4x Platnum    1
7x Platnum    1
Name: US_certification, dtype: int64

#### To better understand how these methods are helpful, lets use a real dataset.

In [16]:
df= pd.read_excel('../data/FX_DERIVATIVES_ALL.xlsx', sheet_name='FX_forwards_static')

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5314 entries, 0 to 5313
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   ID             5314 non-null   object        
 1   Entity         5314 non-null   object        
 2   Counterparty   5314 non-null   object        
 3   Currency_Pair  5314 non-null   object        
 4   Buy_Currency   5314 non-null   object        
 5   Sell_Currency  5314 non-null   object        
 6   Trade_Date     5314 non-null   datetime64[ns]
 7   Delivery_Date  5314 non-null   datetime64[ns]
 8   Contract_Rate  5314 non-null   float64       
 9   Buy_Amount     5314 non-null   float64       
 10  Sell_Amount    5314 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(6)
memory usage: 456.8+ KB


##### OK so what does info() tell us at this point:
 - The index is a range from 0 to 5313. This is similar to a row number in excel.
 - pandas correctly categorized all of the text (string) columns as objects
 - both dates are correctly catergorized as datetime objects
 - all of the numeric entries, {Contract_Rate, Buy_Amount, Sell_Amount} are floating point numbers
 - Every column contains 5314 entries, so we do not have missing values.
 - We can also see the column names.
 

##### I do not like the capitalization of these columns, so I am going to rename these columns without the capital letters. (Except for ID because this is a common abbreviation.)

There are several ways to rename the columns. The first is a simple dictionary approach using the rename() method of a dataframe that you will most likely use very often.

###### NOTE: If you are getting data from tableau, you will most likely be using .rename() very often. We should not have blank spaces in column names. (Very Strong Preference, but not a requirement)

In [None]:
df = df.rename(columns = {'Entity': 'entity',
                         'Counterparty': 'counterparty',
                         'Currency_Pair': 'currency_pair',
                         'Buy_Currency': 'buy_currency',
                         'Sell_Currency': 'sell_currency',
                         'Trade_Date': 'trade_date',
                         'Delivery_Date': 'delivery_date',
                         'Contract_Rate': 'contract_rate',
                         'Buy_Amount': 'buy_amount',
                         'Sell_Amount': 'sell_amount'})

Another way to do this would be to extract the column names into a list, change the list capitalization and then rename the columns to the new list.

In [18]:
# extracting column names
list_cols = df.columns
list_cols

Index(['ID', 'Entity', 'Counterparty', 'Currency_Pair', 'Buy_Currency',
       'Sell_Currency', 'Trade_Date', 'Delivery_Date', 'Contract_Rate',
       'Buy_Amount', 'Sell_Amount'],
      dtype='object')

Notice that our variable list_cols is an Index and not a list! This is because pandas knows the columns are an index into the dataframe. We need to change this to be a list using the list() constructor.

In [19]:
list_cols = list(list_cols)
list_cols

['ID',
 'Entity',
 'Counterparty',
 'Currency_Pair',
 'Buy_Currency',
 'Sell_Currency',
 'Trade_Date',
 'Delivery_Date',
 'Contract_Rate',
 'Buy_Amount',
 'Sell_Amount']

In [20]:
# convert the column names to be lower case for all columns EXCEPT 'ID'
for i, item in enumerate(list_cols):
    print(i, item)
    if item != 'ID':
        lower_case = item.lower()
        list_cols[i] = lower_case
list_cols

0 ID
1 Entity
2 Counterparty
3 Currency_Pair
4 Buy_Currency
5 Sell_Currency
6 Trade_Date
7 Delivery_Date
8 Contract_Rate
9 Buy_Amount
10 Sell_Amount


['ID',
 'entity',
 'counterparty',
 'currency_pair',
 'buy_currency',
 'sell_currency',
 'trade_date',
 'delivery_date',
 'contract_rate',
 'buy_amount',
 'sell_amount']

In [21]:
# reassigning the columns of the dataframe to be the lower case list of column names.
df.columns = list_cols

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5314 entries, 0 to 5313
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   ID             5314 non-null   object        
 1   entity         5314 non-null   object        
 2   counterparty   5314 non-null   object        
 3   currency_pair  5314 non-null   object        
 4   buy_currency   5314 non-null   object        
 5   sell_currency  5314 non-null   object        
 6   trade_date     5314 non-null   datetime64[ns]
 7   delivery_date  5314 non-null   datetime64[ns]
 8   contract_rate  5314 non-null   float64       
 9   buy_amount     5314 non-null   float64       
 10  sell_amount    5314 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(6)
memory usage: 456.8+ KB


### Now lets explore the dataset using describe and value_counts

In [23]:
df.describe()

Unnamed: 0,contract_rate,buy_amount,sell_amount
count,5314.0,5314.0,5314.0
mean,68.344292,140864800.0,-150410500.0
std,226.661081,523273200.0,486116900.0
min,0.560555,41.0,-8082055000.0
25%,1.131188,2861728.0,-16000000.0
50%,1.390734,5722594.0,-6502750.0
75%,59.077625,13003580.0,-2696684.0
max,1371.7,6500000000.0,7537424.0


This does not tell us much because the rates and amounts are such a muddles group of currencies. We will use this later to look at a specific currency and it will be more helpful.


In [24]:
df['entity'].value_counts(dropna=False)

ADIR    3631
ADUS    1455
AILP     101
NDIN      65
NDBV      52
ASCN       5
ADCN       4
NLGM       1
Name: entity, dtype: int64

In [25]:
df['counterparty'].value_counts(dropna=False)

BOA     1497
JPM     1354
HSBC     758
WFB      481
USB      428
CITI     321
RBS      159
GS       156
UBS       97
SG        53
MSC        7
SOC        3
Name: counterparty, dtype: int64

### It looks like SG and SOC are both tickers for Societe General. We need to change these to be one specific counterparty.

To do this we will use the replace method. Inputs:
    - to_replace: The string or value you want replaced in the dataset
    - value: The new value in the dataframe

In [26]:
df.replace('SG', 'SOC', inplace=True)

In [27]:
df['counterparty'].value_counts(dropna=False)

BOA     1497
JPM     1354
HSBC     758
WFB      481
USB      428
CITI     321
RBS      159
GS       156
UBS       97
SOC       56
MSC        7
Name: counterparty, dtype: int64

In [28]:
df['currency_pair'].value_counts(dropna=False)

EURUSD             1041
GBPUSD              556
USDJPY              525
AUDUSD              455
USDINR Offshore     349
USDCAD              302
USDCHF              267
USDINR              232
USDRON              229
USDSEK              171
USDKRW              164
USDSGD              133
USDBRL Offshore     127
USDDKK              121
USDSGD Offshore     102
USDNOK               93
USDKRW Offshore      66
USDILS               59
USDBRL               50
USDCNY Offshore      40
USDHKD               37
USDCNY               29
USDRUB               27
NZDUSD               26
USDILS Offshore      17
USDCZK               16
USDUAH               14
EURGBP               10
EURSEK                7
EURRON                6
EURJPY                6
EURRUB                6
EURPLN                6
EURCHF                5
USDTRY                4
USDMXP                3
EURNOK                3
EURDKK                3
EURKRW                2
USDZAR                2
EURBRL                2
USDEUR          

### Now we have our first REAL data issue. The currency pairs are more than the standard 6 characters when the currency is offshore!

There are several approaches to deal with this, but let's try splitting the column currency_pair into two seperate columns using str.split(" "). This should move the 'Offshore' text to another column and the currency_pair column will contain only the first six characters of the column.

Below we are adding two columns to the dataframe. 'currency_pair' (which already exists, so we are writing over this column) and 'offshore'. We are splitting the string contained in the 'currency_pair' column based on whitespace.
The expand=True is necessary for pandas to return a dataframe (versus a series).

In [29]:
df[['currency_pair', 'offshore']] = df['currency_pair'].str.split(" ", expand=True)

In [30]:
df.head()

Unnamed: 0,ID,entity,counterparty,currency_pair,buy_currency,sell_currency,trade_date,delivery_date,contract_rate,buy_amount,sell_amount,offshore
0,ADUS2357,ADUS,JPM,USDINR,INR,USD,2020-09-24,2020-11-05,74.071,700000000.0,-9450392.19,
1,ADIR8274,ADIR,JPM,GBPUSD,USD,GBP,2020-09-24,2020-09-30,1.273907,7643443.0,-6000000.0,
2,ADIR8273,ADIR,HSBC,EURUSD,USD,EUR,2020-09-24,2020-10-07,1.16682,14001840.0,-12000000.0,
3,ADUS2355,ADUS,WFB,USDCAD,CAD,USD,2020-09-22,2020-09-24,1.331265,10000000.0,-7511652.45,
4,ADUS2356,ADUS,WFB,USDCAD,USD,CAD,2020-09-22,2021-02-10,1.330427,6013108.0,-8000000.0,


So we can see that we have a new variable 'offshore'. First let's check that the currency_pair column looks correct by using value_counts.

In [31]:
df['currency_pair'].value_counts()

EURUSD    1041
USDINR     581
GBPUSD     556
USDJPY     525
AUDUSD     455
USDCAD     302
USDCHF     267
USDSGD     235
USDKRW     230
USDRON     229
USDBRL     177
USDSEK     171
USDDKK     121
USDNOK      93
USDILS      76
USDCNY      69
USDHKD      37
USDRUB      27
NZDUSD      26
USDCZK      16
USDUAH      14
EURGBP      10
EURSEK       7
EURRON       6
EURRUB       6
EURJPY       6
EURPLN       6
EURCHF       5
USDTRY       4
USDMXP       3
EURNOK       3
EURDKK       3
EURKRW       2
EURBRL       2
USDZAR       2
USDEUR       1
Name: currency_pair, dtype: int64

OK all of these are actually currency pairs with no additional text. So we are halfway there. Now let's turn our attention to the 'offshore' column

In [32]:
df['offshore'].value_counts(dropna=False)

NaN         4613
Offshore     701
Name: offshore, dtype: int64

We want to clean this by making all rows that say 'Offshore' eqaul to 1 and all Nan's equal to zero. We will do this with the replace method.

In [33]:
df['offshore'].replace('Offshore', 'True', inplace=True)
df['offshore'].fillna('False', inplace=True)
df['offshore'].value_counts(dropna=False)

False    4613
True      701
Name: offshore, dtype: int64

Now let's check that the datatypes are still correct.

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5314 entries, 0 to 5313
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   ID             5314 non-null   object        
 1   entity         5314 non-null   object        
 2   counterparty   5314 non-null   object        
 3   currency_pair  5314 non-null   object        
 4   buy_currency   5314 non-null   object        
 5   sell_currency  5314 non-null   object        
 6   trade_date     5314 non-null   datetime64[ns]
 7   delivery_date  5314 non-null   datetime64[ns]
 8   contract_rate  5314 non-null   float64       
 9   buy_amount     5314 non-null   float64       
 10  sell_amount    5314 non-null   float64       
 11  offshore       5314 non-null   object        
dtypes: datetime64[ns](2), float64(3), object(7)
memory usage: 498.3+ KB


This is not exactly what we want because the 'offshore' varible is not reading as a boolean variable. We have to change it with the astype() method.

In [36]:
df['offshore'] = df['offshore'].astype('bool')

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5314 entries, 0 to 5313
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   ID             5314 non-null   object        
 1   entity         5314 non-null   object        
 2   counterparty   5314 non-null   object        
 3   currency_pair  5314 non-null   object        
 4   buy_currency   5314 non-null   object        
 5   sell_currency  5314 non-null   object        
 6   trade_date     5314 non-null   datetime64[ns]
 7   delivery_date  5314 non-null   datetime64[ns]
 8   contract_rate  5314 non-null   float64       
 9   buy_amount     5314 non-null   float64       
 10  sell_amount    5314 non-null   float64       
 11  offshore       5314 non-null   bool          
dtypes: bool(1), datetime64[ns](2), float64(3), object(6)
memory usage: 462.0+ KB


## 2. d. df.head(), df.tail and df.sample()
When using large datasets, it is difficult to see all of the data. A good way to check the data is to check the first n rows using head, the last n rows using tail and a random sample of rows using sample.

To better demonstrate this, we will be using the fx_derivatives_all database. (This may take a little while to load.)

The head(n_rows) method shows the first n_rows of the dataframe and their column names. The default n_rows is equal to 5, so if you do not specify the number of rows, it will show you the first five.

In [None]:
df.head()

Sometimes looking at the bottom rows is important to make sure the data loaded properly. This is particularly true if you are loading a spreadsheet that someone else has created. (Totals at the bottom of the data, comments and random text may cause the data to be messy. df.tail(n_rows) will show the final n_rows

In [None]:
df.tail(5)

df.sample(n_rows) is a method that will pull a random sample of the dataframe rows. This can be particularly helpful if you are checking for data errors.

In [None]:
df.sample(5)

This is usually more helpful if we have more than 100 rows of data, but I think you get the point.

In [None]:
df['Entity'].value_counts()

In [None]:
df['Counterparty'].value_counts()

In [None]:
df.describe()

In [None]:
df.info()

## 2. e. Other descriptive statistics

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.index

## 3. Indexing, subsetting, slicing and selecting columns

### 3. a. df.iloc
To reference a particular row, column or group of rows and columns by their numeric index (row or column number asd an integer) use .iloc()

In [None]:
# will return the first row, every column
df.iloc[0,:]

In [None]:
# will return the first column, every row
df.iloc[:,0]

In [None]:
# will take a subset of columns and rows
df.iloc[5:10, :4]

##### note that the index (very first column on the left) is not considered one of the columns, but just a row reference

In [None]:
df.iloc[-1, :]

## 3. b. df.loc()

To reference an row, series of rows or columns by the labels in the column, use .loc()

In [None]:
df.loc[df['ID']=='ADIR3715']

In [None]:
df.loc[df['Buy_Currency']=='EUR']

In [None]:
df.loc[df['Trade_Date']>= '2020-01-01']

In [None]:
Selection

In [None]:
Data Cleaning

In [None]:
Filter, Sort, Groupby

In [None]:
Join/Combine

Statistics

## Notes on what to do next

- Add columns
    - numeric
    - with operatoins
    - to some
    - text
    
- boolean mask
- merge
- pivot
- groupby
    Possibly do a groupby to see the total spend by currency on a specific currency
    Average EUR rates, total sold, etc
    
    
    
- pandas plotting

In [None]:
 - Examples :
        FX Derivatives ALL
        - average spot
        - best counterparty by currency
        -= most often used WTF.
        - stochastic
        


Beautiful Soup
 - getting finanial data
 - reporting
 - automation (from book)
 - seaborn
 - 