# Pandas Tutorial

Pandas is named stands for 'panel data analysis'. Panel data is data that includes multiple variables about individuals over time (so a combination of cross sectional data and time series data). For example, medical studies where a myriad patient's health statistics are stored over time for thousands of patients.

If you continue using python for projects at work, you will love pandas and use it more frequently than you use Excel. It is extermely powerful and easy to use (now that you have the other tutorials under your belts.)

Numpy's main object is arrays and pandas brings these arrays (or series as they are called in pandas) into an an object called a dataframe. A dataframe is very similar to an excel spreadsheet, but the rows and columns are on steroids allowing for more flexibility and power than excel. (Extreme understatement!)

So lets get started!

### Import statement
ALWAYS import pandas as pd

In [107]:
import pandas as pd

## 1. Creating dataframes

As with anything in python, there are multiple ways to create a dataframe in pandas. I am only going to present two here.

##### NOTE 1: Although you have the flexibility to use any variable names you want, the python standard is to use df for dataframe. If you have a program or function with only one dataframe, call it df. If you have a program that contains multiple dataframes, then call them df_billings, df_FX_rates, df_waterfall, df_derivatives, df_ex_girlfriends etc. Start the variable name with df_ to let the user know you are talking about a dataframe.

##### NOTE 2: A note about my naming convention:
In python classes are generally capitalized and all other variables <b> DO NOT </b> contain capitals. However, when there is a common abbreviation, I tend to use capital letters for the abbreviation. Some examples:

 - Foreign Exchange: FX
 - United Stated Dollar (equivalent): US
 - Document Currency: DC
 - Currency tickers: AUD, EUR, GBP, JPY, USD
 - Company code: ADIR, ADUS, AILP, etc.
 - Three month United States Treasury Yield: UST_3m


##### NOTE: In practice 90% of the time we will load data into pandas from some other source and not create them directly, but you need a basic understanding of how to do this because you will need to create them from time to time.

## 1.1 Creating a dictionary from a ndarray

In [108]:
import numpy as np

In [109]:
data = np.random.randint(0, 10, size= (5,5))
data

array([[9, 7, 7, 7, 9],
       [9, 7, 7, 7, 2],
       [9, 8, 0, 6, 6],
       [4, 9, 0, 7, 1],
       [9, 8, 2, 1, 5]])

In [110]:
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D', 'E'])
df

Unnamed: 0,A,B,C,D,E
0,9,7,7,7,9
1,9,7,7,7,2
2,9,8,0,6,6
3,4,9,0,7,1
4,9,8,2,1,5


The first column is called the index. It starts at zero and is similar to the row numbers in excel.

The columns are labeled (in this case just like excel as letters.)

## 1.2 Creating a dataframe from a dictionary of lists

In [111]:
list_albums = ['Forever', 'Greatest Hits', 'Spice', 'Spiceworld']
list_dates = ['11-01-2000', '11-07-2007 ', '11-04-1996', '11-03-1997']        
ww_sales = [4000000, 1200000, 23000000, 13000000]
with_Ginger = [False, True, True, True]
US_certification = [np.nan, np.nan, '7x Platnum', '4x Platnum']

In [112]:
data_dict = {'album': list_albums,
            'date': list_dates,
            'global_sales': ww_sales,
            'Ginger?': with_Ginger,
            'US_certification': US_certification}
data_dict

{'album': ['Forever', 'Greatest Hits', 'Spice', 'Spiceworld'],
 'date': ['11-01-2000', '11-07-2007 ', '11-04-1996', '11-03-1997'],
 'global_sales': [4000000, 1200000, 23000000, 13000000],
 'Ginger?': [False, True, True, True],
 'US_certification': [nan, nan, '7x Platnum', '4x Platnum']}

In [113]:
df_spice_girls = pd.DataFrame(data_dict)
df_spice_girls

Unnamed: 0,album,date,global_sales,Ginger?,US_certification
0,Forever,11-01-2000,4000000,False,
1,Greatest Hits,11-07-2007,1200000,True,
2,Spice,11-04-1996,23000000,True,7x Platnum
3,Spiceworld,11-03-1997,13000000,True,4x Platnum


## 2. Exploring your dataframe

There are several methods that provide descriptive information about your dataframe that can be very helpful if you have a large dataset loaded into pandas.

### 2. a. df.info()

### The first thing you should always do is check that the variable types are as you expect them to be.
Many times we get data from excel, web scraping or user input that is not as we expect. We need to check that the data types are as we expect. The .info() method will display data types and general information about the data.

In [114]:
df_spice_girls.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   album             4 non-null      object
 1   date              4 non-null      object
 2   global_sales      4 non-null      int64 
 3   Ginger?           4 non-null      bool  
 4   US_certification  2 non-null      object
dtypes: bool(1), int64(1), object(3)
memory usage: 260.0+ bytes


The # column is jus the index of the column.

The 'Column' displays all of the column names

The third column displays the count of the variables and whether there are any nulls. This can indicate blanks in your dataset. (Seeing 'non-null' is a good thing)

The Dtype displays the datatype. The possible datatypes in a dataframe are:
 - float64 (a floating point number, i.e. a decimal)
 - int64
 - datetime64[ns] This is a date and time format that can store dates to the nanosecond. (Nanoseconds are not really relevant for most of our work.)
 - object
 - bool
 
##### Objects are strings! This is important. Pandas really only cares about numbers and treats all other things as strings. If you see a variable that should be an number listed as an object, then there is a problem with the data.
Usually this is someone has made a number in excel that is text.

We can see from our df_spice_girls that pandas does not recognize data as a datetime64 type. We need to change this so that pandas recognizes this as a date.



In [115]:
df_spice_girls['date'] = pd.to_datetime(df_spice_girls['date'])

In [116]:
df_spice_girls.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   album             4 non-null      object        
 1   date              4 non-null      datetime64[ns]
 2   global_sales      4 non-null      int64         
 3   Ginger?           4 non-null      bool          
 4   US_certification  2 non-null      object        
dtypes: bool(1), datetime64[ns](1), int64(1), object(2)
memory usage: 260.0+ bytes


## 2. b. df.describe()
The describe method will provide summary statistics for all of the numeric variables in your dataframe

In [117]:
df_spice_girls.describe()

Unnamed: 0,global_sales
count,4.0
mean,10300000.0
std,9850212.0
min,1200000.0
25%,3300000.0
50%,8500000.0
75%,15500000.0
max,23000000.0


The global sales is the only numeric field (column) in our dataframe, so the descriptive statistics for this variable are presented with the describe method. If we had multiple variables that were numeric, describe would show the details for all of them.

## 2. c. df.value_counts()
The value_counts method will count the number of times each unique categorical (i.e. text) value shows up in a particular column. 

In [118]:
df_spice_girls['album'].value_counts()

Forever          1
Spiceworld       1
Greatest Hits    1
Spice            1
Name: album, dtype: int64

In [119]:
df_spice_girls['Ginger?'].value_counts()

True     3
False    1
Name: Ginger?, dtype: int64

In [120]:
df_spice_girls['US_certification'].value_counts()

7x Platnum    1
4x Platnum    1
Name: US_certification, dtype: int64

### Notice that the Nan values (so blanks in a spreadsheet) did not show up in the value counts!!!!
To fix this we need to add dropna=False to the value_counts method.

<font color = 'red'> This is a very common problem using data from excel or the web. It is a good idea to always include dropna=False in every value count. </font>

## 2. d. df.head(), df.tail and df.sample()
When using large datasets, it is difficult to see all of the data. A good way to check the data is to check the first n rows using head, the last n rows using tail and a random sample of rows using sample.

To better demonstrate this, we will be using the fx_derivatives_all database.

In [121]:
df= pd.read_excel('../data/FX_DERIVATIVES_SOME.xlsx')

In [122]:
df.head(10)

Unnamed: 0,ID,Entity,Counterparty,Currency_Pair,Buy_Currency,Sell_Currency,Trade_Date,Delivery_Date,Contract_Rate,Buy_Amount,Sell_Amount
0,ADIR3725,ADIR,RBS,USDJPY,USD,JPY,2012-08-15,2012-10-10,78.867,4564647.0,-360000000
1,ADUS813,ADUS,CITI,EURUSD,USD,EUR,2012-08-15,2012-10-11,1.229378,614689.0,-500000
2,ADIR3723,ADIR,GS,USDCHF,CHF,USD,2012-08-14,2012-09-19,0.973454,9000000.0,-9245429
3,ADIR3719,ADIR,UBS,EURUSD,USD,EUR,2012-08-13,2012-10-10,1.233346,6166730.0,-5000000
4,ADUS810,ADUS,CITI,EURUSD,EUR,USD,2012-08-09,2012-10-11,1.2294,1000000.0,-1229400
5,ADIR3717,ADIR,CITI,USDSGD,USD,SGD,2012-08-09,2012-09-14,1.245248,1204579.0,-1500000
6,ADIR3715,ADIR,BOA,USDJPY,USD,JPY,2012-08-09,2012-10-10,78.5226,3820556.0,-300000000
7,ADUS808,ADUS,CITI,USDCHF,CHF,USD,2012-08-08,2012-11-16,0.9704,61000000.0,-62860676
8,ADIR3709,ADIR,BOA,USDJPY,USD,JPY,2012-08-06,2012-10-10,78.1261,3839946.0,-300000000
9,ADIR3708,ADIR,BOA,EURUSD,USD,EUR,2012-08-06,2012-10-10,1.240847,3722541.0,-3000000


In [123]:
df.tail(5)

Unnamed: 0,ID,Entity,Counterparty,Currency_Pair,Buy_Currency,Sell_Currency,Trade_Date,Delivery_Date,Contract_Rate,Buy_Amount,Sell_Amount
31,ADIR3678,ADIR,CITI,USDSGD,USD,SGD,2012-07-10,2012-09-14,1.266689,2763109.0,-3500000
32,ADIR3677,ADIR,BOA,AUDUSD,USD,AUD,2012-07-10,2012-09-12,1.014912,1522368.0,-1500000
33,AILP469,AILP,CITI,USDILS,ILS,USD,2012-06-20,2012-09-21,3.876038,7000000.0,-1805968
34,ADIR3551,ADIR,JPM,EURUSD,USD,EUR,2012-03-20,2012-10-12,1.324263,177980900.0,-134400000
35,NDBV169,NDBV,BOA,EURKRW,EUR,KRW,2020-03-13,2020-03-17,1371.7,2267166.0,-3109871282


In [124]:
df.sample(10)

Unnamed: 0,ID,Entity,Counterparty,Currency_Pair,Buy_Currency,Sell_Currency,Trade_Date,Delivery_Date,Contract_Rate,Buy_Amount,Sell_Amount
35,NDBV169,NDBV,BOA,EURKRW,EUR,KRW,2020-03-13,2020-03-17,1371.7,2267166.0,-3109871282
18,ADIR3698,ADIR,BOA,AUDUSD,USD,AUD,2012-07-26,2012-09-26,1.034292,8274336.0,-8000000
19,ADIR3699,ADIR,UBS,EURUSD,USD,EUR,2012-07-26,2012-09-26,1.229099,8603693.0,-7000000
10,ADIR3707,ADIR,GS,AUDUSD,USD,AUD,2012-08-03,2012-09-26,1.051216,8409728.0,-8000000
34,ADIR3551,ADIR,JPM,EURUSD,USD,EUR,2012-03-20,2012-10-12,1.324263,177980900.0,-134400000
29,ADIR3683,ADIR,CITI,EURUSD,USD,EUR,2012-07-12,2012-09-12,1.220399,3661197.0,-3000000
1,ADUS813,ADUS,CITI,EURUSD,USD,EUR,2012-08-15,2012-10-11,1.229378,614689.0,-500000
31,ADIR3678,ADIR,CITI,USDSGD,USD,SGD,2012-07-10,2012-09-14,1.266689,2763109.0,-3500000
7,ADUS808,ADUS,CITI,USDCHF,CHF,USD,2012-08-08,2012-11-16,0.9704,61000000.0,-62860676
25,ADUS799,ADUS,CITI,EURUSD,EUR,USD,2012-07-17,2012-10-11,1.228709,7000000.0,-8600963


This is usually more helpful if we have more than 100 rows of data, but I think you get the point.

In [125]:
df['Entity'].value_counts()

ADIR    27
ADUS     6
AILP     2
NDBV     1
Name: Entity, dtype: int64

In [126]:
df['Counterparty'].value_counts()

CITI    15
BOA     11
GS       4
UBS      4
RBS      1
JPM      1
Name: Counterparty, dtype: int64

In [127]:
df.describe()

Unnamed: 0,Contract_Rate,Buy_Amount,Sell_Amount
count,36.0,36.0,36.0
mean,50.152256,10774150.0,-141338200.0
std,228.147616,30271820.0,521936900.0
min,0.9704,614689.0,-3109871000.0
25%,1.178103,2339090.0,-11184070.0
50%,1.237097,4197145.0,-5000000.0
75%,1.563767,7000000.0,-1951492.0
max,1371.7,177980900.0,-500000.0


In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   ID             36 non-null     object        
 1   Entity         36 non-null     object        
 2   Counterparty   36 non-null     object        
 3   Currency_Pair  36 non-null     object        
 4   Buy_Currency   36 non-null     object        
 5   Sell_Currency  36 non-null     object        
 6   Trade_Date     36 non-null     datetime64[ns]
 7   Delivery_Date  36 non-null     datetime64[ns]
 8   Contract_Rate  36 non-null     float64       
 9   Buy_Amount     36 non-null     float64       
 10  Sell_Amount    36 non-null     int64         
dtypes: datetime64[ns](2), float64(2), int64(1), object(6)
memory usage: 3.2+ KB


## 2. e. Other descriptive statistics

In [129]:
df.columns

Index(['ID', 'Entity', 'Counterparty', 'Currency_Pair', 'Buy_Currency',
       'Sell_Currency', 'Trade_Date', 'Delivery_Date', 'Contract_Rate',
       'Buy_Amount', 'Sell_Amount'],
      dtype='object')

In [130]:
df.shape

(36, 11)

In [131]:
df.index

RangeIndex(start=0, stop=36, step=1)

## 3. Indexing, subsetting, slicing and selecting columns

### 3. a. df.iloc
To reference a particular row, column or group of rows and columns by their numeric index (row or column number)use .iloc()

In [132]:
# will return the first row, every column
df.iloc[0,:]

ID                          ADIR3725
Entity                          ADIR
Counterparty                     RBS
Currency_Pair                 USDJPY
Buy_Currency                     USD
Sell_Currency                    JPY
Trade_Date       2012-08-15 00:00:00
Delivery_Date    2012-10-10 00:00:00
Contract_Rate                 78.867
Buy_Amount               4.56465e+06
Sell_Amount               -360000000
Name: 0, dtype: object

In [133]:
# will return the first column, every row
df.iloc[:,0]

0     ADIR3725
1      ADUS813
2     ADIR3723
3     ADIR3719
4      ADUS810
5     ADIR3717
6     ADIR3715
7      ADUS808
8     ADIR3709
9     ADIR3708
10    ADIR3707
11     ADUS807
12    ADIR3705
13    ADIR3704
14    ADIR3703
15    ADIR3702
16    ADIR3701
17    ADIR3700
18    ADIR3698
19    ADIR3699
20    ADIR3697
21    ADIR3696
22    ADIR3694
23    ADIR3693
24     ADUS802
25     ADUS799
26    ADIR3687
27    ADIR3686
28    ADIR3685
29    ADIR3683
30     AILP472
31    ADIR3678
32    ADIR3677
33     AILP469
34    ADIR3551
35     NDBV169
Name: ID, dtype: object

In [134]:
# will take a subset of columns and rows
df.iloc[5:10, :4]

Unnamed: 0,ID,Entity,Counterparty,Currency_Pair
5,ADIR3717,ADIR,CITI,USDSGD
6,ADIR3715,ADIR,BOA,USDJPY
7,ADUS808,ADUS,CITI,USDCHF
8,ADIR3709,ADIR,BOA,USDJPY
9,ADIR3708,ADIR,BOA,EURUSD


##### note that the index (very first column on the left) is not considered one of the columns, but just a row reference

In [135]:
df.iloc[-1, :]

ID                           NDBV169
Entity                          NDBV
Counterparty                     BOA
Currency_Pair                 EURKRW
Buy_Currency                     EUR
Sell_Currency                    KRW
Trade_Date       2020-03-13 00:00:00
Delivery_Date    2020-03-17 00:00:00
Contract_Rate                 1371.7
Buy_Amount               2.26717e+06
Sell_Amount              -3109871282
Name: 35, dtype: object

## 3. b. df.loc()

To reference an row, series of rows or columns by the labels in the column, use .loc()

In [136]:
df.loc[df['ID']=='ADIR3715']

Unnamed: 0,ID,Entity,Counterparty,Currency_Pair,Buy_Currency,Sell_Currency,Trade_Date,Delivery_Date,Contract_Rate,Buy_Amount,Sell_Amount
6,ADIR3715,ADIR,BOA,USDJPY,USD,JPY,2012-08-09,2012-10-10,78.5226,3820556.0,-300000000


In [137]:
df.loc[df['Buy_Currency']=='EUR']

Unnamed: 0,ID,Entity,Counterparty,Currency_Pair,Buy_Currency,Sell_Currency,Trade_Date,Delivery_Date,Contract_Rate,Buy_Amount,Sell_Amount
4,ADUS810,ADUS,CITI,EURUSD,EUR,USD,2012-08-09,2012-10-11,1.2294,1000000.0,-1229400
25,ADUS799,ADUS,CITI,EURUSD,EUR,USD,2012-07-17,2012-10-11,1.228709,7000000.0,-8600963
35,NDBV169,NDBV,BOA,EURKRW,EUR,KRW,2020-03-13,2020-03-17,1371.7,2267165.77,-3109871282


In [138]:
df.loc[df['Trade_Date']>= '2020-01-01']

Unnamed: 0,ID,Entity,Counterparty,Currency_Pair,Buy_Currency,Sell_Currency,Trade_Date,Delivery_Date,Contract_Rate,Buy_Amount,Sell_Amount
35,NDBV169,NDBV,BOA,EURKRW,EUR,KRW,2020-03-13,2020-03-17,1371.7,2267165.77,-3109871282
