### FINA 4380 with Marius Popescu

## Pandas - Part I

https://pandas.pydata.org

#### Pandas can be easily imported as follows:

In [1]:
import pandas as pd

In [2]:
# We will also use Numpy quite often, so I will be importing it in every notebook
import numpy as np

### 1. Pandas Series
A Pandas Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its *index*. It is very similar concept to a NumPy array.

In [3]:
#Creating a series from a list
list1 = [5,10,15]

In [4]:
sr1=pd.Series(list1)
sr1

0     5
1    10
2    15
dtype: int64

Since no index was specified, a default one consisting of the integers 0 throug N - 1 (when N is the length of the data) was created

In [5]:
# Creating a series from a list and using another list as the index
sr2=pd.Series(data = list1, index = ['X','Y','Z'])
sr2

X     5
Y    10
Z    15
dtype: int64

In [6]:
# Creating a series from a NumPy array
arr = np.array([5,10,15])
pd.Series(arr)

0     5
1    10
2    15
dtype: int32

In [7]:
# Creating a series from a dictionary
d = {"GOOGL":1100, "AAPL":120, "MSFT": 110, "AMZN":1700}
series1 = pd.Series(d)
series1

GOOGL    1100
AAPL      120
MSFT      110
AMZN     1700
dtype: int64

### 2. Pandas DataFrames
A Pandas DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be of a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Pandas Series that share the same index. Each Series represents a named column of the DataFrame.

In [8]:
#Creating a DataFrame from a nested list
df1 = pd.DataFrame([[10,20,'True'],[1000,2000,'False']])
df1

Unnamed: 0,0,1,2
0,10,20,True
1,1000,2000,False


If no index is specified, a default one consisting of the integers 0 throug N - 1 (when N is the length of the data) will be created. In addition, default names for the columns are created. However, column names can be specified at the time of creating the DataFrame, using the Columns parameter.

In [9]:
# Creating a DataFrame from a two-dimensional NumPy randomly-generated array of integers
df2 = pd.DataFrame(np.random.randint(0,10,(2,3)),
                   index = ['Row1','Row2'],
                   columns = ['Col1','Col2','Col3'])
df2

Unnamed: 0,Col1,Col2,Col3
Row1,5,6,4
Row2,5,8,4


In [10]:
# Creating a DataFrame from a dictionary
data_2017 = {'Ticker': ['MSFT', 'MSFT', 'MSFT', 'AAPL', 'AAPL', 'AAPL', 'AMZN', 'AMZN', 'AMZN'],
        'Year': [2015, 2016, 2017, 2015, 2016, 2017, 2015, 2016, 2017],
        'ROA': [7.0, 9.1, 9.8, 20.5, 14.9, 13.9, 1.0, 3.2, np.nan], 
        'ROE': [14.4, 22.1, 29.4, 46.3, 36.9, 36.9, 4.9, 14.5, 12.9]}

In [11]:
fin_df_2017 = pd.DataFrame(data_2017)
fin_df_2017

Unnamed: 0,Ticker,Year,ROA,ROE
0,MSFT,2015,7.0,14.4
1,MSFT,2016,9.1,22.1
2,MSFT,2017,9.8,29.4
3,AAPL,2015,20.5,46.3
4,AAPL,2016,14.9,36.9
5,AAPL,2017,13.9,36.9
6,AMZN,2015,1.0,4.9
7,AMZN,2016,3.2,14.5
8,AMZN,2017,,12.9


### 3. Data Input and Output 

Pandas features a number of functions for reading tabular data as a DataFrame object. We will only focus on reading from and writing to CSV.

#### CSV Input

In [12]:
# Reading in only certain columns from a .csv file located in the same directory
# as the notebook. In addition, setting one of the columns as the index
df = pd.read_csv('acctg_data_in.csv',
                usecols=['fyear','tic','at','ceq','ni'],
                index_col='fyear')
#df

#### We can use `obj_name.info()` to present a summary of the DataFrame

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 2010 to 2017
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tic     25 non-null     object
 1   at      25 non-null     int64 
 2   ceq     25 non-null     int64 
 3   ni      25 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 1000.0+ bytes


#### We can use `obj_name.head()` to output the first five elements of a Series or DataFrame

In [14]:
#df.head()

In [15]:
#df.head(0)

#### We can use `obj_name.tail()` to output the last five elements of a Series or DataFrame.

In [16]:
#df.tail()

In [17]:
#df.tail(1)

#### CSV Output

In [18]:
# Outputing the DataFrame as a .csv file. Setting "index = False" is used to prevent saving to
# output the index created in Pandas
fin_df_2017.to_csv('fin_df_2017.csv',index=False)

### 4. Time Series with Pandas
Any variable that is observed or measured at many points in time forms a time series. Time series data is an important form of structured data in finance. The simplest and most widely used type of time series are those indexed by timestamp (specific instant in time).

The short way of reading in and properly formatting a time series *.csv* file

In [19]:
bank_df = pd.read_csv('bank_data.csv',
                       usecols = ['date','TICKER','RET'],
                       index_col = 'date',
                       parse_dates = True)
bank_df.tail(3)

Unnamed: 0_level_0,TICKER,RET
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-12-27,C,0.006415
2018-12-28,C,0.001159
2018-12-31,C,0.004438


#### We can get more information about the DataFrame Index, using the `df_name.index` attribute

In [20]:
bank_df.index

DatetimeIndex(['2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06',
               '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12',
               '2017-01-13', '2017-01-17',
               ...
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', name='date', length=2510, freq=None)

### 5. Setting and Resetting the Index in a DataFrame

#### We can reset the index to default by using the `df_name.reset_index()` method. The reset is not permanent (not in place), as seen below.

In [21]:
#bank_df.reset_index()
bank_df.reset_index().head(3)

Unnamed: 0,date,TICKER,RET
0,2017-01-03,WFC,0.01615
1,2017-01-04,WFC,0.000893
2,2017-01-05,WFC,-0.015522


In [22]:
#bank_df.head(3)

#### We can change the index from default to a column of our choice, using the `df_name.set_index()` method. The change is not permanent (not in place), as seen below.

In [23]:
#bank_df.reset_index().set_index('TICKER')
bank_df.reset_index().set_index('TICKER').head(3)

Unnamed: 0_level_0,date,RET
TICKER,Unnamed: 1_level_1,Unnamed: 2_level_1
WFC,2017-01-03,0.01615
WFC,2017-01-04,0.000893
WFC,2017-01-05,-0.015522


In [24]:
#bank_df.head(3)

#### We can also use the `set_index()` method to create a multi-index for a DataFrame.

In [25]:
bank_df2 = bank_df.reset_index().set_index(['TICKER','date'])
bank_df2.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,RET
TICKER,date,Unnamed: 2_level_1
WFC,2017-01-03,0.01615
WFC,2017-01-04,0.000893
WFC,2017-01-05,-0.015522


In [26]:
#bank_df2.index

#### Additional details on the multi-index can also be found as follows:

In [27]:
#bank_df2.index.names

In [28]:
#bank_df2.index.levels