### Pandas

* Pandas was created and is maintained by Wes McKinney, formerly of AQR.
* .loc extracts rows using row labels (index).  In this case, the labels are just 0, 1, 2, ...
* .iloc extracts rows using the index location.  It works like extracting items from a list, starting at 0.
* Because the row labels here are just 0, 1, 2, ... .loc and .iloc work almost the same.  The only difference is that .iloc works like extracting items from a list, so it goes up to but not including the last number.  .loc uses the row labels including the last label.

In [108]:
tips = load_dataset("tips")
# tips.info()
# tips.describe()
# tips.dtypes
# tips.head()
# tips.tail()
# tips.columns
# tips.index
# tips.day.unique()
# tips.loc[0]
# tips.loc[3:6]
# tips.iloc[0]
# tips.iloc[3:6]
# tips.iloc[3:10:2]
# tips.iloc[-1]
# tips.iloc[-4:]
# tips['tip']
# tips.tip
# tips[['total_bill', 'tip]]
# tips[['total_bill', 'tip']].loc[3:7]
# tips.loc[3:7][['total_bill', 'tip']]
# tips.to_dict()
# tips.to_dict('records')
# tips[['total_bill', 'tip']].to_numpy()

In [109]:
tips2 = tips.copy()
tips2.columns = ['new_' + c for c in tips2.columns]
tips2.head(3)

Unnamed: 0,new_total_bill,new_tip,new_sex,new_smoker,new_day,new_time,new_size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [110]:
tips2 = tips2.rename(columns={'new_smoker': 'new_new_smoker'})
tips2.head(3)

Unnamed: 0,new_total_bill,new_tip,new_sex,new_new_smoker,new_day,new_time,new_size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


### Sorting

In [111]:
tips = tips.sort_values(by=['sex', 'total_bill'])
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
172,7.25,5.15,Male,Yes,Sun,Dinner,2
149,7.51,2.0,Male,No,Thur,Lunch,2
195,7.56,1.44,Male,No,Thur,Lunch,2
218,7.74,1.44,Male,Yes,Sat,Dinner,2
126,8.52,1.48,Male,No,Thur,Lunch,2


### Inserting columns

We can add new rows and columns.  More often, we want to add new columns.  Operations on columns are element-wise as with numpy.

In [112]:
tips['pct'] = tips.tip / tips.total_bill
tips['day_type'] = tips.day.map(lambda x: 'Weekend' if x in ['Sat', 'Sun'] else 'Weekday')
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday


### Filtering

In [113]:
# tips[tips.sex=="Male"].head()
# tips[(tips.sex=="Male") & (tips.pct>0.2)].head()
# tips[(tips.sex=="Male") | (tips.pct>0.2)].head()


### Aggregating

There are built-in methods and you can create your own with apply.  Often it is more convenient to define a new function with the "lambda" method instead of "def ..." but "def" is useful for longer functions.

In [114]:
tips2 = tips[['total_bill', 'tip']]

# tips2.sum()
# tips[['total_bill", 'tip']].sum()
# tips.total_bill.sum()
# tips['total_bill'].sum()
# tips2.mean()
# tips2.std()
# tips2.corr()
# tips2.cov()
# tips2.median()
# tips2.quantile([0.25, 0.5, 0.75])
# tips2.sum(axis=1)
tips2.apply(lambda x: (x**2).sum())
# def sumsquares(x):
#     return (x**2).sum()
# tips2.apply(sumsquares)


total_bill    114780.4443
tip             2658.6932
dtype: float64

### Aggregating by groups

In empirical asset pricing, we are constantly either grouping by stock and doing something to each time series of stock data, or we are grouping by date and doing something to each cross-section.  

In [115]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday
218,7.74,1.44,Male,Yes,Sat,Dinner,2,0.186047,Weekend
126,8.52,1.48,Male,No,Thur,Lunch,2,0.173709,Weekday


In [116]:
# tips.groupby('sex').total_bill.mean()
# tips.groupby(['sex', 'time']).total_bill.mean()
# tips.groupby(['sex']).total_bill.apply(lambda x: (x**2).sum())

### Transform

When we aggregate, we get a lower-dimensional object - for example, just one number for each column.  With transform, we get an object of the same dimension we started with, which is useful when we want to paste it into the original object.  This new object will repeat the aggregate in order to be of the same dimension as the original.  This is useful, for example, if we want to include a group mean as a characteristic in a model.

In [117]:
tips['total_by_sex'] = tips.groupby('sex').total_bill.transform(lambda x: x.mean())
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type,total_by_sex
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend,20.744076
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday,20.744076
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday,20.744076
218,7.74,1.44,Male,Yes,Sat,Dinner,2,0.186047,Weekend,20.744076
126,8.52,1.48,Male,No,Thur,Lunch,2,0.173709,Weekday,20.744076


### Demeaning by group

If we want to demean by group, we can use apply instead of transform, because demeaning is not an aggregation.

In [118]:
tips['total_dev_from_mean_by_sex'] = tips.groupby('sex').total_bill.apply(lambda x: x - x.mean())
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type,total_by_sex,total_dev_from_mean_by_sex
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend,20.744076,-13.494076
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday,20.744076,-13.234076
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday,20.744076,-13.184076
218,7.74,1.44,Male,Yes,Sat,Dinner,2,0.186047,Weekend,20.744076,-13.004076
126,8.52,1.48,Male,No,Thur,Lunch,2,0.173709,Weekday,20.744076,-12.224076


### Index and reset index

In [119]:
avgs = tips.groupby(['sex', 'time']).total_bill.mean()
# avgs
# avgs.loc[('Male', 'Dinner')]
# avgs.index
# avgs.reset_index()
# avgs.reset_index().set_index(['sex', 'time])

### Wide versus long form

In [120]:
# avgs.unstack()
# avgs.unstack().stack()

### Dataframes versus series

* Pandas provides two classes: DataFrame and Series.  
* A series is like a one column dataframe but not quite.  There are occasionally dataframe methods that are not available for a series.  
* We can convert a series to a dataframe with pd.DataFrame.

In [121]:
# type(tips)
# type(tips.tip)
# type(avgs)
# type(avgs.unstack())
# type(pd.DataFrame(tips.tip))

### Pandas data reader

In [122]:
treasury10 = pdr('DGS10', 'fred', start=1980)
treasury30 = pdr('DGS30', 'fred', start=1980)
both = pdr(['DGS10', 'DGS30'], 'fred', start=1980)

# treasury10.head()
# treasury30.head()
# both.head()
# treasury10.index
# type(treasury10)
# np.log(1+both/100)

### Merging dataframes

* Merge, join, and concatenate are different methods for combining dataframes.  
* Merge provides the finest control.

In [123]:
both1 = treasury10.merge(treasury30, left_index=True, right_index=True)
both2 = treasury10.merge(treasury30, on='DATE')
both3 = treasury10.join(treasury30)
both4 = pd.concat((treasury10, treasury30), axis=1)
[both.equals(b) for b in [both1, both2, both3, both4]]


[True, True, True, True]

### Missing data

* Missing values are recorded as NaN (not a number).  
* We can drop them or fill them.  
* We can fill with a specific value or fill from the previous entry or the next entry.

In [124]:
# both.isna()
# both.dropna()
# both.dropna(subset=['DGS10'])
# both.fillna(0)
# both.bfill()
# both.ffill()

### Working with time series



In [125]:
# both.shift().head()
# both.shift(2).head()
# both.shift(-1).head()
# both.diff().head()
# both.diff(2).head()
# both.pct_change()
# both.rolling(5).mean()
# both.rolling(5).std()
# both.resample('M').last()
# both.resample('MS').first()
# both.resample('M').mean()

The datetime format is a standard format.  When pdr gets daily data, it returns dates in the datetime format.  The datetime module contains functions for working with datetime objects.  Pandas implements some of them.  strftime will format datetime objects in many different ways.  Its inverse is strptime, which converts dates in different string formats into datetime objects.

In [126]:
# both.index.dtype
# [x.year for x in both.index]
# both.index.map(lambda x: x.month)
# both.index.astype(str)
# both.index.strftime("%b %d, %Y")
# both.resample('M').last().index.to_period('M')

### Upsampling

We can map a time series into a higher frequency index with reindex.  This creates NaN's for the new dates.  We might want to fill those NaN's by inserting the most recent valid value with ffill (forward filling the time series).  Downsampling is done with resample as shown above.

In [127]:
monthly = both.resample('MS').first()
min_date = monthly.index.min()
max_date = monthly.index.max()
new_index = pd.date_range(start=min_date, end=max_date, freq="D")
monthly = monthly.reindex(new_index).ffill()
monthly.head()

Unnamed: 0,DGS10,DGS30
1980-01-01,10.5,10.23
1980-01-02,10.5,10.23
1980-01-03,10.5,10.23
1980-01-04,10.5,10.23
1980-01-05,10.5,10.23


### Saving and reading dataframes

* To save or read in Colab, you need to "mount" your Google Drive.  Click on the file icon in the left toolbar and then click the Google drive icon.
* pandas has read_csv, read_stata, read_sas, and read_excel functions.

In [128]:
both.to_csv('filename.csv')
newboth = pd.read_csv('filename.csv', parse_dates=['DATE'])