# Fund Selection

## Indicator

- Filters (Data Preprocessing)
- Industry Classification (correlation to sector ETFs)
- Value & Size Classification (correlation to Value and Size ETFs)
- Performance (mean of funds' return as benchmark)
- Volatility
- Survival Time
- Risk
- Inflow
- Turnover


In [28]:
import pandas as pd
import numpy as np
import datetime

In [2]:
funds = pd.read_csv('data/13f_filings_fractions_new.csv')

In [3]:
funds.head()

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iAMEND,iRESTATEMENT,iTYPE,iQTY,iMARKET_VALUE,iLONG_FRACTION
0,4443748,216235,055622104,1999-09-30,2000-01-03,0,0,0,28287.0,3134550.0,0.009282
1,4443749,216235,020039103,1999-09-30,2000-01-03,0,0,0,79199.0,5690620.0,0.016851
2,4443750,216235,713448108,1999-09-30,2000-01-03,0,0,0,186140.0,5795950.0,0.017163
3,4443751,216235,002824100,1999-09-30,2000-01-03,0,0,0,169555.0,6290420.0,0.018627
4,4443752,216235,05964H105,1999-09-30,2000-01-03,0,0,0,16897.0,174250.0,0.000516


In [4]:
funds_copy = funds.copy()

## Fileters (Data Preprocessing)

### Principle

- filter out the duplicate rows
- filter out rows where iQTY is zero
- filter out rows where MARKET_VALUE is zero
- filter out rows where iFILING_DATE is before June 30, 2013 as the electronic form filing is not required before this date
- filter out the funds with amendments
- funds must complete filing within 46 days of the end of the quarter
- limit of the number of holding for each fund
- market value of each fund

In [5]:
len(funds.iCIK.unique())

9540

### Restatement, Amendments, Type

What is type?

In [6]:
funds.iAMEND.unique()

array([0, 1], dtype=int64)

In [7]:
funds.iRESTATEMENT.unique()

array([0, 1], dtype=int64)

In [8]:
funds.iTYPE.unique()

array([0, 1, 3, 2, 5, 4], dtype=int64)

In [9]:
funds.iTYPE.value_counts()

0    56703261
1      398090
2      289728
3      177434
4         387
5         336
Name: iTYPE, dtype: int64

In [10]:
## only keep the funds without amendments
funds = funds[funds['iAMEND']== 0]
funds = funds[funds['iRESTATEMENT']== 0]
funds = funds[funds['iTYPE']== 0]
funds

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iAMEND,iRESTATEMENT,iTYPE,iQTY,iMARKET_VALUE,iLONG_FRACTION
0,4443748,216235,055622104,1999-09-30,2000-01-03,0,0,0,28287.0,3134550.0,0.009282
1,4443749,216235,020039103,1999-09-30,2000-01-03,0,0,0,79199.0,5690620.0,0.016851
2,4443750,216235,713448108,1999-09-30,2000-01-03,0,0,0,186140.0,5795950.0,0.017163
3,4443751,216235,002824100,1999-09-30,2000-01-03,0,0,0,169555.0,6290420.0,0.018627
4,4443752,216235,05964H105,1999-09-30,2000-01-03,0,0,0,16897.0,174250.0,0.000516
...,...,...,...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,0,0,0,300.0,11000.0,0.000086
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,0,0,0,25.0,13000.0,0.000102
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,0,0,0,230.0,25000.0,0.000196
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,0,0,0,1030.0,171000.0,0.001341


In [11]:
funds = funds.drop('iAMEND', axis=1)
funds = funds.drop('iRESTATEMENT', axis=1)
funds = funds.drop('iTYPE', axis=1)
funds

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iQTY,iMARKET_VALUE,iLONG_FRACTION
0,4443748,216235,055622104,1999-09-30,2000-01-03,28287.0,3134550.0,0.009282
1,4443749,216235,020039103,1999-09-30,2000-01-03,79199.0,5690620.0,0.016851
2,4443750,216235,713448108,1999-09-30,2000-01-03,186140.0,5795950.0,0.017163
3,4443751,216235,002824100,1999-09-30,2000-01-03,169555.0,6290420.0,0.018627
4,4443752,216235,05964H105,1999-09-30,2000-01-03,16897.0,174250.0,0.000516
...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,300.0,11000.0,0.000086
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,25.0,13000.0,0.000102
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,230.0,25000.0,0.000196
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,1030.0,171000.0,0.001341


In [12]:
## drop records wherein number of shares is zero 
## or total amount of money invested is zero

funds = funds[funds['iMARKET_VALUE']!=0]
funds = funds[funds['iQTY'] != 0]
funds

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iQTY,iMARKET_VALUE,iLONG_FRACTION
0,4443748,216235,055622104,1999-09-30,2000-01-03,28287.0,3134550.0,0.009282
1,4443749,216235,020039103,1999-09-30,2000-01-03,79199.0,5690620.0,0.016851
2,4443750,216235,713448108,1999-09-30,2000-01-03,186140.0,5795950.0,0.017163
3,4443751,216235,002824100,1999-09-30,2000-01-03,169555.0,6290420.0,0.018627
4,4443752,216235,05964H105,1999-09-30,2000-01-03,16897.0,174250.0,0.000516
...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,300.0,11000.0,0.000086
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,25.0,13000.0,0.000102
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,230.0,25000.0,0.000196
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,1030.0,171000.0,0.001341


In [13]:
## drop duplicate rows
funds = funds.drop_duplicates()
funds

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iQTY,iMARKET_VALUE,iLONG_FRACTION
0,4443748,216235,055622104,1999-09-30,2000-01-03,28287.0,3134550.0,0.009282
1,4443749,216235,020039103,1999-09-30,2000-01-03,79199.0,5690620.0,0.016851
2,4443750,216235,713448108,1999-09-30,2000-01-03,186140.0,5795950.0,0.017163
3,4443751,216235,002824100,1999-09-30,2000-01-03,169555.0,6290420.0,0.018627
4,4443752,216235,05964H105,1999-09-30,2000-01-03,16897.0,174250.0,0.000516
...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,300.0,11000.0,0.000086
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,25.0,13000.0,0.000102
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,230.0,25000.0,0.000196
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,1030.0,171000.0,0.001341


In [14]:
funds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51482411 entries, 0 to 57569235
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   iRECORD_ID      int64  
 1   iCIK            int64  
 2   iCUSIP          object 
 3   iPERIOD_END     object 
 4   iFILING_DATE    object 
 5   iQTY            float64
 6   iMARKET_VALUE   float64
 7   iLONG_FRACTION  float64
dtypes: float64(3), int64(2), object(3)
memory usage: 3.5+ GB


In [15]:
funds['iFILING_DATE']= pd.to_datetime(funds['iFILING_DATE'])
funds['iPERIOD_END']= pd.to_datetime(funds['iPERIOD_END'])
funds

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iQTY,iMARKET_VALUE,iLONG_FRACTION
0,4443748,216235,055622104,1999-09-30,2000-01-03,28287.0,3134550.0,0.009282
1,4443749,216235,020039103,1999-09-30,2000-01-03,79199.0,5690620.0,0.016851
2,4443750,216235,713448108,1999-09-30,2000-01-03,186140.0,5795950.0,0.017163
3,4443751,216235,002824100,1999-09-30,2000-01-03,169555.0,6290420.0,0.018627
4,4443752,216235,05964H105,1999-09-30,2000-01-03,16897.0,174250.0,0.000516
...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,300.0,11000.0,0.000086
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,25.0,13000.0,0.000102
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,230.0,25000.0,0.000196
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,1030.0,171000.0,0.001341


In [16]:
funds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51482411 entries, 0 to 57569235
Data columns (total 8 columns):
 #   Column          Dtype         
---  ------          -----         
 0   iRECORD_ID      int64         
 1   iCIK            int64         
 2   iCUSIP          object        
 3   iPERIOD_END     datetime64[ns]
 4   iFILING_DATE    datetime64[ns]
 5   iQTY            float64       
 6   iMARKET_VALUE   float64       
 7   iLONG_FRACTION  float64       
dtypes: datetime64[ns](2), float64(3), int64(2), object(1)
memory usage: 3.5+ GB


In [18]:
## only keep the funds starting from June 30, 2013
funds = funds[funds.iFILING_DATE >= datetime(2013, 6, 30)]
funds

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iQTY,iMARKET_VALUE,iLONG_FRACTION
26088669,38587073,1349353,002824100,2013-06-30,2013-08-01,96510.0,3366000.0,0.023443
26088670,38587074,1349353,025816109,2013-06-30,2013-08-01,70214.0,5249000.0,0.036557
26088671,38587075,1349353,026874784,2013-06-30,2013-08-01,346115.0,15471000.0,0.107748
26088672,38587076,1349353,060505104,2013-06-30,2013-08-01,882340.0,11347000.0,0.079026
26088673,38587077,1349353,064058100,2013-06-30,2013-08-01,129065.0,3620000.0,0.025212
...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,300.0,11000.0,0.000086
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,25.0,13000.0,0.000102
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,230.0,25000.0,0.000196
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,1030.0,171000.0,0.001341


In [21]:
funds['FILING_INTERVAL'] = funds['iFILING_DATE']-funds['iPERIOD_END']
funds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  funds['FILING_INTERVAL'] = funds['iFILING_DATE']-funds['iPERIOD_END']


Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iQTY,iMARKET_VALUE,iLONG_FRACTION,FILING_INTERVAL
26088669,38587073,1349353,002824100,2013-06-30,2013-08-01,96510.0,3366000.0,0.023443,32 days
26088670,38587074,1349353,025816109,2013-06-30,2013-08-01,70214.0,5249000.0,0.036557,32 days
26088671,38587075,1349353,026874784,2013-06-30,2013-08-01,346115.0,15471000.0,0.107748,32 days
26088672,38587076,1349353,060505104,2013-06-30,2013-08-01,882340.0,11347000.0,0.079026,32 days
26088673,38587077,1349353,064058100,2013-06-30,2013-08-01,129065.0,3620000.0,0.025212,32 days
...,...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,300.0,11000.0,0.000086,23 days
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,25.0,13000.0,0.000102,23 days
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,230.0,25000.0,0.000196,23 days
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,1030.0,171000.0,0.001341,23 days


In [22]:
funds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28211620 entries, 26088669 to 57569235
Data columns (total 9 columns):
 #   Column           Dtype          
---  ------           -----          
 0   iRECORD_ID       int64          
 1   iCIK             int64          
 2   iCUSIP           object         
 3   iPERIOD_END      datetime64[ns] 
 4   iFILING_DATE     datetime64[ns] 
 5   iQTY             float64        
 6   iMARKET_VALUE    float64        
 7   iLONG_FRACTION   float64        
 8   FILING_INTERVAL  timedelta64[ns]
dtypes: datetime64[ns](2), float64(3), int64(2), object(1), timedelta64[ns](1)
memory usage: 2.1+ GB


In [29]:
## only keep the funds which complete filing within 45 days (for backtesting purpose)
funds = funds[funds['FILING_INTERVAL']<=datetime.timedelta(days = 45)]
funds

Unnamed: 0,iRECORD_ID,iCIK,iCUSIP,iPERIOD_END,iFILING_DATE,iQTY,iMARKET_VALUE,iLONG_FRACTION,FILING_INTERVAL
26088669,38587073,1349353,002824100,2013-06-30,2013-08-01,96510.0,3366000.0,0.023443,32 days
26088670,38587074,1349353,025816109,2013-06-30,2013-08-01,70214.0,5249000.0,0.036557,32 days
26088671,38587075,1349353,026874784,2013-06-30,2013-08-01,346115.0,15471000.0,0.107748,32 days
26088672,38587076,1349353,060505104,2013-06-30,2013-08-01,882340.0,11347000.0,0.079026,32 days
26088673,38587077,1349353,064058100,2013-06-30,2013-08-01,129065.0,3620000.0,0.025212,32 days
...,...,...,...,...,...,...,...,...,...
57569231,74561590,1730479,039483102,2020-03-31,2020-04-23,300.0,11000.0,0.000086,23 days
57569232,74561591,1730479,88160R101,2020-03-31,2020-04-23,25.0,13000.0,0.000102,23 days
57569233,74561592,1730479,921937793,2020-03-31,2020-04-23,230.0,25000.0,0.000196,23 days
57569234,74561593,1730479,863667101,2020-03-31,2020-04-23,1030.0,171000.0,0.001341,23 days


## Industry Classification (correlation to sector ETFs)

## Value & Size Classification (correlation to Value and Size ETFs)

## Performance (mean of funds' return as benchmark)

## Volatility

## Survival Time

## Risk

## Inflow

## Turnover