# Capston Project 2 - The Financial Performance of the ESG Funds
# Data Wrangling
By Tingyin Xiao

<a id="0"></a>
# Table of Contents
* [1.  Introduction](#1)
* [2.  Import the Dependencies](#2)
* [3.  Data Collection](#3)
    * [3.1  Load the data](#3.1)
    * [3.2  Merge the data](#3.2)
    * [3.3  Save the merged data](#3.3)

* [4. Data Organization](#4)
    * [4.1  Load the saved data](#4.1)
    * [4.2  Clean the fund names and share classes including trademark symbols](#4.2)
    * [4.3  Merge duplicated columns with different names](#4.3)
    * [4.4  Deal with the double reported months](#4.4)
    * [4.5  Find the funds and share classes that have target variables over longest period](#4.5)
    * [4.6  Save the time series data and the data records with target covering longest time period](#4.6)
* [5. Data Definition](#5)
    * [5.1  Load the data](#5.1)
    * [5.2  Define the target](#5.2)
    * [5.3  Define the features](#5.3)
* [6. Data Cleaning](#6)
* [7. Summary](#7)

<a id="1"></a> 
## Introduction

The sustainable investments enable people who want to make the world better to align their investments with personal values. With the rise of this type of ethical investing, data-based knowledge and insights on the ESG (environmental, social, and governance) funds are desired by both the investors and the financial institutions that help with the investments. Analyzing the financial performance of the ESG funds, we can provide professional advice  for the stakeholders on whether the ESG investments are generating higher returns in general, which assets perform better, and how would these assets perform in the near future. 

In order to do the analysis, we downloaded the ESG funds data from the Fossil Free Funds website (https://fossilfreefunds.org/). The frequency of reporting is mostly once per month, with a couple of double reportings when there were data updated in the month. The whole dataset go back to March, 2020 and are still being updated monthly (to present: July, 2023). 

This sections is devoted on the data wrangling. The combined table of all the available raw data (from March, 2020 to the present day) has more than 300000 rows and 146 columns. Each row is an observation of a share class at a certain month, with its characteristics and various ESG evaluations as the columns. The goal of the data wrangling is to organize and clean the data to make it ready for the next step, exploratory data analysis.

[Back to the Table of Contents](#0)

<a id="2"></a> 
## Import the Dependencies

In [2]:
import pandas as pd
import numpy as np
import glob
import warnings
from ydata_profiling import ProfileReport
import re

[Back to the Table of Contents](#0)

<a id="3"></a> 
## Data Collection

<a id="3.1"></a>
### Load the data

In [3]:
# get the list of files in the data folder

file_names = glob.glob('data-FFF/Invest+Your+Values+shareclass+results+*.xlsx')
file_names.sort()
#file_names

[Back to the Table of Contents](#0)

<a id="3.2"></a>
### Merge the data

In [4]:
# Using all files to generate one single file
# get the time from the file name and insert it into the dataframe as a column
data = None
warnings.filterwarnings("ignore", message="Unknown extension is not supported and will be removed")

for file in file_names:
    df = pd.read_excel(file, sheet_name = 'Shareclasses')
    print(file)
    df['current time'] = file[-13:-5]
    if data is None:
        data = df
    else:
        data = pd.concat([data, df])  

data-FFF/Invest+Your+Values+shareclass+results+20200414.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20200519.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20200610.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20200716.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20200811.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20200913.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20200928.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20201018.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20201111.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20210121.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20210303.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20210401.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20210503.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20210715.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20210809.xlsx
data-FFF/Invest+Your+Values+shareclass+results+20210902.xlsx
data-FFF/Invest+Your+Val

In [5]:
data.shape

(365103, 146)

In [6]:
type(data)

pandas.core.frame.DataFrame

[Back to the Table of Contents](#0)

<a id="3.3"></a>
### Save the merged data

In [7]:
data.to_csv('data/all_results_data.csv', index = False)

In [8]:
# save dtype of each column
datatypes  = data.dtypes.to_frame('dtypes').reset_index()
type(datatypes)

pandas.core.frame.DataFrame

In [9]:
datatypes.head()

Unnamed: 0,index,dtypes
0,Fund profile: Shareclass name,object
1,Fund profile: Ticker,object
2,Fund profile: Fund name,object
3,Fund profile: Asset manager,object
4,Fund profile: Shareclass type,object


In [10]:
datatypes.to_csv('data/datatypes.csv', index = False)

[Back to the Table of Contents](#0)

<a id="4"></a> 
## Data Organization

<a id="4.1"></a> 
### Load the saved data

In [18]:
types = pd.read_csv('data/datatypes.csv')
dtypes_dict = types.set_index('index')['dtypes'].to_dict()
#dtypes_dict

In [19]:
df = pd.read_csv('data/all_results_data.csv', dtype = dtypes_dict)
#set(df['current time'])

In [20]:
fund_names = df['Fund profile: Fund name']
len(fund_names)

365103

In [21]:
len(set(fund_names))

4706

In [22]:
shareclasses = df['Fund profile: Shareclass name']
len(shareclasses)

365103

In [23]:
len(set(shareclasses))

14125

[Back to the Table of Contents](#0)

<a id="4.2"></a> 
### Clean the fund names and share classes including trademark symbols

There are many fund names and share classes have trademark symbols ™ or registerd trademark symbol ®. Some of them are missing in certain occassions while presenting in other ones. Sometimes, even though they present in all records, their print out are different. Therefore, we need to clean the fund and share class names in order to match the data.

In [24]:
# find the fund and share class names that contains trademark symbols

tm_shareClasses = shareclasses.str.contains('™')
tm_fundNames = fund_names.str.contains('™')

r_shareClasses = shareclasses.str.contains('®')
r_fundNames = fund_names.str.contains('®')

In [25]:
shareClasses_tm = set(shareclasses[tm_shareClasses])
fundNames_tm = set(fund_names[tm_fundNames])
                                 
shareClasses_r = set(shareclasses[r_shareClasses])
fundNames_r = set(fund_names[r_fundNames])

In [26]:
len(shareClasses_tm)

33

In [27]:
len(fundNames_tm)

36

In [28]:
len(shareClasses_r)

1414

In [29]:
len(fundNames_r)

666

In [30]:
list(shareClasses_tm)[:5]

['Schwab US REIT ETF™',
 'Schwab International Small-Cap Eq ETF™',
 'Invesco RAFI™ Strategic US ETF',
 'Eaton Vance Stock NextShares™',
 'Vanguard US Growth Admiral™']

In [31]:
list(fundNames_tm)[:5]

['Invesco RAFI™ Strategic Emerging Markets ETF',
 'Schwab Core Equity Fund™',
 'Invesco RAFI™ Strategic US ETF',
 'Eaton Vance Stock NextShares™',
 'Schwab International Small-Cap Equity ETF™']

In [32]:
list(fundNames_r)[25:35]

['VY® T. Rowe Price Capital Appreciation Portfolio',
 'SPDR®\xa0EURO STOXX\xa0Small Cap ETF',
 'VY® Invesco Oppenheimer Global Portfolio',
 'Fidelity® 500 Index Fund',
 'Fidelity® Series Small Cap Discovery Fund',
 'Financial Select Sector SPDR®\xa0Fund',
 'Fidelity® Small-Mid Cap Opportunities ETF',
 'Fidelity® High Dividend ETF',
 'Fidelity Advisor® Sustainability U.S. Equity Fund',
 'Fidelity® International Value Factor ETF']

In [33]:
list(shareClasses_r)[30:40]

['Real Estate Select Sector SPDR®',
 'JPMorgan SmartRetirement® 2055 I',
 'Fidelity Advisor® Diversified Intl I',
 'Invesco S&P MidCap 400® Equal Weight ETF',
 'BlackRock LifePath® Index 2055 Inv P',
 'Fidelity Advisor Freedom® Blend 2065 A',
 'Fidelity Advisor Asset Manager® 40% Z',
 'Schwab ® US Large-Cap Growth Idx',
 'Fidelity® NASDAQ Composite Index®',
 'Fidelity Advisor® Emerging Asia M']

In [34]:
# Here is an example that the same string can be printed out differently

ex_fundNames = fund_names.str.contains('SPDR®')
list(set(fund_names[ex_fundNames]))[:10]

['SPDR® Portfolio S&P 600 Small Cap ETF',
 'SPDR® Portfolio S&P 500 ETF',
 'SPDR®\xa0MFS Systematic Value Equity ETF',
 'SPDR® S&P Kensho Intelligent Structures ETF',
 'SPDR®\xa0S&P\xa0400 Mid Cap Growth ETF',
 'SPDR®\xa0S&P\xa0International Dividend ETF',
 'SPDR® S&P 500 ETF Trust',
 'SPDR® Kensho Final Frontiers ETF',
 'SPDR®\xa0S&P\xa0Homebuilders ETF',
 'SPDR® Portfolio S&P 500 Growth ETF']

In [35]:
# We can firstly replace the '\xa0' string by ' '

fund_names_clean = fund_names.str.replace('\xa0', ' ')
list(set(fund_names_clean[ex_fundNames]))[:10]

['SPDR® S&P 600 Small Cap ETF',
 'SPDR® S&P Bank ETF',
 'SPDR® Portfolio S&P 600 Small Cap ETF',
 'SPDR® S&P Software & Services ETF',
 'SPDR® Portfolio S&P 500 ETF',
 'Materials Select Sector SPDR® Fund',
 'SPDR® S&P Kensho Intelligent Structures ETF',
 'Utilities Select Sector SPDR® Fund',
 'SPDR® S&P Telecom ETF',
 'SPDR® S&P Capital Markets ETF']

In [36]:
shareclass_clean = shareclasses.str.replace('\xa0', ' ')
list(set(shareclass_clean[shareclasses.str.contains('SPDR®')]))[:10]

['SPDR® S&P 600 Small Cap ETF',
 'SPDR® S&P Bank ETF',
 'Energy Select Sector SPDR® ETF',
 'SPDR® S&P Software & Services ETF',
 'SPDR® Portfolio S&P 500 ETF',
 'Materials Select Sector SPDR® ETF',
 'Industrial Select Sector SPDR® ETF',
 'SPDR® S&P Telecom ETF',
 'SPDR® S&P Capital Markets ETF',
 'SPDR® S&P 500 ETF Trust']

In [37]:
# The trade mark symbol is missing in certain records
list(set(fund_names_clean[fund_names_clean.str.contains('SPDR ')]))[:]

['SPDR Bloomberg SASB Emerging Markets ESG Select ETF',
 'SPDR S&P SmallCap 600 ESG ETF',
 'The Real Estate Select Sector SPDR Fund',
 'SPDR Bloomberg SASB Developed Markets Ex US ESG Select ETF',
 'SPDR MSCI USA Climate Paris Aligned ETF',
 'SPDR S&P® North American Natural Resources ETF']

For the data matching purpose, it is necessary to remove all the trade mark symbols and keep an space between words but eliminate head or trailing spaces.

In [38]:
# copy the original columns of fund names and share class names

df['Shareclass name'] = df['Fund profile: Shareclass name']
df['Fund name'] = df['Fund profile: Fund name']

In [39]:
# clean the new columns by replacing the '\xa0' string by ' '

df['Shareclass name'] = df['Shareclass name'].str.replace('\xa0', ' ')
df['Fund name'] = df['Fund name'].str.replace('\xa0', ' ')

In [40]:
# Also drop the trademark symbols (™ and ®)
# note that there might be space before and/or after the symbols

def replace_trademark(s):
    s = s.replace('™', ' ')
    s = re.sub('®', ' ', s)
    s = re.sub('\s+', ' ', s)
    return s.strip()
    
df['Shareclass name'] = df['Shareclass name'].apply(replace_trademark)
df['Fund name'] = df['Fund name'].apply(replace_trademark)

In [41]:
# examine the results

df[['Fund profile: Shareclass name','Shareclass name',
    'Fund profile: Fund name','Fund name']][df['Fund profile: Fund name'].str.contains('\xa0')].head()


Unnamed: 0,Fund profile: Shareclass name,Shareclass name,Fund profile: Fund name,Fund name
5570,JNL/RAFI® Fundamental Asia Developed A,JNL/RAFI Fundamental Asia Developed A,JNL/RAFI® Fundamental Asia Developed Fund,JNL/RAFI Fundamental Asia Developed Fund
5571,JNL/RAFI® Fundamental Asia Developed I,JNL/RAFI Fundamental Asia Developed I,JNL/RAFI® Fundamental Asia Developed Fund,JNL/RAFI Fundamental Asia Developed Fund
7734,Consumer Discret Sel Sect SPDR® ETF,Consumer Discret Sel Sect SPDR ETF,Consumer Discretionary Select Sector SPDR® Fund,Consumer Discretionary Select Sector SPDR Fund
7736,Energy Select Sector SPDR® ETF,Energy Select Sector SPDR ETF,Energy Select Sector SPDR® Fund,Energy Select Sector SPDR Fund
7737,Financial Select Sector SPDR® ETF,Financial Select Sector SPDR ETF,Financial Select Sector SPDR® Fund,Financial Select Sector SPDR Fund


In [42]:
df[['Fund profile: Shareclass name','Shareclass name','Fund profile: Fund name','Fund name']][df['Fund profile: Fund name'].str.contains('™')].head()


Unnamed: 0,Fund profile: Shareclass name,Shareclass name,Fund profile: Fund name,Fund name
2426,Eaton Vance Global Income Builder NS™,Eaton Vance Global Income Builder NS,Eaton Vance Global Income Builder NextShares™,Eaton Vance Global Income Builder NextShares
2464,Eaton Vance Stock NextShares™,Eaton Vance Stock NextShares,Eaton Vance Stock NextShares™,Eaton Vance Stock NextShares
4395,Invesco Cleantech™ ETF,Invesco Cleantech ETF,Invesco Cleantech™ ETF,Invesco Cleantech ETF
4433,Invesco Dividend Achievers™ ETF,Invesco Dividend Achievers ETF,Invesco Dividend Achievers™ ETF,Invesco Dividend Achievers ETF
4577,Invesco High Yield Eq Div Achiev™ ETF,Invesco High Yield Eq Div Achiev ETF,Invesco High Yield Equity Dividend Achievers™ ETF,Invesco High Yield Equity Dividend Achievers ETF


In [43]:
df[['Fund profile: Shareclass name','Shareclass name',
    'Fund profile: Fund name','Fund name']][df['Fund profile: Fund name'].str.contains('®')].head()


Unnamed: 0,Fund profile: Shareclass name,Shareclass name,Fund profile: Fund name,Fund name
507,American Century Ultra® A,American Century Ultra A,American Century Ultra® Fund,American Century Ultra Fund
508,American Century Ultra® C,American Century Ultra C,American Century Ultra® Fund,American Century Ultra Fund
509,American Century Ultra® G,American Century Ultra G,American Century Ultra® Fund,American Century Ultra Fund
510,American Century Ultra® I,American Century Ultra I,American Century Ultra® Fund,American Century Ultra Fund
511,American Century Ultra® Inv,American Century Ultra Inv,American Century Ultra® Fund,American Century Ultra Fund


[Back to the Table of Contents](#0)

<a id="4.3"></a> 
### Merge duplicated columns with different names

There are several columns that are duplicated in meaning. For example, 'Returns and fees: Month end trailing returns, 1 year' and 'Financial performance: Month end trailing returns, year 1'. They are the same variable with different names in the original files. We should merge them to a single column.

In [44]:
key_str = 'Financial performance:'
sub_col = df.filter(regex = key_str)
sub_col.columns

Index(['Financial performance: Financial performance as-of date',
       'Financial performance: Month end trailing returns, year 1',
       'Financial performance: Month end trailing returns, year 3',
       'Financial performance: Month end trailing returns, year 5',
       'Financial performance: Month end trailing returns, year 10'],
      dtype='object')

In [45]:
key_str2 = 'Returns and fees:'
sub_col2 = df.filter(regex = key_str2)
sub_col2.columns

Index(['Returns and fees: Financial performance as-of date',
       'Returns and fees: Month end trailing returns, 1 month',
       'Returns and fees: Month end trailing returns, 3 month',
       'Returns and fees: Month end trailing returns, 6 month',
       'Returns and fees: Month end trailing returns, 1 year',
       'Returns and fees: Month end trailing returns, 3 year',
       'Returns and fees: Month end trailing returns, 5 year',
       'Returns and fees: Month end trailing returns, 10 year',
       'Returns and fees: Month end trailing returns, 15 year',
       'Returns and fees: Month end trailing returns, 20 year',
       'Returns and fees: Month end trailing returns, year-to-date',
       'Returns and fees: Month end trailing returns, since inception',
       'Returns and fees: Prospectus net expense ratio'],
      dtype='object')

In [46]:
# merge the duplicated columns for 'Month end trailing returns, 1 year' to a new column

df['Month end trailing returns, 1 year']= (df['Financial performance: Month end trailing returns, year 1']
                                           .combine_first(df['Returns and fees: Month end trailing returns, 1 year']))

In [47]:
Non_missing = df['Month end trailing returns, 1 year'].count()

In [48]:
Non_missing1 = df['Returns and fees: Month end trailing returns, 1 year'].count()

In [49]:
Non_missing2 = df['Financial performance: Month end trailing returns, year 1'].count()

In [50]:
assert Non_missing == Non_missing1 + Non_missing2

In [51]:
# merge the other duplicated columns

df['Month end trailing returns, 3 year']= (df['Financial performance: Month end trailing returns, year 3']
                                           .combine_first(df['Returns and fees: Month end trailing returns, 3 year']))
df['Month end trailing returns, 5 year']= (df['Financial performance: Month end trailing returns, year 5']
                                           .combine_first(df['Returns and fees: Month end trailing returns, 5 year']))
df['Month end trailing returns, 10 year']= (df['Financial performance: Month end trailing returns, year 10']
                                            .combine_first(df['Returns and fees: Month end trailing returns, 10 year']))
df['Financial performance as-of date']= (df['Financial performance: Financial performance as-of date']
                                         .combine_first(df['Returns and fees: Financial performance as-of date']))


In [52]:
df.head()

Unnamed: 0,Fund profile: Shareclass name,Fund profile: Ticker,Fund profile: Fund name,Fund profile: Asset manager,Fund profile: Shareclass type,Fund profile: Shareclass inception date,Fund profile: Category group,Fund profile: Sustainability mandate,Fund profile: US-SIF member,Fund profile: Oldest shareclass inception date,...,"Fossil Free Funds: Fossil fuel insurance holdings, count","Fossil Free Funds: Fossil fuel insurance holdings, weight","Fossil Free Funds: Fossil fuel insurance holdings, asset",Shareclass name,Fund name,"Month end trailing returns, 1 year","Month end trailing returns, 3 year","Month end trailing returns, 5 year","Month end trailing returns, 10 year",Financial performance as-of date
0,1290 SmartBeta Equity A,TNBRX,1290 SmartBeta Equity Fund,1290 Funds,Open-end mutual fund,2014-11-12,International Equity Funds,Y,,2014-11-12,...,,,,1290 SmartBeta Equity A,1290 SmartBeta Equity Fund,-7.74546,3.45056,4.32708,,2020-03-31
1,1290 SmartBeta Equity I,TNBRX,1290 SmartBeta Equity Fund,1290 Funds,Open-end mutual fund,2014-11-12,International Equity Funds,Y,,2014-11-12,...,,,,1290 SmartBeta Equity I,1290 SmartBeta Equity Fund,-7.50301,3.69803,4.58249,,2020-03-31
2,1290 SmartBeta Equity R,TNBRX,1290 SmartBeta Equity Fund,1290 Funds,Open-end mutual fund,2014-11-12,International Equity Funds,Y,,2014-11-12,...,,,,1290 SmartBeta Equity R,1290 SmartBeta Equity Fund,-7.97413,3.16886,4.04976,,2020-03-31
3,1290 SmartBeta Equity T,TNBRX,1290 SmartBeta Equity Fund,1290 Funds,Open-end mutual fund,2014-11-12,International Equity Funds,Y,,2014-11-12,...,,,,1290 SmartBeta Equity T,1290 SmartBeta Equity Fund,-7.52837,3.68855,4.57675,,2020-03-31
4,13D Activist A,DDDCX,13D Activist Fund,13D Activist Fund,Open-end mutual fund,2011-12-28,U.S. Equity Fund,Y,,2011-12-28,...,,,,13D Activist A,13D Activist Fund,-23.01826,-3.8415,-1.0491,,2020-03-31


In [53]:
# drop the incomplete columns used to merge to complete ones

df = df.drop(['Financial performance: Month end trailing returns, year 1',
             'Returns and fees: Month end trailing returns, 1 year',
             'Financial performance: Month end trailing returns, year 3',
             'Returns and fees: Month end trailing returns, 3 year',
             'Financial performance: Month end trailing returns, year 5',
             'Returns and fees: Month end trailing returns, 5 year',
             'Financial performance: Month end trailing returns, year 10',
             'Returns and fees: Month end trailing returns, 10 year',
             'Financial performance: Financial performance as-of date',
             'Returns and fees: Financial performance as-of date'], 1)

  df = df.drop(['Financial performance: Month end trailing returns, year 1',


[Back to the Table of Contents](#0)

<a id="4.4"></a> 
### Deal with the double reported months

Some months' performance are reported in two consecutive months' data. Are these double reported data duplicated?

In [54]:
# 'current time' was added when loading data to differentiate batches of data
# it is the reporting time

time_table = df[['Financial performance as-of date','current time']]
tt = time_table.drop_duplicates()
tt.head()

Unnamed: 0,Financial performance as-of date,current time
0,2020-03-31,20200414
9238,2020-04-30,20200519
13992,2020-03-31,20200519
18476,2020-05-31,20200610
20686,2020-04-30,20200610


In [55]:
tt.sort_values(['Financial performance as-of date', 'current time']).head(31)

Unnamed: 0,Financial performance as-of date,current time
0,2020-03-31,20200414
13992,2020-03-31,20200519
9238,2020-04-30,20200519
20686,2020-04-30,20200610
18476,2020-05-31,20200610
27861,2020-05-31,20200716
27846,2020-06-30,20200716
37183,2020-07-31,20200811
46423,2020-08-31,20200913
55648,2020-08-31,20200928


In [56]:
# the two largest counts of records indicate the double reporting for the same date

df['Financial performance as-of date'].value_counts().head()

2020-08-31    18450
2022-01-31    17573
2020-05-31     9384
2020-06-30     9316
2020-07-31     9240
Name: Financial performance as-of date, dtype: int64

In [57]:
df['current time'].value_counts().head()

20200610    9370
20200716    9337
20200811    9240
20200414    9238
20200519    9238
Name: current time, dtype: int64

In [58]:
col_names = df.columns
type(col_names)

pandas.core.indexes.base.Index

In [59]:
len(col_names)

143

In [60]:
# 'current time' should be ignored when judging duplicated rows

col_names_subset = col_names.drop('current time')

In [61]:
len(col_names_subset)

142

In [62]:
duplicated_rows = df.duplicated(subset = col_names_subset)

In [63]:
# number of duplicated rows in the data

sum(duplicated_rows)

7128

In [64]:
df[duplicated_rows].head()

Unnamed: 0,Fund profile: Shareclass name,Fund profile: Ticker,Fund profile: Fund name,Fund profile: Asset manager,Fund profile: Shareclass type,Fund profile: Shareclass inception date,Fund profile: Category group,Fund profile: Sustainability mandate,Fund profile: US-SIF member,Fund profile: Oldest shareclass inception date,...,"Fossil Free Funds: Fossil fuel insurance holdings, count","Fossil Free Funds: Fossil fuel insurance holdings, weight","Fossil Free Funds: Fossil fuel insurance holdings, asset",Shareclass name,Fund name,"Month end trailing returns, 1 year","Month end trailing returns, 3 year","Month end trailing returns, 5 year","Month end trailing returns, 10 year",Financial performance as-of date
200365,ALPS/Kotak India Growth A,INAAX,ALPS/Kotak India Growth,ALPS,Open-end mutual fund,2018-06-12,International Equity Funds,Y,,2011-02-14,...,,,,ALPS/Kotak India Growth A,ALPS/Kotak India Growth,25.55341,16.95053,12.37844,11.83715,2022-01-31
200366,ALPS/Kotak India Growth C,INFCX,ALPS/Kotak India Growth,ALPS,Open-end mutual fund,2011-02-14,International Equity Funds,Y,,2011-02-14,...,,,,ALPS/Kotak India Growth C,ALPS/Kotak India Growth,24.69715,16.11809,11.57407,11.01872,2022-01-31
200367,ALPS/Kotak India Growth I,INDIX,ALPS/Kotak India Growth,ALPS,Open-end mutual fund,2011-02-14,International Equity Funds,Y,,2011-02-14,...,,,,ALPS/Kotak India Growth I,ALPS/Kotak India Growth,25.93037,17.26608,12.69606,12.13499,2022-01-31
200368,ALPS/Kotak India Growth II,,ALPS/Kotak India Growth,ALPS,Open-end mutual fund,2019-12-19,International Equity Funds,Y,,2011-02-14,...,,,,ALPS/Kotak India Growth II,ALPS/Kotak India Growth,26.29507,17.50607,12.83439,12.20379,2022-01-31
200369,ALPS/Kotak India Growth Inv,INDAX,ALPS/Kotak India Growth,ALPS,Open-end mutual fund,2011-02-14,International Equity Funds,Y,,2011-02-14,...,,,,ALPS/Kotak India Growth Inv,ALPS/Kotak India Growth,25.53898,17.09529,12.439,11.81849,2022-01-31


In [65]:
# which date(s) have duplicated reporting

df.loc[duplicated_rows, 'Financial performance as-of date'].unique()

array(['2022-01-31', nan], dtype=object)

In [66]:
# drop these duplicated rows

df = df[~duplicated_rows]
df.shape

(357975, 143)

In [68]:
# some months performance reported in two consecutive months but not duplicated
# use the rows that was reported in the latest month as they are updated

df = df.sort_values(['current time'])
df.drop_duplicates(subset = ['Shareclass name', 'Financial performance as-of date'], keep = 'last', inplace = True)
df.shape

(286596, 143)

In [69]:
row_counts = df.groupby(['Shareclass name', 'Financial performance as-of date']).apply(len)
sum(row_counts > 1)

0

[Back to the Table of Contents](#0)

<a id="4.5"></a> 
### Find the funds and share classes that have target variables over longest period

Firstly, we can choose the 'Month end trailing returns, 1 year' as our target variable

In [72]:
df_sub = pd.concat([df['Financial performance as-of date'], df['current time']], axis = 1)
df_sub.columns = ['date1', 'date2']
df_sub.head()

Unnamed: 0,date1,date2
0,2020-03-31,20200414
2,2020-03-31,20200414
3,2020-03-31,20200414
4,2020-03-31,20200414
5,2020-03-31,20200414


In [73]:
df_sub.describe()

Unnamed: 0,date1,date2
count,277579,286596
unique,34,40
top,2020-05-31,20200610
freq,9363,9349


In [74]:
time_series_data = df[['Shareclass name','Financial performance as-of date','Month end trailing returns, 1 year']]


In [75]:
time_series_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 286596 entries, 0 to 365102
Data columns (total 3 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Shareclass name                     286596 non-null  object 
 1   Financial performance as-of date    277579 non-null  object 
 2   Month end trailing returns, 1 year  271807 non-null  float64
dtypes: float64(1), object(2)
memory usage: 8.7+ MB


In [76]:
sorted_data = time_series_data.sort_values(['Shareclass name','Financial performance as-of date'])

In [77]:
sorted_data.head(16)

Unnamed: 0,Shareclass name,Financial performance as-of date,"Month end trailing returns, 1 year"
0,1290 SmartBeta Equity A,2020-03-31,-7.74546
9238,1290 SmartBeta Equity A,2020-04-30,-2.73874
18476,1290 SmartBeta Equity A,2020-05-31,5.15167
27846,1290 SmartBeta Equity A,2020-06-30,0.92022
37183,1290 SmartBeta Equity A,2020-07-31,5.48843
55648,1290 SmartBeta Equity A,2020-08-31,10.96411
64873,1290 SmartBeta Equity A,2020-09-30,6.55784
1,1290 SmartBeta Equity I,2020-03-31,-7.50301
9239,1290 SmartBeta Equity I,2020-04-30,-2.46099
18477,1290 SmartBeta Equity I,2020-05-31,5.3863


In [78]:
sorted_data.shape

(286596, 3)

In [79]:
sorted_data['Shareclass name'].nunique()

14096

In [80]:
non_missing_data = sorted_data.dropna(subset = ['Financial performance as-of date', 
                                                'Month end trailing returns, 1 year'])
non_missing_data.shape

(271807, 3)

In [81]:
freq_counts = non_missing_data['Shareclass name'].value_counts()
sorted_counts = freq_counts.sort_values(ascending = False).reset_index()
sorted_counts.columns = ['Shareclass name', '# time steps']
sorted_counts.head()

Unnamed: 0,Shareclass name,# time steps
0,Goldman Sachs International Eq Inc Inv,31
1,Janus Henderson Growth And Income S,31
2,Janus Henderson Overseas C,31
3,Janus Henderson Overseas D,31
4,Janus Henderson Overseas I,31


In [82]:
max_t = sorted_counts['# time steps'].max()

In [83]:
sorted_counts.shape

(13438, 2)

In [84]:
shareclasses_sub = sorted_counts.loc[sorted_counts['# time steps'] == max_t,'Shareclass name']
shareclasses_sub.head()

0    Goldman Sachs International Eq Inc Inv
1       Janus Henderson Growth And Income S
2                Janus Henderson Overseas C
3                Janus Henderson Overseas D
4                Janus Henderson Overseas I
Name: Shareclass name, dtype: object

In [85]:
shareclasses_list = list(shareclasses_sub)
len(shareclasses_list)

5579

In [86]:
# the months are not all continuous. 
# some months are missing, in the first share class, 2020-11-30 and 2021-04-30 are missing
# also for last month, 2022-12-31 is NAN

longest_ts = sorted_data[sorted_data['Shareclass name'].isin(shareclasses_list)]
longest_ts.head(33)

Unnamed: 0,Shareclass name,Financial performance as-of date,"Month end trailing returns, 1 year"
7,1919 Socially Responsive Balanced A,2020-03-31,0.00352
9245,1919 Socially Responsive Balanced A,2020-04-30,5.60674
18483,1919 Socially Responsive Balanced A,2020-05-31,13.11467
27853,1919 Socially Responsive Balanced A,2020-06-30,10.9106
37190,1919 Socially Responsive Balanced A,2020-07-31,15.01209
55655,1919 Socially Responsive Balanced A,2020-08-31,21.47144
64880,1919 Socially Responsive Balanced A,2020-09-30,18.34213
74102,1919 Socially Responsive Balanced A,2020-10-31,15.15362
83172,1919 Socially Responsive Balanced A,2020-12-31,20.57031
92372,1919 Socially Responsive Balanced A,2021-01-31,17.88592


In [87]:
# check if these two time step is missing for all share classes

check_missing = longest_ts.loc[longest_ts['Financial performance as-of date'].isin(['2020-11-30', '2021-04-30'])]
check_missing

Unnamed: 0,Shareclass name,Financial performance as-of date,"Month end trailing returns, 1 year"


None of the share classes with the longest time series of 'Month end trailing returns, 1 year' have this performance reported for these two months.

In [88]:
# drop the NaN values

longest_ts = longest_ts.loc[~longest_ts['Financial performance as-of date'].isnull()]
longest_ts.head(33)

Unnamed: 0,Shareclass name,Financial performance as-of date,"Month end trailing returns, 1 year"
7,1919 Socially Responsive Balanced A,2020-03-31,0.00352
9245,1919 Socially Responsive Balanced A,2020-04-30,5.60674
18483,1919 Socially Responsive Balanced A,2020-05-31,13.11467
27853,1919 Socially Responsive Balanced A,2020-06-30,10.9106
37190,1919 Socially Responsive Balanced A,2020-07-31,15.01209
55655,1919 Socially Responsive Balanced A,2020-08-31,21.47144
64880,1919 Socially Responsive Balanced A,2020-09-30,18.34213
74102,1919 Socially Responsive Balanced A,2020-10-31,15.15362
83172,1919 Socially Responsive Balanced A,2020-12-31,20.57031
92372,1919 Socially Responsive Balanced A,2021-01-31,17.88592


In [90]:
len(set(longest_ts['Shareclass name']))

5579

[Back to the Table of Contents](#0)

<a id="4.6"></a> 
### Save the time series data and the data records with target covering longest time period

In [92]:
# save the time series data

longest_ts.to_csv('data/shareclasses_one_year_return_max_months_long_time_series.csv', index = False)

In [93]:
# include the features and other targets

df_filtered = df[df['Shareclass name'].isin(shareclasses_list)]
df_filtered.shape

(178505, 143)

In [95]:
df_filtered.tail()

Unnamed: 0,Fund profile: Shareclass name,Fund profile: Ticker,Fund profile: Fund name,Fund profile: Asset manager,Fund profile: Shareclass type,Fund profile: Shareclass inception date,Fund profile: Category group,Fund profile: Sustainability mandate,Fund profile: US-SIF member,Fund profile: Oldest shareclass inception date,...,"Fossil Free Funds: Fossil fuel insurance holdings, count","Fossil Free Funds: Fossil fuel insurance holdings, weight","Fossil Free Funds: Fossil fuel insurance holdings, asset",Shareclass name,Fund name,"Month end trailing returns, 1 year","Month end trailing returns, 3 year","Month end trailing returns, 5 year","Month end trailing returns, 10 year",Financial performance as-of date
364837,VY® Invesco Equity and Income S2,IVIPX,VY® Invesco Equity and Income Portfolio,Voya,Open-end mutual fund,2009-02-27,Allocation Funds,,,2001-12-10,...,1.0,0.014404,15000773.0,VY Invesco Equity and Income S2,VY Invesco Equity and Income Portfolio,,,,,
365074,WisdomTree U.S. ESG ETF,RESP,WisdomTree U.S. ESG Fund,WisdomTree,ETF,2007-02-23,U.S. Equity Fund,Y,,2007-02-23,...,2.0,0.004272,286368.0,WisdomTree U.S. ESG ETF,WisdomTree U.S. ESG Fund,,,,,
365093,Xtrackers MSCI EAFE ESG Leaders Eq ETF,EASG,Xtrackers MSCI EAFE ESG Leaders Equity ETF,Xtrackers,ETF,2018-09-05,International Equity Funds,Y,,2018-09-05,...,9.0,0.036928,1600746.0,Xtrackers MSCI EAFE ESG Leaders Eq ETF,Xtrackers MSCI EAFE ESG Leaders Equity ETF,,,,,
365094,Xtrackers MSCI EMs ESG Leaders Eq ETF,EMSG,Xtrackers MSCI Emerging Markets ESG Leaders Eq...,Xtrackers,ETF,2018-12-04,International Equity Funds,Y,,2018-12-04,...,0.0,0.0,0.0,Xtrackers MSCI EMs ESG Leaders Eq ETF,Xtrackers MSCI Emerging Markets ESG Leaders Eq...,,,,,
365095,Xtrackers MSCI USA ESG Leaders Eq ETF,USSG,Xtrackers MSCI USA ESG Leaders Equity ETF,Xtrackers,ETF,2019-03-06,U.S. Equity Fund,Y,,2019-03-06,...,2.0,0.003216,3962100.0,Xtrackers MSCI USA ESG Leaders Eq ETF,Xtrackers MSCI USA ESG Leaders Equity ETF,,,,,


In [96]:
df_filtered.to_csv('data/shareclasses_one_year_return_max_months_long_full_data.csv', index = False)

In [124]:
len(set(df_filtered['Shareclass name']))

5579

In [97]:
# save dtype of each column
datatypes2  = df_filtered.dtypes.to_frame('dtypes').reset_index()
type(datatypes2)

pandas.core.frame.DataFrame

In [98]:
datatypes2.tail()

Unnamed: 0,index,dtypes
138,"Month end trailing returns, 1 year",float64
139,"Month end trailing returns, 3 year",float64
140,"Month end trailing returns, 5 year",float64
141,"Month end trailing returns, 10 year",float64
142,Financial performance as-of date,object


In [99]:
datatypes2.to_csv('data/shareclasses_one_year_return_max_months_long_full_data_datatypes.csv', index = False)

[Back to the Table of Contents](#0)

<a id="5"></a> 
## Data Definition

In this section, the target variables and feature variables will be defined. The feature variables are grouped by the categories. 

<a id="5.1"></a> 
### Load the data

In [138]:
# Load the saved data

types2 = pd.read_csv('data/shareclasses_one_year_return_max_months_long_full_data_datatypes.csv')
dtypes_dict2 = types2.set_index('index')['dtypes'].to_dict()

df_1yr = pd.read_csv('data/shareclasses_one_year_return_max_months_long_full_data.csv', dtype = dtypes_dict2)

[Back to the Table of Contents](#0)

<a id="5.2"></a> 
### Define the target

Although the data were selected based on the month end trailing returns of one year, there are other target variables available in the dataframe. 

In [139]:
target_cols = [col for col in df_1yr.columns if 'returns' in col or 'Returns' in col]
target_cols

['Returns and fees: Month end trailing returns, 1 month',
 'Returns and fees: Month end trailing returns, 3 month',
 'Returns and fees: Month end trailing returns, 6 month',
 'Returns and fees: Month end trailing returns, 15 year',
 'Returns and fees: Month end trailing returns, 20 year',
 'Returns and fees: Month end trailing returns, year-to-date',
 'Returns and fees: Month end trailing returns, since inception',
 'Returns and fees: Prospectus net expense ratio',
 'Month end trailing returns, 1 year',
 'Month end trailing returns, 3 year',
 'Month end trailing returns, 5 year',
 'Month end trailing returns, 10 year']

In [140]:
len(target_cols)

12

In [141]:
# Check how many missing data in each target

df_1yr[target_cols].isnull().mean().sort_values()

Month end trailing returns, 1 year                               0.031125
Month end trailing returns, 3 year                               0.044940
Month end trailing returns, 5 year                               0.089885
Month end trailing returns, 10 year                              0.210661
Returns and fees: Prospectus net expense ratio                   0.374919
Returns and fees: Month end trailing returns, year-to-date       0.375211
Returns and fees: Month end trailing returns, 6 month            0.375250
Returns and fees: Month end trailing returns, since inception    0.375872
Returns and fees: Month end trailing returns, 3 month            0.375889
Returns and fees: Month end trailing returns, 1 month            0.376292
Returns and fees: Month end trailing returns, 15 year            0.569239
Returns and fees: Month end trailing returns, 20 year            0.650508
dtype: float64

The column 'Month end trailing returns, 1 year' will be our main target as it has the least missing data

[Back to the Table of Contents](#0)

<a id="5.3"></a> 
### Define the features

In [142]:
column_names = df_1yr.columns
column_names[120:]

Index(['Returns and fees: Month end trailing returns, 15 year',
       'Returns and fees: Month end trailing returns, 20 year',
       'Returns and fees: Month end trailing returns, year-to-date',
       'Returns and fees: Month end trailing returns, since inception',
       'Returns and fees: Prospectus net expense ratio',
       'Fund profile: Target date',
       'Fossil Free Funds: Fossil fuel finance grade',
       'Fossil Free Funds: Fossil fuel finance fund score',
       'Fossil Free Funds: Fossil fuel finance holdings, count',
       'Fossil Free Funds: Fossil fuel finance holdings, weight',
       'Fossil Free Funds: Fossil fuel finance holdings, asset',
       'Fossil Free Funds: Fossil fuel insurance grade',
       'Fossil Free Funds: Fossil fuel insurance fund score',
       'Fossil Free Funds: Fossil fuel insurance holdings, count',
       'Fossil Free Funds: Fossil fuel insurance holdings, weight',
       'Fossil Free Funds: Fossil fuel insurance holdings, asset',
      

In [143]:
# There are several categories of feature variables

cat_features = set([colname.split(':')[0] for colname in column_names 
                    if ':' in colname and 'Returns and fees' not in colname
                    and 'Fund profile' not in colname])
cat_features

{'Deforestation Free Funds',
 'Fossil Free Funds',
 'Gender Equality Funds',
 'Gun Free Funds',
 'Prison Free Funds',
 'Tobacco Free Funds',
 'Weapon Free Funds'}

In [144]:
# The feature category of most interest of this project is 'Fossil Free Funds'

df_1yr.filter(like = 'Fossil Free Funds: ').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178505 entries, 0 to 178504
Data columns (total 38 columns):
 #   Column                                                                                 Non-Null Count   Dtype  
---  ------                                                                                 --------------   -----  
 0   Fossil Free Funds: Fossil fuel grade                                                   178500 non-null  object 
 1   Fossil Free Funds: Fossil fuel holdings, count                                         178505 non-null  int64  
 2   Fossil Free Funds: Fossil fuel holdings, weight                                        178505 non-null  float64
 3   Fossil Free Funds: Fossil fuel holdings, asset                                         178505 non-null  int64  
 4   Fossil Free Funds: Carbon Underground 200, count                                       178505 non-null  int64  
 5   Fossil Free Funds: Carbon Underground 200, weight                

In [145]:
# In this category, 'Fossil fuel grade' is one of those with the lease missing values

df_grade = df_1yr['Fossil Free Funds: Fossil fuel grade']
df_grade.head()

0    A
1    A
2    A
3    D
4    D
Name: Fossil Free Funds: Fossil fuel grade, dtype: object

In [146]:
# combine the fossil fuel grade column with the one year return target column

df_return = df_1yr['Month end trailing returns, 1 year']

df_sub = pd.concat([df_grade, df_return], axis = 1)
df_sub.columns = ['grade', 'return']
df_sub.head()

Unnamed: 0,grade,return
0,A,-0.65016
1,A,0.00352
2,A,0.37831
3,D,-32.02905
4,D,-32.24253


In [147]:
# see if higher grade corresponds to higher return

pivot_table = pd.pivot_table(df_sub, values = 'return', index = 'grade')
pivot_table

Unnamed: 0_level_0,return
grade,Unnamed: 1_level_1
A,14.60262
B,14.09777
C,9.107015
D,9.079069
F,7.257575


In [148]:
# other categories

df_1yr.filter(like = 'Deforestation Free Funds: ').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178505 entries, 0 to 178504
Data columns (total 10 columns):
 #   Column                                                               Non-Null Count   Dtype  
---  ------                                                               --------------   -----  
 0   Deforestation Free Funds: Deforestation grade                        178505 non-null  object 
 1   Deforestation Free Funds: Deforestation-risk producer, count         178505 non-null  int64  
 2   Deforestation Free Funds: Deforestation-risk producer, weight        178505 non-null  float64
 3   Deforestation Free Funds: Deforestation-risk producer, asset         178505 non-null  int64  
 4   Deforestation Free Funds: Deforestation-risk financier, count        178505 non-null  int64  
 5   Deforestation Free Funds: Deforestation-risk financier, weight       178505 non-null  float64
 6   Deforestation Free Funds: Deforestation-risk financier, asset        178505 non-null  int64 

In [149]:
df_1yr.filter(like = 'Gender Equality Funds: ').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178505 entries, 0 to 178504
Data columns (total 12 columns):
 #   Column                                                                                                          Non-Null Count   Dtype  
---  ------                                                                                                          --------------   -----  
 0   Gender Equality Funds: Gender equality grade                                                                    169052 non-null  object 
 1   Gender Equality Funds: Gender equality group ranking                                                            169466 non-null  float64
 2   Gender Equality Funds: Gender equality score (out of 100 points)                                                27895 non-null   float64
 3   Gender Equality Funds: Gender equality score, gender balance (out of 100 points)                                27895 non-null   float64
 4   Gender Equality Funds: Gender eq

In [150]:
df_1yr.filter(like = 'Gun Free Funds: ').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178505 entries, 0 to 178504
Data columns (total 10 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Gun Free Funds: Civilian firearm grade    178505 non-null  object 
 1   Gun Free Funds: Civilian firearm, count   178505 non-null  int64  
 2   Gun Free Funds: Civilian firearm, weight  178505 non-null  float64
 3   Gun Free Funds: Civilian firearm, asset   178505 non-null  int64  
 4   Gun Free Funds: Gun manufacturer, count   178505 non-null  int64  
 5   Gun Free Funds: Gun manufacturer, weight  178505 non-null  float64
 6   Gun Free Funds: Gun manufacturer, asset   178505 non-null  int64  
 7   Gun Free Funds: Gun retailer, count       178505 non-null  int64  
 8   Gun Free Funds: Gun retailer, weight      178505 non-null  float64
 9   Gun Free Funds: Gun retailer, asset       178505 non-null  int64  
dtypes: float64(3), int64

In [151]:
df_1yr.filter(like = 'Prison Free Funds: ').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178505 entries, 0 to 178504
Data columns (total 22 columns):
 #   Column                                                   Non-Null Count   Dtype  
---  ------                                                   --------------   -----  
 0   Prison Free Funds: Prison industrial complex grade       156189 non-null  object 
 1   Prison Free Funds: All flagged, count                    156189 non-null  float64
 2   Prison Free Funds: All flagged, weight                   156189 non-null  float64
 3   Prison Free Funds: All flagged, asset                    156189 non-null  float64
 4   Prison Free Funds: Prison industry, count                156189 non-null  float64
 5   Prison Free Funds: Prison industry, weight               156189 non-null  float64
 6   Prison Free Funds: Prison industry, asset                156189 non-null  float64
 7   Prison Free Funds: Border industry, count                156189 non-null  float64
 8   Prison Free Fu

In [152]:
df_1yr.filter(like = 'Tobacco Free Funds: ').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178505 entries, 0 to 178504
Data columns (total 7 columns):
 #   Column                                                               Non-Null Count   Dtype  
---  ------                                                               --------------   -----  
 0   Tobacco Free Funds: Tobacco grade                                    178505 non-null  object 
 1   Tobacco Free Funds: Tobacco producer, count                          178505 non-null  int64  
 2   Tobacco Free Funds: Tobacco producer, weight                         178505 non-null  float64
 3   Tobacco Free Funds: Tobacco producer, asset                          178505 non-null  int64  
 4   Tobacco Free Funds: Tobacco-promoting entertainment company, count   178505 non-null  int64  
 5   Tobacco Free Funds: Tobacco-promoting entertainment company, weight  178505 non-null  float64
 6   Tobacco Free Funds: Tobacco-promoting entertainment company, asset   178505 non-null  int64  

In [153]:
df_1yr.filter(like = 'Weapon Free Funds: ').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178505 entries, 0 to 178504
Data columns (total 13 columns):
 #   Column                                                    Non-Null Count   Dtype  
---  ------                                                    --------------   -----  
 0   Weapon Free Funds: Military weapon grade                  178505 non-null  object 
 1   Weapon Free Funds: Military weapon, count                 178505 non-null  int64  
 2   Weapon Free Funds: Military weapon, weight                178505 non-null  float64
 3   Weapon Free Funds: Military weapon, asset                 178505 non-null  int64  
 4   Weapon Free Funds: Major military contractors, count      178505 non-null  int64  
 5   Weapon Free Funds: Major military contractors, weight     178505 non-null  float64
 6   Weapon Free Funds: Major military contractors, asset      178505 non-null  int64  
 7   Weapon Free Funds: Nuclear weapons, count                 178505 non-null  int64  
 8   Weap

[Back to the Table of Contents](#0)

<a id="6"></a> 
## Data Cleaning

In [155]:
# check if there are duplicated rows

df_1yr.duplicated().sum()

0

In [156]:
# check if all rows have values for 'Month end trailing returns, 1 year', the main target variable

df_1yr['Month end trailing returns, 1 year'].isnull().mean()

0.03112517856642671

In [157]:
# check current dimension of the dataframe

df_1yr.shape

(178505, 143)

In [158]:
# drop the rows that do not have values for the main target

df_1yr = df_1yr.dropna(subset = 'Month end trailing returns, 1 year')
df_1yr.shape

(172949, 143)

In [159]:
# some features and targets have too many missing data
# Check how many missing data in each target and feature variables

missing_rate = df_1yr.isnull().mean().sort_values()
missing_rate[missing_rate != 0.0]

Fossil Free Funds: Fossil fuel grade                                                                              0.000029
Month end trailing returns, 3 year                                                                                0.014259
Fund profile: Shareclass tickers                                                                                  0.016490
Fund profile: Ticker                                                                                              0.019728
Gender Equality Funds: Gender equality group ranking                                                              0.050356
Gender Equality Funds: Gender equality grade                                                                      0.052750
Month end trailing returns, 5 year                                                                                0.060648
Prison Free Funds: Private prison operators, weight                                                               0.129032
Prison Free Fund

In [160]:
# drop columns with too many missing data 
df_1yr = df_1yr.loc[:, df_1yr.isnull().mean() < .1]
df_1yr.shape

(172949, 91)

In [161]:
len(set(df_1yr['Shareclass name']))

5579

In [168]:
len(df_1yr.filter(like = 'Deforestation Free Funds: ').columns)

10

In [169]:
len(df_1yr.filter(like = 'Fossil Free Funds: ').columns)

28

In [172]:
len(df_1yr.filter(like = 'Gender Equality Funds: ').columns)

4

In [170]:
len(df_1yr.filter(like = 'Gun Free Funds: ').columns)

10

In [175]:
len(df_1yr.filter(like = 'Prison Free Funds: ').columns)

0

In [174]:
len(df_1yr.filter(like = 'Tobacco Free Funds: ').columns)

7

In [167]:
len(df_1yr.filter(like = 'Weapon Free Funds: ').columns)

13

In [162]:
# Save the data
df_1yr.to_csv('data/shareclasses_one_year_return_max_months_less_than_10pct_missing.csv', index = False)

In [163]:
# save dtype of each column
datatypes_1yr  = df_1yr.dtypes.to_frame('dtypes').reset_index()
datatypes_1yr.tail()

Unnamed: 0,index,dtypes
86,Fund name,object
87,"Month end trailing returns, 1 year",float64
88,"Month end trailing returns, 3 year",float64
89,"Month end trailing returns, 5 year",float64
90,Financial performance as-of date,object


In [164]:
datatypes_1yr.to_csv('data/shareclasses_one_year_return_max_months_less_than_10pct_missing_datatypes.csv',
                     index = False)

[Back to the Table of Contents](#0)

<a id="7"></a> 
## Summary

In the data wrangling, I have done the following:

* 1. Downloaded and combined the full raw data set of 365103 rows and 146 columns. The data is saved for future use: 
    `all_results_data.csv`
    
    In this data, there are 4706 distinct funds and 14125 distinct share classes.
    
    
* 2. Extracted and organized the time series data of the share classes with the longest record (31 months) of the main target (Month end trailing returns, 1 year), and saved the data for the next step analysis:
    `shareclasses_one_year_return_max_months_long_time_series.csv`
    
    In this file, there are three columns. First is the share class names, second is the date for the financial performance, third is the main target: month end trailing returns, 1 year. There are 5579 distinct share classes, each have the observation of the target variable over the same 31 months.
    
    
* 3. Cleaned the full data set of the share classes with the longest record of the main target together with all features variables and other target variables options, and save the data for future use:
    `shareclasses_one_year_return_max_months_long_full_data.csv`
    
    In this file, there are 172949 rows and 143 columns, and among them there are 5579 distinct share classes as in the time series file. Including the main target, there are 12 target variables. The feature variables include 7 categories of ESG evaluations: Deforestation Free Funds (10 columns), Fossil Free Funds (38 columns), Gender Equality Funds (12 columns), Gun Free Funds (10 columns), Prison Free Funds (22 columns), Tobacco Free Funds (7 columns), and Weapon Free Funds (13 columns). Some of these target and feature variables have a lot of missing data.
    
    
* 4. Extracted a subset of targets and features that have less than 10 percent of missing data for all the share classes with the longest record of the main target, and saved the data for the next step analysis:
    `shareclasses_one_year_return_max_months_less_than_10pct_missing.csv`
    
    In this data, there are also 172949 rows and 5579 distinct share classes. However, as we dropped the target and feature variables that have too many missing data, there are only 91 columns left. Among them, there are 10 columns for the Deforestation Free Funds, 28 columns for the Fossil Free Funds, 4 columns for the Gender Equality Funds, 10 columns for the Gun Free Funds, 0 columns for the Prison Free Funds, 7 columns for the Tobacco Free Funds, and 13 columns for the Weapon Free Funds.  
    
In the next section of exploratory data analysis, I will use the tidy data of the time series (point 2): `shareclasses_one_year_return_max_months_long_time_series.csv`

and the clean data including both the target variable and features that contains less than 10% missing values (point 4):
`shareclasses_one_year_return_max_months_less_than_10pct_missing.csv`

The main target variable is the month end trailing returns, 1 year, and the main feature variables are the ones in the category of Fossil Free Funds.

[Back to the Table of Contents](#0)