# Forecasting tax avoidance rates by international listed companies

## Problem description

**Tax avoidance is not a crime!** Companies that carry out this act are on the verge of law. By definition: tax avoidance consists in carrying out economic activities in such a way that they are understood in a different way from tax regulations, in order to reduce the tax burden. International tax avoidance arose as a result of globalization and liberalization of economic systems of countries, weakening trade barriers and development of new technologies. Tax avoidance is achieved through aggressive **tax optimization** (e.g. tax havens, double taxation agreements, etc.). 

In this study, we will check whether we can **forecast for one year ahead the level of tax avoidance by a group of companies** listed on stock exchanges using shallow Machine Learning models. As the econometric research shows, this problem is non-trivial, and the most important determinants result from the financial statements per se. The question arises whether any additional data sources can be used to forecast this problem.

There are many ways in the literature to measure tax avoidance. All of them have their advantages and disadvantages. However, the most popular metric appears to be Effective Tax Rate (ETR) = $\dfrac{\textrm{total tax expenses}}{\textrm{pre-tax income}} $. Due to this formula, ETR has values in the range [0,1]. This measure applies directly to each jurisdiction and it is based on annual data published in the financial statements and this involves an annual change in the effective tax rate, or failure to determine it , in the case of negative income tax resulting from current tax overruns of deferred tax assets. ETR will be used as the target/endogenous variable in this study. Therefore, the following evaluation metrics for the given problem were selected: Mean Absolute Error (MAE), Root Mean Square error (RMSE). The choice was not accidental: MAE allows for relatively easy interpretation, while RMSE punishes model for large individual errors, which in the case of ETR forecasting may be crucial. For this case, absolute measures seem to be a more correct approach than relative ones. Nevertheless, the most important metric will be RMSE.

This problem is a classic panel problem (many companies and many years).

## Dataset description

The database used in the study was created for the purposes of the paper "Determinants of multinational tax avoidance" (Agnieszka Teterycz, PhD. Anna Bia≈Çek) on the basis of data retrieved from the Bloomberg database, OECD and PWC reports. Information on the introduction of regulations concerning foreign controlled companies (CFCs) in the analysed countries has been selected from OECD reports. From the reports prepared by PWC, data specifying the number of double taxation agreements signed in the analysed countries were selected. The dataset gather companies included in WIG, DAX, UK100, CAC40 and ATX indices listed on stock exchanges in Poland, Germany, Great Britain, France and Austria in 2005-2017. All companies from the financial (including banks) and insurance sectors were excluded from the analysis, as well as those with missing data in the explanatory variables. In addition, the observations that took a negative value for the financial result before tax and income tax were removed in order to avoid situations where a negative value of ETR would be difficult to interpret. The above exclusions and removal of outlier observations reduced the sample for panel data from 7 800 to 4 719 observations. Panel data are balanced (13 years x 363 companies). Inputation process was applied - using medians and means at the company level. Authors also used fill forward interpolation at the beginning of the time series.

#### Columns description 

* index - technical index
* ticker - company ticker from stock exchange
* Nazwa2 - full name of a company
* sektor - business sector of a company
* rok - year
* gielda - the stock exchange from which the company originates {1: Warsaw, 2: London, 3: Frankfurt, 4: Paris, 5: Vienna}
* ta - total assets of a company 
* txt - total tax expenses of a company 
* pi - pre-tax income of a company 
* str - statutory tax rate of a company 
* xrd - research and development expenditure of a company 
* ni - net income of a company 
* ppent - property plant and equipment net of a company  
* intant - total intangible assets of a company 
* dlc - long term debt of a company 
* dltt - short term debt of a company 
* capex - capital expenditures of a company 
* revenue - revenue of a company 
* cce - cash and cash equivalents of a company 
* adv - advertising expenses of a company  
* etr - effective tax rate of a company
* diff - statutory tax rate - effective tax rate
* roa - return of assets of a company
* lev - leverage of a company
* intan - intangible assets/total assets
* rd - research and development expenditure/total assets
* ppe - property plant and equipment/total assets
* sale - log(revenue of a company/total assets)
* cash_holdings - cash and cash equivalents of a company/total assets
* adv_expenditures - advertising expenses/total assets
* capex2 - capex/property plant and equipment
* cfc - control foreign companies by a company
* dta - double taxation agreements of a company 
* capex2_scaled - scaled capex2

the rest of the columns are technical and redundant, so they will be deleted!

## Dependencies loading

In [92]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 500)

## Data preparation

#### Data loading

In [110]:
df = pd.read_stata("../data/tax_avoidance.dta")

In [111]:
df.sample(10)

Unnamed: 0,index,Ticker,Nazwa2,sektor,rok,gielda,ta,txt,pi,str,xrd,ni,ppent,intant,dlc,dltt,capex,revenue,cce,adv,etr,diff,roa,lev,intan,rd,ppe,sale,cash_holdings,adv_expenditure,capex2,cfc,dta,capex2_scaled,firm_id,firma_id,rok2005,rok2006,rok2007,rok2008,rok2009,rok2010,rok2011,rok2012,rok2013,rok2014,rok2015,rok2016,rok2017,industry,industry1,capex1,roa1,country1,country2,country3,country4,country5,industry11,industry12,industry13,industry14,industry15,industry16,industry17,industry18,industry19,industry20,diff1,diff2,diff3,_est_random,_est_fixed
3709,3986,SGE LN Equity,Sage Group PLC/The,technology,2009,3,2738.5,77.900002,267.399994,0.28,189.5,144.5,174.600006,2246.800049,460.600006,18.799999,19.5,1439.300049,59.400002,0.0,0.291324,-0.011324,0.052766,0.175059,0.820449,0.069198,0.063758,0.422374,0.021691,0.0,0.111684,1,1,0.000104,Sage Group PLC/The,Sage Group PLC/The,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,technology,technology,3.020425,0.052766,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.011324,-0.011324,-0.011324,1,1
1740,274,ATT PW Equity,Grupa Azoty SA,materials,2016,1,10993.594727,116.833,432.075989,0.19,0.0,301.869995,6360.625977,507.431,1415.218994,127.711998,1241.666992,8966.803711,641.89502,0.0,0.270399,-0.080399,0.027459,0.140348,0.046157,0.0,0.578576,0.596437,0.058388,0.0,0.195211,1,0,0.000181,Grupa Azoty SA,Grupa Azoty SA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,materials,materials,7.125015,0.027459,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.080399,-0.080399,-0.080399,1,1
1152,3150,PBB GY Equity,Deutsche Pfandbriefbank AG,real estate,2013,2,104979.888889,35.555556,-52.888889,0.289411,20.222222,-108.0,5.222222,29.666667,58054.666667,26075.777778,0.0,3096.444444,1121.222222,0.0,0.0,0.96168,-0.001029,0.801396,0.000283,0.000193,5e-05,0.029069,0.01068,0.0,0.0,1,0,0.0,Deutsche Pfandbriefbank AG,Deutsche Pfandbriefbank AG,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,real estate,real estate,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.96168,0.96168,0.96168,1,1
1976,1026,IPX PW Equity,Impexmetal SA,materials,2005,1,1815.302002,17.007,73.389,0.19,0.0,60.662998,858.851013,23.127001,87.144997,490.544006,133.554993,2613.037109,36.341,0.0,0.231738,-0.041738,0.033418,0.318233,0.01274,0.0,0.473117,0.891773,0.020019,0.0,0.155504,0,0,0.000144,Impexmetal SA,Impexmetal SA,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,materials,materials,4.901973,0.033418,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.041738,-0.041738,-0.041738,1,1
3399,1176,KMP PW Equity,Przedsiebiorstwo Produkcyjno-Handlowe Ko,materials,2011,1,111.149002,-1.285,23.666,0.19,0.0,25.674,83.389999,0.371,14.787,10.021,2.262,52.935001,0.711,0.0,0.0,0.244297,0.230987,0.223196,0.003338,0.0,0.750254,0.389507,0.006397,0.0,0.027126,0,0,2.5e-05,Przedsiebiorstwo Produkcyjno-Handlowe Ko,Przedsiebiorstwo Produkcyjno-Handlowe Ko,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,materials,materials,1.182341,0.230987,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.244297,0.244297,0.244297,1,1
575,360,BFT PW Equity,Benefit Systems SA,industrials,2008,1,8.561,0.925,4.698,0.19,0.0,3.773,0.075,0.001,0.0,19.43811,0.256,26.363001,4.561,0.0,0.196892,-0.006892,0.44072,1.0,0.000117,0.0,0.008761,1.405957,0.532765,0.0,3.413333,0,0,0.003168,Benefit Systems SA,Benefit Systems SA,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,industrials,industrials,0.227932,0.44072,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.006892,-0.006892,-0.006892,1,1
3696,4415,SAF FP Equity,Safran SA,industrials,2009,4,18169.0,235.0,893.0,0.3443,659.0,641.0,2201.0,5343.0,1904.0,1367.0,293.0,10559.0,415.0,0.0,0.263158,0.081142,0.03528,0.180032,0.294072,0.036271,0.12114,0.458155,0.022841,0.0,0.133121,1,1,0.000124,Safran SA,Safran SA,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,industrials,industrials,5.68358,0.03528,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.081142,0.081142,0.081142,1,1
3601,3197,RHM GY Equity,Rheinmetall AG,consumer discretionary,2005,2,3423.0,52.0,170.0,0.29617,144.0,113.0,1052.0,417.0,397.0,162.0,190.0,3454.0,408.0,0.0,0.305882,-0.009712,0.033012,0.163307,0.121823,0.042068,0.307333,0.697665,0.119194,0.0,0.180608,1,0,0.000168,Rheinmetall AG,Rheinmetall AG,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consumer discretionary,consumer discretionary,5.252274,0.033012,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.009712,-0.009712,-0.009712,1,1
123,84,ACG PW Equity,Ac SA,consumer discretionary,2011,1,101.371002,5.896,28.479,0.19,0.0,22.583,42.835999,0.582,4.314,0.0,14.087,140.625,20.68,0.0,0.20703,-0.01703,0.222776,0.042557,0.005741,0.0,0.422567,0.870134,0.204003,0.0,0.328859,0,0,0.000305,Ac SA,Ac SA,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,consumer discretionary,consumer discretionary,2.713833,0.222776,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.01703,-0.01703,-0.01703,1,1
232,131,AMB PW Equity,Ambra SA,consumer staples,2016,1,411.759003,6.409,31.353001,0.19,0.0,17.868999,103.253998,60.479,23.806,25.097,12.185,422.239014,9.866,0.0,0.204414,-0.014414,0.043397,0.118766,0.14688,0.0,0.250763,0.705793,0.023961,0.0,0.11801,1,0,0.00011,Ambra SA,Ambra SA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,consumer staples,consumer staples,2.57908,0.043397,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.014414,-0.014414,-0.014414,1,1


#### Dataset adjustment

##### Removing redundant variables

In [112]:
df.columns

Index(['index', 'Ticker', 'Nazwa2', 'sektor', 'rok', 'gielda', 'ta', 'txt',
       'pi', 'str', 'xrd', 'ni', 'ppent', 'intant', 'dlc', 'dltt', 'capex',
       'revenue', 'cce', 'adv', 'etr', 'diff', 'roa', 'lev', 'intan', 'rd',
       'ppe', 'sale', 'cash_holdings', 'adv_expenditure', 'capex2', 'cfc',
       'dta', 'capex2_scaled', 'firm_id', 'firma_id', 'rok2005', 'rok2006',
       'rok2007', 'rok2008', 'rok2009', 'rok2010', 'rok2011', 'rok2012',
       'rok2013', 'rok2014', 'rok2015', 'rok2016', 'rok2017', 'industry',
       'industry1', 'capex1', 'roa1', 'country1', 'country2', 'country3',
       'country4', 'country5', 'industry11', 'industry12', 'industry13',
       'industry14', 'industry15', 'industry16', 'industry17', 'industry18',
       'industry19', 'industry20', 'diff1', 'diff2', 'diff3', '_est_random',
       '_est_fixed'],
      dtype='object')

In [113]:
df.drop(columns=['index', 'firm_id', 'firma_id', 'rok2005', 'rok2006',
       'rok2007', 'rok2008', 'rok2009', 'rok2010', 'rok2011', 'rok2012',
       'rok2013', 'rok2014', 'rok2015', 'rok2016', 'rok2017', 'industry',
       'industry1', 'capex1', 'roa1', 'country1', 'country2', 'country3',
       'country4', 'country5', 'industry11', 'industry12', 'industry13',
       'industry14', 'industry15', 'industry16', 'industry17', 'industry18',
       'industry19', 'industry20', 'diff1', 'diff2', 'diff3', '_est_random',
       '_est_fixed'], inplace = True)

df.shape

(4719, 33)

##### Checking if every variable has proper type

In [114]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4719 entries, 0 to 4718
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Ticker           4719 non-null   object 
 1   Nazwa2           4719 non-null   object 
 2   sektor           4719 non-null   object 
 3   rok              4719 non-null   int16  
 4   gielda           4719 non-null   int8   
 5   ta               4719 non-null   float64
 6   txt              4719 non-null   float64
 7   pi               4719 non-null   float64
 8   str              4719 non-null   float64
 9   xrd              4719 non-null   float64
 10  ni               4719 non-null   float64
 11  ppent            4719 non-null   float64
 12  intant           4719 non-null   float64
 13  dlc              4719 non-null   float64
 14  dltt             4719 non-null   float64
 15  capex            4719 non-null   float64
 16  revenue          4719 non-null   float64
 17  cce           

##### Endogenous variable shifting

In [115]:
df.sort_values(by = ["Ticker","rok"],inplace = True)
df["etr"] = df["etr"].shift(-1)
df = df[df.rok != 2017]

df.shape

(4356, 33)

#### External data adding

It was assumed in accordance with the effective market hypothesis that all information about specific companies is included in their book values at the end of the year, i.e. analyzing additional individual stock market data would rather not have the desired effect. In addition, company names (tickers) are not perfect and in some cases it is very difficult to find the right company on websites such as: stooq.pl. Therefore, the bulk approach is not optimal, and manual download would take too much time. Therefore, it was decided to choose variables that will testify to the condition of selected stock exchanges and the entire economy in each country.

##### First source: World Bank, V-Dem index, Polity index, BR index, BMR index
We assume that variables like: democracy indices (proxy for investment moods), GDP growth, GDP per capita and Inflation (proxy for business cycle ) might be significant in case of tax avoidance predictions. Data were gathered for mentioned above 5 countries (2005-2017). This data chunk is a part of our own dataset, which collects determinants of democracy. We don't want to share it right now, so we slice it.

In [121]:
df_extra0 = pd.read_csv("../data/external_dataset.csv")

In [122]:
df_extra0.head()

Unnamed: 0,country_name,year,y_v2x_polyarchy,y_e_p_polity,y_BR_Democracy,y_BMR_democracy,WB_GDPgrowth,WB_GDPpc,WB_Inflation
0,Austria,2005,0.855,10.0,1.0,1.0,2.244065,38403.133877,2.299139
1,Austria,2006,0.863,10.0,1.0,1.0,3.454042,40635.281816,1.441547
2,Austria,2007,0.885,10.0,1.0,1.0,3.727415,46855.771745,2.168556
3,Austria,2008,0.884,10.0,1.0,1.0,1.460424,51708.765754,3.21595
4,Austria,2009,0.894,10.0,1.0,1.0,-3.764578,47963.179402,0.506308


#### Second source: main stock market indexes for Poland, Germany, Great Britain, France and Austria
We assume that such variables might be additional proxy for market condition in each country.

In [133]:
cac = pd.read_csv("../data/external_dataset1/^cac_y.csv")
dax = pd.read_csv("../data/external_dataset1/^dax_y.csv")
a5c = pd.read_csv("../data/external_dataset1/a5_c_y.csv")
wig = pd.read_csv("../data/external_dataset1/wig_y.csv")
wig = pd.read_csv("../data/external_dataset1/x_f_y.csv")

##### Third source: S&P indexes for sectors
We believe that S&P indexes for sectors might be good proxies for each sector condition

#### Dataset splitting

We decided to split dataset into:
 * **train (& validation) dataset** - 2005 - 2015 (exogenous notation) x 364 companies
 * **test (out of sample / out of time) dataset**  - 2016 (exogenous notation) x 364 companies 

Test dataset will be used **only** for the final predictions! The authors assume that during the entire study they do not have access to it and do not study its statistical properties.

In [100]:
df_train = df[df.rok != 2016]
df_test = df[df.rok == 2016]

In [101]:
df_train.shape

(3993, 33)

In [102]:
df_test.shape

(363, 33)

## Initial descriptive analyses of the data

In [107]:
df.head()

Unnamed: 0,Ticker,Nazwa2,sektor,rok,gielda,ta,txt,pi,str,xrd,ni,ppent,intant,dlc,dltt,capex,revenue,cce,adv,etr,diff,roa,lev,intan,rd,ppe,sale,cash_holdings,adv_expenditure,capex2,cfc,dta,capex2_scaled
13,11B PW Equity,11 bit studios SA,communication,2005,1,21.127613,1.24185,6.329725,0.19,0.0,5.0879,0.276275,4.1959,0.0,0.0,2.223413,11.873301,12.142975,0.0,0.196193,-0.006193,0.240818,0.0,0.198598,0.0,0.013076,0.445954,0.574744,0.0,8.047824,0,0,0.007469
14,11B PW Equity,11 bit studios SA,communication,2006,1,21.127613,1.24185,6.329725,0.19,0.0,5.0879,0.276275,4.1959,0.0,0.0,2.223413,11.873301,12.142975,0.0,0.196193,-0.006193,0.240818,0.0,0.198598,0.0,0.013076,0.445954,0.574744,0.0,8.047824,0,0,0.007469
15,11B PW Equity,11 bit studios SA,communication,2007,1,21.127613,1.24185,6.329725,0.19,0.0,5.0879,0.276275,4.1959,0.0,0.0,2.223413,11.873301,12.142975,0.0,0.196193,-0.006193,0.240818,0.0,0.198598,0.0,0.013076,0.445954,0.574744,0.0,8.047824,0,0,0.007469
16,11B PW Equity,11 bit studios SA,communication,2008,1,21.127613,1.24185,6.329725,0.19,0.0,5.0879,0.276275,4.1959,0.0,0.0,2.223413,11.873301,12.142975,0.0,0.196193,-0.006193,0.240818,0.0,0.198598,0.0,0.013076,0.445954,0.574744,0.0,8.047824,0,0,0.007469
17,11B PW Equity,11 bit studios SA,communication,2009,1,21.127613,1.24185,6.329725,0.19,0.0,5.0879,0.276275,4.1959,0.0,0.0,2.223413,11.873301,12.142975,0.0,0.188487,-0.006193,0.240818,0.0,0.198598,0.0,0.013076,0.445954,0.574744,0.0,8.047824,0,0,0.007469


In [109]:
df.sektor.value_counts()

consumer discretionary    924
industrials               816
materials                 672
technology                444
consumer staples          324
real estate               312
communication             300
health care               228
utilities                 204
energy                    132
Name: sektor, dtype: int64