---
title: |
  | Earnings Management and Investor Protection:
  | Accounting Reading Group - Assignment III\vspace{1cm}
author:
  - name: Melisa Mazaeva
    email: melisa.mazaeva@student.hu-berlin.de
    affiliations:
      - Humboldt-Universität zu Berlin  
date: today
date-format: MMM D, YYYY [\vspace{1cm}]
abstract: |
  | This project uses the TRR 266 Template for Reproducible Empirical Accounting Research (TREAT) to provide an infrastructure for open science-oriented empirical projects. Leveraging external Worldscope data sets on financial data, the repository showcases a reproducible workflow that integrates Python scripts for data analysis. The project’s output demonstrates a comprehensive application of skills to replicate and extend the findings from the seminal paper by Leuz, Nanda, and Wysocki (2003), particularly in providing descriptive statistics for the four individual earnings management measures as well as the aggregate earnings management score across various countries. In doing so, it documents and discusses the research design choices made and the variations between the original and reproduced results. This code base, adapted from TREAT, should give you an overview on how the template is supposed to be used for my specific project and how to structure a reproducible empirical project.
  | \vspace{6cm}
bibliography: references.bib
biblio-style: apsr
format:
  pdf:
    documentclass: article
    number-sections: true
    toc: false
fig_caption: yes
fontsize: 11pt
ident: yes
always_allow_html: yes
number-sections: true 
header-includes:
  - \usepackage[nolists]{endfloat}    
  - \usepackage{setspace}\doublespacing
  - \setlength{\parindent}{4em}
  - \setlength{\parskip}{0em}
  - \usepackage[hang,flushmargin]{footmisc}
  - \usepackage{caption} 
  - \captionsetup[table]{skip=24pt,font=bf}
  - \usepackage{array}
  - \usepackage{threeparttable}
  - \usepackage{adjustbox}
  - \usepackage{graphicx}
  - \usepackage{csquotes}
  - \usepackage{indentfirst}  # Added this line to ensure the first paragraph is indented for better readability
  - \usepackage[margin=1in]{geometry}
---


\pagebreak

# Research Design Choices and Assumptions {#sec-research_design_assumptions}

The aim of Assignment III is to replicate a specific empirical table (Table 2 Panel A) from the seminal paper by @Leuz_2003. This table involves calculating the EM measures for firms across various countries over a defined period and examining the relationship between these measures and investor protection. The replication process includes data loading, preparation, cleaning, and normalization, followed by the application of statistical methods to compute and interpret financial metrics. For Assignment III, I pulled data from the Worldscope database through WRDS and used the Python programming language to carry out the empirical analysis. Visual Studio Code was used as the Integrated Development Environment (IDE) for writing, debugging, and optimizing the Python code.

The replication is based on data pulled from the Worldscope database, specifically from the `wrds_ws_company` and `wrds_ws_funda` tables, which were merged for the analysis. The first table provides company profile information, including items such as ISIN, Worldscope Identifiers, company name, and the country where the company is domiciled @WRDS_WS_Company_2024. The latter table contains Fundamentals Annual data at the company-year level, including items such total assets, net income, and other relevant financial variables @WRDS_WS_Funda_2024.

Following @Leuz_2003, I focus the analysis on companies across various countries, ensuring that the data accurately reflects the fiscal years 1990 to 1999 as specified in the original study. The replication aims to mirror the research design as closely as possible with the available data.

In addition, I impose the following assumptions to ensure clarity and consistency where the paper by @Leuz_2003 does not provide explicit guidance:

1. The original paper references the November 2000 version of the Worldscope Database. However, the data used for this analysis represents the latest available version, updated in July 2024, with quarterly frequency updates [@WRDS_2024]. Due to potential adjustments and updates made to the database since November 2000, there may be differences between the databases that could affect the results. For example, companies may restate financials after the original reporting period, so that these restatements are reflected in the later database version rather than the historical one. Moreover, the data vendor Refinitiv regularly updates its databases to correct errors and add new information, which may be included in the later data but not in the November 2000 snapshot.
2. The original paper outlines key terms that will be used in this project to ensure consistency and accuracy in the replication. @Leuz_2003 define earnings management as the manipulation of a firm's reported economic performance by insiders to deceive certain stakeholders or to affect contractual results. Authors describe investor protection as a key institutional factor that limits insiders' acquisition of private control benefits, thereby reducing their incentives to manage accounting earnings by ensuring strong and well-enforced rights for outside investors. Finally, private control benefits are the benefits that insiders can gain from controlling a firm, which can include financial gains or other advantages that are not shared with other stakeholders @Leuz_2003.
3. While pulling the data for analysis, I encountered negative values for some key financial metrics such as operating income (`item1250`), or net income before preferred dividends (`item1651`). The paper by @Leuz_2003 does not explicitly specify how to handle negative values in key financial metrics. For the purpose of this replication study, I will include negative values in the analysis. Including these values ensures that the analysis captures the full spectrum of earnings management activities across different countries.
4. Another potential source of discrepancies between the original and replicated tables may be the choice of variables pulled from Worldscope. For example, in the `wrds_ws_funda table`, both `item1151` and `item4051` are named “DEPRECIATION, DEPLETION AND AMORTIZATION” @WRDS_WS_Funda_2024. I chose `item1151` from the Income Statement rather than `item4051` from the Cash Flow Statement, based on the Excel Industrials Template by @WRDS_Balancing_Model_2024. Since the authors do not specify the choice of variables used from the database, this could cause differences in the results.
5. The EM measures are based on scaled variables (e.g., operating cash flow scaled by lagged total assets). As such, the currency of the relevant data items should not affect the results as long as the same currency is used consistently for both the numerator and the denominator. This approach ensures comparability across different countries, regardless of their local currencies. Additionally, according to a document by [@Thomson_Financial_2007, p.20], all Worldscope data is consistently reported in the local currency of each firm’s country of domicile, eliminating the need for currency conversions in this project.
333. ## delete. mine 
333. It is assumed that Penman (2013) restricted the possible P/B values to a range of 0 to 7 to exclude outliers. Hence, extreme P/B values that are negative and very high (greater than 7) are excluded to focus the analysis on firms with more stable and reasonable valuations, reducing the impact of outliers.

By following the steps provided in @sec-replication_steps and adhering to the assumptions made, I successfully replicated the analysis and produced the required table. A thorough step-by-step approach, with each step clearly documented, helped to understand and verify the outputs.


# Replication Steps {#sec-replication_steps}
## Step 1: Pulling the Data and Managing the Databases
In contrast to Assignment I, where the data was provided externally, Assignment III involves additionally pulling data directly from the Worldscope database, merging relevant tables, and preparing the data for further analysis from raw data to final output.

To ensure the integrity of the data, I filtered out rows with empty `item6105` (Worldscope Permanent ID) values, as this identifier is critical for firm/year level filtering in the data preparation step. In total, 125 observations from the dynamic data and 10,306 observations from the static data were removed. To compile the dataset, the dynamic and static datasets were merged on the `item6105` identifier, representing the unique Worldscope Permanent ID. WRDS advises using this identifier consistently within Worldscope data since it remains stable over time [@WRDS_Overview_2024]. An inner join was used for this merge, because this approach ensures that only the complete and consistent data from both tables is retained. 

Moreover, an additional filter is applied to retain only company rows. This is achieved by selecting rows where the `item6100` field equals 'C', indicating that the Worldscope Identifier represents a company. This step ensures that the analysis includes only company data, excluding averages, exchange rates, securities, or stock indices, as indicated by [@WRDS_Overview_2024].

As required by @Leuz_2003, financial institutions are removed from the analysis based on their SIC codes. This is done by filtering out rows where the `item7021` identifier, representing the SIC code, falls within the range of 6000 to 6999. Hence, the dataset focuses only on non-financial companies, aligning with the methodology of the original paper.

Finally, additional filtering ensures that only data from the 31 countries, as given in the paper, is included. These filtering steps were applied to reduce the dataset size and improve the workflow. 

Notably, the configuration file utilizes additional refinement as proposed by [@WRDS_Overview_2024]. Using `A` (Annual) in `freq` variable ensures that the data represents the financial information reported on an annual basis, which is consistent with the paper’s methodology. This excludes data reported on a current, daily, or quarterly basis.

After retaining only relevant company data, filtering out financial institutions, and focusing on specified countries, the processed data was then saved to a CSV file at the path `data/pulled` specified in the configuration file `config/pull_data_cfg​`.

## Step 2: Data Preparation
In order to verify the pulle data, I checked the dataset for duplicate firm-year observations based on the combination of the Worldscope Permanent ID (`item6105`) and the fiscal year (`year_`) and confirmed that there were no duplicates present, ensuring the accuracy of the data for further analysis. As the original study does not specify the net income measure, `item1651` (Net Income before Preferred Dividends) was used as the variable for net income. This choice aligns with the final net income figure reported in the income statement, based on the Excel Industrials Template by @WRDS_Balancing_Model_2024.

To prepare the data sample in line with the methodology outlined in the paper, it is essential to follow the requirements, as instructed by @Leuz_2003. Firstly, countries with sufficient firm-year observations must be filtered. Each country should have at least 300 firm-year observations for certain key accounting variables, including total assets, sales, net income, and operating income. In this step, no countries were eliminated as all countries met the requirement, which aligns with the paper's overview of countries.

In the second filtration step firms with adequate consecutive data must be identified. Each firm must have income statement and balance sheet information for at least three consecutive years, with all key accounting variables mentioned above present. If a firm had at least three consecutive years of complete data at any point, all its data entries were retained in the final dataset. Therefore, only those countries and firms that meet these criteria for all specified variables are retained in the dataset.

During the preparation step, 8,265 firms and 20,521 firm-year observations (all due to second filtration step) were dropped, resulting in a final dataset with 18,040 firms and 123,469 firm-year observations. The differences in the numbers between the prepared dataset and the figures mentioned in the paper (70,955 firm-year observations and 8,616 non-financial firms) could be due to the assumptions such as variations in initial datasets, data updates, and filtering criteria listed in @sec-research_design_assumptions. However, the original study might have included additional data cleaning steps not explicitly mentioned, such as handling outliers, specific industry exclusions, or other criteria, which could affect the final counts.

Now that it is clear that the number of observations for this project is significantly higher than that in @Leuz_2003 study, in order to illustrate the differences and compare the firm-year observations, Table 1 from @Leuz_2003 was partially replicated (only columns on countries and firm/year observations) to distinguish specific discrepancies that could arise for certain countries. The following table representes the (partially) replicated Table 1.


In [None]:
#| label: table1
#| echo: true
#| output: true

import pandas as pd

# Load Table 1 CSV file
table_1 = pd.read_csv('data/generated/table_1.csv')

# Display Table 1
table_1

\pagebreak

\setcounter{table}{0}
\renewcommand{\thetable}{\arabic{table}}

# References {-}
\setlength{\parindent}{-0.2in}
\setlength{\leftskip}{0.2in}
\setlength{\parskip}{8pt}
\noindent