<a id='importing-dependencies'></a>
<font size="+3" color='#053c96'><b> Feature Engineering</b></font>  

Feature engineering is a critical step in the data analysis pipeline that involves transforming raw data into meaningful features to enhance the predictive power of machine learning models. In this notebook, we focus on crafting and refining features specific to financial data for effective fraud detection.

#### Objectives:
- **Enhance Model Performance**: Create features that improve the accuracy, precision, and recall of our models.
- **Capture Financial Insights**: Derive meaningful metrics and ratios from the raw data to better reflect financial health and risk.
- **Address Data Challenges**: Handle messy, imbalanced, or incomplete data by engineering robust features that mitigate these issues.

#### Key Highlights:
1. **Feature Selection**:
   - Identify and retain features relevant to detecting financial anomalies or fraud.
   - Remove redundant or non-informative features.

2. **Feature Transformation**:
   - Normalize and standardize numerical features to ensure consistency across scales.
   - Encode categorical variables for compatibility with machine learning models.

3. **Derived Metrics**:
   - Engineer financial ratios such as **Debt-to-Equity Ratio**, **Profit Margins**, and **Liquidity Ratios** to capture meaningful patterns.
   - Incorporate domain-specific knowledge to design features relevant to financial statement analysis.

4. **Feature Validation**:
   - Evaluate the impact of engineered features through statistical analysis and visualization.
   - Assess feature importance using feature selection methods.

This notebook serves as a bridge between raw data exploration and model development, ensuring that the final dataset is well-prepared for training robust machine learning models.


<a id='importing-dependencies'></a>
<font size="+2" color='#053c96'><b> Importing Libraries</b></font>  

In [3]:
import sys
# Insert the parent path relative to this notebook so we can import from the src folder.
sys.path.insert(0, "..")

from src.dependencies import *
from src.functions import *

To import the necessary dependencies from the src folder, we have inserted the parent path relative to our notebook using sys.path.insert(0, "..").

<a id='data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>

#### Loading the data

Below is the data that will be used in feature engineering

In [4]:
df = pd.read_csv('../src/data/cleaned_financial_data.csv')

<a id='feature-engineering'></a>
<font size="+2" color='#780404'><b> Vertical Analysis</b></font>   
Vertical analysis is a technique for analyzing the relationships between the items on any one of the financial statements in one reporting period. The analysis results in the relationships between components expressed as percentages that can then be compared across periods. This method is often referred to as “common sizing” financial statements. In the vertical analysis of an income statement, net sales is assigned 100%; for a balance sheet, total assets is assigned 100% on the asset side, and total liabilities and equity is expressed as 100% on the other side. All other items in each of the sections are expressed as a percentage of these numbers.

#### Income Statement Vertical Analysis

In [5]:
df['VA_Revenue_CostOfSales'] = df['CostOfSales'] / df['Revenue'] * 100

In [6]:
df['VA_Revenue_GrossProfi'] = df['GrossProfit'] / df['Revenue'] * 100

In [7]:
df['VA_Revenue_TotalCostBase'] = df['TotalCostBase'] / df['Revenue'] * 100

In [8]:
df['VA_Revenue_EBIT'] = df['EBIT'] / df['Revenue'] * 100

In [9]:
df['VA_Revenue_NetProfitAfterTax'] = df['NetProfitAfterTax'] / df['Revenue'] * 100

#### Balance Sheet Vertical Analysis

In [10]:
df['VA_TotalEquity_RetainedEarnings'] = df['RetainedEarnings'] / df['TotalEquity'] * 100

In [11]:
df['VA_NCL_TotalEquityAndLiabilities_TotalEquity'] = df['TotalEquity'] / df['NCL_TotalEquityAndLiabilities'] * 100

In [12]:
df['VA_NCL_TotalEquityAndLiabilities_TotalLiabilities'] = df['TotalLiabilities'] / df['NCL_TotalEquityAndLiabilities'] * 100

In [13]:
df['VA_TotalAssets_NCA_TotalNonCurrentAssets'] = df['NCA_TotalNonCurrentAssets'] / df['TotalAssets'] * 100

In [14]:
df['VA_TotalAssets_CA_TotalCurrentAssets'] = df['CA_TotalCurrentAssets'] / df['TotalAssets'] * 100

<a id='feature-engineering'></a>
<font size="+2" color='#780404'><b> Ratio Analysis</b></font>   
Ratio analysis is a tool used to evaluate the financial performance and health of a company by analyzing the relationships between different financial statement items. It involves calculating various financial ratios based on the financial data available in a company's financial statements, such as the balance sheet, income statement, and cash flow statement.

Financial ratios can be broadly classified into four categories: liquidity ratios, solvency ratios, profitability ratios, and activity ratios.

#### Gross Profit Margin
Gross Profit Margin is a key financial ratio that measures a company's profitability and efficiency in producing and selling its products or services. It is often used by investors, creditors, and financial analysts to assess a company's financial health and performance, and to compare it to its peers or industry benchmarks.

In [15]:
df['GrossProfitMargin'] = df['Revenue'] - df['CostOfSales'] / df['Revenue'] * 100

#### Operating Profit Margin
Operating Profit Margin is a financial ratio that measures a company's profitability and efficiency in generating operating income from its revenue.

In [16]:
df['OperatingProfitMargin'] = df['EBITDA'] / df['Revenue']

#### Net Profit Margin
Net Profit Margin is a financial ratio that measures a company's profitability and efficiency in generating profit after all expenses have been accounted for. 

In [17]:
df['NetProfitMargin'] = df['NetProfitAfterTax'] / df['Revenue']

#### Asset turnover ratio
The asset turnover ratio is a financial ratio that measures a company's efficiency in using its assets to generate revenue. It is calculated by dividing the company's net sales by its total assets.

In [18]:
df['AssetTurnoverRatio'] = df['EBITDA'] / df['TotalAssets']

#### EBIT to Sales Ratio
The EBIT to Sales Ratio is a financial ratio that measures a company's operating profitability as a percentage of its total revenue or sales. It is calculated by dividing a company's Earnings Before Interest and Taxes (EBIT) by its total revenue.

In [19]:
df['EBITtoSalesRatio'] = df['EBIT'] / df['Revenue']

#### Non-Current Asset Turnover Ratio
Non-Current Asset Turnover Ratio is a financial ratio that measures a company's efficiency in generating revenue from its Non-Current Assets.

In [20]:
df['NCA_TurnoverRatio'] = df['Revenue'] / (df['NCA_TotalNonCurrentAssets'] - df['DepreciationAmortisationTotal'])

#### Debt to Equity Ratio
The debt-to-equity ratio is a financial ratio that shows the proportion of debt and equity that a company is using to finance its assets. It is calculated by dividing the company's total liabilities by its shareholder equity.

In [21]:
df['DebtEquityRatio'] = df['TotalLiabilities'] / df['TotalEquity']

#### Cash Conversion Cycle
The cash conversion cycle (CCC) is a financial metric used to measure the time it takes a company to convert its inventory and other resources into cash flow from sales. It's calculated as:

CCC = DIO + DSO - DPO

where DIO is the days inventory outstanding, DSO is the days sales outstanding, and DPO is the days payable outstanding.

To calculate the CCC, we need to calculate DIO, DSO, and DPO first. We can use the following formulas:

DIO = (Inventory / Cost of Goods Sold) * 365
DSO = (Accounts Receivable / Revenue) * 365
DPO = (Accounts Payable / Cost of Goods Sold) * 365

We can then use these values to calculate the CCC:

CCC = DIO + DSO - DPO

where:

Inventory is the value of the inventory.
Cost of Goods Sold is the cost of the goods sold.
Accounts Receivable is the value of the accounts receivable.
Revenue is the revenue generated by the company.
Accounts Payable is the value of the accounts payable.

In [22]:
DIO = (df['CA_Inventories'] / df['CostOfSales']) * 365
DSO = (df['CA_TradeAndOtherReceivables'] / df['Revenue']) * 365
DPO = (df['CL_TradeAndOtherPayables'] / df['CostOfSales']) * 365

df['CCC'] = DIO + DSO - DPO

#### Return on Equity Ratio
The formula for return on equity ratio is:

Return on Equity Ratio = Net Profit After Tax / Total Shareholder Equity Before Minorities

In [23]:
df['ReturnEquityRatio'] = df['NetProfitAfterTax'] / df['TotalShareholderEquityBeforeMinorities']

#### Quick Ratio
To calculate the quick ratio, you will need to use the formula:

Quick Ratio = (Current Assets - Inventories) / Current Liabilities

In [24]:
df['QuickRatio'] = (df['CA_TotalCurrentAssets'] - df['CA_Inventories']) / df['CL_TotalCurrentLiabilities']

#### Operating Expense Ratio
The operating expense ratio is a financial metric that represents the percentage of a company's total revenue that is spent on operating expenses. It is calculated by dividing operating expenses by total revenue and multiplying the result by 100 to express it as a percentage.

In [25]:
df['OperatingExpenseRatio'] = (df['OperatingExpensesOverheads'] / df['Revenue']) * -100

#### Return on Assets
Return on Assets (ROA) ratio measures a company's ability to generate profit from its assets, and is calculated by dividing a company's net profit by its total assets.

In [26]:
df['ROA'] = df['NetProfitAfterTax'] / df['TotalAssets']

#### Operating Margin Ratio
The operating margin ratio is calculated by dividing operating income by revenue. Operating income is calculated as revenue minus cost of sales minus operating expenses.

In [27]:
df['OperatingMarginRatio'] = (df['Revenue'] - df['TotalCostBase']) / df['Revenue']

#### Debt-to-assets ratio
The Debt-to-assets ratio is a financial ratio that measures the proportion of a company's total assets that are financed through debt. It shows the degree to which a company is leveraged and can indicate the level of risk associated with investing in the company.

In [28]:
df["DebtToAssetsRatio"] = df["TotalLiabilities"] / df["TotalAssets"]

#### Cash Ratio
Cash ratio: This ratio measures a company's ability to pay its short-term obligations using only its cash and cash equivalents.

In [29]:
df["CashRatio"] = df["CFF_NetIncCashAndCashEquivalents"] / df["CL_TotalCurrentLiabilities"]

#### Financial Leverage
This ratio measures how much a company is relying on debt to finance its operations. It is calculated as total assets divided by total equity.

In [30]:
df['FinancialLeverage'] = df['TotalAssets'] / df['TotalEquity']

#### Dupont Analysis
Dupont analysis is a method used to analyze a company's return on equity (ROE) by breaking it down into three components: net profit margin, asset turnover, and financial leverage. Here are the three ratios used in Dupont analysis:

By multiplying these three ratios together, we can calculate a company's ROE:

ROE = Net Profit Margin x Asset Turnover x Financial Leverage

In [31]:
df['DupontAnalysis'] = df['NetProfitMargin'] * df['AssetTurnoverRatio'] * df['FinancialLeverage']

### Saving Engineered Features to CSV

The dataset with engineered features is saved for further use in the modeling process. The following code exports the processed data to a CSV file:


In [32]:
df.to_csv('../src/data/featured_financial_data.csv', index=False)