# DSCI 100: Project Proposal 

In [2]:
# libraries 
library(tidyverse)
library(repr)
library(tidymodels)
library(gridExtra)
options(repr.matrix.max.rows = 6)

ERROR: Error in parse(text = x, srcfile = src): attempt to use zero-length variable name


### 1. Introduction

Auditing is the examination of businesses financial records and the inspection that they align with standard accounting laws and principles (Hooda, 2018). Certain factors of a business or firm, such as historical discrepancy between a financial report and an audit inspection can help auditors identify those that are higher risk for fraudulent activity. This dataset contains information about 777 firms, each of which are either classified as “Fraud” firms, or “Non-fraud” firms. 

The dataset aims to aid in the auditing process, by providing insight into whether a particular firm is “high risk” (in which case auditors would want to visit the firm) or “low risk” (in which case auditors may skip visiting that firm). Some of the risk factors examined in the dataset include discrepancies in reports, historical discrepancy scores, and amounts of money involved in misstatements. With this dataset, we will implement a K-nearest-neighbors classification model to identify "Fraud" firms from unseen datasets.

### 2. Preliminary EDA

#### Feature Descriptions

| -**Inherent risk factors**- |                                                                                               | -**Control risk factors**- |                                                                                     |
|-----------------------|-----------------------------------------------------------------------------------------------------|----------------------|-------------------------------------------------------------------------------------------|
| **Feature**           | Information                                                                                         | **Feature**          | Information                                                                               |
| Para A value          | Discrepancy found in the planned-expenditure of inspection and summary report A in Rs (in crore).   | Sector score         | Historical risk score value of the target-unit in the Table 1 using analytical procedure. |
| Para B value          | Discrepancy found in the unplanned-expenditure of inspection and summary report B in Rs (in crore). | Loss                 | Amount of loss suffered by the firm last year.                                            |
| Total                 | Total amount of discrepancy found in other reports Rs (in crore).                                   | History              | Average historical loss suffered by firm in the last 10¬†years.                           |
| Number                | Historical discrepancy score.                                                                       | District score       | Historical risk score of a district in the last 10¬†years.                                |
| Money value           | Amount of money involved in misstatements in the past audits.                                       |                      |                                                                                           |
| **Other features**    |                                                                                                     |                      |                                                                                           |
| **Feature**           | Information                                                                                         | **Feature**          | Information                                                                               |
| Sector ID             | Unique ID of the target sector.                                                                     | Location ID          | Unique ID of the city/province.                                                           |
| ARS                   | Total risk score using analytical procedure.                                                        | Audit ID             | Unique Id assigned to an audit case.                                                      |
| Risk class            | Risk Class assigned to an audit-case. (Target Feature)                                              |                      |                                                                                           |

In [None]:
audit <- read_csv("audit_data/audit_risk.csv") 
audit <- audit |> mutate(Risk = as.factor(Risk), LOCATION_ID = as.factor(LOCATION_ID))
head(audit)

In [None]:
# Tidying the Data

audit_tidy <- audit |>
  pivot_longer(cols = starts_with("PARA_"), names_to = "discrepancy", values_to = "discrepancy_value") |>
  pivot_longer(cols = starts_with("Score_"), names_to = "score_variable", values_to = "score_value") |>
  pivot_longer(cols = starts_with("Risk_"), names_to = "risk_variable", values_to = "risk_value")

head(audit_tidy)

In [None]:
summary(audit_tidy)

In [None]:
# Train Test Split
risk_split <- initial_split(audit_tidy, prop = 0.75, strata = Risk)
risk_train <- training(risk_split)
risk_test <- testing(risk_split) 

head(risk_train)

In [4]:
feature_plot1 <- risk_train |> ggplot(aes(x = TOTAL, y = Inherent_Risk, color = Risk)) + geom_point() + scale_x_log10() + scale_y_log10() +
    ggtitle('Scatter Plot of TOTAL vs Inherent Risk') +
    xlab('TOTAL') +
    ylab('Inherent Risk')

feature_plot2 <- ggplot(risk_train, aes(x = TOTAL, y = Money_Value)) +
  geom_point(aes(color = Risk), alpha = 0.5) +
  scale_x_log10() + scale_y_log10() +
  ggtitle('Scatter Plot of TOTAL vs Money_Value') +
  xlab('TOTAL') +
  ylab('Money_Value')


feature_plot3 <- risk_train |>
  group_by(LOCATION_ID, Risk) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(LOCATION_ID) |>
  summarise(total = sum(count), most_frequent_risk = Risk[which.max(count)]) |>
  ungroup() |>
  mutate(LOCATION_ID = reorder(LOCATION_ID, -total)) |>
  ggplot(aes(y = LOCATION_ID, x = total, fill = most_frequent_risk)) +
    geom_bar(stat = "identity", position = "dodge") +
    ggtitle("Bar Chart of Location ID by Risk") +
    ylab("Location ID") +
    xlab("Count") 


feature_plot4 <- risk_train |>
  ggplot(aes(y = score_value, x = score_variable, fill = Risk)) +
    geom_bar(stat = "identity", position = "stack") +
    ggtitle("Bar Chart of Score Variables by Score Values") +
    ylab("Score Values") +
    xlab("Score Variables") 


grid.arrange(feature_plot1, feature_plot2, feature_plot3, feature_plot4, ncol = 2)

ERROR: Error in parse(text = x, srcfile = src): attempt to use zero-length variable name


### 3. Methods

We will use the K-nearest neighbour algorithm to build the classifier for our data. We want to find the number of neighbours that will give us the most accurate classification results. By first splitting the data into a training set and a test set, we can then split the training data into a sub training set and a validation set in order to perform a cross validation. Following this we will create a recipe that selects Risk as our class and ___, ____, ____, and ___ as our predictors and standardize the training data. When creating our model using the KNN algorithm, we will set neighbours = tune() so that our cross validation can calculate an accuracy for multiple values of K. We will combine this model with the recipe into a workflow to train the classifier. We will visualize the results (which number of neighbours is appropraite) by plotting the accuracy estimates against the number of neighbors. When building our K nearest neighbours classifier for the dataset, we will use the number of neighbours will give us the most accurate predictions. 

### 4. Expected Outcomes and Significance 