# Identifying Fake Bills

The Introduction goes here

In [5]:
library(tidyverse)  # for general tidyverse functions
library(tidymodels) # for making the training/testing split
library(ggplot2)    # for producing plots
library(repr)       # for adjusting plots

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

In [6]:
# reading and tidying the dataset
fake_bills <- read_csv("https://raw.githubusercontent.com/heeyachung/dsci-group-14-/main/fakebills.csv") |> #reading the file
    mutate(is_genuine = as_factor(is_genuine)) |> #factoring the legitimacy variable
    mutate(is_genuine = fct_recode(is_genuine, "real" = "1", "fake" = "2")) #renaming values

# splitting the data into the training and testing split
fake_bills_split <- initial_split(fake_bills, prop = 0.75, strata = is_genuine)
bills_training <- training(fake_bills_split)  # training split
bills_testing <- testing(fake_bills_split)    # testing split
head(bills_training)

[1mRows: [22m[34m1500[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (7): is_genuine, diagonal, height_left, height_right, margin_low, margin...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


is_genuine,diagonal,height_left,height_right,margin_low,margin_up,length
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
fake,171.59,104.14,104.38,4.97,3.47,111.22
fake,172.55,104.25,104.23,5.6,3.13,111.72
fake,171.88,104.3,104.18,5.34,3.33,112.69
fake,171.63,104.05,104.25,4.61,3.1,110.91
fake,171.83,104.13,104.52,4.94,3.27,111.72
fake,172.3,104.28,103.9,5.1,3.57,110.66


In [7]:
bills_class <- bills_training |>
    group_by(is_genuine) |>
    summarize(count = n()) 
bills_class

is_genuine,count
<fct>,<int>
real,750
fake,375


As previously discussed in the initial analysis in our proposal, there is a large disparity between the number of real and fake bills. As said disparity would negatively affect the efficacy of our model, we will produce synthetic data points for fake bills.