Skip to content
Catch 'em cheaters!
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.Rproj.user
R
doc
man
tests
vignettes
.Rbuildignore
.gitattributes
.gitignore
DESCRIPTION
NAMESPACE
NEWS.md
README.Rmd
README.md
cheatR.Rproj

README.md

cheatR: Catch 'em baddies

This is a mini package to help you find cheaters by comparing hand-ins! (Read more about the circumstances that brought about the development of this package.)

Download and Install

You can install cheatR from github with:

# install.packages("devtools")
devtools::install_github("mattansb/cheatR")

Example usage

Scripting

Create a list of files:

my_files <- list.files(path = '../doc', pattern = '.doc', full.names = T)
my_files
#> [1] "../doc/paper1 (1).docx" "../doc/paper1 (2).docx"
#> [3] "../doc/paper1 (3).docx" "../doc/paper2 (1).doc"

The first 3 documents are different drafts of the same paper, so we would expect them to be similar to each other. The last document is a draft of a different paper, so it should be dissimilar to the first 3. All files are about 45K words long.

Now we can use cheatR to find duplicates.

The only function, catch_em, takes the following input arguments:

  • flist - a list of documents (.doc/.docx/.pdf). A full/relative path must be provided.
  • n_grams - see ngram package.
  • time_lim - max time in seconds for each comparison (we found that some corrupt files run forever and crash R, so a time limit might be needed).
library(cheatR)
#> Catch 'em cheaters!
results <- catch_em(flist = my_files,
                    n_grams = 10, time_lim = 1) # defults
#> Reading documents... Done!
#> Looking for cheaters
#> ===========================================================================
#> Busted!

The resulting list contains a matrix with the similarity values between each pair of documents:

knitr::kable(summary(results))
paper1 (1).docx paper1 (2).docx paper1 (3).docx paper2 (1).doc
paper1 (1).docx 1.000
paper1 (2).docx 0.873 1.000
paper1 (3).docx 0.901 0.878 1.000
paper2 (1).doc 0.002 0.002 0.002 1

You can also plot the relational graph if you'd like to get a more clear picture of who copied from who.

graph_em(results, weight_range = c(0.7, 1))
#> Using `nicely` as default layout

Shiny app!

The accompanying Shiny app can be found on shinyapps.io, but can also be run locally with:

cheatR::catch_em_app()

Limitations?

  • As far as we can tell, this should work on any language; we tried both English and Hebrew, with and without setting Sys.setlocale("LC_ALL", "Hebrew").
  • Best performance was achieved on R version > 3.5.0.

Authors

  • Mattan S. Ben-Shachar [aut, cre].
  • Almog Simchon [aut, cre].
You can’t perform that action at this time.