This is a mini package to help you find cheaters by comparing hand-ins! (Read more about the circumstances that brought about the development of this package.)
Download and Install
You can install
cheatR from github with:
# install.packages("devtools") devtools::install_github("mattansb/cheatR")
Create a list of files:
my_files <- list.files(path = '../doc', pattern = '.doc', full.names = T) my_files #>  "../doc/paper1 (1).docx" "../doc/paper1 (2).docx" #>  "../doc/paper1 (3).docx" "../doc/paper2 (1).doc"
The first 3 documents are different drafts of the same paper, so we would expect them to be similar to each other. The last document is a draft of a different paper, so it should be dissimilar to the first 3. All files are about 45K words long.
Now we can use
cheatR to find duplicates.
The only function,
catch_em, takes the following input arguments:
flist- a list of documents (
time_lim- max time in seconds for each comparison (we found that some corrupt files run forever and crash R, so a time limit might be needed).
library(cheatR) #> Catch 'em cheaters! results <- catch_em(flist = my_files, n_grams = 10, time_lim = 1) # defults #> Reading documents... Done! #> Looking for cheaters #> =========================================================================== #> Busted!
The resulting list contains a matrix with the similarity values between each pair of documents:
|paper1 (1).docx||paper1 (2).docx||paper1 (3).docx||paper2 (1).doc|
You can also plot the relational graph if you'd like to get a more clear picture of who copied from who.
graph_em(results, weight_range = c(0.7, 1)) #> Using `nicely` as default layout
Shiny app can be found on shinyapps.io, but can also be run locally with:
- As far as we can tell, this should work on any language; we tried both English and Hebrew, with and without setting
- Best performance was achieved on
Rversion > 3.5.0.
- Mattan S. Ben-Shachar [aut, cre].
- Almog Simchon [aut, cre].