## Plagiarism Detection

### Main and related tasks in plagiarism detection

* **Plagiarism detection:** Given a document, identify all  plagiarized sources and boundaries of re-used passages.
   - similar to deduplication
* **Author identification:** Given a document, identify its author.
* **Author profiling:** Given a document, extract information about the author (e.g. gender, age).

### External vs. Intrinsic plagiarism detection

#### External plagiarism detection

Given a set of suspicious documents and a set of source documents the
task is to find all text passages in the suspicious documents which have
been plagiarized and the corresponding text passages in the source
documents.

#### Intrinsic plagiarism detection

Given a set of suspicious documents the task is to identify all plagiarized
text passages, e.g., by detecting writing style breaches. The comparison of
a suspicious document with other documents is not allowed in this task.

# Task: Select a detection algorithm and implement it in Python

- Input: File in a 3-column vertical format (word, lemma, tag)
- Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.


In [1]:
!wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/LanguageResourcesFromWeb/training_data.vert

--2021-10-27 18:05:12--  https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/LanguageResourcesFromWeb/training_data.vert
Resolving nlp.fi.muni.cz (nlp.fi.muni.cz)... 147.251.51.11
Connecting to nlp.fi.muni.cz (nlp.fi.muni.cz)|147.251.51.11|:443... connected.
HTTP request sent, awaiting response... 200 Ok
Length: 503730 (492K) [application/octet-stream]
Saving to: ‘training_data.vert’


2021-10-27 18:05:14 (672 KB/s) - ‘training_data.vert’ saved [503730/503730]



In [24]:
import sys, codecs, re

def parse_input(input):
  """
  Parse input vert file into dictionary. On top level, documents are grouped by authors. 
  Each document is represented by dictionary with metadata
    - author, 
    - unique id, 
    - class (original or suspicious), 
    - source_id (The same as unique id for originals. Referencing original unique id for suspicious documents.),
    - wordlist (set of words with their counts)
    - lemmalist (set of lemmas with their counts)
  
  """

  header_re = re.compile('<doc author="([^"]+)" id="(\d+)" class="(plagiarism|original)" source="(\d+)"')

  # reading all docurment into the memory - okey for small amout
  doc_sets = {} # sets of documents, each from one author
  doc = {}
  with open(input, "r") as handle:
    for line in handle:
        if line.startswith('<doc'):

            # structure for info about document
            author, id_, class_, source_id = header_re.match(line).groups()
            doc = {
                'author': author,
                'id': id_,
                'class': class_,
                'source_id': source_id,
                'wordlist': {},
                'lemmalist': {}
            }
        elif line.startswith('</doc'):

            # adding document to author's set - to original of suspisious documents
            if not doc['author'] in doc_sets:
                doc_sets[doc['author']] = {'original': [], 'suspicious': []}
            if doc['class'] == 'original':
                doc_sets[doc['author']]['original'].append(doc)
            else:
                doc_sets[doc['author']]['suspicious'].append(doc)
        elif not line.startswith('<'):

            # adding info about content of document
            word, lemma, tag = line.rstrip().split('\t')[:3]
            doc['wordlist'][word] = doc['wordlist'].get(word, 0) + 1
            doc['lemmalist'][tag] = doc['lemmalist'].get(tag, 0) + 1

    return doc_sets

In [25]:
doc_sets = parse_input('training_data.vert')