## Plagiarism Detection

### Main and related tasks in plagiarism detection

* **Plagiarism detection:** Given a document, identify all  plagiarized sources and boundaries of re-used passages.
   - similar to deduplication
* **Author identification:** Given a document, identify its author.
* **Author profiling:** Given a document, extract information about the author (e.g. gender, age).

### External vs. Intrinsic plagiarism detection

#### External plagiarism detection

Given a set of suspicious documents and a set of source documents the
task is to find all text passages in the suspicious documents which have
been plagiarized and the corresponding text passages in the source
documents.

#### Intrinsic plagiarism detection

Given a set of suspicious documents the task is to identify all plagiarized
text passages, e.g., by detecting writing style breaches. The comparison of
a suspicious document with other documents is not allowed in this task.

# Task: Select a detection algorithm and implement it in Python

- Input: File in a 3-column vertical format (word, lemma, tag)
- Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.


In [1]:
!wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/LanguageResourcesFromWeb/training_data.vert

--2021-10-29 09:29:14--  https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/LanguageResourcesFromWeb/training_data.vert
Resolving nlp.fi.muni.cz (nlp.fi.muni.cz)... 147.251.51.11
Connecting to nlp.fi.muni.cz (nlp.fi.muni.cz)|147.251.51.11|:443... connected.
HTTP request sent, awaiting response... 200 Ok
Length: 503730 (492K) [application/octet-stream]
Saving to: ‘training_data.vert’


2021-10-29 09:29:16 (663 KB/s) - ‘training_data.vert’ saved [503730/503730]



In [16]:
import pandas as pd
import numpy as np
import re

class PlagiarismDetection:
  def __init__(self):
    self.metadata = None
    self.docs = None

    # store computations that might be useful for other methods
    self.bag_of_words_docs = None

  def parse_input(self, vert_file):
    header_re = re.compile('<doc author="([^"]+)" id="(\d+)" class="(plagiarism|original)" source="(\d+)"')
    self.metadata = {}
    self.docs = {}
    current_id = None
    doc_list = []

    with open(vert_file, "r") as handle:
      for line in handle:

        # start of the document - preparing metadata
        if line.startswith('<doc'):

          # structure for info about document
          author, id_, class_, source_id = header_re.match(line).groups()
          doc = {
            'author': author,
            'id': id_,
            'class': class_,
            'source_id': source_id,
          }
          current_id = id_
          doc_list = []

        # end of the document - storing metadata
        elif line.startswith('</doc'):

          # adding document to author's set - to original of suspisious documents
          if not doc['author'] in self.metadata:
              self.metadata[doc['author']] = {'original': [], 'suspicious': []}
          if doc['class'] == 'original':
              self.metadata[doc['author']]['original'].append(doc)
          else:
              self.metadata[doc['author']]['suspicious'].append(doc)

          self.docs[current_id] = pd.DataFrame(doc_list, columns=['word', 'lemma', 'tag'])

        elif not line.startswith('<'):

          # storing content of document
          word, lemma, tag = line.rstrip().split('\t')[:3]
          doc_list.append([word, lemma, tag])



In [17]:
detector = PlagiarismDetection()
detector.parse_input('training_data.vert')

In [20]:
detector.docs['101']

Unnamed: 0,word,lemma,tag
0,Reveň,reveň,k1gFnSc1
1,kadeřavá,kadeřavý,k2eAgFnSc1d1
2,Reveň,reveň,k1gFnSc1
3,kadeřavá,kadeřavý,k2eAgFnSc1d1
4,(,(,kIx(
...,...,...,...
276,listy,lista,k1gFnPc1
277,doporučovány,doporučovat,k5eAaImNgFnP
278,ke,k,k7c3
279,konzumaci,konzumace,k1gFnSc3


In [79]:
import sys, codecs, re

def parse_input(input):
  """
  Parse input vert file into dictionary. On top level, documents are grouped by authors. 
  Each document is represented by dictionary with metadata
    - author, 
    - unique id, 
    - class (original or suspicious), 
    - source_id (The same as unique id for originals. Referencing original unique id for suspicious documents.),
    - wordlist (set of words with their counts)
    - lemmalist (set of lemmas with their counts)
  """

  header_re = re.compile('<doc author="([^"]+)" id="(\d+)" class="(plagiarism|original)" source="(\d+)"')

  # reading all docurment into the memory - okey for small amout
  doc_sets = {} # sets of documents, each from one author
  doc = {}
  word_set = {}
  lemma_set = {}
  N = 0
  doc_lemmas = {}
  doc_words = {}
  with open(input, "r") as handle:
    for line in handle:
        if line.startswith('<doc'):

            # structure for info about document
            author, id_, class_, source_id = header_re.match(line).groups()
            doc = {
                'author': author,
                'id': id_,
                'class': class_,
                'source_id': source_id,
                'wordlist': {},
                'lemmalist': {}
            }

            doc_lemmas = {}
            doc_words = {}

        elif line.startswith('</doc'):

            # adding document to author's set - to original of suspisious documents
            if not doc['author'] in doc_sets:
                doc_sets[doc['author']] = {'original': [], 'suspicious': []}
            if doc['class'] == 'original':
                doc_sets[doc['author']]['original'].append(doc)
            else:
                doc_sets[doc['author']]['suspicious'].append(doc)

            N += 1
            for word in doc_words.keys():
                word_set[word] = word_set.get(word, 0) + 1
            for lemma in doc_lemmas.keys():
                lemma_set[lemma] = lemma_set.get(lemma, 0) + 1
        elif not line.startswith('<'):

            # adding info about content of document
            word, lemma, tag = line.rstrip().split('\t')[:3]
            doc['wordlist'][word] = doc['wordlist'].get(word, 0) + 1
            doc['lemmalist'][lemma] = doc['lemmalist'].get(lemma, 0) + 1

            doc_words[word] = doc_words.get(word, 0) + 1
            doc_lemmas[lemma] = doc_lemmas.get(lemma, 0) + 1

    return doc_sets, lemma_set, word_set, N

In [45]:
DOC_SIMILARITY_THRESHOLD = 0.5
from scipy import spatial

def cosine_similarity(doc1, doc2, word_or_lemma_list, **kwargs):
    """
    Converting documents into vectors and computing their cosine distance.
    Each item of a vector represents one word, value of that item represents
    relative counts of a word. 
    Result is a number betwwen 0 and 1 representing similarity of documents 
    (1 meand identity).
    """
    vector1, vector2 = [], []
    all_words = list(doc1[word_or_lemma_list].keys() | doc2[word_or_lemma_list].keys())
    doc1_len = float(sum(doc1[word_or_lemma_list].values()))
    doc2_len = float(sum(doc2[word_or_lemma_list].values()))
    for word in all_words:
        vector1.append(doc1[word_or_lemma_list].get(word, 0) / doc1_len)
        vector2.append(doc2[word_or_lemma_list].get(word, 0) / doc2_len)
    cosine_similarity = 1.0 - spatial.distance.cosine(vector1, vector2)
    return cosine_similarity

In [57]:
import numpy as np

def tf(word, doc, word_or_lemma_list):
    N = float(sum(doc[word_or_lemma_list].values()))
    return doc[word_or_lemma_list][word]/N

def idf(word, word_or_lemma_set, N):
    try:
        word_occurance = word_or_lemma_set[word] + 1
    except:
        word_occurance = 1
    return np.log(N/word_occurance)

def tf_idf(doc, word_or_lemma_set, word_or_lemma_list, N):
    tf_idf_vec = np.zeros((len(word_or_lemma_set),))
    for word in doc[word_or_lemma_list].keys():
        tf_ = tf(word, doc, word_or_lemma_list)
        idf_ = idf(word, word_or_lemma_set, N)
          
        value = tf_*idf_
        tf_idf_vec[word_or_lemma_set[word]] = value 
    return tf_idf_vec

def tfidf_lemma(doc1, doc2, word_or_lemma_list, word_or_lemma_set, N, **kwargs): 

    vector1 = tf_idf(doc1, word_or_lemma_set, word_or_lemma_list, N)
    vector2 = tf_idf(doc2, word_or_lemma_set, word_or_lemma_list, N)
    cosine_similarity = 1.0 - spatial.distance.cosine(vector1, vector2)
    return cosine_similarity
  

In [40]:
def evaluate(doc_sets, lemma_set, N, metric, word_or_lemma_list, word_or_lemma_set):
  #Srovname wordlisty podezrelych dokumentu s originaly ze stejne sady dokumentu.
  #Zaroven vyhodnocujeme uspesnost.
  stats = {'tp': 0, 'fp': 0, 'tn': 0, 'fn': 0}
  for author, doc_set in doc_sets.items():
      #print('Doc set by %s\n' % author)
      set_stats = {'tp': 0, 'fp': 0, 'tn': 0, 'fn': 0}
      for doc in doc_set['suspicious']:
          #srovnani se vsemi originaly
          most_similar_doc_id = doc['id'] #vychozi stav je dokument je nejpodobnejsi sam sobe
          highest_similarity_score = 0.0
          plagiarism = False
          for orig_doc in doc_set['original']:
              similarity_score = metric(doc1=doc, doc2=orig_doc, word_or_lemma_list=word_or_lemma_list, word_or_lemma_set=word_or_lemma_set, N=N)
              if similarity_score >= DOC_SIMILARITY_THRESHOLD \
                      and similarity_score > highest_similarity_score:
                  most_similar_doc_id = orig_doc['id']
                  highest_similarity_score = similarity_score
                  plagiarism = True
          #print('%s\t%s\t%s\n' % (doc['id'], most_similar_doc_id, doc['source_id']))
          #vyhodnoceni
          if most_similar_doc_id == doc['source_id']:
              if doc['class'] == 'plagiarism':
                  set_stats['tp'] += 1
              else:
                  set_stats['tn'] += 1
          else:
              if doc['class'] == 'plagiarism':
                  set_stats['fp'] += 1
              else:
                  set_stats['fn'] += 1
      #vyhodnoceni
      try:
          precision = set_stats['tp'] / float(set_stats['tp'] + set_stats['fp'])
      except ZeroDivisionError:
          precision = 0.0
      try:
          recall = set_stats['tp'] / float(set_stats['tp'] + set_stats['fn'])
      except ZeroDivisionError:
          recall = 0.0
      try:
          f1_measure = 2 * precision * recall / (precision + recall)
      except ZeroDivisionError:
          f1_measure = 0.0
      #print('Set precision: %.2f, recall: %.2f, F1: %.2f\n\n' %
      #    (precision, recall, f1_measure))
      stats['tp'] += set_stats['tp']
      stats['fp'] += set_stats['fp']
      stats['tn'] += set_stats['tn']
      stats['fn'] += set_stats['fn']
  try:
      precision = stats['tp'] / float(stats['tp'] + stats['fp'])
  except ZeroDivisionError:
      precision = 0.0
  try:
      recall = stats['tp'] / float(stats['tp'] + stats['fn'])
  except ZeroDivisionError:
      recall = 0.0
  try:
      f1_measure = 2 * precision * recall / (precision + recall)
  except ZeroDivisionError:
      f1_measure = 0.0
  print('Overall precision: %.2f, recall: %.2f, F1: %.2f\n' %
      (precision, recall, f1_measure))

In [80]:
doc_sets, lemma_set, word_set, N = parse_input('training_data.vert')
evaluate(doc_sets=doc_sets, lemma_set=lemma_set, N=N, metric=cosine_similarity, word_or_lemma_list='wordlist', word_or_lemma_set=None)

Overall precision: 0.88, recall: 1.00, F1: 0.93



In [63]:
evaluate(doc_sets=doc_sets, lemma_set=lemma_set, N=N, metric=cosine_similarity, word_or_lemma_list='lemmalist', word_or_lemma_set=None)

Overall precision: 0.88, recall: 1.00, F1: 0.93



In [81]:
evaluate(doc_sets=doc_sets, lemma_set=lemma_set, N=N, metric=tfidf_lemma, word_or_lemma_list='lemmalist', word_or_lemma_set=lemma_set)


Overall precision: 0.65, recall: 1.00, F1: 0.79



In [82]:
evaluate(doc_sets=doc_sets, lemma_set=lemma_set, N=N, metric=tfidf_lemma, word_or_lemma_list='wordlist', word_or_lemma_set=word_set)


Overall precision: 0.65, recall: 1.00, F1: 0.79



In [68]:
lemma_set

{}