**Submitted by: Prabin Sahani**

**BECE (Day)**

**171342**

**NLTK(Natural Language Toolkit):**

  NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning etc.

**Scikit-Learn(Sklearn):**

  Scikit-learn is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. Here we use Scikit-learn for TfidfVectorizer and cosine_similarity.


*The following code imports the necessary library for plagiarism detection.*

In [2]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [4]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

*The following code makes a list(student_files) of all the files with .txt file extension.*

Note: The files to check plagiarism must be in .txt file format.

In [5]:
student_files = [doc for doc in os.listdir() if doc.endswith('.txt')]

*The following code reads the content of each file and makes a list(student_notes)*

In [6]:
student_notes =[open(File).read() for File in  student_files]


*The following code perform preprocessing of the text in the above list(student_notes). At first, the content of each files are tokenized into sentences with sent_tokenize and then all the numbers, characters such as '.',',','!','?','{',
'}','[',']','(',')' and other characters are removed using regular expression.
All the uppercases are converted to lowercases and splitted into the words.
Then we perform lemmatization of only those words which do not belong to the stopwords of english language.
Finally, we join the words into a sentence and append it to the list(student_note)*

In [7]:
wordnet_lem = WordNetLemmatizer()
student_note = []
for i in student_notes:
  sentence = ''
  sentences = nltk.sent_tokenize(i)
  for j in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[j])
    review = review.lower()
    review = review.split()
    review = [wordnet_lem.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    sentence=sentence + review + ' '
  student_note.append(sentence.strip())

**Tf-idf:**
Term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collectionor corpus. It is often used as a weighting factor in searches of information retrieval, text mining and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

**Cosine Similarity:**
Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size. We can measure the similarity between two sentences in python suing cosine similarity. In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two bectors is-
cos(x,y) = x . y / ||x|| * ||y||

*The following lambda function converts the word in sentence of student_note into a vector using TfidfVectorizer() so as to represent the words in numbers which can be provided to the model.*

In [8]:
vectorize = lambda Text: TfidfVectorizer().fit_transform(Text).toarray()

The following lambda function finds the cosine similarity between two text files.

In [9]:
similarity = lambda doc1, doc2: cosine_similarity([doc1, doc2])

In following code, all the preprocessed texts are converted into vectors.

In [10]:
vectors = vectorize(student_note)

In the following code, the tuples of student_files and vectors are created and strored in s_vectors list.

In [11]:
s_vectors = list(zip(student_files, vectors))

In the following code, first we create plagiarism_results which is of set type so as to store unique elements only. Then we find the index of current tuple and remove it so as to get next file and vector tuple. Then we find the similarity between the files, sort the files and provide score of the similarity. Finally, the results are added to the plagiarism_results and is printed after calling the function.

In [12]:
def check_plagiarism():
    plagiarism_results = set()
    global s_vectors
    for student_a, text_vector_a in s_vectors:
        new_vectors =s_vectors.copy()
        current_index = new_vectors.index((student_a, text_vector_a))
        del new_vectors[current_index]
        for student_b , text_vector_b in new_vectors:
            sim_score = similarity(text_vector_a, text_vector_b)[0][1]
            student_pair = sorted((student_a, student_b))
            score = (student_pair[0], student_pair[1],sim_score)
            plagiarism_results.add(score)
    return plagiarism_results
    
for data in check_plagiarism():
    print(data)

('source_check.txt', 'test_3.txt', 0.021059745276379792)
('source_check.txt', 'student_work.txt', 0.7742890007201874)
('student_work.txt', 'test_3.txt', 0.02219225284643004)
('New Text Document.txt', 'student_work.txt', 0.0)
('New Text Document.txt', 'test_3.txt', 0.0)
('New Text Document.txt', 'source_check.txt', 0.0)
