# 2252621

# In-Class Exercise on Plagiarism Detection

## Overview
In this assignment, we will sequentially practice the steps to build a plagiarism detection application using a pre-trained Word2Vec model.

- Data: https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c4147f9_data/data.zip
- This exercise requires knowledge of Python programming with the following libraries:
  - `gensim` (to load the Word2Vec model)
  - `numpy` (to compute similarity)
- Additionally, we will use a pre-trained Word2Vec model Google's pre-trained word2vec model: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g

Steps to Solve This Exercise
1. Exploring the Dataset  
2. Build a class for computing document similarity (`DocSim` class)
3. Create an instance of the above class (Load the pre-trained word embedding model, Create a list of stopwords and create an instance of the `DocSim` class)
4. Plagiarism Detection and Evaluation

**Note**: Submit only a single Jupyter Notebook file that can handle all tasks, including data downloading, preprocessing, and model training. (Submissions that do not follow the guidelines will receive a score of 0.)

## Import and Download

In [1]:
# Install required libraries
%pip install gensim numpy requests tqdm zipfile36 --quiet

Note: you may need to restart the kernel to use updated packages.




In [2]:
import os
import pandas as pd 
import numpy as np 
import matplotlib as plt 
import gensim
import sys
import platform
import gdown
import requests
from zipfile import ZipFile
from gensim.models import KeyedVectors

import nltk
from nltk.corpus import stopwords

# Download stopwords dataset (only needed once)
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Python environment details
print("Python executable being used:", sys.executable)
print("Python version:", sys.version)

# Operating System details
print("Operating System:", platform.system())
print("OS Version:", platform.version())
print("OS Release:", platform.release())

# Machine and architecture details
print("Machine:", platform.machine())

# Visual Studio Code details (based on environment variable)
vscode_info = os.environ.get('VSCODE_PID', None)
if vscode_info:
    print("Running in Visual Studio Code")
else:
    print("Not running in Visual Studio Code")

Python executable being used: e:\anaconda3\envs\ml_env_test\python.exe
Python version: 3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:48:34) [MSC v.1929 64 bit (AMD64)]
Operating System: Windows
OS Version: 10.0.19045
OS Release: 10
Machine: AMD64
Running in Visual Studio Code


In [4]:
# if not os.path.exists("data"):
#     os.makedirs("data")
# if not os.path.exists("models"):
#     os.makedirs("models")


In [5]:
dataset_url = "https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c4147f9_data/data.zip"
dataset_filename = "data.zip"

if not os.path.exists(dataset_filename):
    print("Downloading dataset...")
    response = requests.get(dataset_url)
    with open(dataset_filename, "wb") as f:
        f.write(response.content)
    print("Dataset downloaded successfully!")

    # Extract the dataset
    print("Extracting dataset...")
    with ZipFile(dataset_filename, 'r') as zip_ref:
        zip_ref.extractall("data")
    print("Dataset extracted successfully!")


In [6]:
# Define download URL and file path
# word2vec_url = "https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" # their links
word2vec_url = "https://drive.google.com/uc?id=1t6JI7xzGIh47O4-D19Ybs44zRB3dY5FX" # our links
word2vec_filename = "models/GoogleNews-vectors-negative300.bin.gz"

# Ensure the models directory exists
if not os.path.exists("models"):
    os.makedirs("models")

# Check if the model file already exists
if not os.path.exists(word2vec_filename):
    print("Downloading pre-trained Word2Vec model (this may take a while)...")
    
    try:
        gdown.download(word2vec_url, word2vec_filename, quiet=False)
        print("Pre-trained Word2Vec model downloaded successfully!")
    except Exception as e:
        print(f"Error downloading Word2Vec model: {e}")

else:
    print("Pre-trained Word2Vec model already exists.")


Pre-trained Word2Vec model already exists.


## Exploring the dataset

In [7]:
def print_directory_structure(root_dir=None, indent=""):
    """
    Prints the structure of the current working directory recursively.
    
    Args:
        root_dir (str): The directory to start from. If None, uses the current working directory.
        indent (str): Indentation for subdirectories (used for recursion).
    """
    if root_dir is None:
        root_dir = os.getcwd()  # Use current directory if not specified

    print(f"📂 {root_dir}")  # Print the root directory

    for item in os.listdir(root_dir):
        item_path = os.path.join(root_dir, item)
        if os.path.isdir(item_path):
            print(f"{indent}📁 {item}")  # Print folder name
            print_directory_structure(item_path, indent + "    ")  # Recursive call for subdirectory
        else:
            print(f"{indent}📄 {item}")  # Print file name


In [None]:
print_directory_structure()

📂 e:\2_LEARNING_BKU\2_File_2\K22_HK242\CO3085_NLP\BT\Lab05
📄 2252621_Lab5_HEX copy.ipynb
📄 2252621_Lab5_HEX.ipynb
📄 2252621_Lab5_inclass.ipynb
📁 data
📂 e:\2_LEARNING_BKU\2_File_2\K22_HK242\CO3085_NLP\BT\Lab05\data
    📁 data
📂 e:\2_LEARNING_BKU\2_File_2\K22_HK242\CO3085_NLP\BT\Lab05\data\data
        📄 .DS_Store
        📄 file_information.csv
        📄 g0pA_taska.txt
        📄 g0pA_taskb.txt
        📄 g0pA_taskc.txt
        📄 g0pA_taskd.txt
        📄 g0pA_taske.txt
        📄 g0pB_taska.txt
        📄 g0pB_taskb.txt
        📄 g0pB_taskc.txt
        📄 g0pB_taskd.txt
        📄 g0pB_taske.txt
        📄 g0pC_taska.txt
        📄 g0pC_taskb.txt
        📄 g0pC_taskc.txt
        📄 g0pC_taskd.txt
        📄 g0pC_taske.txt
        📄 g0pD_taska.txt
        📄 g0pD_taskb.txt
        📄 g0pD_taskc.txt
        📄 g0pD_taskd.txt
        📄 g0pD_taske.txt
        📄 g0pE_taska.txt
        📄 g0pE_taskb.txt
        📄 g0pE_taskc.txt
        📄 g0pE_taskd.txt
        📄 g0pE_taske.txt
        📄 g1pA_taska.txt
     

In [9]:
# Explore the dataset
data_path = "data/data"
print("Files in dataset directory:")
print(os.listdir(data_path))


Files in dataset directory:
['.DS_Store', 'file_information.csv', 'g0pA_taska.txt', 'g0pA_taskb.txt', 'g0pA_taskc.txt', 'g0pA_taskd.txt', 'g0pA_taske.txt', 'g0pB_taska.txt', 'g0pB_taskb.txt', 'g0pB_taskc.txt', 'g0pB_taskd.txt', 'g0pB_taske.txt', 'g0pC_taska.txt', 'g0pC_taskb.txt', 'g0pC_taskc.txt', 'g0pC_taskd.txt', 'g0pC_taske.txt', 'g0pD_taska.txt', 'g0pD_taskb.txt', 'g0pD_taskc.txt', 'g0pD_taskd.txt', 'g0pD_taske.txt', 'g0pE_taska.txt', 'g0pE_taskb.txt', 'g0pE_taskc.txt', 'g0pE_taskd.txt', 'g0pE_taske.txt', 'g1pA_taska.txt', 'g1pA_taskb.txt', 'g1pA_taskc.txt', 'g1pA_taskd.txt', 'g1pA_taske.txt', 'g1pB_taska.txt', 'g1pB_taskb.txt', 'g1pB_taskc.txt', 'g1pB_taskd.txt', 'g1pB_taske.txt', 'g1pD_taska.txt', 'g1pD_taskb.txt', 'g1pD_taskc.txt', 'g1pD_taskd.txt', 'g1pD_taske.txt', 'g2pA_taska.txt', 'g2pA_taskb.txt', 'g2pA_taskc.txt', 'g2pA_taskd.txt', 'g2pA_taske.txt', 'g2pB_taska.txt', 'g2pB_taskb.txt', 'g2pB_taskc.txt', 'g2pB_taskd.txt', 'g2pB_taske.txt', 'g2pC_taska.txt', 'g2pC_taskb.txt'

In [10]:
# Load dataset path
DATA_PATH = "data/data"
CSV_FILE = os.path.join(DATA_PATH, "file_information.csv")

# Load file classification information (train data only)
df_info = pd.read_csv(CSV_FILE)
df_info.columns = ["File", "Task", "Category"]  # Rename columns for clarity

In [11]:
df_info.head()

Unnamed: 0,File,Task,Category
0,g0pA_taska.txt,a,non
1,g0pA_taskb.txt,b,cut
2,g0pA_taskc.txt,c,light
3,g0pA_taskd.txt,d,heavy
4,g0pA_taske.txt,e,non


## Build a class for computing document similarity (```DocSim``` class)

In [12]:
class DocSim:
    def __init__(self, model, stopwords=None):
        """
        Initialize the DocSim class with a pre-trained word embedding model and stopwords.
        
        Args:
            model: A pre-trained word embedding model (e.g., Word2Vec, FastText).
            stopwords: A list of stopwords to ignore during similarity calculation.
        """
        self.model = model
        self.stopwords = stopwords if stopwords else []

    def _vectorize(self, words):
        """
        Vectorize a list of words using the word embedding model.
        
        Args:
            words: A list of words to vectorize.
        
        Returns:
            The average vector of the words.
        """
        vectors = []
        for word in words:
            if word not in self.stopwords and word in self.model:
                vectors.append(self.model[word])
        
        if vectors:
            return np.mean(vectors, axis=0)
        else:
            return np.zeros(self.model.vector_size)
    
    def similarity(self, source, target):
        """
        Calculate the similarity between two documents.
        
        Args:
            source: Source document (string).
            target: Target document (string).
        
        Returns:
            Cosine similarity score.
        """
        source_vector = self._vectorize(source.split())
        target_vector = self._vectorize(target.split())
        return np.dot(source_vector, target_vector) / (
            np.linalg.norm(source_vector) * np.linalg.norm(target_vector)
        )



## Create an instance of the above class 

- Load the pre-trained word embedding model 
- Create a list of stopwords and create an instance of the ```DocSim``` class

In [13]:
# Load Google's pre-trained Word2Vec model
print("Loading pre-trained Word2Vec model...")
word2vec_path = "models/GoogleNews-vectors-negative300.bin.gz"
model = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
print("Word2Vec model loaded successfully!")


Loading pre-trained Word2Vec model...
Word2Vec model loaded successfully!


In [14]:
# Get a list of common English stopwords
stopwords_list = set(stopwords.words('english'))

print("Example Stopwords:", list(stopwords_list)[:10])  # Show a few stopwords


Example Stopwords: ['no', 'very', 'was', 'has', 'wouldn', 'between', 't', "hasn't", 'few', 'his']


In [15]:
# custom_stopwords = {
#     "amazon", "product", "review", "music", "album", "track", "song", "listen",
#     "buy", "purchased", "download", "play", "sound", "favorite", "cd", "mp3",
#     "great", "good", "best", "bad", "worst"
# }

# # Merge with NLTK stopwords
# stopwords_list.update(custom_stopwords)

In [16]:
# Create an instance of the DocSim class
doc_sim = DocSim(model, stopwords=stopwords_list)

## Plagiarism Detection and Evaluation

In [17]:
# Load all text files and map them to their category
text_data = {}
for file_name in df_info["File"]:
    file_path = os.path.join(DATA_PATH, file_name)
    
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            text_data[file_name] = f.read()
    except UnicodeDecodeError:
        print(f"Warning: {file_name} has non-UTF-8 encoding. Trying ISO-8859-1.")
        with open(file_path, "r", encoding="ISO-8859-1") as f:
            text_data[file_name] = f.read()




In [18]:
text_data.keys()

dict_keys(['g0pA_taska.txt', 'g0pA_taskb.txt', 'g0pA_taskc.txt', 'g0pA_taskd.txt', 'g0pA_taske.txt', 'g0pB_taska.txt', 'g0pB_taskb.txt', 'g0pB_taskc.txt', 'g0pB_taskd.txt', 'g0pB_taske.txt', 'g0pC_taska.txt', 'g0pC_taskb.txt', 'g0pC_taskc.txt', 'g0pC_taskd.txt', 'g0pC_taske.txt', 'g0pD_taska.txt', 'g0pD_taskb.txt', 'g0pD_taskc.txt', 'g0pD_taskd.txt', 'g0pD_taske.txt', 'g0pE_taska.txt', 'g0pE_taskb.txt', 'g0pE_taskc.txt', 'g0pE_taskd.txt', 'g0pE_taske.txt', 'g1pA_taska.txt', 'g1pA_taskb.txt', 'g1pA_taskc.txt', 'g1pA_taskd.txt', 'g1pA_taske.txt', 'g1pB_taska.txt', 'g1pB_taskb.txt', 'g1pB_taskc.txt', 'g1pB_taskd.txt', 'g1pB_taske.txt', 'g1pD_taska.txt', 'g1pD_taskb.txt', 'g1pD_taskc.txt', 'g1pD_taskd.txt', 'g1pD_taske.txt', 'g2pA_taska.txt', 'g2pA_taskb.txt', 'g2pA_taskc.txt', 'g2pA_taskd.txt', 'g2pA_taske.txt', 'g2pB_taska.txt', 'g2pB_taskb.txt', 'g2pB_taskc.txt', 'g2pB_taskd.txt', 'g2pB_taske.txt', 'g2pC_taska.txt', 'g2pC_taskb.txt', 'g2pC_taskc.txt', 'g2pC_taskd.txt', 'g2pC_taske.txt',

In [19]:
text_data['g0pA_taska.txt']

'Inheritance is a basic concept of Object-Oriented Programming where\nthe basic idea is to create new classes that add extra detail to\nexisting classes. This is done by allowing the new classes to reuse\nthe methods and variables of the existing classes and new methods and\nclasses are added to specialise the new class. Inheritance models the\n“is-kind-of” relationship between entities (or objects), for example,\npostgraduates and undergraduates are both kinds of student. This kind\nof relationship can be visualised as a tree structure, where ‘student’\nwould be the more general root node and both ‘postgraduate’ and\n‘undergraduate’ would be more specialised extensions of the ‘student’\nnode (or the child nodes). In this relationship ‘student’ would be\nknown as the superclass or parent class whereas, ‘postgraduate’ would\nbe known as the subclass or child class because the ‘postgraduate’\nclass extends the ‘student’ class.\n\nInheritance can occur on several layers, where if visualis

In [20]:
print(df_info.info())
print(df_info["Category"].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   File      100 non-null    object
 1   Task      100 non-null    object
 2   Category  100 non-null    object
dtypes: object(3)
memory usage: 2.5+ KB
None
['non' 'cut' 'light' 'heavy' 'orig']


In [21]:
# Store results
results = set()  # Use a set to store unique comparisons
remove_set = set() # Remove

# Compare each document with others (training only, no test data)
print("Computing document similarities...")
for file1, text1 in text_data.items():
    for file2, text2 in text_data.items():
        if file1 != file2:  # Avoid self-comparison
            if (file1, file2) not in remove_set and (file2, file1) not in remove_set:
                remove_set.add((file1, file2))
                similarity = doc_sim.similarity(text1, text2)
                category1 = df_info[df_info["File"] == file1]["Category"].values[0]
                category2 = df_info[df_info["File"] == file2]["Category"].values[0]

                # Store unique pair with categories
                results.add((file1, file2, similarity, category1, category2))

# Convert results to DataFrame
df_results = pd.DataFrame(results, columns=["File1", "File2", "Similarity", "Category1", "Category2"])

# Sort by Similarity (Descending)
df_results = df_results.sort_values(by="Similarity", ascending=False)

# Save sorted results
df_results.to_csv("train_similarities.csv", index=False)
print("Training similarities saved to train_similarities.csv (sorted, no duplicates)")


Computing document similarities...


Training similarities saved to train_similarities.csv (sorted, no duplicates)


In [22]:
df_results.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4950 entries, 1348 to 1798
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   File1       4950 non-null   object 
 1   File2       4950 non-null   object 
 2   Similarity  4950 non-null   float32
 3   Category1   4950 non-null   object 
 4   Category2   4950 non-null   object 
dtypes: float32(1), object(4)
memory usage: 212.7+ KB


In [23]:
df_results.head(10)

Unnamed: 0,File1,File2,Similarity,Category1,Category2
1348,g0pE_taska.txt,orig_taska.txt,0.997738,light,orig
4611,g3pA_taskd.txt,orig_taskd.txt,0.997639,cut,orig
1783,g4pC_taska.txt,orig_taska.txt,0.997398,cut,orig
2283,g4pC_taskd.txt,orig_taskd.txt,0.996071,heavy,orig
4172,g3pA_taskd.txt,g4pC_taskd.txt,0.994818,cut,heavy
559,g0pE_taska.txt,g4pC_taska.txt,0.994622,light,cut
4374,g2pA_taskc.txt,orig_taskc.txt,0.994138,light,orig
736,g2pB_taskd.txt,g3pA_taskd.txt,0.989713,light,cut
4070,g2pB_taskd.txt,orig_taskd.txt,0.989347,light,orig
2382,g0pB_taskc.txt,orig_taskc.txt,0.988685,cut,orig
