The main purpose of this document is to introduce text mining with TF-IDF, implemented by [scikit-learn](https://scikit-learn.org/stable/). This document encompasses two scenarios. In the first case, we will simply calculate the TF-IDF vectors and cosine similarities in the tutorial questions. Then in the second case, we will simulate the process of evaluating assignment similarities on term frequency utilizing the TF-IDF approach.

# 1. An example from tutorial sheet

Given the following documents,

**DOCUMENT 1** This dog eats dog food.

**DOCUMENT 2** That cat eats fish

(Note: In the above documents, please consider ‘This’ and ‘That’ as stop words. )

1. what are the TF-IDF vectors of documents after being normalized to unit length? 

2. given a query Q = {dog eats fish}, what is the cosine distance between Q and Document 1?

## 1.1 Data preparation

We first import the packages that will be used in this document.

1. [Pandas](https://pandas.pydata.org/): Pandas is an open-source Python library widely used for data manipulation, analysis, and cleaning tasks. The central data structure in Pandas is the [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which provides methods to facilitate the preliminary examination of essential properties, statistical summaries, and a select number of rows for a cursory exploration of the data.

2. [Numpy](https://numpy.org/): Numpy is a powerful Python library for numerical and array-based computing. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions to operate on these arrays efficiently. 

3. [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html): TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features, provided by scikit-learn.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

Prepare the documents and the query of the question.

In [2]:
documents = [
    'This dog eats dog food.',
    'That cat eats fish.',
]

In [3]:
Q = ['dog eats fish']

## 1.2 TF-IDF vectors of the documents after normalization

We apply [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) provided by scikit-learn to calculate the TF-IDF vectors.

We set the parameter `stop_words` to a list assumed to contain stop words, all of which will be removed from the resulting tokens, which should be 'this' and 'that' as in the question.

The parameter `smooth_idf` is set to `True` by default, where the constant "1" is added to the numerator and denominator of the IDF as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions. We haven't included this in our lecture so for this scenario, we set it to `False`.

In [4]:
tfidf_vectorizer_tut = TfidfVectorizer(stop_words=['this', 'that'], smooth_idf=False)

We use [TfidfVectorizer.fit_transform()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform) to let the model earn vocabulary and IDF, and it returns the document-term matrix.

In [5]:
tfidf_vector_tut = tfidf_vectorizer_tut.fit_transform(documents)

We can check the word list by [TfidfVectorizer.get_feature_names_out()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.get_feature_names_out).

In [6]:
tfidf_vectorizer_tut.get_feature_names_out()

array(['cat', 'dog', 'eats', 'fish', 'food'], dtype=object)

Then let's check the IDF vectors by the property [TfidfVectorizer.idf_](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.idf_).

It is obvious that the IDF vectors are different from the values we calculate by the formula from the lecture. It is because scikit-learn implements it in a different formula:

$$
IDF = \ln\frac{\#documents}{\#documents\_containing\_the\_word} + 1
$$

Note: it is an explanation of the coding, please use the formula introduced in the lecture for tutorials and exams.

In [7]:
tfidf_vectorizer_tut.idf_

array([1.69314718, 1.69314718, 1.        , 1.69314718, 1.69314718])

We make the TF-IDF vectors with the corresponding word list to a [pd.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for display and further calculation.

In [8]:
tfidf_df_tut = pd.DataFrame(tfidf_vector_tut.toarray(),columns=tfidf_vectorizer_tut.get_feature_names_out())

And then round it to 3 decimals.

In [9]:
tfidf_df_tut.round(decimals=3)

Unnamed: 0,cat,dog,eats,fish,food
0,0.0,0.865,0.255,0.0,0.432
1,0.652,0.0,0.385,0.652,0.0


## 1.3 Cosine similarity between Q and Document 1

Construct the TF-IDF vector of query Q by [TfidfVectorizer.transform()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform).

In [10]:
Q_vector = tfidf_vectorizer_tut.transform(Q)

Since the vectors have been normalized to unit length, the cosine similarity would be the inner product of two vectors. We implement it by [np.dot()](https://numpy.org/doc/stable/reference/generated/numpy.dot.html).

In [11]:
cosine_similarity_tut = np.dot(Q_vector.toarray()[0], tfidf_df_tut.iloc[0])

In [12]:
cosine_similarity_tut

0.6626683974485237

# 2. Checking the frequent word similarity by TF-IDF

In this scenario, we simulate 3 assignments with 2 having high frequent word similarity. The contents of the three text files are actually the project specification documents for phase 1 and phase 2 with another generated by AI only with paraphrasing.

## 2.1 Data preparation

We import some additional packages for this case.

1. [pathlib.Path](https://docs.python.org/3/library/pathlib.html): This module offers classes representing filesystem paths with semantics appropriate for different operating systems.

2. [glob](https://docs.python.org/3/library/glob.html): The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

In [13]:
from pathlib import Path  
import glob

We use `glob.glob()` to get a list of path names that match pathname, which must be a string containing a path specification.

In [14]:
text_files = glob.glob('./*.txt')

Check the qualified path names.

In [15]:
text_files

['.\\INFS42037203ProjectPhase1.txt',
 '.\\INFS42037203ProjectPhase1_chatgpt.txt',
 '.\\INFS42037203ProjectPhase2.txt']

We can get the final path component, without its suffix by `Path.stem`.

In [16]:
text_titles = [Path(text).stem for text in text_files]

Check the titles (file names without extensions).

In [17]:
text_titles

['INFS42037203ProjectPhase1',
 'INFS42037203ProjectPhase1_chatgpt',
 'INFS42037203ProjectPhase2']

## 2.2 TF-IDF vectors after normalization

Similar to scenario 1, we apply [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) provided by scikit-learn to calculate the TF-IDF vectors while setting the parameter `stop_words` to `english` to get a predefined list.

In [18]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

### 2.2.1 Two assignments with low similarity

Similar to scenario 1, conduct TF-IDF vectors by [TfidfVectorizer.fit_transform()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform)

In [19]:
tfidf_vector_diff = tfidf_vectorizer.fit_transform([text_files[0],text_files[2]])

In [20]:
tfidf_df_diff = pd.DataFrame(tfidf_vector_diff.toarray(), index=[text_titles[0], text_titles[2]], columns=tfidf_vectorizer.get_feature_names_out())

In [21]:
tfidf_df_diff

Unnamed: 0,00,01,10,100,100mb,11,12,128,13,15,...,want,way,web,week,work,works,world,written,www,zip
INFS42037203ProjectPhase1,0.026708,0.0,0.03561,0.026708,0.0,0.0,0.025024,0.0,0.0,0.062561,...,0.012512,0.017805,0.025024,0.03561,0.025024,0.012512,0.037537,0.025024,0.062561,0.0
INFS42037203ProjectPhase2,0.010994,0.015452,0.021988,0.021988,0.030903,0.046355,0.0,0.015452,0.015452,0.0,...,0.0,0.010994,0.0,0.010994,0.0,0.0,0.0,0.0,0.0,0.123613


Calculate the cosine similarity of these two files.

In [22]:
cosine_similarity_diff = np.dot(tfidf_df_diff.iloc[0],tfidf_df_diff.iloc[1])

In [23]:
cosine_similarity_diff

0.36272088568181

As the similarity is relatively low, there is no need to check the details.

### 2.2.2 Two assignments with high similarity

Similar to scenario 1, conduct TF-IDF vectors by [TfidfVectorizer.fit_transform()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform)

In [24]:
tfidf_vector_sim = tfidf_vectorizer.fit_transform(text_files[:-1])

In [25]:
tfidf_df_sim = pd.DataFrame(tfidf_vector_sim.toarray(), index=text_titles[:-1], columns=tfidf_vectorizer.get_feature_names_out())

In [26]:
tfidf_df_sim

Unnamed: 0,00,10,100,12,15,15th,16,1st,20,2023,...,voting,want,way,web,week,work,works,world,written,www
INFS42037203ProjectPhase1,0.027692,0.051893,0.03892,0.025946,0.046153,0.018461,0.027692,0.012973,0.046153,0.046153,...,0.012973,0.012973,0.025946,0.025946,0.036922,0.025946,0.012973,0.027692,0.025946,0.064866
INFS42037203ProjectPhase1_chatgpt,0.053955,0.0,0.0,0.0,0.080933,0.026978,0.053955,0.0,0.080933,0.080933,...,0.0,0.0,0.0,0.0,0.053955,0.0,0.0,0.053955,0.0,0.0


Calculate the cosine similarity of these two files.

In [27]:
cosine_similarity_sim = np.dot(tfidf_df_sim.iloc[0],tfidf_df_sim.iloc[1])

In [28]:
cosine_similarity_sim

0.727880340296424

Since the cosine similarity is high, let's check some details of these two files.

Let’s reorganize the DataFrame so that the words are in rows rather than columns and have a look at the new DataFrame.

In [29]:
tfidf_df_sim = tfidf_df_sim.stack().reset_index()

In [30]:
tfidf_df_sim

Unnamed: 0,level_0,level_1,0
0,INFS42037203ProjectPhase1,00,0.027692
1,INFS42037203ProjectPhase1,10,0.051893
2,INFS42037203ProjectPhase1,100,0.038920
3,INFS42037203ProjectPhase1,12,0.025946
4,INFS42037203ProjectPhase1,15,0.046153
...,...,...,...
1117,INFS42037203ProjectPhase1_chatgpt,work,0.000000
1118,INFS42037203ProjectPhase1_chatgpt,works,0.000000
1119,INFS42037203ProjectPhase1_chatgpt,world,0.053955
1120,INFS42037203ProjectPhase1_chatgpt,written,0.000000


Rename the columns to some meaningful ones.

In [31]:
tfidf_df_sim = tfidf_df_sim.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term'})

To find out the top 10 words with the highest TF-IDF for each file, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.

In [32]:
tfidf_df_sim.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf
412,INFS42037203ProjectPhase1,proposal,0.313838
409,INFS42037203ProjectPhase1,project,0.295377
138,INFS42037203ProjectPhase1,data,0.286146
33,INFS42037203ProjectPhase1,ai,0.221533
390,INFS42037203ProjectPhase1,phase,0.221533
316,INFS42037203ProjectPhase1,marks,0.193841
540,INFS42037203ProjectPhase1,use,0.17538
514,INFS42037203ProjectPhase1,techniques,0.16615
498,INFS42037203ProjectPhase1,submission,0.147688
139,INFS42037203ProjectPhase1,dataset,0.110766


The two files exhibit a significant overlap of terms with high TF-IDF values. To determine if plagiarism is present, a further check on their sentence-level similarity is required. However, sentence-level similarity is not covered in this course. If you're keen on exploring this topic further, you may consider conducting independent research.

Author: *Kaki Zhou* 12/10/2023 