##                                                        DATA MINING LAB 02

## NAME: Dinky Mishra
## CSUID: 2864923

Part1:
1.We want to count only the words appeared in the Webpage Text as the Content of the page, not the words included inside any tags <……> or any system generated scripts or html codes. You can count the words appeared in the title bar as well as in the Menu in the webpage
2. We do not want to count any subexpressions that are a part of another words.
“Spin” should not be counted as “pin”
3. However, No case sensitive: Insert, insert, INSERT, insert are all counted as a same word.
4. The words from a same stem are counted as a same word. For example, program, programming, programed, programmable are all counted as “program”. You can directly add OR conditions with all the variations of the words that are from a same stem to count all as a same word.
5. We want to count for a phrase (bi-grams or tri-grams) by counting occurring of ‘data mining’, for example, when ‘data’ immediately followed by ‘mining’. This is usually done to add a discovered bi-gram or tri-gram in the term dictionary as a single term, for example, ‘data mining’ is added as a single term with ‘data_mining’ in the term dictionary with its frequency.

Summary of Lab2
Fetching the Content of the Webpages:

We started by retrieving the content from the given webpages. For this task, Python libraries such as requests or web scraping tools like BeautifulSoup from bs4 were utilized.

Text Preprocessing:

After obtaining the content, we proceeded to preprocess the text to make it suitable for analysis.
This involved: Stripping off any HTML tags to extract only the readable text present on the webpage.
Converting the entire content to lowercase to maintain consistency.
Eradicating special characters and punctuation marks.
Breaking down the content into individual words or tokens, a process called tokenization.
Counting Term Frequencies:

The next phase involved analyzing the frequency of the seven terms provided in the assignment: "research", "data", "mining", "analytics", "machine learning", "deep learning", and the bi-grams.
It's crucial to note that bi-grams were treated as individual terms, thus requiring their frequencies to be counted as a singular term.
Constructing Document Vectors:

Finally, for each document, we crafted a vector. Each vector's entry was mapped to the frequency of one of the seven terms, allowing us to represent the content of the document in terms of these term frequencies.

STEP 1 : Import necessary libraries & Mount Drive

In [None]:
import pandas as pd
import re
import math
from collections import defaultdict

In [None]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [None]:
ls "/content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs"

doc1.txt  doc2.txt  doc3.txt  doc4.txt  doc5.txt  doc6.txt


Step 2: Define Terms and File Paths

In [None]:
# Terms to search for
terms = ["research", "data", "mining", "analytics", "machine learning", "deep learning", "data mining"]

# Paths to your saved text documents on Google Drive
folder_path = "/content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/"
file_names = [f"{folder_path}doc{i}.txt" for i in range(1, 7)]


Step 3: Read and Compute Term Frequencies for Each Document

In [None]:
# Dictionary to store term frequencies for each document
term_freq = defaultdict(dict)

# Read and process content for each file
for file_name in file_names:
    with open(file_name, 'r') as f:
        text = f.read().lower()  # Convert to lowercase

    # Tokenize the text (split into words)
    tokens = re.findall(r'\b\w+\b', text)

    # Count term frequencies without considering subexpressions
    for term in terms:
        term_count = 0
        for token in tokens:
            if token == term:
                term_count += 1
        term_freq[file_name][term] = term_count


In [None]:
# Read and process content for each file
for file_name in file_names:
    with open(file_name, 'r') as f:
        original_text = f.read()
        print(f"Original snippet for {file_name}:\n{original_text[:100]}...")  # Print the first 100 characters

        text = original_text.lower()
        print(f"Lowercase snippet for {file_name}:\n{text[:100]}...\n")  # Print the first 100 characters in lowercase

Original snippet for /content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/doc1.txt:
<!DOCTYPE html><html lang="en"><head>
      <script type="text/javascript" data-source="https://cdn....
Lowercase snippet for /content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/doc1.txt:
<!doctype html><html lang="en"><head>
      <script type="text/javascript" data-source="https://cdn....

Original snippet for /content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/doc2.txt:
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la...
Lowercase snippet for /content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/doc2.txt:
<!doctype html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la...

Original snippet for /content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/doc3.txt:
<!DOCTYPE html>
<!-- saved from url=(0039)https://my.clevelandclinic.org/research -->
<html lang="en...
Lowercase snippet for

Step 4: Calculate Document Frequency (DF) and Inverse Document Frequency (IDF) for Each Term

In [None]:
# Calculate Document Frequency (DF) for each term
df = {}
for term in terms:
    df[term] = sum(1 for doc in term_freq if term_freq[doc][term] > 0)

# Calculate Inverse Document Frequency (IDF) for each term
idf = {}
total_docs = len(file_names)
for term, freq in df.items():
    if freq == 0:
        idf[term] = 0
    else:
        idf[term] = math.log(total_docs / freq)

In [None]:
df

{'research': 6,
 'data': 6,
 'mining': 4,
 'analytics': 5,
 'machine learning': 0,
 'deep learning': 0,
 'data mining': 0}

In [None]:
idf

{'research': 0.0,
 'data': 0.0,
 'mining': 0.4054651081081644,
 'analytics': 0.1823215567939546,
 'machine learning': 0,
 'deep learning': 0,
 'data mining': 0}

Step 5: Calculate TF-IDF for Each Term in Each Document

In [None]:
# Calculate TF-IDF for each term in each document
tf_idf = {}
for doc, frequencies in term_freq.items():
    tf_idf[doc] = {}
    for term, freq in frequencies.items():
        tf_idf[doc][term] = freq * idf[term]

Step 6: Build the Inverted Index (Term Dictionary)

In [None]:
# Build Inverted Index (Term Dictionary)
inverted_index = {}
for term in terms:
    inverted_index[term] = {}
    for doc in tf_idf:
        if tf_idf[doc][term] > 0:
            inverted_index[term][doc] = tf_idf[doc][term]

Step 7: Display the Inverted Index

In [None]:
# Create a DataFrame from the Inverted Index
data = []
for term, docs in inverted_index.items():
    for doc, score in docs.items():
        data.append([term, doc, score])

df_display = pd.DataFrame(data, columns=["Term", "Document", "TF-IDF Score"])
display(df_display)

Unnamed: 0,Term,Document,TF-IDF Score
0,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,8.514767
1,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,113.53023
2,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,113.53023
3,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,2.027326
4,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,8.022148
5,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,0.182322
6,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,3.46411
7,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,3.46411
8,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,4.193396


##Extra Credit:Inverted Index (Term Dictionary) Construction with TF-IDF for a Full Feature Set for Each Document

In [None]:
import re
from collections import defaultdict

# Step 1: Import necessary libraries & Mount Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Step 2: Define File Paths
folder_path = "/content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/"
file_names = [f"{folder_path}doc{i}.txt" for i in range(1, 7)]

# Step 3: Read and Compute Term Frequencies for Each Document
doc_term_counts = {}

for file_name in file_names:
    with open(file_name, 'r') as f:
        text = f.read().lower()  # Convert to lowercase

    tokens = re.findall(r'\b\w+\b', text)
    term_counts = defaultdict(int)
    for token in tokens:
        term_counts[token] += 1

    doc_name = file_name.split('/')[-1]  # Extract document name from file path
    doc_term_counts[doc_name] = term_counts

# Step 4: Display Unique Terms and Their Counts for Each Document
for doc_name, term_counts in doc_term_counts.items():
    print(f"\nUnique Terms and Their Counts for {doc_name}:")
    unique_terms = sorted(term_counts.items(), key=lambda x: x)  # Sort by term
    for term, count in unique_terms:
        print(f"{term}: {count}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
390: 1
390px: 1
397: 2
3a: 41
3a14967969: 1
3a3: 1
3a6487637: 1
3abook: 17
3adata: 32
3adoi: 5
3aits: 1
3ajournal: 15
3akev: 32
3amtx: 32
3aoclcnum: 2
3aofi: 32
3apmid: 1
3asid: 32
3d2: 3
3d5: 1
3d546782: 1
3d7: 1
3da18c5b92: 1
3deng_datamininganalysis: 1
3djstor: 1
3ds2cid: 2
3dssrn: 1
3dtop_venues: 1
3em: 6
3f: 2
3fabstract_id: 1
3fg: 1
3fsearchdomain: 1
3fview_op: 1
3fvolume: 1
3px: 4
3rd: 2
4: 24
40: 6
400: 2
407: 2
41: 9
410: 1
42253: 3
43: 2
4398: 4
44: 2
440px: 1
4428654: 1
448: 2
4503: 4
451120: 2
45263753: 2
45px: 1
461: 4
47: 4
471: 4
48: 4
489: 4
49: 2
4a: 6
4c: 1
4d1d: 3
4eea: 1
4em: 23
4px: 1
4th: 1
5: 28
50: 4
500: 2
50000: 1
5000000: 2
50055336: 2
500px: 1
50px: 1
51: 5
512: 2
520: 1
521: 3
52428800: 2
540: 3
54264d1c4647: 3
546782: 2
55860: 4
5670: 2
585px: 1
59749: 3
59904: 3
59px: 1
5a20: 3
5dbd98ca: 2
5em: 13
5x: 12
6: 23
60: 5
6069: 4
619: 2
640: 1
640px: 1
6487637: 2
649: 1
65: 3
66: 4
66676dd79b: 2
6

In [None]:
import pandas as pd
import re
import math
from collections import defaultdict

# Step 1: Import necessary libraries & Mount Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Step 2: Define File Paths
folder_path = "/content/drive/MyDrive/Colab Notebooks/datamining/Lab2/docs/"
file_names = [f"{folder_path}doc{i}.txt" for i in range(1, 7)]

# Extract document names from file paths
doc_names = [f"doc{i}" for i in range(1, 7)]

# Step 3: Read and Compute Term Frequencies for Each Document
term_freq = defaultdict(dict)
all_terms = set()

for file_name in file_names:
    with open(file_name, 'r') as f:
        text = f.read().lower()  # Convert to lowercase

    tokens = re.findall(r'\b\w+\b', text)
    all_terms.update(tokens)

    for token in tokens:
        if token not in term_freq[file_name]:
            term_freq[file_name][token] = 0
        term_freq[file_name][token] += 1

# Step 4: Calculate Document Frequency (DF) and Inverse Document Frequency (IDF) for Each Term
df = {term: 0 for term in all_terms}
for term in all_terms:
    df[term] = sum(1 for doc in term_freq if term in term_freq[doc])

idf = {}
total_docs = len(file_names)
for term, freq in df.items():
    if freq == 0:
        idf[term] = 0
    else:
        idf[term] = math.log(total_docs / freq)

# Step 5: Calculate TF-IDF for Each Term in Each Document
tf_idf = {}
for doc, frequencies in term_freq.items():
    tf_idf[doc] = {}
    for term, freq in frequencies.items():
        if term in idf:  # Ensure the term is in idf
            tf_idf[doc][term] = freq * idf[term]

# Step 6: Calculate Collection Frequency (CF) and Max TF
collection_freq = defaultdict(int)
max_tf = defaultdict(lambda: defaultdict(int))

for doc, frequencies in term_freq.items():
    for term, freq in frequencies.items():
        collection_freq[term] += freq
        if freq > max_tf[doc][term]:
            max_tf[doc][term] = freq

# Step 7: Create the Postings File
postings_file = defaultdict(list)

for doc, doc_name in zip(file_names, doc_names):
    with open(doc, 'r') as f:
        text = f.read().lower()
    tokens = re.findall(r'\b\w+\b', text)
    for position, token in enumerate(tokens):
        if token in all_terms:
            postings_file[token].append((doc_name, position))

# Step 8: Build the Inverted Index (Term Dictionary) with Additional Details
inverted_index = {}
for term in all_terms:
    inverted_index[term] = {}
    for doc, doc_name in zip(tf_idf, doc_names):
        if term in tf_idf[doc] and tf_idf[doc][term] > 0:
            inverted_index[term][doc_name] = {
                'TF-IDF': tf_idf[doc][term],
                'CF': collection_freq[term],
                'Max TF': max_tf[doc][term],
                'DF': df[term]
            }

# Step 9: Display the Postings File
postings_data = []
for term, postings in postings_file.items():
    for doc, position in postings:
        postings_data.append([term, doc, position])

df_postings = pd.DataFrame(postings_data, columns=["Term", "Document", "Position"])
display(df_postings)

# Step 10: Display the Dictionary File
dictionary_data = []
for term, docs in inverted_index.items():
    for doc, details in docs.items():
        dictionary_data.append([
            term, doc, details['TF-IDF'], details['CF'], details['Max TF'], details['DF']
        ])

df_dictionary = pd.DataFrame(dictionary_data, columns=["Term", "Document", "TF-IDF", "Collection Frequency", "Max TF", "Doc Freq"])
display(df_dictionary)


Mounted at /content/drive


Unnamed: 0,Term,Document,Position
0,doctype,doc1,0
1,doctype,doc2,0
2,doctype,doc3,0
3,doctype,doc4,0
4,doctype,doc5,0
...,...,...,...
289573,messagehelp,doc6,6776
289574,messagehelp,doc6,6780
289575,feedbacksubmit,doc6,6799
289576,plugins,doc6,6842


Unnamed: 0,Term,Document,TF-IDF,Collection Frequency,Max TF,Doc Freq
0,4cc1,doc1,3.583519,2,2,1
1,60bc61,doc1,1.791759,1,1,1
2,sciences,doc1,0.364643,30,2,5
3,sciences,doc2,2.187859,30,12,5
4,sciences,doc3,1.093929,30,6,5
...,...,...,...,...,...,...
21180,7cwikibase,doc2,0.693147,3,1,3
21181,7cwikibase,doc4,0.693147,3,1,3
21182,7cwikibase,doc5,0.693147,3,1,3
21183,snapshot,doc1,71.670379,40,40,1


**### NOT the below codes ~!!**


In [None]:

# Calculate Collection Frequency (CF) and Max TF for each term in each document
collection_freq = defaultdict(int)
max_tf = defaultdict(lambda: defaultdict(int))

for doc, frequencies in term_freq.items():
    for term, freq in frequencies.items():
        collection_freq[term] += freq
        if freq > max_tf[doc][term]:
            max_tf[doc][term] = freq

# Add CF and Max TF to the inverted index
for term in terms:
    for doc in tf_idf:
        if tf_idf[doc][term] > 0:
            inverted_index[term][doc] = {
                'TF-IDF': tf_idf[doc][term],
                'CF': collection_freq[term],
                'Max TF': max_tf[doc][term]
            }


In [None]:
# Create the Postings File
postings_file = defaultdict(list)

for doc in file_names:
    with open(doc, 'r') as f:
        text = f.read().lower()
    tokens = re.findall(r'\b\w+\b', text)
    for position, token in enumerate(tokens):
        if token in terms:
            postings_file[token].append((doc, position))

# Display the Postings File
postings_data = []
for term, postings in postings_file.items():
    for doc, position in postings:
        postings_data.append([term, doc, position])

df_postings = pd.DataFrame(postings_data, columns=["Term", "Document", "Position"])
display(df_postings)


Unnamed: 0,Term,Document,Position
0,data,/content/drive/MyDrive/Colab Notebooks/datamin...,10
1,data,/content/drive/MyDrive/Colab Notebooks/datamin...,261
2,data,/content/drive/MyDrive/Colab Notebooks/datamin...,452
3,data,/content/drive/MyDrive/Colab Notebooks/datamin...,1107
4,data,/content/drive/MyDrive/Colab Notebooks/datamin...,1501
...,...,...,...
2773,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,667
2774,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,3596
2775,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,4211
2776,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,4420


In [None]:
# Display the Dictionary File
dictionary_data = []
for term, docs in inverted_index.items():
    for doc, details in docs.items():
        dictionary_data.append([
            term, doc, details['TF-IDF'], details['CF'], details['Max TF'], df[term]
        ])

df_dictionary = pd.DataFrame(dictionary_data, columns=["Term", "Document", "TF-IDF", "Collection Frequency", "Max TF", "Doc Freq"])
display(df_dictionary)


Unnamed: 0,Term,Document,TF-IDF,Collection Frequency,Max TF,Doc Freq
0,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,8.514767,586,21,4
1,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,113.53023,586,280,4
2,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,113.53023,586,280,4
3,mining,/content/drive/MyDrive/Colab Notebooks/datamin...,2.027326,586,5,4
4,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,8.022148,106,44,5
5,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,0.182322,106,1,5
6,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,3.46411,106,19,5
7,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,3.46411,106,19,5
8,analytics,/content/drive/MyDrive/Colab Notebooks/datamin...,4.193396,106,23,5


### **PART 2 BEGINS**

Part2:
In this section, we'll construct a cosine similarity matrix using the term frequencies calculated earlier. We'll also normalize each document vector to calculate the cosine similarity.

Step 1: Display Document Vectors for Keywords

This step is for visualization to get a grasp on our document vectors before we compute cosine similarities.

In [None]:
# Convert the term frequencies dictionary to a DataFrame for a clearer display
df_term_freq = pd.DataFrame({doc.split('/')[-1]: tf for doc, tf in term_freq.items()}).T

# Replace NaN values (indicating a term wasn't found in a document) with 0
df_term_freq.fillna(0, inplace=True)

# Ensure all values are integers, as term frequencies are whole numbers
df_term_freq = df_term_freq.astype(int)

# Display the DataFrame
print("Document Vectors For Keywords:")
df_term_freq


Document Vectors For Keywords:


Unnamed: 0,research,data,mining,analytics,machine learning,deep learning,data mining
doc1.txt,13,296,0,44,0,0,0
doc2.txt,31,264,21,0,0,0,0
doc3.txt,110,21,0,1,0,0,0
doc4.txt,27,575,280,19,0,0,0
doc5.txt,27,575,280,19,0,0,0
doc6.txt,40,107,5,23,0,0,0


Step 2: Normalize Each Document Vector

Before calculating cosine similarity, vectors should be normalized (i.e., have a magnitude of 1).

In [None]:
def normalize_vector(vector):
    magnitude = math.sqrt(sum([val**2 for val in vector.values()]))
    if magnitude == 0:
        return vector
    return {term: val/magnitude for term, val in vector.items()}

# Assuming term_freq is a dictionary of term frequencies for each document
normalized_vectors = {doc.split('/')[-1]: normalize_vector(tf) for doc, tf in term_freq.items()}

# Convert to DataFrame for tabular representation
df = pd.DataFrame(normalized_vectors).transpose()

# Display the DataFrame
print(df)


          research      data    mining  analytics  machine learning  \
doc1.txt  0.043401  0.988200  0.000000   0.146895               0.0   
doc2.txt  0.116261  0.990091  0.078757   0.000000               0.0   
doc3.txt  0.982221  0.187515  0.000000   0.008929               0.0   
doc4.txt  0.042161  0.897873  0.437225   0.029669               0.0   
doc5.txt  0.042161  0.897873  0.437225   0.029669               0.0   
doc6.txt  0.342959  0.917416  0.042870   0.197202               0.0   

          deep learning  data mining  
doc1.txt            0.0          0.0  
doc2.txt            0.0          0.0  
doc3.txt            0.0          0.0  
doc4.txt            0.0          0.0  
doc5.txt            0.0          0.0  
doc6.txt            0.0          0.0  


Step 3: Calculate Cosine Similarity Between Pairs of documents

In [None]:
import math
import pandas as pd
def cosine_similarity(vec1, vec2):
    dot_product = sum([vec1[term] * vec2.get(term, 0) for term in vec1.keys()])
    magnitude1 = math.sqrt(sum([val**2 for val in vec1.values()]))
    magnitude2 = math.sqrt(sum([val**2 for val in vec2.values()]))
    if magnitude1 * magnitude2 == 0:  # Avoid division by zero
        return 0
    return dot_product / (magnitude1 * magnitude2)
    cosine_sim_matrix = {doc1: {doc2: cosine_similarity(normalized_vectors[doc1], normalized_vectors[doc2])
                           for doc2 in file_names} for doc1 in file_names}
    cosine_sim_matrix
    # Assuming normalized_vectors is a dictionary of document vectors
# and file_names is a list of document names
    cosine_sim_matrix = {doc1: {doc2: cosine_similarity(normalized_vectors[doc1], normalized_vectors[doc2])
                           for doc2 in file_names} for doc1 in file_names}
    # Convert the dictionary to a pandas DataFrame for better display
    cosine_sim_df = pd.DataFrame(cosine_sim_matrix)
    # Display the DataFrame
    print(cosine_sim_df)



          doc1.txt  doc2.txt  doc3.txt  doc4.txt  doc5.txt  doc6.txt
doc1.txt  1.000000  0.983454  0.229243  0.893466  0.893466  0.950443
doc2.txt  0.983454  1.000000  0.299851  0.928313  0.928313  0.951575
doc3.txt  0.229243  0.299851  1.000000  0.210041  0.210041  0.510652
doc4.txt  0.893466  0.928313  0.210041  1.000000  1.000000  0.862778
doc5.txt  0.893466  0.928313  0.210041  1.000000  1.000000  0.862778
doc6.txt  0.950443  0.951575  0.510652  0.862778  0.862778  1.000000


Step 4: Display the Cosine Similarity Matrix in Tabular Form

In [None]:
# Convert the cosine similarity matrix to a DataFrame for better display
df_cosine_sim = pd.DataFrame(cosine_sim_matrix).T

# Display the matrix
df_cosine_sim


Unnamed: 0,doc1.txt,doc2.txt,doc3.txt,doc4.txt,doc5.txt,doc6.txt
doc1.txt,1.0,0.983454,0.229243,0.893466,0.893466,0.950443
doc2.txt,0.983454,1.0,0.299851,0.928313,0.928313,0.951575
doc3.txt,0.229243,0.299851,1.0,0.210041,0.210041,0.510652
doc4.txt,0.893466,0.928313,0.210041,1.0,1.0,0.862778
doc5.txt,0.893466,0.928313,0.210041,1.0,1.0,0.862778
doc6.txt,0.950443,0.951575,0.510652,0.862778,0.862778,1.0


Step 5: Compare Document Vectors and Cosine Similarity Matrix in Tabular Format

In [None]:
from IPython.core.display import display, HTML

# Convert the term frequencies dictionary to a DataFrame, using only document file names
df_term_freq = pd.DataFrame({doc.split('/')[-1]: tf for doc, tf in term_freq.items()}).T.fillna(0).astype(int)

# Convert the cosine similarity matrix to a DataFrame, using only document file names
df_cosine_sim = pd.DataFrame({doc1.split('/')[-1]: {doc2.split('/')[-1]: sim for doc2, sim in sims.items()}
                              for doc1, sims in cosine_sim_matrix.items()})

# Generate HTML tables for side-by-side display
table_term_freq = df_term_freq.to_html(classes='table table-condensed')
table_cosine_sim = df_cosine_sim.to_html(classes='table table-condensed')

# Display the tables side-by-side
display(HTML(f'<table><tr><td><h3>Document Vectors For Keywords</h3>{table_term_freq}</td><td><h3>Cosine Similarity Matrix</h3>{table_cosine_sim}</td></tr></table>'))


Unnamed: 0_level_0,research,data,mining,analytics,machine learning,deep learning,data mining
Unnamed: 0_level_1,doc1.txt,doc2.txt,doc3.txt,doc4.txt,doc5.txt,doc6.txt,Unnamed: 7_level_1
doc1.txt,13,296.0,0.0,44.0,0.0,0.0,0.0
doc2.txt,31,264.0,21.0,0.0,0.0,0.0,0.0
doc3.txt,110,21.0,0.0,1.0,0.0,0.0,0.0
doc4.txt,27,575.0,280.0,19.0,0.0,0.0,0.0
doc5.txt,27,575.0,280.0,19.0,0.0,0.0,0.0
doc6.txt,40,107.0,5.0,23.0,0.0,0.0,0.0
doc1.txt,1.000000,0.983454,0.229243,0.893466,0.893466,0.950443,
doc2.txt,0.983454,1.0,0.299851,0.928313,0.928313,0.951575,
doc3.txt,0.229243,0.299851,1.0,0.210041,0.210041,0.510652,
doc4.txt,0.893466,0.928313,0.210041,1.0,1.0,0.862778,

Unnamed: 0,research,data,mining,analytics,machine learning,deep learning,data mining
doc1.txt,13,296,0,44,0,0,0
doc2.txt,31,264,21,0,0,0,0
doc3.txt,110,21,0,1,0,0,0
doc4.txt,27,575,280,19,0,0,0
doc5.txt,27,575,280,19,0,0,0
doc6.txt,40,107,5,23,0,0,0

Unnamed: 0,doc1.txt,doc2.txt,doc3.txt,doc4.txt,doc5.txt,doc6.txt
doc1.txt,1.0,0.983454,0.229243,0.893466,0.893466,0.950443
doc2.txt,0.983454,1.0,0.299851,0.928313,0.928313,0.951575
doc3.txt,0.229243,0.299851,1.0,0.210041,0.210041,0.510652
doc4.txt,0.893466,0.928313,0.210041,1.0,1.0,0.862778
doc5.txt,0.893466,0.928313,0.210041,1.0,1.0,0.862778
doc6.txt,0.950443,0.951575,0.510652,0.862778,0.862778,1.0


### Part3: Analysis and Discussion of Problems



# 1.   Discuss briefly about your topic analysis with your cosine similarity matrix focusing on that:  Whether each value (in Cosine Sim) of each pair of any two docs indicate the similarity correctly?
 -->The cosine similarity value quantifies how similar two documents are based on their content. Values closer to 1 indicate higher similarity, while values closer to 0 indicate lower similarity. By examining the matrix, we can confirm if the expected similar documents have values near 1.
•	Diagonal Values: These are 1, as expected, since a document is perfectly similar to itself.
•	Doc4 and Doc5: This pair has a value of 1, indicating they are identical in terms of the 7 given topics.
•	Other Pairs: Values between 0 and 1 indicate varying levels of similarity.
Thus, each value in the cosine similarity matrix accurately reflects the similarity between document pairs.

# 2.   Which 2 docs are most similar in terms of 7 given topics?
-->Based on the generated cosine similarity matrix, doc4 and doc5 are the most similar in terms of the 7 given topics, as they have a cosine similarity value of 1, indicating they are identical regarding these topics.

# 3.  The Topics of Doc6 is similar to the Topics of Doc 4 and 5?
 Explain Why or Why Not in terms of 7 TFs? If not, what are the reasons?

 -->The similarity between Doc6 and Doc4 is 0.844702. The similarity between Doc6 and Doc5 is 0.844702.
Considering that the highest possible similarity score is 1 (indicating identical documents), the values of 0.844702 suggest that Doc6 is quite similar but not identical to both Doc4 and Doc5.
However, the value 1 between Doc4 and Doc5 indicates they are essentially identical in terms of the seven topics.
Now, in the context of Term Frequencies (TFs):
Similarity: The similarity scores indicate that Doc6 has many of the same topics in common with Doc4 and Doc5. In other words, many of the 7 given topics appear with similar frequencies in Doc6 as they do in Doc4 and Doc5.
Difference: The reason the similarity is not 1 (as with Doc4 and Doc5) suggests there are some differences in the term frequencies of the 7 topics. It could be due to slight variations in how often those topics appear, or perhaps one or more of the topics appear in Doc6 with a frequency that is notably different from Doc4 and Doc5.
Reasons: Depth of Content: Doc6 might delve deeper into one or a few of the 7 topics than Doc4 and Doc5, leading to higher or lower term frequencies for those topics.Inclusion of Additional Content: There might be additional content or topics in Doc6 that could influence the term frequencies of the seven given topics.
Different Context: The context or the way the topics are discussed in Doc6 might differ from Doc4 and Doc5, influencing the term frequencies.
In conclusion, while Doc6 is similar to Doc4 and Doc5 in terms of the 7 topics, they are not identical. The differences in term frequencies for the given topics could be due to variations in content depth, context, or the inclusion of additional topics.






