Need for Topic Modelling 

In [7]:
from bertopic import BERTopic
import pandas as pd 
import re

Load all the text into a dataframe for analysis 

In [8]:
files = ['a1.txt', 'a2.txt' , 'b1.txt' , 'b2.txt']

In [None]:
def get_paragraphs(file_paths):
    all_paras = []
    
    for path in file_paths:
        with open(f"./datasets/class_1{path}", "r", encoding="utf-8") as f:
            text = f.read()
            paragraphs = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]
            all_paras.extend(paragraphs)
            
    return pd.DataFrame(all_paras, columns=['text'])

data = get_paragraphs(files)
print(len(data))
print(data.head())

Use BERTopic to extract the top topics 

In [22]:
model = BERTopic(embedding_model='all-MiniLM-L6-v2')

In [23]:
topics,probs = model.fit_transform(data['text'])

In [24]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,804,0_the_of_to_and,"[the, of, to, and, in, is, it, be, that, which]",[The majority of the women of any class are no...
1,1,275,1_the_of_is_to,"[the, of, is, to, in, that, we, and, it, as]",[(2) What is the relation of this present occu...


Looks like the step words are too huge in number and are skewing the results. 
Lets remove the step words before clustering 

In [25]:
from sklearn.feature_extraction.text import CountVectorizer


In [28]:
vectorizer = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

In [31]:
model = BERTopic(
    embedding_model='all-MiniLM-L6-v2',vectorizer_model=vectorizer,calculate_probabilities=True
)

topics,probs = model.fit_transform(data['text'])

In [32]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,804,0_men_women_power_society,"[men, women, power, society, life, state, work...","[Is not this enough, and much more than enough..."
1,1,275,1_memory_knowledge_object_experience,"[memory, knowledge, object, experience, differ...",[This question of the nature of the object als...


More clearer topics this time but these are too broad, lets make the clustering more finer 

In [33]:
from umap import UMAP
from hdbscan import HDBSCAN

In [34]:
umap_model = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

In [35]:
model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer,
    calculate_probabilities=True,
    verbose=True
)

topics,probs = model.fit_transform(data['text'])

2026-01-23 15:11:02,736 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 34/34 [00:46<00:00,  1.36s/it]
2026-01-23 15:11:54,180 - BERTopic - Embedding - Completed ✓
2026-01-23 15:11:54,182 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-23 15:11:57,875 - BERTopic - Dimensionality - Completed ✓
2026-01-23 15:11:57,877 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-23 15:11:57,932 - BERTopic - Cluster - Completed ✓
2026-01-23 15:11:57,947 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-23 15:11:58,180 - BERTopic - Representation - Completed ✓


In [36]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,581,0_state_work_society_men,"[state, work, society, men, power, opinion, hu...",[Whether art will flourish in a Socialistic co...
1,1,275,1_object_knowledge_memory_different,"[object, knowledge, memory, different, past, e...",[This question of the nature of the object als...
2,2,223,2_women_men_woman_life,"[women, men, woman, life, power, general, law,...","[And in the case of public offices, if the pol..."


NOOO! The topics are getting overlapped a lot and thus you only get a few topics i think. For more finer topics lets try running these on invidual files on not all at once.

In [41]:
data_a1 = get_paragraphs([files[0]])
data_a2 = get_paragraphs([files[1]])

In [42]:
topics_a1,probs_a1 = model.fit_transform(data_a1['text'])

2026-01-23 15:20:29,394 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 8/8 [00:08<00:00,  1.01s/it]
2026-01-23 15:20:37,459 - BERTopic - Embedding - Completed ✓
2026-01-23 15:20:37,460 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-23 15:20:37,793 - BERTopic - Dimensionality - Completed ✓
2026-01-23 15:20:37,794 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-23 15:20:37,807 - BERTopic - Cluster - Completed ✓
2026-01-23 15:20:37,811 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-23 15:20:37,862 - BERTopic - Representation - Completed ✓


In [44]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,116,-1_women_men_society_case,"[women, men, society, case, power, general, na...",[Exactly the same thing may be said of the wom...
1,0,58,0_human_power_men_character,"[human, power, men, character, life, women, so...","[And this, indeed, is what makes it strange to..."
2,1,50,1_women_men_things_great,"[women, men, things, great, woman, faculties, ...","[They perhaps have it from nature, but they ce..."
3,2,21,2_law_wife_slave_marriage,"[law, wife, slave, marriage, man, legal, laws,...",[Its refusal completes the assimilation of the...


Since the book is mostly continous and a mixture of a lot of topics, the large num of outliers make sense. The current output is much better than when we ran this on all novels together.

Repeat this for all the other novels and pick the more populated topics.

In [85]:
topics_a2 , probs_a2 = model.fit_transform(data_a2['text'])
model.get_topic_info()

2026-01-24 14:48:25,381 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 9/9 [00:05<00:00,  1.54it/s]
2026-01-24 14:48:31,251 - BERTopic - Embedding - Completed ✓
2026-01-24 14:48:31,251 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-24 14:48:31,486 - BERTopic - Dimensionality - Completed ✓
2026-01-24 14:48:31,487 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-24 14:48:31,496 - BERTopic - Cluster - Completed ✓
2026-01-24 14:48:31,498 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-24 14:48:31,545 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,60,-1_state_education_government_men,"[state, education, government, men, persons, d...","[It still remains unrecognised, that to bring ..."
1,0,135,0_society_liberty_human_conduct,"[society, liberty, human, conduct, individual,...","[But with regard to the merely contingent, or,..."
2,1,62,1_truth_opinion_opinions_true,"[truth, opinion, opinions, true, discussion, m...","[Where their influence prevails, they make it ..."
3,2,15,2_religious_public_religion_country,"[religious, public, religion, country, feeling...","[But the opinion of a similar majority, impose..."


In [56]:
data_b1 = get_paragraphs([files[2]])
data_b2 = get_paragraphs([files[3]])

In [59]:
topics_b1, probs_b1 = model.fit_transform(data_b1['text'])

2026-01-23 15:56:55,714 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 9/9 [00:10<00:00,  1.14s/it]
2026-01-23 15:57:06,000 - BERTopic - Embedding - Completed ✓
2026-01-23 15:57:06,001 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-23 15:57:06,459 - BERTopic - Dimensionality - Completed ✓
2026-01-23 15:57:06,460 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-23 15:57:06,474 - BERTopic - Cluster - Completed ✓
2026-01-23 15:57:06,478 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-23 15:57:06,540 - BERTopic - Representation - Completed ✓


In [61]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,74,-1_government_guild_state_world,"[government, guild, state, world, bakunin, war...",[Above both will be the joint Committee of Par...
1,0,114,0_work_present_socialism_state,"[work, present, socialism, state, men, communi...","[Socialism, at any rate in most of its forms, ..."
2,1,67,1_class_marx_syndicalism_labor,"[class, marx, syndicalism, labor, political, r...","[The I. W. W., though it has a less definite p..."
3,2,31,2_men_power_evils_love,"[men, power, evils, love, life, world, human, ...",[It is not so that human relations will be con...


In [63]:
topics_b2,probs_b2 = model.fit_transform(data_b2['text'])
model.get_topic_info()

2026-01-24 02:52:26,609 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 9/9 [00:06<00:00,  1.49it/s]
2026-01-24 02:52:32,707 - BERTopic - Embedding - Completed ✓
2026-01-24 02:52:32,708 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-24 02:52:32,959 - BERTopic - Dimensionality - Completed ✓
2026-01-24 02:52:32,960 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-24 02:52:32,981 - BERTopic - Cluster - Completed ✓
2026-01-24 02:52:32,984 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-24 02:52:33,026 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,17,-1_cause_table_precise_different,"[cause, table, precise, different, vague, accu...","[An instrument is a ""measure"" of a set of stim..."
1,0,104,0_images_past_memory_sensations,"[images, past, memory, sensations, present, se...",[They give laws according to which images of p...
2,1,101,1_object_matter_mental_physical,"[object, matter, mental, physical, different, ...",[When a mental occurrence can be regarded as a...
3,2,54,2_desire_actions_instinct_animals,"[desire, actions, instinct, animals, animal, u...","[The essence of instinct, one might say, is th..."


I'm a little doubtful of how gemini is gonna write the paragraph just given the topics, so this is just to test if the meaning/content of the AI generated para is similar to the human written ones. 
Lets do a semantic similarity test using the cosine similarity. 

In [5]:
from sentence_transformers import SentenceTransformer,util
sim_check_model = SentenceTransformer('all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def check_similarity(para_1, para_2): 
    embedding1 = sim_check_model.encode(para_1)
    embedding2 = sim_check_model.encode(para_2)

    similarity = util.cos_sim(embedding1, embedding2)
    print(f"{similarity.item():.4f}")
    return similarity.item()

In [None]:
gemini_response1 = "A human's true character is profoundly tested and revealed not necessarily in times of weakness, but in the stark light of power. While power itself is merely an instrument, its acquisition often illuminates the deeper moral fabric of an individual. Historically, men have frequently occupied positions of significant power, and their actions in such roles have shaped societies. It is here that the strength or flaw in one's character becomes undeniably apparent. A man of strong character might wield power responsibly, with empathy and foresight, using it to uplift and create. Conversely, a weak character, given authority, can quickly succumb to arrogance and corruption, mistaking control for true leadership. Ultimately, the way a human, especially a man, navigates the complexities of power is a definitive measure of their inherent character, influencing not only their personal legacy but the trajectory of humanity." \

def compare_clusters(gemini_response):
    avg_sim = 0
    count = 0
    for i in range(0,len(data_a1)):
        if (topics_a1[i] == 0):
            avg_sim = avg_sim + check_similarity(data_a1['text'][i],gemini_response)
            count = count+1

    avg_sim = avg_sim/count
    print(f"avg_sim is {avg_sim}")

compare_clusters(gemini_response1)

avg_sim is 0.54


The prompts are not that similar, cos-1(0.54) = 58 degree. 

Probably because of the length difference in comparision, one is 150 words and the others are huge essays. Lets compare small cluster wise.

In [13]:
gemini_response2 = "The inherent capacities of the human being know no gender, yet societal power structures have historically cast men in a dominant role. This immense power, often asserted through custom and law, has significantly shaped the character of both sexes. Men, holding absolute sway, frequently developed traits of dominion, while women's character was molded by the demands of subjection, curtailing their intellectual and moral growth. Mill's perspective illuminates how this systemic imbalance not only diminishes individual potential but also impoverishes society by preventing the full flowering of human talent. True progress, then, necessitates an equal distribution of power, allowing every human character to develop freely, unburdened by artificial constraints."
compare_clusters(gemini_response2)

0.4798
0.3679
0.2689
0.3113
0.5969
0.6158
0.4912
0.4679
0.6313
0.4014
0.5684
0.3517
0.3407
0.3531
0.3751
0.5232
0.5696
0.4995
0.6121
0.5184
0.4481
0.4231
0.5675
0.4091
0.4924
0.5019
0.3499
0.3392
0.3970
0.4006
0.4666
0.2707
0.2539
0.5413
0.4786
0.4755
0.4837
0.4812
0.3274
0.4489
0.5801
0.5963
0.5504
0.5219
0.4693
0.5398
0.5084
0.6009
0.4367
0.6357
0.5604
0.5376
0.3556
0.3902
0.5245
0.6052
0.5315
0.3643
avg_sim is 0.6691310344827586


0.6691310344827586

Not that bad, cos inverse (0.66) around 45 degrees. 

In [1]:
with open("./class_2/a1_t1.txt", "r") as file:
    gemini_essay = file.read()
    print(gemini_essay)

The intricate relationship between the concepts of "human," "power," "men," and "character" forms a fundamental analytical framework for understanding the historical
development of social structures and individual identities. These terms are not fixed or universally defined but are instead deeply contingent upon historical context, 
societal organization, and the distribution of authority. A systematic exploration reveals how definitions of humanity have been historically circumscribed, how power 
has been allocated, how the identity of "men" has been socially constructed, and how individual character has been molded under these prevailing conditions.

To begin, the concept of "human" itself has not always been universally applied in its fullest sense. While biologically all individuals belong to the human species,
the *recognition* of full humanity, entailing autonomy, rationality, and the right to self-development, has historically been selectively bestowed. For extensive
periods, pr

Lets compare the entire essay's semantic similarity instead of paragraph by paragraph

In [None]:
str_top0 = ""
for i in range(0,len(data_a1)):
    if (topics_a1[i] == 0):
        str_top0 = str_top0 + data_a1['text'][i]

check_similarity(str_top0, gemini_essay)

0.6123


Lets try take a random human para and use our current approach to see how it generalizes  

In [6]:
human = "The U.S. bombings thatended World War II didn’t mark theclose of atomic warfare.They were just the beginning. The U.S. bombingsthat ended World War IIdidn’t mark the close ofatomic warfare. They werejust the beginning. From 1945 to 2017, nuclear nationscarried out more than 2,000 explosive testsin the atmosphere, undergroundand underwater, mostly in remote places. From 1945 to 2017,nuclear nations carried outmore than 2,000 explosivetests in the atmosphere,underground and underwater,mostly in remote places. Some of the atmospheric testswere magnitudes more powerful thanthe bombs dropped on Japan,sickening and displacing thousands. Some of the atmospherictests were magnitudes morepowerful than the bombsdropped on Japan, sickeningand displacing thousands. Their descendants — whocontinue to endure physical,psychological, economicand cultural fallout — are livingproof that nuclear weaponsshould never be testedagain. If only today’s leaderswould take heed. Their descendants — whocontinue to endure physical,psychological, economicand cultural fallout — are livingproof that nuclear weaponsshould never be testedagain. If only today’s leaderswould take heed. The Toll . About an hour’s drivefrom the Las Vegas Strip, deep craters pockmark the desert sand for miles in every direction. It’s here, amid the sunbaked flats, that the United States conducted 928 nuclear tests during the Cold War above and below ground. The site is mostly quiet now, and has been since 1992, when Washington halted America’s testing program. There are growing fears this could soon change. As tensions deepen in America’s relations with Russia and China, satellite images reveal all three nations areactively expandingtheirnuclear testing facilities, cutting roads and digging new tunnels at long-dormant proving grounds, including in Nevada."
ai_gen = "The U.S. bombings of Hiroshima and Nagasaki, which marked the end of World War II on August 6 and 9, 1945, respectively, are often viewed as the final chapter in the annals of atomic warfare. However, as a recent tweet from The New York Times (NYT) reminds us, these bombings were merely the beginning of a new and ominous era in global conflict. The devastating impact of the atomic bombs on Hiroshima and Nagasaki was immediate and catastrophic. The first bomb, dropped on Hiroshima, exploded with a force equivalent to 15,000 tons of TNT. The second bomb, dropped on Nagasaki, was even more powerful, with an explosive yield of 21,000 tons of TNT. The resulting destruction was unprecedented, with tens of thousands of lives lost in the blink of an eye. However, the aftermath of the bombings was far from over. The atomic bombs unleashed a new and terrifying form of warfare, one that would shape the course of history for decades to come. The atomic bombs marked the dawn of the Nuclear Age, a period characterized by the proliferation of nuclear weapons and the constant threat of nuclear war. In the years following World War II, the United States, the Soviet Union, and other nations raced to develop and stockpile nuclear weapons. The Cold War, a period of geopolitical tension between the United States and the Soviet Union, was marked by a nuclear arms race that saw the stockpiling of tens of thousands of nuclear weapons. The threat of nuclear war was not just theoretical. The world came closer to nuclear war than many realize. During the Cuban Missile Crisis in 1962, the United States and the Soviet Union came dangerously close to nuclear war. The crisis began when the United States discovered that the Soviet Union had installed nuclear missiles in Cuba, just 90 miles off the coast of Florida. The United States demanded that the Soviet Union remove the missiles, and a tense standoff ensued. The world held its breath as the two superpowers engaged in a high-stakes game of brinkmanship. The crisis was eventually resolved through diplomacy, but it served as a stark reminder of the dangers of nuclear war."

check_similarity(human,ai_gen)

0.7094


0.7094463109970093