# Topic Modeling
### Second attempt
Lbl2Vec is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embedded label, document and word vectors and returns documents of categories modeled by manually predefined keywords. The key idea of the algorithm is that many semantically similar keywords can represent a category. In the first step, the algorithm creates a joint embedding of document, and word vectors. Once documents and words are embedded in a shared vector space, the goal of the algorithm is to learn label vectors from previously manually defined keywords representing a category. Finally, the algorithm can predict the affiliation of documents to categories based on the similarities of the document vectors with the label vectors. At a high level, the algorithm performs the following steps to classify unlabeled texts:

https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de

https://www.scitepress.org/Link.aspx?doi=10.5220/0010710300003058

In [1]:
import pandas as pd
# Read data into papers
data = pd.read_csv('D:\\Alerta Backup Data\\gonzalo_data\\datasets\\text\\data_clean_large.csv', sep=';')
data = data.dropna(subset=['text'])
data = data[data['class'] == 0]
# Print head
data.head()

Unnamed: 0,class,text,image
0,0,© From your Google Drive Interview-Mode BK99 S...,focused_Algorithms_one_0.jpg
1,0,=] Computatior] ] + File Edit View Insert Form...,focused_Algorithms_one_1.jpg
2,0,= 401 Computational Geometry File Edit View ...,focused_Algorithms_one_10.jpg
3,0,= 401 Computational Geometry File Edit View ...,focused_Algorithms_one_100.jpg
4,0,= 401 Computational Geometry File Edit View ...,focused_Algorithms_one_101.jpg


In [2]:
# Extract course from 'image' column
data['course'] = data['image'].apply(lambda x: x.split('_')[1])
# Print random sample
data.sample(5)

Unnamed: 0,class,text,image,course
2650,0,EQ) Peper 2 NM - Google Doc: B Paper2INM + (5 ...,focused_English_one_2650.jpg,English
4730,0,"Sy 's THE STORMING STAG! i FN a ee aaa 4, What...",focused_Management_partial_4730.jpg,Management
8496,0,ext derta ould 2 Wor Walze 1 of Membershi 5 ...,focused_Social Politics_one_8496.jpg,Social Politics
4222,0,Untitled document * (a) File Edit View Insert ...,focused_LatinAmericanGovPolitics_partial_4222.jpg,LatinAmericanGovPolitics
6868,0,Next? < Previous There are three emotional con...,focused_Philosophy_one_6868.jpg,Philosophy


In [3]:
# Extract set of courses
courses = set(data['course'])
print(courses)

{'Principles&Practice', 'IntroductionToPsychology', 'ClinicalPractium', 'InternationalSocialJustice', 'Chemistry', 'Literature', 'PhysicalChemistry', 'IntroGenderSexuality', 'GenderSexuality', 'IntroductionToEthics', 'Pharmacology', 'other', 'DrugBiology', 'Business', 'LatinAmericanGovPolitics', 'Genetics', 'Phisiology', 'GeneralPhysics', 'Ethics', 'Philosophy', 'Race&Racism', 'QuantitativeAnalysis', 'Algorithms', 'Biology', 'Neuroscience', 'neutral', 'ReadingFilm', 'ReadingLiterature', 'Data Analytics', 'ComputerScience', 'VRdevelopment', 'Politics', 'Art', 'LinearAlgebra', 'Physics', 'Social Politics', 'extra', 'Management', 'ProfessionalPractices', 'English', 'Astronomy', 'Marketing'}


In [4]:
# Group the courses by topic and add them to a dictionary
course_groups = {
    'ComputerScience': ['Algorithms', 'VRdevelopment', 'ComputerScience'],
    'Chemistry': ['Chemistry', 'PhysicalChemistry'],
    'Physics': ['Physics'],
    'Math&Statistics': ['Data Analytics', 'QuantitativeAnalysis', 'LinearAlgebra'],
    'Pharma' : ['Pharmacology'],
    'Biology': ['Biology', 'Genetics', 'DrugBiology', 'Neuroscience', 'Phisiology'],
    'Psychology': ['Psychology', 'IntroductionToPsychology'],
    'Business': ['Business', 'Marketing', 'Management'],
    'Gender': ['IntroGenderSexuality', 'GenderSexuality'],
    'Philosophy&Ethics' : ['Philosophy', 'Ethics', 'IntroductionToEthics'],
    'Politics&Society': ['Social Politics', 'LatinAmericanGovPolitics', 'Politics', 'InternationalSocialJustice', 'Race&Racism'],
    'Arts': ['Art'],
    'Astronomy': ['Astronomy'],
    'Literature': ['Literature', 'ReadingLiterature', 'ReadingFilm'],
}

# Group courses into topics
def group_courses(course):
    for key, value in course_groups.items():
        if course in value:
            return key
    return 'Other'

# Create a new column called 'topic' that has the value of the course group
data['topic'] = data['course'].apply(group_courses)
# Print the number of courses per topic
data['topic'].value_counts()

topic
Other                2322
Politics&Society     1419
ComputerScience       809
Chemistry             777
Biology               650
Business              607
Math&Statistics       510
Philosophy&Ethics     468
Astronomy             269
Literature            265
Gender                191
Pharma                141
Arts                   93
Physics                48
Psychology             37
Name: count, dtype: int64

In [5]:
# Drop the rows with topic equal to 'Other'
data = data[data['topic'] != 'Other']
# Set a topic index
data['topic_id'] = data['topic'].factorize()[0]
# Create a new dataframe called topics and remove duplicates
topics = data[['topic', 'topic_id']].drop_duplicates().sort_values('topic_id')
# Print topics
topics

Unnamed: 0,topic,topic_id
0,ComputerScience,0
290,Arts,1
383,Astronomy,2
652,Biology,3
788,Business,4
908,Chemistry,5
2255,Math&Statistics,6
2719,Philosophy&Ethics,7
3055,Gender,8
3313,Politics&Society,9


In [6]:
# Drop the columns that are not needed: 'class', 'image', 'course', 'topic'
data = data.drop(columns=['class', 'image', 'course', 'topic'])
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, random_state=42)
# Print the shape of train and test sets
print(train.shape, test.shape)
data

(5027, 2) (1257, 2)


Unnamed: 0,text,topic_id
0,© From your Google Drive Interview-Mode BK99 S...,0
1,=] Computatior] ] + File Edit View Insert Form...,0
2,= 401 Computational Geometry File Edit View ...,0
3,= 401 Computational Geometry File Edit View ...,0
4,= 401 Computational Geometry File Edit View ...,0
...,...,...
8601,oor w 5s E fe unties Build Stings Scones in Bu...,0
8602,"soibigne, tain u nica lea @ \c) is ieee i (eee...",0
8603,Ba od ie tS) nf [HAO Lo le re (epee eo | I HO ...,0
8604,toh. eae lo ( = 6 j < E it | aro b@ fo re BL L...,0


In [7]:
# Dictionary with a list of keywords for each topic
topic_keywords = {
    'ComputerScience': [
    'algorithms', 'data', 'programming', 'networks', 'databases', 'security', 'software', 'development', 'efficiency',
    'backend', 'frontend', 'web', 'robotics', 'computation', 'computational', 'code', 'run', 'debug', 'input', 
    'output', 'technology', 'overflow', 'binary', 'bit', 'bits', 'byte', 'bytes', 'computer', 'computers', 
    'computing', 'program', 'programs', 'programming', 'programmer', 'programmers', 'programmed',
    'automation', 'devops', 'cloud', 'architecture', 'servers', 'testing', 'debugging', 'scaling', 'data structures',
    'analysis', 'analytics', 'design patterns', 'machine learning', 'deep learning', 'neural networks', 'data science', 
    'database management', 'database design', 'data mining', 'data modeling', 'data warehousing', 'data visualization', 
    'data analysis', 'data-driven', 'information', 'information systems', 'artificial intelligence', 'computer vision',
    'natural language processing', 'cybersecurity', 'cryptography', 'encryption', 'decryption', 'hacking', 'penetration testing',
    'firewalls', 'authentication', 'authorization', 'access control', 'virtualization', 'cloud computing', 'cloud storage',
    'distributed computing', 'parallel computing', 'high-performance computing', 'scalability', 'optimization',
    'operating systems', 'system administration', 'mobile development', 'web development', 'responsive design', 
    'user interface', 'user experience', 'agile', 'project management', 'version control', 'continuous integration', 
    'continuous delivery', 'source code', 'open source', 'intellectual property', 'licensing', 'patents', 'copyright',
    'data privacy', 'online privacy', 'e-commerce', 'web services', 'internet of things', 'big data', 'blockchain', 
    'virtual reality', 'augmented reality', 'quantum computing', '3D printing', 'nanotechnology', 'game development',
    'graphics programming', 'rendering', 'compilers', 'interoperability', 'API', 'JSON', 'XML', 'HTTP', 'HTTPS', 
    'TCP/IP', 'FTP', 'SMTP', 'REST', 'SOAP', 'microservices', 'serverless', 'containerization', 'kubernetes',
    'docker', 'ansible', 'terraform', 'jenkins', 'puppet', 'chef', 'salt', 'ansible', 'monitoring', 'logging', 
    'troubleshooting', 'incident response', 'performance tuning', 'load balancing', 'fault tolerance', 
    'reliability engineering', 'chaos engineering', 'computer graphics', 'computer vision', 'user research', 
    'virtual assistants', 'machine translation', 'audio processing', 'speech recognition', 'natural language understanding',
    'predictive analytics', 'data preprocessing', 'data cleaning', 'data integration', 'data engineering', 
    'data governance', 'data quality', 'data enrichment', 'data exploration', 'data validation', 'data profiling',
    'data storage', 'data migration', 'data synchronization', 'data access', 'data federation', 'data catalog', 
    'data lineage', 'data lineage', 'data lineage', 'data transformation', 'data curation', 'data lineage', 
    'data masking', 'data anonymization', 'data ethics'],
    'Chemistry': ['atoms', 'molecules', 'elements', 'reactions', 'thermodynamics', 'kinetics', 'spectroscopy', 'quantum', 'organic', 'inorganic', 'oxygen', 'co2', 'hydrogen', 'electron', 'electronic', 'electrons', 'proton', 'protons', 'napthol', 'state', 'experiment', 'measurements', 'measurement', 'element', 'acid', 'base', 'pH', 'redox', 'stoichiometry', 'enthalpy', 'entropy', 'gibbs', 'solubility', 'solution', 'equilibrium', 'rate', 'rate law', 'catalysis', 'transition state', 'molecular orbitals', 'periodic table', 'valence', 'bonding', 'hybridization', 'isomers', 'chirality', 'alkanes', 'alkenes', 'alkynes', 'aromatics', 'amines', 'alcohols', 'carbonyls', 'carboxylic acids', 'esters', 'amines', 'amides', 'polymers', 'biomolecules', 'proteins', 'nucleic acids', 'lipids', 'carbohydrates', 'chromatography', 'mass spectrometry', 'infrared spectroscopy', 'UV-Vis spectroscopy', 'NMR spectroscopy', 'X-ray crystallography', 'gas laws', 'ideal gas', 'real gas', 'colligative properties', 'phase diagrams'],
    'Physics': ['waves', 'particle', 'thermal energy', 'fluids', 'electricity', 'nuclear', 'dynamics', 'vibration', 'power', 'general relativity', 'gravitational potential energy', 'potential energy', 'quantum mechanics', 'circuits', 'electrostatics', 'magnetism', 'energy', 'astrophysics', 'particle physics', 'optics', 'temperature', 'kinematics', 'wave', 'mechanics', 'entropy', 'radioactivity', 'physical optics', 'quantum theory', 'kinetic energy', 'subatomic', 'astronomy', 'nuclear physics', 'conservation of energy', 'thermodynamics', 'relativity', 'oscillation', 'solid', 'cosmology', 'work', 'heat', 'electromagnetism', 'thermostat', 'quantum', 'geometrical optics', 'special relativity'],
    'Math&Statistics': ['math', 'mathematics', 'calculus', 'differentiation', 'integration', 'derivatives', 'limits', 'functions', 'graphing', 'equations', 'algebra', 'linear', 'quadratic', 'polynomial', 'exponential', 'logarithmic', 'trigonometric', 'complex', 'vector', 'matrix', 'probability', 'probabilistic', 'statistics', 'statistical', 'data', 'analysis', 'sampling', 'hypothesis', 'testing', 'inference', 'regression', 'correlation', 'ANOVA', 'random', 'variable', 'distribution', 'normal', 'binomial', 'poisson', 'chi-squared', 't-distribution', 'f-distribution', 'confidence', 'interval', 'estimation', 'discrete', 'discretization', 'combinatorics', 'permutation', 'combination', 'graph', 'graphing', 'network', 'theory', 'graph', 'geometry', 'Euclidean', 'non-Euclidean', 'topology', 'fractal', 'dimension', 'metric'],
    'Pharma': ['drugs', 'medicines', 'pharmaceuticals', 'pharmacy', 'pharmacology', 'pharmaceutics', 'pharmacokinetics', 'pharmacodynamics', 'clinical', 'preclinical', 'toxicology', 'pharmacogenetics', 'pharmacogenomics', 'pharmacovigilance', 'pharmacoepidemiology', 'pharmacoeconomics', 'drug interactions', 'drug delivery', 'drug development', 'drug discovery', 'therapeutics', 'therapeutic agents', 'biopharmaceuticals', 'biologics', 'biosimilars', 'generic drugs', 'over-the-counter drugs', 'prescription drugs', 'active pharmaceutical ingredients', 'excipients', 'formulations', 'dosage forms', 'clinical trials', 'drug safety', 'pharmaceutical regulation', 'pharmaceutical marketing', 'pharmaceutical sales', 'pharmacy benefit management'],
    'Biology': ['biology', 'evolution', 'genetics', 'cell', 'ecology', 'physiology', 'neuroscience', 'immunology', 'microbiology', 'biotechnology', 'biochemistry', 'heart', 'lung', 'brain', 'body', 'bone', 'bones', 'muscle', 'muscles', 'blood', 'tissue', 'tissues', 'organ', 'organs', 'organism', 'organisms', 'cellular', 'chromosome', 'gene', 'DNA', 'RNA', 'nucleus', 'mitosis', 'meiosis', 'prokaryote', 'eukaryote', 'adaptation', 'natural selection', 'mutation', 'inheritance', 'variation', 'cloning', 'virus', 'bacteria', 'fungi', 'parasite', 'immunity', 'antibody', 'vaccine', 'antigen', 'pathogen', 'disease', 'infection', 'epidemic', 'endocrine', 'hormone', 'neuron', 'synapse', 'reflex', 'afferent', 'efferent', 'peripheral', 'central', 'cerebellum', 'cerebral cortex', 'neurotransmitter', 'dendrite', 'axon', 'action potential', 'membrane potential', 'synaptic transmission', 'receptor', 'ligand', 'signal transduction', 'endocytosis', 'exocytosis', 'vesicle', 'cytoskeleton', 'flagellum', 'cilia', 'organelle', 'membrane', 'osmosis', 'diffusion', 'active transport', 'enzyme', 'substrate', 'metabolism', 'glycolysis', 'citric acid cycle', 'electron transport chain', 'photosynthesis', 'respiration', 'fermentation', 'amino acid', 'protein', 'carbohydrate', 'lipid', 'nucleotide', 'enzyme', 'hormone', 'neurotransmitter', 'receptor', 'apoptosis', 'cancer', 'tumor', 'stem cell', 'regeneration', 'development', 'differentiation', 'gamete', 'fertilization', 'zygote', 'embryo', 'blastula', 'gastrula', 'morula', 'organogenesis', 'homeostasis', 'feedback', 'positive feedback', 'negative feedback', 'metabolism', 'nutrition', 'digestion', 'absorption', 'excretion', 'circulatory', 'lymphatic', 'respiratory', 'excretory', 'immune', 'nervous', 'endocrine', 'muscular', 'skeletal', 'integumentary', 'reproductive', 'vertebrate', 'invertebrate'],
    'Psychology': ['cognition', 'cognitive', 'perception', 'perceive', 'learning', 'learn', 'memory', 'remember', 'development', 'develop', 'personality', 'personality traits', 'traits', 'psychology', 'psychological', 'social', 'sociology', 'sociological', 'counseling', 'counsel', 'neuropsychology', 'neuroscience', 'neurological', 'behavioral', 'behavior', 'behaviors', 'behaviour', 'behaviours', 'mental health', 'mental illness', 'mental disorder', 'psychiatry', 'psychoanalysis', 'psychoanalytic', 'therapist', 'therapy', 'therapies', 'clinical psychology', 'abnormal psychology', 'positive psychology', 'forensic psychology', 'child psychology', 'adolescent psychology', 'sports psychology', 'educational psychology', 'industrial-organizational psychology', 'social psychology'],
    'Business': ['management', 'marketing', 'finance', 'accounting', 'economics', 'entrepreneurship', 'strategy', 'operations', 'leadership', 'resources', 'bank', 'credit', 'money', 'market', 'stock', 'stocks', 'investment', 'investments', 'economy', 'economies', 'economic', 'economics', 'financial', 'finances', 'accounting', 'account', 'accounts', 'accountant', 'accountants', 'entrepreneur', 'entrepreneurs', 'entrepreneurship','insurance', 'insurances', 'insurer', 'insurers', 'insure', 'insured', 'insuring', 'insures', 'invest', 'invested'],
    'Gender': ['feminism', 'queer', 'intersectionality', 'masculinity', 'sexuality', 'gender', 'sexism', 'homophobia', 'transphobia', 'identity', 'patriarchy'],
    'Philosophy&Ethics': ['logic', 'metaphysics', 'epistemology', 'ethics', 'aesthetics', 'existentialism', 'philosophy', 'social', 'mind', 'ontology', 'deontology', 'utilitarianism', 'virtue', 'morality', 'subjectivity', 'objectivity', 'rationality', 'reasoning', 'argument', 'justification', 'dialectics', 'phenomenology', 'hermeneutics', 'postmodernism', 'structuralism', 'postcolonial', 'anarchism', 'communism', 'libertarianism', 'existentialist', 'skepticism', 'socratic', 'platonism', 'aristotelian', 'nihilism', 'humanism', 'transhumanism', 'naturalism', 'pragmatism', 'neo-kantian', 'neo-hegelian', 'critical theory', 'continental philosophy', 'analytic philosophy', 'phenomenology', 'pragmatism', 'post-structuralism', 'postmodernism', 'existentialist philosophy', 'ontology', 'deontology', 'epistemic', 'epistemology', 'hermeneutics', 'historiography', 'phenomenology', 'philosophy of language', 'philosophy of law', 'philosophy of mind', 'philosophy of religion', 'philosophy of science', 'political philosophy', 'social philosophy', 'philosophy of technology', 'existential philosophy'],
    'Politics&Society': ['democracy', 'globalization', 'human rights', 'environmental policy', 'public policy', 'political theory', 'international relations', 'race', 'class', 'gender', 'civil', 'war', 'protest', 'country', 'police'],
    'Arts': ['painting', 'sculpture', 'photography', 'music', 'film', 'theater', 'literature', 'performance', 'installation', 'design', 'contemporary', 'modern', 'color', 'fashion'],
    'Astronomy': ['planets', 'stars', 'galaxies', 'cosmology', 'astrophysics', 'astronomy', 'exoplanets', 'astrobiology', 'gravity', 'black holes', 'nebulae', 'supernovae', 'cosmic rays', 'dark matter', 'dark energy', 'telescopes', 'observatories', 'interstellar', 'intergalactic', 'red giants', 'white dwarfs', 'black dwarfs', 'neutron stars', 'pulsars', 'quasars', 'cosmic microwave background', 'cosmic inflation', 'cosmic web', 'gravitational waves', 'interplanetary', 'solar system', 'orbital mechanics', 'celestial mechanics', 'asteroids', 'comets', 'meteoroids', 'meteorites', 'moon', 'lunar', 'solar', 'eclipse', 'zodiac', 'constellations', 'Milky Way', 'Andromeda', 'Hubble', 'Kepler', 'Chandra', 'Spitzer', 'James Webb', 'planetarium', 'star chart', 'cosmic evolution', 'cosmic abundance', 'exoplanet discovery', 'extraterrestrial life', 'SETI'],
    'Literature': ['poetry', 'prose', 'fiction', 'nonfiction', 'drama', 'criticism', 'literary', 'postcolonial', 'novel', 'literature', 'film', 'writing', 'reading', 'book', 'shakespeare']
}
# Add to the topics dataframe a column called 'keywords' with the list of keywords for each topic
topics['keywords'] = topics['topic'].apply(lambda topic: topic_keywords[topic])
# get number of keywords for each class
topics['number_of_keywords'] = topics['keywords'].apply(lambda row: len(row))
topics

Unnamed: 0,topic,topic_id,keywords,number_of_keywords
0,ComputerScience,0,"[algorithms, data, programming, networks, data...",183
290,Arts,1,"[painting, sculpture, photography, music, film...",14
383,Astronomy,2,"[planets, stars, galaxies, cosmology, astrophy...",57
652,Biology,3,"[biology, evolution, genetics, cell, ecology, ...",140
788,Business,4,"[management, marketing, finance, accounting, e...",42
908,Chemistry,5,"[atoms, molecules, elements, reactions, thermo...",74
2255,Math&Statistics,6,"[math, mathematics, calculus, differentiation,...",62
2719,Philosophy&Ethics,7,"[logic, metaphysics, epistemology, ethics, aes...",65
3055,Gender,8,"[feminism, queer, intersectionality, masculini...",11
3313,Politics&Society,9,"[democracy, globalization, human rights, envir...",15


In [8]:

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import strip_tags
from gensim.models.doc2vec import TaggedDocument

# doc: document text string
# returns tokenized document
# strip_tags removes meta tags from the text
# simple preprocess converts a document into a list of lowercase tokens, ignoring tokens that are too short or too long 
# simple preprocess also removes numerical values as well as punktuation characters
def tokenize(doc):
    return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=15)

# add data set type column
train['data_set_type'] = 'train'
test['data_set_type'] = 'test'

# concat train and test data
data_full_corpus = pd.concat([train,test]).reset_index(drop=True)

# reduce dataset to only articles that belong to classes where we defined our keywords
data_full_corpus = data_full_corpus[data_full_corpus['topic_id'].isin(list(topics['topic_id']))].reset_index(drop=True)

# tokenize and tag documents for Lbl2Vec training
data_full_corpus['tagged_docs'] = data_full_corpus.apply(lambda row: TaggedDocument(tokenize(row['text']), [str(row.name)]), axis=1)

# add doc_key column
data_full_corpus['doc_key'] = data_full_corpus.index.astype(str)

# add class_name column
data_full_corpus = data_full_corpus.merge(topics, left_on='topic_id', right_on='topic_id', how='left')

In [9]:
data_full_corpus

Unnamed: 0,text,topic_id,data_set_type,tagged_docs,doc_key,topic,keywords,number_of_keywords
0,Dashboard & 601 words Journal #2 + (oy File Ed...,7,train,"([dashboard, words, journal, oy, file, edit, v...",0,Philosophy&Ethics,"[logic, metaphysics, epistemology, ethics, aes...",65
1,p. 5 Net impulse causes object to speed up or ...,13,train,"([net, impulse, causes, object, to, speed, up,...",1,Physics,"[waves, particle, thermal energy, fluids, elec...",45
2,Account Dashboard = Courses History Course M...,3,train,"([account, dashboard, courses, history, course...",2,Biology,"[biology, evolution, genetics, cell, ecology, ...",140
3,& My Drive - Google Drive Be 401 Computational...,0,train,"([my, drive, google, drive, be, computational,...",3,ComputerScience,"[algorithms, data, programming, networks, data...",183
4,"se Vectors | Chapter 1, Essence of linear alge...",6,train,"([se, vectors, chapter, essence, of, linear, a...",4,Math&Statistics,"[math, mathematics, calculus, differentiation,...",62
...,...,...,...,...,...,...,...,...
6279,© Mi cores-testing sems-Vininie con 3 ylab a...,3,test,"([mi, cores, testing, sems, vininie, con, ylab...",6279,Biology,"[biology, evolution, genetics, cell, ecology, ...",140
6280,Make an empty set used to store the eages In t...,0,test,"([make, an, empty, set, used, to, store, the, ...",6280,ComputerScience,"[algorithms, data, programming, networks, data...",183
6281,Cc @ virginiacommonwealth.instructure.com/cour...,3,test,"([cc, instructure, com, courses, pages, slides...",6281,Biology,"[biology, evolution, genetics, cell, ecology, ...",140
6282,Cc @ virginiacommonwealth.instructure.com/cour...,3,test,"([cc, instructure, com, courses, pages, slides...",6282,Biology,"[biology, evolution, genetics, cell, ecology, ...",140


In [10]:
topics

Unnamed: 0,topic,topic_id,keywords,number_of_keywords
0,ComputerScience,0,"[algorithms, data, programming, networks, data...",183
290,Arts,1,"[painting, sculpture, photography, music, film...",14
383,Astronomy,2,"[planets, stars, galaxies, cosmology, astrophy...",57
652,Biology,3,"[biology, evolution, genetics, cell, ecology, ...",140
788,Business,4,"[management, marketing, finance, accounting, e...",42
908,Chemistry,5,"[atoms, molecules, elements, reactions, thermo...",74
2255,Math&Statistics,6,"[math, mathematics, calculus, differentiation,...",62
2719,Philosophy&Ethics,7,"[logic, metaphysics, epistemology, ethics, aes...",65
3055,Gender,8,"[feminism, queer, intersectionality, masculini...",11
3313,Politics&Society,9,"[democracy, globalization, human rights, envir...",15


In [11]:
from lbl2vec import Lbl2Vec

# init model with parameters
Lbl2Vec_model = Lbl2Vec(keywords_list=list(topics.keywords), tagged_documents=data_full_corpus['tagged_docs'][data_full_corpus['data_set_type'] == 'train'], label_names=list(topics.topic), similarity_threshold=0.43, min_num_docs=5, epochs=10)

# train model
Lbl2Vec_model.fit()

  from .autonotebook import tqdm as notebook_tqdm
2023-05-02 13:11:41,293 - Lbl2Vec - INFO - Train document and word embeddings
2023-05-02 13:12:15,008 - Lbl2Vec - INFO - Train label embeddings


In [12]:
from sklearn.metrics import f1_score

# predict similarity scores
model_docs_lbl_similarities = Lbl2Vec_model.predict_model_docs()

# merge DataFrames to compare the predicted and true category labels
evaluation_train = model_docs_lbl_similarities.merge(data_full_corpus[data_full_corpus['data_set_type'] == 'train'], left_on='doc_key', right_on='doc_key')
y_true_train = evaluation_train['topic']
y_pred_train = evaluation_train['most_similar_label']

print('F1 score:',f1_score(y_true_train, y_pred_train, average='micro'))

2023-05-02 13:12:15,717 - Lbl2Vec - INFO - Get document embeddings from model
2023-05-02 13:12:15,727 - Lbl2Vec - INFO - Calculate document<->label similarities


F1 score: 0.5496319872687487


In [13]:
# predict similarity scores of new test documents (they were not used during Lbl2Vec training)
new_docs_lbl_similarities = Lbl2Vec_model.predict_new_docs(tagged_docs=data_full_corpus['tagged_docs'][data_full_corpus['data_set_type']=='test'])

# merge DataFrames to compare the predicted and true topic labels
evaluation_test = new_docs_lbl_similarities.merge(data_full_corpus[data_full_corpus['data_set_type']=='test'], left_on='doc_key', right_on='doc_key')
y_true_test = evaluation_test['topic']
y_pred_test = evaluation_test['most_similar_label']

print('F1 score:',f1_score(y_true_test, y_pred_test, average='micro'))

2023-05-02 13:12:34,090 - Lbl2Vec - INFO - Calculate document embeddings
2023-05-02 13:12:37,740 - Lbl2Vec - INFO - Calculate document<->label similarities


F1 score: 0.543357199681782


In [15]:
evaluation_test.head(3)

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,ComputerScience,Arts,Astronomy,Biology,Business,Chemistry,Math&Statistics,...,Literature,Pharma,Physics,text,topic_id,data_set_type,tagged_docs,topic,keywords,number_of_keywords
0,5027,Business,0.364207,0.264404,0.269181,0.228555,0.23788,0.364207,0.196428,0.283186,...,0.325066,0.254052,0.195562,3.24 GUS + Full Schedule - 36 2¢ GUS Best suit...,4,test,"([gus, full, schedule, gus, best, suited, for,...",Business,"[management, marketing, finance, accounting, e...",42
1,5028,Politics&Society,0.440532,0.142629,0.209527,0.118389,0.177022,0.169079,0.12816,0.123822,...,0.295996,0.151381,0.135072,Account Dashboard = Courses Calendar History A...,9,test,"([account, dashboard, courses, calendar, histo...",Politics&Society,"[democracy, globalization, human rights, envir...",15
2,5029,Literature,0.452651,0.167024,0.296795,0.406046,0.196308,0.237937,0.228649,0.338373,...,0.452651,0.147582,0.324708,PHYS-103-0¢ ~ ELEMENTA ASTRONOM 100 points | F...,9,test,"([phys, elementa, astronom, points, feb, at, h...",Politics&Society,"[democracy, globalization, human rights, envir...",15


In [16]:
# df columns
evaluation_test.columns

Index(['doc_key', 'most_similar_label', 'highest_similarity_score',
       'ComputerScience', 'Arts', 'Astronomy', 'Biology', 'Business',
       'Chemistry', 'Math&Statistics', 'Philosophy&Ethics', 'Gender',
       'Politics&Society', 'Psychology', 'Literature', 'Pharma', 'Physics',
       'text', 'topic_id', 'data_set_type', 'tagged_docs', 'topic', 'keywords',
       'number_of_keywords'],
      dtype='object')

In [26]:
def display_pred_sample(df_indx):
    print('True topic:', evaluation_test['topic'][df_indx])
    print('Predicted topic:', evaluation_test['most_similar_label'][df_indx])
    print('Similarity score:', evaluation_test[evaluation_test['most_similar_label'][df_indx]][df_indx])
    print('Text:', evaluation_test['text'][df_indx])
    # Topics sorted by similarity score
    topics_sorted = []
    for t in topics['topic']:
        topics_sorted.append((t, evaluation_test[t][df_indx]))
    topics_sorted = sorted(topics_sorted, key=lambda x: x[1], reverse=True)
    print('Topics sorted by similarity score:')
    print(topics_sorted)


In [83]:
import random
# display a random sample of 5 test documents
for i in range(1):
    r = random.randint(0, len(evaluation_test))
    print('Sample', r, ' ', 30*'-', sep='')
    display_pred_sample(r)

Sample192 ------------------------------
True topic: Politics&Society
Predicted topic: Politics&Society
Similarity score: 0.6709587
Text: Account Dashboard Calendar Ie People Files Zoom @ virginiacommonwealth.instructure.com Home An BigBlueButto! Media Gallen Adobe Creative Cloud My Mediz Quizze Grade olla 90n /\labe odule ssignmen CL ing 20. ouncemen LATIN AMERICAN GOVT & POLITICS oral ite) ‘ion Download Nicaragua Doubling Down on Dictatorship.pdf (1.25 MB) the vote, the only question on election day would be voter turnout. He- roic work by citizen election monitors Urnas Abiertas revealed an as- tonishing 80 percent abstention rate.*° The opposition claimed a mora! victory, but in practical terms, the Ortega-Murillo regime successfully used overt repression and blatant election fraud to stay in power without provoking another prodemocratic uprising. Nicaragua Doubling Down on Dictatorship.pdf 0x ( < Previous 3) =   Urs )8/fil > Files   @u 5/8 264 > Nicaragua Doubling Down on Dictator