<a href="https://colab.research.google.com/github/jan-kreischer/UZH_ML4NLP/blob/main/Project-06/ex06_tm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Project 6 - Topic Modeling 
using Latent Dirichlet Allocation (LDA)  
and Combined Topic Models (CTM).  
## 1. Setup
### 1.1 Dependencies
Installing all dependencies needed to run the simulations

In [82]:
!pip install contextualized-topic-models==2.2.0



### 1.2 Imports

In [108]:
import re
import random
import os
import urllib
import urllib.request
import gzip
import io
import csv
import random
from collections import defaultdict
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

### 1.3 Google Drive
We connect Google Drive in order to access stored data.

In [84]:
# Enable access to files stored in Google Drive
from google.colab import drive
# Leave this like it is
mountpoint = '/content/drive/' 
drive.mount(mountpoint)

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [85]:
# Adapt this path to the folder where your data is stored in google drive
base_path = 'My Drive/UZH_ML4NLP/Projects/Project-06/data' 
data_path = os.path.join(mountpoint, base_path)
# Cd into the directory with the git repo
% cd $data_path

/content/drive/My Drive/UZH_ML4NLP/Projects/Project-06/data


### 1.4 Constants


In [111]:
NUM_LDA_TOPICS = 8 # The number of different topics to identify
NUM_FEATURES = 10000
MAX_DF=0.5
MIN_DF=0.01

In [87]:
# Path to the data files
path_before_1990 = 'titles_before_1990.txt'
path_from_1990_to_2009 = 'titles_from_1990_to_2009.txt'
path_from_2010 = 'titles_from_2010.txt'

### 1.5 Data Acquisition

In [88]:
# Execute the following cell only once to download the data and write it as a file to your google drive. Afterwards, skip this cell or comment it out.
'''
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# to download the data manually or get more information, go to: https://dblp.org/faq/How+can+I+download+the+whole+dblp+dataset.html
url = 'https://dblp.uni-trier.de/xml/dblp.xml.gz'
num_titles = 500000  # the (max)number of titles to load 


def load_gzip_file(url):
    """Download Gzip-file."""
    response = urllib.request.urlopen(url)
    compressed_file = io.BytesIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)
    return decompressed_file

def extract_titles(input_file, max_num=40000):
    """Extract title and publication year of dblp papers, given as input file.
    
    Divide the papers into 3 time periods. 
    
    Collect max max_num papers per time period.
    """
    pairs_before_1990 = []
    count_before_1990 = 0
    pairs_from_1990_to_2009 = []
    count_from_1990_to_2009 = 0
    pairs_from_2010 = []
    count_from_2010 = 0
    got_title = False
    for line in tqdm(input_file):
        line_str = line.decode('utf-8')
        if got_title: 
            # we have a title and check for the corresponding year
            year_result = re.search(r'<year>(.*)</year>', line_str)
            if year_result:
                # we also have the year and thus save the title-year pair
                year = int(year_result.group(1))
                if year < 1990:
                    pairs_before_1990.append((title, year))
                    count_before_1990 += 1
                elif year < 2010:
                    pairs_from_1990_to_2009.append((title, year))
                    count_from_1990_to_2009 += 1
                else:
                    pairs_from_2010.append((title, year))
                    count_from_2010 += 1
                got_title = False
        else:
            # we have no title and search for title
            result = re.search(r'<title>(.*)</title>', line_str)
            if result:
                title = result.group(1)
                if len(title.split(' ')) < 3:  
                    # only include titles with at least four words
                    continue
                got_title = True
        
        if count_before_1990 >= max_num and count_from_1990_to_2009 >= max_num and count_from_2010 >= max_num:
            return pairs_before_1990, pairs_from_1990_to_2009, pairs_from_2010
    
    return pairs_before_1990, pairs_from_1990_to_2009, pairs_from_2010

def save_data(pairs, file_path):
    with open(file_path, 'w') as fout:
        writer = csv.writer(fout)
        for pair in pairs:
            writer.writerow(pair)

in_file = load_gzip_file(url)
pairs_before_1990, pairs_from_1990_to_2009, pairs_from_2010 = extract_titles(in_file)
save_data(pairs_before_1990, path_before_1990)
save_data(pairs_from_1990_to_2009, path_from_1990_to_2009)
save_data(pairs_from_2010, path_from_2010)
'''

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\', force_remount=True)\n\n# to download the data manually or get more information, go to: https://dblp.org/faq/How+can+I+download+the+whole+dblp+dataset.html\nurl = \'https://dblp.uni-trier.de/xml/dblp.xml.gz\'\nnum_titles = 500000  # the (max)number of titles to load \n\n\ndef load_gzip_file(url):\n    """Download Gzip-file."""\n    response = urllib.request.urlopen(url)\n    compressed_file = io.BytesIO(response.read())\n    decompressed_file = gzip.GzipFile(fileobj=compressed_file)\n    return decompressed_file\n\ndef extract_titles(input_file, max_num=40000):\n    """Extract title and publication year of dblp papers, given as input file.\n    \n    Divide the papers into 3 time periods. \n    \n    Collect max max_num papers per time period.\n    """\n    pairs_before_1990 = []\n    count_before_1990 = 0\n    pairs_from_1990_to_2009 = []\n    count_from_1990_to_2009 = 0\n    pairs_from_2010 = []\n    count_from_2010 = 

## 2. Topic Modeling
### 2.1 Using Latent Dirichlet Allocation (LDA)

In [109]:
def load_titles(path):
  with open(path) as fin:
    reader = csv.reader(fin)
    titles = [row[0] for row in reader]
  return titles

In [138]:
# Simple text preprocessing by removing 
# all letters which are not in roman alphabet
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z ]', '', text)
    #text = re.sub(r'\b\w{1,3}\b', ' ', text)
    #text = re.sub(' +', ' ', text)
    text = text.lower()
    return text

In [None]:
# Now we turn the documents (or titles in this case) into a matrix feature representation.
def vectorize_data(titles, max_df=MAX_DF, min_df=MIN_DF, max_features=NUM_FEATURES):
  tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, stop_words='english')
  tf = tf_vectorizer.fit_transform(titles)
  tf_feature_names = tf_vectorizer.get_feature_names_out()
  return tf, tf_feature_names

#### 2.1.1 - Before the 1990s:

In [139]:
# Load the titles
titles_before_1990 = load_titles(path_before_1990)
print("{} titles before 1990".format(len(titles_before_1990)))

40000 titles before 1990


In [140]:
# Show some random samples
random.sample(titles_before_1990, 10)

['Computation of e<sup>N</sup> for -&#8734;&lt;N&lt;+&#8734;.',
 'On the Omega(n log n) Lower Bound for Convex Hull and Maximal Vector Determination.',
 'The future of control.',
 'The Average Number of Stable Matchings.',
 'The binary derivative test: noise filter, crypto aid, and random-number seed selector.',
 'The Arithmetic Cube.',
 'Database Machines: An Introduction.',
 'Technical Note - An Importance Ranking for System Components Based upon Cuts.',
 'Formal and informal communication among scientists in sleep research.',
 'Universal Asynchronous Iterative Arrays of Mealy Automata.']

In [141]:
preprocessed_titles_before_1990 = [preprocess_text(title) for title in titles_before_1990]

In [142]:
# Show some preprocessed samples
random.sample(preprocessed_titles_before_1990, 10)

['parameteradaptive control with configuration aids and supervision functions',
 'an improvement in the iterative data flow analysis algorithm',
 'a note on the predicatively definable sets of n n nepeicircvoda',
 'review of associative networks  the representation and use of knowledge in computers by n v findler ed academic press',
 'algorithm  matrix bandwidth and profile reduction f',
 'axioms of symmetry throwing darts at the real number line',
 'generalized handle grammars and their relation to petri nets',
 'calculation of multicategory minimum distance classifier recognition error for binomial measurement distributions',
 'costeffectiveness analysis for strategic decisions',
 'meanvariance approximations to expected logarithmic utility']

In [143]:
tf_01, tf_feature_names_01 = vectorize_data(preprocessed_titles_before_1990, max_df=0.95, min_df=0.01)

In [144]:
lda_01 = LatentDirichletAllocation(n_components=8, max_iter=10, learning_method='online', random_state=42).fit(tf_01)

In [145]:
for topic_idx, topic in enumerate(lda_01.components_):
    print(f'Topic {topic_idx}:', end=' ')
    print(' '.join([tf_feature_names_01[i] for i in topic.argsort()[:-12 - 1:-1]]))

Topic 0: theory problems algorithms simulation decision parallel application solution applications optimal digital control
Topic 1: computer logic model programs digital performance design networks using applications systems simulation
Topic 2: problem programming optimal language digital processing software research solution parallel linear control
Topic 3: data method network models application languages solution processing problem using analysis programming
Topic 4: note information linear functions applications finite technical programming systems time problem decision
Topic 5: algorithm design analysis approach sets performance new using implementation parallel linear digital
Topic 6: systems using parallel performance implementation decision distributed linear control digital design processing
Topic 7: control networks new recognition distributed time pattern optimal systems approach digital linear


Topics:
0. Graph/networks algorithms (seems to be mostly about algorithms that (maybe) operate on graphs/networks)
1. pattern recognition (and maybe robotics)
2. ...

#### 2.1.2 - From 1990 to 2009:

In [146]:
titles_from_1990_to_2009 = load_titles(path_from_1990_to_2009)
print("{} titles from 1990 to 2009".format(len(titles_from_1990_to_2009)))

327307 titles from 1990 to 2009


In [147]:
random.sample(titles_from_1990_to_2009, 10)

['Model Formulation: Generic Data Modeling for Clinical Rrepositories.',
 'Linear transformations of Wiener process that born Wiener process, Brownian bridge or Ornstein-Uhlenbeck process.',
 'Identifying local spatial association in flow data.',
 'Reducing the number of sub-classifiers for pairwise multi-category support vector machines.',
 'Designs in a coset geometry: Delsarte theory revisited.',
 'Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion.',
 'On call admission control for IP telephony in best effort networks.',
 'Highlights of the Fourth Ka Band Utilization Conference.',
 "Analyse spatiale et cartes anim&eacute;es : construction d'un prototype d'animation des dynamiques d&eacute;mographiques.",
 'The Control Method for the Robot Hand Based on the Fuzzy Theory.']

In [148]:
preprocessed_titles_from_1990_to_2009 = [preprocess_text(title) for title in titles]

In [149]:
random.sample(preprocessed_titles_from_1990_to_2009, 10)

['a performance analysis of mcbased systems',
 'a comparison of order structures for automatic digital computers',
 'algorithm  qshepd quadratic shepard method for bivariate interpolation of scattered data',
 'the brisbane media centre',
 'elimination of cardinality quantifiers',
 'convergent deduction for probabilistic logic',
 'on insensitivities in urban redistricting and facility location',
 'a note on primary and secondary syncategoremata',
 'nonlinear programming counterexamples to two global optimization algorithms',
 'highspeed indirect cryption']

In [150]:
tf_02, tf_feature_names_02 = vectorize_data(titles_from_1990_to_2009, max_df=0.95, min_df=0.01)

In [151]:
lda_02 = LatentDirichletAllocation(n_components=8, max_iter=10, learning_method='online', random_state=42).fit(tf_02)

In [152]:
for topic_idx, topic in enumerate(lda_02.components_):
    print(f'Topic {topic_idx}:', end=' ')
    print(' '.join([tf_feature_names_02[i] for i in topic.argsort()[:-12 - 1:-1]]))

Topic 0: time algorithm linear new network models algorithms efficient high robust equations management
Topic 1: method study problems evaluation case space programming equations finite linear new performance
Topic 2: design approach nonlinear optimal fuzzy modeling computing robust control equations new time
Topic 3: based control model methods computer robust time linear simulation network detection dynamic
Topic 4: using analysis networks performance problem multi dynamic neural wireless mobile recognition network
Topic 5: systems data information multiple digital linear robust management time control nonlinear analysis
Topic 6: adaptive application structure non theory knowledge scheme management robust linear control finite
Topic 7: learning estimation applications order image distributed graphs web software power real development


#### 2.1.3 - From 2010 onwards:

In [153]:
# Load the titles
titles_from_2010 = load_titles(path_from_2010)
print("{} titles from from 2010".format(len(titles_from_2010)))

715820 titles from from 2010


In [154]:
# Show some random samples
random.sample(titles_from_2010, 10)

['A Modified Earley Parser for Huge Natural Language Grammars.',
 'Bidimensional allocation of seats via zero-one matrices with given line sums.',
 'Spectral Leakage-Driven Loopback Scheme for Prediction of Mixed-Signal Circuit Specifications.',
 'Ontology-Based Mobile Communication in Agriculture.',
 'Using machine learning to support healthcare professionals in making preauthorisation decisions.',
 'Performance Modeling and Analysis of Heterogeneous Machine Type Communications.',
 'An NTF-enhanced incremental &#931;&#916; modulator using a SAR quantizer.',
 'Erratum to "A bubble-stabilized least-squares finite element method for steady MHD duct flow problems at high Hartmann numbers" [J. Comput. Physics 228 (2009) 8301-8320].',
 'Energy-Delay Tradeoff in Ultra-Dense Networks Considering BS Sleeping and Cell Association.',
 'Intuitionistic Type-2 Fuzzy Set and Its Properties.']

In [155]:
# Preprocess the titles by removing certain characters
preprocessed_titles_from_2010 = [preprocess_text(title) for title in titles]

In [156]:
# Vectorize
tf_03, tf_feature_names_03 = vectorize_data(preprocessed_titles_from_2010, max_df=0.95, min_df=0.01)

In [157]:
lda_03 = LatentDirichletAllocation(n_components=8, max_iter=10, learning_method='online', random_state=42).fit(tf_03)

In [158]:
for topic_idx, topic in enumerate(lda_03.components_):
    print(f'Topic {topic_idx}:', end=' ')
    print(' '.join([tf_feature_names_03[i] for i in topic.argsort()[:-12 - 1:-1]]))

Topic 0: theory problems algorithms simulation decision parallel application solution applications optimal digital control
Topic 1: computer logic model programs digital performance design networks using applications systems simulation
Topic 2: problem programming optimal language digital processing software research solution parallel linear control
Topic 3: data method network models application languages solution processing problem using analysis programming
Topic 4: note information linear functions applications finite technical programming systems time problem decision
Topic 5: algorithm design analysis approach sets performance new using implementation parallel linear digital
Topic 6: systems using parallel performance implementation decision distributed linear control digital design processing
Topic 7: control networks new recognition distributed time pattern optimal systems approach digital linear


# Combined Topic Models

New method developed by [Bianchi et al. 2021](https://aclanthology.org/2021.acl-short.96/). 

[A 6min presentation of the paper by one of the authors.](https://underline.io/lecture/25716-pre-training-is-a-hot-topic-contextualized-document-embeddings-improve-topic-coherence)

Code: [https://github.com/MilaNLProc/contextualized-topic-models](https://github.com/MilaNLProc/contextualized-topic-models)

Tutorial: [https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing](https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing)

Again, perform topic modelling for the three time periods - this time using the combined topic models (CTMs). 

You can use and adapt the code from the tutorial linked above.

Use the available GPU for faster running times.

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

num_ctm_topics = 5  # you can also choose a higher number of topics

### Before the 1990s:

### From 1990 to 2009

### From 2010 onwards