# Topic Modelling on `data_schoolofinf`

Using the scrapped metadata and downloaded PDF, we create the topic models.


0. using `gensim`, create the corpus, vocabulary
1. use `LDAtuning` (in R) to find the best number of topics avaiable
2. create topic model using `gensim`
3. visualise the results using `lda2vis`

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import pickle as pkl
import pandas as pd
from numpy.random import RandomState
rng = RandomState(93748573)
import os

DATA_DIR = '../../data/data_schoolofinf/'

In [2]:
from gensim.corpora import Dictionary

2018-01-29 11:43:55,588 : INFO : 'pattern' package not found; tag filters are not available for English


### Combining all the tokens together:

In [3]:
df_combined_toks = pd.read_pickle(os.path.join(DATA_DIR,'toks', 'toks.combined.pkl'))

In [4]:
df_combined_toks.head(2)

Unnamed: 0_level_0,year,toks_metada,toks_pdf2txt
pub_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
400818dc-63af-4a26-80c5-906f98e1f8ab,1989,"[ballooning, stability, analysis, jet, hmode, ...",
18b1a861-afef-4fff-bc80-d02e05be18c4,2013,"[query, processing, data, integration, chapter...",


In [5]:
df_combined_toks['toks'] = df_combined_toks.apply(
    lambda row: list(row.toks_metada) + list(row.toks_pdf2txt), axis=1)

## Using publications from 1997-2017

In [5]:
df_combined_toks = df_combined_toks.drop(
    df_combined_toks[(df_combined_toks.year < 1997) | (df_combined_toks.year > 2017)].index)

In [6]:
df_combined_toks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8028 entries, 18b1a861-afef-4fff-bc80-d02e05be18c4 to b2920a27-5293-4f4a-8874-4a0ea804d91a
Data columns (total 3 columns):
year            8028 non-null int64
toks_metada     8028 non-null object
toks_pdf2txt    8028 non-null object
dtypes: int64(1), object(2)
memory usage: 250.9+ KB


### Model 1: metadata + PDF

In [16]:
docs = df_combined_toks.toks.tolist()

combined_toks_dict = Dictionary(docs)

# Filter to remove words thatappeared too frequent (in more than 50% of doucuments) 
# and too little (less than 10 occurences)
combined_toks_dict.filter_extremes(no_below=10, no_above=.5)
combined_toks_dict.compactify()

2018-01-29 10:01:38,942 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-01-29 10:01:56,301 : INFO : built Dictionary(636191 unique tokens: ['query', 'processing', 'data', 'integration', 'chapter']...) from 8028 documents (total 25623592 corpus positions)
2018-01-29 10:01:57,438 : INFO : discarding 581457 tokens: [('data', 4317), ('approach', 4177), ('based', 4816), ('access', 4192), ('model', 4714), ('system', 5123), ('paper', 4628), ('using', 4803), ('supervectors', 8), ('use', 4212)]...
2018-01-29 10:01:57,439 : INFO : keeping 54734 tokens which were in no less than 10 and no more than 4014 (=50.0%) documents
2018-01-29 10:01:57,665 : INFO : resulting dictionary: Dictionary(54734 unique tokens: ['query', 'processing', 'integration', 'chapter', 'illustrate']...)


In [20]:
# Create a bow tagging for each publication:
df_combined_toks['bow'] = df_combined_toks['toks'].apply(combined_toks_dict.doc2bow)

# Generate a corpus based on the tokens, which we will be using later
corpus = df_combined_toks.bow.tolist()

In [32]:
# Save corpus:
combined_toks_dict.save(os.path.join(DATA_DIR, 'corpora','dictionary.all'))

2018-01-29 10:16:18,510 : INFO : saving Dictionary object under ../../data/data_schoolofinf/corpora/dictionary.all, separately None
2018-01-29 10:16:18,602 : INFO : saved ../../data/data_schoolofinf/corpora/dictionary.all


In [41]:
tmp = df_combined_toks.toks
no_pub = 0

with open(os.path.join(DATA_DIR,'toks','toks2word.all'), 'w') as f:
    for pub in tmp:
        if len(pub):
            out = " ".join(pub)
            f.write(out + "\n")
        else:
            no_pub += 1
            pass
print(no_pub)

0


### Model 2: only Metadata

In [10]:
df_metadata = df_combined_toks[['year','toks_metada']]

In [11]:
docs = df_metadata.toks_metada.tolist()

toks_dict = Dictionary(docs)

# Filter to remove words thatappeared too frequent (in more than 50% of doucuments) 
# and too little (less than 10 occurences)
toks_dict.filter_extremes(no_below=10, no_above=.5)
toks_dict.compactify()

2018-01-29 11:24:10,197 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-01-29 11:24:11,490 : INFO : built Dictionary(26756 unique tokens: ['query', 'processing', 'data', 'integration', 'chapter']...) from 8028 documents (total 719861 corpus positions)
2018-01-29 11:24:11,568 : INFO : discarding 20534 tokens: [('chase', 9), ('sv', 2), ('ubm', 2), ('supervectors', 2), ('synthesizer', 9), ('eer', 4), ('wsj', 7), ('rps', 2), ('openairinterface', 3), ('academia', 6)]...
2018-01-29 11:24:11,569 : INFO : keeping 6222 tokens which were in no less than 10 and no more than 4014 (=50.0%) documents
2018-01-29 11:24:11,583 : INFO : resulting dictionary: Dictionary(6222 unique tokens: ['query', 'processing', 'data', 'integration', 'chapter']...)


In [12]:
# Create a bow tagging for each publication:
df_metadata['bow'] = df_metadata['toks_metada'].apply(toks_dict.doc2bow)

# Generate a corpus based on the tokens, which we will be using later
corpus = df_metadata.bow.tolist()

In [13]:
# Save corpus:
toks_dict.save(os.path.join(DATA_DIR, 'corpora','dictionary.meta'))

2018-01-29 11:24:54,315 : INFO : saving Dictionary object under ../../data/data_schoolofinf/corpora/dictionary.meta, separately None
2018-01-29 11:24:54,320 : INFO : saved ../../data/data_schoolofinf/corpora/dictionary.meta


In [14]:
tmp = df_metadata.toks_metada
no_pub = 0

with open(os.path.join(DATA_DIR,'toks','toks2word.meta'), 'w') as f:
    for pub in tmp:
        if len(pub):
            out = " ".join(pub)
            f.write(out + "\n")
        else:
            no_pub += 1
            pass
print(no_pub)

0


## Restrict from 2012-2017

In [6]:
df_combined_toks = df_combined_toks.drop(
    df_combined_toks[(df_combined_toks.year < 2012) | (df_combined_toks.year > 2017)].index)

In [7]:
print(list(df_combined_toks.year.unique()))

[2013, 2012, 2014, 2016, 2015, 2017]


In [8]:
df_combined_toks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3620 entries, 18b1a861-afef-4fff-bc80-d02e05be18c4 to b2920a27-5293-4f4a-8874-4a0ea804d91a
Data columns (total 4 columns):
year            3620 non-null int64
toks_metada     3620 non-null object
toks_pdf2txt    3620 non-null object
toks            3620 non-null object
dtypes: int64(1), object(3)
memory usage: 141.4+ KB


### Model 3: metadata + PDF 

In [9]:
docs = df_combined_toks.toks.tolist()

combined_toks_dict = Dictionary(docs)

# Filter to remove words thatappeared too frequent (in more than 50% of doucuments) 
# and too little (less than 10 occurences)
combined_toks_dict.filter_extremes(no_below=10, no_above=.5)
combined_toks_dict.compactify()

2018-01-29 11:44:45,810 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-01-29 11:45:01,056 : INFO : built Dictionary(376785 unique tokens: ['query', 'processing', 'data', 'integration', 'chapter']...) from 3620 documents (total 13084978 corpus positions)
2018-01-29 11:45:01,974 : INFO : discarding 338843 tokens: [('data', 2339), ('part', 1843), ('problem', 1981), ('approach', 2225), ('based', 2528), ('access', 2347), ('second', 1869), ('model', 2398), ('system', 2553), ('informatik_germany', 8)]...
2018-01-29 11:45:01,975 : INFO : keeping 37942 tokens which were in no less than 10 and no more than 1810 (=50.0%) documents
2018-01-29 11:45:02,142 : INFO : resulting dictionary: Dictionary(37942 unique tokens: ['query', 'processing', 'integration', 'chapter', 'illustrate']...)


In [12]:
# Create a bow tagging for each publication:
df_combined_toks['bow'] = df_combined_toks['toks'].apply(combined_toks_dict.doc2bow)

# Generate a corpus based on the tokens, which we will be using later
corpus = df_combined_toks.bow.tolist()

In [14]:
tmp = df_combined_toks.toks
no_pub = 0

with open(os.path.join(DATA_DIR,'toks','toks2word.less.all'), 'w') as f:
    for pub in tmp:
        if len(pub):
            out = " ".join(pub)
            f.write(out + "\n")
        else:
            no_pub += 1
            pass
print(no_pub)

0
