### Learning about ACL 2020 from analysis of the paper abstracts

This notebook used a data set that I scraped from [ACL 2020 accepted paper](https://www.aclweb.org/anthology/events/acl-2020/#2020-acl-main) abstracts.

Could I use NLP techniques to get the big picture of what the conference was about?

I tried 3 techniques:
* topic modeling
* TF-IDF keyword extraction
* multi-word extraction with NLTK collocations()



In [1]:
# read in the data

with open('abs.csv', 'r') as f:
    docs = f.read().lower().splitlines()
    
len(docs)

1224

### Topic Modeling

I'll use Gensim for the topic modeling. 

In [4]:
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [18]:
NUM_TOPICS = 10

In [6]:
# preprocess docs
wnl = WordNetLemmatizer()

def preprocess(docs, stopwords):
    """
    Tokenize, remove stopwords and non-alpha tokens.
    param: docs - a list of raw text documents
    return: a list of processed tokens
    """
    
    processed_docs = []
    for doc in docs:
        tokens = [wnl.lemmatize(t) for t in word_tokenize(doc.lower()) if t.isalpha()]
        tokens = [t for t in tokens if t not in stopwords]
        processed_docs.append(tokens)
        
    return processed_docs

In [7]:
stopword_list = stopwords.words('english')
stopword_list += ['propose', 'approach', 'paper', 'show', 'result', 'system', 'also', 'two', 'different']
preprocessed_docs = preprocess(docs, stopword_list)

In [10]:
# look at one processed doc
print(" ".join(preprocessed_docs[5])) # join the list of words

conversation achieved remarkable research attention recently however generating informative response multiple relevant knowledge without losing fluency coherence still one main challenge address issue proposes method us recurrent knowledge interaction among response decoding step incorporate appropriate knowledge furthermore introduce knowledge copy mechanism using pointer network copy word external knowledge according knowledge attention distribution joint neural conversation model integrates recurrent knowledge copy kic performs well generating informative response experiment demonstrate model fewer parameter yield significant improvement competitive baseline datasets average bleu ab duconv average bleu ab knowledge format textual amp amp structured language english amp amp chinese


In [11]:
# the dictionary maps words to id numbers
dictionary = corpora.Dictionary(preprocessed_docs)

In [12]:
# represent the doc tokens in numeric form
corpus = [dictionary.doc2bow(tokens) for tokens in preprocessed_docs]

In [19]:
# build an LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

In [20]:
for i in range(NUM_TOPICS):
    top_words = [t[0] for t in lda_model.show_topic(i, 9)]
    print("\nTopic", str(i), ':', top_words)


Topic 0 : ['language', 'model', 'task', 'translation', 'method', 'data', 'sentence', 'using', 'corpus']

Topic 1 : ['model', 'task', 'language', 'method', 'sentence', 'training', 'performance', 'work', 'representation']

Topic 2 : ['model', 'language', 'word', 'task', 'method', 'data', 'learning', 'work', 'sentence']

Topic 3 : ['model', 'language', 'task', 'performance', 'data', 'set', 'method', 'training', 'neural']

Topic 4 : ['model', 'language', 'task', 'text', 'word', 'method', 'data', 'performance', 'learning']

Topic 5 : ['model', 'neural', 'translation', 'task', 'sentence', 'information', 'language', 'data', 'performance']

Topic 6 : ['model', 'task', 'method', 'language', 'neural', 'information', 'representation', 'new', 'framework']

Topic 7 : ['model', 'language', 'task', 'data', 'method', 'information', 'text', 'framework', 'generation']

Topic 8 : ['model', 'task', 'method', 'text', 'data', 'performance', 'word', 'learning', 'using']

Topic 9 : ['model', 'language', 'dat

### Topic Modeling Results

There is a lot of overlapping words in the topics. The words *model* and *language* occurred in all topics even when I rediced the number of topics to 4. Finding the optimal number of topics is one of the most difficult aspects of topic modeling. [This paper](https://www.researchgate.net/publication/283947121_A_heuristic_approach_to_determine_an_appropriate_number_of_topics_in_topic_modeling) discusses an approach using the perplexity metric to find an optimal number of topics.

I would say that the results here are disappointing because the topics are not well separated. In part I can blame the data. It's obvious from looking at the topics that most papers were about language models trained on a corpus and many involve the machine translation task. 

### TF-IDF

Next I try to extract key words using TF-IDF in sklearn.

In [21]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
docs2 = [" ".join(doc) for doc in preprocessed_docs]

In [23]:
# create word counts for the docs 
count_vectorizer =CountVectorizer() 
 
word_counts=count_vectorizer.fit_transform(docs2)

In [25]:
# use sklearn's tfidf transformer

tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True) 
tfidf_transformer.fit(word_counts)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [30]:
# extract idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=count_vectorizer.get_feature_names(),columns=["idf_weights"]) 
 
# sort ascending 
results = df_idf.sort_values(by=['idf_weights'])

print(results.head(n=20))

             idf_weights
model           1.347967
task            1.666565
language        1.813587
method          2.060963
data            2.099429
performance     2.101883
neural          2.205334
work            2.293585
training        2.336145
using           2.351794
experiment      2.377355
text            2.403586
ha              2.420337
however         2.444269
learning        2.461722
present         2.465249
information     2.497568
new             2.501224
based           2.569433
word            2.573362


Notice that *model* and *language* have low idf weights because they occur in most of the documents. 

In [38]:
# count matrix 
count_vectors = count_vectorizer.transform(docs) 
 
# tf-idf scores 
tf_idf_vectors = tfidf_transformer.transform(count_vectors)

In [42]:
# for the first doc, find the important words

feature_names = count_vectorizer.get_feature_names() 
 
#get tfidf vector for first document 
first_document_vector=tf_idf_vectors[0] 
 
#print the scores 
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"]) 
results = df.sort_values(by=["tfidf"], ascending=False)
print(results[:20])

                 tfidf
directed      0.529406
speech        0.524567
child         0.290064
adult         0.232672
trained       0.152596
comparable    0.137928
acoustically  0.116336
linguistic    0.112919
phonemic      0.109977
acoustic      0.101965
differs       0.101965
eventually    0.099106
prosodic      0.099106
acquisition   0.096688
repetition    0.094594
see           0.092746
looking       0.089599
partially     0.089599
explores      0.085817
least         0.085817


### TF-IDF Results

Looking at high-scoring TF-IDF words in a sample document above, these words seems like good 'topic' words for this one document. 

A weakness for both topic modeling and tf-idf is that they look at single words. The following code sections use NLTK to find multi-word expressions.

### NLTK collocations



In [43]:
import nltk
from nltk import word_tokenize

tokens = word_tokenize(" ".join(docs2))
text_obj = nltk.Text(tokens)

In [44]:
text_obj.collocations()

natural language; machine translation; neural network; language
processing; question answering; named entity; shared task; state art;
reading comprehension; social medium; publicly available; training
data; neural machine; available http; extensive experiment; word
embeddings; entity recognition; sentiment analysis; language model;
attention mechanism


This gives us a good sense of what people are talking about in the abstracts. 

The lesson here is that in highly technical text with multi-word expressions, approaches that deal with individual tokens may not get optimal results. 