# Hands on Text Processing

DS 7008

Raf Alvarado

12 March 2024


# Overview

1. Collect
2. Learn
3. Parse
4. Annotate
5. Vectorize
6. Model

# Tools

- Git and GitHub
- Rivanna
- Gensim

# Core Concepts

The core concepts of ``gensim`` are:

1. **document**: some text.
2. **corpus**: a collection of documents.
3. **vector**: a mathematically convenient representation of a document.
4. **model**: an algorithm for transforming vectors from one representation to another.

# Example

Here is an example corpus:

In [77]:
corpus_raw = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

Here's how we convert these data into a corpus object.

Create a set of frequent words.

In [34]:
stoplist = set('for a of the and to in'.split(' '))

Create a list of lists from the raw corpus.

In [79]:
corpus = []
for doc in corpus_raw:
    doclist = []
    for word in doc.split():
        word = word.lower()
        if word not in stoplist:
            doclist.append(word)
    corpus.append(doclist)

In [80]:
corpus

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

Count word frequencies.

In [88]:
from collections import defaultdict

frequency = defaultdict(int)
for doc in corpus:
    for token in doc:
        frequency[token] += 1

Only keep words that appear more than once

In [91]:
corpus_processed = []
for doc in corpus:
    doclist = []
    for token in doc:
        if frequency[token] > 1:
            doclist.append(token)
    corpus_processed.append(doclist)

In [92]:
corpus_processed

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

# Create a Dictionary

In [93]:
from gensim import corpora

In [95]:
dictionary = corpora.Dictionary(corpus_processed)

In [96]:
print(dictionary)

Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>


In [97]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


# Vectorize

We create a bag-of-word representation of the corpus.

In [99]:
bow = []
for text in corpus_processed:
    bow.append(dictionary.doc2bow(text))

In [100]:
bow

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

# Significant Words

In [113]:
from gensim import models

In [114]:
tfidf = models.TfidfModel(bow)

In [127]:
tfidf[dictionary.doc2bow(['system'])]

[(5, 1.0)]

In [116]:
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(5, 0.5898341626740045), (11, 0.8075244024440723)]


In [117]:
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

In [118]:
index

<gensim.similarities.docsim.SparseMatrixSimilarity at 0x7fae8f419d10>

In [121]:
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))

[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


In [74]:
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)

3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
