[**Blueprints for Text Analysis Using Python**](https://github.com/blueprints-for-text-analytics-python/blueprints-text)  
Jens Albrecht, Sidharth Ramachandran, Christian Winkler

**If you like the book or the code examples here, please leave a friendly comment on [Amazon.com](https://www.amazon.com/Blueprints-Text-Analytics-Using-Python/dp/149207408X)!**
<img src="https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/rating.png?raw=1" width="100"/>

# Chapter 5:<div class='tocSkip'/>

# Feature Engineering and Syntactic Similarity

## Remark<div class='tocSkip'/>

The code in this notebook differs slightly from the printed book.

Several layout and formatting commands, like `figsize` to control figure size or subplot commands are removed in the book.

All of this is done to simplify the code in the book and put the focus on the important parts instead of formatting.

## Setup<div class='tocSkip'/>

Set directory locations. If working on Google Colab: copy files and install required libraries.

In [2]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master'
    os.system(f'wget {GIT_ROOT}/ch05/setup.py')

%run -i setup.py

You are working on Google Colab.
Files will be downloaded to "/content".
Downloading required files ...
!wget -P /content https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/settings.py
!wget -P /content/data/abcnews https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/abcnews/abcnews-date-text.csv.gz
!wget -P /content/ch05 https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/ch05/requirements.txt

Additional setup ...
!pip install -r ch05/requirements.txt
!python -m spacy download en


## Load Python Settings<div class="tocSkip"/>

Common imports, defaults for formatting in Matplotlib, Pandas etc.

In [3]:
%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# Data preparation

In [4]:
sentences = ["It was the best of times",
             "it was the worst of times",
             "it was the age of wisdom",
             "it was the age of foolishness"]

tokenized_sentences = [[t for t in sentence.split()] for sentence in sentences]

# 展开嵌套列表的逻辑，提取所有单词
# 更多的语法介绍见ChatGPT: https://chatgpt.com/share/675ba64c-73b0-8000-987f-cbc3f2600369
all_words = [
    word                                   # 输出两层循环后得到的单词
    for sentence in (tokenized_sentences)  # 外层循环遍历每个句子
    for word in (sentence)                 # 内层循环遍历句子中的每个单词
]

# vocabulary = set([w for s in tokenized_sentences for w in s])
vocabulary = set(all_words)

import pandas as pd
[[w, i] for i,w in enumerate(vocabulary)]

[['was', 0],
 ['foolishness', 1],
 ['times', 2],
 ['it', 3],
 ['of', 4],
 ['age', 5],
 ['wisdom', 6],
 ['worst', 7],
 ['It', 8],
 ['best', 9],
 ['the', 10]]

# One-hot by hand

In [5]:
def onehot_encode(tokenized_sentence):
    return [1 if w in tokenized_sentence else 0 for w in vocabulary]

onehot = [onehot_encode(tokenized_sentence) for tokenized_sentence in tokenized_sentences]

for (sentence, oh) in zip(sentences, onehot):
    print("%s: %s" % (oh, sentence))

[1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1]: It was the best of times
[1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1]: it was the worst of times
[1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1]: it was the age of wisdom
[1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1]: it was the age of foolishness


In [6]:
pd.DataFrame(onehot, columns=list(vocabulary))

Unnamed: 0,was,foolishness,times,it,of,age,wisdom,worst,It,best,the
0,1,0,1,0,1,0,0,0,1,1,1
1,1,0,1,1,1,0,0,1,0,0,1
2,1,0,0,1,1,1,1,0,0,0,1
3,1,1,0,1,1,1,0,0,0,0,1


In [7]:
# 计算两个词的词向量编码的关系
sim = [onehot[0][i] & onehot[1][i] for i in range(0, len(vocabulary))]
print(sim)
sum(sim)

[1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]


4

In [8]:
import numpy as np
np.dot(onehot[0], onehot[1])

4

In [9]:
print(onehot[0])
print(onehot[1])
print(onehot)

[1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1]
[1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1]
[[1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1], [1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1], [1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1], [1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1]]


In [10]:
np.dot(onehot, onehot[1])

array([4, 6, 4, 4])

## Out of vocabulary

In [11]:
onehot_encode("the age of wisdom is the best of times".split())

[0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]

In [12]:
# onehot_encode("John likes to watch movies. Mary likes movies too.".split())
onehot_encode("John likes".split())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## document term matrix

In [13]:
onehot

[[1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1],
 [1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1],
 [1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1],
 [1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1]]

## similarities

In [14]:
import numpy as np
np.dot(onehot, np.transpose(onehot))

array([[6, 4, 3, 3],
       [4, 6, 4, 4],
       [3, 4, 6, 5],
       [3, 4, 5, 6]])

# scikit learn one-hot vectorization

In [15]:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
lb.fit([vocabulary])
lb.transform(tokenized_sentences)

array([[1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0]])

# CountVectorizer

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [17]:
more_sentences = sentences + ["John likes to watch movies. Mary likes movies too.",
                              "Mary also likes to watch football games."]
pd.DataFrame(more_sentences)

Unnamed: 0,0
0,It was the best of times
1,it was the worst of times
2,it was the age of wisdom
3,it was the age of foolishness
4,John likes to watch movies. Mary likes movies too.
5,Mary also likes to watch football games.


In [18]:
cv.fit(more_sentences)

In [19]:
print(cv.get_feature_names_out())

['age' 'also' 'best' 'foolishness' 'football' 'games' 'it' 'john' 'likes'
 'mary' 'movies' 'of' 'the' 'times' 'to' 'too' 'was' 'watch' 'wisdom'
 'worst']


In [20]:
dt = cv.transform(more_sentences)

In [21]:
dt

<6x20 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

In [22]:
pd.DataFrame(dt.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,1
2,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,0
3,1,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,1,2,1,2,0,0,0,1,1,0,1,0,0
5,0,1,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,1,0,0


In [23]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(dt[0], dt[1])

array([[0.83333333]])

In [24]:
len(more_sentences)

6

In [25]:
pd.DataFrame(cosine_similarity(dt, dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.83,0.67,0.67,0.0,0.0
1,0.83,1.0,0.67,0.67,0.0,0.0
2,0.67,0.67,1.0,0.83,0.0,0.0
3,0.67,0.67,0.83,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.52
5,0.0,0.0,0.0,0.0,0.52,1.0


# TF/IDF

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
tfidf_dt = tfidf.fit_transform(dt)

In [27]:
pd.DataFrame(tfidf_dt.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0.0,0.0,0.57,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.47,0.0,0.0,0.34,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.47,0.0,0.0,0.34,0.0,0.0,0.57
2,0.47,0.0,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.0,0.0,0.0,0.34,0.0,0.57,0.0
3,0.47,0.0,0.0,0.57,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.0,0.0,0.0,0.34,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.31,0.5,0.25,0.61,0.0,0.0,0.0,0.25,0.31,0.0,0.25,0.0,0.0
5,0.0,0.42,0.0,0.0,0.42,0.42,0.0,0.0,0.34,0.34,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.34,0.0,0.0


In [28]:
pd.DataFrame(cosine_similarity(tfidf_dt, tfidf_dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.68,0.46,0.46,0.0,0.0
1,0.68,1.0,0.46,0.46,0.0,0.0
2,0.46,0.46,1.0,0.68,0.0,0.0
3,0.46,0.46,0.68,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.43
5,0.0,0.0,0.0,0.0,0.43,1.0


In [29]:
headlines = pd.read_csv(ABCNEWS_FILE, parse_dates=["publish_date"])
headlines.head()

Unnamed: 0,publish_date,headline_text
0,2003-02-19,aba decides against community broadcasting licence
1,2003-02-19,act fire witnesses must be aware of defamation
2,2003-02-19,a g calls for infrastructure protection summit
3,2003-02-19,air nz staff in aust strike for pay rise
4,2003-02-19,air nz strike to affect australian travellers


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
dt = tfidf.fit_transform(headlines["headline_text"])

In [31]:
dt

<1103663x95878 sparse matrix of type '<class 'numpy.float64'>'
	with 7001357 stored elements in Compressed Sparse Row format>

In [32]:
dt.data.nbytes

56010856

In [33]:
%%time
cosine_similarity(dt[0:10000], dt[0:10000])

CPU times: user 299 ms, sys: 895 ms, total: 1.19 s
Wall time: 1.18 s


array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.16913596,
        0.16792138],
       [0.        , 0.        , 0.        , ..., 0.16913596, 1.        ,
        0.33258708],
       [0.        , 0.        , 0.        , ..., 0.16792138, 0.33258708,
        1.        ]])

## Stopwords

In [34]:
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
print(len(stopwords))
tfidf = TfidfVectorizer(stop_words=list(stopwords))
dt = tfidf.fit_transform(headlines["headline_text"])
dt

326


<1103663x95600 sparse matrix of type '<class 'numpy.float64'>'
	with 5644186 stored elements in Compressed Sparse Row format>

## min_df

In [35]:
tfidf = TfidfVectorizer(stop_words=list(stopwords), min_df=2)
dt = tfidf.fit_transform(headlines["headline_text"])
dt

<1103663x58527 sparse matrix of type '<class 'numpy.float64'>'
	with 5607113 stored elements in Compressed Sparse Row format>

In [36]:
tfidf = TfidfVectorizer(stop_words=list(stopwords), min_df=.0001)
dt = tfidf.fit_transform(headlines["headline_text"])

In [37]:
print(dt.shape)
print(dt.data.nbytes)


# 不要执行下面的代码，会占用非常多的内存
# dense_dt = dt.toarray()
# 创建一个 Pandas DataFrame，以便更好地查看数据
# df = pd.DataFrame(dense_dt, columns=tfidf.get_feature_names_out())
# 打印 DataFrame
# print(df)

(1103663, 6772)
38531048


## max_df

In [38]:
tfidf = TfidfVectorizer(stop_words=list(stopwords), max_df=0.1)
dt = tfidf.fit_transform(headlines["headline_text"])


In [39]:
tfidf = TfidfVectorizer(max_df=0.1)
dt = tfidf.fit_transform(headlines["headline_text"])
dt

<1103663x95875 sparse matrix of type '<class 'numpy.float64'>'
	with 6532752 stored elements in Compressed Sparse Row format>

## n-grams

In [41]:
tfidf = TfidfVectorizer(stop_words=list(stopwords), ngram_range=(1,2), min_df=2)
dt = tfidf.fit_transform(headlines["headline_text"])
print(dt.shape)
print(dt.data.nbytes)
tfidf = TfidfVectorizer(stop_words=list(stopwords), ngram_range=(1,3), min_df=2)
dt = tfidf.fit_transform(headlines["headline_text"])
print(dt.shape)
print(dt.data.nbytes)

(1103663, 559961)
67325400
(1103663, 747988)
72360104


## Lemmas

In [47]:
# from tqdm.auto import tqdm
# import spacy
# nlp = spacy.load("en_core_web_sm")
# nouns_adjectives_verbs = ["NOUN", "PROPN", "ADJ", "ADV", "VERB"]
# for i, row in tqdm(headlines.iterrows(), total=len(headlines)):
#     doc = nlp(str(row["headline_text"]))
#     headlines.at[i, "lemmas"] = " ".join([token.lemma_ for token in doc])
#     headlines.at[i, "nav"] = " ".join([token.lemma_ for token in doc if token.pos_ in nouns_adjectives_verbs])

from tqdm.auto import tqdm
import spacy
import pandas as pd

# 只加载必要的组件
nlp = spacy.load("en_core_web_sm", disable=['ner', 'parser'])
nouns_adjectives_verbs = ["NOUN", "PROPN", "ADJ", "ADV", "VERB"]

# 预先创建空列
headlines['lemmas'] = ''
headlines['nav'] = ''
# 只处理前10000条数据
headlines = headlines[:10000]
# 批处理函数
def process_batch(texts):
    docs = list(nlp.pipe(texts))

    lemmas = []
    navs = []
    for doc in docs:
        lemmas.append(" ".join([token.lemma_ for token in doc]))
        navs.append(" ".join([token.lemma_ for token in doc if token.pos_ in nouns_adjectives_verbs]))

    return lemmas, navs

# 批量处理
batch_size = 1000
for i in tqdm(range(0, len(headlines), batch_size)):
    batch_texts = headlines['headline_text'].iloc[i:i+batch_size].astype(str)
    lemmas_batch, navs_batch = process_batch(batch_texts)

    headlines.iloc[i:i+batch_size, headlines.columns.get_loc('lemmas')] = lemmas_batch
    headlines.iloc[i:i+batch_size, headlines.columns.get_loc('nav')] = navs_batch


  0%|          | 0/10 [00:00<?, ?it/s]

In [48]:
headlines.head()

Unnamed: 0,publish_date,headline_text,lemmas,nav
0,2003-02-19,aba decides against community broadcasting licence,aba decide against community broadcasting licence,aba decide community broadcasting licence
1,2003-02-19,act fire witnesses must be aware of defamation,act fire witness must be aware of defamation,act fire witness aware defamation
2,2003-02-19,a g calls for infrastructure protection summit,a g call for infrastructure protection summit,g call infrastructure protection summit
3,2003-02-19,air nz staff in aust strike for pay rise,air nz staff in aust strike for pay rise,air nz staff aust strike pay rise
4,2003-02-19,air nz strike to affect australian travellers,air nz strike to affect australian traveller,air nz strike affect australian traveller


In [50]:
tfidf = TfidfVectorizer(stop_words=list(stopwords))
dt = tfidf.fit_transform(headlines["lemmas"].map(str))
dt

<10000x7786 sparse matrix of type '<class 'numpy.float64'>'
	with 49279 stored elements in Compressed Sparse Row format>

In [51]:
tfidf = TfidfVectorizer(stop_words=list(stopwords))
dt = tfidf.fit_transform(headlines["nav"].map(str))
dt

<10000x7612 sparse matrix of type '<class 'numpy.float64'>'
	with 48516 stored elements in Compressed Sparse Row format>

## remove top 10,000

In [53]:
top_10000 = pd.read_csv("https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt", header=None)
tfidf = TfidfVectorizer(stop_words=list(set(top_10000.iloc[:,0].values)))
dt = tfidf.fit_transform(headlines["nav"].map(str))
dt

<10000x4197 sparse matrix of type '<class 'numpy.float64'>'
	with 11472 stored elements in Compressed Sparse Row format>

In [54]:
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=list(set(top_10000.iloc[:,0].values)), min_df=2)
dt = tfidf.fit_transform(headlines["nav"].map(str))
dt

<10000x1954 sparse matrix of type '<class 'numpy.float64'>'
	with 9531 stored elements in Compressed Sparse Row format>

## Finding document most similar to made-up document

In [55]:
tfidf = TfidfVectorizer(stop_words=list(stopwords), min_df=2)
dt = tfidf.fit_transform(headlines["lemmas"].map(str))
dt

<10000x4256 sparse matrix of type '<class 'numpy.float64'>'
	with 45749 stored elements in Compressed Sparse Row format>

In [56]:
made_up = tfidf.transform(["australia and new zealand discuss optimal apple size"])

In [57]:
sim = cosine_similarity(made_up, dt)

In [58]:
sim[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [59]:
headlines.iloc[np.argsort(sim[0])[::-1][0:5]][["publish_date", "lemmas"]]

Unnamed: 0,publish_date,lemmas
7172,2003-03-25,new zealand tourist murder in nairobi
4919,2003-03-14,khan capture four as new zealand falter
5115,2003-03-15,new zealand cling to world cup lifeline
4596,2003-03-13,anderson appoint coach of new zealand
4597,2003-03-13,anderson appoint new zealand coach


# Finding the most similar documents

In [60]:
# there are "test" headlines in the corpus
stopwords.add("test")
tfidf = TfidfVectorizer(stop_words=list(stopwords), ngram_range=(1,2), min_df=2, norm='l2')
dt = tfidf.fit_transform(headlines["headline_text"])

### Timing Cosine Similarity

In [61]:
%%time
cosine_similarity(dt[0:10000], dt[0:10000], dense_output=False)

CPU times: user 42.4 ms, sys: 0 ns, total: 42.4 ms
Wall time: 46.1 ms


<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 1817469 stored elements in Compressed Sparse Row format>

In [62]:
%%time
r = cosine_similarity(dt[0:10000], dt[0:10000])
r[r > 0.9999] = 0
print(np.argmax(r))

71177118
CPU times: user 415 ms, sys: 1.38 s, total: 1.79 s
Wall time: 1.84 s


In [63]:
%%time
r = cosine_similarity(dt[0:10000], dt[0:10000], dense_output=False)
r[r > 0.9999] = 0
print(np.argmax(r))

71177118
CPU times: user 503 ms, sys: 130 ms, total: 633 ms
Wall time: 536 ms


### Timing Dot-Product

In [64]:
%%time
r = np.dot(dt[0:10000], np.transpose(dt[0:10000]))
r[r > 0.9999] = 0
print(np.argmax(r))

71177118
CPU times: user 457 ms, sys: 33.9 ms, total: 491 ms
Wall time: 490 ms


## Batch

In [65]:
%%time
batch = 10000
max_sim = 0.0
max_a = None
max_b = None
for a in range(0, dt.shape[0], batch):
    for b in range(0, a+batch, batch):
        print(a, b)
        #r = np.dot(dt[a:a+batch], np.transpose(dt[b:b+batch]))
        r = cosine_similarity(dt[a:a+batch], dt[b:b+batch], dense_output=False)
        # eliminate identical vectors
        # by setting their similarity to np.nan which gets sorted out
        r[r > 0.9999] = 0
        sim = r.max()
        if sim > max_sim:
            # argmax returns a single value which we have to
            # map to the two dimensions
            (max_a, max_b) = np.unravel_index(np.argmax(r), r.shape)
            # adjust offsets in corpus (this is a submatrix)
            max_a += a
            max_b += b
            max_sim = sim

0 0
CPU times: user 534 ms, sys: 27 ms, total: 561 ms
Wall time: 596 ms


In [66]:
print(max_a, max_b)

7117 7118


In [67]:
print(max_sim)

0.969104954669429


In [69]:
pd.set_option('max_colwidth', None)
headlines.iloc[[max_a, max_b]][["publish_date", "headline_text"]]

Unnamed: 0,publish_date,headline_text
7117,2003-03-25,gulf veterans more likely to have health problems
7118,2003-03-25,gulf veterans more likely to report health problems


# Finding most related words

In [72]:
tfidf_word = TfidfVectorizer(stop_words=list(stopwords), min_df=100)
dt_word = tfidf_word.fit_transform(headlines["headline_text"])

In [73]:
r = cosine_similarity(dt_word.T, dt_word.T)
np.fill_diagonal(r, 0)

In [74]:
voc = tfidf_word.get_feature_names_out()
size = r.shape[0] # quadratic
for index in np.argsort(r.flatten())[::-1][0:40]:
    a = int(index/size)
    b = index%size
    if a > b:  # avoid repetitions
        print('"%s" related to "%s"' % (voc[a], voc[b]))

"world" related to "cup"
"war" related to "anti"
"war" related to "iraq"
"water" related to "rain"
"man" related to "court"
"un" related to "iraq"
"says" related to "iraq"
"nsw" related to "govt"
"troops" related to "iraq"
"police" related to "man"
"qld" related to "govt"
"police" related to "death"
"water" related to "plan"
"rain" related to "qld"
"un" related to "iraqi"
"un" related to "new"
"war" related to "un"
"un" related to "council"
"says" related to "iraqi"
"war" related to "says"
