<a href="https://colab.research.google.com/github/lmrhody/femethodsS23/blob/main/Week10notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 10 Notebook

Name: 

Date:

Class: 

Notes: (Anything you'd like to share with me before I read your notebook)

This week's notebook is based on Chapter 5 from 
[**Blueprints for Text Analysis Using Python**](https://github.com/blueprints-for-text-analytics-python/blueprints-text) by
Jens Albrecht, Sidharth Ramachandran, Christian Winkler. I've added cells to this week's notebook to help explain 

# Vectors, Features, and Similarity

## Review: Preparing Data for Operationalization
- Tokenizing: breaking documents into units (words, characters, sentences) called tokens. 
 - Multiple tokens can be grouped together into n-grams. 
- Stemming: 
 - Words = root + prefix, suffix
 - Root = where the core meaning is held
 - Stemming reduced words that have similar meanings but multiple forms, such as tense, plural, gerunds, etc. 
- Dimension Reduction
  - Simplify the number of observations 
  - Stopword reduction (removing words that don't convey "meaninig") 
  - Remove words that are too frequent or too unique
  - remove other terms that are likely to confuse the counting. 


## Vectors
Vectors are mathematical objects that encode length and direction (represents a position or change in a framework or space) A 1-dimensional array of numbers (components) is displayed as a distribution. When represented geometrically, vectors represent coordinates in an n-dimensional space where n is the number of dimensions (units being compared). 
- In machine learning, text is represented in an array of numbers
- Natural extension of real numbers in mathematics is a tuple (pairs of two numbers) 
- Vectors are useful because they are numerical representations that are spatial and therefore have norms and distances
- We use spatial properties to measure similarity
- Measuring similarity is a fundamental principle for analyzing texts with computers
- Occupying the same, or related space, represents similarities between vectors


# Feature Engineering and Syntactic Similarity

## Remark<div class='tocSkip'/>

The code in this notebook differs slightly from the printed book. 

Several layout and formatting commands, like `figsize` to control figure size or subplot commands are removed in the book.

All of this is done to simplify the code in the book and put the focus on the important parts instead of formatting.

## Setup<div class='tocSkip'/>

Set directory locations. If working on Google Colab: copy files and install required libraries.

In [1]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master'
    os.system(f'wget {GIT_ROOT}/ch05/setup.py')

%run -i setup.py

You are working on Google Colab.
Files will be downloaded to "/content".
Downloading required files ...
!wget -P /content https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/settings.py
!wget -P /content/data/abcnews https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/abcnews/abcnews-date-text.csv.gz
!wget -P /content/ch05 https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/ch05/requirements.txt

Additional setup ...
!pip install -r ch05/requirements.txt
!python -m spacy download en


## Load Python Settings<div class="tocSkip"/>

Common imports, defaults for formatting in Matplotlib, Pandas etc.

In [2]:
%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# Data preparation

In [9]:
sentences = ["It was the best of times", 
             "it was the worst of times", 
             "it was the age of wisdom", 
             "it was the age of foolishness"]

tokenized_sentences = [[t for t in sentence.split()] for sentence in sentences]

vocabulary = list(dict.fromkeys([w for s in tokenized_sentences for w in s]))
# vocabulary = set([w for s in tokenized_sentences for w in s])

import pandas as pd
[[w, i] for i,w in enumerate(vocabulary)]

[['It', 0],
 ['was', 1],
 ['the', 2],
 ['best', 3],
 ['of', 4],
 ['times', 5],
 ['it', 6],
 ['worst', 7],
 ['age', 8],
 ['wisdom', 9],
 ['foolishness', 10]]

# One-hot by hand

In [10]:
def onehot_encode(tokenized_sentence):
    return [1 if w in tokenized_sentence else 0 for w in vocabulary]

onehot = [onehot_encode(tokenized_sentence) for tokenized_sentence in tokenized_sentences]

for (sentence, oh) in zip(sentences, onehot):
    print("%s: %s" % (oh, sentence))


[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]: It was the best of times
[0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]: it was the worst of times
[0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0]: it was the age of wisdom
[0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1]: it was the age of foolishness


In [5]:
onehot_encode("the age of wisdom is the best of times".split())

[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]

In [6]:
onehot_encode("John likes to watch movies. Mary likes to watch movies too.".split())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## The Document-Term Matrix
The document term matrix is the vector representation of all documents adn is the most basic building block for nearly all machine learning tasks we will do this semester. 

In [11]:
import pandas as pd
pd.DataFrame(onehot, columns=vocabulary)

Unnamed: 0,It,was,the,best,of,times,it,worst,age,wisdom,foolishness
0,1,1,1,1,1,1,0,0,0,0,0
1,0,1,1,0,1,1,1,1,0,0,0
2,0,1,1,0,1,0,1,0,1,1,0
3,0,1,1,0,1,0,1,0,1,0,1


### Calculating similarities
Calculating the similarities between documents works by calculating the number of common 1s at the corresponding positions. In one-hot encoding, this is an extremely fast operation, as it can be calculated on the bit level by ANDing the vectors and counting the number of 1s in the resulting vectors. 

In [12]:
# calculate the similarities between the first 2 sentences
# the result is the number of 1s that are shared between the 2 sentences. 
sim = [onehot[0][i] & onehot[1][i] for i in range(0, len(vocabulary))]
sum(sim)

4

## Scalar Product or Dot Product
calculated by multiplying corresponding components of the two vectors and adding up the products. If a product can only be 1 if both factors are 1, we can calculate the number of common 1s in the vectors. 

In [13]:
import numpy as np
np.dot(onehot[0], onehot[1])

4

## Out of vocabulary

In [14]:
onehot_encode("the age of wisdom is the best of times".split())

[0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0]

In [15]:
onehot_encode("John likes to watch movies. Mary likes movies too.".split())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## document term matrix

In [16]:
onehot

[[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
 [0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0],
 [0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0],
 [0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1]]

## similarities

In [17]:
import numpy as np
np.dot(onehot, np.transpose(onehot))

array([[6, 4, 3, 3],
       [4, 6, 4, 4],
       [3, 4, 6, 5],
       [3, 4, 5, 6]])

## scikit learn one-hot vectorization

We did onehot vectorization by hand, but you can also use a tool like scikit learn to do it. Scikit Learn's OneHotEncoder is designed for the specific purpose of categorizing features (it encodes the features into the data), and that's not what we're doing here. We just want to see the encoding in action, so we use the MultiLabelBinarizer (because we want to make it as complicated as possible...). 


In [18]:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
lb.fit([vocabulary])
lb.transform(tokenized_sentences)

array([[1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0]])

### Note: 
Vectorizing has two different methods: fit and transform. Fit *learns* the vocabulary. Transform *converts* the documents into vectors. 

# Text Statistics
Counting words is the simplest approach to analyzing language and text. While language is fluid and dynamic, there are some predictable elements. For example, we can count on the fact that there are words that are frequently found across all documents and all language domains (even poetry). The most common words used in English documents are: the, of, to, and, in, is, for, The, that, and said. Inversely, rare words or words that appear only one time in a corpus are also very common and can comprise about 1/2 of the total words. These are *hapax legomena* (words that appear only once). 

Two "laws" describe this phenomeonon in text analysis: [*Zipf's law*](https://en.wikipedia.org/wiki/Zipf%27s_law) and [*Heaps' Law*](https://en.wikipedia.org/wiki/Heaps%27_law#:~:text=Heaps'%20law%20means%20that%20as,the%20distinct%20terms%20are%20drawn). 


# Bag of Words Models
Bag-of-words representations create vectors for documents that also preserve the frequency of words that appear in each document as a feature. The frequency of the words are used as part of the weighting of the model. Models like Latent Dirichlet Allocation (LDA) explicitly require a BoW approach. 


# CountVectorizer
Creating vectorizers from scratch can be very time intensive, and since a similar method can be repurposed for several types of modeling, we can use the same algorithm over and over again. If we use scikit-learn, that algorithm is part of the class called CountVectorizer. The process of turning documents into vectors is also called feature extraction. 

In [19]:
#import sklearn's CountVectorizer and then rename the function as cv
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [20]:
more_sentences = sentences + ["John likes to watch movies. Mary likes movies too.",
                              "Mary also likes to watch football games."]
pd.DataFrame(more_sentences)

Unnamed: 0,0
0,It was the best of times
1,it was the worst of times
2,it was the age of wisdom
3,it was the age of foolishness
4,John likes to watch movies. Mary likes movies too.
5,Mary also likes to watch football games.


In [21]:
# Use the CountVectorizer to learn the new vocabulary in more_sentences
cv.fit(more_sentences)

In [22]:
# then print out the whole vocabulary
print(cv.get_feature_names_out())

['age' 'also' 'best' 'foolishness' 'football' 'games' 'it' 'john' 'likes'
 'mary' 'movies' 'of' 'the' 'times' 'to' 'too' 'was' 'watch' 'wisdom'
 'worst']


In [23]:
# then we use transform to vectorize all the sentences in the more_sentences variable 
dt = cv.transform(more_sentences)

In [24]:
# when we call the vairable dt, it will describe the matrix of vectors that CountVectorizer produced
# that matrix is a vector of vectors
dt

<6x20 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

In [25]:
# When we turn the vector of vectors (or matrix) into a dataframe, 
# we can see the features for each sentence 
pd.DataFrame(dt.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,1
2,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,0
3,1,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,1,2,1,2,0,0,0,1,1,0,1,0,0
5,0,1,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,1,0,0


## Calculating Cosign Similarities
If you want to find similarities between documents, the trick is harder than just finding the 1s. The number of occurrences of each word can be greater, and that needs to be given additional weight in our calculation of similarity. We can use the angle between two vectors to measure the similarity between them. The output is limited to numbers between 0 and 1 with 0 representing no similarity and 1 representing exact similarity.  Scikit-learn has a function that helps us to do this. 

In [26]:
#import cosine_similarity and then use it to calculate the angle between the first and second sentences.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(dt[0], dt[1])

array([[0.83333333]])

In [27]:
# the more_sentences variable has 6 sentences (or docs), so we will need to compare each of the
# six sentences to each of the other sentences. 
len(more_sentences)

6

In [28]:
# Each sentence is a row and a column. The numbers they share are their calculated
# similarity. Obviously, document 1 and document 1 are the same, so their cosine 
# similarity score is 1. The added sentences have no overlap with the first 4, so 
# their cosine similarity is 0. 
pd.DataFrame(cosine_similarity(dt, dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.83,0.67,0.67,0.0,0.0
1,0.83,1.0,0.67,0.67,0.0,0.0
2,0.67,0.67,1.0,0.83,0.0,0.0
3,0.67,0.67,0.83,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.52
5,0.0,0.0,0.0,0.0,0.52,1.0


## Review
Vectorizing textual data turns strings of vocabulary into a numerical array, which also becomes a set of features that we can use to compare one document to another. One way to do this is with the Bag-of-Words approach. Bag-of-words vectorizing does not take the order of the vocabulary into account, but it does take the frequency that a word appears in a document and gives that term higher weight. 

In the next section, we'll work with TF-IDF vectorizing which essentially "punishes" words that appear too frequently in the corpus. 

# TF/IDF

Term Frequency - Inverse Document Frequency (tf-idf) counts the number of total word occurrences in the corpus in addition to the occurrences in a single document. It uses the relationship between the frequency that a term is used in a document and compares it to the frequency that it appears in the entire collection so that words that are frequently found in both one document and in all documents is then inversely weighted overall. 

The logic behind this approach is that if there are words that are used frequently in every document, then they are not likely to convey important information. Instead, important information will likely be found in uncommon words that convey something different. 

Inverted document frequency creates a penalty for common words. In fact, we can arrive at a tf-idf weighting even if we start with a Bag-of-Words matrix. 

In [29]:
# from the feature_extraction text package in sklearn, we import TfidfTransformer
# then we fit and transform the variable dt from above. 
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
tfidf_dt = tfidf.fit_transform(dt)

In [38]:
# next, we turn the matrix into a dataframe so we can see it. 
pd.DataFrame(tfidf_dt.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0.0,0.0,0.57,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.47,0.0,0.0,0.34,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.47,0.0,0.0,0.34,0.0,0.0,0.57
2,0.47,0.0,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.0,0.0,0.0,0.34,0.0,0.57,0.0
3,0.47,0.0,0.0,0.57,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.34,0.34,0.0,0.0,0.0,0.34,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.31,0.5,0.25,0.61,0.0,0.0,0.0,0.25,0.31,0.0,0.25,0.0,0.0
5,0.0,0.42,0.0,0.0,0.42,0.42,0.0,0.0,0.34,0.34,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.34,0.0,0.0


In [30]:
# We can do the same cosine similarity calculation on the tf-idf matrix
pd.DataFrame(cosine_similarity(tfidf_dt, tfidf_dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.68,0.46,0.46,0.0,0.0
1,0.68,1.0,0.46,0.46,0.0,0.0
2,0.46,0.46,1.0,0.68,0.0,0.0
3,0.46,0.46,0.68,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.43
5,0.0,0.0,0.0,0.0,0.43,1.0


In [31]:
headlines = pd.read_csv(ABCNEWS_FILE, parse_dates=["publish_date"])
headlines.head()

Unnamed: 0,publish_date,headline_text
0,2003-02-19,aba decides against community broadcasting licence
1,2003-02-19,act fire witnesses must be aware of defamation
2,2003-02-19,a g calls for infrastructure protection summit
3,2003-02-19,air nz staff in aust strike for pay rise
4,2003-02-19,air nz strike to affect australian travellers


In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
dt = tfidf.fit_transform(headlines["headline_text"])

In [33]:
dt

<1103663x95878 sparse matrix of type '<class 'numpy.float64'>'
	with 7001357 stored elements in Compressed Sparse Row format>

In [34]:
dt.data.nbytes

56010856

In [35]:
%%time
cosine_similarity(dt[0:10000], dt[0:10000])

CPU times: user 225 ms, sys: 1.43 s, total: 1.66 s
Wall time: 1.66 s


array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.16913596,
        0.16792138],
       [0.        , 0.        , 0.        , ..., 0.16913596, 1.        ,
        0.33258708],
       [0.        , 0.        , 0.        , ..., 0.16792138, 0.33258708,
        1.        ]])

## Stopwords

In [36]:
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
print(len(stopwords))
tfidf = TfidfVectorizer(stop_words="english")
dt = tfidf.fit_transform(headlines["headline_text"])
dt

326


<1103663x95588 sparse matrix of type '<class 'numpy.float64'>'
	with 5616010 stored elements in Compressed Sparse Row format>

## min_df

In [37]:
tfidf = TfidfVectorizer(stop_words="english", min_df=2)
dt = tfidf.fit_transform(headlines["headline_text"])
dt

<1103663x58516 sparse matrix of type '<class 'numpy.float64'>'
	with 5578938 stored elements in Compressed Sparse Row format>

In [38]:
tfidf = TfidfVectorizer(stop_words="english", min_df=.0001)
dt = tfidf.fit_transform(headlines["headline_text"])
dt

<1103663x6767 sparse matrix of type '<class 'numpy.float64'>'
	with 4788559 stored elements in Compressed Sparse Row format>

## max_df

In [39]:
tfidf = TfidfVectorizer(stop_words="english", max_df=0.1)
dt = tfidf.fit_transform(headlines["headline_text"])
dt

<1103663x95588 sparse matrix of type '<class 'numpy.float64'>'
	with 5616010 stored elements in Compressed Sparse Row format>

In [40]:
tfidf = TfidfVectorizer(max_df=0.1)
dt = tfidf.fit_transform(headlines["headline_text"])
dt

<1103663x95875 sparse matrix of type '<class 'numpy.float64'>'
	with 6532752 stored elements in Compressed Sparse Row format>

## n-grams

In [41]:
tfidf = TfidfVectorizer(stop_words="english", ngram_range=(1,2), min_df=2)
dt = tfidf.fit_transform(headlines["headline_text"])
print(dt.shape)
print(dt.data.nbytes)
tfidf = TfidfVectorizer(stop_words="english", ngram_range=(1,3), min_df=2)
dt = tfidf.fit_transform(headlines["headline_text"])
print(dt.shape)
print(dt.data.nbytes)

(1103663, 557787)
66868960
(1103663, 742391)
71782784


## Lemmas

In [44]:
from tqdm.auto import tqdm
import spacy
nlp = spacy.load("en_core_web_sm")
nouns_adjectives_verbs = ["NOUN", "PROPN", "ADJ", "ADV", "VERB"]

# for i, row in tqdm(headlines.iterrows(), total=len(headlines)):
for i, row in tqdm(headlines[:24].iterrows(), total=len(headlines)):
    doc = nlp(str(row["headline_text"]))
    headlines.at[i, "lemmas"] = " ".join([token.lemma_ for token in doc])
    headlines.at[i, "nav"] = " ".join([token.lemma_ for token in doc if token.pos_ in nouns_adjectives_verbs])

  0%|          | 0/1103663 [00:00<?, ?it/s]

In [45]:
headlines.head()

Unnamed: 0,publish_date,headline_text,lemmas,nav
0,2003-02-19,aba decides against community broadcasting licence,aba decide against community broadcasting licence,aba decide community broadcasting licence
1,2003-02-19,act fire witnesses must be aware of defamation,act fire witness must be aware of defamation,act fire witness aware defamation
2,2003-02-19,a g calls for infrastructure protection summit,a g call for infrastructure protection summit,g call infrastructure protection summit
3,2003-02-19,air nz staff in aust strike for pay rise,air nz staff in aust strike for pay rise,air nz staff aust strike pay rise
4,2003-02-19,air nz strike to affect australian travellers,air nz strike to affect australian traveller,air nz strike affect australian traveller


In [46]:
tfidf = TfidfVectorizer(stop_words="english")
dt = tfidf.fit_transform(headlines["lemmas"].map(str))
dt

<1103663x13708 sparse matrix of type '<class 'numpy.float64'>'
	with 1227849 stored elements in Compressed Sparse Row format>

In [47]:
tfidf = TfidfVectorizer(stop_words="english")
dt = tfidf.fit_transform(headlines["nav"].map(str))
dt

<1103663x13399 sparse matrix of type '<class 'numpy.float64'>'
	with 1225616 stored elements in Compressed Sparse Row format>

## remove top 10,000

In [49]:
top_10000 = pd.read_csv("https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt", header=None)
tfidf = TfidfVectorizer(stop_words=list(set(top_10000.iloc[:,0].values)))
dt = tfidf.fit_transform(headlines["nav"].map(str))
dt

<1103663x8772 sparse matrix of type '<class 'numpy.float64'>'
	with 1109199 stored elements in Compressed Sparse Row format>

In [51]:
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=list(set(top_10000.iloc[:,0].values)), min_df=2)
dt = tfidf.fit_transform(headlines["nav"].map(str))
dt

<1103663x5193 sparse matrix of type '<class 'numpy.float64'>'
	with 1107096 stored elements in Compressed Sparse Row format>

## Finding document most similar to made-up document

In [52]:
tfidf = TfidfVectorizer(stop_words="english", min_df=2)
dt = tfidf.fit_transform(headlines["lemmas"].map(str))
dt

<1103663x8167 sparse matrix of type '<class 'numpy.float64'>'
	with 1222308 stored elements in Compressed Sparse Row format>

In [53]:
made_up = tfidf.transform(["australia and new zealand discuss optimal apple size"])

In [54]:
sim = cosine_similarity(made_up, dt)

In [55]:
sim[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [56]:
headlines.iloc[np.argsort(sim[0])[::-1][0:5]][["publish_date", "lemmas"]]

Unnamed: 0,publish_date,lemmas
28116,2003-07-04,bracewell name as new zealand cricket coach
23167,2003-06-10,new feature for apple and grape festival
10731,2003-04-11,gronholm dominate in new zealand
22701,2003-06-08,new zealand to play two test in india
16773,2003-05-11,pakistan skittle out for 116 by new zealand


# Finding the most similar documents

In [68]:
# there are "test" headlines in the corpus
stopwords.add("test")
tfidf = TfidfVectorizer(stop_words="english", ngram_range=(1,2), min_df=2, norm='l2')
dt = tfidf.fit_transform(headlines["headline_text"])

# Finding most related words

In [69]:
tfidf_word = TfidfVectorizer(stop_words="english", min_df=1000)
dt_word = tfidf_word.fit_transform(headlines["headline_text"])

In [70]:
r = cosine_similarity(dt_word.T, dt_word.T)
np.fill_diagonal(r, 0)

In [71]:
voc = tfidf_word.get_feature_names_out()
size = r.shape[0] # quadratic
for index in np.argsort(r.flatten())[::-1][0:40]:
    a = int(index/size)
    b = index%size
    if a > b:  # avoid repetitions
        print('"%s" related to "%s"' % (voc[a], voc[b]))

"sri" related to "lanka"
"hour" related to "country"
"seekers" related to "asylum"
"springs" related to "alice"
"pleads" related to "guilty"
"hill" related to "broken"
"trump" related to "donald"
"violence" related to "domestic"
"climate" related to "change"
"driving" related to "drink"
"care" related to "aged"
"gold" related to "coast"
"royal" related to "commission"
"mental" related to "health"
"wind" related to "farm"
"flu" related to "bird"
"murray" related to "darling"
"north" related to "korea"
"hour" related to "2014"
"world" related to "cup"
