Today we're going to use a different approach to similarity: finding the association between words. This is the same idea that we used for finding phrases (Pointwise Mutual Information). But here we will use a direction-specific measure to provide more syntactically-sensitive results.

Let's start by getting our environment ready.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


We're going to start by looking at our data from Project Gutenberg. We don't need the whole dataset, so we only load part of it. That's because we're going to test the association measure function. Later we will load the complete results already processed.

In [2]:
file = os.path.join(ai.data_dir, "stylistics.authorship_1850.gz")
df = pd.read_csv(file, nrows = 1000)
print(df)

          Author                Title  \
0       abbott_j  alexander_the_great   
1       abbott_j  alexander_the_great   
2       abbott_j  alexander_the_great   
3       abbott_j  alexander_the_great   
4       abbott_j  alexander_the_great   
..           ...                  ...   
995  altsheler_j        the_candidate   
996  altsheler_j        the_candidate   
997  altsheler_j        the_candidate   
998  altsheler_j        the_candidate   
999  altsheler_j        the_candidate   

                                                  Text  
0    note project gutenberg also has an html versio...  
1    it will be recollected to epirus where her fri...  
2    it would be best to endeavor to effect a landi...  
3    transport his army across the straits the army...  
4    that the true greatness of the soul of alexand...  
..                                                 ...  
995  comrades yet churchill was not wholly pleased ...  
996  idea and he immediately hunted up the cousin a

Now, we're going to get ready to find the most associated pairs of words. The *save_phraser* option allows us to use this as the basis for our phrase detection as well. But here we will just inspect the results.

In [3]:
association_df = ai.get_association(df, min_count = 1, save_phraser = True)
print(association_df)

	Time: 227.36381030082703 Full: 989787  Reduced: 392987 with 4997883 words.

	TOTAL TIME: 227.37843680381775
	TOTAL NGRAMS: 392987
	TOTAL WORDS: 4997883
	After pruning:
	TOTAL NGRAMS: 382815

	Calculating association for 382815 pairs.
	Processed 382815 items in 1.6508512496948242
         Word1    Word2       Max        LR        RL  Freq
347524      de  peyster  0.998835  0.998835  0.063444   126
57412    madam   rachel  0.998102  0.998102  0.414062    53
241275  madame   campan  0.997126  0.997126  0.039785    37
290489    loch    leven  0.997059  0.997059  0.447368    34
24262    rhode   island  0.997006  0.031223  0.997006    36
...        ...      ...       ...       ...       ...   ...
59263      and      and -0.036657 -0.036657 -0.036657   127
4039       the       of -0.037537 -0.068729 -0.037537  1891
16027      the      and -0.037744 -0.075858 -0.037744   433
36766       of       of -0.041107 -0.041107 -0.041107    47
3216       the      the -0.078698 -0.078698 -0.078698  1047

This is a direction-specific measure, so we have two basic numbers: left-to-right and right-to-left. We also have a *max* column that allows us to find the best direction for each pair. Finally, we have frequency. This measure is not as frequency-sensitive as the PMI, but there is still a bias toward infrequent pairs (which will be highly associated).

Now let's load fully trained word pairs for the Project Gutenberg dataset and the web/tweet dataset.

In [4]:
pd.set_option('display.max_rows', 100)

file = os.path.join(ai.data_dir, "stylistics.gutenberg_all.gz.association.gz")
pg_df = pd.read_csv(file)
print(pg_df)

file = os.path.join(ai.data_dir, "sociolinguistics.english_all.gz.association.gz")
tw_df = pd.read_csv(file)
print(tw_df)

        Word1      Word2       Max        LR        RL    Freq
0          le    gardeur  0.999871  0.999871  0.022920    1344
1           m     kinlay  0.999788  0.999788  0.010918     638
2         des  esseintes  0.999738  0.999738  0.011821     442
3         des    hermies  0.999672  0.999672  0.009146     342
4         don     aníbal  0.999594  0.999594  0.002330     464
...       ...        ...       ...       ...       ...     ...
3987819   the        and -0.032834 -0.061111 -0.032834  182982
3987820   the         of -0.032943 -0.059793 -0.032943  235270
3987821   and        and -0.033293 -0.033293 -0.033293   44364
3987822    of         of -0.035106 -0.035106 -0.035106   15540
3987823   the        the -0.067346 -0.067346 -0.067346   87214

[3987824 rows x 6 columns]
           Word1        Word2       Max        LR        RL   Freq
0        gustado           un  0.999816  0.014754  0.999816    870
1         indigo       cafind  0.999745  0.999745  0.129993    397
2          vote

Let's get some samples of very associated left-to-right phrases from both datasets.

In [5]:
pg_lr_df = pg_df.loc[pg_df["LR"] > 0.90].sample(frac = 1)
print(pg_lr_df)

tw_lr_df = tw_df.loc[tw_df["LR"] > 0.90].sample(frac = 1)
print(tw_lr_df)

     Word1      Word2       Max        LR        RL   Freq
422    sir    cropton  0.993883  0.993883  0.000029     18
215    van     burnam  0.996390  0.996390  0.013843    558
1449    on  horseback  0.931131  0.931131  0.002727  16716
1765  miss      meeke  0.920657  0.920657  0.000340    140
741     de     tabley  0.991285  0.991285  0.000026     12
...    ...        ...       ...       ...       ...    ...
360     mr     selwin  0.994804  0.994804  0.000087     20
970     of     thunes  0.963883  0.963883  0.000002     56
829    van      vliet  0.982720  0.982720  0.002828    114
68     van    someren  0.998297  0.998297  0.001489     60
886   pont    audemer  0.974357  0.974357  0.029851     76

[1607 rows x 6 columns]
                Word1                Word2       Max        LR        RL  Freq
1912           younus              algohar  0.986111  0.986111  0.250000   142
813            yvonne             fovargue  0.993786  0.993786  0.008060    16
252            ovidiu         

Now we can repeat the same code, this time finding right-to-left phrases.

In [6]:
pg_rl_df = pg_df.loc[pg_df["RL"] > 0.90].sample(frac = 1)
print(pg_rl_df)

tw_rl_df = tw_df.loc[tw_df["RL"] > 0.90].sample(frac = 1)
print(tw_rl_df)

                    Word1        Word2       Max            LR        RL  Freq
1469              merrion       square  0.930160  1.064107e-03  0.930160    80
1812               faeroe        isles  0.916660  3.109978e-03  0.916660    22
2040               tuatha           de  0.901989  1.577325e-04  0.901989    74
1926              metonic        cycle  0.909088  5.704505e-03  0.909088    20
642         heterochronic  variability  0.991732  3.276898e-03  0.991732    12
...                   ...          ...       ...           ...       ...   ...
290                  fadl           el  0.995464  1.841312e-03  0.995464    22
192                 menlo         park  0.996633  6.475565e-04  0.996633    30
323                  tóth        jános  0.995025  2.040816e-01  0.995025    20
1124               geroch           of  0.957401  3.352864e-07  0.957401    12
1163  contradistinguished         from  0.953139  2.004467e-05  0.953139    90

[475 rows x 6 columns]
                Word1       

And there you go! 

In this lab, we've seen how to find or retrieve the most associated words, while taking into account the direction of association. Since we've randomly shuffled the datasets, you can repeat the cells to show more examples. These measure pick up more specific patterns than the PMI.