# <font color='darkorange'>**Term association visualization**</font>

This notebook explains the process followed to obtain the term association graphs entailing what terms within the database appear more and less frequently in one specific category when compared to the other nine categories. The original code can be found in Jason Kessler’s work available in GitHub (<a href="https://github.com/JasonKessler/scattertext">link to source</a>).

In [1]:
# Setting things up.
#%pip install scattertext
#%pip install pytextrank
import os
import pandas as pd
from pathlib import Path
import scattertext as st
import spacy

os.chdir('*** Change working directory here if needed ***')
data_folder = Path('data/')
html_folder = Path('html/')

2023-04-16 11:58:11.673750: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Loading alien data.
data = pd.read_csv(data_folder/'ufo_sighting_data.csv', header=0, delimiter=',', encoding='utf-8')
data.iloc[0]

  data = pd.read_csv(data_folder/'ufo_sighting_data.csv', header=0, delimiter=',', encoding='utf-8')


Date_time                                                           10/10/1949 20:30
city                                                                      san marcos
state/province                                                                    tx
country                                                                           us
UFO_shape                                                                   cylinder
length_of_encounter_seconds                                                     2700
described_duration_of_encounter                                           45 minutes
description                        This event took place in early fall around 194...
date_documented                                                            4/27/2004
latitude                                                                  29.8830556
longitude                                                                 -97.941111
Name: 0, dtype: object

In [3]:
# Showing some instances of sightings in the US.
us_data = data[data['country'] == 'us']
us_data

Unnamed: 0,Date_time,city,state/province,country,UFO_shape,length_of_encounter_seconds,described_duration_of_encounter,description,date_documented,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611
5,10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.595,-82.188889
7,10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.1175,-73.408333
...,...,...,...,...,...,...,...,...,...,...,...
80327,9/9/2013 21:15,nashville,tn,us,light,600.0,10 minutes,Round from the distance/slowly changing colors...,9/30/2013,36.165833,-86.784444
80328,9/9/2013 22:00,boise,id,us,circle,1200.0,20 minutes,Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r...,9/30/2013,43.613611,-116.202500
80329,9/9/2013 22:00,napa,ca,us,other,1200.0,hour,Napa UFO&#44,9/30/2013,38.297222,-122.284444
80330,9/9/2013 22:20,vienna,va,us,circle,5.0,5 seconds,Saw a five gold lit cicular craft moving fastl...,9/30/2013,38.901111,-77.265556


In [4]:
# Initializing spacy.
nlp = spacy.load('en_core_web_sm')

In [5]:
# Creating the corpus.
corpus = st.CorpusFromPandas(us_data,
                             category_col='UFO_shape',
                             text_col='description',
                             nlp=nlp).build()

In [58]:
# Exploring the 10 most frequent words for each category of ufo.
cats = ['chevron', 'cigar', 'diamond', 'disk', 'formation', 'other', 'oval', 'rectangle', 'triangle']

for cat in cats:
    term_freq = corpus.get_term_freq_df()
    term_freq[cat] = corpus.get_scaled_f_scores(cat)
    print('\U0001F47D Frequent terms for', cat, '-->', list(term_freq.sort_values(by=cat, ascending=False).index[:10]))

👽 Frequent terms for chevron --> ['chevron', 'chevron shaped', 'a chevron', 'chevron shape', 'cheveron', 'large chevron', 'chevron formation', 'black chevron', 'quot;v&quot shaped', 'large boomerang']
👽 Frequent terms for cigar --> ['silver cigar', 'cigar shaped&#44', 'large cigar', 'cigar shaped', 'white cigar', 'cigar', 'cigar shape', 'a cigar', 'long cigar', 'orange cigar']
👽 Frequent terms for diamond --> ['black diamond', 'dimond', 'diamond shaped', 'diamond', 'diamond shape', 'a diamond', 'bright diamond', 'orange diamond', 'white diamond', 'diamond with']
👽 Frequent terms for disk --> ['silver disk', 'a disk', 'saucer with', 'saucer shaped', 'saucer', 'a saucer', 'a disc', 'disk with', 'disk shaped', 'disk shape']
👽 Frequent terms for formation --> ['formation of', 'formation seen', 'a formation', 'v formation', 'shaped formation', 'light formation', 'formation in', 'formation', 'of 5', 'in v']
👽 Frequent terms for other --> ['crescent', 'crescent shaped', 'bell', 'bell shaped',

In [53]:
# Creating the html file for the scatterplot. The example below works for the 'cigar' category.
html = st.produce_scattertext_explorer(corpus,
                                       category='cigar',
                                       category_name='for cigar',
                                       not_category_name='for the rest',
                                       width_in_pixels=1000,
                                       )

open(html_folder/'ufo_cigar.html','wb').write(html.encode('utf-8'))

8026933