## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
%%capture
!pip install scattertext

In [1]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [2]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [3]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
1242,dickens,cities,"He unshaded his face after a little while, and...",face while,little,unshade speak
11698,dickens,copperfield,"‘I am sure you are right,’ she returned; ‘and ...",habit self,sure right bad guarded trustful odd former,return grow change wonder ’ study regain
2065,dickens,cities,"“That’s true,” Mr. Lorry acknowledged, with hi...",hand chin eye,true troubled troubled,’ acknowledge
8169,dickens,times,"‘From bad to worse, from worse to worsen. She...",herseln everyway street night t th brigg ower ...,bad bad bad bitter bad more much young,worsen leave disgrace coom coom coom hinder ha...
11328,dickens,copperfield,"Somebody was leaning out of my bedroom window,...",bedroom window forehead stone parapet air face...,cool pale vacant drunk,lean refresh feel address say try smoke know c...


In [3]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [5]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
4680,eliot,mill,At this moment a striking incident made the bo...,moment incident boy walk plunging body water n...,striking small ready unpleasant,make pause intimate undergo
2465,eliot,middlemarch,“I wonder whether you would like to have that ...,miniature stair miniature grandmother,beautiful right,wonder like hang mean think keep wish
12021,eliot,bede,Adam spoke these words with the firm distinctn...,word distinctness man hesitation,firm unsaid more,speak resolve leave bind say go
3258,eliot,middlemarch,Mr. Hawley rode home without thinking of Lydga...,attendance light piece evidence side news exec...,other able new other slow significant sudden d...,ride think become get rid pay spread gather co...
3465,eliot,middlemarch,She entertained no visions of their ever comin...,vision union posture renunciation relation par...,whole sinful inward happy disposed chief repul...,entertain come take accept think keep dwell be...


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [6]:
# create samples
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [7]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [14]:
# drop all columns except 'author' and 'noun'
noun_df = df[['author', 'noun']]

In [16]:
print(adj_df.shape)
noun_df.sample(10)

(20000, 2)


Unnamed: 0,author,adjectives
13055,eliot,possible
15629,eliot,crazy little great busy ignorant large politic...
3725,dickens,certain circumstanced likely accidental strang...
14852,eliot,such old fashioned new high high constructive ...
1785,eliot,unsuited little hard ridiculous well young use...
6260,eliot,sweet bent sympathetic little expansive great ...
16643,eliot,good ill
3669,eliot,wrong little least pleasant
5695,eliot,dreary certain swift angry feeble strange dism...
10629,dickens,suspicious great small


## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [17]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [18]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [19]:
# display visualization in notebook
HTML(html)

In [20]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [4]:
# create samples
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [5]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [8]:
# drop all columns except 'author' and 'adjectives'
adj_df = df[['author', 'adjectives']]

In [9]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(adj_df, category_col='author', text_col='adjectives').build()

In [10]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [11]:
# display visualization in notebook
HTML(html)