## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
%%capture
!pip install scattertext

In [2]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [3]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [5]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
22263,dickens,pickwick,"'Dear me,' said the prim man in the cloth boot...",man cloth boot circumstance,prim extraordinary,dear say
22886,dickens,pickwick,"'My dear friend,' said Mr. Ben Allen, taking a...",friend advantage absence counter whither hand ...,dear temporary second dear miserable,say take retire dispense refer
1938,dickens,cities,As soon as they were established in their new ...,residence father routine avocation household h...,new little english slight little speedy solemn...,establish enter arrange appoint appoint teach ...
8649,dickens,times,‘What shall I understand that you mean by a ba...,name,bad,understand mean
17054,dickens,bleak,These encomiums bring them to Mount Pleasant a...,encomium house door top toe favour sneer oracl...,perennial particular malignant privileged,bring open have survey leave stand consult inf...


In [6]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [7]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
2522,eliot,middlemarch,The scent would have been sweeter to Fred Vinc...,scent lane horseback mind effort father side w...,sweet unsuccessful other eager young unskilled...,come worry imagine expect straightway enter th...
17461,eliot,scenes,Mr. Gilfil could not help going up to the old ...,coachman hand chair room moon face piping voic...,old unable small same awful,help go wring speak motion take leave hang lis...
12059,eliot,bede,"“I came on an errand for Mother,” said Adam. “...",errand touch complaint bit,old,come say get want go stay
11674,eliot,bede,"He hurried his step along the narrow causeway,...",step causeway door woman shake head,narrow clean old slow palsied,hurry rap open
16379,eliot,felix,"""Oh, then,"" said Esther, turning her head asid...",head birch stem time,distant good,say turn consider bear get know get get


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [8]:
# create samples
dickens_sample_df = dickens_df.sample(10000)
eliot_sample_df = eliot_df.sample(10000)

In [14]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [19]:
# drop all columns except 'author' and 'nouns'
nouns_df = df[['author','nouns']]
print(nouns_df.shape)
nouns_df.sample(5)

(20000, 2)


Unnamed: 0,author,nouns
1028,eliot,matter fact feeling fact position feeling pens...
13147,dickens,fellow part story man father childhood public
10936,eliot,downstair bonnet shawl girl rose girl hat froc...
13576,eliot,stepfather face mark hair none top head eyebro...
17058,eliot,property mother side blood money pity way man ...


## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [20]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [21]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [23]:
# display visualization in notebook
HTML(html)

In [24]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [25]:
# create samples
dickens_sample_df = dickens_df.sample(10000)
eliot_sample_df = eliot_df.sample(10000)

In [26]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [27]:
# drop all columns except 'author' and 'adjectives'
adjectives_df = df[['author','adjectives']]
print(adjectives_df.shape)
adjectives_df.sample(5)

(20000, 2)


Unnamed: 0,author,adjectives
10002,eliot,other
14827,eliot,handsome spacious first conspicuous
3302,dickens,sure dear
128,eliot,present certain uncomfortable
14058,eliot,great many little nice coloured sweet great good


In [29]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(adjectives_df, category_col='author', text_col='adjectives').build()

In [30]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [32]:
# display visualization in notebook
HTML(html)
# save this visualization as an html file
file_name = 'adjectives.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)