We need to decide on a comprehensive list of stop words. To ensure our study's reproducibility, we must create this list of stop words using a defined set of rules. If a word fits these rules, then it may be deemed a stop word and removed from our corpus.

We begin by getting the standard list of stop words from the nltk package:

In [32]:
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re

stopWords = stopwords.words('english')
for w in stopWords:
    print(w)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samanthagarland/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
i
me
my
myself
we
our
ours
ourselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
he
him
his
himself
she
she's
her
hers
herself
it
it's
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
that'll
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
don't
should
should've
now
d
ll
m
o
re
ve
y
ain
aren
aren't
couldn
couldn't
didn
didn't
doesn
doesn't
hadn
hadn't
hasn
hasn't
haven
haven't
isn
isn't
ma
mightn
mightn't
mustn
mustn't
needn
ne

Are there any rules that seem to sum up this list?

- pronouns
- prepositions
- single letters
- conjunctions of the above categories

We also note that in our research, we will be removing punctuation, so we do that with these package words now:

In [33]:
stopWords = set([word.replace("'", "") for word in stopWords])
for word in stopWords:
    print(word)

up
their
y
couldnt
weren
ve
shouldn
wont
had
about
didnt
thatll
yourselves
have
down
dont
didn
wasnt
him
whom
any
m
its
themselves
as
werent
or
this
o
of
few
wouldnt
after
himself
at
when
mightn
them
needn
above
such
doing
you
with
why
more
am
over
not
isnt
did
mightnt
itself
he
who
your
shes
other
neednt
couldn
s
now
below
by
hasnt
from
been
haven
they
having
through
it
should
i
being
a
against
and
ourselves
between
some
very
is
hadnt
on
her
but
d
hadn
these
can
to
out
ours
before
mustn
all
once
nor
same
are
each
has
the
myself
only
most
me
youre
off
both
doesnt
while
will
ma
own
for
if
won
where
ain
there
does
were
we
shouldve
too
during
arent
mustnt
wasn
those
into
just
how
ll
in
hasn
that
re
wouldn
shouldnt
doesn
further
theirs
my
herself
no
youll
hers
under
here
our
havent
which
an
his
shan
do
isn
youve
so
yourself
again
shant
was
she
until
be
don
what
then
aren
than
because
yours
t
youd


Since the package words don't include every single individual letter, we add those in now:

In [34]:
singleLetters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
for letter in singleLetters:
    stopWords.add(letter)

Now let's look at the most frequent words in our doctor/patient conversations and discern a few categories from there.

In [35]:
#read in dataframe
transcript_df = pd.read_csv("/Users/samanthagarland/Downloads/processed_transcripts1.csv")

#select conversation 1
t1 = transcript_df["Convo_1"]

#for simple frequency analysis, we assemble all conversations into one long string
all_conversations_string = ""
for s in t1:
    if type(s) is str:
        all_conversations_string += s

#assemble a list of all words
all_words = []
for word in all_conversations_string.split(" "):
     if word is not "" and word is not " ":
        all_words.append(word.lower())
        
#create a counter to find the most frequent words
from collections import Counter
counts = Counter(all_words)

k = 500

most_common = counts.most_common(k)
topWords = []
for word, num in most_common:
    if word not in stopWords:
        topWords.append((word, num))

What are the kth most frequent words? Are they useful? Let's see...

In [36]:
for word, num in topWords:
    print(word, num)

pt 52917
md 49270
okay 22986
know 20976
um 20477
yeah 16969
prostate 12863
cancer 11883
right 11487
surgery 10300
thats 9885
like 9541
radiation 9335
would 9301
well 8361
get 7807
hmm 7691
uh 7172
think 7045
one 6550
im 6087
go 5644
good 5365
theres 5034
kind 4907
going 4877
mm 4845
see 4726
risk 4367
back 4361
want 4192
oth 4157
treatment 4131
little 4089
time 3991
biopsy 3948
mean 3887
say 3826
people 3734
oh 3720
take 3574
years 3559
things 3501
really 3423
probably 3357
something 3309
thing 2979
psa 2941
side 2920
make 2915
sure 2856
talk 2715
two 2656
low 2591
could 2567
alright 2567
got 2559
lot 2517
come 2401
bit 2387
way 2373
gleason 2370
said 2360
need 2331
months 2270
men 2265
done 2205
give 2173
anything 2148
weeks 2144
much 2075
look 2035
gonna 2030
year 1998
long 1992
ill 1959
yes 1942
still 1940
day 1925
may 1899
ah 1866
tell 1845
effects 1841
bladder 1814
actually 1804
pretty 1792
blood 1783
dr 1776
three 1755
six 1752
put 1744
disease 1712
even 1695
surveillance 1684
fi

From these words, we see many of them are from the transcript legend. Let's create a list of those and add them to stopWords.

In [37]:
legend = set(["pt", "md", "oth", "so", "legend", "inaudible", "phi", "laughs", "pt/so"])
stopWords = stopWords.union(legend)

This list also includes many filler words, which we also add to stopWords.

In [38]:
filler = set(["um", "hmm", "uh", "mm", "uhhuh", "uhmmm", "nah", "whatnot", "mhmmm", "uhhmm", "othumhmm", "mmhmm", "umhmm","oh", "ah", "hm","ok", "okay", "kay", "umm","gee", "yeah","yep", "huh", "ya", "mmhmm", "mmm", "hum", "kinda", "like","right"])
stopWords = stopWords.union(filler)

This list had more preposition and pronoun forms, which we include below and add to stopWords.

In [39]:
pr = set(["would", "well", "im", "itd", "theyll","go", "theres", "em","thats","could", "gonna", "ill", "theyre", "ive", "us", "cant", "id", "lets", "hes", "wed", "weve", "came", "sounds"])
stopWords = stopWords.union(pr)

Our stop words now include words from the following categories:
- pronouns
- prepositions
- single letters
- conjunctions of the above categories
- filler words ("um", "hmm", "ok", "yep", etc)
- legend-specific words ("pt", "md", etc)

In [40]:
stopWords = sorted(list(stopWords))
for w in stopWords:
    print(w)


a
about
above
after
again
against
ah
ain
all
am
an
and
any
are
aren
arent
as
at
b
be
because
been
before
being
below
between
both
but
by
c
came
can
cant
could
couldn
couldnt
d
did
didn
didnt
do
does
doesn
doesnt
doing
don
dont
down
during
e
each
em
f
few
for
from
further
g
gee
go
gonna
h
had
hadn
hadnt
has
hasn
hasnt
have
haven
havent
having
he
her
here
hers
herself
hes
him
himself
his
hm
hmm
how
huh
hum
i
id
if
ill
im
in
inaudible
into
is
isn
isnt
it
itd
its
itself
ive
j
just
k
kay
kinda
l
laughs
legend
lets
like
ll
m
ma
md
me
mhmmm
mightn
mightnt
mm
mmhmm
mmm
more
most
mustn
mustnt
my
myself
n
nah
needn
neednt
no
nor
not
now
o
of
off
oh
ok
okay
on
once
only
or
oth
other
othumhmm
our
ours
ourselves
out
over
own
p
phi
pt
pt/so
q
r
re
right
s
same
shan
shant
she
shes
should
shouldn
shouldnt
shouldve
so
some
sounds
such
t
than
that
thatll
thats
the
their
theirs
them
themselves
then
there
theres
these
they
theyll
theyre
this
those
through
to
too
u
uh
uhhmm
uhhuh
uhmmm
um
umhmm
umm
under
u