# Stop Word Rules

We need to decide on a comprehensive list of stop words. To ensure our study's reproducibility, we must create this list of stop words using a defined set of rules. If a word fits these rules, then it may be deemed a stop word and removed from our corpus.

We begin by getting the standard list of stop words from the nltk package:

In [1]:
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re

stopWords = stopwords.words('english')
for w in stopWords:
    print(w)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samanthagarland/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
i
me
my
myself
we
our
ours
ourselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
he
him
his
himself
she
she's
her
hers
herself
it
it's
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
that'll
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
don't
should
should've
now
d
ll
m
o
re
ve
y
ain
aren
aren't
couldn
couldn't
didn
didn't
doesn
doesn't
hadn
hadn't
hasn
hasn't
haven
haven't
isn
isn't
ma
mightn
mightn't
mustn
mustn't
needn
ne

Are there any rules that seem to sum up this list?

- pronouns
- prepositions
- single letters
- conjunctions of the above categories

We also note that in our research, we will be removing punctuation, so we do that with these package words now:

In [2]:
stopWords = set([word.replace("'", "") for word in stopWords])
for word in stopWords:
    print(word)

some
should
having
y
m
himself
through
it
youre
up
mustnt
is
them
further
werent
isnt
but
him
wouldnt
its
me
hasnt
isn
didnt
what
thatll
or
by
shouldve
hadn
here
ours
his
where
had
which
more
each
was
have
not
the
and
once
re
during
with
yourself
shouldnt
being
an
ll
why
herself
are
were
other
ourselves
after
youll
off
hasn
a
then
until
couldnt
if
under
our
d
hadnt
we
about
o
any
won
shouldn
yours
t
haven
myself
am
they
when
such
themselves
for
don
only
your
didn
can
as
ma
i
has
needn
weren
all
wasnt
of
shes
her
there
neednt
into
did
mightn
this
than
very
wouldn
mustn
yourselves
out
aren
doesn
because
nor
that
arent
s
against
just
do
she
ve
these
my
to
now
ain
doesnt
over
youd
their
does
at
wasn
how
before
shant
havent
between
who
so
itself
hers
couldn
down
he
shan
youve
above
below
been
in
wont
same
no
on
dont
be
those
own
again
theirs
most
while
doing
too
from
you
mightnt
few
will
both
whom


Since the package words don't include every single individual letter, we add those in now:

In [3]:
singleLetters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
for letter in singleLetters:
    stopWords.add(letter)

Now let's look at the most frequent words in our doctor/patient conversations and discern a few categories from there.

In [4]:
#read in dataframe
transcript_df = pd.read_csv("/Users/samanthagarland/Downloads/processed_transcripts1.csv")

#select conversation 1
t1 = transcript_df["Convo_1"]

#for simple frequency analysis, we assemble all conversations into one long string
all_conversations_string = ""
for s in t1:
    if type(s) is str:
        all_conversations_string += s

#assemble a list of all words
all_words = []
for word in all_conversations_string.split(" "):
     if word is not "" and word is not " ":
        all_words.append(word.lower())
        
#create a counter to find the most frequent words
from collections import Counter
counts = Counter(all_words)

k = 500

most_common = counts.most_common(k)
topWords = []
for word, num in most_common:
    if word not in stopWords:
        topWords.append((word, num))

What are the kth most frequent words? Are they useful? Let's see...

In [5]:
for word, num in topWords:
    print(word, num)

pt 52917
md 49270
okay 22986
know 20976
um 20477
yeah 16969
prostate 12863
cancer 11883
right 11487
surgery 10300
thats 9885
like 9541
radiation 9335
would 9301
well 8361
get 7807
hmm 7691
uh 7172
think 7045
one 6550
im 6087
go 5644
good 5365
theres 5034
kind 4907
going 4877
mm 4845
see 4726
risk 4367
back 4361
want 4192
oth 4157
treatment 4131
little 4089
time 3991
biopsy 3948
mean 3887
say 3826
people 3734
oh 3720
take 3574
years 3559
things 3501
really 3423
probably 3357
something 3309
thing 2979
psa 2941
side 2920
make 2915
sure 2856
talk 2715
two 2656
low 2591
could 2567
alright 2567
got 2559
lot 2517
come 2401
bit 2387
way 2373
gleason 2370
said 2360
need 2331
months 2270
men 2265
done 2205
give 2173
anything 2148
weeks 2144
much 2075
look 2035
gonna 2030
year 1998
long 1992
ill 1959
yes 1942
still 1940
day 1925
may 1899
ah 1866
tell 1845
effects 1841
bladder 1814
actually 1804
pretty 1792
blood 1783
dr 1776
three 1755
six 1752
put 1744
disease 1712
even 1695
surveillance 1684
fi

From these words, we see many of them are from the transcript legend. Let's create a list of those and add them to stopWords.

In [6]:
legend = set(["interview_length", "significant_other", "pt", "doc", "md", "oth", "so", "legend", "inaudible", "phi", "laughs", "pt/so", "mdmd", "patient", "physician", "clean", "indecipherable"])
stopWords = stopWords.union(legend)

This list also includes many filler words, which we also add to stopWords.

In [7]:
filler = set(["um", "uhmhmm", 'mmhmmm', "umhmmm", "mmmhmm", "lot", "mmkay", "yer", 'ummmm','mmmmm', "mhmm", "na", "mkay", "ohhhohohohoh", "whatever", "sorta", "uhum", "noooo", "jeez", "things", "thing", "ahhhh", "mmmm","ummm", "stuff", "yall", "hmm", "uh", "mm", "oooh", "uuh","uhhuh", "uhmmm", "nah", "whatnot", "mhmmm", "uhhmm", "othumhmm", "mmhmm", "umhmm","oh", "ah", "hm","ok", "okay", "kay", "umm","gee", "yeah","yep", "huh", "ya", "mmhmm", "mmm", "hum", "kinda", "like","right", "yup", "hi", "nope", "hey", "cuz", "mmhm", "mhm", "ahh", "hello", "gosh", "bye", "uhh", "er", "yea", "geez", "ohh", "heh", "ahhh", "ohhh", "aah", "yada", "whoa", "aaah", "okey", "dokey", "uhmhmmm", "whew", "unhunh","nooh", "nahuh", "ahhm", "yeyeah", "uhhhhh", "uhhm", "uhm", "hmmm", "eh", "ha", "yah", "mmmhmmm", "alrighty", "alright"])
stopWords = stopWords.union(filler)

This list had more preposition and pronoun forms, which we include below and add to stopWords.

In [8]:
pr = set(["would", "well", "anyone", "shell", "hell", "etc", "say", "going", "look", "maam", "got","let", "also", "gotta", "thereof", "lot", "mustve", "said", "wheres", "see", "ow", "went", "get", "hows", "hed", "thereve", "st", "thatd", "goin", "theyve", "im", "itd", "theyll","go", "theres", "em","thats","could", "aint","gonna", "ill", "theyre", "ive", "us", "cant", "id", "lets", "hes", "wed", "weve", "came", "sounds", "whats", "hes", "thered","whatd", "doin", "mightve", "oughta", "gal", "whos", "itll", "'em", "wanna", "could"])
stopWords = stopWords.union(pr)

There should be absolutely no names in these transcripts, but should we find any, we will also remove them with the stop words.

In [9]:
names = set(["taiwan", "taiwanese", "communist", "henry", "walsh", "potter", "ohio", "hodgkins", "california", "florida", "alabama", "alaska", "germany", "europe", "michigan", "virginia", "swedish", "costa_rica", "greek", "african", "washington", "vietnam", "indianapolis"])
stopWords = stopWords.union(names)

We will also take out any misspelled words as we come across them.

In [10]:
misspell = set(["thew", "ne", "de", "leastno", "un", "ro", "imrt", "et", "tthe", "ti", "youyou", "rd", "aa", "yepvery", "ga", "nd", "ab", "nn", "gu", "alot", "ay", "le","anand", "andand", "thethe", "ifif", "itit", "ii", "iii", "thatsthat", "eek", "uhoh", "wewe", "isis", "nahuh", "yeyeah","th", "'", "'cause", "youl", "whatev"])
stopWords = stopWords.union(misspell)

Our stop words now include words from the following categories:
- pronouns
- prepositions
- single letters
- conjunctions of the above categories
- filler words ("um", "hmm", "ok", "yep", etc)
- legend-specific words ("pt", "md", etc)
- names
- misspelled words

In [11]:
stopWords = sorted(list(stopWords))
for w in stopWords:
    print(w)
print(stopWords)

'
'cause
'em
a
aa
aaah
aah
ab
about
above
african
after
again
against
ah
ahh
ahhh
ahhhh
ahhm
ain
aint
alabama
alaska
all
alot
alright
alrighty
also
am
an
anand
and
andand
any
anyone
are
aren
arent
as
at
ay
b
be
because
been
before
being
below
between
both
but
by
bye
c
california
came
can
cant
clean
communist
costa_rica
could
couldn
couldnt
cuz
d
de
did
didn
didnt
do
doc
does
doesn
doesnt
doin
doing
dokey
don
dont
down
during
e
each
eek
eh
em
er
et
etc
europe
f
few
florida
for
from
further
g
ga
gal
gee
geez
germany
get
go
goin
going
gonna
gosh
got
gotta
greek
gu
h
ha
had
hadn
hadnt
has
hasn
hasnt
have
haven
havent
having
he
hed
heh
hell
hello
henry
her
here
hers
herself
hes
hey
hi
him
himself
his
hm
hmm
hmmm
hodgkins
how
hows
huh
hum
i
id
if
ifif
ii
iii
ill
im
imrt
in
inaudible
indecipherable
indianapolis
interview_length
into
is
isis
isn
isnt
it
itd
itit
itll
its
itself
ive
j
jeez
just
k
kay
kinda
l
laughs
le
leastno
legend
let
lets
like
ll
look
lot
m
ma
maam
md
mdmd
me
mhm
mhmm
mhmmm
