### Sample text analysis using spacy

Spacy is a library that can assist you in doing linguistic analyses. 

To install and use the Englis-language version of spacy you should run these commands in your virtual environment:
`pip3 install spacy`
`python3 -m spacy download en_core_web_sm`
We will be importing the `text.txt` file in our `data` folder. It contains a sample article about a very special [cat](https://www.buzzfeednews.com/article/juliareinstein/this-thicc-lazy-high-maintenance-incredibly-well-hydrated/).

In [1]:
import spacy
import pandas as pd

In [2]:
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

# opens the text file and turns it into a string
text = open("../data/text.txt","r+").read()
len(text) # this returns the length of characters and spaces

2990

Now let's turn the string into a corpus for spacy

In [3]:
doc = nlp(text)
len(doc) # this returns the tokens

724

The document can act like a list of words. To access each word or 'token' we can use the built in function `.text`

In [4]:
for token in doc:
    print(token.text)

This
is
Bruno
,
and
he
’s
a
25
-
pound
cat
who
’s
currently
up
for
adoption
at
the
Wright
-
Way
Rescue
Adoption
Center
in
Morton
Grove
,
Illinois
.
Erin
Ellison
,
who
works
at
the
shelter
,
told
BuzzFeed
News
he
's
been
in
their
care
since
April
11
when
he
was
given
up
for
adoption
because
he
"
was
n't
meshing
well
with
young
kids
in
his
home
.
"



"
He
was
no
doubt
loved
by
his
former
family
but
maybe
a
little
too
much
,
"
she
said
.
"
He
needed
a
home
that
would
love
him
enough
to
help
him
trim
down
.
"



The
7
-
year
-
old
cat
is
polydactyl
,
meaning
he
has
a
few
extra
toes
.
He
also
has
a
strange
habit
of
standing
on
his
hind
legs
,
the
shelter
said
on
Facebook
.



“
This
usually
happens
when
I
want
food
.
No
,
my
foster
parents
did
not
teach
me
this
.
They
are
not
sure
how
I
learned
,
”
the
shelter
said
.



He
’s
now
on
a
diet
and
“
walking
,
playing
,
and
doing
tricks
”
so
he
can
lose
some
weight
.



Bruno
also
apparently
loves
to
be
petted
while
he
eats
.



“
It
took
my
fo

Now we can count some words by:
- turning the words into a list
- turning that list into a pandas data frame
- counting the values

In [5]:
rows = []
for token in doc:
    rows.append(token.text)

In [6]:
print(rows)

['This', 'is', 'Bruno', ',', 'and', 'he', '’s', 'a', '25', '-', 'pound', 'cat', 'who', '’s', 'currently', 'up', 'for', 'adoption', 'at', 'the', 'Wright', '-', 'Way', 'Rescue', 'Adoption', 'Center', 'in', 'Morton', 'Grove', ',', 'Illinois', '.', 'Erin', 'Ellison', ',', 'who', 'works', 'at', 'the', 'shelter', ',', 'told', 'BuzzFeed', 'News', 'he', "'s", 'been', 'in', 'their', 'care', 'since', 'April', '11', 'when', 'he', 'was', 'given', 'up', 'for', 'adoption', 'because', 'he', '"', 'was', "n't", 'meshing', 'well', 'with', 'young', 'kids', 'in', 'his', 'home', '.', '"', '\n\n', '"', 'He', 'was', 'no', 'doubt', 'loved', 'by', 'his', 'former', 'family', 'but', 'maybe', 'a', 'little', 'too', 'much', ',', '"', 'she', 'said', '.', '"', 'He', 'needed', 'a', 'home', 'that', 'would', 'love', 'him', 'enough', 'to', 'help', 'him', 'trim', 'down', '.', '"', '\n\n', 'The', '7', '-', 'year', '-', 'old', 'cat', 'is', 'polydactyl', ',', 'meaning', 'he', 'has', 'a', 'few', 'extra', 'toes', '.', 'He', 'a

In [7]:
word_dataframe = pd.DataFrame(rows)
word_dataframe.columns = ['word']
word_dataframe.head()

Unnamed: 0,word
0,This
1,is
2,Bruno
3,","
4,and


In [8]:
word_count = word_dataframe['word'].value_counts().reset_index()
word_count.head()

Unnamed: 0,word,count
0,",",34
1,.,33
2,I,20
3,the,18
4,\n\n,18


In [9]:
word_count_alt = word_dataframe.groupby('word').agg({"word":"count"})
word_count_alt.head()

Unnamed: 0_level_0,word
word,Unnamed: 1_level_1
\n,1
\n\n,18
!,1
"""",12
's,1


In [10]:
word_count.to_csv('../output/word_count.csv', index=False)
word_count_alt.to_csv('../output/word_count2.csv')