# Importing Libraries

In [1]:
import praw
import config

reddit = praw.Reddit(client_id = config.client_id,
                     client_secret = config.client_secret,
                         user_agent = config.user_agent)
import spacy

# First Steps with Spacy

In [4]:
nlp = spacy.load("en_core_web_sm")

In [8]:
with open ("data\\wiki_us.txt", "r") as f:
    text = f.read()

## Setting up Doc Object

In [10]:
doc = nlp(text)

In [14]:
for sent in doc.sents:
    print(f">> {sent}")

>> The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
>> It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
>> At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
>> The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
>> The national capital is Washington, D.C., and the most populous city is New York.


>> Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
>> The United States emerged from the thirtee

## Token Attributes

In [27]:
sentence_1 = list(doc.sents)[0]
print(sentence_1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [37]:
token_2 = sentence_1[2]
display(token_2)
display(token_2.text) # actual string in token
display(token_2.left_edge) # word left of token
display(token_2.right_edge) # word right of token (why "," here???)
display(token_2.ent_type_) # "GPE" = geopolitical entity
display(token_2.ent_iob_) # "I" = inside of (larger) entity // "B" = beginning of (larger) entity // "O" = outside of (larger) entity
display(token_2.lemma_) # Lemma = word stem; from which word does word come from
display(token_2.pos_) # "PROPN" = proper noun (part of speach)
display(token_2.dep_) # "nsubj" = noun subject (dependency; what roles does word play in sentence)

States

'States'

The

,

'GPE'

'I'

'States'

'PROPN'

'nsubj'

## Linguistic Annotations

In [38]:
text2 = "Paul enjoys progamming in Python."
doc2 = nlp(text2)
print(doc2)

Paul enjoys progamming in Python.


In [39]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Paul PROPN nsubj
enjoys VERB ROOT
progamming VERB xcomp
in ADP prep
Python PROPN pobj
. PUNCT punct


In [40]:
from spacy import displacy

In [45]:
display(displacy.render(doc2, style="dep"))
display(displacy.render(doc2, style="ent"))

None

None

In [48]:
display(list(doc.sents)[4])
display(displacy.render(list(doc.sents)[4], style="dep"))

The national capital is Washington, D.C., and the most populous city is New York.


None

## Named Entity Recognition

In [49]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775â€“1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The Spanishâ€“American War and World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT


In [50]:
displacy.render(doc, style="ent")

## Word Vectors