### For this assignment you must, using the text provided: 

1. Create a Spacy doc from the text,
2. Print the toekn text, part of speech for each token in the doc 
3. Find and print any geographical entity mentioned in the doc
4. Use a RegEx to find any death count mentioned in the doc
5. Find the similarity between the entire doc and the doc "I am happy"

In [2]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## 1. Creating a Spacy doc from our text

In [0]:
# your solution here 
import spacy 

nlp = spacy.load('en_core_web_sm') # loading in the package we just downloaded...

doc = nlp("The number of confirmed and probable lung-injury cases linked to vaping increased to 1,299, including 26 deaths, the federal Centers for Disease Control and Prevention said Thursday. The count of cases rose by 219 from a week ago. The cases were spread across 49 states, the District of Columbia, and the U.S. Virgin Islands, and 26 people have died. Alaska is the only state without reported cases. Connecticut, Pennsylvania, Michigan, Massachusetts, New York and Texas confirmed deaths for the first time over the past week. Georgia and California confirmed an additional death each. Among the deaths recently reported was of a 17-year-old from New York City, one of the youngest people reported to have died due to vaping-related injury so far. The CDC’s count of vaping-related deaths didn’t include one reported Wednesday by Utah’s health department. It said a person under the age of 30 years had died at home, without being hospitalized. The victim died after vaping products containing THC, the psychoactive ingredient in marijuana. If confirmed by the CDC, the Utah death would raise the total number of vaping-related fatalities across the U.S. to 27. Investigators from the Food and Drug Administration are conducting a criminal probe into the supply chain for vaping products, while health authorities investigate what is causing the vaping-related illnesses. The authorities have found that, among the 573 patients who reported their vaping habits, 76% reported using products containing THC. Many had bought the products on the black market, according to previous reports. Yet health officials say they haven’t linked any one product or substance with all of the illnesses, as only a third of the patients have reported exclusive THC use and only 13% have reported exclusive nicotine-product use. As the numbers of injured have risen, health authorities have urged people to stop using electronic cigarettes, some highlighting THC-containing products specifically. Separately, states including Massachusetts, New York and Washington have taken steps to crack down on flavored e-cigarettes, which the Trump administration has also said it would take.")

## 2. Finding the token text and associated part of speech for each token in our doc

In [4]:
# your solution here
for token in doc:
    print(token.text, token.pos_)

The DET
number NOUN
of ADP
confirmed VERB
and CCONJ
probable ADJ
lung NOUN
- PUNCT
injury NOUN
cases NOUN
linked VERB
to ADP
vaping VERB
increased VERB
to ADP
1,299 NUM
, PUNCT
including VERB
26 NUM
deaths NOUN
, PUNCT
the DET
federal ADJ
Centers PROPN
for ADP
Disease PROPN
Control PROPN
and CCONJ
Prevention PROPN
said VERB
Thursday PROPN
. PUNCT
The DET
count NOUN
of ADP
cases NOUN
rose VERB
by ADP
219 NUM
from ADP
a DET
week NOUN
ago ADV
. PUNCT
The DET
cases NOUN
were VERB
spread VERB
across ADP
49 NUM
states NOUN
, PUNCT
the DET
District PROPN
of ADP
Columbia PROPN
, PUNCT
and CCONJ
the DET
U.S. PROPN
Virgin PROPN
Islands PROPN
, PUNCT
and CCONJ
26 NUM
people NOUN
have VERB
died VERB
. PUNCT
Alaska PROPN
is VERB
the DET
only ADJ
state NOUN
without ADP
reported VERB
cases NOUN
. PUNCT
Connecticut PROPN
, PUNCT
Pennsylvania PROPN
, PUNCT
Michigan PROPN
, PUNCT
Massachusetts PROPN
, PUNCT
New PROPN
York PROPN
and CCONJ
Texas PROPN
confirmed VERB
deaths NOUN
for ADP
the DET
first ADJ
t

## 3. Creating a set of each geopolitical entity mentioned in the article

In [7]:
# your solution here
for ent in doc.ents: # for each entity in our Doc...
    print(ent.text, ent.label_) # print it alongside its label

1,299 CARDINAL
26 CARDINAL
Centers for Disease Control and Prevention ORG
Thursday DATE
219 CARDINAL
a week ago DATE
49 CARDINAL
the District of Columbia GPE
the U.S. Virgin Islands GPE
26 CARDINAL
Alaska GPE
Connecticut GPE
Pennsylvania GPE
Michigan GPE
Massachusetts GPE
New York GPE
Texas GPE
first ORDINAL
the past week DATE
Georgia GPE
California GPE
New York City GPE
one CARDINAL
CDC ORG
Wednesday DATE
Utah GPE
the age of 30 years DATE
THC ORG
CDC ORG
Utah GPE
U.S. GPE
27 CARDINAL
the Food and Drug Administration ORG
573 CARDINAL
76% PERCENT
THC ORG
one CARDINAL
as only a third CARDINAL
THC ORG
only 13% PERCENT
THC ORG
Massachusetts GPE
New York GPE
Washington GPE
Trump ORG


## 4. Using a RegEx to find any mention of a death count

In [29]:
# your solution here 
import re

doc = nlp('The number of confirmed and probable lung-injury cases linked to vaping increased to 1,299, including 26 deaths, the federal Centers for Disease Control and Prevention said Thursday. The count of cases rose by 219 from a week ago. The cases were spread across 49 states, the District of Columbia, and the U.S. Virgin Islands, and 26 people have died. Alaska is the only state without reported cases. Connecticut, Pennsylvania, Michigan, Massachusetts, New York and Texas confirmed deaths for the first time over the past week. Georgia and California confirmed an additional death each. Among the deaths recently reported was of a 17-year-old from New York City, one of the youngest people reported to have died due to vaping-related injury so far. The CDC’s count of vaping-related deaths didn’t include one reported Wednesday by Utah’s health department. It said a person under the age of 30 years had died at home, without being hospitalized. The victim died after vaping products containing THC, the psychoactive ingredient in marijuana. If confirmed by the CDC, the Utah death would raise the total number of vaping-related fatalities across the U.S. to 27. Investigators from the Food and Drug Administration are conducting a criminal probe into the supply chain for vaping products, while health authorities investigate what is causing the vaping-related illnesses. The authorities have found that, among the 573 patients who reported their vaping habits, 76% reported using products containing THC. Many had bought the products on the black market, according to previous reports. Yet health officials say they haven’t linked any one product or substance with all of the illnesses, as only a third of the patients have reported exclusive THC use and only 13% have reported exclusive nicotine-product use. As the numbers of injured have risen, health authorities have urged people to stop using electronic cigarettes, some highlighting THC-containing products specifically. Separately, states including Massachusetts, New York and Washington have taken steps to crack down on flavored e-cigarettes, which the Trump administration has also said it would take.')

expression = r'death(s)' 

for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    print("Found match:", span.text)
    
for ent in doc.ents:
    if ent.label_ == 'CARDINAL':  
      next_token = doc[ent.start + 1]
      if next_token.text in ('deaths','death','dead'):
        print(ent.text, next_token.text)
    

Found match: deaths
Found match: deaths
Found match: deaths
Found match: deaths
26 deaths


## 5. Finding the similarity between the entire doc and the doc "I am happy"

In [31]:
# your solution here
!sudo python -m spacy download en_core_web_md
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.1.0/en_core_web_md-2.1.0.tar.gz (95.4MB)
[K     |████████████████████████████████| 95.4MB 12.0MB/s 
[?25hBuilding wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.1.0-cp36-none-any.whl size=97126237 sha256=b6ae40e797bcc6ff6d747c5a42ef8b1125677827a4254fe52657d8a397ed74e0
  Stored in directory: /tmp/pip-ephem-wheel-cache-hjrswecx/wheels/c1/2c/5f/fd7f3ec336bf97b0809c86264d2831c5dfb00fc2e239d1bb01
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_m

In [34]:
import en_core_web_md # you only have to download it in the line above if you didn't earlier

nlp = en_core_web_md.load()

# compare two documents

doc1 = nlp('The number of confirmed and probable lung-injury cases linked to vaping increased to 1,299, including 26 deaths, the federal Centers for Disease Control and Prevention said Thursday. The count of cases rose by 219 from a week ago. The cases were spread across 49 states, the District of Columbia, and the U.S. Virgin Islands, and 26 people have died. Alaska is the only state without reported cases. Connecticut, Pennsylvania, Michigan, Massachusetts, New York and Texas confirmed deaths for the first time over the past week. Georgia and California confirmed an additional death each. Among the deaths recently reported was of a 17-year-old from New York City, one of the youngest people reported to have died due to vaping-related injury so far. The CDC’s count of vaping-related deaths didn’t include one reported Wednesday by Utah’s health department. It said a person under the age of 30 years had died at home, without being hospitalized. The victim died after vaping products containing THC, the psychoactive ingredient in marijuana. If confirmed by the CDC, the Utah death would raise the total number of vaping-related fatalities across the U.S. to 27. Investigators from the Food and Drug Administration are conducting a criminal probe into the supply chain for vaping products, while health authorities investigate what is causing the vaping-related illnesses. The authorities have found that, among the 573 patients who reported their vaping habits, 76% reported using products containing THC. Many had bought the products on the black market, according to previous reports. Yet health officials say they haven’t linked any one product or substance with all of the illnesses, as only a third of the patients have reported exclusive THC use and only 13% have reported exclusive nicotine-product use. As the numbers of injured have risen, health authorities have urged people to stop using electronic cigarettes, some highlighting THC-containing products specifically. Separately, states including Massachusetts, New York and Washington have taken steps to crack down on flavored e-cigarettes, which the Trump administration has also said it would take.')
doc2 = nlp('I am happy')

print(doc1.similarity(doc2))

0.5751000869904347
