###                     **Spacy Language Processing Pipelines: Exercises**

In [69]:
#importing necessary libraries
import spacy

nlp = spacy.load("en_core_web_sm")  #creating an object and loading the pre-trained model for "English"

#### **Excersie: 1**

- Get all the proper nouns from a given text in a list and also count how many of them.
- **Proper Noun** means a noun that names a particular person, place, or thing.

In [70]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [71]:
# burda captain bir insan olmadigi ucun onu saymamagi normaldi. yeni bir novu, basa dusur ki, burda umumi anlayisdan danisilir, xususi ad deyil.
text = '''Captain is huge. captain is awesome. Captain is good.

    Ravi and Raju are the best friends from school days.They wanted to go for a world tour and
visit famous cities like Paris, London, Dubai, Rome etc and also they called their another friend Mohan to take part of this world tour.
They started their journey from Hyderabad and spent next 3 months travelling all the wonderful cities in the world and cherish a happy moments!
'''

# https://spacy.io/usage/linguistic-features

#creating the nlp object
doc = nlp(text)

# first print them for explanation to myself :)
for sentence in doc.sents:
  for token in sentence:
    print(token, '                ', token.pos_ , '           ', spacy.explain(token.pos_))

Captain                  NOUN             noun
is                  AUX             auxiliary
huge                  ADJ             adjective
.                  PUNCT             punctuation
captain                  NOUN             noun
is                  AUX             auxiliary
awesome                  ADJ             adjective
.                  PUNCT             punctuation
Captain                  NOUN             noun
is                  AUX             auxiliary
good                  ADJ             adjective
.                  PUNCT             punctuation


                      SPACE             space
Ravi                  PROPN             proper noun
and                  CCONJ             coordinating conjunction
Raju                  PROPN             proper noun
are                  AUX             auxiliary
the                  DET             determiner
best                  ADJ             adjective
friends                  NOUN             noun
from                 

In [72]:
proper_nouns_list = []
count_of_proper_nouns = 0

for sentence in doc.sents:
  for token in sentence:
    if str(spacy.explain(token.pos_)) == 'proper noun':
      proper_nouns_list.append(token)

count_of_proper_nouns = len(proper_nouns_list)
print(proper_nouns_list, count_of_proper_nouns)

[Ravi, Raju, Paris, London, Dubai, Rome, Mohan, Hyderabad] 8


In [73]:
proper_nouns_list = []
count_of_proper_nouns = 0

for sentence in doc.sents:
  for token in sentence:
    if str(spacy.explain(token.pos_)) == 'proper noun':
      proper_nouns_list.append(token.lemma_)

count_of_proper_nouns = len(proper_nouns_list)
print('Proper Nouns:', proper_nouns_list, '\nCount:', count_of_proper_nouns)

Proper Nouns: ['Ravi', 'Raju', 'Paris', 'London', 'Dubai', 'Rome', 'Mohan', 'Hyderabad'] 
Count: 8


In [75]:
# short version
p_n = [token.lemma_ for sentence in doc.sents for token in sentence if token.pos_ == 'PROPN']
print('Proper Nouns:', p_n, '\nCount:', len(p_n))

Proper Nouns: ['Ravi', 'Raju', 'Paris', 'London', 'Dubai', 'Rome', 'Mohan', 'Hyderabad'] 
Count: 8


**Expected Output**

Proper Nouns:  [Ravi, Raju, Paris, London, Dubai, Rome, Mohan, Hyderabad]

Count:  8


#### **Excersie: 2**

- Get all companies names from a given text and also the count of them.
- **Hint**: Use the spacy **ner** functionality

In [76]:
text = '''The Top 5 companies in USA are Tesla, Walmart, Amazon, Microsoft, Google and the top 5 companies in
India are Infosys, Reliance, HDFC Bank, Hindustan Unilever and Bharti Airtel'''


doc = nlp(text)

for ent in doc.ents:
  print(ent, '      ', ent.label_, '         ', spacy.explain(ent.label_))

# hamisi duz isleyecek amma Bharti Airtel yox. Cunki tekce Bhartini goturur.

5        CARDINAL           Numerals that do not fall under another type
USA        GPE           Countries, cities, states
Tesla        ORG           Companies, agencies, institutions, etc.
Walmart        ORG           Companies, agencies, institutions, etc.
Amazon        ORG           Companies, agencies, institutions, etc.
Microsoft        ORG           Companies, agencies, institutions, etc.
Google        ORG           Companies, agencies, institutions, etc.
5        CARDINAL           Numerals that do not fall under another type
India        GPE           Countries, cities, states
Infosys        ORG           Companies, agencies, institutions, etc.
Reliance        ORG           Companies, agencies, institutions, etc.
HDFC Bank        ORG           Companies, agencies, institutions, etc.
Hindustan Unilever        ORG           Companies, agencies, institutions, etc.
Bharti        ORG           Companies, agencies, institutions, etc.


In [77]:
# bu da hell etmir problemi.
for ent in doc.ents:
  print(ent.text, '      ', ent.label_, '         ', spacy.explain(ent.label_))

5        CARDINAL           Numerals that do not fall under another type
USA        GPE           Countries, cities, states
Tesla        ORG           Companies, agencies, institutions, etc.
Walmart        ORG           Companies, agencies, institutions, etc.
Amazon        ORG           Companies, agencies, institutions, etc.
Microsoft        ORG           Companies, agencies, institutions, etc.
Google        ORG           Companies, agencies, institutions, etc.
5        CARDINAL           Numerals that do not fall under another type
India        GPE           Countries, cities, states
Infosys        ORG           Companies, agencies, institutions, etc.
Reliance        ORG           Companies, agencies, institutions, etc.
HDFC Bank        ORG           Companies, agencies, institutions, etc.
Hindustan Unilever        ORG           Companies, agencies, institutions, etc.
Bharti        ORG           Companies, agencies, institutions, etc.


In [78]:
companies = []

for ent in doc.ents:
  if ent.label_ == 'ORG':
    print(ent.text)
    companies.append(ent.text)

Tesla
Walmart
Amazon
Microsoft
Google
Infosys
Reliance
HDFC Bank
Hindustan Unilever
Bharti


In [79]:
print('Company Names:', companies, '\nCount:', len(companies))

Company Names: ['Tesla', 'Walmart', 'Amazon', 'Microsoft', 'Google', 'Infosys', 'Reliance', 'HDFC Bank', 'Hindustan Unilever', 'Bharti'] 
Count: 10


In [80]:
# short version
companies = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
print('Company Names:', companies, '\nCount:', len(companies))

Company Names: ['Tesla', 'Walmart', 'Amazon', 'Microsoft', 'Google', 'Infosys', 'Reliance', 'HDFC Bank', 'Hindustan Unilever', 'Bharti'] 
Count: 10


**Expected Output**


Company Names:  [Tesla, Walmart, Amazon, Microsoft, Google, Infosys, Reliance, HDFC Bank, Hindustan Unilever, Bharti Airtel]

Count:  10

## [**Solution**](./language_processing_exercise_solutions.ipynb)