In [1]:
# !pip install -U spacy
import spacy

In [2]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


In [3]:
# Load small english model: https://spacy.io/models
# python -m spacy download en_core_web_sm
nlp=spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x115c54dbfc8>

In [4]:
# Parse text through the `nlp` model
my_text = """The economic situation of the country is on edge , as the stock market crashed causing loss of millions.\
Citizens who had their main investment in the share-market are facing a great loss. Many companies might lay off\
thousands of people to reduce labor cost"""

my_doc = nlp(my_text)
type(my_doc)

spacy.tokens.doc.Doc

In [5]:
piano_text = 'Gus is learning piano'
piano_doc = nlp(piano_text)
for token in piano_doc:
     print (token.text, token.tag_, token.head.text, token.dep_)

Gus NNP learning nsubj
is VBZ learning aux
learning VBG learning ROOT
piano NN learning dobj


In [6]:
displacy.serve(piano_doc, style='dep')

NameError: name 'displacy' is not defined

## Tokenization with spaCy

In [5]:
# Printing the tokens of a doc
for token in my_doc:
  print(token.text)

The
economic
situation
of
the
country
is
on
edge
,
as
the
stock
market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment
in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
offthousands
of
people
to
reduce
labor
cost


## Text-Preprocessing with spaCy

In [6]:
# Printing tokens and boolean values stored in different attributes
for token in my_doc:
    print(token.text,'--',token.is_stop,'---',token.is_punct)

The -- True --- False
economic -- False --- False
situation -- False --- False
of -- True --- False
the -- True --- False
country -- False --- False
is -- True --- False
on -- True --- False
edge -- False --- False
, -- False --- True
as -- True --- False
the -- True --- False
stock -- False --- False
market -- False --- False
crashed -- False --- False
causing -- False --- False
loss -- False --- False
of -- True --- False
millions -- False --- False
. -- False --- True
Citizens -- False --- False
who -- True --- False
had -- True --- False
their -- True --- False
main -- False --- False
investment -- False --- False
in -- True --- False
the -- True --- False
share -- False --- False
- -- False --- True
market -- False --- False
are -- True --- False
facing -- False --- False
a -- True --- False
great -- False --- False
loss -- False --- False
. -- False --- True
Many -- True --- False
companies -- False --- False
might -- True --- False
lay -- False --- False
offthousands -- False --

In [7]:
# Removing StopWords and punctuations
my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct]

for token in my_doc_cleaned:
    print(token.text)

economic
situation
country
edge
stock
market
crashed
causing
loss
millions
Citizens
main
investment
share
market
facing
great
loss
companies
lay
offthousands
people
reduce
labor
cost


In [8]:
# Reading a huge text data on robotics into a spacy doc
robotics_data= """Robotics is an interdisciplinary research area at the interface of computer science and engineering. Robotics involvesdesign, construction, operation, and use of robots. The goal of robotics is to design intelligent machines that can help and assist humans in their day-to-day lives and keep everyone safe. Robotics draws on the achievement of information engineering, computer engineering, mechanical engineering, electronic engineering and others.Robotics develops machines that can substitute for humans and replicate human actions. Robots can be used in many situations and for lots of purposes, but today many are used in dangerous environments(including inspection of radioactive materials, bomb detection and deactivation), manufacturing processes, or where humans cannot survive (e.g. in space, underwater, in high heat, and clean up and containment of hazardousmaterials and radiation). Robots can take on any form but some are made to resemble humans in appearance. This is said to help in the acceptance of a robot in 
certain replicative behaviors usually performed by people. Such robots attempt to replicate walking, lifting, speech, cognition, or any other human activity. Many of todays robots are inspired by nature, contributing to the field of bio-inspired 
robotics.The concept of creating machines that can operate autonomously dates back to classical times, but research into the functionality and potential uses of robots did not grow substantially until the 20th century. Throughout history, it has been frequently assumed by various scholars, inventors, engineers, and technicians that robots will one day be able to mimic human behavior and manage tasks in a human-like fashion. Today, robotics is a rapidly growing field, as technological advances continue; researching, designing, and building new robots serve various practical purposes, whether domestically, commercially, or militarily. Many robots are built to do jobs that are hazardous to people, such as defusing bombs, finding survivors in unstable ruins, and exploring mines and shipwrecks. Robotics is also used in STEM (science, technology, engineering, and mathematics) as a teaching aid. The advent of nanorobots, microscopic robots that can be injected into the human body, could revolutionize medicine and human health.Robotics is a branch of engineering that involves the conception, design, manufacture, and operation of robots. This field overlaps with computer engineering, computer science (especially artificial intelligence), electronics, mechatronics, nanotechnology and bioengineering.The word robotics was derived from the word robot, which was introduced to the public by Czech writer Karel Capek in his play R.U.R. (Rossums Universal Robots), whichwas published in 1920. The word robot comes from the Slavic word robota, which means slave/servant. The play begins in a factory that makes artificial people called robots, creatures who can be mistaken for humans – very similar to the modern ideas of androids. Karel Capek himself did not coin the word. He wrote a short letter in reference to an etymology in the 
Oxford English Dictionary in which he named his brother Josef Capek as its actual 
originator.According to the Oxford English Dictionary, the word robotics was first 
used in print by Isaac Asimov, in his science fiction short story "Liar!", 
published in May 1941 in Astounding Science Fiction. Asimov was unaware that he 
was coining the term  since the science and technology of electrical devices is 
electronics, he assumed robotics already referred to the science and technology 
of robots. In some of Asimovs other works, he states that the first use of the 
word robotics was in his short story Runaround (Astounding Science Fiction, March 
1942) where he introduced his concept of The Three Laws of Robotics. However, 
the original publication of "Liar!" predates that of "Runaround" by ten months, 
so the former is generally cited as the words origin.There are many types of robots; 
they are used in many different environments and for many different uses. Although 
being very diverse in application and form, they all share three basic similarities 
when it comes to their construction:Robots all have some kind of mechanical construction, a frame, form or shape designed to achieve a particular task. For example, a robot designed to travel across heavy dirt or mud, might use caterpillar tracks. The mechanical aspect is mostly the creators solution to completing the assigned task and dealing with the physics of the environment around it. Form follows function.Robots have electrical components which power and control the machinery. For example, the robot with caterpillar tracks would need some kind of power to move the tracker treads. That power comes in the form of electricity, which will have to travel through a wire and originate from a battery, a basic electrical circuit. Even petrol powered machines that get their power mainly from petrol still require an electric current to start the combustion process which is why most petrol powered machines like cars, have batteries. The electrical aspect of robots is used for movement (through motors), sensing (where electrical signals are used to measure things like heat, sound, position, and energy status) and operation (robots need some level of electrical energy supplied to their motors and sensors in order to activate and perform basic operations) All robots contain some level of computer programming code. A program is how a robot decides when or how to do something. In the caterpillar track example, a robot that needs to move across a muddy road may have the correct mechanical construction and receive the correct amount of power from its battery, but would not go anywhere without a program telling it to move. Programs are the core essence of a robot, it could have excellent mechanical and electrical construction, but if its program is poorly constructed its performance will be very poor (or it may not perform at all). There are three different types of robotic programs: remote control, artificial intelligence and hybrid. A robot with remote control programing has a preexisting set of commands that it will only perform if and when it receives a signal from a control source, typically a human being with a remote control. It is perhaps more appropriate to view devices controlled primarily by human commands as falling in the discipline of automation rather than robotics. Robots that use artificial intelligence interact with their environment on their own without a control source, and can determine reactions to objects and problems they encounter using their preexisting programming. Hybrid is a form of programming that incorporates both AI and RC functions.As more and more robots are designed for specific tasks this method of classification becomes more relevant. For example, many robots are designed for assembly work, which may not be readily adaptable for other applications. They are termed as "assembly robots". For seam welding, some suppliers provide complete welding systems with the robot i.e. the welding equipment along with other material handling facilities like turntables, etc. as an integrated unit. Such an integrated robotic system is called a "welding robot" even though its discrete manipulator unit could be adapted to a variety of tasks. Some robots are specifically designed for heavy load manipulation, and are labeled as "heavy-duty robots".one or two wheels. These can have certain advantages such as greater efficiency and reduced parts, as well as allowing a robot to navigate in confined places that a four-wheeled robot would not be able to.Two-wheeled balancing robots Balancing robots generally use a gyroscope to detect how much a robot is falling and then drive the wheels proportionally in the same direction, to counterbalance the fall at hundreds of times per second, based on the dynamics of an inverted pendulum.[71] Many different balancing robots have been designed.[72] While the Segway is not commonly thought of as a robot, it can be thought of as a component of a robot, when used as such Segway refer to them as RMP (Robotic Mobility Platform). An example of this use has been as NASA Robonaut that has been mounted on a Segway.One-wheeled balancing robots Main article: Self-balancing unicycle A one-wheeled balancing robot is an extension of a two-wheeled balancing robot so that it can move in any 2D direction using a round ball as its only wheel. Several one-wheeled balancing robots have been designed recently, such as Carnegie Mellon Universitys "Ballbot" that is the approximate height and width of a person, and Tohoku Gakuin University BallIP Because of the long, thin shape and ability to maneuver in tight spaces, they have the potential to function better than other robots in environments with people
"""

# Pass the Text to Model
robotics_doc = nlp(robotics_data)

print('Before PreProcessing n_Tokens: ', len(robotics_doc))

# Removing stopwords and punctuation from the doc.
robotics_doc=[token for token in robotics_doc if not token.is_stop and not token.is_punct]

print('After PreProcessing n_Tokens: ', len(robotics_doc))

Before PreProcessing n_Tokens:  1626
After PreProcessing n_Tokens:  744


## Lemmatization

In [14]:
# Lemmatizing the tokens of a doc
text='played plays playing. likes like'
doc=nlp(text)
for token in doc:
    print(token.lemma_)

play
play
play
.
like
like


## Lexical attributes of spaCy

In [21]:
text=' 2020 is far worse than 2009'
doc=nlp(text)
for token in doc:
    if token.like_num:
    print(token)

2020
2009


## Question
## Consider you have a text document about details of various employees.
## What if you want all the emails of employees to send a common email ?
employee_text=""" name : Koushiki age: 45 email : koushiki@gmail.com
                 name : Gayathri age: 34 email: gayathri1999@gmail.com
                 name : Ardra age: 60 email : ardra@gmail.com
                 name : pratham parmar age: 15 email : parmar15@yahoo.com
                 name : Shashank age: 54 email: shank@rediffmail.com
                 name : Utkarsh age: 46 email :utkarsh@gmail.com"""

## Part of Speech analysis with spaCy

In [None]:
# POS tagging using spaCy
my_text='John plays basketball,if time permits. He played in high school too.'
my_doc=nlp(my_text)
for token in my_doc:
  print(token.text,'---- ',token.pos_)

John ----  PROPN
plays ----  VERB
basketball ----  NOUN
, ----  PUNCT
if ----  SCONJ
time ----  NOUN
permits ----  VERB
. ----  PUNCT
He ----  PRON
played ----  VERB
in ----  ADP
high ----  ADJ
school ----  NOUN
too ----  ADV
. ----  PUNCT


In [None]:
spacy.explain('SCONJ')

'subordinating conjunction'

In [None]:
# Raw text document
raw_text="""I liked the movies etc The movie had good direction  The movie was amazing i.e.
            The movie was average direction was not bad The cinematography was nice. i.e.
            The movie was a bit lengthy  otherwise fantastic  etc etc"""
# Creating a spacy object
raw_doc=nlp(raw_text)

# Checking if POS tag is X and printing them
print('The junk values are..')
for token in raw_doc:
    if token.pos_=='X':
        print(token.text)

print('After removing junk')
# Removing the tokens whose POS tag is junk.
clean_doc=[token for token in raw_doc if not token.pos_=='X']
print(clean_doc)

The junk values are..
etc
i.e.
i.e.
etc
etc
After removing junk
[I, liked, the, movies, The, movie, had, good, direction,  , The, movie, was, amazing, 
            , The, movie, was, average, direction, was, not, bad, The, cinematography, was, nice, ., 
            , The, movie, was, a, bit, lengthy,  , otherwise, fantastic,  ]


## Named Entity Recognition

In [None]:
# Preparing the spaCy document
text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)

# Printing the named entities
print(doc.ents)
# Printing labels of entities.
for entity in doc.ents:
  print(entity.text,'--- ',entity.label_)

(Tony Stark, StarkEnterprises, Emily Clark, Microsoft, Manchester, Bible, French)
Tony Stark ---  PERSON
StarkEnterprises ---  ORG
Emily Clark ---  PERSON
Microsoft ---  ORG
Manchester ---  GPE
Bible ---  WORK_OF_ART
French ---  NORP


In [None]:
# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)

## Extracting brand names with Named Entity Recognition

In [None]:
mobile_industry_article=""" 30 Major mobile phone brands Compete in India – A Case Study of Success and Failures
Is the Indian mobile market a terrible War Zone? We have more than 30 brands competing with each other. Let’s find out some insights about the world second-largest mobile bazaar.There is a massive invasion by Chinese mobile brands in India in the last four years. Some of the brands have been able to make a mark while others like Meizu, Coolpad, ZTE, and LeEco are a failure.On one side, there are brands like Sony or HTC that have quit from the Indian market on the other side we have new brands like Realme or iQOO entering the marketing in recent months.The mobile market is so competitive that some of the brands like Micromax, which had over 18% share back in 2014, now have less than 5%. Even the market leader Samsung with a 34% market share in 2014, now has a 21% share whereas Xiaomi has become a market leader. The battle is fierce and to sustain and scale-up is going to be very difficult for any new entrant.new comers in Indian Mobile MarketiQOO –They have recently (March 2020) launched the iQOO 3 in India with its first 5G phone – iQOO 3. The new brand is part of the Vivo or the BBK electronics group that also owns several other brands like Oppo, Oneplus and Realme.Realme – Realme launched the first-ever phone – Realme 1 in November 2018 and has quickly became a popular brand in India. The brand is one of the highest sellers in online space and even reached a 16% market share threatening Xiaomi’s dominance.iVoomi – In 2017, we have seen the entry of some new Chinese mobile brands likeiVoomi which focuses on the sub 10k price range, and is a popular online player. They have an association with Flipkart.Techno &amp; Infinix – Transsion Group’s Tecno and Infinix brands debuted in India in mid-2017 and are focusing on the low end and mid-range phones in the price range of Rs. 5000 to Rs. 12000.10.OR &amp; Lephone – 10.OR has a partnership with Amazon India and is an exclusive online brand with phones like 10.OR D, G and E. However, the brand is not very aggressive currently.Kult – Kult is another player who launched a very aggressively priced Kult Beyond mobile in 2017 and followed up by launching 2-3 more models.However, most of these new brands are finding it difficult to strengthen their footing in India. As big brands like Xiaomi leave no stone unturned to make things difficult.Also, it is worth noting that there is less Chinese players coming to India now. As either all the big brands have already set shop or burnt their hands and retreated to the homeland China.Chinese/ Global  Brands Which failed or are at the Verge of Failing in India?
There are a lot more failures in the market than the success stories. Let’s first look at the failures and then we will also discuss why some brands were able to succeed in India.HTC – The biggest surprise this year for me was the failure of HTC in India. The brand has been in the country for many years, in fact, they were the first brand to launch Android mobiles. Finally HTC decided to call it a day in July 2018.LeEco – LeEco looked promising and even threatening to Xiaomi when it came to India. The company launched a series of new phones and smart TVs at affordable rates. Unfortunately, poor financial planning back home caused the brand to fail in India too.LG – The company seems to have lost focus and are doing poorly in all segments. While the budget and mid-range offering are uncompetitive, the high-end models are not preferred by buyers.Sony – Absurd pricing and lack of ability to understand the Indian buyers have caused Sony to shrink mobile operations in India. In the last 2 years, there are far fewer launches and hardly any promotions or hype around the new products.Meizu – Meizu is also a struggling brand in India and is going nowhere with the current strategy. There are hardly any popular mobiles nor a retail presence.ZTE – The company was aggressive till last year with several new phones launching under the Nubia banner, but with recent issues in the US, they have even lost the plot in India.Coolpad – I still remember the first meeting with Coolpad CEO in Mumbai when the brand started operations. There were big dreams and ambitions, but the company has not been able to deliver and keep up with the rivals in the last 1 year.Gionee – Gionee was doing well in the retail, but the infighting in the company and loss of focus from the Chinese parent company has made it a failure. The company is planning a comeback. However, we will have to wait and see when that happens."""

In [None]:
# creating spacy doc
mobile_doc=nlp(mobile_industry_article)

# List to store name of mobile companies
list_of_org=[]

# Appending entities which havel the label 'ORG' to the list
for entity in mobile_doc.ents:
  if entity.label_=='ORG':
    list_of_org.append(entity.text)

print(list_of_org)

['Meizu', 'Sony', 'Vivo', 'Xiaomi', 'Flipkart', 'Techno &amp', 'Infinix – Transsion Group', '12000.10.OR &amp', 'Lephone', 'Amazon India', 'Global  Brands', 'the Verge of Failing', 'Sony', 'Sony', 'Meizu', 'Meizu', 'Nubia']


## You come across many articles about theft and other crimes.
## While using this for a case study, you might need to to avoid use of original names, companies and places. How can you do it ?

## Write a function which will scan the text for named entities which have the labels PERSON , ORG and GPE. These tokens can be replaced by “UNKNOWN”.

## spaCy pipelines

In [None]:
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)

['tagger', 'parser', 'ner']


## Adding components to pipeline

In [None]:
nlp.add_pipe(nlp.create_pipe('textcat'))
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'textcat']


In [None]:
nlp.remove_pipe("textcat")

('textcat', <spacy.pipeline.pipes.TextCategorizer at 0x21724d52948>)

In [None]:
print(nlp.pipe_names)

['tagger', 'parser', 'ner']


In [None]:
## Creating custom pipeline

In [None]:
def my_custom_component(doc):
    doc_length = len(doc)
    print(' The no of tokens in the document ', doc_length)
    named_entity=[token.label_ for token in doc.ents]
    print(named_entity)
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component in the pipeline after ner
nlp.add_pipe(my_custom_component, after='ner')
print(nlp.pipe_names)

# Call the nlp object on your text
doc = nlp(" The Hindu Newspaper has increased the cost. I usually read the paper on my way to Delhi railway station")

['tagger', 'parser', 'ner', 'my_custom_component']
 The no of tokens in the document  21
['ORG', 'GPE']


## Coreference resolution

In [None]:
import spacy, neuralcoref
nlp = spacy.load('en_core_web_sm')

In [None]:
neuralcoref.add_to_pipe(nlp)

doc1 = nlp('My sister has a dog. She loves him.')
print(doc1._.coref_clusters)

[My sister: [My sister, She], a dog: [a dog, him]]


In [None]:
doc2 = nlp('Angela lives in Boston. She is quite happy in that city.')
print(doc2.ents)
for ent in doc2.ents:
    print(ent._.coref_cluster)

(Boston,)
Boston: [Boston, that city]
