# What's spaCy?

SpaCy is **free**, **open-source library** for advanced **Natural language processing**(NLP) in Python.

Suppose you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What does the words mean in the context? Who is doing what to whom? What products and compnaies are mentioned in the text? Which texts are simmilar to each other.

spaCy is designed specifically for **production use** and helps you build applications that process and "understand" large volume of text. It can be used to build **information extraction** or **natural language processing** systems, or to pre-process text for **deep learning**.


## What spaCy isn't?

- First, **spaCy** isn't a platform or an "API". Unlike a platform, spaCy doesn't provide a software as a service or a web application. It’s an open-source library designed to help you build NLP applications, not a consumable service.

- Second, **spaCy is not an out-of-the-box chat bot engine.** While spaCy can be used to power conversational applications, it’s not designed specifically for chat bots, and only provides the underlying text processing capabilities.

- Third, **spaCy is not research software.** It’s built on the latest research, but it’s designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.

- Fourth, **spaCy is not a company** It’s an open-source library.The company publishing spaCy and other software is called Explosion AI.

## Installation

spaCy is compatible with **64bit of Cython 2.7/3.5+** and runs on **Unix/Linux**, **macOS/OS X** and **Windows**. The latest version of spaCy is available over pip and conda.

 --> Installation with pip in Linux,Windows and macOs/OS X for both version of Python 2.7/3.5+
 
     pip install -U spacy or pip install spacy
     
 --> Installation with conda in Linux,Windows and macOs/OS X for both version of Python 2.7/3.5+
 
     conda install -c conda-forge spacy



  Once you’ve [downloaded and installed](https://spacy.io/usage/models) a model, you can load it via spacy.load() This will return a *Language* object containing all components and data needed to process text. We usually call it *nlp* object on a string of text will return a processed *Doc* :


In [4]:
!pip install spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting spacy
  Downloading spacy-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pathy>=0.10.0
  Downloading pathy-0.10.1-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting catalogue<2.1.0,>=2.0.6
  Using cached catalogue-2.0.8-py3-none-any.whl (17 kB)
Collecting thinc<8.2.0,>=8.1.0
  Downloading thinc-8.1.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (815 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.7/815.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting murmurhash<1.1.0,>=0.28.0
  Using cached murmurhash-1.0.9-cp310-cp310-manylinux_2_5_x86_64.manylinux

In [5]:
!python3 -m spacy download en_core_web_sm

2023-02-03 14:11:46.832545: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-03 14:11:46.871506: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-03 14:11:46.871523: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-02-03 14:11:48.680247: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-02-03 14:11:48.680266: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to c

In [6]:
#tokenization

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I am hasmukh from india")
print(doc)
print(type(doc))

2023-02-03 14:11:59.376098: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-03 14:11:59.379117: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-03 14:11:59.379135: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-02-03 14:12:00.338903: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-02-03 14:12:00.338921: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to c

I am hasmukh from india
<class 'spacy.tokens.doc.Doc'>


In [7]:
for token in doc:
    print(token.text)

I
am
hasmukh
from
india


In [8]:
#POS tag

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I am hasmukh from Gujarat India")
print(doc)
for token in doc:
    print(token.text ,">>>>>",token.pos_)

I am hasmukh from Gujarat India
I >>>>> PRON
am >>>>> AUX
hasmukh >>>>> NOUN
from >>>>> ADP
Gujarat >>>>> PROPN
India >>>>> PROPN


Using spaCy’s built-in **displaCy** visualizer, here’s what our example sentence and its dependencies look like:

In [10]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("i am hasmukh mer from india")
displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Named Entities 

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

In [17]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("i am hasmukh mer from india")

for ent in doc.ents:
    print(ent)
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

india
india 22 27 GPE


In [16]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("i am Hasmukh mer from india")

for ent in doc.ents:
    print(ent)
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Hasmukh mer
Hasmukh mer 5 16 PERSON
india
india 22 27 GPE


## Visualizing the Named Entity recognizer

The entity visualizer, *ent* , highlight named entities and their label in the text.

In [18]:
import spacy
from spacy import displacy

text = "i am Hasmukh mer from india"

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")
# https://spacy.io/api/annotation#named-entities




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [20]:
import spacy
from spacy import displacy

text = "i am hasmukh mer from India"

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [22]:
!python3 -m spacy download en_core_web_md


2023-02-03 14:22:28.363371: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-03 14:22:28.365940: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-03 14:22:28.365951: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-02-03 14:22:29.413199: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-02-03 14:22:29.413219: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to c

In [23]:
import spacy

nlp = spacy.load("en_core_web_md")
tokens = nlp("lion bear apple banana fadsfdshds")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
# Vector norm: The L2 norm of the token’s vector (the square root of the sum of the values squared)
# has vector: Does the token have a vector representation?
# OOV: Out-of-vocabulary

lion True 55.145737 False
bear True 52.114674 False
apple True 43.366478 False
banana True 31.620354 False
fadsfdshds False 0.0 True


The words “lion”, “bear”, “apple” and "banana" are all pretty common in English, so they’re part of the model’s vocabulary, and come with a vector. The word “fadsfdshds” on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it’s practically nonexistent. If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models or loading in a full vector package, for example, *en_vectors_web_lg*, which includes over 1 million unique vectors.

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether “dog” and “cat” are similar really depends on how you’re looking at it. spaCy’s similarity model usually assumes a pretty general-purpose definition of similarity.

In [24]:
import spacy

nlp = spacy.load("en_core_web_md")  # make sure to use larger model!
tokens = nlp("lion bear cow apple mango spinach")

for token11 in tokens:
    for token13 in tokens:
        print(token11.text, token13.text, token11.similarity(token13))

lion lion 1.0
lion bear 0.40031397342681885
lion cow 0.4524093568325043
lion apple 0.06742796301841736
lion mango 0.18510109186172485
lion spinach 0.06951921433210373
bear lion 0.40031397342681885
bear bear 1.0
bear cow 0.2781473994255066
bear apple 0.18584339320659637
bear mango 0.14443379640579224
bear spinach 0.0758492723107338
cow lion 0.4524093568325043
cow bear 0.2781473994255066
cow cow 1.0
cow apple 0.25756582617759705
cow mango 0.26287969946861267
cow spinach 0.261837899684906
apple lion 0.06742796301841736
apple bear 0.18584339320659637
apple cow 0.25756582617759705
apple apple 1.0
apple mango 0.6305076479911804
apple spinach 0.5129707455635071
mango lion 0.18510109186172485
mango bear 0.14443379640579224
mango cow 0.26287969946861267
mango apple 0.6305076479911804
mango mango 1.0
mango spinach 0.5483009219169617
spinach lion 0.06951921433210373
spinach bear 0.0758492723107338
spinach cow 0.261837899684906
spinach apple 0.5129707455635071
spinach mango 0.5483009219169617
spin

In the above case you can see that "lion" and "bear" have a similarity of 63%. Identical tokens are obviously 100% similar to each other(just not always exactly 1.0, because of vector math and floating point imprecisions).