# Extract Keywords Using spaCy in Python

In this piece, you’ll learn how to extract the most important keywords from a chunk of text — an article, academic paper, or even a short tweet. You can freely use it to generate hashtags, calculate the importance of the sentence and so on.
I will be using an industrial strength natural language processing module called spaCy for this tutorial. I have made a tutorial on similarity matching using spaCy previously — feel free to check it out. There are three sections in this tutorial:


Original Source - https://betterprogramming.pub/extract-keywords-using-spacy-in-python-4a8415478fbf

* Setup
* Implementation
* Conclusion

#1. Setup
We will be installing the spaCy module via the pip install. Administrative privilege is required to create a symlink when you download the language model. Open a terminal in administrator mode. It’s highly recommended to create a virtual environment before you run the following command:

In [1]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 4.9 MB/s 
[?25hCollecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 13.4 MB/s 
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 4.8 MB/s 
[?25hCollecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 39.3 MB/s 
[?25hCollecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 954 kB/s 
Collecting langcodes<4.

The next step is to download the language model of your choice. I will be using the large English model for this tutorial. Feel free to check the official website for the complete list of available models.

## en_core_web_lg (large)

In [2]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl (777.4 MB)
[K     |████████████████████████████████| 777.4 MB 6.4 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


The file size of the model is about 800MB. If you would like to just try it out, download the smaller version of the language model.
en_core_web_md (medium)
The medium model is much smaller at just 100MB.

In [3]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[K     |████████████████████████████████| 45.7 MB 1.2 MB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


### en_core_web_sm (small)
The smallest English language model should take only a moment to download as it’s around 11MB.

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.7 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


When you’re done, run the following command to check whether spaCy is working properly. It also indicates the models that have been installed.

In [5]:
!python -m spacy validate

⠙ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.7/dist-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_md   >=3.2.0,<3.3.0   [38;5;2m3.2.0[0m   [38;5;2m✔[0m
en_core_web_sm   >=3.2.0,<3.3.0   [38;5;2m3.2.0[0m   [38;5;2m✔[0m
en_core_web_lg   >=3.2.0,<3.3.0   [38;5;2m3.2.0[0m   [38;5;2m✔[0m



Let’s move to the next section and start writing some code in Python.

##2. Implementation
Import
First, we need to add an import declaration to the top of the file.

In [6]:
import spacy

#Apart from spaCy, we need the following import as well. Counter will be used to count 
#and sort the keywords based on the frequency while punctuation contains 
#the most commonly used punctuation.

from collections import Counter
from string import punctuation

#Load spaCy model
We can easily load the model that we have just installed via the following command. Modify the string according to the name of the model you’ve installed.

If you experience issues with not being able to load the model, even though it’s installed, you can load the model in a different way. Let’s import the module directly and you can use it to load the model.

In [7]:
nlp = spacy.load("en_core_web_lg")
import en_core_web_lg
nlp = en_core_web_lg.load()

## Hotword function
We’ll be writing the keyword extraction code inside a function. It’s a lot more convenient and we can easily call it whenever we need to extract keywords from a big chunk of text. It accepts a string as an input parameter.

In [8]:
def get_hotwords(text):
    result = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN'] # 1
    doc = nlp(text.lower()) # 2
    for token in doc:
        # 3
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        # 4
        if(token.pos_ in pos_tag):
            result.append(token.text)
                
    return result # 5

* #1 A list containing the part of speech tag that we would like to extract. I will be using just PROPN (proper noun), ADJ (adjective) and NOUN (noun) for this tutorial. If you would like to extract another part of speech tag such as a verb, extend the list based on your requirements.

* #2 Convert the input text into lowercase and tokenize it via the spacy model that we have loaded earlier. A processed Doc object will be returned. The object contains Token objects based on the tokenization process.


* #3 Loop over each of the token and determine if the tokenized text is part of the stopwords or punctuation. Ignore this token and move on to the next token if it is.


* #4 Store the result if part of speech tag of the tokenized text is the one that we have specified previously.

* #5 Return the result as a list of strings.

<br/>

Let’s test it out by using a simple text of your choice. I’m using the following input text:

In [9]:
output = get_hotwords('''Welcome to Medium! Medium is a publishing platform where 
people can read important, insightful stories on the topics that matter most to
 them and share ideas with the world.''')

print(output)

['medium', 'medium', 'publishing', 'platform', 'people', 'important', 'insightful', 'stories', 'topics', 'ideas', 'world']


# Remove duplicate items
Note that the function we’ve just written contains duplicate items if it contains the same important keywords inside the input text. In this case, the keyword medium is repeated twice. You can easily remove it via the set function:


In [10]:
output = set(get_hotwords('''Welcome to Medium! Medium is a publishing platform where
 people can read important, insightful stories on the topics that matter most to them
  and share ideas with the world.'''))


print(output)

{'ideas', 'important', 'medium', 'platform', 'world', 'insightful', 'stories', 'people', 'topics', 'publishing'}


# Generate hashtags from keywords
You can easily generate hashtags from keywords by appending the hash symbol at the start of every keyword. The easiest way to do this is to use the list comprehension method. You need to join the resulting list with a space to generate a hashtag string:

In [11]:
output = set(get_hotwords('''Welcome to Medium! Medium is a publishing platform 
where people can read important, insightful stories on the topics that matter 
most to them and share ideas with the world.'''))


hashtags = [('#' + x) for x in output]
print(' '.join(hashtags))

#ideas #important #medium #platform #world #insightful #stories #people #topics #publishing


Sort by frequency
There may be cases in which the order of the keywords is based on frequency. in that case, you need to sort them based on how frequently the keywords appear — use the Counter module to sort and get the most frequent keywords. TheCounter module has a most_common function that accepts an integer as an input parameter. Remember, you must remove the set function to retain the frequency of each keyword.

In [12]:
output = get_hotwords('''Welcome to Medium! Medium is a publishing platform where people can 
read important, insightful stories on the topics that matter most to them and 
share ideas with the world.''')

hashtags = [('#' + x[0]) for x in Counter(output).most_common(5)]
print(' '.join(hashtags))

#medium #publishing #platform #people #important


# 3. Conclusion
Let’s recap what we’ve learned today. We started off installing the spaCy module via pip install. Then we downloaded a pre-trained language model. In this case, I downloaded the large version of the English model.

Next, we wrote some simple codes to implement our own keyword extractor. We defined our own hotword function that accepts an input string and outputs a list of keywords. We used the Python built-in set function to remove duplicates from the result. List comprehension is extremely helpful in appending the hash symbol at the front of each keyword to create a hashtags string. Finally, we explored the most_common function in the Counter module to sort the keywords based on frequency.