## 1.Extracting Phone Numbers

Phone numbers can have very different formats depending on the country, and matching phone numbers is often a tricky business. The best strategy here is to be specific about the country phone number format you want to parse. If there are several countries, you can add corresponding individual patterns to the matcher. If you have too many countries, then you can relax some conditions and go for a more general pattern (we'll see how to do that).


In [1]:
#Import Spacy 
import spacy 
from spacy.matcher import Matcher 

In [2]:
nlp = spacy.load("en_core_web_md") 

In [3]:
doc1 = nlp("You can call my office on +1 (221) 102-2423 or email me directly (320) 332-4959 or (221) 200-2994.")
doc2 = nlp("You can call me on (221) 102 2423.")

Let's start with the US phone number format. A US number is written as (541) 754-3010 domestically or +1 (541) 754-3010 internationally. We can form our pattern with an optional +1, then a three-digit area code, then two blocks of numbers separated with an optional -. Here is the pattern:

In [4]:
pattern = [{"TEXT": "+1", "OP": "?"}, {"TEXT": "("}, {"SHAPE": "ddd"}, {"TEXT": ")"},{"SHAPE": "ddd"}, {"TEXT": "-", "OP": "?"}, {"SHAPE": "dddd"}]

In [5]:
matcher = Matcher(nlp.vocab)
matcher.add("usPhonNum", [pattern])

In [7]:
#Extract the Phone numbers
for mid, start, end in matcher(doc1):
    print(start, end, doc1[start:end])

6 13 +1 (221) 102-2423
7 13 (221) 102-2423
17 23 (320) 332-4959
24 30 (221) 200-2994
