<h1 align='center'>NLP Tutorial: Regular Expressions</h1>

<h3>(1) Regex in customer support</h3>

<h3>Retrieve order number</h3>

In [56]:
import re

chat1='codebasics: Hello, I am having an issue with my order # 412889912'

pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat1)
matches

['412889912']

In [57]:
chat2='codebasics: I have a problem with my order number 412889912'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat2)
matches

['412889912']

In [58]:
chat3='codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat3)
matches

['412889912']

In [59]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

In [60]:
get_pattern_match('order[^\d]*(\d*)', chat1)

'412889912'

<h3>Retrieve email id and phone</h3>

In [61]:
chat1 = 'codebasics: you ask lot of questions 😠  1235678912, abc@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

**-----Email id-----**

In [62]:
get_pattern_match('[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat1)

'abc@xyz.com'

In [63]:
get_pattern_match('[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat2)

'abc@xyz.com'

In [64]:
get_pattern_match('[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat3)

'abc@xyz.com'

**-----Phone number-----**

In [39]:
get_pattern_match('(\d{10})|(\(\d{3}\)-\d{3}-\d{4})',chat1)

'1235678912'

In [67]:
get_pattern_match('(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat2)

('', '(123)-567-8912')

In [68]:
get_pattern_match('(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat3)

('1235678912', '')

<h3>(2) Regex for Information Extraction</h3>

In [69]:
text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship	
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title	
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)	
Justine Wilson
​
​(m. 2000; div. 2008)​
Talulah Riley
​
​(m. 2010; div. 2012)​
​
​(m. 2013; div. 2016)
'''

In [71]:
get_pattern_match(r'age (\d+)', text)

'50'

In [73]:
get_pattern_match(r'Born(.*)\n', text).strip()

'Elon Reeve Musk'

In [74]:
get_pattern_match(r'Born.*\n(.*)\(age', text).strip()

'June 28, 1971'

In [75]:
get_pattern_match(r'\(age.*\n(.*)', text)

'Pretoria, Transvaal, South Africa'

In [78]:
def extract_personal_information(text):
    age = get_pattern_match('age (\d+)', text)
    full_name = get_pattern_match('Born(.*)\n', text)
    birth_date = get_pattern_match('Born.*\n(.*)\(age', text)
    birth_place = get_pattern_match('\(age.*\n(.*)', text)
    return {
        'age': int(age),
        'name': full_name.strip(),
        'birth_date': birth_date.strip(),
        'birth_place': birth_place.strip()
    }

In [79]:
extract_personal_information(text)

{'age': 50,
 'name': 'Elon Reeve Musk',
 'birth_date': 'June 28, 1971',
 'birth_place': 'Pretoria, Transvaal, South Africa'}

In [80]:
text = '''
Born	Mukesh Dhirubhai Ambani
19 April 1957 (age 64)
Aden, Colony of Aden
(present-day Yemen)[1][2]
Nationality	Indian
Alma mater	
St. Xavier's College, Mumbai
Institute of Chemical Technology (B.E.)
Stanford University (drop-out)
Occupation	Chairman and MD, Reliance Industries
Spouse(s)	Nita Ambani ​(m. 1985)​[3]
Children	3
Parent(s)	
Dhirubhai Ambani (father)
Kokilaben Ambani (mother)
Relatives	Anil Ambani (brother)
Tina Ambani (sister-in-law)
'''

In [81]:
extract_personal_information(text)

{'age': 64,
 'name': 'Mukesh Dhirubhai Ambani',
 'birth_date': '19 April 1957',
 'birth_place': 'Aden, Colony of Aden'}

<h3>References</h3>

Please refer to my videon on python regular expressions to learn more: https://www.youtube.com/watch?v=sHw5hLYFaIw

Here is the code of that video: https://github.com/codebasics/py/blob/master/Advanced/regex/regex_tutorial_python.ipynb

<h3>Exercise</h3>

https://github.com/codebasics/nlp-tutorials/blob/main/1_regex/regex_nlp_exercise_questions.ipynb

<h3>Installation instructions</h3>

pip install spacy

python -m spacy download en

pip install nltk

<h3>Sentence & Word Tokenization In Spacy</h3>

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

In [3]:
for sentence in doc.sents:
    print(sentence)

Dr. Strange loves pav bhaji of mumbai.
Hulk loves chat of delhi


In [4]:
for sentence in doc.sents:
    for word in sentence:
        print(word)

Dr.
Strange
loves
pav
bhaji
of
mumbai
.
Hulk
loves
chat
of
delhi


<h3>Sentence & Word Tokenization In NLTK</h3>

In [5]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dhava\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
sent_tokenize("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

['Dr.', 'Strange loves pav bhaji of mumbai.', 'Hulk loves chat of delhi']

In [7]:
from nltk.tokenize import word_tokenize

In [8]:
word_tokenize("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

['Dr',
 '.',
 'Strange',
 'loves',
 'pav',
 'bhaji',
 'of',
 'mumbai',
 '.',
 'Hulk',
 'loves',
 'chat',
 'of',
 'delhi']

**From above code you can see that Spacy is object oriented whereas NLTK is a string processing library**

<h2 align="center">Spacy Tokenization Tutorial</h2>

In [14]:
import spacy

Create blank language object and tokenize words in a sentence

In [15]:
nlp = spacy.blank("en")

doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate.")

for token in doc:
    print(token)

Dr.
Strange
loves
pav
bhaji
of
mumbai
as
it
costs
only
2
$
per
plate
.


Creating blank language object gives a tokenizer and an empty pipeline. We will look more into language pipelines in next tutorial

<img src="spacy_blank_pipeline.jpg" height=100, width=500/>

<h3>Using index to grab tokens</h3>

In [16]:
doc[0]

Dr.

In [17]:
token = doc[1]
token.text

'Strange'

In [None]:
dir(token)

In [9]:
type(nlp)

spacy.lang.en.English

In [10]:
type(doc)

spacy.tokens.doc.Doc

In [11]:
type(token)

spacy.tokens.token.Token

In [12]:
nlp.pipe_names

[]

<h3>Span object</h3>

In [19]:
span = doc[0:5]
span

Dr. Strange loves pav bhaji

In [20]:
type(span)

spacy.tokens.span.Span

<h3>Token attributes</h3>

In [21]:
doc = nlp("Tony gave two $ to Peter.")

In [31]:
token0 = doc[0]
token0

Tony

In [32]:
token0.is_alpha

True

In [33]:
token0.like_num

False

In [34]:
token2 = doc[2]
token2

two

In [35]:
token2.like_num

True

In [36]:
token3 = doc[3]
token3

$

In [37]:
token3.like_num

False

In [38]:
token3.is_currency

True

In [39]:
for token in doc:
    print(token, "==>", "index: ", token.i, "is_alpha:", token.is_alpha, 
          "is_punct:", token.is_punct, 
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
         )

Tony ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
gave ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
two ==> index:  2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> index:  3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
Peter ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> index:  6 is_alpha: False is_punct: True like_num: False is_currency: False


<h3>Collecting email ids of students from students information sheet</h3>

In [47]:
with open("students.txt") as f:
    text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

In [48]:
text = " ".join(text)
text



In [49]:
doc = nlp(text)
emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails        

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

<h3>Support in other languages</h3>

Spacy support many language models. Some of them do not support pipelines though!
https://spacy.io/usage/models#languages

In [41]:
nlp = spacy.blank("hi")
doc = nlp("भैया जी! 5000 ₹ उधार थे वो वापस देदो")
for token in doc:
    print(token, token.is_currency)

भैया False
जी False
! False
5000 False
₹ True
उधार False
थे False
वो False
वापस False
देदो False


<h3>Customizing tokenizer</h3>

In [42]:
from spacy.symbols import ORTH

nlp = spacy.blank("en")
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [43]:
nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"},
])
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

<h3>Sentence Tokenization or Segmentation</h3>

In [164]:
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")
for sentence in doc.sents:
    print(sentence)

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

In [165]:
nlp.pipeline

[]

In [166]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x1caadca0680>

In [167]:
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")
for sentence in doc.sents:
    print(sentence)

Dr. Strange loves pav bhaji of mumbai.
Hulk loves chat of delhi


In [44]:
nlp.pipeline

[]

<h3>Exercise</h3>

(1) Think stats is a free book to study statistics (https://greenteapress.com/thinkstats2/thinkstats2.pdf) 

This book has references to many websites from where you can download free datasets. You are an NLP engineer working for some company and you want to collect all dataset websites from this book. To keep exercise simple you are given a paragraph from this book and you want to grab all urls from this paragraph using spacy

In [52]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

# TODO: Write code here
# Hint: token has an attribute that can be used to detect a url

(2) Extract all money transaction from below sentence along with currency. Output should be,

two $

500 €

In [51]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

# TODO: Write code here
# Hint: Use token.i for the index of a token and token.is_currency for currency symbol detection

[Click me to see a solution](https://github.com/codebasics/nlp-tutorials/blob/main/4_tokenization/spacy_tokenizer_exercise_solution.ipynb)

<h3>Further Reading</h3>

https://spacy.io/usage/linguistic-features#tokenization

<h2 align="center">Spacy Language Processing Pipelines Tutorial</h2>

<h3>Blank nlp pipeline</h3>

In [90]:
import spacy

nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


We get above error because we have a blank pipeline as shown below. Pipeline is something that starts with a Tokenizer component in a dotted rectange below. You can see there is nothing there hence the blank pipeline

<img height=300 width=400 src="spacy_blank_pipeline.jpg" />

In [68]:
nlp.pipe_names

[]

nlp.pipe_names is empty array indicating no components in the pipeline. Pipeline is something that starts with a tokenizer 

More general diagram for nlp pipeline may look something like below

<img height=300 width=400 src="spacy_loaded_pipeline.jpg" />

<h3>Download trained pipeline</h3>

To download trained pipeline use a command such as,

python -m spacy download en_core_web_sm

This downloads the small (sm) pipeline for english language

Further instructions on : https://spacy.io/usage/models#quickstart

In [85]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [86]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1b5ddd5f5e0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1b5ddd5f280>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1b5ddd66a50>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1b5d92391c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1b5d922ae00>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1b5d996e660>)]

sm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

In [87]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

Captain  |  proper noun  |  Captain
america  |  proper noun  |  america
ate  |  verb  |  eat
100  |  numeral  |  100
$  |  numeral  |  $
of  |  adposition  |  of
samosa  |  proper noun  |  samosa
.  |  punctuation  |  .
Then  |  adverb  |  then
he  |  pronoun  |  he
said  |  verb  |  say
I  |  pronoun  |  I
can  |  auxiliary  |  can
do  |  verb  |  do
this  |  pronoun  |  this
all  |  determiner  |  all
day  |  noun  |  day
.  |  punctuation  |  .


**Run same code above with a blank pipeline and check what output you see?**

<h3>Named Entity Recognition</h3>

In [34]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


In [43]:
from spacy import displacy

displacy.render(doc, style="ent")

<h3>Trained processing pipeline in French</h3>

In [26]:
nlp = spacy.load("fr_core_news_sm")

OSError: [E050] Can't find model 'fr_core_news_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

You need to install the processing pipeline for french language using this command,

python -m spacy download fr_core_news_sm

In [36]:
nlp = spacy.load("fr_core_news_sm")

In [42]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [39]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  X  |  Tesla
Inc  |  X  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


<h3>Adding a component to a blank pipeline</h3>

In [82]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [81]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


In below image you can see sentecizer component in the pipeline

<img height=300 width=400 src="sentecizer.jpg" />

<h3>Further reading</h3>

https://spacy.io/usage/processing-pipelines#pipelines

<h3>Stemming in NLTK</h3>

In [4]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [10]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


<h3>Lemmatization in Spacy</h3>

In [24]:
import spacy

In [25]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_)

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  rafting
ability  |  ability
meeting  |  meeting
better  |  well


<h3>Customizing lemmatizer</h3>

In [26]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [29]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [35]:
doc[6]

Brah

In [36]:
doc[6].lemma_

'Brother'

In [1]:
import spacy

<h3>POS tags</h3>

In [12]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

Elon  |  PROPN  |  proper noun
flew  |  VERB  |  verb
to  |  ADP  |  adposition
mars  |  NOUN  |  noun
yesterday  |  NOUN  |  noun
.  |  PUNCT  |  punctuation
He  |  PRON  |  pronoun
carried  |  VERB  |  verb
biryani  |  PROPN  |  proper noun
masala  |  NOUN  |  noun
with  |  ADP  |  adposition
him  |  PRON  |  pronoun


You can check https://v2.spacy.io/api/annotation for the complete list of pos categories in spacy. 

https://en.wikipedia.org/wiki/Preposition_and_postposition

https://en.wikipedia.org/wiki/Part_of_speech

In [8]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

Wow  |  INTJ  |  interjection
!  |  PUNCT  |  punctuation
Dr.  |  PROPN  |  proper noun
Strange  |  PROPN  |  proper noun
made  |  VERB  |  verb
265  |  NUM  |  numeral
million  |  NUM  |  numeral
$  |  NUM  |  numeral
on  |  ADP  |  adposition
the  |  DET  |  determiner
very  |  ADV  |  adverb
first  |  ADJ  |  adjective
day  |  NOUN  |  noun


<h3>Tags</h3>

In [10]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_), " | ", token.tag_, " | ", spacy.explain(token.tag_))

Wow  |  INTJ  |  interjection  |  UH  |  interjection
!  |  PUNCT  |  punctuation  |  .  |  punctuation mark, sentence closer
Dr.  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
Strange  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
made  |  VERB  |  verb  |  VBD  |  verb, past tense
265  |  NUM  |  numeral  |  CD  |  cardinal number
million  |  NUM  |  numeral  |  CD  |  cardinal number
$  |  NUM  |  numeral  |  CD  |  cardinal number
on  |  ADP  |  adposition  |  IN  |  conjunction, subordinating or preposition
the  |  DET  |  determiner  |  DT  |  determiner
very  |  ADV  |  adverb  |  RB  |  adverb
first  |  ADJ  |  adjective  |  JJ  |  adjective (English), other noun-modifier (Chinese)
day  |  NOUN  |  noun  |  NN  |  noun, singular or mass


<h3>In below sentences Spacy figures out the past vs present tense for quit </h3>

In [34]:
doc = nlp("He quits the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

quits | VBZ | verb, 3rd person singular present


In [35]:
doc = nlp("he quit the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

quit | VBD | verb, past tense


<h3>Removing all SPACE, PUNCT and X token from text</h3>

Processing microsoft's earning report: https://www.microsoft.com/en-us/investor/earnings/fy-2022-q2/press-release-webcast

In [44]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        filtered_tokens.append(token)

In [50]:
filtered_tokens[:10]

[Microsoft,
 Corp.,
 today,
 announced,
 the,
 following,
 results,
 for,
 the,
 quarter]

In [47]:
count = doc.count_by(spacy.attrs.POS)
count

{96: 13,
 92: 48,
 100: 23,
 90: 9,
 85: 17,
 93: 16,
 97: 27,
 98: 1,
 84: 19,
 103: 10,
 87: 6,
 99: 5,
 89: 12,
 86: 2,
 94: 3,
 95: 2}

In [48]:
doc.vocab[96].text

'PROPN'

In [49]:
for k,v in count.items():
    print(doc.vocab[k].text, "|",v)

PROPN | 13
NOUN | 48
VERB | 23
DET | 9
ADP | 17
NUM | 16
PUNCT | 27
SCONJ | 1
ADJ | 19
SPACE | 10
AUX | 6
SYM | 5
CCONJ | 12
ADV | 2
PART | 3
PRON | 2


<h2 align='center'>NLP Tutorial: Named Entity Recognition (NER)</h2>

In [1]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [13]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [11]:
from spacy import displacy

displacy.render(doc, style="ent")

<h3>List down all the entities</h3>

In [14]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

List of entities are also documented on this page: https://spacy.io/models/en

In [34]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | GPE | Countries, cities, states
1982 | DATE | Absolute or relative dates or periods


Above it made a mistake in identifying Bloomberg the company. Let's try hugging face for this now.

https://huggingface.co/dslim/bert-base-NER?text=Michael+Bloomberg+founded+Bloomberg+in+1982

Here also go through 3 sample examples for NER 

In [36]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

Tesla Inc  |  ORG  |  0 | 9
Twitter Inc  |  ORG  |  30 | 41
$45 billion  |  MONEY  |  46 | 57


<h3>Setting custom entities</h3>

In [43]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Twitter  |  PRODUCT
$45 billion  |  MONEY


In [44]:
s = doc[2:5]
s

going to acquire

In [45]:
type(s)

spacy.tokens.span.Span

In [46]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [47]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY


<h2 align='center'>NLP Tutorial: Text Representation - Bag Of Words (BOW)</h2>

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df.Category.value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [5]:
df['spam'] = df['Category'].apply(lambda x: 1 if x =='spam' else 0)

In [8]:
df.shape

(5572, 3)

In [6]:
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


<h3>Train test split</h3>

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

In [9]:
X_train.shape

(4457,)

In [10]:
X_test.shape

(1115,)

In [11]:
type(X_train)

pandas.core.series.Series

In [16]:
X_train[:4]

1579    How to Make a girl Happy? It's not at all diff...
4992    We made it! Eta at taunton is 12:30 as planned...
4142    Beautiful truth : Expression of the face could...
1141            Oh:)as usual vijay film or its different?
Name: Message, dtype: object

In [17]:
type(y_train)

pandas.core.series.Series

In [19]:
y_train[:4]

1579    0
4992    0
4142    0
1141    0
Name: spam, dtype: int64

In [21]:
type(X_train.values)

numpy.ndarray

<h3>Create bag of words representation using CountVectorizer</h3>

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<4457x7675 sparse matrix of type '<class 'numpy.int64'>'
	with 59056 stored elements in Compressed Sparse Row format>

In [25]:
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [26]:
X_train_cv.shape

(4457, 7675)

In [53]:
v.get_feature_names_out()[1771]

'chef'

In [None]:
v.vocabulary_

In [42]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [41]:
np.where(X_train_np[0]!=0)

(array([ 755,  916, 1125, 1127, 1279, 1483, 1771, 1842, 1928, 2046, 2263,
        2290, 2327, 2464, 2744, 3018, 3151, 3154, 3195, 3272, 3348, 3391,
        3541, 3717, 3770, 4138, 4218, 4237, 4311, 4706, 4804, 4908, 4951,
        5117, 5423, 5425, 5432, 6474, 6588, 6847, 6966, 7184, 7191, 7302],
       dtype=int64),)

In [44]:
X_train[:4][1579]

"How to Make a girl Happy? It's not at all difficult to make girls happy. U only need to be... 1. A friend 2. Companion 3. Lover 4. Chef . . .  &lt;#&gt; . Good listener  &lt;#&gt; . Organizer  &lt;#&gt; . Good boyfriend  &lt;#&gt; . Very clean  &lt;#&gt; . Sympathetic  &lt;#&gt; . Athletic  &lt;#&gt; . Warm . . .  &lt;#&gt; . Courageous  &lt;#&gt; . Determined  &lt;#&gt; . True  &lt;#&gt; . Dependable  &lt;#&gt; . Intelligent . . .  &lt;#&gt; . Psychologist  &lt;#&gt; . Pest exterminator  &lt;#&gt; . Psychiatrist  &lt;#&gt; . Healer . .  &lt;#&gt; . Stylist  &lt;#&gt; . Driver . . Aaniye pudunga venaam.."

In [52]:
X_train_np[0][1771]

1

<h3>Train the naive bayes model</h3>

In [54]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [56]:
X_test_cv = v.transform(X_test)

<h3>Evaluate Performance</h3>

In [59]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       976
           1       0.98      0.93      0.95       139

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [60]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

<h3>Train the model using sklearn pipeline and reduce number of lines of code</h3>

In [61]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [66]:
clf.fit(X_train, y_train)

In [64]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       976
           1       0.98      0.93      0.95       139

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



### Stop words tutorial

In [26]:
import spacy

from spacy.lang.en.stop_words import STOP_WORDS

len(STOP_WORDS)

326

In [104]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our wings, the flying part is coming soon")

for token in doc:
    if token.is_stop:
        print(token)

We
just
our
the
part
is


In [64]:
def preprocess(text):
    doc = nlp(text)
    
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)            

In [103]:
preprocess("Musk wants time to prepare for a trial over his")

'Musk wants time prepare trial'

In [105]:
preprocess("The other is not other but your divine brother")

'divine brother'

##### Remove stop words from pandas dataframe text column

Dataset is downloaded from: https://www.kaggle.com/datasets/jbencina/department-of-justice-20092018-press-releases
It contains press releases of different court cases from depart of justice (DOJ). The releases contain information such as outcomes of criminal cases, notable actions taken against felons, or other updates about the current administration.

In [97]:
import pandas as pd

df = pd.read_json("doj_press.json",lines=True)

df.shape

(13087, 6)

In [98]:
df.head(5)

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


Filter out those rows that do not have any topics associated with the case

In [82]:
df = df[df["topics"].str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [99]:
df.shape

(13087, 6)

In [84]:
df =df.head(100)
df.shape

(100, 6)

In [87]:
df["contents_new"] = df.contents.apply(preprocess)

In [88]:
df

Unnamed: 0,id,title,contents,date,topics,components,contents_new
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],"U.S. Department Justice , U.S. Environmental P..."
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],131 - count criminal indictment unsealed today...
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...",United States Attorney Office Middle District ...
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],"21st Century Oncology LLC , agreed pay $ 19.75..."
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]",21st Century Oncology Inc. certain subsidiarie...
...,...,...,...,...,...,...,...
316,15-1359,Alaska Plastic Surgeon Convicted of Wire Fraud...,Doctor Hid Millions in Secret Accounts in Pana...,2015-11-04T00:00:00-05:00,[Tax],[Tax Division],Doctor Hid Millions Secret Accounts Panama Ala...
318,16-396,Alaska Plastic Surgeon Sentenced to Prison for...,Defendant Concealed Bank Accounts in Panama an...,2016-04-04T00:00:00-04:00,[Tax],[Tax Division],Defendant Concealed Bank Accounts Panama Costa...
321,17-736,Alaskan Commercial Fishing Couple Charged with...,An Alaskan couple was charged in federal court...,2017-07-26T00:00:00-04:00,[Tax],"[Tax Division, USAO - Alaska]","Alaskan couple charged federal court Juneau , ..."
322,18-717,Alaskan Husband And Wife Plead Guilty To Willf...,A husband and wife pleaded guilty yesterday to...,2018-06-01T00:00:00-04:00,[Tax],[Tax Division],husband wife pleaded guilty yesterday counts w...


In [89]:
len(df.contents[4])

6286

In [90]:
len(df.contents_new[4])

4810

In [94]:
df.contents[4][:300]

'The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin conta'

In [95]:
df.contents_new[4][:300]

'U.S. Department Justice , U.S. Environmental Protection Agency ( EPA ) , Rhode Island Department Environmental Management ( RIDEM ) announced today subsidiaries Stanley Black & Decker Inc.—Emhart Industries Inc. Black & Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor'

##### Examples where removing stop words can create a problem

**(1) Sentiment detection: Not always but in some cases, based on your dataset it can change the sentiment of a sentence if you remove stop words**

In [31]:
preprocess("this is a good movie")

'good movie'

In [32]:
preprocess("this is not a good movie")

'good movie'

**(2) Language translation: Say you want to translate following sentence from english to telugu. Before actual translation if you remove stop words and then translate, it will produce horrible result**

In [35]:
preprocess("how are you doing dhaval?")

'dhaval ?'

**(3) Chat bot or any Q&A system**

In [39]:
preprocess("I don't find yoga mat on your website. Can you help?")

'find yoga mat website . help ?'