# Tokenization part 2
- How to Extract emails from student.txt file using spaCy and regEx
- How to Extract url using spaCy
- How to Extract Transactions using spaCy


In [None]:
#!pip install spacy regEx

In [2]:
import spacy
import re


- The line `nlp = spacy.load("en_core_web_sm")` initializes a spaCy NLP (Natural Language Processing) pipeline using a pre-trained model named `"en_core_web_sm"`.

- `spacy.load()`: This is a function provided by the spaCy library to load a pre-trained model or language pipeline. It takes a string argument specifying the name or path of the model to load.

- `"en_core_web_sm"`: This is the name of the pre-trained model being loaded.

- `"en_core_web_sm"` refers to the English language pipeline with a small model. The model is trained on a variety of English text data and comes with components for `tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and other NLP tasks.

- `en`: Stands for English.
- `core`: This signifies that the model includes core NLP components.
- `web`: Means that the model is trained on web text data.
- `sm`:This abbreviation stands for "small", small size model.

In [2]:
nlp = spacy.load("en_core_web_sm")

In [9]:
#to connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
#opening the file.
with open("/content/drive/My Drive/NLP sheet/Resources/students.txt") as f:
    text = f.readlines()
text

['NMAMIT Institute of Tachnology,Bachelor of Engineering \n',
 '\n',
 '  Name    birth day         email \n',
 ' \n',
 ' Virat   5/06/1882     virat@kohli.com\n',
 ' Maria   2/04/2001    maria@sharapova.com\n',
 ' Serena  24/06/1998   serena@williams.com\n',
 ' Joe      1/05/1997    joe@root.com\n',
 ' ashwal   11/05/2008    ashwal@nmamit.ac.in\n',
 ' Joen      1/09/2000  joen@nmamit.ac.in\n',
 ' Clawn    1/10/2005   clawn@nitte.edu.in.com\n',
 ' reeema      21/05/1997    reema@flagroots.com\n',
 ' leesha      4/6/2008    leesha@root.com\n',
 ' harshin      7/07/2019    harshin@dreams.com\n',
 ' Dhruvi       30/11/2022   dhrvi@nitte.edu.in\n',
 ' Jane      1/9/2000    jane@nmamit.ac.in\n',
 ' Jon      12/12/2003    jon@nmamit.ac.in\n',
 ' Jonin      10/2/1998    jonin@nmamit.ac.in\n',
 ' Haon      4/1/2010    hoen@nmamit.ac.in\n',
 ' Renz      5/5/1999    ren@nmamit.ac.in\n',
 ' Joe      1/9/2000    joe@nmamit.ac.in\n',
 'ChristopherAnderson    2/3/1981  ChristopherAnderson@gmail.com\n

In [8]:
# If using jupyter notebook
# with open("../Resources/students.txt") as f:
#     text = f.readlines()
# text


Converting the file content into a single string using `.join()`
- Easier Manipulation:  Operating on a single string simplifies regex operations.
- Efficiency:  processing a single string can be more efficient.
- Pattern Matching: we can apply regex patterns to the entire content at once.

In [7]:
#Creating a single string text
#joins together all elements of the text variable into a single string.
text="".join(text)
text

'NMAMIT Institute of Tachnology,Bachelor of Engineering \n\n  Name    birth day         email \n \n Virat   5/06/1882     virat@kohli.com\n Maria   2/04/2001    maria@sharapova.com\n Serena  24/06/1998   serena@williams.com\n Joe      1/05/1997    joe@root.com\n ashwal   11/05/2008    ashwal@nmamit.ac.in\n Joen      1/09/2000  joen@nmamit.ac.in\n Clawn    1/10/2005   clawn@nitte.edu.in.com\n reeema      21/05/1997    reema@flagroots.com\n leesha      4/6/2008    leesha@root.com\n harshin      7/07/2019    harshin@dreams.com\n Dhruvi       30/11/2022   dhrvi@nitte.edu.in\n Jane      1/9/2000    jane@nmamit.ac.in\n Jon      12/12/2003    jon@nmamit.ac.in\n Jonin      10/2/1998    jonin@nmamit.ac.in\n Haon      4/1/2010    hoen@nmamit.ac.in\n Renz      5/5/1999    ren@nmamit.ac.in\n Joe      1/9/2000    joe@nmamit.ac.in\nChristopherAnderson    2/3/1981  ChristopherAnderson@gmail.com\nRonaldClark   3/4/1981        RonaldClark@gmail.com\nMaryWright    4/5/1981       MaryWright@gmail.com\nLi

## Getting Emails

### Using Spacy

In [None]:
doc=nlp(text) #object
emails=[] #list
for token in doc: #token object
    if token.like_email: #if its an email
        emails.append(token.text) #if true add to the list in text format
print("email id:\n",emails)

email id:
 ['virat@kohli.com', 'maria@sharapova.com', 'serena@williams.com', 'joe@root.com', 'ashwal@nmamit.ac.in', 'joen@nmamit.ac.in', 'clawn@nitte.edu.in.com', 'reema@flagroots.com', 'leesha@root.com', 'harshin@dreams.com', 'dhrvi@nitte.edu.in', 'jane@nmamit.ac.in', 'jon@nmamit.ac.in', 'jonin@nmamit.ac.in', 'hoen@nmamit.ac.in', 'ren@nmamit.ac.in', 'joe@nmamit.ac.in', 'ChristopherAnderson@gmail.com', 'RonaldClark@gmail.com', 'MaryWright@gmail.com', 'LisaMitchell@gmail.com', 'MichelleJohnson@gmail.com', 'JohnThomas@gmail.com', 'DanielRodriguez@gmail.com', 'AnthonyLopez@gmail.com', 'PatriciaPerez@gmail.com', 'NancyWilliams@hotmail.com', 'LauraJackson@hotmail.com', 'RobertLewis@hotmail.com', 'PaulHill@hotmail.com', 'KevinRoberts@hotmail.com', 'LindaJones@hotmail.com', 'KarenWhite@hotmail.com', 'SarahLee@hotmail.com', 'MichaelScott@hotmail.com', 'MarkTurner@hotmail.com', 'JasonBrown@aol.com', 'BarbaraHarris@aol.com', 'BettyWalker@aol.com', 'KimberlyGreen@aol.com', 'WilliamPhillips@aol.co

### Using RegEx

In [None]:
#findall return data type is string
emails=re.findall("[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-z]+",text) #passing expression to extract emails (acc, domain, ext.)
print('EMAIL ID: \n',emails)

EMAIL ID: 
 ['virat@kohli.com', 'maria@sharapova.com', 'serena@williams.com', 'joe@root.com', 'ashwal@nmamit.ac', 'joen@nmamit.ac', 'clawn@nitte.edu', 'reema@flagroots.com', 'leesha@root.com', 'harshin@dreams.com', 'dhrvi@nitte.edu', 'jane@nmamit.ac', 'jon@nmamit.ac', 'jonin@nmamit.ac', 'hoen@nmamit.ac', 'ren@nmamit.ac', 'joe@nmamit.ac', 'ChristopherAnderson@gmail.com', 'RonaldClark@gmail.com', 'MaryWright@gmail.com', 'LisaMitchell@gmail.com', 'MichelleJohnson@gmail.com', 'JohnThomas@gmail.com', 'DanielRodriguez@gmail.com', 'AnthonyLopez@gmail.com', 'PatriciaPerez@gmail.com', 'NancyWilliams@hotmail.com', 'LauraJackson@hotmail.com', 'RobertLewis@hotmail.com', 'PaulHill@hotmail.com', 'KevinRoberts@hotmail.com', 'LindaJones@hotmail.com', 'KarenWhite@hotmail.com', 'SarahLee@hotmail.com', 'MichaelScott@hotmail.com', 'MarkTurner@hotmail.com', 'JasonBrown@aol.com', 'BarbaraHarris@aol.com', 'BettyWalker@aol.com', 'KimberlyGreen@aol.com', 'WilliamPhillips@aol.com', 'DonaldDavis@aol.com', 'JeffM

## Collecting website links (url) from a paragrph using spacy

In [None]:
text= '''Look for data to help you address the question. Governments are good sources because data from public research is often freely available. Good places to start include http://www.data.gov/, and http://www.science.gov/, and in the United Kingdom, http://data.gov.uk/. Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, and the European Social Survey at http://www.europeansocialsurvey.org/. The current representation will be formed by a well-organized collection of agents, previously structured in a dynamic, control-based manner. This collection of agents will be built based on the analysis of activations of conception and structuring agents that intercommunicate. Having first deployed an intent, a global interpretation of the system’s situation is formed by means of questionings, qualifying aspects of things, memorized cases, development of numerous cognitive aspects by activating agents that operate proper scaling up, all of which will allow for the efficient emergence of the representation. The system’s interpretation of this collection of agents will take the form of http://www.systemsurvey.org/ a network of dynamic knowledge of apprehensions, operating through questions in a steadily activated loop. This knowledge network will be activated by the system and further developed based on inter-agent relations that will result in significant aggregations of knowledge, structures of dynamic knowledge with appropriate (domain.com) characteristics.'''

### Getting URLs

In [None]:
nlp=spacy.load("en_core_web_sm") #instance of spaCy loading a model
doc=nlp(text) #object storing seq. of tokens
url=[] #list
for token_url in doc: #looping
    if token_url.like_url: #if its an url
        url.append(token_url.text) #added it to url in text format
print('URL:\n',url)

URL:
 ['http://www.data.gov/', 'http://www.science.gov/', 'http://data.gov.uk/.', 'http://www3.norc.org/gss+website/', 'http://www.europeansocialsurvey.org/.', 'http://www.systemsurvey.org/', 'domain.com']


### Getting Transactions

In [4]:
transactions = "Aron gave two $ to Shawn, Smith gave 500 € to Johan"
nlp=spacy.load("en_core_web_sm") #instance of spaCy loading a model
doc=nlp(transactions) #object storing seq. of tokens
for token_trans in doc: #looping
    if token_trans.like_num and doc[token_trans.i+1].is_currency: #if, current token and next token
        print(token_trans.text,doc[token_trans.i+1].text) #printing only Transactions


two $
500 €
