# Exploring techniques for NLP
As we mentioned in the lecture slides, the different approaches used to solve NLP problems commonly fall into three categories: 
- rule-based, 
- machine learning, and 
- deep learning. 

In this notebook we will try to show you how to use different approaches to solve NLP problems. 

You can open the cloud version of this notebook using the following link:
<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/lzadeh/NLP/blob/main/1-Exploring_techniques_for_NLP.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Rule-based technique
Similar to other early AI systems, early attempts at designing NLP systems were based on building rules for the task at hand. This required that the developers had some expertise in the domain to formulate rules that could be incorporated into a program. Such systems also required resources like dictionaries and thesauruses, typically compiled and digitized over a period of time.

Regular expressions (regex) are a great tool for text analysis and building rule-based systems. A regex is a set of characters or a pattern that is used to match and find substrings in text. For example, a regex like <b><font color='maroon'>‘^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.] +)\.([a-zA-Z]{2,5})$’</font></b> is used to find all email IDs in a piece of text. Regexes are a great way to incorporate domain knowledge in your NLP system. For example, given a job advertisement, we want to build a system to automatically identify the contract type, the job title, and the salary. There are a variety of contract types and salary ranges available. These can be easily matched with regexes.

Scenario: Consider a recruiting agent that tries to match candidates with available opportunities by identifying contract type, job title, and salary range from the following list of job descriptions:

- <b>Job Description 1:</b> "We are looking for a full-time software engineer to join our team. The salary for this position is $100,000 per year."

- <b>Job Description 2:</b> "Our company is offering a part-time administrative assistant job with a salary range of $20-25 per hour." 

- <b>Job Description 3:</b> "This is a 6-month internship opportunity for a data analyst. The contract rate is $75,000 per year." 

Regular expression can be used to find all the match for contact types, job titles and salary ranges.

In [1]:
import re

def extract_information(text):
  contract_pattern = r'(full-time|part-time|contract|internship)'
  title_pattern = r'^Title: ([a-zA-Z]+ [a-zA-Z]+){1}'
  salary_pattern = r'\$(\d+(,|-)+\d+)'
  
  contract = re.search(contract_pattern, text, re.IGNORECASE)
  title = re.search(title_pattern, text, re.IGNORECASE)
  salary = re.search(salary_pattern, text, re.IGNORECASE)
  
  if contract:
    contract = contract.group(1)
  if title:
    title = title.group(1)
  if salary:
    salary = salary.group(0)
  
  return contract, title, salary

text1 = "Title: Software engineer, Description: We are looking for a full-time software engineer to join our team. The salary for this position is $100,000 per year."
text2 = "Title: Administrative assistant, Description: Our company is offering a part-time administrative assistant job with a salary range of $20-25 per hour."
text3 = "Title: Data analyst, Description: This is a 6-month internship opportunity for a data analyst. The contract rate is $75,000 per year."

print(extract_information(text1))
print(extract_information(text2))
print(extract_information(text3))

('full-time', 'Software engineer', '$100,000')
('part-time', 'Administrative assistant', '$20-25')
('internship', 'Data analyst', '$75,000')


The regular expressions used in this code are designed to capture common patterns in job post data, but they may not capture all possible variations. The code can be further refined and improved to capture more specific cases.

## Information retrival
One major challenge in NLP is creating structured data from unstructured and/or semi-structured documents. Information extraction is one of the NLP tasks to retrieve structured data from normal text. 

### Exercise
The following texts have been selected from a newspaper about the best players of the Champion League in 2022:

- Karim Mostafa Benzema is a French professional footballer who plays as a striker for La Liga club Real Madrid and the France national team.
- Kylian Mbappé Lottin is a French professional footballer who plays as a forward for Ligue 1 club Paris Saint-Germain and the France national team 
- Lewandowski is a Polish professional footballer who plays as a striker for Bundesliga club Bayern Munich and is the captain of the Poland national team. 

Let's create a function and pass a pattern and text to return the first match of the pattern inside the text.


In [2]:
def get_match_text(pattern, text):
    match_items = re.findall(pattern, text)
    if match_items:
        return match_items[0]

In [3]:
player1 = 'Karim Mostafa Benzema is a French professional footballer who plays as a striker for La Liga club Real Madrid and the France national team.'
player2 = 'Kylian Mbappé Lottin is a French professional footballer who plays as a forward for Ligue 1 club Paris Saint-Germain and the France national team'
player3 = 'Lewandowski is a Polish professional footballer who plays as a striker for Bundesliga club Bayern Munich and is the captain of the Poland national team.'

### <font color='blue'> Exercise 1 </font>
Write a pattern to extract name of first player.

In [27]:
# write a pattern to extract name of the player1
pattern = '?'
get_match_text(pattern ,player1)

'Karim Mostafa Benzema '

### <font color='blue'> Exercise 2 </font>
Write a pattern to extract the club name for the second player.

In [31]:
# write a pattern to extract the club name for the second player.
pattern = '?'
get_match_text(pattern ,player2)

'Paris Saint-Germain'

#### <font color='blue'> Exercise 3 </font>
Write a pattern to retrive the national team name for third player.

In [33]:
# Write a pattern to retrive the national team name for third player.
pattern = '?'
get_match_text(pattern ,player3)

'Poland '

## Deep Learning for NLP
In the following sections you can see some interesting examples of deep learning techniques for NLP.
- Sample 1: Sentiment analysis
- Sample 2: Text generation 
- Sample 3: Question answering

Transformer is one of the advance deep learning techniques for executing various NLP tasks. We will cover them in the future.

In [28]:
# you need to install tensorflow and transformers packages
# !pip install tensorflow
# !pip install -q transformers


In [1]:
# import transformers pipeline
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# creating a ready to use sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love NLP", "I hate NLP"]
sentiment_pipeline(data)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


[{'label': 'POSITIVE', 'score': 0.9997692704200745},
 {'label': 'NEGATIVE', 'score': 0.9990854263305664}]

In [3]:
# Generating text using a text generation pipeline
text_generator = pipeline("text-generation", model = 'gpt2') 
text_generator("Natural language processing (NLP) is", max_length = 30, num_return_sequences=3)

Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 331kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)"tf_model.h5";: 100%|██████████| 498M/498M [03:55<00:00, 2.11MB/s] 
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 61.5kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:01<00:00, 652kB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:

[{'generated_text': 'Natural language processing (NLP) is thought to be one of the key factors in improving verbal performance due to its influence on memory in humans. However'},
 {'generated_text': 'Natural language processing (NLP) is considered to be the cornerstone of the field of linguistics and of cognitive science. The goal is to obtain the'},
 {'generated_text': 'Natural language processing (NLP) is the core tool in NLP programs. NLP consists of two or more languages whose standard features (such as'}]

In [4]:
# Answering questions using a question answering pipeline
nlp = pipeline("question-answering") 
context = """ Natural language processing (NLP) is a subfield of linguistics, computer science, 
                and artificial intelligence concerned with the interactions between computers 
                and human language, in particular how to program computers to process and analyze 
                large amounts of natural language data. 
                The result is a computer capable of "understanding" the contents of documents, 
                including the contextual nuances of the language within them. The technology can 
                then accurately extract information and insights contained in the documents 
                as well as categorize and organize the documents themselves. """ 
nlp(question="What is NLP?", context=context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some layers from the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model ch

{'score': 0.9869260191917419,
 'start': 1,
 'end': 28,
 'answer': 'Natural language processing'}