<a href="https://colab.research.google.com/github/pstorniolo/Master2021/blob/main/L0_Strings_and_regex_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing - Lesson 0**


---


I can tell you without a shadow of a doubt the NLP is at the top of the still open challenges of Artificial Intelligence.

But as geometry teaches us, there is no height without a base.


### > Why Colab? 
Colab (Google Colaboratory) is a free cloud service based on Jupyter Notebooks that supports... FREE GPU!!! 

Lectures will be held through Colab Notebooks. To download each notebook there are few and really simple steps to do:

( *only one thing is required ... having Google Drive or GitHub* )

1.   Click on https://bit.ly/2K59ZVD 
2.   `File` > `Save a copy in Drive` / or `Save a copy on GitHub`
3.   (Drive option) Go to your `Drive` and check if the copied version of notebook is present into `Colab Notebooks` folder 
4.   (Github option) Choose which `repository` to copy the notebook to and than `open it with Colab`

## **Lesson 0 - Strings, Python and regular expressions**

Whenever we have textual data, we need to apply several pre-processing steps to the data to transform words into numerical features that work with machine learning algorithms and, in general, with a computer.

The pre-processing steps for a problem depend mainly on the domain and the problem itself, hence, we don’t need to apply all steps to every problem.

### String Basic Manipulation

One place where the Python language really shines is in the manipulation of strings. This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of *regular expressions*.

Strings ( `str` ) are one of the most basic data types in Python, used to represent textual data. Like many other programming languages, strings are arrays of bytes representing unicode characters and they are immutable. 


In [None]:
# How to define a string? 
string_1 = "Example string #1"
string_2 = 'Exa\'mple string #2'
print(string_1)   
print(type(string_1))
print(string_2)   
print(type(string_2))


Example string #1
<class 'str'>
Exa'mple string #2
<class 'str'>


Python does not have a `Character` data type so, to access a single character, it is necessary to use the bracket notation, exactly as a `list`. 

In [None]:
string_3 = "This is a string."
print(string_3[0])
print(string_3[0:7])

T
This is


Exactly as `list`, to get the length of a string use the `len()` function.


In [None]:
len(string_3)

17

The `+` and `*` operators are overridden for the string class, making it possible to add and multiply strings.

In [None]:
first_name = "Abraham"
last_name = " Lincoln"
first_name + last_name

'Abraham Lincoln'

In [None]:
"a" * 10

'aaaaaaaaaa'

Python has a set of very useful built-in methods that you can use on strings. Let's see the most used.

**Formatting strings: Adjusting case**

`upper()` `lower()` `capitalize()` `title()` `swapcase()`



In [None]:
fox = "tHe qUICk bROWn fOx."
print(fox.upper())
print(fox.lower())
print(fox.capitalize())
print(fox.title())
print(fox.swapcase())

THE QUICK BROWN FOX.
the quick brown fox.
The quick brown fox.
The Quick Brown Fox.
ThE QuicK BrowN FoX.


**Formatting strings: Adding and removing spaces**

`strip()`


In [None]:
# Another common need is to remove spaces (or other characters) from the beginning or end of the string.To remove just space to the right or left, use rstrip() or lstrip() respectively
line = '         this is the content         '
line.strip() 
line.rstrip()
line.lstrip()

'this is the content         '

In [None]:
num = "444440004035"
num.strip('4')

'0004035'

**Splitting and partitioning strings**

`str.split()`  `str.join()`


In [None]:
# it finds all instances of the split-point and returns the substrings in between. The default is to split on any whitespace, returning a list of the individual words in a string:
line_str = line.split()
line_str

['this', 'is', 'the', 'content']

In [None]:
line_test = "A! My name is Elena! And yours?"
line_test.split('A')

['', '! My name is Elena! ', 'nd yours?']

In [None]:
# if you wanted to eliminate the effect of a split()
' -. - '.join(['1','2','3'])

'1 -. - 2 -. - 3'

**Format Strings**

`str() ` `+`  `format()`


In [None]:
# Another use of string methods is to manipulate string representations of values of other types. 
pi = 3.14159
str(pi)

'3.14159'

In [None]:
 # Concatenation
"The value of pi is " + str(pi)

'The value of pi is 3.14159'

In [None]:
# A more flexible way to do concatenation of strings. Inside the {} marker you can also include information on exactly what you would like to appear there
"The value of pi is {}".format(pi)
"First letter: {}. Last letter: {}.".format('A', 'Z')
"First letter: {first}. Last letter: {last}.".format(last='Z', first='A') # You can refer to a string with a keyword

'First letter: A. Last letter: Z.'

**Finding and replacing substrings**

`find()`  `index()` 


In [None]:
line = 'the quick brown fox jumped over a lazy fox'
line.find('fox')

16

In [None]:
line.index('fo')

16

`find()` `index()` are very similar, the only difference is what they return if no occurrence is found

In [None]:
line.find('bear')

-1

In [None]:
#line.index('bear')

`startswith()` `endswith()`


In [None]:
line.endswith('dog')

False

In [None]:
line.startswith('the')

True

`replace()`


In [None]:
# Search and replace static strings
line.replace('brown', 'red')

'the quick red fox jumped over a lazy fox'

And, what if we wanted to find all occurrences of a substring inside a string? 

There is no simple built-in string function that does what you're looking for, but you could use the more powerful **regular expressions** library.

In [None]:
import re

**Regular expression**

Python has a built-in package called **re**, which can be used to work with Regular Expressions.

The re module offers a set of functions that allows us to search a string for a match:

* *Function*	|  Description
* *findall*	|  Returns a list containing all matches
* *search*	|  Returns a Match object if there is a match anywhere in the string
* *split*	|  Returns a list where the string has been split at each match
* *sub*	|  Replaces one or many matches with a string


What is a Regular Expression?

A regular expression is a special text string for describing a search pattern. You are probably familiar with wildcard notations such as `*.txt` to find all text files in a file manager. The regex equivalent is `.*\.txt`.


Let's see togheter the basic rules to learn writing a regular expression.

> `abc` Letters

>  `123` digits

> `\d`	Any Digit

> `\D`	Any Non-digit character

> `.` It is the wildcard, it can match any single character (letter, digit, whitespace, everything).  In order to specifically match a period, you need to escape the dot by using a slash \\. accordingly

> `[abc]`	Only a, b, or c

> `[^abc]`	Not a, b, nor c

> `[a-z]`	Characters a to z (also `[a-zA-Z]`)

> `[0-9]`	Numbers 0 to 9

> `\w`	Any Alphanumeric character, it's equivalent to `[a-zA-Z0-9_]`

> `\W`	Any Non-alphanumeric character

>  `{m}`	m Repetitions

>  `*`	Zero or more repetitions

> `+`	One or more repetitions

> `?`	Optional character

> `\s`	Any Whitespace

> `\S`	Any Non-whitespace character

> `^…$`	Starts and ends

> `(…)`	Capture Group. Regular expressions allow us to not just match text but also to *extract information for further processing*. This is done by defining **groups of characters** and capturing them using the special parentheses **(** and **)** metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.

> `(a(bc))`	Capture Sub-group

> `(.*)`	Capture all

> `(abc|def)`	Matches abc or def

Now, go to this link [https://regex101.com/](https://regex101.com/) and solve some regex exercises. 


Very popular examples of regular expressions, or regex, are:



> `\w+@\w+\.[a-z]{3}`

> `[A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]` 



 
----------------------




In [None]:
line = 'the  ++ quick brown fox  +++ jumped over + a lazy fox'
print(line)
print(line.split())
print(line.split('\++'))
print(re.split('\++',line))

the  ++ quick brown fox  +++ jumped over + a lazy fox
['the', '++', 'quick', 'brown', 'fox', '+++', 'jumped', 'over', '+', 'a', 'lazy', 'fox']
['the  ++ quick brown fox  +++ jumped over + a lazy fox']
['the  ', ' quick brown fox  ', ' jumped over ', ' a lazy fox']


`compile()` compiles a regular expression pattern into a regular expression object, which can be used for matching using its `match()` and `search()` methods. The sequence 


> `prog = re.compile(pattern)`

> `result = prog.match(string)`

is equivalent to 

>  `result = re.match(pattern, string)`

but using `re.compile()` and saving the resulting regular expression object for reuse is more efficient.


`match()` tells you whether the beginning of a string matches the pattern, but only at the beginning of the string. While `re.search()` checks for a match anywhere in the string. Both of them look for the *first location* where the regular expression produces a match. (in poche parole search è un match che agisce su tutta la stringa)


In [None]:
regex = re.compile('\s+')
for s in ["abc", "ab c", " abc", ]:
    if regex.match(s):
       print(s, "match() matches")
    else:
       print(s, "match() does not match")
    if regex.search(s):
       print(s, "search() matches")
    else:
       print(s, "search() does not match")

abc match() does not match
abc search() does not match
ab c match() does not match
ab c search() matches
 abc match() matches
 abc search() matches


When a pattern is matched, a Match Object is created. Match objects always have a boolean value of `True`, in fact you can test whether there was a match with a simple `if` statement. 

Match objects support many methods and attributes, the main ones being the following:  

- `group([group1,...])`
- `groups([default])`
- `start([group])`
- `end([group])`


In [None]:
# returns one or more subgroups of the match. if there is a single argument, the result is a single string if there are multiple arguments, the result is a tuple with one item per argument
# Without arguments, group1 defaults to zero (the whole match is returned)
m = re.search("(\w+) (\w+)", "Isaac Newton, physicist")
m.group(0) 
m.group(1)
m.group(2)
#m.group(3)

'Newton'

In [None]:
m.groups() # groups() returns a tuple containing all the subgroups of the match

('Isaac', 'Newton')

In [None]:
m.string[m.start():m.end()]
#m.start(0) # take as input the group, default is 0
#m.end(0)

'Isaac Newton'

In [None]:
email = "tony@tiremove_thisger.net" #anonymize
m = re.search("remove_this", email)
email[:m.start()] + email[m.end():]

'tony@tiger.net'

`re.findall()` The string is scanned from left to right, all matches return in order in a string list


`re.finditer()` Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string

In [None]:
array = re.findall("(\w+) (\w+)", "Isaac Newton and I, physicist")
for i in array:
  print(i[0]+" "+i[1])

Isaac Newton
and I


In [None]:
iterator = re.finditer("(\w+) (\w+)", "Isaac Newton and I, physicist") #iterator of Matched Objects
for i in iterator:
  print(i.group())


Isaac Newton
and I


Another powerfull tool is the method `re.sub` that operates much like `str.replace()`



> `result = re.sub(pattern, repl, string, count=0, flags=0)` 



In [None]:
# Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.If the pattern isn’t found, string is returned unchanged
result = re.sub('abc','','There is abc before def')   # Delete pattern abc
result = re.sub('\s+',' ',result)           # Eliminate duplicate whitespaces
result = re.sub('abc(def)ghi', r'\1', 'abcdefghi') # Replace a string with a part of itself
result = re.sub("(\d+) (\w+)", r"\2 \1", '1 a') # Reverse the things of the match
result

'a 1'