# Regular Expressions 

## Practical part
Load the text "The Time Machine" by H.G. Wells from your txt-file into a string variable.

In [36]:
from re import findall

import requests

text_file_url = 'https://www.gutenberg.org/files/35/35-0.txt'
paper_url = 'https://arxiv.org/pdf/1706.03762'

response_1 = requests.get(text_file_url)
response_2 = requests.get(paper_url)

with open('time_machine.txt','w', encoding='utf-8') as f:         # to write unicode characters we use encoding as utf-8
    f.write(response_1.text)     # we need to use .text for text content instead of .content
    
with open('Attention is all you need paper.pdf','wb') as f:      # 'wb' is used to write on binary files
    f.write(response_2.content)



In [37]:
with open('time_machine.txt','r',encoding='utf-8') as file:
    text=file.read().rstrip()
    
    

In [38]:
type(text)

str

## String Operations

Work with the pure text string and use the Python string methods to solve the following problems:

- Find all occurrences of the phrase "The Time Machine".
- Find all occurrences of the word "time", independent of letter capitalization (so also "Time" or even "tiMe", if it appeared).
- Split your text at every occurrence of a newline "\n”. You will get a list of strings.
- Afterwards, revert this operation by joining the resulting list of strings again correctly. Make sure that your result equals the original text. 
- Try to transform the text into a list of words by using the split() operation. 
   

In [39]:
import re
# using re.complile() to find the 'The Time Machine' and then using finall() to find all insrances.

count= re.compile('[tT]he [tT]ime [mM]achine').findall(text)
len(count)

27

In [40]:
import re
count=re.compile('[tT]ime').findall(text)
len(count)

232

In [41]:
# using str.splitline() operation to split the lines of a string.
print(type(text))
splited_text=text.splitlines(keepends=False)
#print(text)
print(type(text.splitlines()))

<class 'str'>
<class 'list'>


In [51]:
# using ''.join() operation to join the sentences in list by ''.
joined_text=''.join(text)

if joined_text==text:
    print(True)

True


In [56]:
# using str.split() operation to split the text into list of words
type(text)
#print(text.split())
print(len(text.split()))


35513


## Encodings

Some small exercises to see unicode in action. 


https://docs.python.org/3/howto/unicode.html

Examples:

In [57]:
str = "This is a unicode lesson."
str

'This is a unicode lesson.'

In [59]:
str.encode("utf-8")


b'This is a unicode lesson.'

In [60]:
str = "This is ánother unicode lesson. $%"

In [61]:
str.encode("utf-8")

b'This is \xc3\xa1nother unicode lesson. $%'

### Unicode
- Copy some text in kyrillic or Chinese from a website you trust, e.g. OTH.
- Print the unicode representation to each char. Use the package _unicodedata_ to get the category and the name of each string.

Normalization:
- Unicode representations are not unique. Find below two different representations for the letter "á".
- Convince yourself that they represent the same letter.
- Evaluate whether the strings are equal in Python.
- Use unicodedata.normalize to achieve string equality.
- How is the normalized representation?

In [69]:
str='你能肯定吗？'
encoded_str=str.encode('utf-8')
#print(len(str))

In [67]:
import unicodedata
for char in str:
    cate=unicodedata.category(char)
    print(cate)

Lo
Lo
Lo
Lo
Lo
Po


In [73]:
str='á'
unicodedata.name(str[0])

'LATIN SMALL LETTER A WITH ACUTE'

In [78]:
str='I am doing Homework.'
unicodedata.normalize('NFKD',str)

'I am doing Homework.'

### ASCII encoding
Encode the string in ASCII and decode it into UTF-8.
- What happens?
- What could you use it for?
- use different options for encoding ("strict", "replace", "backslashreplace", "namereplace", [there are even more])



In [84]:
str = "This is ànother unicode lesson"

In [92]:
# TODO
str.encode('ascii',errors='namereplace')

b'This is \\N{LATIN SMALL LETTER A WITH GRAVE}nother unicode lesson'

## Regular Expressions

Familiarize with the Python re package: https://docs.python.org/3/library/re.html

Warmup: Find all occurrences of the pattern "a[bcd]*b" in the string "abcbdab"

In [93]:
# TODO
str='abcbdab'
re.compile('a[bcd]*b').findall(str)

['abcb', 'ab']

In [None]:
import re
re.compile('')

Perform the following searches on your Time Machine text:
- Find the word "time"
- Find the word "time" with small or capital letter at the beginning ("time" and "Time")
- Print context for each occurrence (e.g. 10 chars before and after the finding)
- Are there any digits in the text?
- Count the number of fullstops (".") in the text (this might give you an impression of how many sentences you have...)
- Find patterns with various (at least 2) fullstops in a row (e.g. "...")
- However, some fullstops do not mark the end of a sentence. For example abbreviations like "e.g.", "i.e.", names ("H.G. Wells"), or patterns like "...". We may assume that the end of a sentence is marked by a fullstop followed by a space. Find those patterns. 
- Which words are written in capital letters although they are not at the beginning of a sentence? Find all such ocurrences.





In [100]:
#pending

#### Sentence tokenization 

Manually using regex, see task description in lecture (NLP02-1).
Do not use any dedicated method from a language package here.

In [13]:
# TODO