# GIAN 2: Searching and Finding

Fou should be familiar with the Jupyter Notebook.

This notebook shows you how to search text using simple techniques. 

We will use small bits of text as example, but you can use the same techniques on very large text.

## 1. What is a text string ?

A text string is a sequence of characters. In Python3, a text string can include any character in the [UTF-8]() character set. 

You can indicate that something is a text string by surrounding it by single quotes or double quotes.

In [None]:
mytext1="I love text mining"
mytext2='I love text mining'

Since the quotes are not part of the string itself, these two lead to ways are interchangeable.

We can compare them by using `==` to show that this is the case.

In [None]:
mytext1==mytext2

So, how do you choose whether to use single quotes or double quotes?

Easy! If the text contains single quotes, use double quotes to surround it. If the text contains double quotes, use single quotes to surround it.

In [None]:
mytext1="My classmate said: 'I love text mining'"
mytext2='My classmate said: "I love text mining"'

Now, look at the cell above and try to predict the result of the following code. Remember that there are now quotes inside the string which are part of the string.

In [None]:
mytext1==mytext2

A final way to represent text is by surrounding it by three single or double quotes on each side. This is very useful for text which contains
+ many quotes
+ strange characters
+ runs across multiple lines

The only thing you need to remember is that, if you do this, you cannot have the same sequence of quotes inside the string. 

In [None]:
mytext3="""

This string starts with two blank lines.

"You can use as many single or double quotes as you want in this 'string'"

Just don't use the same three quotes that you also use to surround the text:

So, this is fine: '''

And this too ""

But the characters that delimit the string can't be used:

because they signal the end of the string!
"""

Play around with the string in the preceding cell and look at the differences in output. 

How can you make the string illegal?

## 2. Regular Expressions

Regular expressions are a tool to look for specific patterns in text. These patterns can be used for many text mining tasks, from very simple to extremely complex.

+ very simple: for instance, finding a specific word
+ a bit more complex: for instance, finding e-mail addresses
+ even more complex: for instance, extracting specific parts of html source code dependent on what patterns occur before or after it

### Experienced programmers say: "Always refer to documentation when using regular expressions"

+ [Here is a cheatsheet for regular expressions in Python](https://www.debuggex.com/cheatsheet/regex/python)

+ [And here is the official technical documentation for regular expressions in Python3](https://docs.python.org/3/library/re.html#module-re)

In [None]:
# Before we can use regular expressions in the notebook, we need to import the regular expressions module
import re

## 3. Asking whether text *contains* a pattern (re.search)

A first thing we can do with regular expressions is to check whether a text contains a pattern.

In [None]:
mytext = """I want to live,
I want to give
I've been a miner
for a heart of gold
It's these expressions
I never give
That keep me searching
for a heart of Gold
And I'm getting old
Keeps me searching
for a heart of gold
And I'm getting old

I've been to Hollywood
I've been to Redwood
I crossed the ocean
for a heart of gold
I've been in my mind,
it's such a fine line
That keeps me searching
for a heart of gold
And I'm getting old
Keeps me searching
for a heart of gold
And I'm getting old

Keep me searching
for a heart of gold
You keep me searching
And I'm growing old
Keep me searching
for a heart of gold
I've been a miner
for a heart of gold
"""

In [None]:
# Is there gold ?
mypattern="gold"
re.search(mypattern, mytext)

In [None]:
# Is there silver ?
mypattern="silver"
re.search(mypattern, mytext)

In [None]:
# Is there Gold ?
mypattern="Gold"
re.search(mypattern, mytext)

In [None]:
# Is there [Gg]old ?
# We use square brackets to indicate a set of possible characters
mypattern="[Gg]old"
re.search(mypattern, mytext)

## 4. Asking whether a text *matches* a pattern from the beginning (re.match)

In [None]:
# Is there gold at the beggining of the text?
mypattern="gold"
re.match(mypattern, mytext)

In [None]:
# Is the word "I" at the begining of the text ?
mypattern="I"
re.match(mypattern, mytext)

In [None]:
# Does the text start with a word composed of alphanumeric characters ?
mypattern="\w+"
re.match(mypattern, mytext)

In [None]:
# Does the text start with a an uppercase alphabetic character?
mypattern="[A-Z]"
re.match(mypattern, mytext)

In [None]:
# Does the text start with a lowercase alphabetic character?
mypattern="[a-z]"
re.match(mypattern, mytext)

In [None]:
# Does the text start with a single alphanumeric character followed by a space ?
mypattern="\w\s"
re.match(mypattern, mytext)

In [None]:
# Does the text start with any sequence of characters, followed by "want" ?
mypattern=".+ want"
re.match(mypattern, mytext)

In [None]:
# Does the text start with any sequence of characters, followed by "want", followed by any sequence of characters?
mypattern=".+ want .+"
re.match(mypattern, mytext)

In [None]:
# Does the text start with any sequence of characters, followed by "gold" ?
# Beware: re.match doesn't work across multiple lines
mypattern=".+ gold"
re.match(mypattern, mytext)

In [None]:
# Does the text start with ANY sequence of characters, followed by "gold" ?
# We need to use the re.DOTALL flag to indicate we want the dot to match a newline character
mypattern=".+ gold"
re.match(mypattern, mytext, re.DOTALL)

In [None]:
# And we need to use "group" to show the full pattern
# Observe that the newline characters are now shown by their regular expression pattern!
mypattern=".+ gold"
re.match(mypattern, mytext, re.DOTALL).group()

## 5. Finding multiple patterns in a text 

In [None]:
# How many times does Neil Young tell us that he is old ?
mypattern=" (old)\s"
results=re.findall(mypattern, mytext)
print(len(results))
print(results)

In [None]:
# Where has Neil Young been to ?
# We use round brackets to surround the part we want to extract
mypattern="been (to \w+)"
results=re.findall(mypattern, mytext)
print(results)

In [None]:
# What, where, how, ... has Neil Young been ?
# We can extract multiple groups in a pattern
mypattern="([\w']+) (been) (\w+) (.+)"
results=re.findall(mypattern, mytext)
print(results)

## 6. Splitting text using a pattern (basic tokenizing) 

In [None]:
# What are all the strings in the text that are delimited by a space
results=re.split("\ ", mytext)
print(results)

In [None]:
# What are all the strings in the text that are delimited by a space or a newline character
results=re.split("[\ \n]", mytext)
print(results)

In [None]:
# What are all the strings in the text that are delimited by a space or a comma
results=re.split("[\ \n,]", mytext)
print(results)