---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [15]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [16]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [17]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [18]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [19]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [20]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [21]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [22]:
len(set(text4)) # To find out all the unique words

5

In [23]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [24]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [25]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

In [26]:
# How the split works
text_example = 'ouagadougou'
text_example = text_example.split('ou')
text_example

['', 'agad', 'g', '']

In [27]:
# Join back to the original text
text = 'ou'.join(text_example)
text

'ouagadougou'

**Get the letters out of the word**

In [28]:

text.split('') # this way doesn't work!

ValueError: empty separator

In [29]:
# Method 1
list(text)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

In [30]:
# Method 2
[c for c in text]

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

**Cleaning Text**

In [31]:
text_clean = ' A quick brown fox jumped over the lazy dog. '

# First use strip to strip out the white space from the start and the end
text_clean = text_clean.strip()

# Split the space
text_clean.split(' ')

['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

**Changing text: find and replace**

In [32]:
# find the offset of letter o from left to right
text_clean.find('o')

# find the offset of letter o from the end to start
text_clean.rfind('o')

text_clean.replace('o','O')

'A quick brOwn fOx jumped Over the lazy dOg.'

**Handling larger texts**

In [33]:
# Read files line by line
f = open('UNDHR.txt','r')
text = f.readline()

# remove the last newline character \n
text.rstrip()

FileNotFoundError: [Errno 2] No such file or directory: 'UNDHR.txt'

In [None]:
# Read the full file
f.seek(0) # reset the reading pointer since we just read it from above
text = f.read()
len(text)

# split lines delimited by a \n
text = text.splitlines()

# Get the first line
text[0]

### Processing free-text

In [34]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr',
 '@UN',
 '@UN_Women']

<br>
Finding hastags:

In [35]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [36]:
[w for w in text6 if w.startswith('@')]

['@', '@UN', '@UN_Women']

In [37]:
# To get rid the single @ we use regular expression here
import re
[w for w in text6 if re.search('@[A-Za-z0-9_]+',w)]

['@UN', '@UN_Women']

In [38]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)
* '+' means one or more times

In [39]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

In [40]:
# Another way
[w for w in text8 if re.search('@\w+', w)]

['@UN', '@UN_Women']

**Finding specific characters**

In [41]:
text12 = 'ouagadougou'

print(re.findall(r'[aeiou]',text12))
print(re.findall(r'[^aeiou]',text12))

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']
['g', 'd', 'g']


**Regex for Dates**

In [42]:
dateStr = '23-10-2022\n23/10/2022\n23/10/02\n10/23/2022\n23 Oct 2002\n23 October 2002\nOct 23, 2002\n'

In [43]:
re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', dateStr) # 2 digits-2 digits-4 digits

['23-10-2022', '23/10/2022', '10/23/2022']

In [44]:
re.findall(r'\d{2}[/-]\d[2][/-]\d{2,4}', dateStr) # 2 digits-2digits-2 or 4 digits

[]

In [45]:
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}',dateStr)

['23-10-2022', '23/10/2022', '23/10/02', '10/23/2022']

In [46]:
# the bracket - gives back only the thing that matched between Jan all the way up to Dec
re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['Oct']

In [48]:
# To get the full picture
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['23 Oct 2002']

In [50]:
# Include the October full name - it starts with these three characters but it could have a to z multiple times
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', dateStr)

['23 Oct 2002', '23 October 2002']

In [52]:
re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002']

In [53]:
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002']