---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [1]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [3]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [4]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [97]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [6]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [7]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [98]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [99]:
len(set(text4))

5

In [100]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [101]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [102]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### Some more word comparison functions
- s.startswith(t) 
- s.endswith(t)
- t in s
- s.isupper(), s.islower(), s.istitle()
- s.isalpha(), s.isdigit(), s.isalnum()

### String operations
- s.lower() s.upper() s.titlecase()
- s.split(t)
- s.splitlines()
- s.join(t)
- s.strip(), s.rstrip()
- s.find(), s.rfind()
- s.replace(u, v)

### From words to characters

In [103]:
text5 = 'ouagadougou'

text6 = text5.split('ou')  # When you split, it removes the separator
text6

['', 'agad', 'g', '']

In [104]:
'ou'.join(text6)

'ouagadougou'

In [105]:
text5.split('')  # There is no space in text5

ValueError: empty separator

##### list()

In [106]:
list(text5)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

##### list comprehension

In [112]:
[c for c in text5 if c in ['a', 'o']]

['o', 'a', 'a', 'o', 'o']

### Cleaning text

In [114]:
text8 = '   A quick brown fox jumped over the lazy dog. '
text8.split(' ')

['',
 '',
 '',
 'A',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog.',
 '']

In [115]:
text9 = text8.strip()
text9

'A quick brown fox jumped over the lazy dog.'

### Changing text

In [116]:
text9.find('o')  # Lowest index of 'o' in string counting spaces

10

In [25]:
text9.rfind('o')  # Highest index of 'o' in string

40

In [117]:
text9.replace('o', 'O')

'A quick brOwn fOx jumped Over the lazy dOg.'

### Processing free-text

In [118]:
text10 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
text11 = text10.split(' ')

text11

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr',
 '@UN',
 '@UN_Women']

<br>
Finding words that start with hastags:

In [119]:
[w for w in text11 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [75]:
[w for w in text11 if w.startswith('@')]

['@', '@UN', '@UN_Women']

In [120]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

##### Note that the '@' alone is still being returned
(This is why we use Regular Expressions below)

In [121]:
[w for w in text8 if w.startswith('@')]

['@UN', '@UN_Women', '@']

## Regular Expressions

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [122]:
import re

In [123]:
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

### Another Example
r' ' prefix tells python to read text as raw strings. r'\n' will be read as a backslash and then an n character, not a newline

In [81]:
text12 = 'ouagadougou'

re.findall(r'[aeiou]', text12)  # vowels

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [82]:
re.findall(r'[^aeiou]', text12)  # consonants

['g', 'd', 'g']

### Dates

In [138]:
datestr = '2-10-2002\n23/10/2002\n23/10/02\n10/23/2002\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002\n'

re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', datestr)
# re.findall(r'[0-9][0-9][/-]\d{2}[/-]\d{4}', datestr)

['23/10/2002', '10/23/2002']

In [129]:
re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}', datestr)

['23/10/2002', '23/10/02', '10/23/2002']

In [130]:
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', datestr)

['2-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

#### Getting to work with string month names

In [133]:
re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', datestr)

['Oct']

#### ?:
Must use this to break scoping of () so not only the things inside are returned and the rest of the string is also returned

In [91]:
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', datestr)

['23 Oct 2002']

In [137]:
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', datestr)

['23 Oct 2002', '23 October 2002']

#### Now to get it to match when month name comes first

In [93]:
re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', datestr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [94]:
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', datestr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

### File operations
- f = open(filename, mode)
- f.readline(), f.read(n), f.read()
- for line in f: dosomething(line)
- f.seek(n)
- f.write(message)
- f.close()
- f.closed