---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [1]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [2]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [3]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [4]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [5]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [6]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [7]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [8]:
len(set(text4))

5

In [9]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [10]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [11]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### Some word comparison functions

 - s.startswith(t)
 - s.endswith(t)
 - t in s
 - s.isupper()
 - s.islower()
 - s.istitle()
 - s.isalpha()
 - s.isdigit()
 - s.isalphnum()
 
### String Operations

 - s.lower()
 - s.upper()
 - s.titlecase()
 - s.plit(t)
 - s.splitlines()
 - s.joint(t)
 - s.strip()
 - s.rstrip(t)
 - s.find(t)
 - s.rfind(t)
 - s.replace(u,v)

In [14]:
text5 = 'ouagadoudou'
text6 = text5.split('ou')
text6

['', 'agad', 'd', '']

In [15]:
'ou'.join(text6)

'ouagadoudou'

In [16]:
list(text5)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'd', 'o', 'u']

In [18]:
[c for c in text5]

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'd', 'o', 'u']

### Cleaning text

In [19]:
text8 = '      A quick brown fox jumped over the lazy dog.    '
text8.split(' ')

['',
 '',
 '',
 '',
 '',
 '',
 'A',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog.',
 '',
 '',
 '',
 '']

In [21]:
text9 = text8.strip()
text9.split(' ')

['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

In [22]:
text9.find('o')

10

In [23]:
text9.replace('o','O')

'A quick brOwn fOx jumped Over the lazy dOg.'

### Handling larger text

- Reading files line by line

- f = open(filename, mode)  # mode: r = read-only ; w = write
- f.readline()
- f.read()
- f.read(n)
- for line in f: doSomething(line)
- f.seek(n)
- f.write(messge)
- f.close()
- f.closed()

#### How to remove the last newline character
.rstrip()

In [34]:
f = open('UNDHR.txt','r')
text14 = f.readline()
print(list(text14))
print(list(text14.rstrip()))

['P', 'r', 'e', 'a', 'm', 'b', 'l', 'e', '\n']
['P', 'r', 'e', 'a', 'm', 'b', 'l', 'e']


In [29]:
f.seek(0)
text12 = f.read()
print(len(text12))
text13 = text12.splitlines()
print(len(text13))
print(text13[0])

10814
118
Preamble


### Processing free-text

In [35]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [36]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [37]:
[w for w in text6 if w.startswith('@')]

['@']

In [38]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [39]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

 - . : wildcard, matches a single character
 - ^ : start of a string
 - $ : end of a string
 - [ ] : matches one of the range of characters within [ ]
 - [a-z] : matches one of the range of characters a, b, ..., z
 - [^a-z] : machtes a character that is not a,b or c
 - a|b : matches either a or b, where a and b are strings
 - ( ) : scoping for operators
 - \ : escape character for special characters (\t, \n, \b)
 - \b: Matches word boundary
 - \d: Any digit, equivalent to [0-9]
 - \D: Any non-digit, equivalent to [^0-9]
 - \s: Any whitespce, equivalent to [ \t\n\r\f\v]
 - \S: Any non-whitespace, equivalent to [^ \t\n\r\f\v]
 - \w: Alphanumeric character, equivalent to [a-zA-Z0-9_]
 - \W: Non-alphanumeric, equivalent to [^a-zA-Z0-9_]
 - * : matches zero or more occurences
 - + : matches one or more occurences
 - ? : matches zero or one occurence
 - {n}: exactly n repetitions, n>=0
 - {n,}: at least n repetitions
 - {,n} : at most n repetitions
 - {m,n} : at least m and at most n repetitions


In [40]:
[w for w in text8 if re.search('@\w+', w)]

['@UN', '@UN_Women']

In [42]:
text12 = 'ouagadoudou'
re.findall(r'[aeiou]',text12)

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [43]:
re.findall(r'[^aeiou]',text12)

['g', 'd', 'd']