## Working With Text

In [2]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "
len(text1)

76

In [3]:
text2 = text1.split(' ')   # splits text at ' ' and returns a list 
len(text2)

14

In [4]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

List comprehension allows us to find specific words:

In [5]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [6]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [7]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

We can find unique words using `set()`.

In [9]:
text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4)

6

In [10]:
len(set(text4))

5

In [11]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [12]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

### Processing free-text

In [13]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

Finding hastags:

In [14]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

Finding callouts:

In [15]:
[w for w in text6 if w.startswith('@')]

['@']

In [16]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

We can use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that:

start with '@' and are followed by at least one:
capital letter ('A-Z')
lowercase letter ('a-z')
number ('0-9')
or underscore ('_')

In [19]:
import re

[ w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']