## Working With Text

In [1]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [2]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [3]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

## Finding specific words

<br>
List comprehension allows us to find specific words:

In [4]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [5]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [6]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [7]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [8]:
len(set(text4))

5

In [9]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [10]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [11]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

## Comparison functions

```
s.startswith(t)
s.endswith(t)
t in s #for subtexts
s.isupper(), s.islower(), s.istitle()
s.isalpha(), s.isdigit(), s.isalnum()
```

## String Operations

```
s.lower(), s.upper(), s.titlecase()
s.split(t)
s.splitlines()
s.join(t)
s.strip, s.rstrip() #to clean from spaces at the beginning / end 
s.find(t), s.rfind(t) #position of the found first / last word with t
s.replace(u,v)
```

In [18]:
text5 = 'ouagadougou'
text6 = text5.split('ou')
text6

['', 'agad', 'g', '']

In [20]:
'ou'.join(text6)

'ouagadougou'

In [21]:
list(text5)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

## Cleaning text

In [22]:
text8 = '    A quick brown fox jumped over the lazy dog. '
text8.split(' ')

['',
 '',
 '',
 '',
 'A',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog.',
 '']

In [24]:
text9 = text8.strip()
text9

'A quick brown fox jumped over the lazy dog.'

### Cleaning spaces, or newlines at the end of the line

In [37]:
text9 = 'A quick brown fox jumped over the lazy dog\n'
text9.rstrip()

'A quick brown fox jumped over the lazy dog'

## Changing text

In [25]:
text9.find('o')

10

In [26]:
text9.rfind('o')

40

In [27]:
text9.replace('o', 'O')

'A quick brOwn fOx jumped Over the lazy dOg.'

## Handling larger files

### Reading files line by line

In [30]:
f = open('sample_data/README.md', 'r')
f.readline() 

'This directory includes a few sample datasets to get you started.\n'

### Reading the full file

In [35]:
f.seek(0)
text12 = f.read()
len(text12)

930

In [32]:
text13 = text12.splitlines()
len(text13)

19

In [33]:
text13[0]

'This directory includes a few sample datasets to get you started.'

### File operations

```
f.open(filename, mode)
f.readline(), f.read(), f.read(n) -> read n characters
for line in f: doSomething(line)
f.seek(n) reading position n number of characters
f.write(message)
f.close()
f.closed to check
```

## Processing free-text

In [12]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [13]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [14]:
[w for w in text6 if w.startswith('@')]

['@']

In [15]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [16]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']