# IT Skills for linguists 2
## UAM, Faculty of English, 2BA
### Topic: *Text manipulation*
#### Poznań, 27.02.2023
#### Teacher: mgr inż. Michał Junczyk


# 7. Text manipulation

- Language researchers are often interested in
  - Sifting through texts
  - Finding words or phrases with particular properties
- Previous class - regular expressions and pattern matching 
- This class -  manipulating text, converting one string of letters into another
- Typical task in programs that deal with natural language.
- RE functions *re.sub(), str.translate(), re.split()*, and the string method *join()*.

## 7.1 String Manipulation Is Costly

- String manipulation of any sort is computationally intensive.

In [22]:
s = ''
for i in 'Apalachicola':
    if i not in 'aeiou':
        s = s + i
print(s)

Aplchcl


- every time letter is a nonvowel a new string is created <br> (when *s = s + 1* is invoked)
- An old string s becomes available for garbage collection
- if a string is long – or if operation is done on a lot of strings <br> a lot of new string and garbage collection is needed
- solution: convert strings to lists before processing
- yet, string manipulation is still costly!

## 7.2 Manipulating Text

- The *sub()* in the *re* module - simplest function for manipulating text is
- Converts one string into another by pattern matching
- Arguments:
  - a pattern
  - a replacement
  - the string
  - a maximum number of replacements count 
  - additional flags

In [23]:
# manip1.py
import re
#define a string
s1 = 'This is a rather long string'
#replace '.s' with 'WOW'
s2 = re.sub('.s','INS',s1)
#print old and new strings
print(s1,'\n',s2)

This is a rather long string 
 ThINS INS a rather longINString


### Example of *max replacement count* argument usage

In [24]:
import re

#a test string
s1 = 'This is a rather long string'
pat = '.s'
#a pattern
#find how many instances of the pattern

countmax = len(re.findall(pat,s1))
print(s1)

#print the string
i = 1
#make substitutions 1 by 1
while i < countmax+1:
    #make a change
    s2 = re.sub(pat,'WOW',s1,count=i)
    #print that one change
    print('\t',i,':',s2)
    i += 1
    #increment counter

This is a rather long string
	 1 : ThWOW is a rather long string
	 2 : ThWOW WOW a rather long string
	 3 : ThWOW WOW a rather longWOWtring


- *findall()* gives the max number of matched patterns
- loop iterates through different numbers of substitutions
- Notice: if *count = 0* all substitutions are made (rather than none)

### Example of *flags* argument usage

- The *flags* argument provides various options
- Example: case-insensitive matching: *flags=RE.i*
- Note - case-insensitive matching can be also achieved by adjusting the pattern

In [25]:
import re
#a test string
s1 = 'This is a rather long string'
#do a replacement
s2 = re.sub('t','WOW',s1)
#do a case-insensitive replacement
s3 = re.sub('t','WOW',s1,flags=re.I)
#incorporate case directly in the pattern
s4 = re.sub('t|T','WOW',s1)
#show all three results
print(s1,'\n',s2,'\n',s3,'\n',s4,sep='')

This is a rather long string
This is a raWOWher long sWOWring
WOWhis is a raWOWher long sWOWring
WOWhis is a raWOWher long sWOWring


### The String *translate()* function

- Useful for converting single letters to other single letters 
- Usage: 
  - str.maketrans() method - makes a translation table
      - table specifies which letters are mapped to which
      - table is later used to make the translation 
      - translation table is implemented as a Python dictionary
      - two string arguments to str.maketrans() must be the same length

In [26]:
#make a translation table
mytab = str.maketrans('aeiou','happy')
#a test string
s = 'This is my sample string'
print(s)
#print that string
print(s.translate(mytab)) #print translation

This is my sample string
Thps ps my shmpla strpng


### The *re.split()* function

- not be confused with the string method *split()*
- The string *split()* splits a string up based on some specific **delimiter string**
- The *re.split()* function splits a string based on a **regular expression** instead

In [27]:
#manip5.py
import re

#a test string
s = 'First sentence. Second sentence.'
ss1 = s.split('e.')    #do a regular split
ss2 = re.split('e.',s) #do re.split
print(s)               #print sentence
#print split() results
print('s.split()')
for ss in ss1:
    print('\t"',ss,'"',sep='')
#print re.split() results
print('re.split()')
for ss in ss2:
    print('\t"',ss,'"',sep='')

First sentence. Second sentence.
s.split()
	"First sentenc"
	" Second sentenc"
	""
re.split()
	"First s"
	"t"
	"c"
	" S"
	"ond s"
	"t"
	"c"
	""


- Program compares results of string method split() and re.split() with delimiter 'e.'. 
- The string method *split()* interprets delimiter literally and returns 3 strings
- The *re.split()* interprets this as a regular expression and returns 8 strings
- Note - different syntax
  - The *re.split()* function takes two arguments
  - the string to split is the second argument
  - The *split()* method is suffixed to the string it operates on and takes a single argument of *delimiter*.

### The *re.join()* function

- The *join()* function joins a list of strings together with a string infix
- Syntax: The string it is suffixed to is the infix.
    - single argument - list of strings

In [28]:
s = 'This is a sentence.' #a test sentence
wds = s.split()
#split into words
hyphen = '-'
#define hyphen
#join bits with hyphen
hyphenated = hyphen.join(wds)
#print original sentence
print(s)
#print hyphenated sentence
print(hyphenated)

This is a sentence.
This-is-a-sentence.
