# LAB 2 Chapter 2 Strings

Almost every useful program involves some kind of text processing, whether it is parsing
data or generating output. This chapter focuses on common problems involving text
manipulation, such as pulling apart strings, searching, substitution, lexing, and parsing.
Many of these tasks can be easily solved using built-in methods of strings. However,
more complicated operations might require the use of regular expressions or the cre‐
ation of a full-fledged parser. All of these topics are covered. In addition, a few tricky
aspects of working with Unicode are addressed.

#### 2.1 problem
You need to split a string into fields, but the delimiters (and spacing around them) aren’t
consistent throughout the string.

- split only takes 1 delimitor
- re split helps more because you cn have a pattern 
- using () in a re.split make a capture group which will be included in the reasults of the split
- non capture group uses (?:....) 

In [24]:
import re
line = 'asdf fjdk; afed, fjek,asdf, foo mike) moore('
print(re.split(r'[;,\s]\s*', line))
fields = re.split(r'(;|,|\s)\s*', line)
print('*'*20)
print('fielda: ', fields)
values = fields[::2]
delimiters = fields[1::2] + ['']

print('values: ',values)
print('delimitera: ',delimiters)



['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo', 'mike)', 'moore(']
********************
fielda:  ['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo', ' ', 'mike)', ' ', 'moore(']
values:  ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo', 'mike)', 'moore(']
delimitera:  [' ', ';', ',', ',', ',', ' ', ' ', '']


#### 2.2 problem 
You need to check the start or end of a string for specific text patterns, such as filename extensions, URL schemes, and so on.

- str.start_with
- str.end_with
- both methods take in a tuple, a list will throw an error

In [31]:
name = 'Michael'
print(name.startswith('Mic'), name.endswith('ael'))

True True


#### 2.3 Problem
You want to match text using the same wildcard patterns as are commonly used when
working in Unix shells (e.g., *.py, Dat[0-9]*.csv, etc.).

- fnmatch import fnmatch, fnmatchcase


In [51]:
from fnmatch import fnmatch, fnmatchcase
# on windows True
print(fnmatch('foo.txt', '*.TXT'))
print(fnmatchcase('foo.txt', '*.TXT'))

False
False


#### 2.4 Problem
You want to match or search text for a specific pattern.
- simple literal, 
    - str.find(), str.endswith(), str.startswith(),
- use re ( regular expression) to do more complicated things
    - d+ one or more digit
    - one or more
    - $ if you want the exact match
    
- better to precompile a re if its going to be matched multiple times
    - match capture the first match
    - use findall to get all matches
- using match can help seperate the group that match
- better to use raw strings
    

In [38]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datepat.match('11/27/2012')
print(type(m))
print(m)
print('group(): ',m.group())
print('group(0): ',m.group(0))
print('group(1): ',m.group(1))
print('group(2): ',m.group(2))
print('group(3): ',m.group(3))
print('groups: ',m.groups())

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
datepat.findall(text)

<class 're.Match'>
<re.Match object; span=(0, 10), match='11/27/2012'>
group():  11/27/2012
group(0):  11/27/2012
group(1):  11
group(2):  27
group(3):  2012
groups:  ('11', '27', '2012')


[('11', '27', '2012'), ('3', '13', '2013')]

### Problem 2.5
You want to search for and replace a text pattern in a string.

- str.replace
- for more complicated numbers use sub from the re module 
- \3 <- the 3 represent the capture group

In [43]:
text = 'yeah, but no, but yeah, but no, but yeah'
print(text.replace('no', 'na'))
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print('text',text)
print('text re sub:',re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text))


yeah, but na, but yeah, but na, but yeah
text Today is 11/27/2012. PyCon starts 3/13/2013.
text re sub: Today is 2012-11-27. PyCon starts 2013-3-13.


#### Problem 2.6 
You need to search for and possibly replace text in a case-insensitive manner.

- re.IGNORECASE flag
    - re.sub('python', 'snake', text, flags=re.IGNORECASE)

#### 2.7 problem
You’re trying to match a text pattern using regular expressions, but it is identifying the longest possible matches of a pattern. Instead, you would like to change it to find the shortest possible match.

- re ? - matches 0 or 1 occuance
-    . - matches any character except \n
- noncapture group (i.e., it defines a group for the purposes of matching, but that group is not captured separately or numbered).
-  re.DOTALL match all char including new lines

#### 2.8 Problem
You’re trying to match a block of text using a regular expression, but you need the match
to span multiple lines.

In [47]:
 comment = re.compile(r'/\*(.*?)\*/')
text1 = '/* this is a comment */'
text2 = '''/* this is a
... multiline comment */
... '''

print(comment.findall(text1))
print(comment.findall(text2))
comment = re.compile(r'/\*((?:.|\n)*?)\*/')
print(comment.findall(text2))

[' this is a comment ']
[]
[' this is a\n... multiline comment ']


#### 2.9 Problem
You’re working with Unicode strings, but need to make sure that all of the strings have
the same underlying representation.