## Before continuing, please select menu option:  **Cell => All output => clear**

# Regex

A (Very Brief) History of Regular Expressions

In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation.

Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.

## Exercise:

Play around on https://pythex.org/, copy & paste below string test string:

    The "quick" brown fox (may have jumped), [or not] over the lazy dog"

Hint: See the cheat sheet on the Pythex page.



* Try searching for the letter "o".
* Try searching for the word "fox".
* Search for all whitespace
* Search for all non-whitespace
* Try searching for the word "fox" followed by "dog".
* Try searching for a word with letter "o" or "a" in it.
* Make groups of any word with an "o" in it.
* Group words within []

In [None]:
# strings support some matching and searching functionaility:
s = 'foo123bar'
'123' in s

In [None]:
s = 'foo123bar'
s.find('123')
# or
s.index('123')

### Rather than searching for a fixed substring like '123', suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.

In [None]:
# import the "re" module from the python standard library:
import re
# dir(re) if you wish...

In [None]:
s = 'foo123bar'
re.search('\d\d\d', s)   # produces a match object looking for 3 x digits
# The span of the match object is the same as the slice in the string

In [None]:
if re.search('[0-9]{3}', s):   # a match object is truthy
    print('Found a match')
else:
    print('No match')

In [None]:
if re.match(r'\d\d\d', s):   # If beginning of string match the regular expression pattern, return match object.
    print('Found a match')
else:
    print('No match')

Notice above the use of r to denote a "raw" string, this is a common convention to avoid accidental interpretation.
Unless an 'r' or 'R' prefix is present, escape sequences in string and bytes literals are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:
```
\newline = Backslash and newline ignored
\\ = Backslash (\)
\' = Single quote (')
\" = Double quote (")
\a = ASCII Bell (BEL)
\b = ASCII Backspace (BS)
\f = ASCII Formfeed (FF)
\n = ASCII Linefeed (LF)
\r = ASCII Carriage Return (CR)
\t = ASCII Horizontal Tab (TAB)
\v = ASCII Vertical Tab (VT)
\ooo = Character with octal value ooo
\xhh = Character with hex value hh
```

Escape sequences only recognized in string literals are:
```
\N{name} = Character named name in the Unicode database
\uxxxx = Character with 16-bit hex value xxxx
\Uxxxxxxxx = Character with 32-bit hex value xxxxxxxx
```

For example if looking for the newline character:
```
m = re.search(chr(92)+chr(110))
m = re.search("\\n")
m = re.search(r"\n")
```

In [None]:
# We can use parenthesis to pull subgroups out of the string:
s = 'foo123bar  foo bar'
m = re.search(r'(\d\d\d).*\s+(\w+)\s+', s)
if m:
    print(m.groups())
    print(m.group(0)) # The entire match
    print(m.group(1)) # The first parenthesized subgroup
    print(m.group(2)) # The second parenthesized subgroup
    print(m.group(0,1,2)) # Multiple args give a tuple

### Question: What if we wanted the second subgroup above to pull out any characters not just alphanumeric (\w)? 

In [None]:
sshow = '''Index Slot Port Address Media  Speed        State    Proto
============================================================
   0    3    0   720000   id    N32	  No_Light    FC  Disabled (Persistent) 
   1    3    1   720100   id    N32	  No_Light    FC  Disabled (Persistent) 
   2    3    2   720200   id    8G 	  Online      FC  F-Port  50:00:09:75:a8:17:50:1f 
   3    3    3   720300   id    8G 	  Online      FC  F-Port  50:00:09:75:a8:17:50:9f 
   4    3    4   720400   id    N32	  No_Light    FC  Disabled (Persistent) 
   5    3    5   720500   id    N32	  No_Light    FC  Disabled (Persistent) 
   8    3    8   720800   id    N32	  No_Light    FC  Disabled (Persistent) 
   9    3    9   720900   id    N32	  No_Light    FC  Disabled (Persistent) 
  10    3   10   720a00   id    N32	  No_Light    FC  Disabled (Persistent) 
  11    3   11   720b00   id    N32	  No_Light    FC  Disabled (Persistent) 
  12    3   12   720c00   id    N32	  No_Light    FC  Disabled (Persistent) 
  13    3   13   720d00   id    N32	  No_Light    FC  Disabled (Persistent) 
  14    3   14   720e00   id    8G 	  Online      FC  F-Port  50:06:0e:80:12:3b:2c:13 
  15    3   15   720f00   id    8G 	  Online      FC  LS E-Port  10:00:88:94:71:25:4b:43 "sgsindcw03sanr02" 
  16    3   16   721000   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:c8:62:33 
  17    3   17   721100   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:c8:62:bb 
  18    3   18   721200   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:29:83:13 
  19    3   19   721300   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:2a:84:13 
  20    3   20   721400   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:2a:fe:13 
'''.split('\n')

### However, for complex matches or many looped checks it is reccomended to `compile` to a regular expression pattern object before using for speed.
 - https://docs.python.org/3/library/re.html#regular-expression-objects
 - https://docs.python.org/3/library/re.html#match-objects

In [None]:
wwnpat = re.compile(r'([0-9a-f][0-9a-f]([\s:-]?[0-9a-f][0-9a-f]){7})', re.IGNORECASE)  # find wwn
# all the re functions are also defined as pattern methods:
for i, line in enumerate(sshow, 1):
    m = wwnpat.search(line)
    if m:
        print(m.groups())
        print(f'line{i}:', m.group(0))

In [None]:
wwnpat = re.compile(r'([0-9a-f][0-9a-f]([\s:-]?[0-9a-f][0-9a-f]){7})', re.IGNORECASE)  # find wwn
# all the re functions are also defined as pattern methods:
for i, line in enumerate(sshow, 1):
    m = wwnpat.search(line)
    if m:
        print(re.split(r'\s+', line))    # <== heres an example of spliting a string using regualr expressions
        print(f'line{i}:', m.group(0))

## Exercise:
1. Have a scan of the re module: https://docs.python.org/3/library/re.html
1. Create a function "getcache" to accept the list of cache lines.
 - Values should be stored in a list of dictionaries using the headers as a keys.
 - The function should return this list of dictionarys.
 - Note that headers and values are seperated by a mixture of `,+!@`
 - There is also a corrupt line which should be ignored.
 
Goal is to return:
 
```
[{'Module#': '0',
  'Cache Location': 'CACHE-1CA',
  'CM DIMM Size(GB)': '32',
  'Cache Size(GB)': '256',
  'SM Size(GB)': '18',
  'Cache Residency Size(MB)': '0',
  'CFM Size(GB)': '400'},
 {'Module#': '0',
  'Cache Location': 'CACHE-1CB',
  'CM DIMM Size(GB)': '32',
  'Cache Size(GB)': '160',
  'SM Size(GB)': '',
  'Cache Residency Size(MB)': '',
  'CFM Size(GB)': '400'},
 {'Module#': '1',...
    
```
 

In [None]:
 # It might be easier to copy paste this into https://pythex.org/ to test your regex:
cache = '''<<Cache>>
Module#,Cache Location,CM DIMM Size(GB)+Cache Size(GB)!SM Size(GB)@Cache Residency Size(MB),CFM Size(GB)
0,CACHE-1CA,32,256,18,0,400
0+CACHE-1CB+32+160+++400
325$£%£$%"£%"£%%sfsdfsdf,sdfdsf£$234324234sdfdsfsd <= this is a data corruption
1,CACHE-1CC,,,,,
1,CACHE-1CD,,,,,
0!CACHE-2CA!32!256!18!0!400
0@CACHE-2CB@32@160@@@400
1,"CACHE-2CC",,,,,
1,"CACHE-2CD",,,,,'''.split('\n')

In [None]:
# %load answers/getcache.py

## Exercise:
With what you have learned so far modify your tool program:
1. Read in a file with the format of "poolinfo.txt" (i.e. passed on the command line)
2. Make a dictionary of dictionarys for each Pool summary.
 - Key on the pool id.
 - Second level dictionary has key/value pairs for each item, e.g. `{'Type': 'DP(Multi-Tier)', 'Status':'Normal'}`.

```
    DP   <=== Note that each section: DP followed by multiple DP pools and TI by Thin Image pools.
    DP(Multi-Tier)
      POOL-ID : 0x0001(  1)   <=== KEY
        Status                              : Normal
        Total_Size                          : 232305528[MB]
        Used_Size                           : 166673850[MB]
        Reserved_Size                       : 0[MB]
        Free_+_Reserved_Size(Formatted)     : 65631678[MB](22413426[MB])
        Physical_Size                       : 0[MB]
```
        
3. Add a two items to the dictionary to contain the POOL-VOL table.
 - Add a `poolvolheader` item for a list of headings.
 - Add a `poolvols` item to hold a list of lists containing the values.
 
 ```
    POOL-VOL(136VOL(s))
       Pool-Vol   Index   Tier   Volume_Size[MB]   Used_Size[MB]   Used_Rate[%]   Expanded_Space_Used   
      ----------+-------+------+-----------------+---------------+--------------+---------------------+-
       0xF000     0       1      2092944           1109262         53.0           Disabled              
       0xF001     1       1      2097144           1111488         53.0           Disabled              
       0xF002     2       1      2097144           1111488         53.0           Disabled              
       0xF003     3       1      2097144           1111488         53.0           Disabled                        
```


4. Use the dictionary to print out statistics, e.g. line counts, value counts, max, min, total etc.

#### Feel free to write & debug in Jupyter or Vscode but attempt to make a stand alone CLI program.
#### I would suggest at least two functions; One or more producing the dictionary etc. and another to produce the statistics. 