## Before continuing, please select menu option:  **Cell => All output => clear**

# Open repo PythonTutButty03 and go through the CLI examples

 - CLIargs1.py - Parameter argv in the sys module.
 - CLIargs2.py - Program structure separating CLI parsing.
 - CLIargs3.py - separating options and arguments.
 - CLIskeleton.py - Uses argparse, a helpful python built-in module.

Take a look at clickskeleton.py

# Files & I/O

In [None]:
import os
os.path.exists('testdata/')

### There are many pathname manipulation functions:
 - https://docs.python.org/3/library/os.path.html
  - Common ones are abspath, basename, dirname, exists, join, split, splitext

In [None]:
dir(os.path)

In [None]:
entries = os.listdir('testdata/') # old method no longer suggested to use (what if directory was huge)
entries

In [None]:
entries = os.scandir('testdata') # the preferred method
entries

In [None]:
for i in os.scandir('testdata'):
    print(i.name)

In [None]:
from pathlib import Path
entries = Path('testdata')   # This is now the preferred method with most functionality
for entry in entries.iterdir():
    print(entry.name)

In [None]:
# UNIX translate name patterns with wildcards like ? and * into a list of files. This is called globbing
import glob
glob.glob('testdata\*.txt')

In [None]:
p = Path('.')
list(p.glob('testdata\*.csv'))

In [None]:
# Walking a directory tree and printing the names of the directories and files
count = 0
for dirpath, dirnames, files in os.walk('.'):
    count += 1
count

In [None]:
# Using a temporary file
from tempfile import TemporaryFile
fp = TemporaryFile('w+t')
fp.write('Hello universe!')
fp.name

In [None]:
fp.close()
os.remove(fp.name)  # fails because the file was automatically removed upon closing

In [None]:
import zipfile
with zipfile.ZipFile('testdata/allconf.zip', 'r') as zipobj:
    for i in zipobj.namelist():
        print(i)


## Reading & Writing files

<br>Modes are:
- 'r'	open for reading (default)
- 'w'	open for writing, truncating the file first
- 'x'	open for exclusive creation, failing if the file already exists
- 'a'	open for writing, appending to the end of the file if it exists
- 'b'	binary mode
- 't'	text mode (default)
- '+'	open a disk file for updating (reading and writing)
- 'U'	universal newlines mode (deprecated)

Difference between binary/text:
- Files opened in binary mode return contents as bytes objects without unicode decoding. 
- In text mode the contents of the file are returned as str, After the bytes are decoded using a platform-dependent or specified encoding.

#### File methods:
 - `.read(size=-1)` Reads an entire file or up to *size* number of bytes.
 - `.readline(size=-1)` Reads the next line or up to *size* characters from the next line.
 - `.readlines=()` Reads the remaining lines from the file as a list (including '\n').
 - `.write(string or bytes)` Writes to the file.
 - `.writelines(seq)` Writes the sequence to the file (note that line endings '\n' are not appended).
 
 Note that the `print` statement also accepts a file object which can be an open file:
     ```
     print(*args, file=sys.stdout)
     ```
 

In [None]:
myfn = 'mytest.txt'
fd = open(myfn, 'wb') # Note that mode of 'w' will open for writing and truncate the file first

In [None]:
!dir my*

In [None]:
# Get the file mode used
print(fd.mode)

In [None]:
# Get the files name
print(fd.name)

In [None]:
# Write text to a file with a newline
fd.write(bytes("Write me to the file\n", 'UTF-8'))

In [None]:
# Close the file
fd.close()

In [None]:
!type mytest.txt

In [None]:
# Opens a file for reading and writing
fd = open(myfn, "rb+")  # the + indicates update mode, you can also write to it (no truncation)
# Read text from the file
text = fd.read()
print(type(text))
print(text)

In [None]:
# Implicitly closed before re-opening:
fd = open(myfn, "r+")  # Note by default this is opened as text
fd.seek(6)
fd.write('XX')
fd.seek(0)
# Read text from the file
text = fd.read()
print(type(text))
print(text)

In [None]:
# Close the file
fd.close()

In [None]:
# using the with statement context manager:
with open(myfn, "r+") as fd:
    # Read text from the file
    text = fd.read()
    print(type(text))
    print(text)
# file is automatically closed
fd.seek(0)

In [None]:
import os
# Delete the file
os.remove(myfn)

In [None]:
!dir mytest.txt

In [None]:
# Looking to see if the demo files are available with a Jupyter special execute prefix (!):
!dir testdata

In [None]:
filename = 'SSHOW_SYS.txt'
filename = 'AllConf.csv'
pathname = os.path.join('testdata', filename)
if os.path.exists(pathname): print('Yes') 

In [None]:
if os.path.exists(pathname):
    count = 0
    with open(pathname) as fd:
        line = fd.readline()
        while line != '':  # The EOF is a n empty string
            line=line.strip()
            print(line)
            count += 1
            if count > 3: break
            line = fd.readline()

In [None]:
if os.path.exists(pathname):
    count = 0
    with open(pathname) as fd:
        for line in fd.readlines():   # this will return a full list of the entire file
            line=line.strip()   # Comment this line and see what happens
            print(line)
            count += 1
            if count > 3: break

In [None]:
# This final approach is more Pythonic and can be quicker and more memory efficient. 
# Therefore, it is suggested you use this instead.
if os.path.exists(pathname):
    with open(pathname) as fd:
        for i, line in enumerate(fd):   # Notice how iterating over a file descriptor is same as issung .readline()
            line = line.strip()
            print(f'[{i}] {line}')
            if i > 3: break  

In [None]:
# An example of opening a zip file and directly reading the internal file 
import zipfile
with zipfile.ZipFile(r'testdata\allconf.zip', 'r') as zipobj:
    text = zipobj.read('AllConf.csv').decode()   # A zipfile object is opened as binary so may need decode to string 

for i, line in enumerate(text.split('\n')):
    print(line)
    if i > 5:
        break

### Working with two files at once:
There are times when you may want to read a file and write to another file at the same time. Here is an example:

``` python
d_path = 'dog_breeds.txt'
d_r_path = 'dog_breeds_reversed.txt'
with open(d_path, 'r') as reader, open(d_r_path, 'w') as writer:
    dog_breeds = reader.readlines()
    writer.writelines(reversed(dog_breeds))
```

### Don't reinvent the snake:
Additionally, there are built-in libraries out there that you can use to help you:

* wave: read and write WAV files (audio)
* aifc: read and write AIFF and AIFC files (audio)
* sunau: read and write Sun AU files
* tarfile: read and write tar archive files
* zipfile: work with ZIP archives
* configparser: easily create and parse configuration files
* xml.etree.ElementTree: create or read XML based files
* msilib: read and write Microsoft Installer files
* plistlib: generate and parse Mac OS X .plist files

There are plenty more out there. Additionally there are even more third party tools available on PyPI. Some popular ones are the following:

* PyPDF2: PDF toolkit
* xlwings or : read and write Excel files
* xlsxwriter : write Excel files
* Pillow: image reading and manipulation

## Exercise:
1. In Vscode create a `mypackage` folder under the `pytut` directory (which you should have created previously).
1. Copy all the `*.txt` `*.csv` files from PytutButty01 into this new directory.
1. Copy the `CLIskeleton.py` (Under PytutButty03) program into this folder and rename it `mygrep.py`
1. Write CLI program to accept a filename and text string:

```
mygrep.py [options] <filename> <text string>
-c = case sensitive
```

 - It needs to output any lines containing the text string.
 - By default supplied text searching should be case insensitive.



## Exercise (Part2):
1. In the mypackage folder copy the `CLIskeleton.py` (Under PytutButty03) program into this folder and rename it `<name>tool.py`. 
1. You can play around & test in the notebook but the objecting is to do the following is a stand alone CLI program, so edit and modify the skeleton in Vscode:

- Write some code to process either the `SSHOW_SYS.txt` or `Allconf.csv` or `poolinfo.txt` and print out your favorite section.
    - `SSHOW_SYS.txt` example: text between `???/switchshow` to blank line.
    - `Allconf.csv` example: text between `<<System Option Information>>` to next `<<?>>` section.
    - `poolinfo.txt` example: text between `POOL-ID` and a blank line.

### As a hint:
 - After the argparse processing call a new function `main(args.filename)`.
 - Create a new main function to process and output the file.
 - It can be a good idea to logically put in an output limit while initially developing (e.g. maximum 20 lines). 

# Regex

A (Very Brief) History of Regular Expressions

In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation.

Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.

## Exercise:

Play around on https://pythex.org/, copy & paste below string test string:

    The "quick" brown fox (may have jumped), [or not] over the lazy dog"

Hint: See the cheat sheet on the Pythex page.



* Try searching for the letter "o".
* Try searching for the word "fox".
* Search for all whitespace
* Search for all non-whitespace
* Try searching for the word "fox" followed by "dog".
* Try searching for a word with letter "o" or "a" in it.
* Make groups of any word with an "o" in it.
* Group words within []

In [None]:
# strings support some matching and searching functionaility:
s = 'foo123bar'
'123' in s

In [None]:
s = 'foo123bar'
s.find('123')
# or
s.index('123')

### Rather than searching for a fixed substring like '123', suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.

In [None]:
# import the "re" module from the python standard library:
import re
# dir(re) if you wish...

In [None]:
s = 'foo123bar'
re.search('\d\d\d', s)   # produces a match object looking for 3 x digits
# The span of the match object is the same as the slice in the string

In [None]:
if re.search('[0-9]{3}', s):   # a match object is truthy
    print('Found a match')
else:
    print('No match')

In [None]:
if re.match(r'\d\d\d', s):   # If beginning of string match the regular expression pattern, return match object.
    print('Found a match')
else:
    print('No match')

Notice above the use of r to denote a "raw" string, this is a common convention to avoid accidental interpretation.
Unless an 'r' or 'R' prefix is present, escape sequences in string and bytes literals are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:
```
\newline = Backslash and newline ignored
\\ = Backslash (\)
\' = Single quote (')
\" = Double quote (")
\a = ASCII Bell (BEL)
\b = ASCII Backspace (BS)
\f = ASCII Formfeed (FF)
\n = ASCII Linefeed (LF)
\r = ASCII Carriage Return (CR)
\t = ASCII Horizontal Tab (TAB)
\v = ASCII Vertical Tab (VT)
\ooo = Character with octal value ooo
\xhh = Character with hex value hh
```

Escape sequences only recognized in string literals are:
```
\N{name} = Character named name in the Unicode database
\uxxxx = Character with 16-bit hex value xxxx
\Uxxxxxxxx = Character with 32-bit hex value xxxxxxxx
```

For example if looking for the newline character:
```
m = re.search(chr(92)+chr(110))
m = re.search("\\n")
m = re.search(r"\n")
```

In [None]:
# We can use parenthesis to pull subgroups out of the string:
s = 'foo123bar  foo bar'
m = re.search(r'(\d\d\d).*\s+(\w+)\s+', s)
if m:
    print(m.groups())
    print(m.group(0)) # The entire match
    print(m.group(1)) # The first parenthesized subgroup
    print(m.group(2)) # The second parenthesized subgroup
    print(m.group(0,1,2)) # Multiple args give a tuple

### Question: What if we wanted the second subgroup above to pull out any characters not just alphanumeric (\w)? 

In [None]:
sshow = '''Index Slot Port Address Media  Speed        State    Proto
============================================================
   0    3    0   720000   id    N32	  No_Light    FC  Disabled (Persistent) 
   1    3    1   720100   id    N32	  No_Light    FC  Disabled (Persistent) 
   2    3    2   720200   id    8G 	  Online      FC  F-Port  50:00:09:75:a8:17:50:1f 
   3    3    3   720300   id    8G 	  Online      FC  F-Port  50:00:09:75:a8:17:50:9f 
   4    3    4   720400   id    N32	  No_Light    FC  Disabled (Persistent) 
   5    3    5   720500   id    N32	  No_Light    FC  Disabled (Persistent) 
   8    3    8   720800   id    N32	  No_Light    FC  Disabled (Persistent) 
   9    3    9   720900   id    N32	  No_Light    FC  Disabled (Persistent) 
  10    3   10   720a00   id    N32	  No_Light    FC  Disabled (Persistent) 
  11    3   11   720b00   id    N32	  No_Light    FC  Disabled (Persistent) 
  12    3   12   720c00   id    N32	  No_Light    FC  Disabled (Persistent) 
  13    3   13   720d00   id    N32	  No_Light    FC  Disabled (Persistent) 
  14    3   14   720e00   id    8G 	  Online      FC  F-Port  50:06:0e:80:12:3b:2c:13 
  15    3   15   720f00   id    8G 	  Online      FC  LS E-Port  10:00:88:94:71:25:4b:43 "sgsindcw03sanr02" 
  16    3   16   721000   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:c8:62:33 
  17    3   17   721100   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:c8:62:bb 
  18    3   18   721200   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:29:83:13 
  19    3   19   721300   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:2a:84:13 
  20    3   20   721400   id    8G 	  Online      FC  F-Port  50:06:0e:80:07:2a:fe:13 
'''.split('\n')

### However, for complex matches or many looped checks it is reccomended to `compile` to a regular expression pattern object before using for speed.
 - https://docs.python.org/3/library/re.html#regular-expression-objects
 - https://docs.python.org/3/library/re.html#match-objects

In [None]:
wwnpat = re.compile(r'([0-9a-f][0-9a-f]([\s:-]?[0-9a-f][0-9a-f]){7})', re.IGNORECASE)  # find wwn
# all the re functions are also defined as pattern methods:
for i, line in enumerate(sshow, 1):
    m = wwnpat.search(line)
    if m:
        print(m.groups())
        print(f'line{i}:', m.group(0))

In [None]:
wwnpat = re.compile(r'([0-9a-f][0-9a-f]([\s:-]?[0-9a-f][0-9a-f]){7})', re.IGNORECASE)  # find wwn
# all the re functions are also defined as pattern methods:
for i, line in enumerate(sshow, 1):
    m = wwnpat.search(line)
    if m:
        print(re.split(r'\s+', line))    # <== heres an example of spliting a string using regualr expressions
        print(f'line{i}:', m.group(0))

## Exercise:
1. Have a scan of the re module: https://docs.python.org/3/library/re.html
1. Create a function "getcache" to accept the list of cache lines.
 - Values should be stored in a list of dictionaries using the headers as a keys.
 - The function should return this list of dictionarys.
 - Note that headers and values are seperated by a mixture of `,+!@`
 - There is also a corrupt line which should be ignored.
 
Goal is to return:
 
```
[{'Module#': '0',
  'Cache Location': 'CACHE-1CA',
  'CM DIMM Size(GB)': '32',
  'Cache Size(GB)': '256',
  'SM Size(GB)': '18',
  'Cache Residency Size(MB)': '0',
  'CFM Size(GB)': '400'},
 {'Module#': '0',
  'Cache Location': 'CACHE-1CB',
  'CM DIMM Size(GB)': '32',
  'Cache Size(GB)': '160',
  'SM Size(GB)': '',
  'Cache Residency Size(MB)': '',
  'CFM Size(GB)': '400'},
 {'Module#': '1',...
    
```
 

In [None]:
 # It might be easier to copy paste this into https://pythex.org/ to test your regex:
cache = '''<<Cache>>
Module#,Cache Location,CM DIMM Size(GB)+Cache Size(GB)!SM Size(GB)@Cache Residency Size(MB),CFM Size(GB)
0,CACHE-1CA,32,256,18,0,400
0+CACHE-1CB+32+160+++400
325$£%£$%"£%"£%%sfsdfsdf,sdfdsf£$234324234sdfdsfsd <= this is a data corruption
1,CACHE-1CC,,,,,
1,CACHE-1CD,,,,,
0!CACHE-2CA!32!256!18!0!400
0@CACHE-2CB@32@160@@@400
1,"CACHE-2CC",,,,,
1,"CACHE-2CD",,,,,'''.split('\n')

In [None]:
# %load answers/getcache.py

## Exercise:
With what you have learned so far modify your tool program:
1. Read in a file with the format of "poolinfo.txt" (i.e. passed on the command line)
2. Make a dictionary of dictionarys for each Pool summary.
 - Key on the pool id.
 - Second level dictionary has key/value pairs for each item, e.g. `{'Type': 'DP(Multi-Tier)', 'Status':'Normal'}`.

```
    DP   <=== Note that each section: DP followed by multiple DP pools and TI by Thin Image pools.
    DP(Multi-Tier)
      POOL-ID : 0x0001(  1)   <=== KEY
        Status                              : Normal
        Total_Size                          : 232305528[MB]
        Used_Size                           : 166673850[MB]
        Reserved_Size                       : 0[MB]
        Free_+_Reserved_Size(Formatted)     : 65631678[MB](22413426[MB])
        Physical_Size                       : 0[MB]
```
        
3. Add a two items to the dictionary to contain the POOL-VOL table.
 - Add a `poolvolheader` item for a list of headings.
 - Add a `poolvols` item to hold a list of lists containing the values.
 
 ```
    POOL-VOL(136VOL(s))
       Pool-Vol   Index   Tier   Volume_Size[MB]   Used_Size[MB]   Used_Rate[%]   Expanded_Space_Used   
      ----------+-------+------+-----------------+---------------+--------------+---------------------+-
       0xF000     0       1      2092944           1109262         53.0           Disabled              
       0xF001     1       1      2097144           1111488         53.0           Disabled              
       0xF002     2       1      2097144           1111488         53.0           Disabled              
       0xF003     3       1      2097144           1111488         53.0           Disabled                        
```


4. Use the dictionary to print out statistics, e.g. line counts, value counts, max, min, total etc.

#### Feel free to write & debug in Jupyter or Vscode but attempt to make a stand alone CLI program.
#### I would suggest at least two functions; One or more producing the dictionary etc. and another to produce the statistics. 

## Generators are great for processing files & pipelines

*This example uses generator comprehensions but a more complete solution would likely use generator functions.*

Imagine a large dataset:

> permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round<br>
> digg,Digg,60,web,San Francisco,CA,1-Dec-06,8500000,USD,b<br>
> digg,Digg,60,web,San Francisco,CA,1-Oct-05,2800000,USD,a<br>
> facebook,Facebook,450,web,Palo Alto,CA,1-Sep-04,500000,USD,angel<br>
> facebook,Facebook,450,web,Palo Alto,CA,1-May-05,12700000,USD,a<br>
> photobucket,Photobucket,60,web,Palo Alto,CA,1-Mar-05,3000000,USD,a<br>
> ...

Strategy:

1. Read every line of the file.
2. Split each line into a list of values.
3. Extract the column names.
4. Use the column names and lists to create a dictionary.
5. Filter out the rounds you aren’t interested in.
6. Calculate the total and average values for the rounds you are interested in.

In [None]:
# Is the sample available:
!dir techcrunch.csv

In [None]:
!type techcrunch.csv

In [None]:
# Read in the file:
file_name = "techcrunch.csv"
lines = (line for line in open(file_name))
lines

In [None]:
# Split each line ito values:
list_line = (s.rstrip().split(",") for s in lines)
list_line

In [None]:
# Get just the header row:
cols = next(list_line)
cols

In [None]:
# Convert data into a dictionary:
company_dicts = (dict(zip(cols, data)) for data in list_line)
company_dicts

In [None]:
# Filter the rounds you are not interested in:
funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"].upper() == "A"
)
funding

In [None]:
# Calculate the total:
total_series_a = sum(funding)
print(f"Total series A fundraising: ${total_series_a}")

## Exercise:
 1. When does the code to read the data lines from the file get executed above?
 2. Modify above to calcuate the average of the filtered rounds.