### Files
The open() function opens and returns a file handle that can be used to read or write a file in the usual way.<br>
The code <br>
f = open('name', 'r') <br>
opens the file into the variable f, ready for reading operations,<br>and use <br>f.close()<br> when finished. <br>Instead of 'r', use 'w' for writing, and 'a' for append.The standard for-loop works for text files, iterating through the lines of the file (this works only for text files, not binary files). The for-loop technique is a simple and efficient way to look at all the lines in a text file:

In [None]:
# Echo the contents of a file
f = open('../data/foo.txt', 'r')
for line in f:   ## iterates over the lines of the file
    print(line),    ## trailing , so print does not add an end-of-line char
                   ## since 'line' already includes the end-of-line.
f.close()

Reading one line at a time has the nice quality that not all the file needs to fit in memory at one time -- handy if you want to look at every line in a 10 gigabyte file without using 10 gigabytes of memory. The f.readlines() method reads the whole file into memory and returns its contents as a list of its lines. The f.read() method reads the whole file into a single string, which can be a handy way to deal with the text all at once, such as with regular expressions we'll see later. <br>

For writing, f.write(string) method is the easiest way to write data to an open output file. Or you can use "print" with an open file, but the syntax is nasty: "print >> f, string". In python 3000, the print syntax will be fixed to be a regular function call with a file= optional argument: "print(string, file=f)".

In [None]:
# Echo the contents of a file
ll = ["Copy that Divergence\n",
    "Ready for Countdown\n"]
f = open('../data/foo_w.txt', 'w')

for line in ll:   ## iterates over the lines of the file
    f.write(line),    ## trailing , so print does not add an end-of-line char
                   ## since 'line' already includes the end-of-line.
f.close()

### Python Regular Expressions
Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python "re" module provides regular expression support.

#### In Python a regular expression search is typically written as:
match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

In [None]:
import re

str = 'an example word:cat!!'

match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressionsIt is recommended that you always write pattern strings with the 'r' just as a habit.

#### Basic Patterns
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

- a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

- . (a period) -- matches any single character except newline '\n'
- \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
- \b -- boundary between word and non-word
- \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
- \t, \n, \r -- tab, newline, return
- \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
- ^ = start, $ = end -- match the start or end of the string
- \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

#### Basic Examples
Joke: what do you call a pig with three eyes? piiig!

The basic rules of regular expression search for a pattern within a string are:

The search proceeds through the string from start to end, stopping at the first match found
All of the pattern must be matched, but not all of the string
If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text

In [None]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
print(match)                       # Returns Match object span=(1, 4), match='iii'

In [None]:
match = re.search(r'igs', 'piiig') # not found, match == None
print(match)

In [None]:
## . = any char but \n
match = re.search(r'..g', 'piiig') # found, match.group() == "iig"
print(match)

In [None]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
print(match)

In [None]:
match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"
print(match)

#### Repetition
Things get more interesting when you use + and * to specify repetition in the pattern

- \+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
- \* -- 0 or more occurrences of the pattern to its left
- ? -- match 0 or 1 occurrences of the pattern to its left
<br>

#### Leftmost & Largest
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

#### Repetition Examples

In [None]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
print(match)

In [None]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
print(match)

In [None]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
print(match)

In [None]:
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
print(match)

In [None]:
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
print(match)

In [None]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
print(match)

In [None]:
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
print(match)

#### Emails Example
Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':


In [None]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print (match.group())  ## 'b@google'

The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

#### Square Brackets
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [None]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

#### Group Extraction
The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

In [None]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print (match.group())   ## 'alice-b@google.com' (the whole match)
    print (match.group(1))  ## 'alice-b' (the username, group 1)
    print (match.group(2))  ## 'google.com' (the host, group 2)

#### findall
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

In [None]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print (email)

# Python Utilities
In this section, we look at a few of Python's many standard utility modules to solve common problems.

### File System -- os, os.path, shutil
The *os* and *os.path* modules include many functions to interact with the file system. The *shutil* module can copy files.

- os module docs
- filenames = os.listdir(dir) -- list of filenames in that directory path (not including . and  ..). The filenames are just the names in the directory, not their absolute paths.
- os.path.join(dir, filename) -- given a filename from the above list, use this to put the dir and filename together to make a path
- os.path.abspath(path) -- given a path, return an absolute form, e.g. /home/nick/foo/bar.html
- os.path.dirname(path), os.path.basename(path) -- given dir/foo/bar.html, return the dirname "dir/foo" and basename "bar.html"
- os.path.exists(path) -- true if it exists
- os.mkdir(dir_path) -- makes one dir, os.makedirs(dir_path) makes all the needed dirs in this path
- shutil.copy(source-path, dest-path) -- copy a file (dest path directories should exist)

In [None]:
import os

## Example pulls filenames from a dir, prints their relative and absolute paths
def printdir(dir):
  filenames = os.listdir(dir)
  for filename in filenames:
    print (filename)  ## foo.txt
    print (os.path.join(dir, filename)) ## dir/foo.txt (relative to current dir)
    print (os.path.abspath(os.path.join(dir, filename))) ## /home/nick/dir/foo.txt

In [None]:
printdir('../')

Exploring a module works well with the built-in python help() and dir() functions. In the interpreter, do an "import os", and then use these commands look at what's available in the module: dir(os), help(os.listdir), dir(os.path), help(os.path.dirname).

## subprocess — Subprocess management


The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying [Popen](https://docs.python.org/3/library/subprocess.html#subprocess.Popen) interface can be used directly.

In [48]:
import subprocess

subprocess.run(["ls", "-lha"])

CompletedProcess(args=['ls', '-lha'], returncode=0)

In [49]:
# subprocess.call() does not raise an exception if the underlying process errors!
subprocess.run(["./bash-script-with-bad-syntax"])

FileNotFoundError: [Errno 2] No such file or directory: './bash-script-with-bad-syntax': './bash-script-with-bad-syntax'

If shell=True, the command string is interpreted as a raw shell command.

Using shell=True may expose you to code injection if you use user input to build the command string.

In [50]:
subprocess.run("ls -lha", shell=True)

CompletedProcess(args='ls -lha', returncode=0)

In [56]:
import subprocess
import sys

# create two files to hold the output and errors, respectively
with open('out.txt','w+') as fout:
    with open('err.txt','w+') as ferr:
        out=subprocess.run(["ls",'-lha'],stdout=fout,stderr=ferr)
        # reset file to read from it
        fout.seek(0)
        # save output (if any) in variable
        output=fout.read()

        # reset file to read from it
        ferr.seek(0) 
        # save errors (if any) in variable
        errors = ferr.read()

print("printing output:")
print(output)
# total 20K
# drwxrwxr-x  3 felipe felipe 4,0K Nov  4 15:28 .
# drwxrwxr-x 39 felipe felipe 4,0K Nov  3 18:31 ..
# drwxrwxr-x  2 felipe felipe 4,0K Nov  3 19:32 .ipynb_checkpoints
# -rw-rw-r--  1 felipe felipe 5,5K Nov  4 15:28 main.ipynb
print ("End output\n")

print("printing errors:")
print(errors)
# '' empty string

printing output:
total 176
drwxr-xr-x  7 vishp100  staff   224B Dec 23 09:19 .
drwxr-xr-x  6 vishp100  staff   192B Dec 23 09:19 ..
drwxr-xr-x  4 vishp100  staff   128B Dec 22 08:34 .ipynb_checkpoints
-rw-r--r--@ 1 vishp100  staff    57K Dec 22 08:33 01_python_language.ipynb
-rw-r--r--@ 1 vishp100  staff    26K Dec 23 09:19 02_python_language_files_regex_utils.ipynb
-rw-r--r--@ 1 vishp100  staff     0B Dec 23 09:41 err.txt
-rw-r--r--  1 vishp100  staff     0B Dec 23 09:41 out.txt

End output

printing errors:



# Exceptions
An exception represents a run-time error that halts the normal execution at a particular line and transfers control to error handling code. This section just introduces the most basic uses of exceptions. For example a run-time error might be that a variable used in the program does not have a value (ValueError .. you've probably seen that one a few times), or a file open operation error because a file does not exist (IOError). Learn more in the exceptions tutorial and see the entire exception list.

Without any error handling code (as we have done thus far), a run-time exception just halts the program with an error message. That's a good default behavior, and you've seen it many times. You can add a "try/except" structure to your code to handle exceptions, like this:

In [62]:
filename = '../hello.txt'
try:
    ## Either of these two lines could throw an IOError, say
    ## if the file does not exist or the read() encounters a low level error.
    f = open(filename, 'r')
    text = f.read()
    f.close()
except IOError:
    ## Control jumps directly to here if any of the above lines throws IOError.
    sys.stderr.write('problem reading:' + filename)
  ## In any case, the code then continues with the line after the try/except

problem reading:../hello.txt

The try: section includes the code which might throw an exception. The except: section holds the code to run if there is an exception. If there is no exception, the except: section is skipped (that is, that code is for error handling only, not the "normal" case for the code). You can get a pointer to the exception object itself with syntax "except IOError as e: .." (e points to the exception object).