# Advanced Python Modules

Here we will discuss following modules:


1.) Counter - Collection Module

2.) DefaultDict - Collection Modue

3.) OrederedDict - Collection Module

4.) NamedTuple - Collection Module

5.) Datetime

6.) Python debugger

7.) timeit

8.) Regular Expression

9.) StringIO


# Collections Module

The collections module is a built-in module that implements specialized container data types providing alternatives to Python’s general purpose built-in containers. We've already gone over the basics: dict, list, set, and tuple.

Now we'll learn about the alternatives that the collections module provides.

## 1.) Counter

*Counter* is a *dict* subclass which helps count hashable objects. Inside of it elements are stored as dictionary keys and the counts of the objects are stored as the value.

Let's see how it can be used:

In [1]:
from collections import Counter # C has to be capital in Counter

In [2]:
l = [1,1,1,1,1,1,12,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,4]

Counter(l)

Counter({1: 6, 12: 1, 2: 6, 3: 5, 4: 6})

In [3]:
s = 'asasaasasasjkgfkgfknvjfkdnvdv'

Counter(s)

Counter({'a': 6,
         's': 5,
         'j': 2,
         'k': 4,
         'g': 2,
         'f': 3,
         'n': 2,
         'v': 3,
         'd': 2})

In [4]:
s = "How many times words like these words show up in a sentence. How many times in totol"

In [5]:
words = s.split()

Counter(words)

Counter({'How': 2,
         'many': 2,
         'times': 2,
         'words': 2,
         'like': 1,
         'these': 1,
         'show': 1,
         'up': 1,
         'in': 2,
         'a': 1,
         'sentence.': 1,
         'totol': 1})

In [6]:
c = Counter(words)

In [7]:
c.most_common(2) # Shows 2 mosr common words

[('How', 2), ('many', 2)]

## Common patterns when using the Counter() object


    sum(c.values())                 # total of all counts
    c.clear()                       # reset all counts
    list(c)                         # list unique elements
    set(c)                          # convert to a set
    dict(c)                         # convert to a regular dictionary
    c.items()                       # convert to a list of (elem, cnt) pairs
    Counter(dict(list_of_pairs))    # convert from a list of (elem, cnt) pairs
    c.most_common()[:-n-1:-1]       # n least common elements
    c += Counter()                  # remove zero and negative counts

In [8]:
sum(c.values())

17

## Next 3 topics in collections modules are Default Dic, Ordered Dict and Named Tuple

### These topics are summarized in the DataTypes sheet

# 5.) Datetime

   * time class
        * attributes - min, max, resolution
  
   * date class
        * attributes - day, month, year, min, max, resolution
        * methods - today(), timetuple(), replace()

### time class

In [8]:
import datetime
t = datetime.time

In [9]:
t = datetime.time(5,25,1) # hours, minutes, seconds, microseconds, timezones

In [10]:
print (t) # Note that time class of datetime just holds the time and not date
print (t.min) # min and max are class attributes which represent valid range of time in a day
print(t.max)
print(t.resolution)

05:25:01
00:00:00
23:59:59.999999
0:00:00.000001


### date class

In [11]:
today = datetime.date(2020, 1, 1)
print(today)
print(today.day)

2020-01-01
1


In [12]:
today = datetime.date
print(today.today()) # notice the parantheses here this is because maybe today is method and not attribute of date class

2019-07-11


In [13]:
today = datetime.date.today()
print(today.timetuple())
print(today.day)
print(today.min)
print(today.max)
print(today.resolution)

time.struct_time(tm_year=2019, tm_mon=7, tm_mday=11, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=192, tm_isdst=-1)
11
0001-01-01
9999-12-31
1 day, 0:00:00


In [14]:
d1 = datetime.date(2015, 3, 11)
d2 = d1.replace(year=1990)
print(d2)
print(d1)

1990-03-11
2015-03-11


In [15]:
d1-d2 # gives difference in days between the two dates. timedelta can also be asked for to give difference in time in terms of
# hours, seconds, etc

datetime.timedelta(days=9131)

# 6.) Python Debugger

You've probably used a variety of print statements to try to find errors in your code. A better way of doing this is by using Python's built-in debugger module (pdb). The pdb module implements an interactive debugging environment for Python programs. It includes features to let you pause your program, look at the values of variables, and watch program execution step-by-step, so you can understand what your program actually does and find bugs in the logic.

This is a bit difficult to show since it requires creating an error on purpose, but hopefully this simple example illustrates the power of the pdb module. <br>*Note: Keep in mind it would be pretty unusual to use pdb in an iPython Notebook setting.*

___
Here we will create an error on purpose, trying to add a list to an integer

In [3]:
import pdb

In [None]:
x = [1,3,4]
y = 2
z = 3

result = y+z
print(result)

pdb.set_trace() # Main line of code

result2 = y+x
print(result2)

5
--Return--
> <ipython-input-5-5b7597957c3f>(8)<module>()->None
-> pdb.set_trace() # Main line of code
(Pdb) x
[1, 3, 4]
(Pdb) y
2
(Pdb) z
3
(Pdb) x**2
*** TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
(Pdb) x + y
*** TypeError: can only concatenate list (not "int") to list


q is used to quit the pdb

Official Document - https://docs.python.org/3/library/pdb.html

# 7.) Timing your code - timeit

This module provides a simple way to time small bits of Python code. It has both a Command-Line Interface as well as a callable one. It avoids a number of common traps for measuring execution times. 

In [2]:
import timeit

In [3]:
'0-1-2-3-.......-99' # we need to crete such a string

'0-1-2-3-.......-99'

In [4]:
"-".join(str(n) for n in range(100))

'0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50-51-52-53-54-55-56-57-58-59-60-61-62-63-64-65-66-67-68-69-70-71-72-73-74-75-76-77-78-79-80-81-82-83-84-85-86-87-88-89-90-91-92-93-94-95-96-97-98-99'

In [5]:
"-".join([str(n) for n in range(100)])

'0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50-51-52-53-54-55-56-57-58-59-60-61-62-63-64-65-66-67-68-69-70-71-72-73-74-75-76-77-78-79-80-81-82-83-84-85-86-87-88-89-90-91-92-93-94-95-96-97-98-99'

In [7]:
"-".join(map(str, range(100)))

'0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50-51-52-53-54-55-56-57-58-59-60-61-62-63-64-65-66-67-68-69-70-71-72-73-74-75-76-77-78-79-80-81-82-83-84-85-86-87-88-89-90-91-92-93-94-95-96-97-98-99'

In [29]:
# Study all of these one by one by uncommenting

#print("-".join(str(n) for n in range(100)))
#print(str(n) for n in range(100)) 
print([str(n) for n in range(100)])    
#print("-".join(k))

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99']


In [48]:
# For loop
timeit.timeit('"-".join(str(n) for n in range(100))', number=10000) # note that the entire function is converted to string

0.2624610440000197

In [49]:
# List comprehension
timeit.timeit('"-".join([str(n) for n in range(100)])', number=10000)

0.22552700599999298

In [51]:
timeit.timeit('"-".join(map(str, range(100)))', number=10000) # note that map is the fastest

0.1775258859997848

## Built-in Magic Function

In [4]:
%timeit "-".join(str(n) for n in range(100)) # notice here that %timeit magic method is representing the best time taken out 
# of the 10000 loops

22.6 µs ± 616 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [5]:
%timeit "-".join([str(n) for n in range(100)])

21.2 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [7]:
%timeit "-".join(map(str, range(100)))

15.4 µs ± 287 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Check out the documentation for more information:
https://docs.python.org/3/library/timeit.html

# 8.) Regular Expressions

Regular expressions are text-matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).

In this topic we will discuss:

[1] Searching for pattenrs in text

[2] Split feature

[3] findall method

[4] re pattern

    (i) repetition syntax
    (ii) character sets 
    (iii) exclusion
    (iv) character ranges
    (v) excape code

## [1] Searching for Patterns in Text

One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the re module to find some text:

In [4]:
import re

In [5]:
patterns = ['term1', 'term2']

In [6]:
text = "This is a string with term1 but not the other term"

## Search feature of re

In [13]:
for pattern in patterns:
    print('Searching for "%s" in:\n "%s"\n' %(pattern,text))
    
    #Check for match
    if re.search(pattern,text):
        print('Match was found. \n')
    else:
        print('No Match was found.\n')

Searching for "term1" in:
 "This is a string with term1 but not the other term"

Match was found. 

Searching for "term2" in:
 "This is a string with term1 but not the other term"

No Match was found.



In [14]:
print (re.search("h", "w")) #notice none is obtained

None


In [10]:
match = re.search(patterns[0], text)
type(match) # This contains the information about the match including the originial input string, the re that was used and the 
# location of the match

re.Match

This **Match** object returned by the search() method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match. Let's see the methods we can use on the match object:

In [11]:
# Hence we can use the method on this object match
match.start() # Tells index of the start of the match

22

In [12]:
match.end()

27

## [2] Split feature

In [22]:
split_term = "@" # same as the split for the strings

phrase = "My email is Ibrahimrupawala@gmail.com"

In [25]:
re.split(split_term, phrase) #first split_term then the sentence to be splitted

['My email is Ibrahimrupawala', 'gmail.com']

In [26]:
# notice that the split of the string has a syntax phrase.split("@")

phrase.split(split_term)

['My email is Ibrahimrupawala', 'gmail.com']

In [27]:
# These kind of questions are asked eg how will you split the domain names vs email address

## [3] findall method

finding all instances of the method

In [30]:
re.findall("match", "here is one match and here is another match")

['match', 'match']

## [4] re Pattern Syntax

This will be the bulk of this lecture on using re with Python. Regular expressions support a huge variety of patterns beyond just simply finding where a single string occurred. 

We can use *metacharacters* along with re to find specific types of patterns. 

Since we will be testing multiple re syntax forms, let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [15]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %r' %(pattern))
        print(re.findall(pattern,phrase))
        print('\n')

### (i) Repetition Syntax

There are five ways to express repetition in a pattern:

   1. A pattern followed by the meta-character <code>*</code> is repeated zero or more times. 
   2. Replace the <code>*</code> with <code>+</code> and the pattern must appear at least once. 
   3. Using <code>?</code> means the pattern appears zero or one time. 
   4. For a specific number of occurrences, use <code>{m}</code> after the pattern, where **m** is replaced with the number of times the pattern should repeat. 
   5. Use <code>{m,n}</code> where **m** is the minimum number of repetitions and **n** is the maximum. Leaving out **n** <code>{m,}</code> means the value appears at least **m** times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [38]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']




### (ii) Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input <code>[ab]</code> searches for occurrences of either **a** or **b**.
Let's see some examples:

In [40]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = ['[sd]',    # either s or d
                's[sd]+']   # s followed by one or more s or d

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




It makes sense that the first input <code>[sd]</code> returns every instance of s or d. Also, the second input <code>s[sd]+</code> returns any full strings that begin with an s and continue with s or d characters until another character is reached.

### (iii) Exclusion

We can use <code>^</code> to exclude terms by incorporating it into the bracket syntax notation. For example: <code>[^...]</code> will match any single character not in the brackets. Let's see some examples:

In [42]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

Use <code>[^!.? ]</code> to check for matches that are not a !,.,?, or space. Add a <code>+</code> to check that the match appears at least once. This basically translates into finding the words.

In [45]:
re.findall('[^!.? ]+',test_phrase) # Notice the + sign which means to check if the match appears atleast 1

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

### (iv) Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is <code>[start-end]</code>.

Common use cases are to search for a specific range of letters in the alphabet. For instance, <code>[a-f]</code> would return matches with any occurrence of letters between a and f. 

Let's walk through some examples:

In [46]:

test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

test_patterns=['[a-z]+',      # sequences of lower case letters
               '[A-Z]+',      # sequences of upper case letters
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[a-z]+'
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z]+'
['T', 'L']


Searching the phrase using the re check: '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z][a-z]+'
['This', 'Lets']




### (v) Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash <code>\</code>. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with <code>r</code>, eliminates this problem and maintains readability.

Personally, I think this use of <code>r</code> to escape a backslash is probably one of the things that block someone who is not familiar with regex in Python from being able to read regex code at first. Hopefully after seeing these examples this syntax will become clear.

In [48]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '\\d+'
['1233']


Searching the phrase using the re check: '\\D+'
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']




## Notes: 

Take a look at the full [documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) if you ever need to look up a particular pattern.

You can also check out the nice summary tables at this [source](http://www.tutorialspoint.com/python/python_reg_expressions.htm).




# 9.) StringIO Objects and the io Module

Back in **Lecture 24 - Files** we opened files that exist outside of python, and streamed their contents into an in-memory file object. You can also create in-memory file-like objects within your program that Python treats the same way. Text data is stored in a StringIO object, while binary data would be stored in a BytesIO object. This object can then be used as input or output to most functions that would expect a standard file object.

Let's investigate StringIO objects. The best way to show this is by example:

In [1]:
import io

In [2]:
# Arbitrary String
message = 'This is just a normal string.'

In [3]:
# Use StringIO method to set as file object
f = io.StringIO(message)

Now we have an object *f* that we will be able to treat just like a file. For example:

In [4]:
f.read()

'This is just a normal string.'

We can also write to it:

In [5]:
f.write(' Second line written to file like object')

40

In [6]:
# Reset cursor just like you would a file
f.seek(0)

0

In [7]:
# Read again
f.read()

'This is just a normal string. Second line written to file like object'

In [8]:
# Close the object when contents are no longer needed
f.close()

This kind of action has various use cases, especially in web scraping cases where you want to read some string you scraped as a file.

For more info on StringIO check out the documentation: https://docs.python.org/3/library/io.html