## Leftovers

10 weeks is not a lot of time to learn how to program. There were 2 basic topics that I wanted to cover this term, but we ran out of time. Here is a short sample of each so you can learn on your own!

### Functional Programming -  List Comprehension, `map`, and `lambda`

This was mentioned very briefly in one of the homeworks, but I gave you no detail. Looping is pretty efficient in Python, but it still feels tedious if you want to do something that takes just 1 line of code. The vast majority of the time, we want to loop through a list, do something to the information in the list, and save the result into a new list. Python provides 2 basic methods for doing this in just 1 line. There are entire programming languages (like Haskell, Scheme, and Lisp) that operate this way. The general technique is called "Functional programming" because it's centered on using functions. 

First we read in some information from a text file. Remember the baby names? We save the result into a list called `lines`. We want to then clean up the information using `strip` and `split`:

In [141]:

with open('./datasets/baby_names_2012.txt','r') as f:
    lines = f.readlines()
    
lines = lines[1:] #remove the header row
       
print lines[:10]


['Aabha   13\r\n', 'Aabriella   5\r\n', 'Aaden   5\r\n', 'Aadhira 6\r\n', 'Aadhya  218\r\n', 'Aadi    10\r\n', 'Aadison 11\r\n', 'Aaditri 10\r\n', 'Aadya   292\r\n', 'Aadyn   16\r\n']


**List Comprehension** looks like a for loop in reverse. Take a look at the original `for` loop, and the same thing as list comprehension. Notice that you have almost the same exact words, but with the inside of the loop mentioned first. The statement is also contained in square brackets. We save this as the variable `cleanlines2`. List comprehension takes a `for` loop that has only 1 line in it, and it performs that function on each element of the list, and produces another list that you can save. 

In [142]:
#original for loop method

cleanlines = []

for l in lines:
    cleanlines.append(l.strip())
    
    
#list comprehension
cleanlines2 = [l.strip() for l in lines]

print cleanlines[:10], cleanlines2[:10] #they look the same!


['Aabha   13', 'Aabriella   5', 'Aaden   5', 'Aadhira 6', 'Aadhya  218', 'Aadi    10', 'Aadison 11', 'Aaditri 10', 'Aadya   292', 'Aadyn   16'] ['Aabha   13', 'Aabriella   5', 'Aaden   5', 'Aadhira 6', 'Aadhya  218', 'Aadi    10', 'Aadison 11', 'Aaditri 10', 'Aadya   292', 'Aadyn   16']


**`map`** is very similar, but with an emphasis on functions. It takes 2 arguments: a function (either a built-in python one, or one that you make yourself) and a list. It goes through each element of the list and performs that function on it, returning the result in a new list. Here in order to refer to the `strip` method by itself, we need to say `str.strip` because it's a method that operates on strings. 

In [143]:
cleanlines3 = map(str.strip, lines)

print cleanlines3[:10] #the same as above!

['Aabha   13', 'Aabriella   5', 'Aaden   5', 'Aadhira 6', 'Aadhya  218', 'Aadi    10', 'Aadison 11', 'Aaditri 10', 'Aadya   292', 'Aadyn   16']


What if we want to do something else with each element? Let's say we want to make it uppercase, and split the items so that the name an freqency are separate. There are 2 ways to do this. The first way involves defining our own function. It's a function that takes a string (i.e., 1 element from the list) and returns a list, after stripping, making the text uppercase, and splitting the information. We call the function whatever we want. Then, we use that function name as the first argument in `map`:

In [144]:

def myfunc(somestring):
    newstring = somestring.strip().upper().split()
    return(newstring)
    
    
cleanlines4 = map(myfunc,lines)

print cleanlines4[:10]
    


[['AABHA', '13'], ['AABRIELLA', '5'], ['AADEN', '5'], ['AADHIRA', '6'], ['AADHYA', '218'], ['AADI', '10'], ['AADISON', '11'], ['AADITRI', '10'], ['AADYA', '292'], ['AADYN', '16']]


If our function was more complicated, then this would be very efficient. But, our function is really just 1 line. Also, we will probably have no use for `myfunc` after this point. It seems kind of silly to define a new function that's only 1 line and we only use once. That's why Python has "lambda functions" (sometimes called "annonymous functions"). They allow you to create a function in 1 line and to not give it a name. They are super useful for using `map`. Below I do the same thing as above, but without creating `myfunc`

In [145]:

cleanlines5 = map(lambda x: x.strip().upper().split(), lines)
print cleanlines5[:10]


[['AABHA', '13'], ['AABRIELLA', '5'], ['AADEN', '5'], ['AADHIRA', '6'], ['AADHYA', '218'], ['AADI', '10'], ['AADISON', '11'], ['AADITRI', '10'], ['AADYA', '292'], ['AADYN', '16']]


Here's how to read it: 

```python
lambda x: x.strip().upper().split()
```

"Create a function that takes 1 argument, which we call x. Take x and do `strip`, `upper`, and `split` on it. Then return the result." 

By convention people usually use `x` as the argument, but you can do whatever you want: 

In [146]:
#try to say "lambda banana" 5 times fast!
cleanlines6 = map(lambda banana: banana.strip().upper().split(), lines)
print cleanlines6[:10]


[['AABHA', '13'], ['AABRIELLA', '5'], ['AADEN', '5'], ['AADHIRA', '6'], ['AADHYA', '218'], ['AADI', '10'], ['AADISON', '11'], ['AADITRI', '10'], ['AADYA', '292'], ['AADYN', '16']]


### Regular Expressions (RegEx)

Regular expressions are almost a language of their own. Every major programming language has a way to use them (they originally came from the Perl programming language though). They allow you to do very complex searches through text. In python, everything is done using the `re` package. The basic idea with regular expressions is that you specify a *pattern* using plain text and special symbols, and you search for that pattern in some bit of text. Here is a super basic introduction, but you should check out the help page to really understand it: https://docs.python.org/2/howto/regex.html

In [158]:
mystring = 'SAM-I-AM has the phone number: 867-5309. SAM has a favorite meal, it is green eggs and ham (or yams)' 

p = re.compile('SAM') #find the word SAM

m = p.search( mystring )
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

Match found:  SAM


Notice it only found SAM when it was on its own, it didn't find SAM-I-AM. Let's fix it so it can find any occurrence of SAM, regardless of what comes after it. the `search` method only finds the first occurrence of a pattern, so we use `findall` to find all of them. We modify our pattern to match the word SAM followed by anything, using the `.`. This means it will match any single character following SAM (including space)

In [159]:
p = re.compile('SAM.') #find the word SAM

m = p.findall( mystring )
if m:
    print 'Match found: ', m
else:
    print 'No match'

Match found:  ['SAM-', 'SAM ']


Let's find any word that ends in "am", ignoring upper/lowercase. Again the `.` matches any single character, before `AM`. 

In [160]:
p = re.compile('.AM',re.IGNORECASE) #find anything that ends in `AM`

m = p.findall( mystring )
if m:
    print 'Match found: ', m
else:
    print 'No match'

Match found:  ['SAM', '-AM', 'SAM', 'ham', 'yam']


We can get fancier. Now I want ot include 5 characters (including spaces) leading up to each `AM`. I can do this using a number in curly brackets. Try changing the number and see what happens. Try adding `{.3}` to the end of the pattern to include 3 characters after the "AM".  

In [150]:
p = re.compile('.{5}AM',re.IGNORECASE) #find the string AM, and grab 5 characters leading up to it

m = p.findall( mystring )
if m:
    print 'Match found: ', m
else:
    print 'No match'

Match found:  ['AM-I-AM', '09. SAM', 'and ham', '(or yam']


let's find just the phone number. We can specify a range of values to search for in brackets, `[0-9]` searches only for digits. We specify repetitions in curly brackets. So, we look for 3 digits in a row like this: `[0-9]{3}`. Below we search for 3 digits, followd by a dash `-`, then 4 more digits: 

In [151]:
p = re.compile('[0-9]{3}-[0-9]{4}') #find the phone number
m = p.findall( mystring )
if m:
    print 'Match found: ', m
else:
    print 'No match'

Match found:  ['867-5309']


Alternatively, we could have searched for a digit, followed by any character, for a total of 7 characters: 

In [152]:
p = re.compile('[0-9].{7}') #find the phone number
m = p.findall( mystring )
if m:
    print 'Match found: ', m
else:
    print 'No match'

Match found:  ['867-5309']


If we use a `*` that means "repeat the previous pattern any number of times". So, if we search for `egg.*`, it will find the string "egg", followed by any characters up until the end of the string. We also use the `search` method again to search for only 1 occurrence this time: 

In [153]:
p = re.compile('egg.*') #the end of the string, starting with "egg"
m = p.search( mystring )
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

Match found:  eggs and ham (or yams)


The `re.sub` method allows you to use regular expressions to do a find-and-replace. We could do this to remove all punctuation from our string. We specify all the characters to look for in square brackets. First let's to a basic `findall` to see what I mean: 

In [154]:
p = re.compile('[\.,?!;()]') #find all the punctuation
m = p.findall( mystring )
if m:
    print 'Match found: ', m
else:
    print 'No match'

Match found:  ['.', ',', '(', ')']


Alternatively we could use the "not" character `^` to search for anything that's not a letter or a number (or a space)

In [155]:
p = re.compile('[^A-Za-z0-9 ]') #find anything that's not a letter, number, or space
m = p.findall( mystring )
if m:
    print 'Match found: ', m
else:
    print 'No match'

Match found:  ['-', '-', ':', '-', '.', ',', '(', ')']


Now we can remove all the punctuation in one line, like this: 

In [156]:
re.sub('[^A-Za-z0-9 ]','',mystring)

'SAMIAM has the phone number 8675309 SAM has a favorite meal it is green eggs and ham or yams'

Or we could replace all punctuation with some other character, like underscores: 

In [163]:
re.sub('[^A-Za-z0-9 ]','_',mystring)

'SAM_I_AM has the phone number_ 867_5309_ SAM has a favorite meal_ it is green eggs and ham _or yams_'

Let's replace all the numbers with Xs. This time we use the special character, `\d` to represent "any digit". 

In [162]:
re.sub('\d','X',mystring)

'SAM-I-AM has the phone number: XXX-XXXX. SAM has a favorite meal, it is green eggs and ham (or yams)'

I could keep going. Regular expressions are very powerful, but also very complicated. You will probably memorize 1 or 2 basic rules, and have to look it up every time you have to use it. That's totally fine, many programmers do the same thing. There are online sources for creating regular expressions and seeing how they work (like http://regexr.com/). Just Google "regular expression builder" and you should find a few. 