# Lab Session 2 - Review, Code blocks, Conditions, Loops, Functions (1/30/19)

## Objectives

* **Review handling text with `strings` and `lists` of words**

* **Understand how whitespace/indentation functions in Python**
      
* **Reading text from a file**
      
* **Using conditions to allow optional steps**
    
* **Repeating steps and walking through a list using a loop**

* **Grouping steps into a unit with functions**


### 1. Review: text in `strings`

* create a string object with 5 characters `B e a r !`
* in a notebook/interactive context the result of entering this is for it to be passed back to you and displayed

In [2]:
'BEARS!'

'BEARS!'

* the `'` quotes around the output indicate that it is a string type object
* using the `print()` function gives a prettier formatted version of the object (including any layout characters like line feeds etc)

In [3]:
print('BEARS!')

BEARS!


* we can apply functions to this string

In [4]:
len('BEARS!')

6

In [5]:
'BEARS!'.lower()

'bears!'

* BUT each time we retype it we are creating another string object and after the line of code has been executed we have no way of getting back to that same object
* It is 'orphaned' in the Python memory workspace!

![](pointer1.png)

* when we do something to it like `replace` the exclamation point with an empty character (`''`)
* the result is also a new object that is orphaned 

In [6]:
'BEARS!'.replace('!','')

'BEARS'

![](pointer2.png)

* So we can assign a named pointer to the string object and then we have a way to get back to it...

In [7]:
text = 'BEARS!'

![](pointer3.png)

In [8]:
print(text)

BEARS!


In [9]:
print(text*5)

BEARS!BEARS!BEARS!BEARS!BEARS!


In [10]:
print(text.lower())

bears!


* **BUT** notice if we do something the object pointed to by the name `text` it doesn't change the object

In [11]:
print("This is result of calling replace('!','') on text >>>", text.replace('!',''))
print()
print("The current object pointed to by text is >>>", text)

This is result of calling replace('!','') on text >>> BEARS

The current object pointed to by text is >>> BEARS!


![](pointer4.png)

* **SO** to get the behavior we might have expected, that is, that `text` would point to the result of stripping the `!` we need to reassign the pointer:

In [12]:
text = text.replace('!','')

* this does the following:

![](pointer5.png)

In [13]:
print("The current object pointed to by text is >>>", text)

The current object pointed to by text is >>> BEARS


In [14]:
text = 'BEARS!'
text2 = text

In [15]:
print('text >>> ', text)
print()
print('text2 >>> ', text2)

text >>>  BEARS!

text2 >>>  BEARS!


![](pointer6.png)

In [16]:
text = text.replace('!','')

In [17]:
print('text >>> ', text)
print()
print('text2 >>> ', text2)

text >>>  BEARS

text2 >>>  BEARS!


![](pointer7.png)

In [18]:
text2=text.lower()

In [19]:
print('text >>> ', text)
print()
print('text2 >>> ', text2)

text >>>  BEARS

text2 >>>  bears


![](pointer8.png)

### Turning a `string` of text into a `list` of words (tokens)

* One of the string object functions in Python is the `.split()` function

In [20]:
help(str.split)

Help on method_descriptor:

split(...)
    S.split(sep=None, maxsplit=-1) -> list of strings
    
    Return a list of the words in S, using sep as the
    delimiter string.  If maxsplit is given, at most maxsplit
    splits are done. If sep is not specified or is None, any
    whitespace string is a separator and empty strings are
    removed from the result.



In [21]:
sent = "This, is a sentence!..."

In [22]:
sent.split()

['This,', 'is', 'a', 'sentence!...']

* Here the _tokenization_ using just one or more whitespace characters as the delimiter between items


* Often we want to _normalize_ case, i.e. put all words into lower- (or sometimes upper-) case and also strip non-alphabetical characters, e.g. punctuation.


* The `.lower()`, `.replace()` and `.translate()` string functions can be used for this

In [23]:
sent_lc=sent.lower()
print(sent_lc)

this, is a sentence!...


* `.replace()` allows you to replace all instances of one character

In [24]:
sent_lc.replace('.','')

'this, is a sentence!'

* But a whole series of calls to `.replace()` can be chained together like this:

In [25]:
sent_lc_no_punc=sent_lc.replace('.','').replace(',','').replace('!','')

In [26]:
print(sent_lc_no_punc)

this is a sentence


* The `.translate()` function allows for a mapping to be created between a list of characters and their replacement

In [27]:
rdict = str.maketrans('','', '!,.')
sent_lc.translate(rdict)

'this is a sentence'

In [28]:
rdict

{33: None, 44: None, 46: None}

In [29]:
sent_lc.translate(rdict).split()

['this', 'is', 'a', 'sentence']

* Now we have a normalized and punctuation stripped list of words


* We can put all these steps together into a __BLOCK__ of code (see more discussion below)

In [30]:
sent = "This, is a sentence!..."
sent=sent.lower()
rdict = str.maketrans('','', '!,.')
sent=sent.translate(rdict)
sent.split()

['this', 'is', 'a', 'sentence']

In [31]:
help(str.maketrans)

Help on built-in function maketrans:

maketrans(x, y=None, z=None, /)
    Return a translation table usable for str.translate().
    
    If there is only one argument, it must be a dictionary mapping Unicode
    ordinals (integers) or characters to Unicode ordinals, strings or None.
    Character keys will be then converted to ordinals.
    If there are two arguments, they must be strings of equal length, and
    in the resulting dictionary, each character in x will be mapped to the
    character at the same position in y. If there is a third argument, it
    must be a string, whose characters will be mapped to None in the result.



### Indexing and slicing of a `string`

* A string is a sequence (or list) of characters so we can refer to specific characters in the string using **INDEXING**
* In Python indexes begin at **ZERO**/**0**
* So `bear[1]` is the character `e` and *not* `b` which is `bear[0]`

In [32]:
print('bear'[1])

e


In [33]:
print('bear'[0])

b


In [34]:
word = "Bear!"

In [35]:
word[1]

'e'

In [36]:
word[3]

'r'

In [37]:
word[-1]

'!'

![](string-indexing.png)

* A **SLICE** is a contiguous sequence of characters in a string

In [38]:
word[0:3]

'Bea'

* The start index is **inclusive** and the end index is **exclusive**
* But better to understand the indexes as points **BEFORE** a character just like in the figure above

## Lists

* We have just started to explore lists by splitting a string into tokens, e.g.


In [39]:
'This is a sentence with NINE words in it'.split()

['This', 'is', 'a', 'sentence', 'with', 'NINE', 'words', 'in', 'it']

* A list is a sequence of items with each item separated by a `,` (comma) and surrounding by open and close square brackets
* To work with a list we again will want to point to it with a named pointer

In [40]:
sent = 'This is a sentence with NINE words in it'
tokens = sent.split()
print('There are', len(tokens), 'tokens in the string:', sent)
print('They are', tokens)

There are 9 tokens in the string: This is a sentence with NINE words in it
They are ['This', 'is', 'a', 'sentence', 'with', 'NINE', 'words', 'in', 'it']


* To create a list (which can contain anything) you use:
    * `[ item1, item2, ..., itemN]`
    
* For example:

In [41]:
tokens

['This', 'is', 'a', 'sentence', 'with', 'NINE', 'words', 'in', 'it']

In [42]:
tokens[7]

'in'

In [43]:
len(tokens)

9

In [44]:
tokens[-1]

'it'

In [45]:
tokens[2:4]

['a', 'sentence']

In [46]:
tokens[:3]

['This', 'is', 'a']

In [47]:
list_of_ints = [2,3,5,7,11]

In [48]:
print(list_of_ints)

[2, 3, 5, 7, 11]


* This is 
![](list-indexing.png)

* And you can **INDEX** and **SLICE** items in a list to retrieve them


In [49]:
list_of_ints[2]

5

In [50]:
list_of_ints[-2:]

[7, 11]

## Code blocks

* So far a lot of the Python we have been working on has been single function calls or definitions each in a individual cell

In [51]:
sent = 'This is a sentence!'

In [52]:
print(sent)

This is a sentence!


In [53]:
sent = sent.replace('!','')

In [54]:
print(sent)

This is a sentence


In [55]:
sent = sent.lower()

In [56]:
print(sent)

this is a sentence


* and so on... but we can also put them although into a single cell and each step will be executed one after the other:

In [57]:
sent = 'This is a sentence!'
print(sent)
sent = sent.replace('!','')
print(sent)
sent = sent.lower()
print(sent)

This is a sentence!
This is a sentence
this is a sentence


In [58]:
sent = 'This is a sentence!'
sent
sent = sent.replace('!','')
sent
sent = sent.lower()
sent

'this is a sentence'

* This is called a **BLOCK** of code
* Python uses indentation/whitespace to understand steps that belong together
* In function definitions for instance.

### What are functions?

* __Functions__ in their simplest form are blocks of code that are grouped together and pointed to with a name 


In [59]:
def do_some_steps():
    sent = 'This is a sentence!'
    print(sent)
    sent = sent.replace('!','')
    print(sent)
    sent = sent.lower()
    print(sent)

* What we've done here is create a function to carry out these steps everytime it is called with `do_some_steps()`
* Everything within the function is indented one level

In [60]:
do_some_steps()

This is a sentence!
This is a sentence
this is a sentence


* We can `call` the function `do_some_steps` twice

In [61]:
do_some_steps()
do_some_steps()

This is a sentence!
This is a sentence
this is a sentence
This is a sentence!
This is a sentence
this is a sentence


* So to _define_ a function you use the syntax:

```
    def function_name():
        ** CODE BLOCK **
```

* And can _call_ (or _execute_) the code using `function_name()`

In [62]:
def another_function():
    print('in my function')
    print('still in it')
    
print('where am i?')

where am i?


In [63]:
another_function()

in my function
still in it


#### Function arguments and return values

* Functions become really useful and flexible when you can pass certain __INPUTS__ to them to use

In [64]:
def say_hello1():
    print('Hello whatever your name is!')

In [65]:
say_hello1()

Hello whatever your name is!


* Here the function just executes the fixed `print()` function


* But we can add a parameter or argument to the function definition like this:

In [66]:
def say_hello2(name):
    print('Hello',name)

* Now when we call `hello2()` we must provide a value for name

In [69]:
say_hello2()

TypeError: say_hello2() missing 1 required positional argument: 'name'

In [71]:
say_hello2('Sigfried')

Hello Sigfried


In [72]:
say_hello2('Mildred')

Hello Mildred


* You can also use the attribute name in your function call (this is more relevant when you have multiple arguments)

In [73]:
say_hello2(name="Sigfried")

Hello Sigfried


* Arguments can also have _default_ values which will be used if you don't pass a value when you call the function. This will remove the error of trying to call it without a value.

In [74]:
def say_hello3(name="You"):          # here the default value for name will be 'You'
    print('Hello {}!'.format(name))

In [75]:
say_hello3('Sigfried')

Hello Sigfried!


In [76]:
say_hello3()

Hello You!


## Conditions in Python

* Frequently we want to only do something if some condition is met (or is `True`)
* For instance test whether a string is all in lowercase with the `.islower()` function


In [77]:
sent2 = 'this is all LOWERCASE'
sent2.islower()

False

In [78]:
sent2 = 'this is all lowercase'

In [79]:
sent2.islower()

True

In [80]:
if sent2.islower():
    print('The string is all lowercase!')

The string is all lowercase!


* Note the use of an indented block (here just one line but it could be multiple lines)

In [81]:
if sent2.islower():
    print('The string is all lowercase!')
    sent3 = sent2.lower()
    print("But I've made sent3 all lowercase")
    print(">>>",sent3)
    print()
    print('See?')

The string is all lowercase!
But I've made sent3 all lowercase
>>> this is all lowercase

See?


In [82]:
sent2 = sent2.lower()

if sent2.islower():
    print('The string is all lowercase!')

The string is all lowercase!


* This condition (i.e., one being greater than two) should never be `True` 

In [83]:
1==2

False

In [84]:
if 1>2:
    print('Something is wrong with Python math!')

In [85]:
if 1==2:
    print('Something is wrong with Python math!')

In [86]:
if 1<2:
    print('Phew looks like we are good with Python math!')

Phew looks like we are good with Python math!


In [87]:
if 1<=2:
    print('Phew looks like we are good with Python math!')

Phew looks like we are good with Python math!


* We can invert a conditional test with `not`

In [88]:
if not 1>2:
    print('Phew looks like we are good with Python math!')

Phew looks like we are good with Python math!


* And also have an `else` block which is executed if the test fails

In [89]:
if 1>2:
    print('Something is wrong with Python math!')
else:
    print('Phew looks like we are good with Python math!')

Phew looks like we are good with Python math!


### TASK

* Try defining some string objects and using the various string functions that give a `True` or `False` value in some conditional statements

* E.g.
    * `.islower()`
    * `.isupper()`
    * `.startswith()`
    * `.endswith()`
    * `.isdecimal()`
    * `.isdigit()`
    * `.isnumeric()`
    * `.istitle()`
    * `.startswith()`
    * `.endswith()`



* Try inverting the test with `not`
* Define some `else` blocks for some of them

In [90]:
if not "abc".islower():
    print('String is all lowercase')

## Loops

* loops are a another code structure you will see a lot
* they are a way of repeating steps in code
* or walking through each item in a list

In [91]:
my_list = ['a', 'b','c']

In [92]:
for item in my_list:
    print('item is:', item)

item is: a
item is: b
item is: c


* Common idiom is to use the `range()` function to create a list of numbers

In [93]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

* And then use that list as a way to repeat some block of code a certain number of times

In [94]:
for i in range(10):
    print('Hello!')

Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!
Hello!


In [95]:
for i in range(5):
    print('Say "hello"', i, 'time(s)')

Say "hello" 0 time(s)
Say "hello" 1 time(s)
Say "hello" 2 time(s)
Say "hello" 3 time(s)
Say "hello" 4 time(s)


In [96]:
for item in [1,21,3023,4,5]:
    print('Current value  of item =',item)

Current value  of item = 1
Current value  of item = 21
Current value  of item = 3023
Current value  of item = 4
Current value  of item = 5


In [97]:
def tokenize(text):
    rep_chars=str.maketrans('','','!.,-')
    normalized_string = text.lower().translate(rep_chars)
    tokens = normalized_string.split()
    return tokens

In [98]:
tokenize('This is a sentence. This!!! is, another one...')

['this', 'is', 'a', 'sentence', 'this', 'is', 'another', 'one']

In [99]:
for token in tokenize('This is a sentence. This!!! is, another one...'):
    print(token)

this
is
a
sentence
this
is
another
one


In [100]:
tokens = tokenize('This is a sentence. This!!! is, another one...')

for token in tokens:
    
    if token == 'is':
        print(token)

is
is


## Working with the _Personals Ad_ corpus

* The file `personals_ads.txt` has the NLTK _Personals Ads_ corpus (`text8` in the lecture notebook)

* Each ad is on a separate line

* Ads are separated by an empty line (`\n\n')

In [101]:
ads = open('personals_ads.txt').read()

In [102]:
print(ads[:300])

25 SEXY MALE , seeks attrac older single lady , for discreet encounters . 

35YO Security Guard , seeking lady in uniform for fun times . 

40 yo SINGLE DAD , sincere friendly DTE seeks r / ship with fem age open S / E 

44yo tall seeks working single mum or lady below 45 fship rship . Nat Open 6 . 


### Tasks

* Create a list of ads by splitting the text using the delimiter `\n\n`


In [103]:
from collections import Counter

In [104]:
words_dist=Counter(ads.split())

In [105]:
words_dist.most_common()

[(',', 533),
 ('.', 347),
 ('/', 104),
 ('for', 99),
 ('and', 73),
 ('to', 69),
 ('lady', 68),
 ('-', 66),
 ('seeks', 60),
 ('a', 52),
 ('with', 43),
 ('ship', 33),
 ('&', 30),
 ('S', 29),
 ('relationship', 29),
 ('fun', 28),
 ('slim', 27),
 ('build', 27),
 ('o', 26),
 ('in', 25),
 ('s', 24),
 ('y', 23),
 ('50', 23),
 ('I', 22),
 ('movies', 22),
 ('good', 21),
 ('non', 21),
 ('smoker', 21),
 ('honest', 19),
 ('dining', 19),
 ('out', 19),
 ('rship', 18),
 ('looking', 18),
 ('age', 17),
 ('attractive', 17),
 ('who', 17),
 ('like', 17),
 ('friendship', 17),
 ('40', 16),
 ('45', 16),
 ('35', 16),
 ('5', 16),
 ('MALE', 15),
 ('times', 15),
 ('male', 15),
 ('Looking', 15),
 ('seeking', 14),
 ('r', 14),
 ('open', 14),
 ('the', 14),
 ('female', 14),
 ('life', 14),
 ("''", 14),
 ('fit', 14),
 ('or', 13),
 ('LADY', 13),
 ('guy', 13),
 ('no', 13),
 ('GSOH', 13),
 ('music', 13),
 ('enjoy', 13),
 ('meet', 13),
 ('ft', 13),
 ('30', 13),
 ('f', 13),
 ('tall', 12),
 ('of', 12),
 ('be', 12),
 ('employe

In [106]:
ads_tokens=ads.split()
ads_tokens

['25',
 'SEXY',
 'MALE',
 ',',
 'seeks',
 'attrac',
 'older',
 'single',
 'lady',
 ',',
 'for',
 'discreet',
 'encounters',
 '.',
 '35YO',
 'Security',
 'Guard',
 ',',
 'seeking',
 'lady',
 'in',
 'uniform',
 'for',
 'fun',
 'times',
 '.',
 '40',
 'yo',
 'SINGLE',
 'DAD',
 ',',
 'sincere',
 'friendly',
 'DTE',
 'seeks',
 'r',
 '/',
 'ship',
 'with',
 'fem',
 'age',
 'open',
 'S',
 '/',
 'E',
 '44yo',
 'tall',
 'seeks',
 'working',
 'single',
 'mum',
 'or',
 'lady',
 'below',
 '45',
 'fship',
 'rship',
 '.',
 'Nat',
 'Open',
 '6',
 '.',
 '2',
 '35',
 'yr',
 'old',
 'OUTGOING',
 'M',
 'seeks',
 'fem',
 '28',
 '-',
 '35',
 'for',
 'o',
 '/',
 'door',
 'sports',
 '-',
 'w',
 '/',
 'e',
 'away',
 'A',
 'professional',
 'business',
 'male',
 ',',
 'late',
 '40s',
 ',',
 '6',
 'feet',
 'tall',
 ',',
 'slim',
 'build',
 ',',
 'well',
 'groomed',
 ',',
 'great',
 'personality',
 ',',
 'home',
 'owner',
 ',',
 'interests',
 'include',
 'the',
 'arts',
 'travel',
 'and',
 'all',
 'things',
 'good

In [107]:
ads_tokens.index('male')

86