<div>
<img src="img/python_logo.png" width="100px" style="float: left; "/> 
<div style="font-size: 40px; padding-top: 20px">Python basics</div>   
<div style="font-size: 30px; padding-top: 20px">Part 2 - Lists</div>   
</div>

 
<h2 style="clear: both">COMM4190 Spring 2025</h2>

<h3>Instructor: Matt O'Donnell (mbod@asc.upenn.edu)</h3>

-----

<div class="alert alert-info">
    
## Overview

* This notebook will cover:
    1. __`list` OBJECTS__ - are **ordered sequences** of objects. 
        * Lists are a key way to organize data so that items can be accessed and processed in a sequence (i.e. a series of repeated steps on each item)
        * An example is a `list` of `str` objects where each string is a 'word' (*token*) in a text.
    2. __`list` indexing and slicing__ - the position of a data item in a list is called it's __INDEX__.
        * The first item in a list has an index of `0`
        * _Indexing_ is one way you can retrieve a single item in a list
        * _Slicing_ allows you to retrieve a contiguous sequence of items from a list
    3. __`list` specific functions__ - just as we saw with `str` objects, `list` objects have functions that are specific to working with ordered sequences of data
    4. __Working with lists of string__ - which serve as the core data structure for representing and analyzing a text.
    

</div>
    
------

# 1. The `list` object in Python

* A __LIST__ in Python is __AN ORDERED SEQUENCE OF OBJECTS__


* To create a `list` object you use:
    1. An open square bracket `[` to mark the beginning of the `list`
    2. A closed square bracket `]` to mark the end of the `list`
    3. A comma `,` to separate each _item_ in the list
    
    
* For example, to define a `list` object with three string objects as items:
    1. `A`
    2. `b`
    3. `C`
    
  you would enter:

In [None]:
['A','b','C']

* As we saw working with `string` objects to make them useful for larger tasks we need to use a _named pointer_ as a reference to the object

In [None]:
my_list = ['A','b','C']

* In a notebook entering the pointer in a code cell will produce a representation of the object referenced by the pointer.

In [None]:
my_list

* You can also use the _generic function_ `print()` to display the data in a list object.

In [None]:
# display the data in the list object

print(my_list)

In [None]:
# what is the type of the object?

type(my_list)

In [None]:
# how big is the list obj, e.g. how many items?

len(my_list)

## 2. Indexing and slicing `list` objects


### Indexing `list` objects

* Items in a list have an __INDEX__ that allows them to be retrieved. 


* The syntax for indexing is:

   `object[INDEX]`
        
    e.g.
        
   `my_list[0]`


* The _first_ item in the list has an index of __ZERO__ [`0`]

In [None]:
my_list[0]

* The _second_ item in the list has an index of __ONE__ [`1`]

In [None]:
my_list[1]

* The _third_ item in the list has an index of __TWO__ [`2`], and so on...

In [None]:
print(my_list[2])

* Here in the code `print(my_list[2])` notice that:
    * we first retrieve the third item in list pointed to by the `my_list`pointer
    * this will return the `string` object `'C'`
    * which is then displayed using the `print()` function

* If you try and use an index to an item that does not exist in the list, i.e. trying to get the _fourth_ item in a list with _three_ items in it, your code will break and you'll see an `IndexError`


```
my_list ->

index:   0    1    2     3

item:  ['A' ,'b' ,'C']   ?

```

In [None]:
# my_list points to a list object with three items
# if we try and index the fourth item using index 3

my_list[3]

### List slicing

* You can get continuous subsequences of a list using __SLICING__


* It works just like we saw with _slicing_ of `str` objects


* You specify:
    * A __START INDEX__
    * An __END INDEX__
    * Separated by a colon `:` character
    
    
* __REMEMBER__ it is start inclusive but end index exclusive because of the way indexing works.

In [None]:
my_list[0:2]

## 4. Representing a text as a _list of strings_


### Turning a `string` of text into a `list` of words (tokens)

* One of the string object functions in Python is the `.split()` function

In [None]:
help(str.split)

In [None]:
sent = "This, is a sentence!..."

In [None]:
sent.split()

* Here the _tokenization_ using just one or more whitespace characters as the delimiter between items


* Often we want to _normalize_ case, i.e. put all words into lower- (or sometimes upper-) case and also strip non-alphabetical characters, e.g. punctuation.


* The `.lower()`, `.replace()` and `.translate()` string functions can be used for this

In [None]:
sent_lc=sent.lower()
print(sent_lc)

* `.replace()` allows you to replace all instances of one character

In [None]:
sent_lc.replace('.','')

* But a whole series of calls to `.replace()` can be chained together like this:

In [None]:
sent_lc_no_punc=sent_lc.replace('.','').replace(',','').replace('!','')

In [None]:
print(sent_lc_no_punc)

* The `.translate()` function allows for a mapping to be created between a list of characters and their replacement

In [None]:
rdict = str.maketrans('','', '!,.#$%#()')
sent_lc.translate(rdict)

* Now we have a normalized and punctuation stripped list of words


* We can put all these steps together into a __BLOCK__ of code (see more discussion below)

In [None]:
sent = "This, is a sentence!..."
sent=sent.lower()
rdict = str.maketrans('','', '!,.')
sent=sent.translate(rdict)
tokens=sent.split()

In [None]:
tokens

In [None]:
tokens[0]

In [None]:
tokens[2]

In [None]:
tokens[1:3]

### Example: A longer text

In [None]:
three_bears_text='''
Once upon a time, there was a little girl named Goldilocks.  She  went for a walk in the forest.  
Pretty soon, she came upon a house.  She knocked and, when no one answered, she walked right in. 

At the table in the kitchen, there were three bowls of porridge. Goldilocks was hungry.  
She tasted the porridge from the first bowl. 

"This porridge is too hot!" she exclaimed.

So, she tasted the porridge from the second bowl.

"This porridge is too cold," she said

So, she tasted the last bowl of porridge.

"Ahhh, this porridge is just right," she said happily and she ate it all up.
'''

In [None]:
three_bears_text_lc=three_bears_text.lower()
rdict = str.maketrans('','', '!,."')
three_bears_text_lc_np=three_bears_text_lc.translate(rdict)
three_bears_tokens=three_bears_text_lc_np.split()

In [None]:
len(three_bears_tokens)

In [None]:
three_bears_tokens

#### Slicing the list of tokens

In [None]:
three_bears_tokens[20:30]

#### Using some list functions

* Using `.count(obj)` to find the number of times a token occurs

In [None]:
# how many times does 'she' occur in the list of tokens?
three_bears_tokens.count('she')

In [None]:
# how many times does 'bears' occur in the list of tokens?
three_bears_tokens.count('bears')

In [None]:
# how many times does 'porridge' occur in the list of tokens?
three_bears_tokens.count('porridge')

* Using `.index(obj, (sidx))` to find the first instance of an item in a list
    * The second optional _argument_ of the `.index()` function gives the index to start searching from
    * By default (i.e. when it is not specified) this is `0`, the beginning of the list

In [None]:
# what is index of the first instance of 'porridge'
three_bears_tokens.index('porridge')

In [None]:
three_bears_tokens[48]

In [None]:
# what is the next one? - start +1 from the first
three_bears_tokens.index('porridge', 49)

In [None]:
three_bears_tokens[48:56]