# Notebook 4: range, zip, enumerate, and useful string methods

This notebook expands on for-loops, introducing a way to iterate over numbers within a certain range, therefore giving access to index-based iteration over containers using `range`. It also shows how to use `zip` and `enumerate`. 
It also discusses several additional string methods such as `split` and `join`.
Finally, the homework will lead you to use what you have learned so far (specifically, for-loops, if statements, and lists) to implement $n$-gram extraction.

## For-loops: reminder

_For-loops_ iterates over some object (**iterable**) and considers sub-elements of that object in order.

In [1]:
for letter in "apple":
	print(letter)

a
p
p
l
e


In [2]:
for letter in "apple":
	print("Hello")

Hello
Hello
Hello
Hello
Hello


In [3]:
indexes = [1,0,-1,3]
word = "linguistics"
for index in indexes:
	print(word[index])

i
l
s
g


In order to print indexes of items in iterables, we can implement a **counter**, i.e. a variable that will increase every time some condition is met. In this case, we will set the counter to $0$ and increase it with every iteration.

In [4]:
index = 0
for letter in word:
	print(letter, index)
	index += 1

l 0
i 1
n 2
g 3
u 4
i 5
s 6
t 7
i 8
c 9
s 10


**Example:** Let's say we are given three lists: list of states (`states`), list of average temperatures for those states in the same order (`temperatures`) and a list of states that are considered New England (`new_england`).

In [5]:
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

temperatures = [62.8, 26.6, 60.3, 60.4, 59.4, 45.1, 49, 55.3, 70.7, 63.5,
                70, 44.4, 51.8, 51.7, 47.8, 54.3, 55.6, 66.4, 41, 54.2, 
                47.9, 44.4, 41.2, 63.4, 54.5, 42.7, 48.8, 49.9, 43.8, 52.7, 
                53.4, 45.4, 59, 40.4, 50.7, 59.6, 48.4, 48.8, 50.1, 62.4, 
                45.2, 57.6, 64.8, 48.6, 42.9, 55.1, 48.3, 51.8, 43.1, 42]

new_england = ["Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut",
               "Rhode Island"]

The code below prints average temperatures for New England states. The variable `index` stores the index of an item we are currently looking at.

In [7]:
index = 0
for state in states:
	if state in new_england:
		print(state+":", temperatures[index])
	index +=1

Connecticut: 49
Maine: 41
Massachusetts: 47.9
New Hampshire: 43.8
Rhode Island: 50.1
Vermont: 42.9


**Practice:** Helpful function for the following practice exercise is `sum` that takes list as an argument and returns the sum of all numbers in that list. FYI, functions `min` and `max` are available as well.

In [9]:
numbers = [1, 18.3, 9, 0, 3.14]
print("Sum of those numbers is", sum(numbers))
print("The smallest number is", min(numbers))
print("The largest number is", max(numbers))

Sum of those numbers is 31.44
The smallest number is 0
The largest number is 18.3


Modify the code above to print the average temperature in New England. (You can use the `round` function to make the resulting number prettier.)

In [13]:
#average = sum(numbers)/len(numbers)
#print("The average is", round(average))

sum_temp = 0
count = 0
index = 0
for state in states:
	if state in new_england:
		sum_temp += temperatures[index]
		count += 1
	index += 1
avg_temp = round(sum_temp/count, 1)
print(avg_temp)

45.8


In [15]:
index = 0
ne_temps = []
for state in states:
	if state in new_england:
		ne_temps.append(temperatures[index])
	index += 1
avg_temp = round(sum(ne_temps)/len(ne_temps), 1)
print(avg_temp)

45.8


### Modifying strings

String indexes cannot be reassigned, i.e. the existent parts of the string cannot be modified directly:

In [16]:
string = "apple"
string[-1] = "b"

TypeError: 'str' object does not support item assignment

If we have a task to "mask" all vowels from a text, we will need to create a new string based on the old one.

**Practice** Can you think of how to do it?

In [21]:
vowels = "aoiue"
text = "This is a sentence that should contain no vowels."

#try it here by yoursel!
masked_text = ""
for char in text:
	if char not in vowels:
		masked_text += char
	else:
		masked_text += "*"
print(masked_text)

Th*s *s * s*nt*nc* th*t sh**ld c*nt**n n* v*w*ls.


**Practice:** You are given a string `alphabet` that contains all English letters, and a string `text`.

In [22]:
alphabet = "abcdefghijklmnopqrstuvwxyz"
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."

Write code that makes this string lowercase and deletes punctuations from the text.

In [28]:
new_text = ""
for char in text.lower():
	if char in alphabet or char == " ":
		new_text += char
	else:
		continue
print(new_text)

a chessboard appeared but it was triangular and so big that only the nearest point could be seen


## Range

Say that you want to print the word "hello" ten times. How would you do it? The most trivial answer is "I'll write _print("Hello")_ ten times". But how would you do it with a for-loop? Can you think of a way to make the loop iterate exactly $10$ times?

In [30]:
# Try it here!
list(range(1,20)) # first num is inclusive, second num is exclusive

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

**Range** is a numeric iterable defined by three arguments: _start_, _end_, and _step_. These arguments behave exactly as they do in slices: _start_ defines the initial numerical value, _end_ is the first value not included in the range, and _step_ defines the difference between the first and the following value.

In [31]:
for value in range(500, 1000):
	print(value)

500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749


In [32]:
for value in range(0,10, 2):
	print(value)

0
2
4
6
8


If only one argument is provided, it is considered to be _end_, and the initial value is assumed to be $0$.

In [1]:
for value in range(20):
	print(value)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


Range cannot be displayed directly, but can be easily converted to a list using `list` function.
(If you are curious about the nature of the range object, read [this article](https://treyhunner.com/2018/02/python-range-is-not-an-iterator/), but a safe way is to just call it an iterable, or a range object).

In [None]:
print("Printing range object:", range(10))
print("Typecasting range to a list:", list(range(10)))

In order to iteratively get indexes available in some iterable, we can use the following trick: `range(len(iterable))`.

In [2]:
word = "apple"
for i in range(len(word)):
	print(i, word[i])

0 a
1 p
2 p
3 l
4 e


**Practice** OK, now you know all you need to know about for-loops! Can you write a code that asks the user for $10$ favorite foods, one at the time? Add those foods into a list, and once the user is done print them back!

In [6]:
# Try it here!
foods = []
for i in range(10):
	food = input("What are your top 10 favorite foods? List them 1 by 1: ")
	foods.append(food)

for food in foods:
	print(food)

m
a
r
k
h
a
b
i
b
i


## N-grams

$n$-gram models are a very basic, fundamental concept in computational linguistics!
Intuitively, $n$-grams are sequences of $n$ consequtive symbols.

    word:   banana
    n:      2
    ngrams: ba, an, na
    
    word:   linguist
    n:      3
    ngrams: lin, ing, ngu, gui, uis, ist

A special case of $n$-grams where the value of $n$ is $2$ are called _bigrams_. If $n=1$, these are called _unigrams_.

For computational linguistics and NLP, **$n$-gram models** are extremely important: symbol-level $n$-gram models define which sequences of characters are (im)possible in a certain language, word-level $n$-gram models tell us which words can be adjacent to each other, and so on.

**Practice:** write code that extracts _bigrams_ from a given word.

In [12]:
word = input("Enter a word: ")
bigrams = []
n = 3

for i in range(len(word) - n+1):
	bigram = word[i:i+n]
	if bigram not in bigrams:
		bigrams.append(bigram)
print(bigrams)

['hel', 'ell', 'llo', 'lo ', 'o t', ' th', 'the', 'her', 'ere']


## Enumerate and Zip

Object-defining functions that can sometimes be very useful are `enumerate` and `zip`.

**`enumerate`** takes a list as input, and returns list of _tuples_, where every tuple contains an item from the input list, and its index. Just as `range`, this function creates its own object that can be easily typecasted into a list.

In [19]:
input_list = ["NY", "CA", "RI", "CO"]
list(enumerate(input_list))

[(0, 'NY'), (1, 'CA'), (2, 'RI'), (3, 'CO')]

In [None]:
z = (0,1,2,3,4,4)


**Tuple** is another basic data type in Python. While they share the majority of the functionality with lists, their main difference is that tuples cannot be modified as easily as lists. Tuples can be thought of as "protected lists", but read [here](https://realpython.com/python-lists-tuples/) to learn more.

**`zip`** takes an arbitrary number of lists as input, and returns a list of tuples, where every tuple is an index-wise combination of items from those lists (i.e. `[(lis1[0],list2[0]),(lis1[1],list2[1]), ...]`).

In [22]:
towns = ["Port Jeff", "Stony Brook", "Lake Grove"]
random_list = [111,121,131]
list(zip(towns, random_list))

[('Port Jeff', 111), ('Stony Brook', 121), ('Lake Grove', 131)]

In [24]:
for i, town in enumerate(towns):
	print(i, town)

0 Port Jeff
1 Stony Brook
2 Lake Grove


## Several useful string methods

There are multiple methods that simplify working with strings and lists, and in this section, I exemplify the following ones: `replace`, `split`, `strip`, `join`, `startswith`, and `endswith`.

**`replace`** returns a string in which some replacement was performed.

    string.replace(old_substring, new_substring)

In [27]:
string = "Hi friend. It is very nice to see you, friend!"
string.replace("friend", "Mark")

'Hi Mark. It is very nice to see you, Mark!'

**Practice:** Using the template provided below, greet everybody whose name is listed in the list `guests`.

In [28]:
template = "Hi, [guest], it is very nice to meet you!"
guests = ["Pearl", "Garnet", "Peridot"]

# your code
for guest in guests:
	print(template.replace("[guest]", guest))

Hi, Pearl, it is very nice to meet you!
Hi, Garnet, it is very nice to meet you!
Hi, Peridot, it is very nice to meet you!


**`split`** takes a string and splits it into a list based on the provided argument. If no argument is provided, `split` splits the string based on the whitespaces.

    string.split(separator)

In [29]:
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."
# your code
text.split(" ")

['A',
 'chessboard',
 'appeared,',
 'but',
 'it',
 'was',
 'triangular,',
 'and',
 'so',
 'big',
 'that',
 'only',
 'the',
 'nearest',
 'point',
 'could',
 'be',
 'seen.']

In [30]:
text = "Achessboardappeared"
#code
text.split()

['Achessboardappeared']

In [31]:
names = "Anna and Mary and John and Sebastian"
#code
names.split(" and ")

['Anna', 'Mary', 'John', 'Sebastian']

In [None]:
names = "Anna, and , Mary and John and Sebastian"
#code

**`strip`** removes inisible symbols from the ends of the string. The invisible things that `strip` removes are ` `, `\n` and `\t`. It is an extremely useful function when working with the "dirty" user input, or when processing text files.

    string.strip()

In [32]:
string = "\nHello world!   \t"
string = string.strip()
print("-->" + string + "<--")

-->Hello world!<--


**`startswith`** and **`endswith`** are string methods that return booleans depending on the string starting or ending with a certain substring.

    string.startswith(substring)
    string.endswith(substring)

In [33]:
print("'hello' starts with 'hell':", "hello".startswith("hell"))
print("'hello' starts with 'hi':", "hello".startswith("hi"))
print("'hello' starts with 'hello':", "hello".startswith("hello"))

'hello' starts with 'hell': True
'hello' starts with 'hi': False
'hello' starts with 'hello': True


In [34]:
print("'linguistics' ends with 'cs':", "linguistics".endswith("cs"))
print("'linguistics' ends with '':", "linguistics".endswith(""))

'linguistics' ends with 'cs': True
'linguistics' ends with '': True


**`join`** is a string method that takes a list as argument, and, if all items within that list are strings, it concatenates them using the given string.

    conjunction_string.join(list_to_concatenate)

In [35]:
names = ['Anna', 'Mary', 'John', 'Sebastian']
print(" and ".join(names))

Anna and Mary and John and Sebastian


In [38]:
letters = ['P', 'y', 't', 'h', 'o', 'n']
print("".join(letters))

Python
