## CS 210 Spring 2024 - Feb 8
### File I/O, Regular expressions

---

### <font color="brown">File I/O<font>

#### <font color="brown">Reading From a File</font>

**Open file and iterate through the lines**

In [42]:
oscars = {}
for line in open("oscars.txt"):   # read one line at a time
    movie, year = line.split(':')
    oscars[year.strip()] = movie.strip()
print(oscars)

{'2018': 'The Shape of Water', '2010': 'The Hurt Locker', '2017': 'Moonlight', '2019': 'Green Book', '2014': '12 Years a Slave', '2016': 'Spotlight', '2015': 'Birdman', '2022': 'CODA'}


**Sort films by year**

In [43]:
oscar_years = sorted(oscars.items(),  # dictionary items are returned as tuples
          key=lambda movie: movie[0])
print(oscar_years)

[('2010', 'The Hurt Locker'), ('2014', '12 Years a Slave'), ('2015', 'Birdman'), ('2016', 'Spotlight'), ('2017', 'Moonlight'), ('2018', 'The Shape of Water'), ('2019', 'Green Book'), ('2022', 'CODA')]


#### <font color="brown">Writing to a File</font>

**Open file in "w" (write) mode, and write to it**

In [40]:
grades = {'Jenna': 80, 'Dylan': 75, 'Anis': 65}
scores_file = open("scores_file.txt","w")   # open a file in "write" mode
for key,value in grades.items():
    scores_file.write(key + ':' + str(value) + '\n')  # call write method on file
scores_file.close()  # make sure to close the file

---

#### Exercise: Parsing a population file with '|' field separator

In [44]:
# read populations from file into a dictionary
# country name is key, population is value
# each line of file is <country>|<population>
# population may have commas, need to remove

def getPopulations(file):
    pops = {}
    for line in open(file):
        country, pop = line.split('|')
        population = int(pop.replace(',',''))  
        pops[country] = population
    return pops

In [45]:
populations = getPopulations('population.txt')

In [46]:
populations['China']

1347350000

In [47]:
populations['Nepal']

26620809

**Use list comprehension to get countries with population over 1 million**

In [48]:
large_pops = [c for c,p in populations.items() if p > 100000000 ]
print(large_pops)

['China', 'India', 'United States', 'Indonesia', 'Brazil', 'Pakistan', 'Nigeria', 'Bangladesh', 'Russia', 'Japan', 'Mexico']


---

#### Exercise: Counting words in a document using Counter

In [49]:
from collections import Counter

word_counts = Counter()
for line in open('metamorphosis.txt'):
    tokens = line.split()  # separate into non-whitespace sequences
    for token in tokens:
        word_counts.update([token.lower().strip(',.')])  # strip ',' and '.' from word

In [50]:
print(word_counts)

Counter({'he': 10, 'to': 6, 'the': 4, 'that': 4, 'was': 4, 'his': 4, 'a': 3, 'and': 3, 'look': 2, 'at': 2, 'dull': 2, 'feel': 2, 'right': 2, 'have': 2, 'gregor': 1, 'then': 1, 'turned': 1, 'out': 1, 'window': 1, 'weather': 1, 'drops': 1, 'of': 1, 'rain': 1, 'could': 1, 'be': 1, 'heard': 1, 'hitting': 1, 'pane': 1, 'which': 1, 'made': 1, 'him': 1, 'quite': 1, 'sad': 1, 'how': 1, 'about': 1, 'if': 1, 'i': 1, 'sleep': 1, 'little': 1, 'bit': 1, 'longer': 1, 'forget': 1, 'all': 1, 'this': 1, 'nonsense': 1, 'thought': 1, 'but': 1, 'something': 1, 'unable': 1, 'do': 1, 'because': 1, 'used': 1, 'sleeping': 1, 'on': 1, 'in': 1, 'present': 1, 'state': 1, "couldn't": 1, 'get': 1, 'into': 1, 'position': 1, 'however': 1, 'hard': 1, 'threw': 1, 'himself': 1, 'onto': 1, 'always': 1, 'rolled': 1, 'back': 1, 'where': 1, 'must': 1, 'tried': 1, 'it': 1, 'hundred': 1, 'times': 1, 'shut': 1, 'eyes': 1, 'so': 1, "wouldn't": 1, 'floundering': 1, 'legs': 1, 'only': 1, 'stopped': 1, 'when': 1, 'began': 1, 'mil

**Find top 5 most common words**

In [51]:
for word, count in word_counts.most_common(5):
    print(word, count)

he 10
to 6
the 4
that 4
was 4


**Find words of length 4 or more that occur at least twice**

In [52]:
commons = [(w,c) for w,c in word_counts.most_common() if len(w) > 3 and c > 1]
commons

[('that', 4), ('look', 2), ('dull', 2), ('feel', 2), ('right', 2), ('have', 2)]

**Find up to 5 words of length 4 or more that occur at least twice**

In [53]:
commons = [(w,c) for w,c in word_counts.most_common() if len(w) > 3 and c > 1][:5]
commons

[('that', 4), ('look', 2), ('dull', 2), ('feel', 2), ('right', 2)]

---

### <font color="brown">Regular Expressions</font>

Tutorials can be found at the following sites

1. https://www.w3schools.com/python/python_regex.asp
2. https://developers.google.com/edu/python/regular-expressions#basic-patterns
3. https://docs.python.org/3/howto/regex.html?highlight=regular%20expressions

And the site https://regex101.com/ has a regular expression engine you can use to try things out.


#### <font color="brown">Import the re module</font>

In [2]:
import re

---

#### <font color="brown">Search for a pattern in a string using re.search function</font>

In [18]:
res = re.search('a','cat')  # search for pattern 'a' in target 'cat'
res

<re.Match object; span=(1, 2), match='a'>

**search returns a Match object: span(1,2) is the span from start index to end index (exclusive)<br>
of target string "cat" where the match is found, and match gives the actual match**

In [19]:
res = re.search('a','dog')
print(res)

None


**If you simply echo res, nothing will be echoed since res is null, see below**

In [20]:
res

**So it's good policy to print the return from search, in case the return was None**

In [21]:
print ('matched') if re.search('a','dog') else print('not matched')

not matched


**search returns the first occurrence of a match, in case there are multiple matches**

In [25]:
res = re.search('ar','barbaric')  
print(res)  

<re.Match object; span=(1, 3), match='ar'>


In [27]:
# since failure is possible when searching, use condition
def searchit(pattern,astr): 
    return 'Matched' if re.search(pattern,astr) else 'No match' 

# if re.search(pattern,astr) is same as if re.search(pattern,astr) != None

In [28]:
print(searchit('a','cat'))
print(searchit('a','dog'))
print(searchit('ar','barbaric'))

Matched
No match
Matched


**<font color="red">Matching literal strings is faster with string method</font>**

In [29]:
def findit(litstr,target):
    return 'No match' if target.find(litstr) == -1 else 'Matched'
    
print(searchit('a','cat'))
print(searchit('a','dog'))
print(searchit('ar','barbaric'))

Matched
No match
Matched


---

#### <font color="brown">Writing regexp patterns with metacharacters</font>

**Metacharacter [ ] is used for a class of characters<br>
Metacharacter * means 0 or more of preceding character/class<br>
Metacharacter + means 1 or more of preceding character/class**

**Example 1**<br>
Search for any sequence of characters that starts with 'a', ends with 't', and has zero or more 'c's in between

In [3]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('ac*t',astr)  # uses metacharacter *
    print('match') if res else print('no match')

match
no match
match
no match


**Example 2**<br>
Search for any sequence of characters that starts with 'a', ends with 't', and has AT LEAST one 'c' in between

In [31]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('ac+t',astr)  # uses metacharacter +
    print('match') if res else print('no match')

string? ('quit' to stop)  at


no match


string? ('quit' to stop)  act


match


string? ('quit' to stop)  tact


match


string? ('quit' to stop)  tacit


no match


string? ('quit' to stop)  quit


**Example 3**<br>
Search for any sequence that starts with a, ends with t, and has 0 or more digits in between

In [32]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[0-9]*t',astr)  # uses metacharacters [] and *
    print('match') if res else print('no match')

string? ('quit' to stop)  at


match


string? ('quit' to stop)  art


no match


string? ('quit' to stop)  ca2t


match


string? ('quit' to stop)  xya90ter


match


string? ('quit' to stop)  quit


**Example 4**<br>
Search for any sequence that starts with a, ends with t, and has any number of letters or digits (or none) in between

In [33]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z0-\9]*t',astr)  # uses metacharacters [] and *
    print('match') if res else print('no match')

string? ('quit' to stop)  at


match


string? ('quit' to stop)  cA23ter


no match


string? ('quit' to stop)  ca23ter


match


string? ('quit' to stop)  quit


**Example 5**<br>
Search for any sequence that starts with a, ends with t, and has AT LEAST one letter and one digit between, in that order<br>
i.e. between a and t, all letters must precede all digits

In [34]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z]+[0-9]+t',astr)  # uses metacharacters [] and +
    print('match') if res else print('no match')

string? ('quit' to stop)  at


no match


string? ('quit' to stop)  cater


no match


string? ('quit' to stop)  caBb3txy


match


string? ('quit' to stop)  ca3btxy


no match


string? ('quit' to stop)  quit


---

**Metacharacter . matches any character**

**Example**<br>
Search for any sequence that starts with a, ends with t, and has any character any number of times (including zero) between

In [35]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a.*t',astr)  # uses metacharacters . and *
    print('match') if res else print('no match')

string? ('quit' to stop)  at


match


string? ('quit' to stop)  a.t


match


string? ('quit' to stop)  tartan


match


string? ('quit' to stop)  roast


match


string? ('quit' to stop)  race


no match


string? ('quit' to stop)  ta?!34tte


match


string? ('quit' to stop)  quit


---

**Metacharacter ? matches one or zero occurrence of *preceding* character**

**Example**<br>
Search for the sequence 'act' or 'at' in any string

In [36]:
res = re.search('ac?t','at')
print(res)
res = re.search('ac?t','act')
print(res)
res = re.search('ac?t','tractor')
print(res)
res = re.search('ac?t','art')
print(res)

<re.Match object; span=(0, 2), match='at'>
<re.Match object; span=(0, 3), match='act'>
<re.Match object; span=(2, 5), match='act'>
None


---

**Metacharacter ^ matches start of target string when used outside of a [ ] class<br>
Metacharacter $ matches end of target string**

**Example 1**<br>
Match all target strings that start with p

In [37]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('^p',astr)  # uses metacharacter ^
    print('match') if res else print('no match')

string? ('quit' to stop)  p


match


string? ('quit' to stop)  part


match


string? ('quit' to stop)  apart


no match


string? ('quit' to stop)  quit


**Example 2**<br>
Match all target strings that end with p

In [38]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('p$',astr)  # uses metacharacter $
    print('match') if res else print('no match')

string? ('quit' to stop)  p


match


string? ('quit' to stop)  amp


match


string? ('quit' to stop)  quit


**Example 3**<br>
Match all target strings that start with ar, end with t, and have at least one lowercase letter between

In [39]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('^ar[a-z]+t$',astr)  # uses metacharacters ^, [ ], +, and $
    print('match') if res else print('no match')

string? ('quit' to stop)  at


no match


string? ('quit' to stop)  art


no match


string? ('quit' to stop)  arrest


match


string? ('quit' to stop)  aresty


no match


string? ('quit' to stop)  quit


In [18]:
f = open("genreMovieSample.txt", "r")

for line in f:
    parts = line.strip().split("|")
    print(parts)

FileNotFoundError: [Errno 2] No such file or directory: 'genreMovieSample.txt'