# Solutions

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Files" data-toc-modified-id="Files-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Files</a></span></li><li><span><a href="#Regular-Expressions" data-toc-modified-id="Regular-Expressions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Regular Expressions</a></span></li><li><span><a href="#Project-–-The-Invoices" data-toc-modified-id="Project-–-The-Invoices-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Project – The Invoices</a></span></li></ul></div>

## Files

**1.2.3 File creator function**

Ok, so a function that creates a plaintext file! There are a bunch of ways to do this. But this is how I would do it. As always, I try to base the solutions of what has been taught so far in the course.

First, we import the `Path` class from the `pathlib` module:

In [2]:
from pathlib import Path

Since we will use the `Path` class in our function, we must convert the first parameter to a path object. Why? Because some users may pass a string value as the path, and strings don't have the method `.is_absolute()`. It would crash!

In [35]:
def file_creator(path, content):
    path = Path(path)
    if not path.is_absolute():
        print("The path you provided isn't working.",
              "It should be an absolute path, try again!")
        return
    else:
        if not ".txt" in str(path):
            print("You must include a textfile in your path!")
            return
        else:
            file = open(path,"w")
            file.write(content)

I also included the if-statement `if not ".txt" in str(path):`. If the user forgets to include a filename in their path, the function will crash. The `open()` function needs a filename at the end of the passed path argument. 

The if-statement checks to see "if there isn't a file extension in the path, print a warning!" It does so by checking if the string ".txt" isn't in the path. But this will only work if we convert the path into a string value, hence the `str(path)`. Path objects can't be looped over (they aren't iterable).

If we pass something that isn't an absolute path, such as an empty string, the function will warn us and abort:

In [32]:
file_creator("","Hello")

The path you provided isn't working. It should be an absolute path, try again!


Here's an absolute path to try on:

In [33]:
path = Path.cwd() / "new_file.txt"

In [44]:
file_creator(path,"This is a new file!")

In [45]:
file = open(path,"r")
print(file.read())
file.close()

This is a new file!


In [46]:
# run this if you wan't to delete the file:
path.unlink()

**1.3 Exercise – move the textfiles**

Ok! First we run the provided code to create all the files:

In [63]:
import shutil
import os

from random import randint, seed

seed(30)

# First, create tree of directories
path = Path('exercise')
if os.path.isdir(path):
    shutil.rmtree(path)
    os.mkdir(path)
    os.mkdir(path / 'old_location')
    os.mkdir(path / 'new_location')
else:
    os.mkdir(path)
    os.mkdir(path / 'old_location')
    os.mkdir(path / 'new_location')

# This following code randomly creates 500 files
file_path = Path('exercise/old_location')
for i in range(500):
    # random number to decide file extension of present sequence
    num = randint(0,1)
    # if 'num' equals 0 -> plaintext, otherwise pythonfile
    file_ext = ".txt" if num == 0 else ".py"
    
    # Here to decide file content
    if file_ext == '.txt':
        text = "This is a plaintext file!"
    else:
        text = "# this is a python file"
        
    # finally, writing and closing the file
    file = open(file_path / f"file_{randint(500,10000)}{file_ext}","w")
    file.write(text)
    file.close()


Righty!

So, in this exercise, we need to copy and move a whole bunch of files, using a function, from "old_location" to "new_location". We should also include code that add the current date as part of each files name when moved. Let's do a small TODO schematic:

In [64]:
# TODO – define function

# TODO – convert potential path strings into path objects

# TODO – find all files in old_location, save to list

# TODO – filter all text files

# TODO – loop over our text files

# TODO – rename each file with date

# TODO – create file paths for each file

# TODO – copy each file

# TODO – move each copy to 'new_location'

# TODO – create path variables for the two locations

A pretty scary long list, right? Well, it's actually not that bad, since alot of these TODOs will be done together in one go. Let's start:

In [65]:
# TODO – define function
def move_files(location_A, location_B, date):
    # TODO – convert potential path strings into path objects
    location_A, location_B = Path(location_A), Path(location_B)
    
    # TODO – find all files in old_location, save to list
    all_files = os.listdir(location_A)

    # TODO – filter all text files
    all_txt_files = []
    
    for file in all_files:
        if file.endswith(".txt"):
            all_txt_files.append(file)
        else:
            continue


    # TODO – loop over our text files
    for file_name in all_txt_files:
        
        # TODO – rename each file with date
        new_file_name = f"{date}_{file_name}"
        
        # TODO – create file paths for each file
        file_path_1 = old_loc / file_name
        file_path_2 = new_loc / new_file_name
        
        # TODO – copy each file
        # TODO – move each copy to 'new_location'
        shutil.copy(file_path_1,file_path_2)
    
    print(f"All {len(all_txt_files)} text files copied and moved!")
        
# TODO – create path variables for the two locations
old_loc = Path("exercise/old_location")
new_loc = Path("exercise/new_location")

In [66]:
move_files(old_loc, new_loc, "2020-10-15")

All 262 text files copied and moved!


Let's have a look in the "new_location" to see if it worked (only showing 10 here):

In [67]:
os.listdir(new_loc)[:10]

['2020-10-15_file_4286.txt',
 '2020-10-15_file_7200.txt',
 '2020-10-15_file_3099.txt',
 '2020-10-15_file_2387.txt',
 '2020-10-15_file_2556.txt',
 '2020-10-15_file_6042.txt',
 '2020-10-15_file_8709.txt',
 '2020-10-15_file_6081.txt',
 '2020-10-15_file_1513.txt',
 '2020-10-15_file_6254.txt']

Dates are now added to the files' name! All in all there are 262 text files, which means there are more text files than python files. 

Remember, if you just change the file extension in our function. This would work for any type of file you have, anywhere on your computer! Maybe you can find som use out of it somewhere? Have fun!

(run this following code if you want to remove all exercise folders and files:)

In [68]:
shutil.rmtree("exercise/")

**1.5 Exercise – pdf to text function**

Ok! To be able to get this done we need to first import the `PyPDF2` module and the `Path` class from the `pathlib` module:

In [5]:
import PyPDF2

from pathlib import Path

Let's do a TODO list:

In [6]:
# TODO – define a function with two parameters

# TODO – make sure that the two arguments are path objects

# TODO – using the pdf's path parameter, open the file in "rb" mode

# TODO – pass the opened file object to the PyPDF2's PdfFileReader class

# TODO – create empty string variable to add pdf text to

# TODO – create for-loop: loop over all the pdf's pages

# TODO – get page data

# TODO – add each page's text to the string variable

# TODO – save the string variable to a plaintext file object

# TODO – create two path objects to test our function on

# TODO – test the function!

So, quite a list! Let's get to work with our function. I'll just name it the same as in the course notebook!

In [7]:
# TODO – define a function with two parameters
def pdf_converter(pdf_path, results_path):
    
    # TODO – make sure that the two arguments are path objects
    pdf_path, results_path = Path(pdf_path), Path(results_path)

    # TODO – using the pdf's path parameter, open the file in "rb" mode
    file_object = open(pdf_path, "rb")
    
    # TODO – pass the opened file object to the PyPDF2's PdfFileReader class
    pdf_file = PyPDF2.PdfFileReader(file_object)

    # TODO – create empty string variable to add pdf text to
    pdf_content = ""

    # TODO – create for-loop: loop over all the pdf's pages
    for i in range(pdf_file.getNumPages()):
        # TODO – get page data
        page = pdf_file.getPage(i)
        
        # TODO – add each page's text to the string variable
        pdf_content += page.extractText()
        
    # TODO – save the string variable to a plaintext file object
    text_file = open(results_path,"w")
    
    text_file.write(pdf_content)
    text_file.close()


That's our function! Pretty straight forward actually, at least I think so. Hope you're still with me :)

Now, let's test the function to see if it works! I'll just try it on the same pdf file as in the course notebook. It is found in the "course_material" directory. I'll save the results in a plaintext file in the current working directory: 

In [17]:
# TODO – create two path objects to test our function on
path_to_pdf = Path("../course_material/report.pdf")
results_path = Path("report.txt") # important with file extension!


# TODO – test the function!
pdf_converter(path_to_pdf, results_path)

Let's open the newly created "report.txt":

In [18]:
file = open(results_path,"r")
text = file.read()

How long is the file?

In [19]:
len(text)

55344

The first 1000 characters:

In [20]:
text[:1000]

'Corporate governance report 2019\nH & M Hennes\n & Mauritz AB\nH & M Hennes & Mauritz AB is a Swedish public limited company. H&M™s \nclass B share is listed on Nasdaq Stockholm. H&M applies the Swedish \nCorporate Governance Code (the Code) and has prepared this corporate \ngovernance report in accordance with the Annual Accounts Act and the \nCode. H&M has applied the Code since 2005. The report, which covers \n\ndirectors and has been reviewed by the company™s auditors.\nH&M is governed by both external regulations and internal \n control documents.\n\n ŠThe Swedish Companies Act\n ŠAccounting legislation including the Swedish Bookkeeping Act \n and Annual Accounts Act\n ŠMAR, EU Market Abuse Regulation (596/2014/EU)\n ŠNasdaq Stockholm Rules for Issuers\n ŠThe General Data Protection Regulation (GDPR)\n ŠSwedish Corporate Governance Code (the Code), which is available \n\n\nmay deviate from individual rules provided they give an explanation of \nthe deviation, describe the chosen 

It seems to have worked! Yey!

In [21]:
# Uncomment this if you want to remove the results file:
#results_path.unlink()

## Regular Expressions

**2.4 Exercise – matching phone numbers**

First, let's get the list:

In [60]:
file = open("../course_material/phone_list.txt","r")
text = file.read()
file.close()

In [65]:
print(text[:250])

Participant, Phone number
Boivie Jurgen, 0703-1901XX
Bram,  Mats, 0707-2321XX
Carlsson,  Lars, 0735-4474XX
Christiansen,  Jan, 0730-2868XX
Ekblom,  Torbjorn, 018-5115XX
Ekstedt,  Stig, 0706-4084XX
Englund,  Jan, 0703-6826XX
Grine,  Mats, 0735-6226XX



The cellphone numbers all start with "07", then more number digits, followed by a dash, then more numbers, and finally two "X" letters. Let's try and type a regex based on this:

In [105]:
p = "07\d+-\d+XX"

In [106]:
number_list = re.findall(p,text)

In [107]:
len(number_list)

31

There are 31 Swedish cellphone numbers in the list!

**2.6 Exercise – can you find the number?**

Let's start by importing the re module:

In [24]:
import re

We also need to get the speach and save it into a variable:

In [22]:
text = open("../course_material/speach.txt","r").read()

We know that the number we need to find comes _before_ the quote "refugees". This means that we can include this in our regular expression:

In [26]:
p = "refugees"

In [27]:
re.findall(p,text)

['refugees', 'refugees']

Ok, so Trump mentions the word "refugees" two times in the speach. Let's check what he sais just before the word "refugees". We can du this by including a blank space and the word character `\w`. The word character matches both alphabetical letters _and_ numerical digits! 

Since we want the _word_ in front of "refugees", we'll add the plus sign `+`, which means we want `\w` once or more times:

In [28]:
p = "\w+ refugees"

In [29]:
re.findall(p,text)

['become refugees', '000 refugees']

Ok! So the second match is the one we're interested in. But he obviously didn't say "000 refugees", there's more to it. However, how is numbers formatted in this speach? If Trump said "one hundred thousand refugees", is it typed:
```
"100.000"?
"100,000"?
"100 000"?
"100'000"?
```
They are all probable. To be certain, we'll not use any of them Instead, we're going to use the "not a word character"-special character `\W`. This matches all of the options above. Have a look:

In [37]:
p = "\W"
test = "., '"

re.findall(p, test)

['.', ',', ' ', "'"]

So let's include it in our search:

In [38]:
p = "\W\w+ refugees"

In [39]:
re.findall(p,text)

[' become refugees', ',000 refugees']

Let's exchange the word special character to digits `\d`, and then see the amount in front of the comma:

In [50]:
p = "\d+\W\d+ refugees"

re.findall(p,text)

['620,000 refugees']

Now, is there more? Is he, for example, saying "1,620,000 refugees"? Let's have a look by duplicating `\d+\W`, so the entire expression will be:

In [44]:
p = "\d+\W\d+\W\d+ refugees"

This should match on "1,620,000 refugees":

In [45]:
t = "1,620,000 refugees"

In [46]:
re.findall(p,t)

['1,620,000 refugees']

Let's check in Trump's speach:

In [47]:
re.findall(p,text)

[]

No hits! This means that the number president Trump is talking about, must be 620,000! Let's change the first `\d` into a word character instead and have a look:

In [48]:
p = "\w+\W\d+\W\d+ refugees"

In [49]:
re.findall(p,text)

['estimated 620,000 refugees']

Voilá!

One problem still remain though. This regular expression won't match any number, as was part of the exercise. This is, however, easily fixed! Let's go back to the expression just matching on the number, and the word "refugees":

In [51]:
p = "\d+\W\d+ refugees"

We can actually add a group over the first two characters in our expression, and then attach a repetition qualifier to this group. So instead of `\d+\W`, we'll type `(\d+\W)*`. This means that this pattern can occur zero, or more times. Have a look:

In [56]:
p = "((\d+\W)*\d+ refugees)"

_(I've also added a group that encloses the entire expression. This is just so that all hits will be displayed when I use the `.findall()` method here below)_

This pattern will now match any number we give it, if it's followed by the string " refugees":

In [57]:
t = "1,123,032 refugees"
re.findall(p,t)

[('1,123,032 refugees', '123,')]

In [58]:
t = "32 refugees"
re.findall(p,t)

[('32 refugees', '')]

In [59]:
t = "54,654,721,321 refugees"
re.findall(p,t)

[('54,654,721,321 refugees', '721,')]

There you go!

**2.9 Exercise – Who has landlines?**

First, let's get the list:

In [111]:
import re

file = open("../course_material/phone_list.txt","r")
text = file.read()
file.close()

Here, I will try to write a regex that captures each row in the list, and then use a group to catch peoples' names. Each row in the list ends with a newline character. That will be our breaking point. SOmetimes, I find it easier if you try to divide the string into parts that you then can deconstruct!

Let's write a regex:

In [110]:
p = ",\s\w+-\w+\n"

In [113]:
re.findall(p, text)[:10]

[', 0703-1901XX\n',
 ', 0707-2321XX\n',
 ', 0735-4474XX\n',
 ', 0730-2868XX\n',
 ', 018-5115XX\n',
 ', 0706-4084XX\n',
 ', 0703-6826XX\n',
 ', 0735-6226XX\n',
 ', 018-2066XX\n',
 ', 0738-2149XX\n']

So this regex catches all phone numbers in the list (plus the comma and the blank space that preceds the number). Let's see if we can capture all numbers where the second digit isn't a seven – since "07" is phone numbers:

In [114]:
p = ",\s0[^7]\d+-\w+\n"

In [115]:
re.findall(p, text)[:10]

[', 018-5115XX\n',
 ', 018-2066XX\n',
 ', 018-4611XX\n',
 ', 018-5007XX\n',
 ', 018-3213XX\n',
 ', 018-3005XX\n']

There we go! So to recap, I'll go through the regex character per character: 
1. the above regex will capture all strings that start with a comma `,`
2. any type of whitespace character `\s` (whitespace characters are blankspace, newlines, tabs etc…)
3. the digit zero `0`
4. NOT the number seven – `[^7]`
5. one or more numerical digits `\d+`
6. a literal dash character `-`
7. one or more word characters `\w+` (remember that these captures numerical digits as well)
8. a newline character `\n`

Now, we just need to add regex syntax to capture the names on these rows. Let's type a name finding regex by itself first, then combining it with the landline finding regex above later.

Each name is separated by a comma and a space, so let's use that in the regex. However, looking at the top of the list, the comma doesn't seem to be included in all rows:

In [120]:
print(text[:54])

Participant, Phone number
Bolsvik Jurgen, 0703-1901XX



So there may or may not be a comma. We'll include a comma with a star to cover our bases `,*`:

In [145]:
# now looking for each name
p = "[A-Z][a-z]+,*\s[A-Z][a-z]+"

In [125]:
re.findall(p,text)

['Participant, Phone', 'Bolsvik Jurgen']

Huh? Only catching first two rows, why is that? Maybe there's more than one blankspace inbetween each surname and name? Let's check:

In [138]:
p = "[A-Z][a-z]+,*\s+"

In [140]:
re.findall(p,text)[:5]

['Participant, ', 'Phone ', 'Bolsvik ', 'Jurgen, ', 'Brumm,  ']

That seems to be the issue. Let's modify:

In [144]:
p = "[A-Z][a-z]+,*\s+[A-Z][a-z]+"

In [143]:
re.findall(p,text)[:10]

['Participant, Phone',
 'Bolsvik Jurgen',
 'Brumm,  Mats',
 'Carlsson,  Yngve',
 'Svensson,  Jan',
 'Ekstrom,  Torbjorn',
 'Ekgren,  Stig',
 'Engdahl,  Jan',
 'Gripe,  Mats',
 'Hakku,  Tommy']

This seems to have worked great! Now let's combine the two regex's. Since a regex pattern is such an eye soar, if I'm writing a longer regex, I sometimes save parts of the patterns in variables. Then, use the variables in a f-string that I use as my final regex pattern. It makes it a little bit more readable, in my opinion. I'll show you what I mean:

In [146]:
all_names = "[A-Z][a-z]+,*\s+[A-Z][a-z]+"
all_landlines = ",\s0[^7]\d+-\w+\n"

In [147]:
p = f"{all_names}{all_landlines}"

In [148]:
re.findall(p, text)

['Ekstrom,  Torbjorn, 018-5115XX\n',
 'Hakku,  Tommy, 018-2066XX\n',
 'Harrysson,  Peder, 018-4611XX\n',
 'Helgsson,  Kurt, 018-5007XX\n',
 'Langefors,  Arvid, 018-3213XX\n',
 'Roos,  Anne, 018-3005XX\n']

Tada! Since we only want the names, let's group that part of the regex:

In [149]:
p = f"({all_names}){all_landlines}"

In [150]:
re.findall(p, text)

['Ekstrom,  Torbjorn',
 'Hakku,  Tommy',
 'Harrysson,  Peder',
 'Helgsson,  Kurt',
 'Langefors,  Arvid',
 'Roos,  Anne']

There you go! Those six people have landlines!

## Project – The Invoices

It's standard practice to always start with all your imports. Here's what we'll be using:

In [153]:
import os
import datetime
import PyPDF2
import re

from pathlib import Path

Let's first get a list of all pdf-files:

In [154]:
path = Path('project')
files = os.listdir(path)

We'll need check out what the pdf-files look like as strings. Knowing this, we can write regular expressions to extract the information we're interested in:

In [157]:
file = open(path / files[0],"rb")

pdf_file = PyPDF2.PdfFileReader(file)

In [158]:
page = pdf_file.getPage(0)

In [163]:
print(page.extractText())


Invoice date: 2020-01-24

Invoice for services in accordance with #T542AA1, Chap 3
--------------------------------------------------------


Total expenditure: SEK 1,951,604

------
Our contact: Christoffer Olofsson (841012-8668)




In [164]:
page.extractText()

'\nInvoice date: 2020-01-24\n\nInvoice for services in accordance with #T542AA1, Chap 3\n--------------------------------------------------------\n\n\nTotal expenditure: SEK 1,951,604\n\n------\nOur contact: Christoffer Olofsson (841012-8668)\n\n'

After looking through a sample of the files, they all seem to have the same layout. 