# Lecture 3
## 1. Opening, Reading and Writing Files

In the previous lectures we covered the Python basics, inspected some of the internal data types and syntax. The real power of programming lays in power of computers to perform thousands of operations in very little time. In other words, coding helps us to interrogate "big data", and for humanities more than we are able to read in a lifetime.

Until now all "data" was mostly confined to mock examples: strings, lists or other values we manually entered. In this lecture we turn finally turn to some more realistic use of coding in the Humanities, and show how Python assists us with analysing larger, external information sources. 

In this series of lecture we focus on:
- Reading files from disk.
- Reading data from the web.
- Performing some analysis on these data.
- Write results to a file.

## 1.1. Locating files: `os` and `path`

Before opening a file, Python has to locate it. Create a string variable that tells your program where it has to to look. 
Generally, Python will look in the current directory where your script (or Notebook such as this one) is located. Therefore you have to create a **relative path**, i.e. starting from where your script is located. 
Let's try to find John Locke's "An Essay Concerning Human Understanding", which is saved in the `data` subdirectory.

In [None]:
file_name = 'data/pg10615.txt'

In the above cell we assigned the relative path to the 'file_name' variable. Using relative path is highly recommended. Not only is this often shorter, it also makes your scripts and data more transportable. The relative path will work on any computer (as long as you don't start moving the folders) while the absolute path only points to the right file on my laptop.

relative path: `'data/pg10615.txt'`

absolute path: `'/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures/lecture3/data/pg10615.txt'`


[VU] Sometimes you see double dots in the beginning of the file path; this means 'the parent of the current directory'. When writing a file path, you can use the following:

- /     go to the root of the current drive
- ./    go to current directory
- ~/    go home directory
- -    go previous directory
- ../   go to parent directory (one up in the tree)

You can use your Notebook to navigate your computer with as you would do in your terminal. `cd` (change directory) and `ls` (list directory) provide 
For example to go your home directory

In [None]:
Print the current directory:

In [2]:
pwd

'/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures/lecture3'

Go to the User directory:

In [4]:
cd ~/

/Users/kasparbeelen


List all items in the User directory:

In [5]:
ls ./ 

[34mApplications[m[m/      [34mDownloads[m[m/         [34mPictures[m[m/          [34mautoconf-2.69[m[m/
[34mCalibre Library[m[m/   [34mLibrary[m[m/           [34mPublic[m[m/            [34mnltk_data[m[m/
[34mDesktop[m[m/           [34mMovies[m[m/            [34mSites[m[m/             [34mpolyglot_data[m[m/
[34mDocuments[m[m/         [34mMusic[m[m/             [34manaconda3[m[m/         [34mscikit_learn_data[m[m/


Go back to the previous directory:

In [6]:
cd -

/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures/lecture3


Go to the parent's parent folder (go two up):

In [8]:
cd ../../

/Users/kasparbeelen/Documents/Onderwijsea/CTH


In [9]:
ls ./

[34mexamples[m[m/   [34mlectures[m[m/   [34mliterature[m[m/ [34mnotes[m[m/


In [None]:
Go back to the previous directory:

In [11]:
cd -

/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures/lecture3


Go one up:

In [12]:
cd ..

/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures


Go one down do, to the 'lecture3' folder:

In [13]:
cd lecture3

/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures/lecture3


...and we should be home again after a long travel:

In [14]:
pwd

'/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures/lecture3'

The code you are running in the above cells is not Python, but bash, the command line language. Jupyer Notebook allows you to combine both language to some extent.

## 1.2 Opening Documents

Python has built-in function `open()`, which returns a 'file object'.

`open()` has the following crucial arguments: 
- **location** of the file (see above)
- **mode** combination of characters, indicates the purpose of file opening
- **encoding** encoding of the text file

What do **mode** and **encoding** actually mean?

### 1.2.1 Encoding 

**UTF-8**

You may wonder what an encoding is and what *utf-8* is. For anyone working with texts and computers this is vital to know. Internally, a computer knows no characters whatsoever: every piece of information is represented as numbers (which in turn are represented in a binary format, as zeroes and ones). An encoding specifies which numbers represent which characters. A famous and long-standing encoding scheme is ASCII, in which for example the letter 'A' is encoded using the number 65. ASCII however only has a very limited alphabet and can not encode a lot of writing systems. A modern-day encoding supporting countless writing systems is *unicode* and *utf-8* is a kind of unicode. This the type of encoding that you will want to use for your data whenever possible. Whenever you have a choice, you should use unicode!

### 1.2.2 Mode
[VU]
* **r** = Opens a file for reading only. The file pointer is placed at the beginning of the file.
* **w** = Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
* **a** = Opens a file for appending. The file pointer is at the end of the file if the file exists. If the file does not exist, it creates a new file for writing. Use it if you would like to add something to the end of a file
* **t** = Text mode

## 1.3 Reading Documents

Let's try Python to read in a few paragraphs from Locke's "An Essay Concerning Human Understanding".

In [15]:
locke = open('data/locke_excerpt.txt','r')

Reminder: The `open()` function requires the file path as its first argument. The second (optional) argument specifies the *mode* in which the file is opened. The third (optional) argument specifies the encoding of the file.

Even though 'opened' the file in 'read', this function does not return the actual content or text. To assign the text to a variable we have to call the read function on this object

In [16]:
locke_text = locke.read()
print(locke_text)

THE EPISTLE TO THE READER

READER,

I have put into thy hands what has been the diversion of some of my idle and heavy hours. 
If it has the good luck to prove so of any of thine, and thou hast but half so much pleasure in reading as I had in writing it, thou wilt as little think thy money, as I do my pains, ill bestowed.
Mistake not this for a commendation of my work; nor conclude, because I was pleased with the doing of it, that therefore I am fondly taken with
it now it is done. 
He that hawks at larks and sparrows has no less sport, though a much less considerable quarry, than he that flies at nobler game: and he is little acquainted with the subject of this treatise--the UNDERSTANDING--who does not know that, as it is the most elevated faculty of the soul, so it is employed with a greater and more constant delight than any of the other. 
Its searches after truth are a sort of hawking and hunting, wherein the very pursuit makes a great part of the pleasure. 
Every step the mind tak

After reading, it is recommendable to close the file

In [17]:
locke.close()

The code below will rais a ValueError, because the content is no longer accessible after closing the file

In [18]:
locke.read()

ValueError: I/O operation on closed file.

## 1.4 `read()`, `readlines()` and `readline()`

In order to *read* the contents of the file, Python provides three related operations. The first operation is `read()`:

`f = open(path,'r').read()` assigns the entire document to a variable `f`:

In [33]:
document = open('data/locke_excerpt.txt','r')
text = document.read()
document.close()

The variable `text` now holds the entire content of the file located at `data/locke_excerpt.txt` as a single string and we can access and manipulate it just like any other string. We can print the first 100 characters of this string:

In [34]:
print(text[:100])

THE EPISTLE TO THE READER

READER,

I have put into thy hands what has been the diversion of some of


The second operation is `readlines()`, which returns a list of the lines in the file, where each item of the list represents a single line:

In [36]:
document = open('data/locke_excerpt.txt','r')
lines = document.readlines()
print(lines)
print(type(lines))
document.close()

['THE EPISTLE TO THE READER\n', '\n', 'READER,\n', '\n', 'I have put into thy hands what has been the diversion of some of my idle and heavy hours. \n', 'If it has the good luck to prove so of any of thine, and thou hast but half so much pleasure in reading as I had in writing it, thou wilt as little think thy money, as I do my pains, ill bestowed.\n', 'Mistake not this for a commendation of my work; nor conclude, because I was pleased with the doing of it, that therefore I am fondly taken with\n', 'it now it is done. \n', 'He that hawks at larks and sparrows has no less sport, though a much less considerable quarry, than he that flies at nobler game: and he is little acquainted with the subject of this treatise--the UNDERSTANDING--who does not know that, as it is the most elevated faculty of the soul, so it is employed with a greater and more constant delight than any of the other. \n', 'Its searches after truth are a sort of hawking and hunting, wherein the very pursuit makes a great

The third operation `readline()` returns the next line of the file, returning the text up to and including the next newline character (*\n*, or *\r\n* on Windows). More simply put, this operation will read a file line-by-line. So if you call this operation again, it will return the next line in the file. Try it out below!

In [37]:
infile = open('data/locke_excerpt.txt', "r")
next_line = infile.readline()
print(next_line)

THE EPISTLE TO THE READER



Repeat pressing `ctrl+enter` below, this show you a new line each time.

In [43]:
print(infile.readline())

Mistake not this for a commendation of my work; nor conclude, because I was pleased with the doing of it, that therefore I am fondly taken with



But what about **big data**? So far, we managed to load the complete file. But what if the file size ran into the Gigabytes, and we are only interested in a small subsection of the data. Loading the entire file into memory, will significantly slow down your computer (unless you possess one with generous RAM, but even then) 

In [None]:
infile = open('data/locke_excerpt.txt', "rt")
for line in infile:
    print(line)
infile.close()

`infile.close()`. This closes our file, which is a very important operation. This prevents Python of keeping files that are unneccessary anymore still open.

### Intermezzo: The 'newline' character

[MK]The 'newline' character is probably something new to you. If you are dealing with plain text files (typically files whose name ends in the '.txt' extension), your machine uses a special character internally to signal that a new line should begin. Internally, such newlines are represented as `"\n"`. Normally, this character is visualized on your screen as if the enter key were pressed. See what happens below: 

In [None]:
s = "This is the first line.\nThis is the second line."
print(s)

There exists a similar character to encode 'tab' characters, namely `\t`. You can use this character to play around with the indentation of your (e.g. hierarchically structured) output:


In [None]:
s = "First line\n\t* Second line\n\t* Third line\n\t* Fourth line\nFifth line"
print(s)

[MK]In the code block above in which you read the Austen file, the newline is still included with the original line that preceded it in the file: this is why you see all the extra empty lines in the output above! If you wish to remove all preceding and trailing whitespace in a string (newlines, spaces, but also tabs), you can use the `strip()` function:

In [None]:
s = "   strip me!    "
print(s)
print(s.strip())

*Exercise*: loop through file and print each line without the preceding and trailing whitespace.

#### End of intermezzo

## 1.5 Processing Files

Besides printing we can also manipulate the content of the file or extract information from it such as counting the number of lines. 

In [None]:
infile = open('data/pg10615.txt', "rt")
count = 0
for line in infile:
    count+=1
    
print(count)
infile.close()

In [None]:
In the above code we 

In [None]:
infile = open('data/locke_excerpt.txt', "rt")
count = 0
new_lines = []
for line in infile:
    if len(line) > 5:
        new_lines.append(line.lower().strip())
    
print(new_lines)
infile.close()

This is just a small teaser. During the next lectures we 

## 1.6 Context Manager

In many situations you have read in and process large collection of text. Keeping all these files stored in memory is often pointless and might slow down your computer. [VU]In fact, it is good practice to close the file as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [None]:
file.close()

[VU] There is actually an easier (and preferred) way to make sure that the file is closed as soon as you don't need it anymore, namely using what is called a <span style="background-color:yellow">context manager</span>:

In [None]:
with open(filename, "r") as infile:
    content = infile.read()
    
print(content)

[VU] The main advantage of using the with-statement is that it automatically closes the file once you leave the local context defined by the indentation level. If you 'manually' open and close the file, you risk forgetting to close the file. Therefore, context managers are considered a best-practice, and we will use the with-statement in all of our following code. 

## 1.7 Writing Files

## Processing File Content

Reading files is just the first step of your research process. In most cases, we'd like to process
from document to a word frequency. In what follows below, we performs some basic string processing that prepare your document for further analysis: counting the words in the document. 

Word-counting is very rudimentary, but nonetheless useful form of content analysis. Moretti coined the term 'distant reading' to argue that texts can be interpreted at some level of abstraction. Below, we compare two British philosophers from the Enlightement.

Before counting words, we first have to decide what, exactly, constitutes a word. To make things easier we can define a word as everything between to whitespaces (space, but also hard returns such as '\n'.

Also, what distinction to retain? For a computer 'Tree' and 'tree' are different things. To we want to counts them as the same item?

### Convert to lower case

A standard procedure in text-processing is lowercasing. This converts all capital characters to lowercase, such as in:

In [8]:
print('Donald Duck'.lower())

# Or the same

name = 'Trevor Noah'
print(name.lower())

donald duck
trevor noah


Again, why would we do this? The choice for lowercasing is arbitrary, we could as uppercase the whole text. The point here is **uniformisation**, i.e. we want to discard differences between elements that are not relevant to our research question. We, here, make the explicit choice to treat 'Hamburger' and 'hamburger' as the same item. We want to count them as the same word. 

But equally, if a word starts with a capital because it is located at the beginning, or somewhere else in the sentence, it remains the same word from a semantic point of view (in the majority of the cases).

In [10]:
print("Don't lowercase this!".upper())

DON'T LOWERCASE THIS!


Of course, this is a choice, and should always be reported.

### Delete punctuation

Lowercasing discards largely unimportant differences in the text.

In [25]:
with open('data/pg10615.txt', "r") as infile:
    content = infile.read()
    
subsample = content[10000:10499].lower()
print(subsample)

not to envy
them, since they afford thee an opportunity of the like diversion, if
thou wilt make use of thy own thoughts in reading. it is to them, if
they are thy own, that i refer myself: but if they are taken upon trust
from others, it is no great matter what they are; they are not following
truth, but some meaner consideration; and it is not worth while to be
concerned what he says or thinks, who says or thinks only as he is
directed by another. if thou judgest for thyself i know thou wilt 


to transform this string to a list of words, we can split it by the whitespace using the `split()` function.

In [26]:
print(subsample.split())

['not', 'to', 'envy', 'them,', 'since', 'they', 'afford', 'thee', 'an', 'opportunity', 'of', 'the', 'like', 'diversion,', 'if', 'thou', 'wilt', 'make', 'use', 'of', 'thy', 'own', 'thoughts', 'in', 'reading.', 'it', 'is', 'to', 'them,', 'if', 'they', 'are', 'thy', 'own,', 'that', 'i', 'refer', 'myself:', 'but', 'if', 'they', 'are', 'taken', 'upon', 'trust', 'from', 'others,', 'it', 'is', 'no', 'great', 'matter', 'what', 'they', 'are;', 'they', 'are', 'not', 'following', 'truth,', 'but', 'some', 'meaner', 'consideration;', 'and', 'it', 'is', 'not', 'worth', 'while', 'to', 'be', 'concerned', 'what', 'he', 'says', 'or', 'thinks,', 'who', 'says', 'or', 'thinks', 'only', 'as', 'he', 'is', 'directed', 'by', 'another.', 'if', 'thou', 'judgest', 'for', 'thyself', 'i', 'know', 'thou', 'wilt']


This looks better, but not perfect yet. If you have a closer look, you notice that some punctution marks are still glued to the word (for example 'are' and 'are;'). Again, a computer would count these as two totally different items, and lead us to underestimate the use this verb (as they are kept distinct. 

Ignore the code for now, just look at 'are' in the output, which shows that frequency of the words in this short fragment.

In [34]:
from collections import Counter
Counter(subsample.split())

Counter({'afford': 1,
         'an': 1,
         'and': 1,
         'another.': 1,
         'are': 3,
         'are;': 1,
         'as': 1,
         'be': 1,
         'but': 2,
         'by': 1,
         'concerned': 1,
         'consideration;': 1,
         'directed': 1,
         'diversion,': 1,
         'envy': 1,
         'following': 1,
         'for': 1,
         'from': 1,
         'great': 1,
         'he': 2,
         'i': 2,
         'if': 4,
         'in': 1,
         'is': 4,
         'it': 3,
         'judgest': 1,
         'know': 1,
         'like': 1,
         'make': 1,
         'matter': 1,
         'meaner': 1,
         'myself:': 1,
         'no': 1,
         'not': 3,
         'of': 2,
         'only': 1,
         'opportunity': 1,
         'or': 2,
         'others,': 1,
         'own': 1,
         'own,': 1,
         'reading.': 1,
         'refer': 1,
         'says': 2,
         'since': 1,
         'some': 1,
         'taken': 1,
         'that': 1,
         

Almost there! We almost have a clean list of words. The code below helps you to remove interpunctions. 
First we define what punctuation actually comprise, using a standard list provided by Python

In [36]:
import string
punct = string.punctuation
print(punct)
print(type(punct))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
<class 'str'>


This returns a string, which contains most of the frequent forms of punctuation.

In the code below, we start with defining a an empty string. The we loop over all lines in John Locke's treatise. For each line, we again loop over all characters (notice the double for loop in this example) and add a whitespace to the initial string if the the character if belongs to the set of punctuation marks, otherwise we just add the character.

In [None]:
chars = ''
for line in open('data/pg10615.txt','rt'):
    line_lowered = line.lower()
    for char in line_lowered:
        if char in punct:
            chars+=' '
        else:
            chars+=char

The result of this looks as follows:

In [None]:
print(chars[10000:10499].split())

In [None]:
### Counting Words

Now, we are ready to count all the words in using a in-built function `Counter()`. There are other ways to count the frequency of elements in a list, but for now, this is the easiest option.

In [37]:
words = chars.split()
print('The number of words is %s'%len(words))
wf = Counter(words)

The number of words is 153643


`Counter()` is a rather convenient function. It allows you to easily list the most comment words.

To list the most common words, the call the 'most_common()' method on the `wf` variable, which is a Counter() object.

In [38]:
wf.most_common(10)

[('the', 7653),
 ('of', 7256),
 ('and', 4943),
 ('to', 4723),
 ('in', 3126),
 ('that', 2966),
 ('it', 2761),
 ('is', 2494),
 ('a', 2299),
 ('be', 1999)]

What a disappointment, you might think at this point. All this work, just to obtain a list of these 'boring' words? The most frequents words are mostly not very informative, at least not if you want to pin down the topic of a text. These most frequent words are often called stopwords. 

On a side note: words have a rather persistent distribution, as small set of words is very frequent. For example the 10 most frequent words take around 26% of the total, put differently 0.1% of the total vocabulary (the ten most frequent) alone are enough to compose 26% of the text. For n=100: 60% of the text is composed from solely 1.4 percent of the vocabulary. If you want to compress your text, just throw away the 10 most frequent words!

The code below allows you to verify this.

In [61]:
topn = 100
total_words = len(words)
total_topn = sum([v for w,v in wf.most_common(topn)])
print(topn/len(set(words))*100)
print(total_topn/total_words*100)

1.4779781259237363
60.47590843709118


### Filtering "Function" Words

In [49]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kasparbeelen/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [52]:
from nltk.corpus import stopwords
stopw = stopwords.words('english')
print(stopw)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [54]:
wf_filtered = Counter(w for w in words if w not in stopw)
wf_filtered.most_common(25)

[('ideas', 1398),
 ('one', 911),
 ('idea', 886),
 ('mind', 592),
 ('may', 501),
 ('man', 440),
 ('us', 424),
 ('men', 422),
 ('without', 381),
 ('things', 362),
 ('body', 361),
 ('think', 352),
 ('make', 341),
 ('simple', 339),
 ('yet', 328),
 ('would', 327),
 ('another', 326),
 ('power', 316),
 ('though', 313),
 ('innate', 297),
 ('motion', 290),
 ('parts', 254),
 ('duration', 251),
 ('use', 243),
 ('cannot', 240)]

## Writing files
### CSV Files

The previous steps reduced a book to table of word frequencies. For sure, you do not want to repeat this procedure every time but save it as an intermediate result. The optimal format is a CSV file, with CSV abbreviation Comma Separated Value. The comma in this case is called the **delimiter** the value that separates the items on each row. The end of the row is usually by a hard return.

The content of an example CSV 

``
'ideas', 1398
'one', 911
'idea', 886
``



In [66]:
content = ''
for key,value in wf_filtered.items():
    line = key+','+str(value)+'\n'
    content+=line
    
# or more concise
#content = '\n'.join(["{},{}".format(k,v) for k,v in wf.items()])

In [67]:
filename = "data/wf.csv"
with open(filename, "w") as outfile:
    outfile.write(content)

In [68]:
!ls data
!head data/wf.csv

locke_excerpt.txt pg10615.txt       wf.csv
﻿the,1
project,88
gutenberg,98
ebook,12
of,7256
an,439
essay,18
concerning,70
humane,6
understanding,158


## JSON Files

### Google Books

In [70]:
from urllib.request import urlopen
import json
from pprint import pprint
antwoord=urlopen("https://www.googleapis.com/books/v1/volumes?q=shakespeare").read()
data=json.loads(antwoord.decode("utf-8"))
pprint(data)

{'items': [{'accessInfo': {'accessViewStatus': 'FULL_PUBLIC_DOMAIN',
                           'country': 'NL',
                           'embeddable': True,
                           'epub': {'downloadLink': 'http://books.google.nl/books/download/The_Works_of_Shakespear.epub?id=wsPe-P8lb8AC&hl=&output=epub&source=gbs_api',
                                    'isAvailable': False},
                           'pdf': {'downloadLink': 'http://books.google.nl/books/download/The_Works_of_Shakespear.pdf?id=wsPe-P8lb8AC&hl=&output=pdf&sig=ACfU3U07OWCZj5Z2mMdO5J4MsPgtzmmiJg&source=gbs_api',
                                   'isAvailable': True},
                           'publicDomain': True,
                           'quoteSharingAllowed': False,
                           'textToSpeechPermission': 'ALLOWED',
                           'viewability': 'ALL_PAGES',
                           'webReaderLink': 'http://play.google.com/books/reader?id=wsPe-P8lb8AC&hl=&printsec=frontcover&sour

                                          'in Shakespearean criticism over a '
                                          'period of about three decades. Many '
                                          'of them were written for specific '
                                          'occasions or specific reasons '
                                          'having to do with teaching or with '
                                          'panel discussions before diverse '
                                          'audiences, which she entered into '
                                          'along with others. In the process '
                                          'she contributed some of the best '
                                          'work on Shakespeare that was then '
                                          'extant, as this collection '
                                          'demonstrates. Searching for a '
                                          'principle of organizati

## Obtaining Reading Web Pages

focua on specific HTML structures: tables
scraping specific context from Web pages

In [None]:
from bs4 import BeautifulSoup as bs
import requests

base_url = "https://www.poemhunter.com/charles-bukowski/poems"

content = requests.get(url).content

In [None]:
soup = bs(content,'lxml')
tables=soup.find_all('table')
len(tables)

In [None]:
#!pip install python-louvain==0.5

In [None]:
for table in tables:
    print(table.get('class','NaN'))

In [None]:
poems=soup.find('table',{'class':'poems'})

In [None]:
print(len(poems))

In [None]:
links = poems.find_all('a')

In [None]:
first_link = links[0]
url = first_link['href']
print(url)

In [None]:
import urllib.parse
poem_url = urllib.parse.urljoin(base_url,url)
print(poem_url)

In [None]:
poem = bs(requests.get(poem_url).content,'lxml')
poem_div = poem.find('div',{'class':'KonaBody'}).find('p')#.find_all('br')
#print(poem_div)
print(str(poem_div).replace('<p>','').replace('</p>','').replace('<br/>','\n').strip())