# Intro to Python

---

## Jupyter Notebooks

This is our introduction to Jupyter Notebooks.

---

### Working with text files

First we define a reference to the file we want to open.

In [3]:
file_name = 'wollstonecraft-letters.txt'

Next, we open a connection to the file.

Here, we use the `open` function to open a connection to the file. We use a reference variable, `file`, to hold the reference to the file. Our `file` reference variable stores information about the file.

The `open` function accepts the reference to the file name as an argument, and optionally the mode, typically `r` for read, `w` for write, or `a` for append. The `r` can be left out as it is the default.

In [5]:
file = open(file_name, 'r')

Our file reference has properties and methods, which are characteristics of the file (properties) and things we can do with the file (methods). We get the properties and run the methods using dot notation. Properties do not take parentheses, methods do require parentheses.

In [6]:
file.name

'wollstonecraft-letters.txt'

In [7]:
file.readable()

True

In [9]:
file.writable()

False

In [10]:
file.close()

In [11]:
file.closed

True

We've closed the file, now we can reopen with a shorter variable name. To read the data from the file, we use the `readlines()` method. This generates a list of lines. A line is defined as everything up to a newline character. As you can see here, the source text file includes numerous newline characters, `\n`, and these separate the lines.

When I run the `readlines()` method, notice what I get is a data structure that begins and ends with brackets `[]` and between those brackets, the lines are separated by commas. So it is a comma-separated list of lines between brackets. This is a very common Python data structure.

After doing whatever I want to do with the information, I must also close the file, because each file connection consumes resources and I have a limited amount of possible connections.

In [12]:
f = open(file_name)
data = f.readlines()
f.close()

In [13]:
data

['The Project Gutenberg eBook of Letters Written During a Short Residence in Sweden, Norway, and Denmark\n',
 '    \n',
 'This ebook is for the use of anyone anywhere in the United States and\n',
 'most other parts of the world at no cost and with almost no restrictions\n',
 'whatsoever. You may copy it, give it away or re-use it under the terms\n',
 'of the Project Gutenberg License included with this ebook or online\n',
 'at www.gutenberg.org. If you are not located in the United States,\n',
 'you will have to check the laws of the country where you are located\n',
 'before using this eBook.\n',
 '\n',
 'Title: Letters Written During a Short Residence in Sweden, Norway, and Denmark\n',
 '\n',
 'Author: Mary Wollstonecraft\n',
 '\n',
 'Release date: November 1, 2002 [eBook #3529]\n',
 '                Most recently updated: June 5, 2022\n',
 '\n',
 'Language: English\n',
 '\n',
 '\n',
 '\n',
 '*** START OF THE PROJECT GUTENBERG EBOOK LETTERS WRITTEN DURING A SHORT RESIDENCE IN SWEDEN,

I can also iterate over the data, line by line. Using a Python programming structure called a for loop, I can iterate over the data and do something to each element in the list. In this case I output it. Instead of ending with the default newline, I here say I want to end with nothing at all. This basically prints out each line in the text.

In [15]:
f = open(file_name)
for line in f:
    print (line, end = '')

The Project Gutenberg eBook of Letters Written During a Short Residence in Sweden, Norway, and Denmark
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Letters Written During a Short Residence in Sweden, Norway, and Denmark

Author: Mary Wollstonecraft

Release date: November 1, 2002 [eBook #3529]
                Most recently updated: June 5, 2022

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK LETTERS WRITTEN DURING A SHORT RESIDENCE IN SWEDEN, NORWAY, AND DENMARK ***




CASSELL'S NATIONAL LIBRARY.




LETTERS
WRITTEN
_DURING A SHORT RESIDENCE_
IN
SWEDEN, NORWAY, AND
D

After I'm done iterating over the list, I can't iterate again without reloading the data. Think of it as a playhead. After I'm done iterating over the list, the playhead is at the end. If I try iterating now, there's nothing to iterate over.

In [16]:
for line in f:
    print (line, end = '')

In [17]:
f.close()

I could also go one line at a time if I wanted to, using the `next` function.

In [18]:
f = open(file_name)
print(next(f))
print(next(f))
f.close()

The Project Gutenberg eBook of Letters Written During a Short Residence in Sweden, Norway, and Denmark

    



Having to manually close each file after every usage is error-prone and difficult, so instead we will typically use something called a context manager, which will automatically open and close the connection. A context manager uses the `with` keyword. Here, I use the `with` keyword, then run the `open` function, which returns the reference to the file as before. I save the variable with the keyword `as`. Then I have to indent the code that we run while the connection is open. Any code that is indented in this way runs within the context manager. 
Notice that afterwards, the file connection is closed.

Within the code block, I am going to first read the data. As we saw earlier, this generates a list of lines. We can find out the length of the list using the `len` function.

In [20]:
with open(file_name) as f:
    data = f.readlines()
    print(len(data))

5428


In [22]:
f.closed

True

Now that I know the number of lines, I can create a selection of lines using a technique called slicing. Notice in this code sample, I retrieve the lines and store the reference to the lines in a variable called `data`. Then I slice the data using brackets and a colon: `data[2000:2050]` means the lines 2000 to 2049. It includes the start index but excludes the end index. Then I use our old friend the `for` loop to iterate over that selection.

In [24]:
with open(file_name) as f:
    data = f.readlines()
    for line in data[2000:2050]:
        print(line, end = '')

in Sweden, and after I arrived at Tonsberg. By chance I found a fine
rivulet filtered through the rocks, and confined in a basin for the
cattle. It tasted to me like a chalybeate; at any rate, it was pure; and
the good effect of the various waters which invalids are sent to drink
depends, I believe, more on the air, exercise, and change of scene, than
on their medicinal qualities. I therefore determined to turn my morning
walks towards it, and seek for health from the nymph of the fountain,
partaking of the beverage offered to the tenants of the shade.

Chance likewise led me to discover a new pleasure equally beneficial to
my health. I wished to avail myself of my vicinity to the sea and bathe;
but it was not possible near the town; there was no convenience. The
young woman whom I mentioned to you proposed rowing me across the water
amongst the rocks; but as she was pregnant, I insisted on taking one of
the oars, and learning to row. It was not difficult, and I do not know a
pleasante

## Python Lists

In all programming languages, you can put diverse bits of data into groups called **collections**. Some collections have no order, other collections are ordered. An ordered collection is often called a **sequence**. In Python, the most common sequence type is the list. Lists are very useful and very commonly used in Python. We have seen how, when we call the `readlines()` method on a file in Python, the method returns a list of lines, and we saw how we can iterate over those lines. Because of how common lists are, and how useful it is to iterate over lists, it is worth pausing to take a look at these concepts.

We declare a list by giving it a name and then using the assignment operator, the equals sign, to assign that variable name to a list literal. 

In [27]:
words = ['gracefully', 'gambols', 'phrase', 'edge']

Notice here, we have made a list of what we would call in everyday language "words". In Python we call these **strings** because they are strings of characters. We put them between quotes. Putting them between quotes indicates that we are dealing with strings of characters, not with some sort of command or Python keyword. In this case, the term `words` is not a string, it's a variable name. We don't put it between quotes.

In [28]:
words[0]

'gracefully'

In [29]:
words[1]

'gambols'

In [30]:
words[2]

'phrase'

In [31]:
words[3]

'edge'

A list is a sequence, which means order matters. Each element in a list therefore has an **index**, which is a numerical expression of its place in the sequence. Python, along with most programming languages (but not R) starts indexing at 0, not at 1. So, the first element is always at index 0 (some languages commonly encountered in statistical analysis, such as R and Matlab, start their indexes at 1). 

If we try to get an index that is out of bounds, we will cause an error.

As we saw earlier, we can get a slice of a list.

In [32]:
words[1:3]

['gambols', 'phrase']

When you use slicing syntax, the indexes you use are inclusive for the start index, but exclusive for the end index.

You can add elements to a list in several different ways. The most common, and the. most efficient from a computational perspective, is to append an element to the end of the list. You do this with the append method.

In [33]:
words.append('chalybeate')
words

['gracefully', 'gambols', 'phrase', 'edge', 'chalybeate']

It goes beyond the scope of this workshop to explore all the different ways you can manipulate lists. You can extend a list with another list. You can slice an entire list into an existing list. You can delete elements at any place in the list, replace elements with others, or insert an element at an index. Keep in mind that inserting an element in the middle of a list is computationally very slow compared to appending.

In [34]:
words.insert(2, 'seashore')
words

['gracefully', 'gambols', 'seashore', 'phrase', 'edge', 'chalybeate']

You can generally find quick solutions for various types of list manipulations by doing an online search or using an LLM.

## Python Iteration

In Python, we say we iterate over a collection when we do the same thing to its elements, over and over. This can be as many times as there are elements in the collection, or it can be over some only, perhaps over such elements as fulfill some condition.

We typically iterate over a list using a **for loop**, as shown earlier in this workshop. A very simple approach to this is to just print the content of the list.

In [37]:
for word in words:
    print(word)

gracefully
gambols
seashore
phrase
edge
chalybeate


You can also carry out some sort of modification before printing it.

In [39]:
for word in words:
    print(word.capitalize())

Gracefully
Gambols
Seashore
Phrase
Edge
Chalybeate


Note that when I do this, the list itself is not modified.

In [40]:
words

['gracefully', 'gambols', 'seashore', 'phrase', 'edge', 'chalybeate']

It is consider unsafe to modify a list while one is iterating over it. If we do want to iterate over a list and modify all the elements, there are several ways of doing that, including enumeration and comprehensions. However, these are well beyond the scope of an introductory workshop. The most straightforward way to do it is to just create a new list and populate it as we iterate.

In [42]:
cap_words = []
for word in words:
    cap_words.append(word.capitalize())
cap_words

['Gracefully', 'Gambols', 'Seashore', 'Phrase', 'Edge', 'Chalybeate']

In this example, I first declare an empty list. Then I iterate over the words list, and for every word in the words list, I insert it, capitalized, into the new list. At the end, I have a list of the words, but capitalized.

## Importing libraries

Python comes in a basic form called the **standard library**. However, for many purposes it is useful to **import** libraries of code beyond the standard library. Many of these libraries do not ship with the basic Python installation. Instead, you first have to install a package, then import the library from the package.

The usual place to find 3rd party Python code packages is the Python Package Index, [PyPI](https://pypi.org/). You can install packages from PyPI using a command, `pip install`. Note that it is best practice to create a virtual environment before installing Python packages.

In a Jupyter notebook you typically have to precede this command by an exclamation mark: `!pip install package_name` or, on a Mac, typically `!pip3 install package_name`. The exclamation mark instructs your notebook to execute the code through the Terminal.

In this case, let's say we have done a search for "python turn pdf to text". We have found [a link to a source](https://www.geeksforgeeks.org/convert-pdf-to-txt-file-using-python/) that seems like it could work. The article says to install the PyPDF package using `pip install`. After making sure you have a virtual environment running, execute the following command (or `!pip install` if this works on your system):

In [4]:
!pip3 install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


Now, we can `import` the required libraries from the aspose package. First, make sure the PDF you want to convert is in the same folder as your notebook file. We will use a PDF from our UTSA Libraries Special Collections [Oral History collection](https://digital.utsa.edu/digital/custom/oralhistories/). Once the PDF is in the same folder as your notebook, you can run the code that we found online. The first part of the code is a Python **function**, which is a set of computer instructions that is designed to be reusable. 

In Python, a function is defined by the `def` keyword. Following the `def` keyword is the name of the function, in this case `pdf_to_text`. After that a pair of parentheses opens. Inside are the names of the function's **arguments**, which are the pieces of information it needs to have in order to do its job. In this case it needs the path to the pdf file, and the name of the output text file.

The function will open the pdf file, activate the PyPDF reader, prepare a variable to store the text, then read through the pdf one page at a time, adding the extracted text to the text storage variable. Finally, it will open the text file and write the text to that file.

When we run the code block with the function in it, nothing will actually happen. The Python interpreter will read the function and load it into memory, but this only *defines* the function, it does not actually run the instructions inside the function.

In [7]:
import PyPDF2

def pdf_to_text(pdf_path, output_txt):
    # Open the PDF file in read-binary mode
    with open(pdf_path, 'rb') as pdf_file:
        # Create a PdfReader object instead of PdfFileReader
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Initialize an empty string to store the text
        text = ''

        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()

    # Write the extracted text to a text file
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

Next we see another code block. This starts with a conditional statement, `if __name__ == "__main__":`. This means to run the code only if it is being run as a script, not if it is being imported as a module. We are not at a point where we are writing and importing modules, so any code we run is going to be run as a script. Notice that as written, the code refers to a file called 'gfg.pdf' and 'gfg.txt', which is not what we are using. So, we will have to change these variables before running the code. Otherwise we will get an error, because the Python interpreter will not find the 'gfg.pdf' file. We can also dispense with the conditional.

In [2]:
if __name__ == "__main__":
    pdf_path = 'gfg.pdf'

    output_txt = 'gfg.txt'

    pdf_to_text(pdf_path, output_txt)

    print("PDF converted to text successfully!")

Let's rewrite the code to suit our needs.

In [8]:
pdf_path = 'p15125coll4_1078.pdf'

output_txt = 'doc.txt'

pdf_to_text(pdf_path, output_txt)

print("PDF converted to text successfully!")

PDF converted to text successfully!


When we look at our folder now, we should see the 'doc.txt' file has been created and has the correct text in it. We can now do any of the things we were doing earlier with text files, such as find out the number of lines.

In [9]:
with open('doc.txt') as doc:
    data = doc.readlines()
    print(len(data))

602


We now have our Python environment successfully set up. We have read and looped through text files and converted pdfs to text. We are ready to do text analysis.