[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kasparvonbeelen/ghi_python/blob/main/3%20-%20Text%20and%20String%20Methods.ipynb)


# 3 Working with Text: Strings and string methods


## Text Mining for Historians (with Python)
## A Gentle Introduction to Working with Textual Data in Python

### Created by Kaspar Beelen and Luke Blaxill

### For the German Historical Institute, London

<img align="left" src="https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png">


## 3.1 String variables and methods

Variables can contain, or more correctly, refer to strings. You may have noticed how operations (such as addition) allow you to perform simple string manipulations. For example, we can write a program that prints a greeting with a name.

### -- Exercise: 

Change the value of the `first_name` and `last_name` variables so that the cell below prints a correct greeting.

In [None]:
first_name = 'First_name' # change this your first name
last_name = 'Last_name' # enter last now
print("Hello"+' '+first_name+' '+last_name) # this combines the variables in a greeting

We'd achieve the same results by passing these variables as separate arguments to the `print()` function.

In [None]:
print("Hello", first_name, last_name)

But Python provides you with many more tools to process and manipulate strings (and, by extension, whole documents).

Below we first inspect the general syntax and discuss a few simple examples.

The `Breakout` provides more detailed background information.

Let's store (a part of) the famous opening sentence " A Tale of Two Cities" in a variable `first_sentence`.

In [None]:
first_sentence = "It was the best of times, it was the worst of times."

### -- Exercise: 

Print the content of `first_sentence`.

In [None]:
# Enter answer here

String variables (and numbers) can be thought of as **objects**, "things you can do stuff with". In Python, each object has a set of **methods/functions** attached to it, which are the tools that enable you to manipulate these objects. 

If objects can be thought of as the **nouns** of a programming language, then methods/functions serve as the **verbs**, they are the tools that operate on (do something with) these objects. 

In general the methods (or functions) appear in these forms:
- `function(object)`
- `object.method()`

For string objects (`str` in Python), we can change the general notation to:
- `function(str)`
- `str.method()`


This may look confusing at first—and we can't go into detail here about these syntactic differences—but you will get familiar with the syntax pretty soon, we promise.

Below we discuss a few functions and methods, which will provide you with the tools for working with text data (more technically strings).



### `len()`

`len()` takes an object and returns the number of elements, i.e. the length of the object. When given a string `len()` counts the number of characters, not words.

Applying `len()` to `first_sentence` should return 52.

In [None]:
len(first_sentence)

The `first_sentence` variable is just a toy example. We can easily load the actual content of ["A Tale of Two Cities"](https://www.gutenberg.org/files/98/98-0.txt) and print the number of characters it contains. (Please ignore the code in the example, we show it here only to convince you how easy you could scale up from one line of text to a whole book)

In [None]:
import requests 
book = requests.get('https://www.gutenberg.org/files/98/98-0.txt').content.decode('utf-8') # download book
print(book[:1000]) # print first 1000 characters

In [None]:
print(len(book)) # print the number of characters

### `str.lowercase()`

Lowercasing is often useful for normalizing texts, i.e. removing distinctions between words we don't really care about when analysing collections at scale. For example, many search engines use lowercasing in the background to provide you with all document that matches your query, i.e. if you search for `berlin` you will also get results for `Berlin` etc. Later in this course, when we focus on counting words, lowercasing will also be useful because we want to count `"Book"` and `"book"` as the same word.


Converting all capitals to lowercase is common practice in text mining, but of course, whether it's appropriate or not depends on the purposes of your research. For example, if you are interested in Named Entities (such as place names, you better retain capitals as these contain use signals for detecting such entities).

However, the most important thing at this point, is that you understand the syntax of the statement and what it returns. `str.lowercase()` acts on the string (which comes before the dot) and returns a string object.

Please note that this method works directly on string or on a variable referring to a string. 

In [None]:
print('LOWERCASE ME!'.lower()) # lowercase and print

In [None]:
lowercase = 'LOWERCASE ME!' # variable assignment
print(lowercase.lower()) # lowercase variable and print


Both `len()` and `str.lowercase()` are called **fruitful** functions/methods, they return something (i.e. a number or a string respectively)


### -- Exercise

Lowercase the variable first_sentence, store the lowercased version in a new variable and print the length of this variable.

In [None]:
# add answer here

### `str.endswith(parameter)`

`str.endswith(parameter)` is another commonly used string method. It slightly differs from `str.lower()` because it usually requires an argument for the parameter between the parentheses. `str.endswith(parameter)` will return a **boolean value** (`True` or `False`) if the string at the left-hand side of the `.` ends with the string given as an argument. This is commonly used to check the extension of a document, for example:

In [None]:
filename = 'document_1.txt'
filename.endswith('.txt')

In [None]:
filename.endswith('.doc')

We are using some technical terms here, which will be explained in more detail later. However, we hope that you slowly start to pick up and remember some of these terms just by reading through the notebook. Don't worry too much about the explanations, try to understand how the code works, that's the most important thing at this point!

### dir()

Of course the Python string toolkit is much larger. Use the `dir()` function to see all the methods you can apply to a string. 

In [None]:
print(dir(str))

In [None]:
print(dir("Hello World."))

`dir()` returns a list of all the tools that apply to a string. You can ignore the items starting with `__`, but please look at those elements further down, for examples the `str.upper()` method.

To inspect the `docstring` of a method, which explain its functionality, use `help()`.

In [None]:
help(str.upper)

Let's see what `str.upper()` does!

In [None]:
'hello'.upper()

### -- Exercise

- Create a few code cell below
- Inspect the docstring of the following methods `str.strip()`, `str.isalpha()` and `str.startswith()`
- Create a new string variable (whatever text you prefer)
- Apply the above methods to the string and print the outcome

## `Breakout:`
- more about [string methods](break_out/string_methods.ipynb)[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kasparvonbeelen/ghi_python/blob/main/break_out/string_methods.ipynb)


## Indexing and slicing

Another common type of string manipulation is indexing and slicing. Indexing here means retrieving characters of a string (it could also be another data type) by their position (i.e. obtaining the fifth or last character of a word).

In Python, we start counting from `0`: to retrieve the first element, we add `[0]` to the end of a string (variable). Note the square brackets!

In [None]:
print(first_sentence[0])

To print the second character, we need to access the item at position 1.

In [None]:
print(first_sentence[1])

To access the last character, use `[-1]`.

In [None]:
print(first_sentence[-1])

Slicing is similar to indexing, but it allows you to select a sequence of (multiple) characters. We still use square brackets but add a colon. At the left of the colon stands the first character, at the right the last characters. 

Below we printh everything between (and including) the sixth and tenth character.

In [None]:
print(first_sentence[5:10])

Negative indices can also be used for slicing.

In [None]:
print(first_sentence[-6:-1])

The first or last character can remain implicit.

In [None]:
print(first_sentence[:5])

In [None]:
print(first_sentence[-5:])

Even though these operations seem pretty abstract, we will use indexing and slicing frequently later in this course. Please consult the `breakout` for more information.

## -- Exercise

- Assign the sentence (from Jane Austen's "Pride and Prejudice") below to a variable named `sentence`. (Please remember, double click on any Markdown cell to reveal the actual text)
>   "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."                     
- Lowercase the sentence and assign it to `sentence_lower`
- Print the first and last **words** of the lowercased sentence

In [None]:
# Enter code here

## `Breakout`:
[More on string indexing](break_out/indexing_and_slicing.ipynb)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kasparvonbeelen/ghi_python/blob/main/break_out/indexing%20and%20slicing.ipynb)





## 4.3 Reading and Opening Text Files

In this section, we transition from experimenting with mock examples to working with more realistic, historical examples. First, we do this on a small scale, but soon we'll be processing thousands of newspaper articles!

**[Important:]** To proceed, we have to download and store the data used in the examples below. Please run the code cell below (it creates a new folder in which it then stores a sonnet of Shakespeare).

In [None]:
# ! Run this cell to download data used in examples below
from pathlib import Path
Path("working_data").mkdir(exist_ok=True)
!wget https://raw.githubusercontent.com/kasparvonbeelen/ghi_python/main/example_data/notebook_3/shakespeare_sonnet_i.txt -O working_data/shakespeare_sonnet_i.txt

To open a file in Python, you have to first explain where it is stored. More technically you provide a location or `path` as a string. The `Break out` will point you to more information about the path syntax, for now a simple example of (what is called) a **relative** path should suffice.

A relative tells to the location of a file, relative to your current position in the folder structure of your working environment. In our case, this means relative to where the Notebook (the one in which you are working at the moment) is located.

The see the files in the current folder run the `ls .` or list command in the cell below.

In [None]:
!ls .

Please note that `!ls` starts with an exclamation mark. `ls` is a bash command you'd normally use in a terminal. This is not very important at the moment, just remember that lines starting with `!` are not Python code.

You see the folder `working_data` appearing. Now we can list the items in `working_data` again using `ls`.

In [None]:
!ls working_data/

The relative path to our file is `working_data/shakespeare_sonnet_i.txt`. Python requires you to define the path as a string (i.e. enclosed by single or double quotation marks).

Getting the location right is the first part of the puzzle. Next, we need some Python tools to open a file and read its content. It may sound confusing at first (why open _and_ read?), but these are separate steps in Python. 

Let's use the `open()` function to open the sonnet. As you notice, this doesn't return the actual text, but a `_io.TextIOWrapper` object (you can ignore that safely.

In [None]:
path = "working_data/shakespeare_sonnet_i.txt"
sonnet = open(path)
sonnet

We need to apply the `read()` method to the `_io.TextIOWrapper` object to inspect the content of the file.

In [None]:
sonnet = open(path).read()
sonnet

Please note the special characters such as `\n` (which marks a new line). This becomes apparent when we print the sonnet.


In [None]:
print(sonnet)

Since the `sonnet` variable refers to a string, we can use everything we learned before to analyse and manipulate this string.

In [None]:
len(sonnet)

In [None]:
sonnet.lower()

`str.find()` allows you to query a string for a substring. It will return the index of the lowest index of the first match for your query substring S.

In [None]:
help(str.find)

In [None]:
sonnet.find('riper')

In [None]:
sonnet[98:]


## `Break out`
- [reading and writing files](https://openbookproject.net/thinkcs/python/english3e/files.html)**[Under construction]**
- [paths and filenames](break_out/paths.ipynb)

## Fin.