# 3 Reading texts at scale: Strings and string methods

## 3.2. String variables and methods

Variables can contain (or more correctly) refer to strings. You may have noticed how operations (such as addition) allow you to do simple string manipulation. For example, we can write a simple program that prints a greeting with a name.

-- Exercise: change the value of the `first_name` and `last_name` variable, so that the cell below prints a correct greeting.

In [5]:
first_name = 'First_name'
last_name = 'Last_name'
print("Hello"+' '+first_name+' '+last_name)

Hello First_name Last_name


But Python provides you with with many tools to process strings (and by extension whole documents).

Below we first inspect the general syntax and discuss a few simple examples.

The intermezzo provides more detailled background information.

Let's store part of the opening sentence of Charles Dickens' "Tale of Two Cities" in variable with the name `first_sentence`.

In [6]:
first_sentence = "It was the best of times, it was the worst of times."

-- Exercise: print the content of `first_sentence`.


Strings and numbers can be thought of as **objects**, "things you can do stuff with". In Python language each object has a set of **methods/functions** attached to it. If objects can be thought of as **nouns**, then methods/functions serve as **verbs**, they are the tools that operate on (do something with) these objects. 

In general the methods (or functions) appear in these forms:
- `function(object,argument)`
- `object.method(arguments)`
    
In the example below, we applied the `len()` function to measure the number of characters in a string; the `.lower()` methods lowercases all characters. 

Both are called **fruitful** functions, as they return something (i.e. a number and a string respectively)

Below we apply a method to the string directly.

In [None]:
print('HELLO'.lower())
print(len('HELLO'))

In [None]:
For sure the methods can also be applied to variables:

In [7]:
print(first_sentence.lower())
print(len(first_sentence))

it was the best of times, it was the worst of times.
52


### *str.lowercase()*

Lowercasing is often useful for normalizing texts, i.e. removing distinctions between words we don't really care about when processing collections at scale, we discuss a few more methods later, but converting all capitals to lowercase is fairly normal, but of course depends on the purposes of your research. For example, if you are interested in Named Entities (such as place names, you may better retain capitals).

However, the most important thing at this point, is that you understand the syntax of the statement and what it returns.

In [None]:
# something more here

### *str.endswith()*

### dir()

In [11]:
dir(str)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',


## `Intermezzo`
- other string operations
- sting indexing

## 3.3 Reading and opening text documents.

But let's now transition to working with more realistic, historical examples, first on a small scale, but soon we'll processing thousands of newspaper articles by the end of the chapter.

To open a file in Python, you have give a location or `path`, i.e. indicate where it is stored on your computer. The `Intermezzo` below will point you to more information about understanding the syntax of path names, but for now a simple example of what is called a **relative** path, relative to point where you are running this path notebook.

The see the files in this folder run the `ls` or list command in the cell below.

In [12]:
!ls

1 - Introduction.ipynb                 LICENSE
2 - First Steps.ipynb                  README.md
3_1 - Reading texts Introduction.ipynb [34mdata[m[m
3_2 - Strings and String Methods.ipynb [34mexample_data[m[m


You see the folder `example_data` appearing. Now we can list the subfolders or files in `example_data` again using `ls`.

In [14]:
!ls example_data/

[34mnotebook_3[m[m


You see that this series of notebooks has specific folder called `notebook_3` which you can again insepct, but don't forget to add `example_data`.

In [15]:
!ls example_data/notebook_3/

shakespeare_sonnet_i.txt


Now you can see the file we were searching for! So the relative path is `example_data/notebook_3/shakespeare_sonnet_i.txt`. Python requires you to define the path as a string (i.e. enclosed by single or double quotation marks).

This is the first part of the puzzle. Next, we need some Python tools to open a file and read its content. It may come across a confusing at first but these are seperate steps in Python. 

Let's use the `open()` function to open the sonnet. As you notice, this doesn't return the actual text, but a `_io.TextIOWrapper` object (you can ignore that safely for now, just notice you 

In [16]:

path = "example_data/notebook_3/shakespeare_sonnet_i.txt"
sonnet = open(path)
sonnet

<_io.TextIOWrapper name='example_data/notebook_3/shakespeare_sonnet_i.txt' mode='r' encoding='UTF-8'>

In [None]:
We need to apply the `read()` method to `_io.TextIOWrapper` object to inspect the content of the file.

In [17]:
sonnet = open(path).read()
sonnet

"From fairest creatures we desire increase,\nThat thereby beauty's rose might never die,\nBut as the riper should by time decease,\nHis tender heir might bear his memory:\nBut thou, contracted to thine own bright eyes,\nFeed'st thy light's flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThyself thy foe, to thy sweet self too cruel:\nThou that art now the world's fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd tender churl mak'st waste in niggarding:\nPity the world, or else this glutton be,\nTo eat the world's due, by the grave and thee."

Please note the special characters such as `\n` marking a new line, which will become apparent

In [18]:
print(sonnet)

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.


In [20]:
len(sonnet)

609

In [23]:
sonnet.lower()

"from fairest creatures we desire increase,\nthat thereby beauty's rose might never die,\nbut as the riper should by time decease,\nhis tender heir might bear his memory:\nbut thou, contracted to thine own bright eyes,\nfeed'st thy light's flame with self-substantial fuel,\nmaking a famine where abundance lies,\nthyself thy foe, to thy sweet self too cruel:\nthou that art now the world's fresh ornament,\nand only herald to the gaudy spring,\nwithin thine own bud buriest thy content,\nand tender churl mak'st waste in niggarding:\npity the world, or else this glutton be,\nto eat the world's due, by the grave and thee."

In [21]:
sonnet.find('riper')

98

In [22]:
sonnet[98:]

"riper should by time decease,\nHis tender heir might bear his memory:\nBut thou, contracted to thine own bright eyes,\nFeed'st thy light's flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThyself thy foe, to thy sweet self too cruel:\nThou that art now the world's fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd tender churl mak'st waste in niggarding:\nPity the world, or else this glutton be,\nTo eat the world's due, by the grave and thee."


## `Intermezzo`
- reading files
- writing files
- special characters
- paths and filenames
- what files to read

## `Intermezzo++`
- encoding
- common errors

## 3.3 Processing text

In [None]:
At this point

SyntaxError: invalid syntax (<ipython-input-19-0d2ff988d5df>, line 3)