# Strings

Strings are not like integers, floats, and booleans. A string is a **sequence**, which means it contains multiple values in a particular order.
In this chapter we'll see how to access the values that make up a string, and we'll use functions that process strings.

As an exercise, you'll have a chance to apply these tools to a word game called Wordle.

## A string is a sequence

A string is a sequence of characters. A **character** can be a letter (in almost any alphabet), a digit, a punctuation mark, or white space.

You can select a character from a string with the bracket operator.
This example statement selects character number 1 from `fruit` and
assigns it to `letter`:

The expression in brackets is an **index**, so called because it *indicates* which character in the sequence to select.
But the result might not be what you expect.

The letter with index `1` is actually the second letter of the string.
An index is an offset from the beginning of the string, so the offset of the first letter is `0`.

You can think of `'b'` as the 0th letter of `'banana'` -- pronounced "zero-eth".

The index in brackets can be a variable.

Or an expression that contains variables and operators.

But the value of the index has to be an integer -- otherwise you get a `TypeError`.

As we saw in Chapter 1, we can use the built-in function `len` to get the length of a string.

To get the last letter of a string, you might be tempted to write this:

But that causes an `IndexError` because there is no letter in `'banana'` with the index 6. Because we started counting at `0`, the six letters are numbered `0` to `5`. To get the last character, you have to subtract `1` from `n`:

But there's an easier way.
To get the last letter in a string, you can use a negative index, which counts backward from the end.

The index `-1` selects the last letter, `-2` selects the second to last, and so on.

## String slices

A segment of a string is called a **slice**.
Selecting a slice is similar to selecting a character.

The operator `[n:m]` returns the part of the string from the `n`th
character up to the `m`th character (including the first but excluding the second).
This behavior is counterintuitive, but it might help to imagine the indices pointing *between* the characters.

For example, the slice `[3:6]` selects the letters `ana`, which means that `6` is legal as part of a slice, but not legal as an index.


If you omit the first index, the slice starts at the beginning of the string.

If you omit the second index, the slice goes to the end of the string:

If the first index is greater than or equal to the second, the result is an **empty string**, represented by two quotation marks:

An empty string contains no characters and has length 0.

Continuing this example, what do you think `fruit[:]` means? Try it and
see.

## Strings are immutable

It is tempting to use the `[]` operator on the left side of an
assignment, with the intention of changing a character in a string, like this:

The result is a `TypeError`.
In the error message, the "object" is the string and the "item" is the character
we tried to assign.
For now, an **object** is the same thing as a value, but we will refine that definition later.

The reason for this error is that strings are **immutable**, which means you can't change an existing string.
The best you can do is create a new string that is a variation of the original.

This example concatenates a new first letter onto a slice of `greeting`.
It has no effect on the original string.

## String comparison

The relational operators work on strings. To see if two strings are
equal, we can use the `==` operator.

Other relational operations are useful for putting words in alphabetical
order:

Python does not handle uppercase and lowercase letters the same way
people do. All the uppercase letters come before all the lowercase
letters, so:

To solve this problem, we can convert strings to a standard format, such as all lowercase, before performing the comparison.
Keep that in mind if you have to defend yourself against a man armed with a Pineapple.

## String methods

W3Schools.com: [Python String Methods](https://www.w3schools.com/python/python_strings_methods.asp)

Strings provide methods that perform a variety of useful operations.
A method is similar to a function -- it takes arguments and returns a value -- but the syntax is different.
For example, the method `upper` takes a string and returns a new string with all uppercase letters.

Instead of the function syntax `upper(word)`, it uses the method syntax `word.upper()`.

This use of the dot operator specifies the name of the method, `upper`, and the name of the string to apply the method to, `word`.
The empty parentheses indicate that this method takes no arguments.

A method call is called an **invocation**; in this case, we would say that we are invoking `upper` on `word`.

## Writing files

String operators and methods are useful for reading and writing text files.
As an example, we'll work with the text of *Dracula*, a novel by Bram Stoker that is available from Project Gutenberg (<https://www.gutenberg.org/ebooks/345>).

In [None]:
import os

if not os.path.exists('pg345.txt'):
    !wget https://www.gutenberg.org/cache/epub/345/pg345.txt

I've downloaded the book in a plain text file called `pg345.txt`, which we can open for reading like this:

In addition to the text of the book, this file contains a section at the beginning with information about the book and a section at the end with information about the license.
Before we process the text, we can remove this extra material by finding the special lines at the beginning and end that begin with `'***'`.

The following function takes a line and checks whether it is one of the special lines.
It uses the `startswith` method, which checks whether a string starts with a given sequence of characters.

We can use this function to loop through the lines in the file and print only the special lines.

Now let's create a new file, called `pg345_cleaned.txt`, that contains only the text of the book.
In order to loop through the book again, we have to open it again for reading.
And, to write a new file, we can open it for writing.

`open` takes an optional parameters that specifies the "mode" -- in this example, `'w'` indicates that we're opening the file for writing.
If the file doesn't exist, it will be created; if it already exists, the contents will be replaced.

As a first step, we'll loop through the file until we find the first special line.

The `break` statement "breaks" out of the loop -- that is, it causes the loop to end immediately, before we get to the end of the file.

When the loop exits, `line` contains the special line that made the conditional true.

In [None]:
line

Because `reader` keeps track of where it is in the file, we can use a second loop to pick up where we left off.

The following loop reads the rest of the file, one line at a time.
When it finds the special line that indicates the end of the text, it breaks out of the loop.
Otherwise, it writes the line to the output file.

When this loop exits, `line` contains the second special line.

At this point `reader` and `writer` are still open, which means we could keep reading lines from `reader` or writing lines to `writer`.
To indicate that we're done, we can close both files by invoking the `close` method.

To check whether this process was successful, we can read the first few lines from the new file we just created.

The `endswith` method checks whether a string ends with a given sequence of characters.

## Find and replace

In the Icelandic translation of *Dracula* from 1901, the name of one of the characters was changed from "Jonathan" to "Thomas".
To make this change in the English version, we can loop through the book, use the `replace` method to replace one name with another, and write the result to a new file.

We'll start by counting the lines in the cleaned version of the file.

To see whether a line contains "Jonathan", we can use the `in` operator, which checks whether this sequence of characters appears anywhere in the line.

There are 199 lines that contain the name, but that's not quite the total number of times it appears, because it can appear more than once in a line.
To get the total, we can use the `count` method, which returns the number of times a sequence appears in a string.

Now we can replace `'Jonathan'` with `'Thomas'` like this:

The result is a new file called `pg345_replaced.txt` that contains a version of *Dracula* where Jonathan Harker is called Thomas.

## Glossary

**sequence:**
 An ordered collection of values where each value is identified by an integer index.

**character:**
An element of a string, including letters, numbers, and symbols.

**index:**
 An integer value used to select an item in a sequence, such as a character in a string. In Python indices start from `0`.

**slice:**
 A part of a string specified by a range of indices.

**empty string:**
A string that contains no characters and has length `0`.

**object:**
 Something a variable can refer to. An object has a type and a value.

**immutable:**
If the elements of an object cannot be changed, the object is immutable.

**invocation:**
 An expression -- or part of an expression -- that calls a method.

**regular expression:**
A sequence of characters that defines a search pattern.

**pattern:**
A rule that specifies the requirements a string has to meet to constitute a match.

**string substitution:**
Replacement of a string, or part of a string, with another string.

**shell command:**
A statement in a shell language, which is a language used to interact with an operating system.

## Exercises

In [None]:
# This cell tells Jupyter to provide detailed debugging information
# when a runtime error occurs. Run it before working on the exercises.

%xmode Verbose

In [1]:
# The following code is used to download files from the web
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    else:
        print("Already downloaded")

In [None]:
download('https://raw.githubusercontent.com/AllenDowney/ThinkPython/v3/words.txt');

### Exercise

"Wordle" is an online word game where the objective is to guess a five-letter word in six or fewer attempts.
Each attempt has to be recognized as a word, not including proper nouns.
After each attempt, you get information about which of the letters you guessed appear in the target word, and which ones are in the correct position.

For example, suppose the target word is `MOWER` and you guess `TRIED`.
You would learn that `E` is in the word and in the correct position, `R` is in the word but not in the correct position, and `T`, `I`, and `D` are not in the word.

As a different example, suppose you have guessed the words `SPADE` and `CLERK`, and you've learned that `E` is in the word, but not in either of those positions, and none of the other letters appear in the word.
Of the words in the word list, how many could be the target word?
Write a function called `check_word` that takes a five-letter word and checks whether it could be the target word, given these guesses.

In [None]:
# Solution goes here

You can use any of the functions from the previous chapter, like `uses_any`.

In [None]:
def uses_any(word, letters):
    for letter in word.lower():
        if letter in letters.lower():
            return True
    return False

You can use the following loop to test your function.

In [None]:
for line in open('words.txt'):
    word = line.strip()
    if len(word) == 5 and check_word(word):
        print(word)

### Exercise

Continuing the previous exercise, suppose you guess the work `TOTEM` and learn that the `E` is *still* not in the right place, but the `M` is. How many words are left?

In [None]:
# Solution goes here

In [None]:
# Solution goes here

## Credits

Adapted from [Think Python: 3rd Edition](https://allendowney.github.io/ThinkPython/index.html) by [Allen B. Downey](https://allendowney.com)

Code license: [MIT License](https://mit-license.org/)

Text license: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)