# LING 242 Python Lecture 1: Strings

* Getting substrings
* Building strings
* Modifications
* Booleans
* Case
* Numbers
* Which string method?

Computers represent language using the string datatype. Python's support for string manipulation is excellent. If you want to work with (human) language, Python is the (computer) language of choice for most purposes.

## Getting Substrings

Let's start by creating a Python string of the alphabet with the variable name `alphabet` (for string), and access the letters contained within it using basic indexing (square brackets, i.e. \[\]). Note that indexing supports forward as well as backward (negative) indices.

In [1]:
alphabet = "abcdefghijklmnopqrstuvwxyz" 

In [2]:
alphabet[0]

'a'

In [3]:
alphabet[3]

'd'

In [4]:
alphabet[-1]

'z'

In [5]:
alphabet[-4]

'w'

Now, we create a Python string of the word "antihumanitarianism" with the variable name `S` (for string).

Slicing (using the : operator inside of square brackets) applied to strings produces substrings. Note that 
- The letter at the second index (the stop) is NOT included
- Leaving the start/stop index empty means that the beginning/end is assumed

Exercise: Let's pull out good English words that are substrings of "antihumanitarianism":

|a|n|t|i|h| u | m| a| n| i|  t | a | r | i | a | n | i | s | m |
| --- | --- | --- | --- |--- | --- |---  |--- | --- |--- | --- |--- |---|---|--- | --- | --- | --- | --- |
|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16| 17 | 18 |
|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| - 9| -8| -7| -6| -5 | -4 | -3 | -2 | -1 |


In [6]:
S = "antihumanitarianism" 

In [7]:
S[:]

'antihumanitarianism'

In [8]:
S[4:9]

'human'

In [9]:
S[:-3]

'antihumanitarian'

In [10]:
S[4:-3]

'humanitarian'

In [11]:
S[4:]

'humanitarianism'

|a|n|t|i|h| u | m| a| n| i|  t | a | r | i | a | n | i | s | m |
| --- | --- | --- | --- |--- | --- |---  |--- | --- |--- | --- |--- |---|---|--- | --- | --- | --- | --- |
|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16| 17 | 18 |
|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| - 9| -8| -7| -6| -5 | -4 | -3 | -2 | -1 |

In [12]:
S[6:9]

'man'

In [13]:
S[:3]

'ant'

In [14]:
S[:2]

'an'

In [15]:
S[8:11]

'nit'

In [16]:
S[9:11]

'it'

In [17]:
S[-9:-6]

'tar'

|a|n|t|i|h| u | m| a| n| i|  t | a | r | i | a | n | i | s | m |
| --- | --- | --- | --- |--- | --- |---  |--- | --- |--- | --- |--- |---|---|--- | --- | --- | --- | --- |
|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16| 17 | 18 |
|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| - 9| -8| -7| -6| -5 | -4 | -3 | -2 | -1 |

In [18]:
S[-3:-1]

'is'

In [19]:
S[:4]

'anti'

In [20]:
S[-3:]

'ism'

In [21]:
S[-7:-10:-1]

'rat'

Sometimes, you don't know the exact index where you want to make a cut or cuts, but you know there's a special character (or set of characters) there. 

The [*split*](https://docs.python.org/3/library/stdtypes.html#str.split) method is used for breaking at a delimiter. The delimiter can be more than one character

It returns a list of strings corresponding to the parts. The delimiter is not included.

In [22]:
S1 = "merry-go-round"
S2 = "red, green, blue, and yellow"

In [23]:
S1.split("-")

['merry', 'go', 'round']

In [24]:
S1.split("-")[2]

'round'

In [25]:
S2.split(" ")

['red,', 'green,', 'blue,', 'and', 'yellow']

In [26]:
S2.split(", ")

['red', 'green', 'blue', 'and yellow']

In [27]:
colours = S2.split(", ")
colours[-1] = colours[-1][4:]
colours

['red', 'green', 'blue', 'yellow']

Another common occurrence is when you've got a string with some junk at the beginning or end of a string (or both) that you want to remove. 

The [strip](https://docs.python.org/3/library/stdtypes.html#str.strip) removes whitespace (spaces, newlines, tabs) by default, but can remove any character you like from the edges of your string. 

It stops when it hits something it hasn't been told to remove. 

There are other methods (*rstrip* and *lstrip*) which apply this from only one direction.

In [28]:
S = " +-+-+strip me+-+-+ "
S

' +-+-+strip me+-+-+ '

In [29]:
S.strip()

'+-+-+strip me+-+-+'

In [30]:
S.strip("+")

' +-+-+strip me+-+-+ '

In [31]:
S.strip(" +")

'-+-+strip me+-+-'

In [32]:
S.strip(" -+")

'strip me'

In [33]:
S.rstrip(" -+")

' +-+-+strip me'

## Building strings

For many purposes, the concatenator "+" is all you need.

In [34]:
S1 = "hello"
S2 = "world"

In [35]:
S1 + S2

'helloworld'

In [36]:
S1 + " " + S2

'hello world'

However this is wasteful for more than a few strings, so if you have many strings you want to put together, particularly if they are already in a Python list, use [join](https://docs.python.org/3/library/stdtypes.html#str.join). 

The join method has a funky syntax: it is called as a method of the delimiter string, with the list of strings as the argument.

In [37]:
L1 = ["merry","go","round"]
L2 = ["anti","dis","establish","ment","arian","ism"]

In [38]:
"-".join(L1)

'merry-go-round'

In [39]:
" ".join(L1)

'merry go round'

In [40]:
" ".join(L2)

'anti dis establish ment arian ism'

In [41]:
"".join(L2)

'antidisestablishmentarianism'

If you want to construct a string from one or more variables, use [f-stings](https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals) (add a f in front of the quotes) with the variables (or even longer expressions) between curly brackets ({})

In [42]:
noun = "dog"
verb = "run"

In [43]:
f"The {noun} likes to {verb}!"

'The dog likes to run!'

## Modifications

First, remember that Python strings are not mutable. Whenever you "modifying" a string, you are actually not modifying the string, you are creating a new string based on the old one.

Of course, you can "modify" strings by using a combination of what we have already seen. 

Let's the turn the string "the lords of the ring" to "the lord of the rings" using slicing and concatenation. Then let's try doing it by spliting and joining.

In [44]:
S = "the lords of the ring"

In [45]:
S[:8]

'the lord'

In [46]:
S[9:]

' of the ring'

In [47]:
S[:8] + S[9:] + "s"

'the lord of the rings'

In [48]:
S_words = S.split(" ")
S_words[1] = S_words[1][:-1]
S_words[-1] = S_words[-1] + "s"
" ".join(S_words)

'the lord of the rings'

If there is a particular substrings within a string that you wish to change, then use the [replace](https://docs.python.org/3/library/stdtypes.html#str.replace) method.

It is called on the whole string, and takes as its two arguments the string to be replaced, and the string to replace it with. 

By default it applies to all instances of the substring, so be careful!

In [49]:
S = "hands and feet"

In [50]:
S.replace("feet", "fingers")

'hands and fingers'

'hors or feet'

In [52]:
S.replace(" and "," or ")

'hands or feet'

## Boolean methods

Here we mean string methods (and an operator) which return a boolean; they check to see if a string has a particular property.

Two of the most useful boolean methods are [startswith](https://docs.python.org/3/library/stdtypes.html#str.replace) and [endswith](https://docs.python.org/3/library/stdtypes.html#str.endswith), which check to see a string starts/ends with another string. Very handy for checking for morphological affixes, for instance.

In [53]:
S1 = "reranked"
S2 = "disabled"

In [54]:
S1.startswith("re")

True

In [55]:
if S2.startswith("dis"):
    print(S2[3:])

abled


In [56]:
if S1.endswith("ed"):
    print(S1[:-2])

rerank


If you want to check if a string appears anywhere within another string, use the *in* operator.

In [57]:
"dis" in S2

True

In [58]:
"able" in S2

True

In [59]:
"red" in S1

False

Another thing we often want to know is whether a string is in fact a word (consisting only of letters). The [isalpha](https://docs.python.org/3/library/stdtypes.html#str.isalpha) method does this. 

Note that whitespace, punctuation, and some characters that appear regularly within English words (e.g. "-" and "'") are not considered alphabetic.

In [60]:
S1 = "hello"
S2 = "hello world"
S3 = "a can't-do-it attitude"

In [61]:
S1.isalpha()

True

In [62]:
S2.isalpha()

False

In [63]:
S2.replace(" ","").isalpha()

True

In [64]:
S3.isalpha()

False

In [65]:
words = S3.split()
words[0].isalpha()

True

In [66]:
words[1].isalpha()  #can't-do-it

False

In [67]:
S3.replace(" ", "").replace("-","").replace("'","").isalpha()

True

## Case


Upper and lower case versions of the same word are considered entirely different in Python! This means Python is _case-sensitive_. Case-insensitive means it doesn't matter what case you use (your browser's search function is probably case insenstive by default)

In [68]:
S1 = "case"
S2 = "CASE"
S3 = "Case"

In [69]:
S1 == S2

False

In [70]:
S1 == S3

False

To convert between upper and lower case, use the [upper](https://docs.python.org/3/library/stdtypes.html#str.upper) and [lower](https://docs.python.org/3/library/stdtypes.html#str.lower) methods. If the string is already upper/lower case, this has no effect. It is very standard to lowercase all words to standardize for the effects of English capitalization rules. You can also covert a string to have capitalized letters at the beginning of each word with [title](https://docs.python.org/3/library/stdtypes.html#str.title).

In [71]:
S1.upper()

'CASE'

In [72]:
S2.lower()

'case'

In [73]:
S3.lower()

'case'

In [74]:
S1.title()

'Case'

To check whether a string is (entirely upper or lower case), use [isupper](https://docs.python.org/3/library/stdtypes.html#str.isupper) and [islower](https://docs.python.org/3/library/stdtypes.html#str.islower)

In [75]:
S1.isupper()

False

In [76]:
S2.isupper()

True

In [77]:
S3.islower()

False

## Numbers

Be careful about numbers represented as strings, versus real numbers.

In [78]:
S = "42"
i = 42
S1 = "case"

In [79]:
S == i

False

To check if a string consists of only digits (and thus appropriate for conversion to an integer), use [isdigit](https://docs.python.org/3/library/stdtypes.html#str.isdigit).




In [80]:
S.isdigit()

True

In [81]:
S1.isdigit()

False

In [82]:
# i.isdigit()

Convert from strings to ints or floats using built-in functions [int](https://docs.python.org/3/library/functions.html#int) and [float](https://docs.python.org/3/library/functions.html#float), from a number to string by using [str](https://docs.python.org/3/library/stdtypes.html#str)

In [83]:
str(i) == S

True

In [84]:
str(i).isdigit()

True

In [85]:
int(S) == i

True

In [86]:
float(S) == i

True

In [87]:
str(float(S)) == S

False

In [88]:
str(float(S))

'42.0'

When using f-strings, you can format numbers using ":"! We'll just set the number of decimal places using the precision operator ".", but there are a lot more [options](https://docs.python.org/3/library/string.html#format-specification-mini-language)

In [89]:
import math
math.pi

3.141592653589793

In [90]:
print("Pi is " + str(math.pi))

Pi is 3.141592653589793


In [91]:
f"Pi is {math.pi}"

'Pi is 3.141592653589793'

In [92]:
f"Pi is {math.pi:.3f}"

'Pi is 3.142'

## Which string method?

For each of the below, what is the string method or operator you would use to solve the problem?

1. Convert surnames from the first-letter-capitalized (title) form to an all-caps form.
1. Change British spelling (e.g. colour) to American (color) throughout a document.
1. Given a list of words, output a file with one word on each line.
1. Grab the last three letters of a word to use as a feature for machine learning.
1. Exclude all numbers from a list of words derived from a corpus
1. Remove any punctuation appearing at the end of a sentence (e.g. "What is that!?!?").