First, go find this lecture at https://github.ubc.ca/MDS-CL-2019-20/COLX_521_corp-ling_students/blank_lectures/Lecture1_strings.ipynb

Then, click to open up this [google doc](https://docs.google.com/document/d/1j6HWzh96song43CWh6uQJs7HbCZIko_Xl7jlnVawe4s/edit). Wait for instructions before contributing.

# COLX 521 Lecture 1: Strings

* Getting substrings
* Concatenation
* Modifications
* Booleans
* Case
* Numbers
* And more...

Computers represent language using the string datatype. Python's support for string manipulation is excellent. If you want to work with (human) language, Python is the (computer) language of choice for most purposes.

## Getting Substrings

Let's start by creating a Python string of the word "antihumanitarianism" with the varible name S (for string), and access the letters contained within it using basic indexing (square brackets, i.e. \[\]). Note that indexing supports forward as well as backword (negative) indices.

In [3]:
S = "antihumanitarianism" 

In [4]:
S[0]

'a'

In [5]:
S[3]

'i'

In [6]:
S[-1]

'm'

In [7]:
S[-4]

'n'

Slicing (using the : operator inside of square brackets) applied to strings produces substrings. Note that 
- The letter at the second index (the stop) is NOT included
- Leaving the start/stop index empty means that the beginning/end is assumed (you should do this)

Exercise: Let's pull out good English words that are substrings of "antihumanitarianism":

|a|n|t|i|h| u | m| a| n| i|  t | a | r | i | a | n | i | s | m |
| --- | --- | --- | --- |--- | --- |---  |--- | --- |--- | --- |--- |---|---|--- | --- | --- | --- | --- |
|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16| 17 | 18 |
|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| - 9| -8| -7| -6| -5 | -4 | -3 | -2 | -1 |


In [8]:
S[4:9]

'human'

In [9]:
S[:]

'antihumanitarianism'

In [10]:
S[:-3]

'antihumanitarian'

In [11]:
S[4:-3]

'humanitarian'

In [12]:
S[4:]

'humanitarianism'

In [13]:
S[6:9]

'man'

In [14]:
S[:3]

'ant'

In [15]:
S[:2]

'an'

In [16]:
S[8:11]

'nit'

In [17]:
S[9:11]

'it'

In [18]:
S[-9:-6]

'tar'

In [19]:
S[-3:-1]

'is'

In [20]:
S[:4]

'anti'

In [21]:
S[-3:]

'ism'

In [24]:
S[-7:-10:-1]

'rat'

Sometimes, you don't know the exact index where you want to make a cut or cuts, but you know there's a special character (or set of characters) there. 

The [*split*](https://docs.python.org/3/library/stdtypes.html#str.split) method is used for breaking at a delimiter. 

It returns a list of strings corresponding to the parts. The delimiter is not included.

In [34]:
#provided code
S1 = "merry-go-round"
S2 = "red, green, blue, and yellow"

In [35]:
S1.split("-")

['merry', 'go', 'round']

In [36]:
S2.split(" ")

['red,', 'green,', 'blue,', 'and', 'yellow']

In [37]:
S2.split(", ")

['red', 'green', 'blue', 'and yellow']

In [38]:
colours = S2.split(", ")
colours[-1] = colours[-1][4:]
colours

['red', 'green', 'blue', 'yellow']

Another common occurrence is when you've got a string with some junk at the beginning or end of a string (or both) that you want to remove. 

The [strip](https://docs.python.org/3/library/stdtypes.html#str.strip) removes whitespace (spaces, newlines, tabs) by default, but can remove any character you like from the edges of your string. 

It stops when it hits something it hasn't been told to remove. 

There are other methods (*rstrip* and *lstrip*) which apply this from only one direction.

In [26]:
#provided code
S = " +-+-+strip me+-+-+ "

In [27]:
S.strip()

'+-+-+strip me+-+-+'

In [28]:
S.strip("+")

' +-+-+strip me+-+-+ '

In [33]:
S.strip(" +")

'-+-+strip me+-+-'

In [42]:
S.strip(" -+")

'strip me'

In [43]:
S.rstrip(" -+")

' +-+-+strip me'

## Concatenation

For many purposes, "+" is all you need.

In [45]:
#provided code
S1 = "hello"
S2 = "world"

In [46]:
S1 + S2

'helloworld'

In [47]:
S1 + " " + S2

'hello world'

However, if you have many strings you want to put together, particularly if they are already in a Python list, use [join](https://docs.python.org/3/library/stdtypes.html#str.join). 

The join method has a funky syntax: it is called as a method of the delimiter string, with the list of strings as the argument.

In [35]:
#provided code
L1 = ["merry","go","round"]
L2 = ["anti","dis","establish","ment","arian","ism"]

In [36]:
"-".join(L1)

'merry-go-round'

In [37]:
" ".join(L1)

'merry go round'

In [38]:
" ".join(L2)

'anti dis establish ment arian ism'

In [39]:
"".join(L2)

'antidisestablishmentarianism'

## Modifications

First, remember that Python strings are not mutable. Whenever you "modifying" to a string, you are actually not modifying the string, you are creating a new string based on the old one.

Of course, you can "modify" strings by using a combination of what we have already seen. 

Exercise: try turning the string "the lords of the ring" to "the lord of the rings" using slicing and concatenation. Then try doing it by spliting and joining.

In [43]:
#provided code
S = "the lords of the ring"

In [44]:
S[:8] + S[9:] + "s"

'the lord of the rings'

In [8]:
S_words = S.split(" ")
S_words[1] = S_words[1][:-1]
S_words[-1] = S_words[-1] + "s"
" ".join(S_words)

'the lord of the rings'

If there is a particular substrings within a string that you wish to change, then use the [replace](https://docs.python.org/3/library/stdtypes.html#str.replace) method.

It is called on the whole string, and takes as its two arguments the string to be replaced, and the string to replace it with. 

By default it applies to all instances of the substring, so be careful!

In [57]:
#provided code
S = "hands and feet"

In [58]:
S.replace("feet","fingers")

'hands and fingers'

In [59]:
S.replace("and","or")

'hors or feet'

In [60]:
S.replace(" and "," or ")

'hands or feet'

## Boolean methods

Here we mean string methods (and an operator) which return a boolean; they check to see if a string has a particular property.

Two of the most useful boolean methods are [startswith](https://docs.python.org/3/library/stdtypes.html#str.replace) and [endswith](https://docs.python.org/3/library/stdtypes.html#str.endswith), which check to see a string starts/ends with another string. Very handy for checking for morphological affixes, for instance.

In [63]:
#provided code
S1 = "reranked"
S2 = "disabled"

In [64]:
S1.startswith("re")

True

In [65]:
if S2.startswith("dis"):
    print(S2[3:])

abled


In [66]:
if S1.endswith("ed"):
    print(S1[:-2])

rerank


If you want to check if a string appears anywhere within another string, use the *in* operator.

In [67]:
"dis" in S2

True

In [68]:
"able" in S2

True

In [69]:
"red" in S1

False

Another thing we often want to know is whether a string is in fact word (consisting only of letters). The [isalpha](https://docs.python.org/3/library/stdtypes.html#str.isalpha) method does this. 

Note that whitespace, punctuation, and some characters that appear regularly within English words (e.g. "-" and "'") are not considered alphabetic.

In [72]:
#provided code
S1 = "hello"
S2 = "hello world"
S3 = "a can't-do-it attitude"

In [73]:
S1.isalpha()

True

In [74]:
S2.isalpha()

False

In [75]:
words = S3.split()
words[0].isalpha()

True

In [76]:
word2 = words[1]  #can't-do-it
word2.isalpha()

False

In [79]:
subwords = word2.split("-")
subwords[-1].isalpha()

True

In [80]:
subwords[0].replace("'", "").isalpha()

True

## Case


Upper and lower case versions of the same word are considered entirely different in Python!

In [84]:
#provided code
S1 = "case"
S2 = "CASE"
S3 = "Case"


In [85]:
S1 == S2

False

In [86]:
S1 == S3

False

To convert between upper and lower case, use the [upper](https://docs.python.org/3/library/stdtypes.html#str.upper) and [lower](https://docs.python.org/3/library/stdtypes.html#str.lower) methods. If the string is already upper/lower case, this has no effect. It is very standard to lowercase all words to standardize for the effects of English capitalization rules

In [88]:
S1.upper()

'CASE'

In [89]:
S2.lower()

'case'

In [90]:
S3.lower()

'case'

To check whether a string is (entirely upper or lower case), use [isupper](https://docs.python.org/3/library/stdtypes.html#str.isupper) and [islower](https://docs.python.org/3/library/stdtypes.html#str.islower)

In [91]:
S1.isupper()

False

In [92]:
S2.isupper()

True

In [93]:
S3.islower()

False

## Numbers

Be careful about numbers represented as strings, versus real numbers.

In [51]:
#provided code
S = "42"
i = 42

In [52]:
S == i

False

To check if a string is consists of only digits (and thus appropriate for conversion to an integer), use [isdigit](https://docs.python.org/3/library/stdtypes.html#str.isdigit).




In [53]:
S.isdigit()

True

In [54]:
S1.isdigit()

NameError: name 'S1' is not defined

In [55]:
i.isdigit()

AttributeError: 'int' object has no attribute 'isdigit'

Convert from strings to ints or floats using built-in functions [int](https://docs.python.org/3/library/functions.html#int) and [float](https://docs.python.org/3/library/functions.html#float), from a number to string by using [str](https://docs.python.org/3/library/stdtypes.html#str)

In [7]:
str(i) == S

True

In [8]:
int(S) == i

True

In [9]:
float(S) == i

True

In [10]:
str(float(S)) == S

False

In [11]:
str(float(S))

'42.0'

## And more...

The *in* operator is useful when you want to know if a substring appears in a text, but often you want to know exactly where in the string it appears. For this, use the [find](https://docs.python.org/3/library/stdtypes.html#str.find) method. It returns -1 when the string is not found (be careful, since this could be misinterpreted as a valid negative index!).

In [62]:
#provided code
S = '''It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.'''

In [63]:
S.find(", ")

24

In [64]:
S.find("blurst")

-1

In [65]:
wisdom_index = S.find("wisdom")
S[wisdom_index - 7:wisdom_index + len("wisdom")]

'age of wisdom'

The find method has two optional arguments that are very useful: the first is an index to start a search, and the second is an index not to search past. The first can be used find multiple instance of the same string. Let's use it to pull out each of the "it was the" clauses in the Dickens quote above.

In [66]:
clauses = []
start_index = 0 #I know it starts with "It was the"

# my code here
while start_index != -1:
    end_index = S.find(",", start_index)
    print(S[start_index:end_index])
    clauses.append(S[start_index:end_index])
    start_index = S.find("it was the", end_index)
    
# my code here
    
print(clauses)
print(S.split(", "))

It was the best of times
it was the worst of times
it was the age of wisdom
it was the age of foolishness
it was the epoch of belief
it was the epoch of incredulity
it was the season of Light
it was the season of Darkness
it was the spring of hope
it was the winter of despair
['It was the best of times', 'it was the worst of times', 'it was the age of wisdom', 'it was the age of foolishness', 'it was the epoch of belief', 'it was the epoch of incredulity', 'it was the season of Light', 'it was the season of Darkness', 'it was the spring of hope', 'it was the winter of despair']
['It was the best of times', 'it was the worst of times', 'it was the age of wisdom', 'it was the age of foolishness', 'it was the epoch of belief', 'it was the epoch of incredulity', 'it was the season of Light', 'it was the season of Darkness', 'it was the spring of hope', 'it was the winter of despair', 'we had everything before us', 'we had nothing before us', 'we were all going direct to Heaven', 'we were a

If you want to know how often a particular substring appears in string, use [count](https://docs.python.org/3/library/stdtypes.html#str.count).

In [105]:
print(S.count("best"))



1


In [None]:
print(S.count("we"))

In [None]:
print(S.count(" we "))

Though much more rarely used, the Python slicing operator has an optional third argument known as the step. It allows you to get one every nth item, but perhaps its most useful property is that a step of -1 will reverse the string. 

In [68]:
#provided code
nums = "123456"

In [70]:
nums[::2]

'135'

In [13]:
nums[::-1]

'654321'

In [16]:
S[::-1]

'.ylno nosirapmoc fo eerged evitalrepus eht ni ,live rof ro doog rof ,deviecer gnieb sti no detsisni seitirohtua tseision sti fo emos taht ,doirep tneserp eht ekil raf os saw doirep eht ,trohs ni – yaw rehto eht tcerid gniog lla erew ew ,nevaeH ot tcerid gniog lla erew ew ,su erofeb gnihton dah ew ,su erofeb gnihtyreve dah ew ,riapsed fo retniw eht saw ti ,epoh fo gnirps eht saw ti ,ssenkraD fo nosaes eht saw ti ,thgiL fo nosaes eht saw ti ,ytiludercni fo hcope eht saw ti ,feileb fo hcope eht saw ti ,ssenhsiloof fo ega eht saw ti ,modsiw fo ega eht saw ti ,semit fo tsrow eht saw ti ,semit fo tseb eht saw tI'

Exercise: Use this functionality to print out the palindromes (words that read the same forward or backwards) in the list below.

In [110]:
to_check = ["abba","fun","toot","woohoo","racecar","duh","mom"]

for item in to_check:
    #your code here
    if item == item[::-1]:
        print(item + " is a palindrome")
    else:
        print(item + " is not a palindrome")
    #your code here

abba is a palindrome
fun is not a palindrome
toot is a palindrome
woohoo is not a palindrome
racecar is a palindrome
duh is not a palindrome
mom is a palindrome


Those are some of the main methods/operators for strings, but feel free to explore the docs to find others that might be useful for you!