# Strings

## We can do things with strings

We've already seen  in Data 8 some operations that can be done with strings.

In [1]:
first_name = "Franz"
last_name = "Kafka"
full_name = first_name + last_name
print(full_name)

FranzKafka


Remember that computers don't understand context.

In [2]:
full_name = first_name + " " + last_name
print(full_name)

Franz Kafka


## Strings are made up of sub-strings

You can think of strings as a [sequence](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#sequence) of smaller strings or characters. We can access a piece of that sequence using square brackets `[]`.

In [3]:
full_name[1]

'r'

<div class="alert alert-danger">
Don't forget, Python (and many other langauges) start counting from 0.
</div>

In [4]:
full_name[0]

'F'

In [5]:
full_name[4]

'z'

## You can slice strings using  `[ : ]`

If you want a range (or "slice") of a sequence, you get everything *before* the second index, i.e,. Python slicing is *exclusive*:

In [6]:
full_name[0:4]

'Fran'

In [7]:
full_name[0:5]

'Franz'

You can see some of the logic for this when we consider implicit indices.

In [8]:
full_name[:5]

'Franz'

In [9]:
full_name[5:]

' Kafka'

If we want to find out how long a string is, we can use the `len` function:

In [10]:
len(full_name)

11

## Strings have methods

* There are other operations defined on string data. These are called **string [methods](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method)**. 
* The Jupyter Notebooks lets you do tab-completion after a dot ('.') to see what methods an [object](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#object) (i.e., a defined variable) has to offer. Try it now!

In [11]:
str.

SyntaxError: invalid syntax (<ipython-input-11-8c081d95124d>, line 1)

Let's look at the `upper` method. What does it do? Let's take a look at the documentation. Jupyter Notebooks let us do this with a question mark ('?') before *or* after an object (again, a defined variable).

In [12]:
str.upper?

So we can use it to upper-caseify a string. 

In [13]:
full_name.upper()

'FRANZ KAFKA'

You have to use the parenthesis at the end because upper is a method of the string class.
<p></p>
<div class="alert alert-danger">
Don't forget, simply calling the method does not change the original variable, you must *reassign* the variable:
</div>

In [14]:
print(full_name)

Franz Kafka


In [15]:
full_name = full_name.upper()
print(full_name)

FRANZ KAFKA


For what it's worth, you don't need to have a variable to use the `upper()` method, you could use it on the string itself.

In [16]:
"Franz Kafka".upper()

'FRANZ KAFKA'

What do you think should happen when you take upper of an int?  What about a string representation of an int?

In [17]:
1.upper()

SyntaxError: invalid syntax (<ipython-input-17-02adcf0a0b2a>, line 1)

In [18]:
"1".upper()

'1'

## Challenge 1: Write your name

1. Make two string variables, one with your first name and one with your last name.
2. Concatenate both strings to form your full name and [assign](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#assign) it to a variable.
3. Assign a new variable that has your full name in all upper case.
4. Slice that string to get your first name again.

In [19]:
#1
first_name = "Jillian"
last_name = "Smith"
#2
full_name = first_name + last_name
#3
upper_full_name = full_name.upper()
#4
full_name[0:7]

'Jillian'

## Challenge 2: Try seeing what the following string methods do:

    * `split`
    * `join`
    * `replace`
    * `strip`
    * `find`

In [20]:
my_string = "It was a Sunday morning at the height of spring."
my_string.split()

['It', 'was', 'a', 'Sunday', 'morning', 'at', 'the', 'height', 'of', 'spring.']

In [21]:
my_string = "It was a Sunday morning at the height of spring."
s = "-"
broke_up = my_string.split()
s.join(broke_up)

'It-was-a-Sunday-morning-at-the-height-of-spring.'

In [22]:
my_string = "It was a Sunday morning at the height of spring."
new_season = "winter."
my_string.replace("spring.", new_season)

'It was a Sunday morning at the height of winter.'

In [23]:
my_string = "It was a Sunday morning at the height of spring."
my_string.strip('.')

'It was a Sunday morning at the height of spring'

In [24]:
my_string = "It was a Sunday morning at the height of spring."
my_string.find('m')

16

## Challenge 3: Working with strings

Below is a string of Edgar Allen Poe's "A Dream Within a Dream":

In [25]:
poem = '''Take this kiss upon the brow!
And, in parting from you now,
Thus much let me avow —
You are not wrong, who deem
That my days have been a dream;
Yet if hope has flown away
In a night, or in a day,
In a vision, or in none,
Is it therefore the less gone?  
All that we see or seem
Is but a dream within a dream.

I stand amid the roar
Of a surf-tormented shore,
And I hold within my hand
Grains of the golden sand —
How few! yet how they creep
Through my fingers to the deep,
While I weep — while I weep!
O God! Can I not grasp 
Them with a tighter clasp?
O God! can I not save
One from the pitiless wave?
Is all that we see or seem
But a dream within a dream?'''

What is the difference between `poem.strip("?")` and `poem.replace("?", "")` ?

In [26]:
poem.strip("?")

'Take this kiss upon the brow!\nAnd, in parting from you now,\nThus much let me avow —\nYou are not wrong, who deem\nThat my days have been a dream;\nYet if hope has flown away\nIn a night, or in a day,\nIn a vision, or in none,\nIs it therefore the less gone?  \nAll that we see or seem\nIs but a dream within a dream.\n\nI stand amid the roar\nOf a surf-tormented shore,\nAnd I hold within my hand\nGrains of the golden sand —\nHow few! yet how they creep\nThrough my fingers to the deep,\nWhile I weep — while I weep!\nO God! Can I not grasp \nThem with a tighter clasp?\nO God! can I not save\nOne from the pitiless wave?\nIs all that we see or seem\nBut a dream within a dream'

In [27]:
poem.replace("?", "")

'Take this kiss upon the brow!\nAnd, in parting from you now,\nThus much let me avow —\nYou are not wrong, who deem\nThat my days have been a dream;\nYet if hope has flown away\nIn a night, or in a day,\nIn a vision, or in none,\nIs it therefore the less gone  \nAll that we see or seem\nIs but a dream within a dream.\n\nI stand amid the roar\nOf a surf-tormented shore,\nAnd I hold within my hand\nGrains of the golden sand —\nHow few! yet how they creep\nThrough my fingers to the deep,\nWhile I weep — while I weep!\nO God! Can I not grasp \nThem with a tighter clasp\nO God! can I not save\nOne from the pitiless wave\nIs all that we see or seem\nBut a dream within a dream'

poem.strip("?") only removes the very last ? in the entire string (and if there was a ? at the beginning of the string it would remove that as well). poem.replace("?", "") finds every single ? and replaces it with a string that has nothing in it, which just gets rid of every ? in the string. 

At what index does the word "*and*" first appear? Where does it last appear?

In [28]:
first = poem.find('And')
last = poem.find('And', 31, len(poem))
first_words = "The first 'and' is in line"
last_words = "The last 'and' is in line "
print(first_words, first)
print(last_words, last)

The first 'and' is in line 30
The last 'and' is in line  359


How can you answer the above accounting for upper- and lowercase?

You can make the entire text all uppercase or all lowercase and then search for the word 'and'. 

In [30]:
new_poem = poem.lower()
and_1 = new_poem.find('and')
and_2 = new_poem.find('and', 31)
and_3 = new_poem.find('and', 315)
and_4 = new_poem.find('and', 360)
and_5 = new_poem.find('and', 382)
print(and_1)
print(and_2)
print(and_3)
print(and_4)
print(and_5)

30
314
359
381
407


## Challenge 4: Counting Text

Below is a string of Robert Frost's "The Road Not Taken":

In [31]:
poem = '''Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.'''

Using the `len` function and the string methods, answer the following questions:

How many characters (letters) are in the poem?

In [32]:
len(poem)

729

How many words?

In [45]:
word_array = poem.split()
len(word_array)

144

How many lines? (HINT: A line break is represented as  `\n`  )

In [44]:
lines = poem.count('\n')
print(lines + 1)

23


How many stanzas?

In [37]:
stanzas = poem.count('\n\n') + 1 
print(stanzas)

4


How many unique words? (HINT: look up what a `set` is)

In [43]:
unique_words = set(word_array)
print(len(unique_words))

print(unique_words)

100
{'leaves', 'there', 'long', 'leads', 'all', 'the', 'them', 'day!', 'trodden', 'telling', 'a', 'come', 'lay', 'wanted', 'morning', 'equally', 'perhaps', 'for', 'in', 'ages', 'down', 'had', 'knowing', 'another', 'other,', 'traveled', 'looked', 'it', 'this', 'Then', 'better', 'Two', 'by,', 'less', 'Had', 'difference.', 'Yet', 'doubted', 'just', 'Because', 'how', 'passing', 'was', 'first', 'way,', 'sigh', 'roads', 'claim,', 'both', 'back.', 'not', 'made', 'having', 'far', 'one', 'black.', 'stood', 'In', 'yellow', 'and', 'step', 'Somewhere', 'traveler,', 'undergrowth;', 'ever', 'wear;', 'should', 'on', 'Oh,', 'kept', 'hence:', 'no', 'that', 'diverged', 'bent', 'way', 'Though', 'really', 'be', 'as', 'I—', 'where', 'grassy', 'could', 'I', 'same,', 'has', 'with', 'took', 'shall', 'travel', 'sorry', 'to', 'wood,', 'And', 'if', 'To', 'fair,', 'worn', 'about'}


Remove commas and check the number of unique words again. Why is it different?

In [42]:
no_commas = poem.replace(',', '')
poem_words = no_commas.split()
unique_words = set(poem_words)
print(len(unique_words))

99


The number of unique words without commas is different because in the original set of unique words there were two entries for the word "way". One was 'way' and the other was 'way,' (with a comma at the end). When we remove all the commas first, python only finds one unique 'way', which is why we now have 99 unique words instead of 100. 