In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo("GieEvhO4fKA")

# Strings
*  Strings are **immutable** sequences of characters. 
* Strings are specified using either single (') or double (") quotes
    * Doesn't matter which you use; convenience based on the string you are typing.
    * <span style="color:red">Personally I find this problematic, because I am not consistent in my use of ' vs " and thus have messy code to edit.</span>
* Triple quotes ('''xyz''',"""xyz""")
    * Useful for including strings that have quote symbols
   

In [1]:
zeus = "Zeus"
apollo = 'Apollo'
poseidon  = '''Poseidon'''
text1 = """I heard her say to the driver, "I wouldn't want to drive this bus myself." """
text1b = 'I heard him say, "I would not want to ride the bus all day."'
print(text1)
print(text1b)

I heard her say to the driver, "I wouldn't want to drive this bus myself." 
I heard him say, "I would not want to ride the bus all day."


In [2]:
text2 = '''I heard her say to the driver, "I wouldn't want to drive this bus myself."'''
print(text2)

I heard her say to the driver, "I wouldn't want to drive this bus myself."


In [4]:
text3 = """I heard her say to the driver, "I wouldn't want to drive this bus myself." """


# A Little History About Characters and Computers

* Computers are transitors (1's and 0's) which naturally translate to numbers
* How do you get computers to represent characters?
    * Map numbers to characters
* ASCII [(American Standard Code for Information Interchange)](https://en.wikipedia.org/wiki/ASCII)
    * First standard published in 1963
    * Has 128 characters, based primarily on the English alphabet
![ASCII Code Tables](https://upload.wikimedia.org/wikipedia/commons/e/e0/ASCII_Code_Chart-Quick_ref_card.png)

## Limitations of ASCII

* The world isn't all written in English
### Introduce [UNICODE](https://en.wikipedia.org/wiki/Unicode)

>the latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets. ([Wikipedia](https://en.wikipedia.org/wiki/Unicode))

## Python and ASCII and UNICODE

* In Python 2.x strings were by default ASCII but could be UNICODE
* In Python 3.x strings are all UNICODE

### Going between the character and the numeric representation
* **ord(STRING)**: returns the decimal value that corresponds to that string
    * This is the basis of comparison of string values
* **chr(INTEGER)**: returns the character corresponding to the integer value

In [5]:
print(chr(89))
print(chr(5674))

Y
ᘪ


In [6]:
print(ord("œ"))

339


In [7]:
print(ord("a"))

97


### Comparing Characters Means Comparing their ordinal value

In [8]:
print('a' < 'A')

False


# Strings: Accessing Data
* Strings are an example of a sequence in Python. (Will talk more about sequences in next class.) 
* We can access individual characters in a string with a square bracket ([]) syntax.
* Python sequences start at 0. That is, the first element in the sequence is accessed by the number 0 (zero).

## **Slicing**

* You can access a segment of a string using a slicing notation: **STRING[start:stop:increment]**
    * start is inclusive
    * **stop is exclusive**


In [9]:
print(text2)
print(text2[0])
print(text2[0:13])
print(text2[13])

I heard her say to the driver, "I wouldn't want to drive this bus myself."
I
I heard her s
a


## Slicing
* start, stop and increment all have default values
    * start: 0
    * stop: Length of string
    * increment: 1

In [10]:
print(text2[:11])
print(text2[11:])
print(text2[::2])

I heard her
 say to the driver, "I wouldn't want to drive this bus myself."
Ihadhrsyt h rvr Iwud' att rv hsbsmsl.


# Strings: Attributes and Methods
* Strings are objects that have **attributes** and **methods**
    * **attributes** think nouns (things strings have)
    * **methods** think verbs (things strings do)
* You can access **attributes** and **methods** using the 'dot' (.) notation
* You can learn about the **attributes** and **methods** using tab completion and **help()**

In [11]:
text2.isupper()

False

In [12]:
help(text2.isalnum)

Help on built-in function isalnum:

isalnum(...) method of builtins.str instance
    S.isalnum() -> bool
    
    Return True if all characters in S are alphanumeric
    and there is at least one character in S, False otherwise.



In [13]:
text2.isalnum()

False

In [14]:
text2.split()

['I',
 'heard',
 'her',
 'say',
 'to',
 'the',
 'driver,',
 '"I',
 "wouldn't",
 'want',
 'to',
 'drive',
 'this',
 'bus',
 'myself."']

# string module
* Many of the string object methods are also available as functions in the string module
# What do you want to do with strings?
* Change case (upper and lower)
* Recognize punctuation
* Check to see if a substring is in the string
* Check to see if the string is a letter, a number, alphanumeric, punctuation
* Split into substrings
* Concatenate multiple strings
* Replicate
## How many of these can we identify with methods and functions in Python?

In [15]:
import string

# String Split
* Split the stringwith a specified delimiter
* Returns a **list** (to be discussed later) of substrings

In [16]:
a = '1,2,3,4,5'
help(a.split)

Help on built-in function split:

split(...) method of builtins.str instance
    S.split(sep=None, maxsplit=-1) -> list of strings
    
    Return a list of the words in S, using sep as the
    delimiter string.  If maxsplit is given, at most maxsplit
    splits are done. If sep is not specified or is None, any
    whitespace string is a separator and empty strings are
    removed from the result.



In [17]:
a.split()
# no white space in a, so returns the whole string

['1,2,3,4,5']

In [18]:
a.split(',')
# sep is now ',' so split will break a up by commas

['1', '2', '3', '4', '5']

In [19]:
note ="""resp care
pt received on psv mode, per team peep placed back on at 5 cmH20. initially pt requiring ps 12, now on 8 for progression of weaning. tolerating fair with rr approx 25-32 range. mdi's given q4h, flovent started at 8 p. bid. cuff leak seems more constant today, ?worse with peep on, cuff pressure kept at 30 cmH20 with 10 cc's in cuff, to seal it would require cuff pressure of 45 cmh20. IP evaluated and chooses not to replace trach at this time, maintain cuff pressure at 30. c/w slow wean, progress to trach mask as soon as possible."""

print(note.split())

['resp', 'care', 'pt', 'received', 'on', 'psv', 'mode,', 'per', 'team', 'peep', 'placed', 'back', 'on', 'at', '5', 'cmH20.', 'initially', 'pt', 'requiring', 'ps', '12,', 'now', 'on', '8', 'for', 'progression', 'of', 'weaning.', 'tolerating', 'fair', 'with', 'rr', 'approx', '25-32', 'range.', "mdi's", 'given', 'q4h,', 'flovent', 'started', 'at', '8', 'p.', 'bid.', 'cuff', 'leak', 'seems', 'more', 'constant', 'today,', '?worse', 'with', 'peep', 'on,', 'cuff', 'pressure', 'kept', 'at', '30', 'cmH20', 'with', '10', "cc's", 'in', 'cuff,', 'to', 'seal', 'it', 'would', 'require', 'cuff', 'pressure', 'of', '45', 'cmh20.', 'IP', 'evaluated', 'and', 'chooses', 'not', 'to', 'replace', 'trach', 'at', 'this', 'time,', 'maintain', 'cuff', 'pressure', 'at', '30.', 'c/w', 'slow', 'wean,', 'progress', 'to', 'trach', 'mask', 'as', 'soon', 'as', 'possible.']


# String Join
* **join()** is the inverse of **split()**
* Base string becomes the delimiter

In [20]:
number_list = a.split(",") #split by comma and put elements into a list
print(number_list)
print(''.join(number_list)) #base string '' is the delimiter, so the elements are joined with no space between them

['1', '2', '3', '4', '5']
12345


In [21]:
print( ' '.join(number_list)) #now the elements have a space bewteen them because ' ' is the delimiter

1 2 3 4 5


In [22]:
print(','.join(number_list))

print(', '.join(number_list))
print( 'this will look messy'.join(number_list))         

1,2,3,4,5
1, 2, 3, 4, 5
1this will look messy2this will look messy3this will look messy4this will look messy5


# String methods for preprocessing/modifying
* Note that since strings are **immutable** these methods don't change the string, but return a *new* string
* **lower()**: converts all characters to lower case
* **upper()**: converts all characters to upper case
* **replace(a,b)**: replaces all occurrences of a in string with b (e.g., replacing tabs with spaces)

In [23]:
note2 = """PLANL: Admin K-excelate. 3% NS with q2h Na levels. Neuro exam q1h. Monitor I/O, resp status. MRI today after questionaire completed. Call H.O. with changes."""
print(note2.upper())
print("-"*42)
print(note2.lower())
print("-"*42)
print(note2)

PLANL: ADMIN K-EXCELATE. 3% NS WITH Q2H NA LEVELS. NEURO EXAM Q1H. MONITOR I/O, RESP STATUS. MRI TODAY AFTER QUESTIONAIRE COMPLETED. CALL H.O. WITH CHANGES.
------------------------------------------
planl: admin k-excelate. 3% ns with q2h na levels. neuro exam q1h. monitor i/o, resp status. mri today after questionaire completed. call h.o. with changes.
------------------------------------------
PLANL: Admin K-excelate. 3% NS with q2h Na levels. Neuro exam q1h. Monitor I/O, resp status. MRI today after questionaire completed. Call H.O. with changes.


In [24]:
print(note2.swapcase())
print("-"*42)

print(note2.replace('a','Z'))
print("-"*42)

print(note2.replace(' ','')) # replace spaces with empty string

planl: aDMIN k-EXCELATE. 3% ns WITH Q2H nA LEVELS. nEURO EXAM Q1H. mONITOR i/o, RESP STATUS. mri TODAY AFTER QUESTIONAIRE COMPLETED. cALL h.o. WITH CHANGES.
------------------------------------------
PLANL: Admin K-excelZte. 3% NS with q2h NZ levels. Neuro exZm q1h. Monitor I/O, resp stZtus. MRI todZy Zfter questionZire completed. CZll H.O. with chZnges.
------------------------------------------
PLANL:AdminK-excelate.3%NSwithq2hNalevels.Neuroexamq1h.MonitorI/O,respstatus.MRItodayafterquestionairecompleted.CallH.O.withchanges.


# String methods for evaluating
*  You can check whether a string is 
    * alphanumeric
    * is alpha
    * is numeric
    * is a whitespace
    * is upper case
    * is lower case
    

In [25]:
a='10'  
print(a.isalnum())
print(a.isalpha())
print(a.isdigit())
print(a.isspace())


True
False
True
False


In [26]:
print('\n'.isspace()) #applying the method? function? to the string preceeding the dot
print('\t'.isspace())
print('b'.isspace())
print('b'.isalpha())

True
True
False
True


# String Inclusion
## **in**
* **STRING1 in STRING2**: Test whether **STRING1** is anywhere in **STRING2**

In [27]:
report = """Patient: Brian Chapman
12/20/2012 – 2:08PM

HISTORY:  s/p skateboard injury with direct trauma to the right shoulder.  Possibly acromioclavicular separation.  Persistent pain despite several months of physical therapy.  Status post steroid injection  of the acromioclavicular joint one week prior to examination.  New onset of posterior shoulder pain.

COMPARISON STUDIES: 
Radiographs performed at UCSD of the right shoulder in July 2012.

TECHNIQUE: 
On a 3 Tesla superconducting magnet, coronal and sagittal T1, coronal and sagittal T2 fat saturated, and axial proton density fat saturated sequences through the right shoulder were performed without the administration of intravenous or intra-articular contrast.

FINDINGS:  
Supraspinatus tendon: within normal limits. No muscle atrophy or edema is observed.

Infraspinatus tendon: Low grade partial thickness articular sided tearing on a background of mild tendinosis.  No muscle atrophy or edema is observed.

Subscapularis tendon:  Mild signal alteration and thickening of the supraspinatus.  No evidence of tearing.    No muscle atrophy or edema is observed.

Teres minor tendon: within normal limits.  No muscle atrophy or edema is observed.

Subacromial/Subdeltoid Bursa:  A small amount of fluid.

Visualized deltoid muscle and tendon: grossly normal.

Long head biceps tendon: within normal limits. 

Labrum:  within normal limits.  No paralabral cysts are identified.

Rotator cuff interval: within normal limits.

Axillary recess: within normal limits.

Suprascapular and spinoglenoid notches: no mass is identified. 

Quadrilateral space: no mass is identified.

Bone marrow: no bone marrow edema or acute fracture.  No dislocation.  Patchy marrow signal within the glenoid and proximal humeral shaft with sparing of the humeral epiphysis is in keeping with hematopoetic marrow.

Acromioclavicular joint:  Moderate acromioclavicular joint osteoarthrosis with downgoing osteophytes, marrow edema, subchondral cystic change, and capsular hypertrophy.  There is moderate mass effect on the underlying supraspinatus musculotendinous junction.

Cartilage: grossly within normal limits.

Coracoclavicular Ligaments: The conoid and trapezoid components of the coracoclavicular ligament are intact.


IMPRESSION:



1.  Moderate acromioclavicular joint osteoarthrosis.

2.  Low grade partial thickness articular sided tearing of the infraspinatus on a background of mild tendinosis.

3.  Mild supraspinatus tendinosis.

4.  Mild subacromial-subdeltoid bursitis.
"""

In [28]:
print("Shoulder" in report)
print("shoulder" in report)
print("SHOULDER" in report)
print("SHOULDER" in report.upper())

False
True
False
True


## **find(STRING)**

In [29]:
help(report.find)

Help on built-in function find:

find(...) method of builtins.str instance
    S.find(sub[, start[, end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.



In [30]:
print(report.find("edema")) #recall report is the text above. this will print the earliest time "edema" is (completely) found in the text 
print(report.lower().find("edema"))
print(report.lower().find("pituitary")) #pituitary is not in report, so will return -1 for false (indexing starts at 0)

794
794
-1


## index(STRING)

In [31]:
help(report.index)

Help on built-in function index:

index(...) method of builtins.str instance
    S.index(sub[, start[, end]]) -> int
    
    Like S.find() but raise ValueError when the substring is not found.



In [32]:
print(report.index("shoulder")) #Like S.find() but raise ValueError when the substring is not found.
print(report.index("carotid"))

108


ValueError: substring not found

In [33]:
sequence = """ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC
CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC
CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG
AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC
CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG
TTTAATTACAGACCTGAA"""


In [34]:
# Demonstration of simple input
# with and without upper()
subsequence = input("Enter a subsequence to check for inclusion in sequence: ").upper()
print(subsequence in sequence)

Enter a subsequence to check for inclusion in sequence: tgga
True


## String Formatting

Python has a C-style way of formatting (interpolating strings). This is what I am most used to and is what you will usually see in my code. 

* `%d` is for substituting in an integer
    * We can specify the minimum number of spaces to use and whetehr we want a leading zero
    

In [35]:
print("%d"%5) #just 5
print("%5d"%5) #5 spaces but with d=5 at end?
print("%05d"%5) #5 0s but with d=5 at end?

5
    5
00005


* `%f` is for substituting in a float
    * Can control number of digits and decimal point resolution
    * Also `%e` and `%E`
    

In [37]:
import math
print("%f"%math.pi)
print("%2.4f"%math.pi)
print("%e"%math.exp(math.pi))

3.141593
3.1416
2.314069e+01


* `%s` is for substituting in a string (everything can be represented as a string)

Here are some basic examples

In [38]:
import math
print("The values of pi is %5.4f"%math.pi)
print("My name is %s, %s, although you may know me as %03d."%("Bond","James Bond",7))

The values of pi is 3.1416
My name is Bond, James Bond, although you may know me as 007.


Since Python 2.6, there is a *format()* method with strings. Here are some examples from the [Python documentation](https://docs.python.org/2/library/string.html#string-formatting)

In [40]:
#the ability to do complex variable substitutions and value formatting
#Format strings contain “replacement fields” surrounded by curly braces {}. 
#Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. 
#If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}.
#In most of the cases the syntax is similar to the old %-formatting, with the addition of the {} and with : used instead of %. For example, '%03.2f' can be translated to '{:03.2f}'.

#accessing arguments by position
print('{0}, {1}, {2}'.format('a', 'b', 'c'))
print('{2}, {1}, {0}'.format('a', 'b', 'c'))
#accessing by name
#two ways to get the same result
print('Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W'))
coord = {'latitude': '37.24N', 'longitude': '-115.81W'} #is this a dictionary?
print('Coordinates: {latitude}, {longitude}'.format(**coord)) #unpacking argument sequence... what does ** do here?
#accessing by attributes
#{0} is first item, {0.real} = real part of first item, {0.imag} = imaginary part of first item
c = 3-5j
('The complex number {0} is formed from the real part {0.real} and the imaginary part {0.imag}.').format(c)  

a, b, c
c, b, a
Coordinates: 37.24N, -115.81W
Coordinates: 37.24N, -115.81W


'The complex number (3-5j) is formed from the real part 3.0 and the imaginary part -5.0.'