___
<h1> Machine Learning </h1>
<h2> M. Sc. in Electrical and Computer Engineering </h2>
<h3> Instituto Superior de Engenharia / Universidade do Algarve </h3>

[LESTI](https://ise.ualg.pt/curso/1941) / [ISE](https://ise.ualg.pt) / [UAlg](https://www.ualg.pt)

Pedro J. S. Cardoso (pcardoso@ualg.pt)

___

# Strings 

- Textual data in Python is handled with `str` objects, more commonly known as strings. 
- They are `immutable` sequences of unicode code points. 
- When it comes to store textual data though, or send it on the network, you may want to encode it, using an appropriate encoding for the medium you're using. 
- String literals are written in Python using single, double or triple quotes (both single or double). 

4 ways to define a string

In [1]:
str1 = 'This is a string. We built it with single quotes.'

In [2]:
str2 = "This is also a string, but built with double-quotes."

In [3]:
str3 = '''This is built using triple quotes,
so it can span multiple lines.'''

In [4]:
str4 = """This too 
is a multiline one, 
built with triple double-quotes."""

What is diference in the following 2 lines?

In [5]:
str4

'This too \nis a multiline one, \nbuilt with triple double-quotes.'

In [6]:
print(str4)

This too 
is a multiline one, 
built with triple double-quotes.


to get the length of a string, use the `len` function

In [7]:
len(str4)

63

As these are instances of the `str` class, they have associated methods and properties

In [8]:
dir(str4)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'stri

which can be called in a traditional OOP way

In [9]:
str1.lower() # returns a new string with all characters in lowercase    

'this is a string. we built it with single quotes.'

In [10]:
str1.upper() # returns a new string with all characters in uppercase

'THIS IS A STRING. WE BUILT IT WITH SINGLE QUOTES.'

In [11]:
str1.title() # returns a new string with the first character of each word in uppercase

'This Is A String. We Built It With Single Quotes.'

In [12]:
str1.split() # returns a list of words in the string

['This',
 'is',
 'a',
 'string.',
 'We',
 'built',
 'it',
 'with',
 'single',
 'quotes.']

Other methods are:
- `strip`: remove leading and trailing whitespace
- `replace`: replace a substring with another
- `find`: find the first occurrence of a substring
- `count`: count the number of occurrences of a substring
- `startswith`: check if a string starts with a given substring
- `endswith`: check if a string ends with a given substring
- `join`: join a list of strings with a given separator
- `isalnum`: check if a string is alphanumeric
- `isalpha`: check if a string is alphabetic
- `isdigit`: check if a string is numeric
- `islower`: check if a string is lowercase
- `isupper`: check if a string is uppercase
- `isspace`: check if a string is whitespace
- `istitle`: check if a string is titlecase
- `isidentifier`: check if a string is a valid identifier
- `isprintable`: check if a string is printable
- `isdecimal`: check if a string is decimal
- etc.
See [here](https://docs.python.org/3/library/stdtypes.html#string-methods) for a complete list.

You can also repeat a string very easely, like this:

In [13]:
20 * '-*'

'-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*'

## Indexing and slicing

- When manipulating sequences, it's very common to have to access them at one precise position (indexing), or to get a subsequence out of them (slicing). 
- When dealing with immutable sequences, both operations are read-only.
- When you get a slice of a sequence, you can specify the start and stop positions, and the step: `my_sequence[start:stop:step]`. 

![indexing](images/indexing.png)


In [14]:
s = 'Are you suggesting that coconuts migrate?'

Strings are immutable so you can't change them

In [15]:
s[0] = 't'

TypeError: 'str' object does not support item assignment

But you can access them. To get the first character of a string, you can use the index 0

In [None]:
s[0]

To get the 6th character of a string, you can use the index 5

In [None]:
s[5]

To the first 4 characters of a string, you can use the slice `0:4`, or just `:4`

In [None]:
s[:4]

To get the the 5th and beyond characters of a string, you can use the slice `4:`

In [None]:
s[4:]

To get the 5th to the 14th characters of a string, you can use the slice `4:14`. Note this are the characters at positions 4, 5, 6, 7, 8, 9, 10, 11, 12 and 13.

In [None]:
s[4:14]

the `zip` function can be used to iterate over two sequences at the same time, returning a tuple with the elements of each sequence at the same position. In the example, the `list` function is used to convert the result of the `zip` function to a list, because `zip` returns an iterator.

In [None]:
list(zip(s, range(len(s))))

To get the 5th to the 14th characters of a string with a step of 3, you can use the slice `4:14:3`. Note this are the characters at positions 4, 7, 10 and 13.

In [None]:
s[4:14:3]        # slicing, start, stop and step (every 3 chars)

Remember the strings ...

In [None]:
s

To get the last character of a string, you can use the index -1

In [None]:
s[-1]            # indexing at last position

To get the last 5 characters of a string, you can use the slice `-5:`

In [16]:
s[-5:]

'rate?'

To get up to the last 5 characters of a string, you can use the slice `:-5`

In [17]:
s[:-5]

'Are you suggesting that coconuts mig'

To get the 6th to the 5th last characters of a string, you can use the slice `5:-5`

In [18]:
s[5:-5]

'ou suggesting that coconuts mig'

## The `id` of a string (optional)

to get all the characters of a string, you can use the slice `:`

In [19]:
s[:]

'Are you suggesting that coconuts migrate?'

Copies of stings might not be what you expect... doing the following will return a copy of the reference to the string

In [20]:
r = s

So, the ´id` of both variables will be the same

In [21]:
id(s)

4423916816

In [22]:
id(r)

4423916816

Making copies of slices will return a copy of the string

In [23]:
s_copy = s[:5] + s[5:]
print(s_copy)
id(s_copy)

Are you suggesting that coconuts migrate?


4571499664

But, since strings are immutable, the `id` of the copy will not be different from the original (optimization)

In [24]:
s_copy = s[:]
id(s_copy)

4423916816

The following creates a new string, so the `id` will be different

In [25]:
s_copy = 'Are you suggesting that coconuts migrate?'
id(s_copy)

4423912112

the `copy` module can be used to make a copy of a string.

In [26]:
import copy
t = copy.deepcopy(s)

In [27]:
id(t)

4423916816

## Encode and decoding strings (optional)

- Using the encode/decode methods, we can encode unicode strings and decode bytes objects. 
- Utf-8 is a variable length character encoding, capable of encoding all possible unicode code points. 
- Notice also that by adding a literal b in front of a string declaration, we're creating a bytes object.

In [28]:
s = "This is üñíção"            # unicode string: code points
s

'This is üñíção'

In [29]:
type(s)

str

In [30]:
encoded_s = s.encode('utf-8')  # utf-8 encoded version of s
type(encoded_s)

bytes

In [31]:
encoded_s

b'This is \xc3\xbc\xc3\xb1\xc3\xad\xc3\xa7\xc3\xa3o'

In [32]:
encoded_s.decode('utf-8')

'This is üñíção'

In [33]:
b"This is \xc3\xbc\xc3\xb1\xc3\xad\xc3\xa7\xc3\xa3o"

b'This is \xc3\xbc\xc3\xb1\xc3\xad\xc3\xa7\xc3\xa3o'

In [34]:
b"This is \xc3\xbc\xc3\xb1\xc3\xad\xc3\xa7\xc3\xa3o".decode('utf-8')

'This is üñíção'

In [35]:
"This is \xc3\xbc\xc3\xb1\xc3\xad\xc3\xa7\xc3\xa3o".decode('utf-8')

AttributeError: 'str' object has no attribute 'decode'

# Exercises

[Go here...](exercises/02-exercises.ipynb)