# Strings in Python

## Basic Syntax

Declaring strings in Python:

In [8]:
s      = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit'
also_s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"

print('     s: ', s)
print('also_s: ', also_s)

     s:  Lorem ipsum dolor sit amet, consectetur adipiscing elit
also_s:  Lorem ipsum dolor sit amet, consectetur adipiscing elit


Strings support slicing:

In [10]:
# str[START:STOP:STEP]

s[10:20:3]

'mori'

They support basic arithmetic operations:

In [11]:
s_1 = 'foot'
s_2 = 'ball'

s_3 = s_1 + s_2 + '!' * 2
print(s_3)

football!!


Strings are immutable objects.


In [12]:
s[-1] = 'o'

TypeError: 'str' object does not support item assignment

## Escaping Characters


To use quotes as one of the characters inside a string, and not as its beginning or end (and not catch errors), use the **escape character**: `\`. It is placed before the corresponding quote.

In [1]:
string_1 = "Dragon's mother said \"No\""

string_2 = 'Dragon\'s mother said "No"'

print(string_1, string_2, sep='\n')  # printing from a new line

Dragon's mother said "No"
Dragon's mother said "No"


If you need to output the backslash itself, it also needs to be escaped, like any other special character:

In [3]:
print("\\")

\


## String Methods


For the convenience of working with strings, many useful methods are built-in. Let's consider some of them:

### Splitting and Joining


1. **`str.partition(sep[arator]) -> tuple`**

Splits a string into three components (beginning, separator, end) and returns them as a tuple.

In [13]:
s.partition('em')

('Lor', 'em', ' ipsum dolor sit amet, consectetur adipiscing elit')

2. **`str.split(sep[, maxsplit]) -> list`**

Splits a string into parts using a separator and returns these parts as a list.


In [14]:
a = s.split()         # или s.split(' ')
b = s.split('e')      # sep='e'
c = s.split('e', 2)   # maxsplit=2

print(a, ' ', b, c, sep='\n')

['Lorem', 'ipsum', 'dolor', 'sit', 'amet,', 'consectetur', 'adipiscing', 'elit']
 
['Lor', 'm ipsum dolor sit am', 't, cons', 'ct', 'tur adipiscing ', 'lit']
['Lor', 'm ipsum dolor sit am', 't, consectetur adipiscing elit']


In [16]:
words = s.split()
words[0]

'Lorem'

In [17]:
list(words[0])

['L', 'o', 'r', 'e', 'm']

3. **`str.join(words)  -> str`**

Joins strings using the specified separator.


In [18]:
a = ''.join(words)
b = ' '.join(words)
c = ' ^_^ '.join(words)

print(a, b, c, sep='\n')

Loremipsumdolorsitamet,consecteturadipiscingelit
Lorem ipsum dolor sit amet, consectetur adipiscing elit
Lorem ^_^ ipsum ^_^ dolor ^_^ sit ^_^ amet, ^_^ consectetur ^_^ adipiscing ^_^ elit


### Character case


In [5]:
s

NameError: name 's' is not defined

In [None]:
s.upper()

In [None]:
s.lower()

In [None]:
s.lower().capitalize()

In [None]:
s.title()

In [None]:
s.swapcase()

### Searching and Replacing


To check if a substring is in a string, you can use the `in` operator:

In [None]:
'lorem' in s

In [None]:
'lorem' in s.lower()

To find where exactly a substring is in a string, it's convenient to use the `find` and `index` methods:

In [None]:
a = s.find('t')           # returns the index of the first occurrence.
b = s.rfind('t')          # returns the index of the last occurrence.
c = s.find('nonexistent') # returns -1 if if the string is not found.

print(a, b, c)

In [None]:
a = s.index('ipsum')
b = s.rindex('ipsum')

print(a, b)

In [None]:
c = s.index('nonexistent')

In [None]:
txt = "I like bananas, bananas"

a = txt.replace("bananas", "apples")
b = txt.strip("Ilans ")                # removes characters from arguments from the beginning and end of the string

print(a, b, sep='\n')

### String Analysis

We can check what characters a string consists of:

In [None]:
strings = ['abc', '2', '   ']

print('\t\t'.join('string isalpha isdigit isspace'.split()))

for s in strings:
    print("'" + s + "'", s.isalpha(), s.isdigit(), s.isspace(), sep='\t\t')

Additional methods - `startswith`, `endswith`, `strip`.

In [None]:
'Hello, world!'.startswith('Hel')

In [None]:
'Hello, world!'.endswith('world')

In [None]:
'    Hello world    '.strip()

## String formatting

### Classic formatting

#### The `%` operator

As in C, in Python you can format strings:

- `%s` - string (or an object with a string representation, such as numbers)

In [None]:
name1, name2 = 'Alice', [1, 2]
'Hello, %s and %s' % (name1, name2)

- `%d` - integers

In [None]:
result = 99.99
'Your score: %d' % result

- `%f` - floating-point numbers

In [None]:
result = 99.99
'Your score: %f' % result

- `%.<number of digits>f` - floating point numbers with a fixed number of numbers after the decimal point

In [None]:
result = 99.99
'Your score: %.1f' % result    # округление до 1 числа после запятой

- `%x`/`%X` - integers in hexadecimal representation (lowercase/uppercase)

In [None]:
result = 110
'Your score: %x' % result

#### `format` method

In [None]:
name = 'Bob'
'Hello, {}'.format(name)

You can pass multiple arguments to the `format` function:


In [None]:
'Hello, {}! Hello, {}!'.format('Bob','Alice')

To specify the parameter precisely, you can use indices and keys:

In [None]:
'Hello, {1}! Hello, {0}!'.format('Bob','Alice')

You can insert other data types besides strings:


In [None]:
'Coordinates: {latitude}, {longitude}'.format(latitude=37.24, longitude=-115.81)

You can use the same flags as with the `%` operator:


In [None]:
result = 99.9
'Your score: {:.2f}'.format(result)

### f-strings

**f-строки** - a newer and more convenient way of formatting strings, added in Python 3.6:

In [None]:
year = 2024
season = 8
f'In the {year}-th year, the {season}-th season of the Python 3 course will take place.'  # Note the f symbol before the string

f-strings support number formatting:

In [None]:
year = 2024
season = 8

# .2f - a real number with two decimal places
f' the {year}-th year, the {season:.2f}-th season of the Python 3 course will take place.'

Inside f-strings, you can perform various operations:

In [None]:
year = 2024

f'In the {year}-th year, the {year-2017+1}-th season of the Python 3 course will take place.'

You can access list elements by index:


In [None]:
years = [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
season = 8

f'In the {years[season]}-th year, the {season}-th season of the Python 3 course will take place.'

And even use functions and methods:

In [None]:
year = 2024
season = 8
name = 'Python 3'

f'In the {year}-th year, the {season}-th season of the {name.upper()} course will take place.'

#### How to output curly braces?

Okay, so for variable substitution and expressions in f-strings, we use curly braces. But how can we just draw a curly brace in f-strings? Just double them!

In [None]:
f"{{70 + 4}}"

If we use triple curly braces:



In [None]:
f"{{{70 + 4}}}"

#### `f"{var=}"`

Another convenient f-string feature - printing the name of the printed variable:


In [None]:
debug_var = 42
f"{debug_var=}"

Helps with debugging applications using `print` and `f-strings`.

## `string` module

Python has a special module to facilitate working with strings:


In [None]:
import string

string.ascii_letters

In [None]:
string.ascii_lowercase

In [None]:
string.digits

In [None]:
string.whitespace

### `Template`

The module also allows creating templates using `Template` and the `substitute` method:


In [None]:
from string import Template

s = Template('$who likes $what')
s.substitute(who='tim', what='kung pao')

If a key is missing from the dictionary, an error will occur.

In [None]:
d = dict(who='tim')
Template('$who likes $what').substitute(d)

In [None]:
# The safe_substitute method allows avoiding the key missing error

Template('$who likes $what').safe_substitute(d)

# Encodings

**Definitions:**
- **Character** (韩) - abstraction
- **Charset** - a set of "allowed" characters
- **Encoding** - rules for how to write characters on a computer

Operations of translating character representation into byte representation and back - **encoding** and **decoding**.

Encodings:
- **ASCII** - encoding for 128 characters. Each character takes 7 bits.
- **КОИ-8** - encoding for 256 characters (ASCII extension for the Russian language). The character takes 8 bits.
- **Unicode** - a standard that includes several encodings:
    - UTF-8 - from 1 to 4 bytes
    - UTF-16 - 2 or 4 bytes
    - UTF-32 - 4 bytes

In Unicode, the unique identifier of a character (e.g., r, Я or 韩) != byte representation.

In Python, there is `string` for a sequence of characters and `bytes` for a sequence of bytes.

You can switch between a Unicode character (e.g., "U+1D11E") and its integer identifier using the `ord` and `chr` functions:

In [None]:
a = ord('𝄞')
b = ord('\U0001D11E')
c = chr(ord('\U0001D11E'))

print(a, b, c, sep='\n')

Let's consider another example:

In [None]:
s = 'café'
len(s)

In [None]:
b = s.encode('utf8')                 # binary representation
print(b, len(b), type(b), sep='\n')

The length of the character representation is 2, and the byte representation is 5. What's the matter?

Let's look at how each byte looks individually:

In [None]:
print(b)
for byte in b:
    print(hex(byte), byte, chr(byte))

Which encodings can we use? Different ones:

In [None]:
string = 'El Niño'

for codec in ['latin_1', 'utf_8', 'utf_16', 'cp437']:
    encoded = string.encode(codec)
    print(codec, encoded.decode(codec), encoded, sep='\t\t')

Not all encodings work for all strings:

In [None]:
city = 'São Paulo'
city.encode('iso8859_1')

In [None]:
city.encode('cp437')

What to do with such errors? Handle them:

In [None]:
city.encode('cp437', errors='ignore') # bad

In [None]:
city.encode('cp437', errors='replace') # better

In [None]:
city.encode('cp437', errors='xmlcharrefreplace') # still not perfect

# Reading and writing to a file


Let's create a simple file:

In [None]:
%%file testfile.txt
Hello, world!
This is a test file.

## Opening a file

To open a file, we use the open command. It takes two arguments - the file path and the mode (default is 'r'):

In [None]:
f = open('testfile.txt', 'r')
f

The following modes for opening files exist:
- `r`- read the file, cursor at the beginning of the file, raises an error if the file does not exist
- `r+` - read and write the file, cursor at the beginning of the file, raises an error if the file does not exist, does not overwrite the entire file when writing
- `w` - write to the file, cursor at the beginning of the file, overwrites the entire file
- `w+` - write to the file, cursor at the beginning of the file, overwrites the entire file
- `a` - write to the file, cursor at the end of the file
- `a+` - write and read the file, cursor at the end of the file
- <font color="#ABABAB">`r+b` - read and write in binary format

### Context Manager of the `open` function

After working with a file, it is necessary to close it:

In [None]:
f.close()
f.closed   # проверка

However, if an error occurs before calling the `close` method during program execution, the file will not be closed correctly!

To ensure that the file is closed even in the event of an error, the `open` function can be used as a context manager.

A **context manager** is a construct that wraps a set of commands, allowing certain operations to be performed immediately before and after the wrapped set of commands (different sets of commands before and after), even if an error occurs during the execution of the wrapped set of commands.

```python
with <context_manager> as <variable>:
    some_code()
```

When opening files instead of:






In [None]:
f = open('testfile.txt', 'r')
print(f.name)
# do something else
f.close()


it is recommended to use the following:


In [None]:
with open('testfile.txt', 'r') as f:
    print(f.name)
    # do something else

### Example:

In [None]:
# pair open-closed

f = open('testfile.txt', 'r')
print(f"File opened {f.name}")
line = f.readline()
n = 1 / 0
f.close()

In [None]:
f.closed

An exception occurred, we finished executing the code cell before the `close` line, and the file did not close. Let's close the file.

In [None]:
f.close()
f.closed

In [None]:
# context manager

with open('testfile.txt', 'r') as f:
    print(f"File opened {f.name}")
    line = f.readline()
    n = 1 / 0

In [None]:
f.closed

## Reading a file

### Loop `for`

In [None]:
with open('testfile.txt', 'r') as f:
    for i, line in enumerate(f):
        print(i, line, end='')

### `readline` method

To read a file line by line

In [None]:
with open('testfile.txt') as f:
    count = 0
    while True:
        s = f.readline()
        if s == '':
            break
        print(f'count: {count}, {s.__repr__()}')
        count += 1

In [None]:
with open('testfile.txt') as f:
    s1 = f.readline()[:-1]
    s2 = f.readline()[:-1]
    s3 = f.readline()[:-1]

print(s1, s2, s3, sep='\n')

### `readlines` method

In [None]:
# returns a list of strings (it is better not to apply this for very large files).

with open('testfile.txt') as f:
    lst_of_words = f.readlines()
lst_of_words

### `read` method

This method allows reading the file in two ways:


1. Reading the entire file

In [None]:
with open('testfile.txt') as f:
    content = f.read()  # If no argument is specified, it reads the entire file.
    print(content)

2. Reading N characters

In [None]:
with open('testfile.txt') as f:
    chars = f.read(10)  # Specify the argument 10 - it reads 10 characters.
    print(chars)

> **Note** - reading the entire file should be done with great caution!
>
>If you do not know the size of the file in advance or your program could theoretically be run for a file of extremely large size (larger than your RAM), do not read the entire file:
>- Do not use the `readlines` method!
>- Do not use the `read` method without an argument!
>
>Otherwise, you will fill your entire RAM with your text, and everything will crash!
>
>It is safer to read the file line by line (using `readline`) or in chunks of characters (by specifying an argument for the `read` method).

## Writing to a file

To open a file for writing, you can specify `'w'` as the second argument and use the `write` method:

In [None]:
with open('newtext.txt', 'w') as f:
    f.write('aaa\n')
    f.write('bbb\n')
    f.write('bbb\n')

In [None]:
!cat newtext.txt

#### Copying a text file

In [None]:
old_name = 'testfile.txt'
new_name = 'new_testfile.txt'

with open(old_name) as old, open(new_name, 'w') as new:
    for line in old:
        new.write(line)

In [None]:
!cat new_testfile.txt

## Files with different encodings

In [None]:
with open('unicode_file.txt', 'w', encoding='utf-16le') as f:
    f.write('韩国烧酒')

In [None]:
with open('unicode_file.txt', 'r', encoding='utf-16le') as f:
    print(f.read())

In [None]:
# if an incorrect category is specified, an error will occur.

with open('unicode_file.txt', 'r') as f:
    print(f.read())

# Working with Paths in Python. The `Pathlib` Library

When working with files, it is often necessary to open files that are not in the same directory as the executing program. Therefore, it is important to be able to correctly specify the path to the required file.

A poor option would be to treat paths as strings:

```python
data_folder = "source_data/text_files/"
file_to_open = data_folder + "raw_data.txt"
f = open(file_to_open)
print(f.read())
```

Such code may not work for some libraries on some operating systems.

A better option is to use the `os.path` module:

```python
import os.path

data_folder = os.path.join("source_data", "text_files")
file_to_open = os.path.join(data_folder, "raw_data.txt")
f = open(file_to_open)
print(f.read())
```

This code will work correctly on all operating systems, but using it is quite cumbersome.

Currently, the best solution is the [`pathlib`](https://docs.python.org/3/library/pathlib.html) module and its `Path` class:

In [None]:
from pathlib import Path
type(Path)

In [None]:
# Path() returns the current directory

p = Path()
p, print(p)

In [None]:
type(p)

In [None]:
# resolve method converts the path to its canonical form.

p = Path('.././/book')
p = p.resolve()
p

In [None]:
# cwd method returns the current working directory.

Path.cwd()

In [None]:
# iterdir method allows you to see all files in the directory (iterate over it):

p = Path('./sample_data')

for f in p.iterdir():
    print(f)

In [None]:
type(p.iterdir())

If `p` is the path to the directory, then `p/'fname'` is the path to the file (or directory) `fname` within it.

In [None]:
%%bash
mkdir sample_data/dir1
mkdir sample_data/dir2
mkdir sample_data/dir3/
mkdir sample_data/dir3/subdir3

In [None]:
p

In [None]:
d = Path('dir3')
p2 = p / d
p2

In [None]:
p2.exists()

In [None]:
# methods determine what is located at this path (file, folder, or symbolic link):

p2.is_file(), p2.is_dir(), p2.is_symlink()

In [None]:
# parts method breaks the path into its individual components.

Path('/content/sample_data/mnist_test.csv').parts

In [None]:
# absolute method gets the absolute path.

print(p2.absolute().parts, p2)

In [None]:
# parent method allows you to find the parent directory.

p2.parent, p2.parent.parent

In [None]:
# determine the name and extension of the file from its name

p2 = Path('/content/sample_data/mnist_test.csv')
p2.name, p2.stem, p2.suffix

In [None]:
# rglob method finds all files in the folder matching a specified pattern

for item in p.rglob("*.csv"):
    print(item)