# Strings - Humans use text. Computers speak bytes. 🥰

Despite the wealth of multimedia information, text processing remains one of the dominant functions of computers.

## Abundance of Digital Text 😅

Computer are used to edit, store, and display documents, and to transport documents over the Internet. 

Furthermore, digital systems are used to archive a wide range of textual information, and new data is being generated at a rapidly increasing pace. 

A large corpus can readily surpass a petabyte of data (which is equivalent to a thousand terabytes (1 PB - 1000 TB), or a million gigabytes). 

Common examples of digital collections that include textual information are:

• Snapshots of the World Wide Web, as Internet document formats HTML and XML are primarily text formats, with added tags for multimedia content
• All documents stored locally on a user’s computer
• Email archives
• Customer reviews
• Compilations of status updates on social networking sites such as Facebook
• Feeds from sites such as X and Tumblr

In this chapter we explore some of the fundamental algorithms that can be used to efficiently analyze and process large textual data sets. 

---

Python’s `str` class is specifically designed to efficiently represent an immutable sequence of characters, based upon the **Unicode international character set.** 

Strings have a more compact internal representation than the referential lists and tuples.

String literals can be enclosed in single quotes, as in `'hello'` , or double quotes, as in `"hello"`. 

In [2]:
my_str = "what's up!"
print(my_str)

what's up!


## The `str` Class - `be = "the_spark"`

Python’s `str` class is specifically designed to efficiently represent an immutable sequence of characters, based upon the Unicode international character set. 

Strings have a more compact internal representation than the referential lists and tuples.

In [3]:
# Alternatively, the quote delimiter can be designated 
# using a backslash as a so-called escape character, as in 
my_str = 'Don\'t stop.'
print(my_str)

Don't stop.


Other commonly escaped characters are `"\n"` for newline and `"\t"` for tab. 

**Unicode characters** can be included, such as ``'20\u20AC'`` for the string ``20€``.

In [4]:
print("20 \u20AC ONLY!") # 20 € ONLY!
print("the escape room \'is real\'")
print("\'the numbers mason, what do they mean?\'")

20 € ONLY!
the escape room 'is real'
'the numbers mason, what do they mean?'


In [5]:
my_str = "sezai"
my_str_slice = my_str[2:0]  # this is None
bool(my_str_slice) # this is False

False

There are `f-strings` and `r-strings` in python: `Formatted Strings` and `Raw Strings`

In [7]:
a , b = 2, 3
print(f"{a} is not equal to {b} so the answer is: {a == b}")

2 is not equal to 3 so the answer is: False


Python raw string treats the backslash character `\` as a literal character. Raw string is useful when a string needs to contain a backslash, such as for a regular expression or Windows directory path, and you don’t want it to be treated as an escape character.

In [1]:
# Here is an example of an r-string, do not worry about the function

def reset_accumulated_memory_stats(device: tuple[float, int] = None) -> None:
    r"""Reset the "accumulated" (historical) stats tracked by the CUDA memory allocator.

    See :func:`~torch.cuda.memory_stats` for details. Accumulated stats correspond to
    the `"allocated"` and `"freed"` keys in each individual stat dict, as well as
    `"num_alloc_retries"` and `"num_ooms"`.

    Args:
        device (torch.device or int, optional): selected device. Returns
            statistic for the current device, given by :func:`~torch.cuda.current_device`,
            if :attr:`device` is ``None`` (default).

    .. note::
        See :ref:`cuda-memory-management` for more details about GPU memory
        management.
    """
    device = _get_device_index(device, optional=True)
    return torch._C._cuda_resetAccumulatedMemoryStats(device)

### 4 different ways to concatenate Strings:

In [2]:
def repeated_contatenation(n):
	# do not do this
	result = ""
	for i in range(n):
		result += str(i)
	return result

def list_append_and_join(n):
	temp = []
	for i in range(n):
		temp.append(str(i))
	result = "".join(temp)
	return result

def list_comp(n):
    # comprehensions are always good idea    
	result = "".join([str(i) for i in range(n)])
	return result


def gen_comp(n):
	# no need for list anyway!
	result = "".join(str(i) for i in range(n))
	return result

Here are the results:

| Method | Time |
| ---- | ---- |
| `repeated_contatenation` | 0.1979 sec |
| `list_append_and_join` | 0.1714 sec |
| `list_comp` | 0.1514 sec |
| `gen_comp` | 0.1434 sec |



In [8]:
# a lot of string methods to use

# they can change the strings
print("K".lower())

# they can tell you things about them
print("asd".islower(), "asdF".islower())
# True False

print("asd12".isalpha(), "asd12".isalnum())
# False True

k
True False
False True


### Composing Strings

Doing repeated concatenation is terribly inefficient.

##### WARNING: do not do this

```python
letters = ""
for c in document:
	if c.isalpha():
		letters += c
```

Constructing the new string would require time proportional to its length. 

If the final result has n characters, the series of concatenations would take time proportional to the familiar sum $1 + 2 +3 + · · · + n$ , and therefore $O(n^{2})$ time.

##### Instead - Use lists - strings with `split` and `join`

In [15]:
# Because appending to end of a list is $O(1)$. 

document = "This is an example document with numbers like 2 on it"

temp = []  # start with empty list
for c in document:
	if c.isalnum() or c == " ":
		temp.append(c)  # append alphabetic character

letters = "".join(temp)  # compose overall result

print(letters)

print()

# Or Even better:

letters = "".join([c for c in document if c.isalnum()])

# not even a list is needed
# generators can help
letters = "".join(c for c in document if c.isalnum)
print(letters)


This is an example document with numbers like 2 on it

This is an example document with numbers like 2 on it


### Efficiently Concatenating Many Strings With .join() in Python

The `.join()` method is cleaner, more Pythonic, and more readable than concatenating many strings together in a loop using the augmented concatenation operator (`+=`), as you saw before. 

An explicit loop is way more complex and harder to understand than the equivalent call to `.join()`. 

The `.join()` method is also **faster and more efficient** regarding memory usage.

Unlike the concatenation operators, Python’s `.join()` doesn’t create new intermediate strings in each iteration. 

Instead, it creates a single new string object by joining the elements from the input iterable with the selected separator string. 

This behavior is more efficient than using the regular concatenation operators in a loop.

It’s important to note that `.join()` doesn’t allow you to concatenate non-string objects directly:

```python
"; ".join([1, 2, 3, 4, 5])

# Traceback (most recent call last):
#     ...
# TypeError: sequence item 0: expected str instance, int found
```

When you try to join non-string objects using .join(), you get a TypeError, which is consistent with the behavior of concatenation operators. 

Again, to work around this behavior, you can take advantage of str() and a generator expression:

```python
numbers = [1, 2, 3, 4, 5]

"; ".join(str(number) for number in numbers)
# '1; 2; 3; 4; 5'
```

The generator expression in the call to .join() converts every number into a string object before running the concatenation and producing the final string. 

With this technique, you can concatenate objects of different types into a string.

### Prefix and Suffix

All strings have prefixes and Suffixes. 

Sometimes these are helpful to set some strings apart from others.

In [16]:
S = "CGTAAACTGCTTTAATCAAACGC"

my_prefix = "CGTAA"

my_suffix = "CGC"

prefix_to_all = "" # null strings are prefixs for all strings.
suffix_to_all = "" # null strings are suffixes for all strings.

### Searching for Substrings  

The operator syntax, pattern in s, can be used to determine if the given pattern occurs as a substring of string s. 

Table A.1 describes several related methods that determine the number of such occurrences, and the index at which the leftmost or rightmost such occurrence begins. 

Each of the functions in this table accepts two optional parameters, start and end, which are indices that effectively restrict the search to the implicit slice `s[start:end]`. For example, the call `s.find(pattern, 5)` restricts the search to `s[5:]`.

==Table A.1== 

| Calling Syntax |  Description  |
|-|-|
|`s.count(pattern)` | Return the number of non-overlapping occurrences of pattern  |
|`s.find(pattern)` |Return the index starting the leftmost occurrence of pattern; else `-1`  |
|`s.index(pattern)` |Similar to `find`, but raise `ValueError` if not found.  |
|`s.rfind(pattern)` |Return the index starting the rightmost occurrence of pattern; else `-1`  |
|`s.rindex(pattern)` |Similar to `rfind`, but raise `ValueError` if not found. |


Using `find` might be safer, if you do not want to raise an `Exception`.

In [17]:
my_str = "we_love_learning_it makes us grow!"

print(f"where is underscore? {my_str.find('_')}") # 2

print(f"where is l? {my_str.index('l')}") # 3

print(f"where is next l? {my_str.index('l', 4)}") # 8

print(f"how many e's are in my_str?: {my_str.count('e')}") # 4

where is underscore? 2
where is l? 3
where is next l? 8
how many e's are in my_str?: 4


### Constructing Related Strings  

Strings in Python are **immutable**, so none of their methods modify an existing string instance.

However, many methods return a newly constructed string that is closely related to an existing one.

Table A.2 provides a summary of such methods, including those that replace a given pattern with another, that vary the case of alphabetic characters, that produce a fixed-width string with desired justification, and that produce a copy of a string with extraneous characters stripped from either end.

==Table A.2==

|Calling Syntax |  Description  |
|-|-|
|`s.replace(old, new)` |Return a copy of s with all occurrences of old replaced by new  |
|`s.capitalize()` |Return a copy of s with its first character having uppercase  |
|`s.upper()` |Return a copy of s with all alphabetic characters in uppercase  |
|`s.lower()` |Return a copy of s with all alphabetic characters in lowercase  |
|`s.center(width)` |Return a copy of s, padded to width, centered among spaces  |
|`s.ljust(width)` |Return a copy of s, padded to width with trailing spaces  |
|`s.rjust(width)` |Return a copy of s, padded to width with leading spaces  |
|`s.zfill(width)` |Return a copy of s, padded to width with leading zeros  |
|`s.strip()` |Return a copy of s, with leading and trailing whitespace removed  |
|`s.lstrip()`| Return a copy of s, with leading whitespace removed  |
|`s.rstrip()` |Return a copy of s, with trailing whitespace removed|

Several of these methods accept optional parameters not detailed in the table.

For example, the `replace` method replaces all non overlapping occurrences of the old pattern by default, but an optional parameter can limit the number of replacements that are performed. 

In [25]:
related_string_test = "can you make this relateable?"

print(related_string_test.replace("can", "Can"))
# Can you make this relateable?

print(related_string_test.capitalize())
# Can you make this relateable?

# scream and shout
print(related_string_test.upper())
# CAN YOU MAKE THIS RELATEABLE?

# whisper
print(related_string_test.lower())
# can you make this relateable?

Can you make this relateable?
Can you make this relateable?
CAN YOU MAKE THIS RELATEABLE?
can you make this relateable?


In [29]:
related_string_test = "  can you make this relateable?  "

# strip from emptiness
print(related_string_test.strip())
# can you make this relateable?

# you can choose which emptiness you do not want
print(related_string_test.lstrip())
# can you make this relateable?  
# whitespace on right remains

# or you can drop right side
print(related_string_test.rstrip())
#   can you make this relateable?

can you make this relateable?
can you make this relateable?  
  can you make this relateable?


### Testing Boolean Conditions  

==Table A.3== includes methods that test for a Boolean property of a string, such as **whether it begins or ends with a pattern**, or **whether its characters qualify as being alphabetic, numeric, whitespace**, etc. 

For the standard ASCII character set, alphabetic characters are the uppercase A–Z, and lowercase a–z, numeric digits are 0–9, and whitespace includes the space character, tab character, newline, and carriage return.

Conventions for what are considered alphabetic and numeric character codes are extended to more general Unicode character sets.

|Calling Syntax |Description  |
|-|-|
|`s.startswith(pattern)` |Return `True` if `pattern` is a prefix of string s  |
|`s.endswith(pattern)`| Return `True` if `pattern` is a suffix of string s  |
|`s.isspace()` |Return `True` if all characters of nonempty string are whitespace  |
|`s.isalpha()`| Return `True` if all characters of nonempty string are alphabetic  |
|`s.islower()` |Return `True` if there are one or more alphabetic characters, all of which are lowercased  |
|`s.isupper()`| Return `True` if there are one or more alphabetic characters, all of which are uppercased  |
|`s.isdigit()`| Return `True` if all characters of nonempty string are in 0–9  |
|`s.isdecimal()` |Return `True` if all characters of nonempty string represent digits 0–9, including Unicode equivalents  |
|`s.isnumeric()` |Return `True` if all characters of nonempty string are numeric Unicode characters (e.g., 0–9, equivalents, fraction characters)  |
| `s.isalnum()`| Return `True` if all characters of nonempty string are either alphabetic or numeric (as per above definitions) |


In [43]:
goal = "hard work"

print(goal.startswith("hard"))
# True

print(goal.endswith("work"))
# True

print(goal.isalpha())
# False - Because there are whitespaces within

print(goal.islower())
# True

print(goal.isupper())
# False

True
True
False
True
False


In [46]:
number = "two"

print(number.isdigit())
# False - It is string in type

print(number.isdecimal())
# False - alphabetic.

print("420".isdecimal())
# True - you can work with these digits.

print("½".isnumeric())
# True - ½ not a decimal but numeric

print(number.isalnum())
# True

False
False
True
True
True


### `is.alnum()` is the Umbrella ☔

`str.isalnum()`

Return `True` if all characters in the string are alphanumeric and there is at least one character, `False` otherwise. 

A character c is alphanumeric if one of the following returns True: `c.isalpha()`, `c.isdecimal()`, `c.isdigit()`, or `c.isnumeric()`.


### Splitting and Joining Strings  😎

==Table A.4== describes several important methods of Python’s string class, used to compose a sequence of strings together using a delimiter to separate each pair, or to take an existing string and determine a decomposition of that string based upon existence of a given separating pattern.  

This is important because you want to do operations on lists not repeatedly made strings:

|Calling Syntax |Description  |
|-|-|
|`sep.join(strings)` |Return the composition of the given sequence of strings, inserting sep as delimiter between each pair  |
|`s.splitlines()` |Return a `list` of substrings of s, as delimited by newlines  
|`s.split(sep, count)`| Return a `list` of substrings of s, as delimited by the first count occurrences of sep. If count is not specified, split on all occurrences. If sep is not specified, use whitespace `" "` as delimiter.  |
|`s.rsplit(sep, count)` |Similar to split, but using the rightmost occurrences of sep  |
|`s.partition(sep) `|Return (head, sep, tail) such that s = head + sep + tail, using leftmost occurrence of sep, if any; else return (s, , )  |
|`s.rpartition(sep)` |Return (head, sep, tail) such that s = head + sep + tail, using rightmost occurrence of sep, if any; else return ( , , s)|


The `join` method is used to assemble a string from a series of pieces. 

In [5]:
sentence = ' and '.join([ "red" , "green" , "blue" ]) 
print(sentence) # "red and green and blue"

red and green and blue


In [6]:
list_one = "red and green and blue".split(" and ")
print(list_one)  # [ "red" , "green" , "blue" ]

['red', 'green', 'blue']


In [7]:
list_two = "red and green and blue".split() # Uses whitespace as delimiter
print(list_two) # [ "red" , "and" , "green" , "and" , "blue" ].

['red', 'and', 'green', 'and', 'blue']


## Pattern - Matching Algorithms

In the classic pattern-matching problem, we are given a text string T of length n and a pattern string P of length m, and want to find whether P is a substring of T.

In [3]:
my_str = "we_love_learning_it makes us grow!"

print(f"where is underscore? {my_str.find('_')}") # 2

print(f"where is l? {my_str.index('l')}") # 3

print(f"where is next l? {my_str.index('l', 4)}") # 8

print(f"how many e's are in my_str?: {my_str.count('e')}") # 4

where is underscore? 2
where is l? 3
where is next l? 8
how many e's are in my_str?: 4


In [5]:
# Partition the string into three parts using the given separator.
print(my_str.partition(" "))
# ('we_love_learning_it', ' ', 'makes us grow!')

# Return a list of the words in the string, using sep as the delimiter string.
print(my_str.split())
# ['we_love_learning_it', 'makes', 'us', 'grow!']

# Return a copy with all occurrences of substring old replaced by new.
print(my_str.replace("makes", "helps"))
# we_love_learning_it helps us grow!

('we_love_learning_it', ' ', 'makes us grow!')
['we_love_learning_it', 'makes', 'us', 'grow!']
we_love_learning_it helps us grow!


Here is another workout:

In [47]:
drake = "Tuscan Leather"

print(drake.find("L")) # 7

# start, end
print(drake.find("a", 0, 6)) # 4

print(drake.find("W")) # -1

print(drake.index("a")) # 4

7
4
-1
4


In [50]:
drake = "Tuscan Leather"

try:
    print(drake.index("W")) # ValueError substring not found
except Exception as e:
    print(e)

print(drake.index("a", 5)) # 9

print(drake.lower().count("t")) # 2

print(drake.split()[1][2:4]) # at

print(drake.partition(" ")) # ('Tuscan', ' ', 'Leather')

substring not found
9
2
at
('Tuscan', ' ', 'Leather')


## ASCII - Unicode - Encoding Things. ✏️

ASCII and Unicode are two character representations as numbers. 

Basically, they are standards on how to represent difference characters in binary so that they can be written, stored, transmitted, and read in digital media. 

The main difference between the two is in the way they encode the character and the number of bits that they use for each.


### ASCII, Origins

As stated in the other answers, ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations*. Which means that we can represent 128 characters maximum.

    - Wait, 7 bits? But why not 1 byte (8 bits)?

    - The last bit (8th) is used for avoiding errors as parity bit. This was relevant years ago.

Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc.

### ASCII was meant for English only. 128 characters.

    - What? Why English only? So many languages out there!

    - Because the center of the computer industry was in the USA at that time. As a consequence, they didn't need to support accents or other marks such as á, ü, ç, ñ, etc. (aka diacritics).


### Unicode, The Rise

ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? Greek? Russian? Chinese and the likes?

We would have needed an entirely new character set... that's the rational behind Unicode. Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters (see this table).

You cannot save text to your hard drive as "Unicode". Unicode is an abstract representation of the text. You need to "encode" this abstract representation. That's where an encoding comes into play.

### Encodings: UTF-8 vs UTF-16 vs UTF-32

This answer does a pretty good job at explaining the basics:

    - UTF-8 and UTF-16 are variable length encodings.
    
    - In UTF-8, a character may occupy a minimum of 8 bits.
    
    - In UTF-16, a character length starts with 16 bits.
    
    - UTF-32 is a fixed length encoding of 32 bits.

UTF-8 uses the ASCII set for the first 128 characters. That's handy because it means ASCII text is also valid in UTF-8.

## - TLDR: UTF-8 is widely used in the world.

----

- ASCII 32 is whitespace - `_`

- ASCII 48 is `0` , ... , 57 is `9`

- ASCII 65 is `A`, ... , ASCII 90 is `Z`

- ASCII 97 is `a`, ... , 122 IS `z` 

In [2]:
# also do not forget, there is a module you can use 
# for checking string constants 

from string import digits, ascii_letters, ascii_lowercase

print(digits)

print(ascii_lowercase)

0123456789
abcdefghijklmnopqrstuvwxyz


## Example Questions Here: 💕

In [51]:
from collections import Counter

def isAnagram(s: str, t: str) -> bool:
    return Counter(s) == Counter(t) 

print(isAnagram("rat", "car"))


# or map approach
def isAnagramTwo(s: str, t: str) -> bool:
    
    def make_counter_map(container: str) -> dict:
        map = {}
        for elem in container:
            map[elem] = map.get(elem, 0) + 1
        return map
    
    return make_counter_map(s) == make_counter_map(t)

print(isAnagramTwo("car", "rac"))

False
True


In [52]:
from collections import defaultdict

def groupAnagrams(strs: list[str]) -> list[list[str]]:
    
    # every key has a empty list as default
    res = defaultdict(list)
    for s in strs:
        # we can use a tuple for our dict keys
        # sorted returns a list
        # make it a tuple for it to be hashable
        res[tuple(sorted(s))].append(s)

    return list(res.values()) 
    
print(groupAnagrams(["eat","tea","tan","ate","nat","bat"]))

[['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]


In [53]:
"""
Q: A phrase is a palindrome if, after converting all 
uppercase letters into lowercase letters and removing 
all non-alphanumeric characters, it reads the same 
forward and backward. 

Alphanumeric characters include letters and numbers.

Given a string s, return true if it is a palindrome, or 
false otherwise.

Example 1:

    Input: s = "A man, a plan, a canal: Panama"
    Output: true
    
    Explanation: "amanaplanacanalpanama" is a palindrome.

Example 2:

    Input: s = "race a car"
    Output: false
    
    Explanation: "raceacar" is not a palindrome.

Example 3:

    Input: s = " "
    Output: true
    
    Explanation: s is an empty string "" after removing 
        non-alphanumeric characters.
        Since an empty string reads the same forward and 
        backward, it is a palindrome.

"""

class Solution:
    def isPalindrome_(self, s:str):
        # we can simply use string methods.

        result_string = ""

        for c in s:
            if c.isalnum():
                result_string += c.lower()

        return result_string == result_string[::-1]


    def isPalindrome(self, s: str):
        # we can use two pointers and not worry about 
        # reversing the string
        
        l, r = 0 , len(s) - 1

        while l < r:
            # pass all non alphanumeric characters
            while l < r and not s[l].isalpha():
                l += 1
            while r > l and not s[r].isalpha():
                r -= 1
            # check equality
            if s[l].lower() != s[r].lower():
                return False
            l, r = l + 1, r - 1

        # if all passed, we found a Palindrome
        return True

    def isPalindrome__(self, s: str):
        """You can write the alphanumeric function yourself.
        Also you dont need to reverse the string. 
        You can compare the characters one by one"""

        l, r = 0 , len(s) - 1

        while l < r:
            # pass all non alphanumeric characters
            while l < r and not self.alphanumeric(s[l]):
                l += 1
            while r > l and not self.alphanumeric(s[r]):
                r -= 1
            # check equality
            if s[l].lower() != s[r].lower():
                return False
            l, r = l + 1, r - 1

        # if all passed, we found a Palindrome
        return True

    def alphanumeric(self, s):
        return (ord("A") <= ord(s) <= ord("Z")) or \
            (ord("a") <= ord(s) <= ord("z")) or \
            (ord("0") <= ord(s) <= ord("9"))

if __name__ == '__main__':
    sol = Solution()

    # best approach ?
    print(sol.isPalindrome(s = "A man, a plan, a canal: Panama"))
    print(sol.isPalindrome(s = "race a car"))

    # second approach - string methods
    print(sol.isPalindrome_(s = "A man, a plan, a canal: Panama"))
    print(sol.isPalindrome_(s = "race a car"))

    # third approach, our own alphanumeric function
    print(sol.isPalindrome__(s = "A man, a plan, a canal: Panama"))
    print(sol.isPalindrome__(s = "race a car"))

True
False
True
False
True
False


In [55]:
"""
Given a string, check if it is a palindrome 
(reads the same forwards and backwards), ignoring 
spaces and punctuation.
"""
def is_palindrome(string):
    # Remove spaces and punctuation
    # we do not want anything that is not alnum
    # we want to make all lowercase so we can compare
    cleaned_string = ''.join(char.lower() for char in string if char.isalnum())

    # Check if the cleaned string is equal to its reverse
    return cleaned_string == cleaned_string[::-1]

# Test the function
test_string = "A man, a plan, a canal, Panama!"
print(is_palindrome(test_string))  # Output: True


True


In [2]:
"""
Given a string representing a sentence count the 
occurance of each unique words and store the 
counts in a dictionary"""

def count_word_occurances(sentence):
    # Convert the sentence to lowercase and split into words
    words = sentence.lower().split()

    # Initialize an empty dictionary to store word counts
    word_counts = {}

    # Count the occurrence of each word
    for word in words:
        # Remove punctuation marks and special characters
        word = "".join(char for char in word if char.isalnum())

        # Update the count in the dictionary
        if word:
            word_counts[word] = word_counts.get(word, 0) + 1

    return word_counts

# Example usage:
sentence = "Hello, how are you? Are you doing well, hello?"
print(count_word_occurances(sentence))

{'hello': 2, 'how': 1, 'are': 2, 'you': 2, 'doing': 1, 'well': 1}
