# String Operations and Manipulation in Python

## 1.0 String Operations

String operations in Python refer to various actions and manipulations that you can perform on strings, which are sequences of characters. Python provides a wide range of built-in functions and methods for working with strings. Here are some common string operations in Python:

**1.1** Concatenation  
**1.2** String formatting  
**1.3** String Indexing  
**1.4** Slicing  
**1.5** String length  
**1.6** String methods  
**1.7** String searching and Manipulation  
**1.8** String Conversion

### 1.1 Concatenation

You can concatenate (combine) two or more strings using the + operator

In [1]:
str1 = "Hello"
str2 = "World"
result = str1 + " " + str2

print(result)

Hello World


In [2]:
str1 = "1"
str2 = "2"
result = str1 + "-" + str2

print(result)

1-2


In [3]:
str1 = "Hello"
str2 = "100"
result = str1 + " " + str2

print(result)

Hello 100


### 1.2 String formatting
You can format strings using the % operator for older-style string formatting

In [4]:
# Example 1 (Using f-strings)
name = "Alice"
age = 30
formatted_string = f"My name is {name} and I am {age} years old."

formatted_string

'My name is Alice and I am 30 years old.'

In [5]:
# Example 2 (Using %-formatting)
product = "apple"
price = 0.75
formatted_string = "The %s costs $%.2f" % (product, price)

formatted_string

'The apple costs $0.75'

In [6]:
# Example 3 (Using str.format())
city = "New York"
population = 8_400_000
formatted_string = "The population of {} is {}.".format(city, population)

formatted_string

'The population of New York is 8400000.'

In [7]:
# Example 4 (Using positional arguments with str.format())
item = "book"
price = 15.99
formatted_string = "A {} costs ${:.2f}".format(item, price)

formatted_string

'A book costs $15.99'

In [8]:
# Example 5 (Using named arguments with str.format())
fruit = "banana"
quantity = 5
formatted_string = "I bought {qty} {fruit}s.".format(qty=quantity, fruit=fruit)

formatted_string

'I bought 5 bananas.'

### 1.3 String Indexing

Display the character from a string using the index value

In [9]:
my_string = "Python"

In [10]:
first_char = my_string[0]
print(first_char)

P


In [11]:
last_char = my_string[-1]
print(last_char)

n


In [12]:
middle_char = my_string[2]
print(middle_char)

t


In [13]:
second_char = my_string[1]
print(second_char)

y


### 1.4 Slicing
Access the group of characters from a string using slicing

In [14]:
my_string = "GreatLearning"
substring = my_string[2:5]
# substring is 'tho'

substring

'eat'

In [15]:
first_three_chars = my_string[:3]
# first_three_chars is 'Pyt'

first_three_chars

'Gre'

In [16]:
last_three_chars = my_string[-3:]
# last_three_chars is 'hon'

last_three_chars

'ing'

In [17]:
every_other_char = my_string[::2]
# every_other_char is 'Pto'

every_other_char

'Geterig'

In [18]:
reversed_string = my_string[::-1]
# reversed_string is 'nohtyP'

reversed_string

'gninraeLtaerG'

### 1.5 String Length

In [19]:
my_string = "Hello, World!"
length = len(my_string)

length

13

In [20]:
empty_string = ""
length = len(empty_string)

length

0

In [21]:
long_string = "This is a very long string with lots of characters."
length = len(long_string)

length

51

In [22]:
unicode_string = "Αυτή είναι μια συμβολοσειρά σε Unicode."
length = len(unicode_string)

length

39

In [23]:
newline_string = "Line 1\nLine 2\nLine 3"
length = len(newline_string)

length

20

### 1.6 String Methods

In [24]:
# Example 1 (str.upper())
my_string = "Hello, World!"
uppercase_string = my_string.upper()

uppercase_string

'HELLO, WORLD!'

In [25]:
# Example 2 (str.lower())
lowercase_string = my_string.lower()

lowercase_string

'hello, world!'

In [26]:
# Example 3 (str.strip())
my_string = "   Python   "
stripped_string = my_string.strip()

stripped_string

'Python'

In [27]:
# Example 4 (str.replace())
original_string = "I like cats."
new_string = original_string.replace("cats", "dogs")

new_string

'I like dogs.'

In [28]:
# Example 5 (str.split())
sentence = "This is a sentence."
words = sentence.split()

words

['This', 'is', 'a', 'sentence.']

### 1.7 String searching and Manipulation

In [29]:
# Example 1 (str.find())
my_string = "Hello, World!"
index = my_string.find("World")

index

7

In [30]:
# Example 2 (str.startswith())
starts_with_hello = my_string.startswith("Hello")

starts_with_hello

True

In [31]:
# Example 3 (str.endswith())
ends_with_exclamation = my_string.endswith("#")

ends_with_exclamation

False

In [32]:
# Example 4 (str.capitalize())
sentence = "this is a sentence."
capitalized_sentence = sentence.capitalize()

capitalized_sentence

'This is a sentence.'

### 1.8 String Conversion

In [33]:
# Example 1 (int to str)
number = 42
str_number = str(number)

str_number

'42'

In [34]:
# Example 2 (float to str)
pi = 3.14159
str_pi = str(pi)

str_pi

'3.14159'

In [35]:
# Example 3 (str to int)
str_num = "123"
int_num = int(str_num)

str_num

'123'

In [36]:
# Example 4 (str to float)
str_float = "3.14"
float_num = float(str_float)

float_num

3.14

In [37]:
# Example 5 (ord())
char = 'A'
ascii_value = ord(char)

ascii_value

65

##  2.0 String Manipulation using Regular expressions (Regex)

A regular expression, often referred to as "regex" or "regexp," is a powerful and flexible pattern matching and text manipulation tool used in Python and many other programming languages. It is a sequence of characters that defines a search pattern. Regular expressions are commonly used for tasks like searching, extracting, validating, and manipulating text based on specific patterns.

**Regex Functions**
1. findall
2. search
3. split
4. sub

**Special sequances**

\d: Matches any digit (equivalent to [0-9]). \
\D: Matches any non-digit character (equivalent to [^0-9]). \
\w: Matches any word character (equivalent to [a-zA-Z0-9_]). \
\W: Matches any non-word character (equivalent to [^a-zA-Z0-9_]). \
\s: Matches any whitespace character (e.g., space, tab, newline). \
\S: Matches any non-whitespace character. \
\b: Matches a word boundary. It does not consume any characters, but it asserts that a word character is present at one end and a non-word character is present at the other end. \
\B: Matches a non-word boundary.

In [38]:
#Import a library
import re

### 2.1 Finding characters in a text
Let's use the below sample text and see how we can extract different parts of this text using regular expressions

In [39]:
sample_text = "abcxyzABCXYZabc xyz%$...| 676-898 "

### 2.2 Finding Patterns

Let's find all the occurances of 'abc' in the above text

In [40]:
pattern = re.compile(r'abc') # this is used to store the pattern you want to search
matches = pattern.findall(sample_text) # this is used to find all instanaces in a text that matches the pattern
print(matches) # the result is in the form of a python list

['abc', 'abc']


<b>Note:</b> The pattern which is compiled is case-sensitive. So by searching for 'abc', you cannot find 'ABC'

### 2.3 Match everything
The dot character returns all the distinct characters in a text

In [41]:
pattern = re.compile(r'.')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', ' ', 'x', 'y', 'z', '%', '$', '.', '.', '.', '|', ' ', '6', '7', '6', '-', '8', '9', '8', ' ']


### 2.4 Match all numbers
The identifier <b>\d</b> finds all digits - numbers from 0 to 9

In [42]:
pattern = re.compile(r'\d')
matches = pattern.findall(sample_text)
print(matches)

['6', '7', '6', '8', '9', '8']


### 2.5 Match all which are NOT numbers
The identifier <b>\D</b> finds all characters which are not digits

In [43]:
pattern = re.compile(r'\D')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', ' ', 'x', 'y', 'z', '%', '$', '.', '.', '.', '|', ' ', '-', ' ']


### 2.6 Match all alphabets
The identifier <b>\w</b> finds all alphabets - a to z, A to Z and underscore

In [44]:
pattern = re.compile(r'\w')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', 'x', 'y', 'z', '6', '7', '6', '8', '9', '8']


### 2.7 Match all which are NOT alphabets
The identifier <b>\W</b> finds all characters which are not alphabets or underscore`

In [45]:
pattern = re.compile(r'\W')
matches = pattern.findall(sample_text)
print(matches)

[' ', '%', '$', '.', '.', '.', '|', ' ', '-', ' ']


### 2.8 Match all whitespaces
The identifier <b>\s</b> finds all whitespaces - space, tab, new line

In [46]:
pattern = re.compile(r'\s')
matches = pattern.findall(sample_text)
print(matches)

[' ', ' ', ' ']


### 2.8 Match all which are NOT whitespaces
The identifier <b>\S</b> finds all which are not space, tab and new line

In [47]:
pattern = re.compile(r'\S')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', 'x', 'y', 'z', '%', '$', '.', '.', '.', '|', '6', '7', '6', '-', '8', '9', '8']


### 2.9 Reserved Characters
Some characters are reserved for specific functions in regular expressions - . ^ $ * + ? { } [ ] | \ ( )

We cannot use them directly. For example, as we saw earlier, using '.' gives you all the characters, but what if you want to find the occurance of all dots in a text? You can use the backslash to tell python that you actually want to find a dot and not use dot as an identifier

In [48]:
pattern = re.compile(r'\.')
matches = pattern.findall(sample_text)
print(matches)

['.', '.', '.']


### 2.10 Splitting data
Which returns a list where the string has been split at each match

In [49]:
#split by spaces
text = "Hello World! This is a test."
words = re.split(r'\s+', text)

words

['Hello', 'World!', 'This', 'is', 'a', 'test.']

In [50]:
#split by commas and spaces
text = "apple,banana, cherry, date"
fruits = re.split(r',\s*', text)
print(fruits)

['apple', 'banana', 'cherry', 'date']


In [51]:
#split by multiple delimiters
text = "One, two; three - four"
items = re.split(r'[,\s;.-]+', text)
print(items)

['One', 'two', 'three', 'four']


In [52]:
#split by digits
text = "The year is 2023"
parts = re.split(r'\d+', text)
print(parts)

['The year is ', '']


In [53]:
#split non-word characters
text = "Hello! World@ This# is^ a$ test%"
words = re.split(r'\W+', text)
print(words)

['Hello', 'World', 'This', 'is', 'a', 'test', '']


## 3.0 Exercise Study 1 - Extracting dates from logs

Let's use our understanding from the above exercise to extract dates from the below sample text

In [54]:
sample_text = """
2021-01-01: January 2021, Friday: Mr. Xyz
2021-07-19: July 2021, Monday: Mr L
2021-05-28: May 2021, Friday: Ms. IOU abc
2021/11/10: November 2021, Wednesday: Dr. Moby D
2020-01-01xyzabc: Jan 10th: Dr Dora The Exp
2020-04-02 0-1102-010002: Mrs. Joy
2020-02-01 0-12-010002: Mr. Mario
2019-09-18: suu: Mr. Luigi
2018-07-01: hha iuas
21-10-19: xyuz sus
21-10-18
21-9-20
21-8-14
21/12/02
21/11/14
21/10/21
"""

#### Let's start simple
First let's try to find all text that match 4 numbers, a dash, then 2 numbers, a dash and finally followed by 2 more numbers

In [55]:
pattern = re.compile(r'\d\d\d\d-\d\d-\d\d')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01']


#### Using Quantifiers to match patterns
You can use quantifiers to check of multiple occurances. For example, instead of repeating <b>\d</b> four times, you can say <b>\d+</b> which will look for 1 or more digits

In [56]:
pattern = re.compile(r'\d+-\d+-\d+')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '0-1102-010002', '2020-02-01', '0-12-010002', '2019-09-18', '2018-07-01', '21-10-19', '21-10-18', '21-9-20', '21-8-14']


But if we do this, you end up with garbage values like <b>0-1102-010002</b> and <b>0-12-010002</b>. You can avoid these by specifying the exact number of digits to check for using curly brackets - <b>{how many?}</b>

In [57]:
pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01']


Now, we are missing out on dates which have only 2 digits to represent the year - like <b>21-10-19</b> and 1 digit to represent the year - like <b>21-8-14</b>. To include these as well, we can change the pattern to look for a range of numbers. You can do this with <b>{min,max}</b>

In [58]:
pattern = re.compile(r'\d{2,4}-\d{1,2}-\d{2}')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01', '21-10-19', '21-10-18', '21-9-20', '21-8-14']


#### Matching multiple characters

The above code is not yet perfect. It misses out on dates which are separated by '/' - like <b>21/10/21</b>. We can use square brackets <b>[ ]</b> to tell python to look for -'s or for /'s

In [59]:
pattern = re.compile(r'\d{2,4}[-/]\d{1,2}[-/]\d{2}')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2021/11/10', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01', '21-10-19', '21-10-18', '21-9-20', '21-8-14', '21/12/02', '21/11/14', '21/10/21']


## 4.0 Exercise Study 2 - Extracting names from logs

Let's try to extract all the names in the sample text. We need 8 names - Mr. Xyz, Mr L, Ms. IOU abc, Dr. Moby D, Dr Dora the Exp., Mrs. Joy, Mr. Mario and Mr. Luigi

In [60]:
sample_text = """
2021-01-01: January 2021, Friday: Mr. Xyz
2021-07-19: July 2021, Monday: Mr L
2021-05-28: May 2021, Friday: Ms. IOU abc
2021/11/10: November 2021, Wednesday: Dr. Moby D
2020-01-01xyzabc: Jan 10th: Dr Dora The Exp
2020-04-02 0-1102-010002: Mrs. Joy
2020-02-01 0-12-010002: Mr. Mario
2019-09-18: suu: Mr. Luigi
2018-07-01: hha iuas
21-10-19: xyuz sus
21-10-18
21-9-20
21-8-14
21/12/02
21/11/14
21/10/21
"""

Let's start by getting all the Mr

In [61]:
pattern = re.compile(r'Mr')
matches = pattern.findall(sample_text)
print(matches)

['Mr', 'Mr', 'Mr', 'Mr', 'Mr']


After Mr, you can have a dot or need not have it. We can use the <b>?</b> quantifier to specifiy 0 or 1 dot

In [62]:
pattern = re.compile(r'Mr\.?')
matches = pattern.findall(sample_text)
print(matches)

['Mr.', 'Mr', 'Mr', 'Mr.', 'Mr.']


Now let's bring in Mr, Ms, Mrs and Dr too into the match. For this, we can use <b>(group)</b> them using either or <b>|</b>

In [63]:
pattern = re.compile(r'(?:Mr|Ms|Mrs|Dr)\.?')
matches = pattern.findall(sample_text)
print(matches)

['Mr.', 'Mr', 'Ms.', 'Dr.', 'Dr', 'Mr', 'Mr.', 'Mr.']


Now let's get the first names. You have a space, then a capital letter, then 0 or many words

In [64]:
pattern = re.compile(r'(?:Mr|Ms|Mrs|Dr)\.?\s[A-Z]\w*')
matches = pattern.findall(sample_text)
print(matches)

['Mr. Xyz', 'Mr L', 'Ms. IOU', 'Dr. Moby', 'Dr Dora', 'Mrs. Joy', 'Mr. Mario', 'Mr. Luigi']


Now let's get their middle names and last names

In [65]:
sample_text = """
2021-01-01: January 2021, Friday: Mr. Xyz
2021-07-19: July 2021, Monday: Mr L
2021-05-28: May 2021, Friday: Ms. IOU abc
2021/11/10: November 2021, Wednesday: Dr. Moby D
2020-01-01xyzabc: Jan 10th: Dr Dora The Exp
2020-04-02 0-1102-010002: Mrs. Joy
2020-02-01 0-12-010002: Mr. Mario
2019-09-18: suu: Mr. Luigi
2018-07-01: hha iuas
21-10-19: xyuz sus
21-10-18
21-9-20
21-8-14
21/12/02
21/11/14
21/10/21
"""

After the first name, there is a space, then there are none or more characters. Also note, that not all names have names have middle and last names, so the expressions have to be optional.

<b> ?</b> finds the optional space
    
<b>(?:[A-Za-z]\w*)?</b> finds the optional middle and last names

In [66]:
pattern = re.compile(r'(?:Mr|Ms|Mrs|Dr)\.?\s[A-Z]\w* ?(?:[A-Za-z]\w*)? ?(?:[A-Za-z]\w*)?')
matches = pattern.findall(sample_text)
print(matches)

['Mr. Xyz', 'Mr L', 'Ms. IOU abc', 'Dr. Moby D', 'Dr Dora The Exp', 'Mrs. Joy', 'Mr. Mario', 'Mr. Luigi']


---