
### Problems

In this problem set, we focus on problems related to strings.

   1. Given a long string `s`, count the frequency of the substring `sub`.
   2. Find the location (start, end) of a particular word in a long string. List all locations, a word can show up more than once. Put these indices in a list of tuples.
   3. Given a list of strings, and a particular substring, extract list of indices of the strings in which that substring occurs. For example, if we have a list ["Gfg is good", "for Geeks", "I love Gfg", "Gfg is useful"], and the substring we are searching for is "Gfg", the result should be [0, 2, 3].

In [0]:
import re


### Solutions

For the first problem we need to check if the substring exists. If it does, then find the frequency:

In [0]:
s1 = "I turned into a cat while I was in the Amazon jungle. After I got back from trip, I wrote an article about it telling everyone how I turned into a cat"
s2 = "I went to the Amazon jungle"

sub = "cat"

def count_substring(s, g):
    if g in s: # if the substring g exists
        return s.count(g)
    else:
        return 0
        
print(count_substring(s1, sub))
print(count_substring(s2, sub))

2
0



For the second problem, we need to know how to identify indices of substrings. In Python, there are a few methods associated with strings when we want to find substrings:

   - str.find(_sub_, _start_, _end_) returns the lowest index in the string where the substring _sub_ is found within the slice s[_start_:_end_]. It returns -1 if the _sub_ is not found;
   - str.rfind(_sub_, _start_. _end_) is similar to str.find() except that it tries to find the last occurrence of the string.

Another set of options is str.index(_sub_, _start_, _end_) and str.rindex(_sub_, _start_, _end_). However, the only difference is that str.find() method returns -1 if the substring is not found, whereas str.index() throws an exception.

In addition, re.search(_regex_, _text_) method either returns None (if the pattern does not match), or a `re.MatchObject` object that contains information about the matching part of the string. This method stops after the first match, so this is best suited for testing a regular expression more than extracting data. In contrast, the re.findall(_regex_, _text_) method returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. The re.finditer() works exactly the same as the re.findall() method except it returns an iterator yielding match objects matching the _regex_ pattern in a string instead of a list. It scans the string from left to right, and matches are returned in the iterator form. Later, we can use this iterator object to extract all matches. The finditer() is helpful because in some scenarios, the number of matches is high, and one could risk filling up the memory by loading them all using re.findall().

Let's see some examples of these methods first before revealing the solution:

In [0]:
s3 = "The famous musician Franz Liszt created a phenomenon called Lisztomania characterized by intense levels of hysteria demonstrated by fans, akin to the treatment of celebrity musicians today. Liszt was a good friend of Chopin and he even helped Chopin write his piano concertos."
sub = "Liszt"

print(s3.find(sub))
print(s3.rfind(sub))
print(s3.index(sub))
print(s3.rindex(sub))

search_res = re.search(sub, s3)
print(search_res)

26
190
26
190
<re.Match object; span=(26, 31), match='Liszt'>



Now let's see the solution using re.finditer():

In [0]:
result=[]
for match in re.finditer(sub, s3):
    tup=(match.start(), match.end())
    result.append(tup)
print(result)

[(26, 31), (60, 65), (190, 195)]



For problem 3, the answer becomes easier due to the previous problem:

In [0]:
mylist=["Gfg is good", "for Geeks", "I love Gfg", "Gfg is useful"]
sub="Gfg"

result =[i for i in range(len(mylist)) if mylist[i].find(sub) != -1]
result

Out[5]: [0, 2, 3]