# Practical 3-1: Strings and Text Processing

## Part 1: Strings

### 1.1 Comparing Strings Using == and the is Keyword

▪ The **==** operator compares the value or equality of two objects.

▪ The **is** operator checks whether two variables point to the same object in memory.

In [None]:
string_1 = 'hello world'
string_2 = "hello world"

print(string_1 == string_2)
print(string_1 is string_2)

In [None]:
string_3 = 'hello world'
print(string_1 is string_3)

### 1.2 id()

▪ The **id()** function returns a unique id for the specified object.

In [None]:
print(id(string_1))
print(id(string_2))
print(id(string_3))

### 1.3 Escape Sequence

▪ An escape sequence starts with a backslash, which signifies that the next character after it has a different meaning.

In [None]:
print("Bobby's World")

In [None]:
print('Bobby\'s World')

In [None]:
print("This is printed\nin two lines")

### 1.4 Referencing a File

<img src="properties.png" width="300">

▪ Python allows us to use OS-X/Linux style slashes "/" even in Windows. 

▪ Therefore, we can refer to the file as 'C:/Users/narae/Desktop/alice.txt'.

https://sites.pitt.edu/~naraehan/python3/file_path_cwd.html

In [None]:
print("C:/Users/User/Desktop/Files")

In [None]:
print("C:\Users\User\Desktop\Files")

In [None]:
print("C:\\Users\\User\\Desktop\\Files")

### 1.5 Multiline Strings: Line Continuation Character

▪ In Python, a backslash "\\" can also be used as a line continuation character.

In [None]:
message = "two lines "
"of text"
print(message)

In [None]:
# A string literal can span multiple lines, but there must be a backslash \ at the end of each line to escape the newline.
message = "two lines "\
"of text"
print(message)

In [None]:
total = 1 + 2\
    + 3
    
print(total)

### 1.6 Multiline Strings: Parentheses

▪ Multiline strings can also be achieved with the use of parentheses "()".

In [None]:
print("two lines of text "
    'in code cell') 

In [None]:
text = ("several lines " 
        "of text "
        "in code cell")
print(text)

### 1.7 Multiline Strings: Triple Single Quotes or Triple Double Quotes

▪ If we want to make sure the output looks exactly like how we placed it, use triple quotation marks (""" or ''').

In [None]:
print("""several
    lines of
        text""")

In [None]:
my_string = """Hello, welcome to
            the world of Python"""
print(my_string)

### 1.8 Formatting String: String format()

▪ The format() method was introduced for handling complex string formatting more efficiently.

▪ Format strings contain curly braces {} as placeholders or replacement fields which get replaced.

▪ Float precision with the.format() method: Syntax: {[index]:[width][.precision][type]}

In [None]:
x = 'looked'

print("Misha {} and {} around.".format('walked', x))

In [None]:
print("My name is {} and weight is {} kg!".format('Zara', 21))

In [None]:
x = 12.3456789

print('The value of x is {0:.2f}.\n'.format(x))

In [None]:
height = 180.3152
weight = 81.65431

y = "My height and weight are {0:.2f} and {1:.3f} respectively.".format(height, weight)
print(y)

### 1.9 Loop Through a String

▪ We can iterate through a string using a for loop.

In [None]:
# Write code that count the occurrance of a specific letter in a string, both can be read from user input 
sentence = input("Please enter a sentence:")
letter = input("Please enter a letter:")

count = 0
for item in sentence:
    if(item == letter):
        count += 1
        
message = "{} {} '{}' found.".format(count, "letters" if count > 1 else "letter", letter)
print(message)

### 1.10 Accessing Character(s) in a String

In [None]:
#      01234567890123456789012345678901234567890123
fox = "The quick brown fox jumps over the lazy dog."

print(fox[0])  # The first character
print(fox[4])  # The fifth character
print(fox[-1]) # The last character

### 1.11 String Slicing: Positive Indexing

<img src="string-slicing.png" width="400">

In [None]:
dog = "the lazy dog"

print(dog[:3]) # From beginning until index 3-1
print(dog[9:]) # From 9 until the end
print(dog[4:8]) # From index 4 until index 8-1

### 1.12 String Slicing: Negative Indexing

In [None]:
print(dog[-12:3])
print(dog[-11:-4])

### 1.13 String Membership Test

In [None]:
string = 'programming'

print('a' in string)
print('a' not in string)

In [None]:
# Searching for a substring using the in keyword
fox = "The quick brown fox jumps over the lazy dog"

letter = input("Search a letter: ") 

if letter in fox:
    print(letter, 'found')
else:
    print(letter, 'not found')

### 1.14 Modifying a Immutable String

In [None]:
string[5] = 'a'

### 1.15 Deleting a String

In [None]:
del string

In [None]:
print(string)

### 1.16 dir()

▪ The **dir()** function returns all properties and methods of the specified object, without the values.

In [None]:
# To find out all attributes and methods we have access to for variable name
string = 'Python Programming'

print(type(string))

In [None]:
print(dir(str))

In [None]:
print(dir(string))

In [None]:
# We can use help() to find out more about str's index() method
print(help(str.index))

In [None]:
age_list = [70, 60]

print(type(age_list))

In [None]:
print(dir(age_list))

In [None]:
print(help(list.sort))

### 1.17 String Methods

▪ Below are some other useful String methods:

Method |	Functionality
-------|-----------------
s.count(t,start,end) |	Counts how many times string t occurs in string or in a substring of string if starting index start and ending index end are given.
s.find(t) |	index of first instance of string t inside s (-1 if not found)
s.rfind(t) |	index of last instance of string t inside s (-1 if not found)
s.index(t) |	like s.find(t) except it raises ValueError if not found
s.rindex(t) |	like s.rfind(t) except it raises ValueError if not found
s.isalnum() |	Returns true if string s has at least 1 character and all characters are alphanumeric and false otherwise.
s.isalpha() |	Returns true if string s has at least 1 character and all characters are alphabetic and false otherwise.
s.isdigit() |	Returns true if string s contains only digits and false otherwise.
s.join(text) |	combine the words of the text into a string using s as the glue
s.split(t) |	split s into a list wherever a t is found (whitespace by default)
s.splitlines() |	split s into a list of strings, one per line
s.lower() |	a lowercased version of the string s
s.upper() |	an uppercased version of the string s
s.title() |	a title-cased version of the string s
s.strip() |	a copy of s without leading or trailing whitespace
s.replace(t, u) |	replace instances of t with u inside s


# Part 2: Text Processing

## 2.1 Reading Text from The Local Drive

### 2.1.1 Opening a Text File in a Local Drive

▪ Python has a built-in **open()** function to open a file and returns it as a **file object**, which is used to read or modify the file accordingly.

https://www.programiz.com/python-programming/file-operation

In [None]:
f = open('C:/Users/User/Desktop/pg2554.txt') 

In [None]:
f = open('C:\\Users\\User\\Desktop\\pg2554.txt') 

### 2.1.2 Reading Text File from a Local Drive

▪ When reading from a file, we can use the **read(size)** method to read in **size** number of data. 

In [None]:
print(f.read(8)) # read the first 8 data

In [None]:
print(f.tell())  # get the current file position

In [None]:
print(f.read(4)) # read the next 4 data

In [None]:
print(f.seek(0)) # bring file cursor to initial position

In [None]:
raw_local = f.read() # read in the rest till end of file
print(raw_local[:201])

In [None]:
print(raw_local)

### 2.1.3 Closing the File

▪ Closing a file can be done using the **close()** method to free up the resources that were tied with the file. 

In [None]:
f.close()

### 2.1.4 Opening a Text File with encoding='utf-8' 

▪ UTF-8 is the most popular character encoding on the internet.

▪ It's always good practice to explicitly specify UTF-8 encoding when opening files.

In [None]:
f = open('C:\\Users\\User\\Desktop\\pg2554.txt', encoding='utf-8') 

raw_local = f.read()
print(raw_local[:201])

f.close()

## 2.2 Reading Text from the Web

### 2.2.1 urllib.request

▪ We can use the **urlopen** function from "urllib.request" module to open and read URL contents. 

In [None]:
import urllib.request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

with urllib.request.urlopen(url) as f:
    print(f.read(300))

In [None]:
import urllib.request

f = urllib.request.urlopen("http://www.gutenberg.org/files/2554/2554-0.txt")

f.read(300)

### 2.2.2 Reading the Content from Web

In [None]:
import urllib.request

with urllib.request.urlopen('http://www.python.org/') as f:
    raw_html = f.read()
    
print(raw_html[:300])

### 2.2.3 Preprocessing HTML Text

▪ We use a Python library called **BeautifulSoup** (http://www.crummy.com/software/BeautifulSoup/) to get text out of HTML.

▪ Beautiful Soup is a Python library that makes it easy to scrape information from web pages or pulling data out of HTML and XML files. 

▪ Beautiful Soup's **Tag.get_text()** method returns the text within the tag. 

▪ **Example**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
from bs4 import BeautifulSoup

html_text = BeautifulSoup(raw_html).get_text()
print(html_text[:300])