# Introduction to String Manipulation and Reading Files

We'll re-examine some of the basic string manipulation we have seen before and then move on to regular expressions.

## String Manipulation

Recall that a string object in Python is technically a sequence of characters. We can access elements of a string the same way access elements of any sequence, with the square bracket notation and index, `[index]`. Indexing starts at 0.

In [None]:
# Create a silly string
s = "This is just a silly string that contains all kinds of nonsense."

In [None]:
# Let's see how long the string is
print(f"The string s is {len(s)} characters long.")

In [None]:
# Iterate through the string using a basic for loop
# This will print each character on a separate line
for char in s:
    print(char)

In [None]:
# Another way to iterate through the string
for i in range(len(s)):
    print(f"The character returned by calling s[{i}] is {s[i]}")

In [None]:
# You can also slice a string
print(f"s[4:9] = {s[4:9]}")

In [None]:
# Starting at far right-hand-side and going left
print(f"s[-1:-5:-1] = {s[-1:-5:-1]}")

In [None]:
# What if I wanted to completely reverse the sentence?
# YOUR CODE HERE

In [None]:
# Changing the case of a string
print(f"s.lower() = {s.lower()}")
print(f"s.upper() = {s.upper()}")

### Searching Within a String

We can also search within a string using some built-in methods. The method `.find()` will return the index of the first occurrence of the substring you pass to it. It will return `-1` if it does not find the substring. There are optional arguments that lets you specify within the range `[start, end]`.

In [None]:
# Find the first occurrence of "a"
s.lower().find("a")

In [None]:
# Find the first occurrence of "that"
s.lower().find("that")

In [None]:
# Find the first occurrence of "these"
# "these" is not in the string s, so we get back -1
s.lower().find("these")

In [None]:
# See if there is another "a" after the first one we found before
# Not giving an "end" will assume from start to the end of the string
# Where is the first occurrence of "a" starting from index 14?
s.lower().find("a", 14)

In [None]:
# Any "a" between index 14 and 30
s.lower().find("a", 14, 30)

So, the argument `end` is **exclusive**. Try changing the end to 31.

In [None]:
s.lower().find("a", 14, 31)

We also have the convenience methods of `startswith()` and `endswith()` that return `True` or `False`.

In [None]:
s.lower().startswith("in the")

In [None]:
s.lower().startswith("this")

In [None]:
s.lower().endswith(".")

In [None]:
s.lower().endswith("sense.")

In [None]:
# Is the string s lower case? What if you call .lower()?
print(f"s.islower() = {s.islower()}")
print(f"s.lower().islower() = {s.lower().islower()}")

#### Checking Character Contents of String

We can check to see if the string contains only alphabetical characters, only digits, or only alphanumerical characters.

In [None]:
# Does s contain only alphabetical characters?
s.lower().isalpha()

In [None]:
# Does s contain only digits?
s.lower().isdigit()

In [None]:
# Does s contain only alphanumeric characters?
s.lower().isalnum()

Those all came back `False`. There are two reasons:

1. There are spaces in the string (whitespace).
2. There is a period at the end of the sentence.

So, let's try another string to test our intuition.

In [None]:
# testing out .isalpha(), .isalnum(), etc.
test = "1234abcd"

# YOUR CODE
# print out various tests on the string test

Let's go back to our original string `s` and remove the spaces and put the string in lower case.

In [None]:
# Put in lower case and remove the spaces
removeSpaces = "".join(s.lower().split())
removeSpaces

In [None]:
# What happens when we call .isalpha() now?
removeSpaces.isalpha()

In [None]:
# Doesn't like the period (".")
# Let's remove it with rstrip() and try again
removeSpaces.rstrip(".").isalpha()

#### Splitting and Joining Strings

We just saw how to split a string an join it back together. By default, the when you call `split()` with no arguments, it will use whitespace as the delimiter for splitting the string. For our sentence `s` that is what we wanted. We then used `join` passing it the list of split words, specifying that there should be no characters between the elements of the list (each word in our case) when joining them. 

Now, let's create a new string and by joining the split original string with a new character `|`. Then, we'll try to split it using that delimiter.

In [None]:
# Split s after making it lower case
splitS = s.lower().split()
print(splitS)

In [None]:
# Create a newS string that joins all of the elements in the
# list with a pipe symbol "|" between each element
newS = "|".join(splitS)
print(newS)

In [None]:
# Now split the newS using the new delimiter
newS.split("|")

## Reading from Files

We've already seen how useful `pandas` can be for reading in .csv file and .xlsx files. These file are often **structured** data. There are many times when we have unstructured data in files that we also want to analyze. For example, we might want to parse emails, HTML (Hyper Text Markup Language), or JSON (JavaScript Object Notation) files. Each of these file types do have a structure to them, but not it is often not **tabular** like we saw with our .csv and Excel files. What this generally means is that we often will need to manually examine a representative sample of the file types that we want to parse to help us write code that will automate the task of "reading" these files.

### Reading Files

There are several ways to read a file in. At the most basic you can use the method `open()`.

- [`open()`](https://docs.python.org/3/library/functions.html#open) returns a [file object](https://docs.python.org/3/glossary.html#term-file-object), and is most commonly used with two arguments: `open(filename, mode)`.
- Using the [`with`](https://docs.python.org/3/reference/compound_stmts.html#with) keyword. It is good practice to use the `with` keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point. Using with is also much shorter than writing equivalent `try-finally` blocks.

Once the file is open, you can call `read()` which will try to read the entire file. You can also read one line at a time with `readline()`.

### HTML file

Let's look at a simple HTML file. All we are going to do is open the file, read in its content, covert it all to lower case, store it in the variable `inputText`, and then explicitly close the file. We can then print the `inputText` to see what was in the file. We'll do it two ways.

In [None]:
# 1. With open()
f = open("sample.html", "r")
inputText = f.read().lower()
# Explicitly close the file 
f.close()

print(inputText)

In [None]:
# 1b. Read each line one at a time
f2 = open("sample.html", "r")
print("line 1:", f2.readline(), end="")
print("line 2:", f2.readline(), end="")
print("line 3:", f2.readline(), end="")
    
f2.close()

In [None]:
# 1c. Iterate over the file one line at a time
f3 = open("sample.html", "r")
for line in f3:
    print(line, end="")
    
f3.close()

In [None]:
# 2. Use with statement
with open("sample.html", "r") as f4:
    inputText = f4.read().lower()
    

# We did not explicitly close the file, is it closed?
print(f4.closed)
print(inputText)

#### Splitting Lines

We saw earlier that we could split a string into pieces using whitespace as the delimiter (by default) with `split()`. There is another way to split -- `splitlines()` which uses newline characters to split. Let's try both with our `inputText` variable.

In [None]:
inputText.split()

In [None]:
inputText.splitlines()

### Text Files

Reading a text file (.txt) can be done just like we did with the HTML file. The preferred method is by using the `with` statement because it will close the file for us and forces us to think in the **context** of working with the file when it is open. That is, whatever code we put in the `with` statement will be executed with the file open which can be costly in terms of memory and processing time. 



In [None]:
with open("wordDoc.txt", "r") as textFile:
    myText = textFile.read()
    
print(myText)

Let's split `myText` using `splitlines()` after changing the case to lower case. We'll use a `for` loop to iterate over each element in the returned list.

In [None]:
for line in myText.lower().splitlines():
    print("line is:\n", line, sep="")
    print()

----

<font color='red' size = '5'> Student Exercise </font>

You have been given a .txt file that contains an article from the *New York Times* called `nyTimes.txt`. In the **Code** cells below, do the following:

1. Read the data from .txt file into a variable called `article` and print it out.
2. How many characters are in the article?
3. Approximately, how many words are in the article?
4. Which words show up the most? 

-----

In [None]:
# 1. Read in the file to variable article


In [None]:
# 2. How many characters in article?


In [None]:
# 3. Approximately, how many words in article? 


In [None]:
# 4. Which words show up the most? 
