# A Deeper Look at Strings
<hr style="border:1px solid gray">

We have seen the data type `str` for strings which is technically called a *text sequence type*. First, we will revisit the `str` data type. Then, we will explore strings in a bit more detail. 

<hr style="border:1px solid gray">

## Strings - the `str` data type

The `str` data type is a special sequence type - a *text* sequence type. Recall that we create strings by providing a sequence of zero or more characters enclosed in either a pair of single quote characters, `'`, or a pair of double quote characters, `"`.

In [None]:
# Create string1, print it and its type
string1 = 'My name is Mateo.'
print('string1 =', string1)
print('   type =', type(string1))

In [None]:
# Create string2, print it and its type
string2 = 'Hello, my name is Mateo!'
print('string2 =', string2)
print('   type =', type(string2))

In [None]:
# Create the string user_name
user_name = 'Mateo'

In [None]:
# Check to see if user_name is in string1 or string2
user_name in string1 

In [None]:
user_name in string2

## Strings are Immutable

Unlike a `list`, strings are immutable. In this sense, they are similar to a `tuple`. Let's try changing a string and see what happens.

In [None]:
# Try to change first letter of user_name
print('first letter of user_name is', user_name[0])

In [None]:
# Now try to change it
user_name[0] = 'K'

## Special Characters

Strings can also contain special characters. For example, a newline is represented as `\n`, a tab as `\t`, and a backspace as `\b`. Let's try them.

In [None]:
# Try tab \t in the middle of string
print('Here is some\ttext with two\ttabs in it.')

In [None]:
# Try newline \n
print('Here is some\ntext with two\nnewlines in it.')

In [None]:
# What about the backspace \b
# What do you think will printed
print('54\b65')

## Raw and Formatted String Literals

As we saw above, we can include newline and tab characters within a string by using `\n` and `\t`, respectively. If you want to have the characters `\n` show up in the string instead of being replaced with a newline, then you have two options:

1. Escape it by adding an additional backslash: `\\n`.
2. Make the string a **raw string literal** by preceding the opening quotation mark with the letter `r`. 

In [None]:
# Create a string with a new line character in it
s1 = 'String over \n two lines?'
print(s1)

In [None]:
# Create a string with escaped new line character
s2 = 'String over \\n two lines?'
print(s2)

In [None]:
# Create a raw literal string
s3 = r'String over \n two lines?'
print(s3)

### Raw String Literals
In addition to using raw literal strings as shown above, you will often encounter them when using **regular expressions** in text analysis projects.

### Formatted String Literals

To help with formatting when printing out strings, Python provides the concept of *formatted string literals*, also called f-strings. You can include the value of Python expressions inside a string by prefixing the string with an `f` or `F` and writing expressions as `{expression}`. We have already seen examples of this concept.

In [None]:
# Create an f-string
my_var = 3*3
my_fstring = f'The value of my_var is {my_var}'
print(my_fstring)

In [None]:
my_fstring2 = f'The value of my_var with some formatting is {my_var:0.2f}'
print(my_fstring2)

### Concatenating Strings

In [None]:
# Concatenate two strings together
# The + operator is "overloaded", so it works on strings in addition to numbers
string1 + string2

### Repeating Strings

In [None]:
# Repeating strings
# The * operator is also overloaded, allowing it work on strings
string2 * 4

### Other Useful String Functions and Methods

Let's explore a few of the most commonly used string manipulations that we will use.

### Length of Strings

In [None]:
# How long is a string?
# Use the len() function
print('length of string1 =', len(string1))
print('length of string2 =', len(string2))

### Changing the Case of Strings

In [None]:
# What if we want to convert everything to lower case or upper case
# Convert string1 to lower
print(string1.lower())

In [None]:
# Convert string1 to upper
print(string1.upper())

In [None]:
## NOTE: Neither one of these modify the original string1. Instead they return a new string.
print(string1)

<hr style="border:1px solid gray">

<font color='red' size = '5'> Student Exercise </font>

Complete the following tasks in the empty **Code** cells below.

1. Convert `string2` to lower case and print it out.
2. Convert `string2` to upper case and print it out.
3. If you repeat `string1` twice, repeat `string2` three times, and combine them, how many characters are in that string?

<hr style="border:1px solid gray">

In [None]:
# 1. Convert `string2` to lower case and print it out.


In [None]:
# 2. Convert `string2` to upper case and print it out.


In [None]:
# 3. If you repeat `string1` twice, repeat `string2` three times,
#    and combine them, how many characters are in that string?


<hr style="border:1px solid gray">

## Cleaning Strings

A common need when cleaning text data is to remove white space from the **beginning and end** of strings. We can use the `strip` method to accomplish this. White space includes space, tab, and newline characters.

In [None]:
# Create a string with spaces at beginning and end
badString = '     Spaces at the beginning and end.     '
print('before:|', badString, '|', sep='')
print('after :|', badString.strip(), '|', sep='')

In [None]:
# Create a string with spaces at beginning and end
# Add newline and tab characters and spaces in middle
# Will strip work?
badString2 = '''     Spaces at the beginning and end.     Now let's try newline \n and
    even a tab \t or \t two to see what happens.\n'''
print('before:|', badString2, '|', sep='')
print()
print('after :|', badString2.strip(), '|', sep='')

Well, that did not work. Hence, `strip()` only works for beginning and ending whitespace, even the newline character as we just saw. Luckily for us, there is another useful function that will help out. We will use `split()` to remove spaces, tabs, and newlines in a string. Splitting the string puts the words into a `list` (remember `list`s?). We can reconstruct the string with just spaces by using the `join()` method. Let's try it.


In [None]:
# Split it and look at the list
badStringList = badString2.split()
print(badStringList)

In [None]:
# Now, reconstruct with join
goodString = ' '.join(badStringList)
print(goodString)

Similar to `strip`, there are also `lstrip` and `rstrip` that strip off whitespace on at the beginning (the left) and the end (right), respectively. 

<hr style="border:1px solid gray">

<font color='red' size = '5'> Student Exercise </font>

In the **Code** cell below is a string variable named `badString3` containing "extra" whitespace at both the beginning and end (but not in the middle).

Complete the following tasks in the empty **Code** cells below the cell that contains `badString3`. Be sure to run that cell of code before trying your own code.

1. Strip all the whitespace from the beginning of `badString3`.
2. Strip all the whitespace from the end of `badString3`.
3. Strip all the whitespace from the beginning and end of `badString3`.

<hr style="border:1px solid gray">

In [None]:
# String with whitespace at beginning and end
badString3 = '\t   \t\n\t\tWhat a fun string!  \t\t\t\n\r\f'
badString3

In [None]:
# 1. Strip all the whitespace from the beginning of `badString3`


In [None]:
# 2. Strip all the whitespace from the end of `badString3`


In [None]:
# Strip all the whitespace from the beginning and end of `badString3`


<hr style="border:1px solid gray">

## Searching for Substrings

We may also want to find substrings within a string. There are various methods for this task similar to the `in` operator we saw earlier. We will look at `startswith`, `endswith`, `find`, and `replace`. The methods `startswith` and `endswith` do exactly what they say: they will return a boolean indicating if the string starts or ends with the specified substring. The `find` method searches for a substring within a string. If it is found, the *index* of the first occurrence is returned. If the substring is not found it returns -1.  You can use `replace` to replace one substring with another within a string. By default it replaces *all* occurrences of the substring. An optional *count* argument allows you to specify the number of replacements.

In [None]:
# Create a string
string3 = 'I think text analysis is fun because text is where the hidden messages are.'

In [None]:
# Does string3 start with 'I'?
print(string3.startswith('I'))

In [None]:
# What about case? Test with 'i'
print(string3.startswith('i'))

In [None]:
# Does string3 end with 'text'?
print(string3.endswith('text'))

In [None]:
# Where does the first occurrence of 'text' occur in string3?
string3.find('text')

In [None]:
# Can we pull out the word text where it first occurs?
print(string3[string3.find('text'):string3.find('text')+4])

In [None]:
# Replace the word "text" with "TEXT" for all occurrences
print(string3.replace('text', 'TEXT'))

In [None]:
# Replace "text" with "TEXT" once - just the first occurrence
print(string3.replace('text', 'TEXT', 1))

In [None]:
# The function replace() does not change the original string
print(string3)

### Counting Substrings

You may also want to count the number of occurrences of a substring within in a string. You can use the `count` method.

*Thought Excercise:* Are there any other ways to accomplish the same task?

In [None]:
# How many times does "text" show up in string3?
# We are going to convert everything to lower case and then count
string3.lower().count("text")

<hr style="border:1px solid gray">

<font color='red' size = '5'> Student Exercise </font>

In the **Code** cell below is a long string variable named `overview`. It contains language from a less-than-truckload freight carrier's website.

Complete the following tasks in the empty **Code** cells below the cell that contains `overview`. Be sure to run that cell of code before trying your own code.

1. Convert the string to lower case, storing it in the variable `overview_lower`, and print it out.
2. Convert the string to upper case, storing it in the variable `overview_upper`, and print it out.
3. Count the number of characters in `overview` and print it out.
4. Count the number of times the substring "ltl" occurs.
5. Count the number of times the substring "we" occurs.
6. Count the number of times the **word** "we" occurs. Will this be different than what you found above? Why or why not?
7. How many words are in are in `overview_lower`?
8. What is the most commonly occurring word in `overview_lower`?

<hr style="border:1px solid gray">

In [None]:
overview = '''Overview

We are one of the largest regional North American less-than-truckload (“LTL”) motor carriers. \
We provide regional LTL services through a single integrated, union-free organization. \
Our service offerings, which include expedited transportation, are provided through an \
expansive network of service centers stretching from the northeast to the southeast. \
In addition to our core LTL services, we offer a range of value-added services including \
container drayage, truckload brokerage and supply chain consulting. More than 98% of our \
revenue has historically been derived from transporting LTL shipments for our customers, \
whose demand for our services is generally tied to industrial production and the overall \
health of the U.S. domestic economy.

We have increased our revenue and customer base over the past ten years primarily through \
organic market share growth. Our infrastructure allows us to provide next-day and second-day \
service throughout the east coast of the continental United States. We believe the growth in \
demand for our services can be attributed to our ability to consistently provide a superior \
level of customer service at a fair price, which allows our customers to meet their supply \
chain needs. Our integrated structure allows us to offer our customers consistent, \
high-quality service from origin to destination, and we believe our operating structure and \
proprietary information systems enable us to efficiently manage our operating costs. Our \
services are complemented by our technological capabilities, which we believe improve the \
efficiency of our operations while also empowering our customers to manage their individual \
shipping needs.

We were founded and incorporated in Pennsylvania in 1975. Our principal executive offices \
are located at 1 North Bell Avenue, Carnegie, Pennsylvania 15106.

Our Industry

Trucking companies provide transportation services to virtually every industry operating in \
the United States and generally offer higher levels of reliability and faster transit times \
than other surface transportation options. The trucking industry is comprised principally of \
two types of motor carriers: LTL and truckload. LTL freight carriers typically pick up multiple \
shipments from multiple customers on a single truck. The LTL freight is then routed through a \
network of service centers where the freight may be transferred to other trucks with similar \
destinations. LTL motor carriers generally require a more expansive network of local pickup and \
delivery (“P&D”) service centers, as well as larger breakbulk, or hub, facilities. In contrast, \
truckload carriers generally dedicate an entire truck to one customer from origin to destination.

Significant capital is required to create and maintain a network of service centers and a fleet \
of tractors and trailers. The high fixed costs and capital spending requirements for LTL motor \
carriers make it difficult for new start-up or small operators to effectively compete with \
established carriers. In addition, successful LTL motor carriers generally employ, and regularly \
update, a high level of technology-based systems and processes that provide information to \
customers and help reduce operating costs.

The American Trucking Associations reported total transportation revenue in the United States \
of $911.2 billion in 2020, which included approximately $41.1 billion for the LTL industry based \
on information reported in Transport Topics. The LTL industry is highly competitive on the basis \
of service and price and has consolidated significantly since the industry was deregulated in 1980. \
The largest 5 and 10 LTL motor carriers accounted for approximately 58% and 83%, respectively, of \
the domestic LTL market in 2020. We believe consolidation in our industry will continue due to \
increased customer demand for transportation providers that can offer both regional and national \
service as well as other complementary value-added services.

Competition

The transportation and logistics industry is intensely competitive and highly fragmented. We \
compete with regional, inter-regional and national LTL carriers and, to a lesser extent, with \
truckload carriers, small package carriers, airfreight carriers and railroads. We also compete \
with, and provide transportation services to, third-party logistics providers that determine both \
the mode of transportation and the carrier. Most of our competitors may have a broader global \
network and a wider range of services than we do. Competition in our industry is based primarily \
on service, price, available capacity and business relationships. We believe we are able to gain \
market share by expanding our capacity in the United States and providing high-quality service at \
a fair price.

Throughout our organization, we continuously seek to improve customer service by, among other \
things, maximizing on-time performance and minimizing cargo claims. We believe our transit times \
are generally faster and more reliable than those of our competitors, in part because of our more \
efficient service center network, use of team drivers and proprietary technology. 

We utilize flexible scheduling and train our employees to perform multiple tasks, which we \
believe allows us to achieve greater productivity and higher levels of customer service than \
our competitors. We believe our focus on employee communication, continued education, \
development and motivation strengthens the relationships and trust among our employees.'''

In [None]:
# 1. Convert the string to lower case, storing it in the variable `overview_lower`, and print it out.


In [None]:
# 2. Convert the string to upper case, storing it in the variable `overview_upper`, and print it out.


In [None]:
# 3. Count the number of characters in `overview` and print it out.


In [None]:
# 4. Count the number of times the substring "ltl" occurs


In [None]:
# 5. Count the number of times the substring "we" occurs.


In [None]:
# 6. Count the number of times the **word** "we" occurs.
# Will this be different than what you found above? Why or why not?


In [None]:
# 7. How many words are in are in `overview_lower`?


In [None]:
# 8. What is the most commonly occurring word in `overview_lower`?

## Ancillary Information

The following links point you to additional resources that you might find helpful in learning this material. 

1. The official Python tutorial about [formatting strings][1].
2. A nice post about [f-strings][2].



-----

[1]: https://docs.python.org/3/tutorial/inputoutput.html
[2]: https://realpython.com/python-f-strings/

**&copy; 2022 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**