# Python Data Manipulation

* Now that you have a basic understanding of the Python programming language



## String Methods

* Strings have a bunch of *methods*, functions specifically designed for strings, that can be used to manipulate the text contents of the string
* See the [Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for a full list of string methods

In [None]:
# use the lowercase method on a string literal
"The Batman".lower()

In [None]:
# Put a string in a variable
superhero = "The Batman"

superhero.upper()

* The `split` method is very handy

In [None]:
# Make a longer string a split it
superhero = "The Batman bit the frog; he died of poison"
superhero.split()

In [None]:
# Split on a semicolon instead of a space
superhero.split(";")

In [None]:
# Check to see if the string contains a number
"50000".isdigit()

In [None]:
# Doesn't work for money
"$50000".isdigit()

* String formatting is a *super* useful way to programmatically create strings
* This is a very powerful system, so definitely [check out the documentation](https://docs.python.org/3/library/string.html#formatstrings)

In [None]:
# create a string template
template_string = "My name is {}"

In [None]:
name = "Dr. Strange"
template_string.format(name)

In [None]:
# create string template variables
template_string = "Oh we are using made up names. Hello, {you}! My name is {me}"
my_name = "Spiderman"

In [None]:
# format the template with data values
template_string.format(me=my_name, you=name)

* Use triple quotes to make strings with newlines

In [None]:
# Make a Frosty string and split on the lines
multiline_example = """Nature’s first green is gold,
Her hardest hue to hold.
Her early leaf’s a flower;
But only so an hour.
Then leaf subsides to leaf,
So Eden sank to grief,
So dawn goes down to day
Nothing gold can stay."""

multiline_example.splitlines()

* What if we wanted to split the lines and then split the words

In [None]:
multiline_example.splitlines().split()

* When we use `splitlines` we get back a list, to process the values in the list we need to loop over those values

---

## List Comprehensions

* [List comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) provide a concise way to create lists.
* If you find yourself creating lists by performing operations in a `for` loop, a list comprehension is a more *pythonic* way to do this
* For example, squaring all the values in a list

In [None]:
# Create a list of squares
# Code from Python documentation
squares = []
for x in range(10):
    squares.append(x**2)

squares

* Here is the same problem solved with a list comprehension

In [None]:
# compute the squares and save in a variable called squares
squares = [x**2 for x in range(10)]
squares

* If you find yourself creating an empty list, looping, and appending to that list then you might consider a list comprehension
* A list comprehension consists of brackets containing an expression followed by a `for` clause, then zero or more `for` or `if` clauses. 

* You can call functions in list comprehensions to do more powerful processing

In [None]:
# create a list of super people
list_of_strings = ['Bruce WAYNE', "  The JOKER", "   ThAnOS   "]

# create a function to clean a string
def clean_string(to_clean):
    # remove whitespace and lowercase
    return to_clean.strip().lower()

# use a list comprehension to process our list of super people
cleaned = [clean_string(x) for x in list_of_strings]
cleaned

* And this works with string methods, so now we can split the Rober Frost poem

In [None]:
# Make a Frosty string and split on the lines
multiline_example = """Nature’s first green is gold,
Her hardest hue to hold.
Her early leaf’s a flower;
But only so an hour.
Then leaf subsides to leaf,
So Eden sank to grief,
So dawn goes down to day
Nothing gold can stay."""

# loop over the each line and split each line into word tokens
tokenized = [line.split() for line in multiline_example.splitlines()]
tokenized

---

## Working With CSV Files

* CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel – only the content is stored in plaintext.
* Python has a CSV parser as part of the standard library
* To parse CSV files, we use the `csv` module.
* The csv module provides a number of built-in functions to make it easier to parse and iterate through CSV files.
 

In [None]:
#  load the CSV module 
import csv

* Now we need to tell Python to open a connection to `pgh-mayors.csv` and diabetes_file handler should be processed as a CSV file. 
*  We do that by calling on the `reader()` function of the csv module

In [None]:
# open the mayors file
with open("pgh-mayors.csv", 'r') as mayors_file:
    # Create a CSV reader 
    mayors = csv.reader(mayors_file)

* At this point, the entire CSV file is treated as a table - a collection of rows and columns
* We can iterate (loop) through this table and get access to each individual row, just like the line-by-line above
* But CSV module automatically splits it all into different values!

In [None]:
# open the mayors file
with open("pgh-mayors.csv", 'r') as mayors_file:
    # Create a CSV reader 
    mayors = csv.reader(mayors_file)
    
    # loop over the file and print the row contents 
    for mayor in mayors:
        print(mayor)
    

* You probably noticed that the row variable is just a list - it is a list of values contained in each column.
* You can access individual columns exactly the same way you would access values in a list.
* For example, the name of each mayor is in a column called 'mayor', which is a second column and therefore has the index of 1

In [None]:
# open the mayors file
with open("pgh-mayors.csv", 'r') as mayors_file:
    # Create a CSV reader 
    mayors = csv.reader(mayors_file)
    
    # loop over the file and print the row contents 
    for mayor in mayors:
        print(mayor[1])
    

* You probably also noticed that the first row does not contain data - it's just the column headers
* In order for us to do any processing on the mayor, we need to skip the header row
* We have to skip the header row. We can do this with the `next()` function to separate the header rows

In [None]:
# open the diabetes file
with open("diabetes.csv", 'r') as diabetes_file:
    # Create a CSV reader 
    diabetes_data = csv.reader(diabetes_file)

    # use next to skip the header row
    headers = next(diabetes_file)
    print(headers)

    # loop over the remaining lines file 
    for row in diabetes_data:
        print(row[1]) # print only the values for the chol column


In [None]:
# open the mayors file
with open("pgh-mayors.csv", 'r') as mayors_file:
    # Create a CSV reader 
    mayors = csv.reader(mayors_file)
    
    # save the first row to a variable called headers
    headers = next(mayors)
    print("Headers:",headers)
    
    # loop over the file and print the row contents 
    for mayor in mayors:
        print(mayor)
    

# Who is the longest serving Mayor?


In [None]:
import csv

with open("pgh-mayors.csv", "r") as mayors_file:
    mayors = [mayor for mayor in csv.reader(mayors_file)]
mayors[0:2]

In [None]:
# create an empty dictionary to hold term lengths
term_length = {}

# loop over all the mayors excluding the first line
for mayor in mayors[1:]:
    # get teh column with term information
    term = mayor[2]
    # split on - and unpack into two variables
    start,end = term.split("-")
    # compute the length by mathing
    length = end - start
    # save the mayor name and length in our dictionary
    term_length[mayor[1]] = length

In [None]:
# create an empty dictionary to hold term lengths
term_length = {}

# loop over all the mayors excluding the first line
for mayor in mayors[1:]:
    # get teh column with term information
    term = mayor[2]
    # split on - and unpack into two variables
    start,end = term.split("-")
    # compute the length by mathing
    length = int(end) - int(start)
    # save the mayor name and length in our dictionary
    term_length[mayor[1]] = length

In [None]:
# create an empty dictionary to hold term lengths
term_length = {}

# loop over all the mayors excluding the first line
for mayor in mayors[1:]:
    # get teh column with term information
    term = mayor[2]
    # split on - and unpack into a term variable
    terms = term.split("-")
    
    if len(terms) < 2:
        print(terms)
    else:
        start, end = terms
    
    # compute the length by mathing
    length = int(end) - int(start)
    # save the mayor name and length in our dictionary
    term_length[mayor[1]] = length

In [None]:
# create an empty dictionary to hold term lengths
term_length = {}

# loop over all the mayors excluding the first line
for mayor in mayors[1:]:
    # get teh column with term information
    term = mayor[2]
    # split on - and unpack into a term variable
    terms = term.split("-")
    
    if len(terms) < 2:
        length = 1
    elif "present" in terms[1]:
        start = terms[0]
        end = 2023
    else:
        start, end = terms
    
    # compute the length by mathing
    length = int(end) - int(start)
    # save the mayor name and length in our dictionary
    term_length[mayor[1]] = length
term_length

In [None]:
mayor_name = ""
longest_term = 0

for mayor, term in term_length.items():
    if term > longest_term:
        mayor_name = mayor
        longest_term = term
    
template = "{m} served the longest term with {t} years."
print(template.format(m=mayor_name, t=longest_term))