## Now the real fun begins...

Before we start playing with data files, we need to cover one more really important section.

### Loops, conditionals, and functions

If you have used a progamming language before, you're probably familiar with the for-loop. For everyone else, a for-loop is a way of iterating through a data structure - a string, a list, a dictionary, etc. - or file. It's a way to execute the same piece of code multiple times with a parameter being updated on every iteration.

In [None]:
## CODE CELL 1
# The go-to first example of a for-loop

for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:
    print(i)

The `range()` function is useful here. It can be given up to three parameters: $([start], stop[, step])$ (brackets indicate optional parameters). As with slicing in strings and lists, the loop will stop at $stop-1$.

In [None]:
## CODE CELL 2
# Do the above more efficiently using the "range" function:

for i in range(1,11):
    print(i)

You may be wondering why we chose the letter "i" to iterate through the lists above. The answer is that it's just convention - letters like "i" and "j" are often used for iteration, but in practice, you can use any letters, letter + number combinations, or even an underscore.

One common use of a for-loop is to iteratively append elements to a list.

In [None]:
## CODE CELL 3
# What are the first ten multiples of 3?

multiples = []    # initializing list
for i in range(1,11):
    multiples.append(3*i)

print(multiples)

You can also use multiple layers of loops (called "nested loops").

In [None]:
## CODE CELL 4
# What are the first five multiples of numbers 10-14?

for i in range(10,15):
    myList = []
    for j in range(1,6):
        myList.append(i*j)
    print(myList)

As you begin writing more complex code, you may find it helpful to use this step-by-step visualization tool: http://pythontutor.com/visualize.html#. (DEMO)

What if we want to control what code gets executed based on certain conditions? That's where conditional statements come in. Let's look at some operators you'll likely use:

In [None]:
## CODE CELL 5

print(3 < 4)    # less than

In [None]:
## CODE CELL 6

print(4 <= 4)    # less than or equal to

In [None]:
## CODE CELL 7

print('a' == 'A')    # equal to

In [None]:
## CODE CELL 8

print(4 != 4)    # not equal to

In [None]:
## CODE CELL 9

print(2 >= 10)     # greater than or equal to

In [None]:
## CODE CELL 10

print('a' > 'A')    # greater than

You may have noticed some unexpected behavior. When comparing strings, these operators indicate how Python sorts them lexicographically - 'a', 'b', etc. come *after* 'A', 'B', etc.

The return values of True/False are called "Booleans". These are actual values that can be assigned to a variable:

In [None]:
## CODE CELL 11

var = True
new_var = False

print('var:', var, '\nnew_var:', new_var)

In Part 1, we saw another case where the result was a Boolean; we were checking if "Orlando" was in the $topcities$ list. The `in` and `not in` membership checks return True or False.

Now, we can implement some of these comparisons in what's called an if-else statement. The gist of it is this: if {some condition}, execute some code; for all other cases, execute some other code.

In [None]:
## CODE CELL 12
# On which days could we potentially have a picnic?

forecast7Day = ['rain', 'mostly cloudy', 'rain', 'mostly cloudy', 'sunny', 'partly cloudy', 'rain']
picnic = []
for i in forecast7Day:
    if i == 'rain':
        picnic.append('no')
    else:
        picnic.append('yes')
        
print(picnic)

We can include more "options" by incorporating "elif" statements.

In [None]:
## CODE CELL 13
# How many layers do I need to wear for the next few days?

forecastTemps = [50, 59, 72, 74, 60, 62, 63]
layers = []
for i in forecastTemps:
    if i < 60:
        layers.append('wear a jacket')
    elif i >= 70:
        layers.append("don't need a jacket or sweater")
    else:
        layers.append('wear a sweater')

print(layers)

You can incorporate as many elif statements as you'd like.

One last type of control structure - the while-loop. The general structure is the following: while {some condition}, execute some code. Iteration will continue until that condition is no longer true.

In [None]:
## CODE CELL 14
# Using up a gift card

balance = 110     # initial balance = $110
while balance - 20 >= 0:
    print('Your balance is now $' + str(balance))
    balance -= 20    # using up $20 for each purchase
print('Final balance: $' + str(balance))

Finally, functions. Functions are extremely useful for when you want to execute a section of code repeatedly, but with parameters (called "arguments") for which values can be defined when the function is called. Functions are defined with `def` and then a user-provided name.

In [None]:
## CODE CELL 15
# A function to generalize the gift card code in the previous example

def giftCard(init_balance, purchase_size):    # this function has two arguments
    balance = init_balance
    while balance - purchase_size >= 0:
        print('Your balance is now $' + str(balance))
        balance -= purchase_size
    return 'Final balance: $' + str(balance)

What happened when you ran the previous cell?

In order to use the function, we have to call it.

In [None]:
## CODE CELL 16
# Calling the giftCard function

newCard = giftCard(200, 50)    # initial balance = $200, purchase_size = $50
print(newCard)   

Try calling `giftCard()` with different parameters.

Question to ponder/research: What happens when you replace the "return" statement with a "print" statement like the one in the non-function version of this code? What's the difference between a print statement and a return statement in a function?

**Exercise 3:**

How many words are in the first sentence of Charles Dickens's *Oliver Twist*? How many times does the word "which" appear?

*Hint: The punctuation characters in this paragraph are given in the remove_punc list. Think about using the `replace` and `split` string methods from Part 1 of the workshop to get a list of words without punctuation.*

**Answer 3:**

In [None]:
## CODE CELL 17

oliverT = '''Among other public buildings in a certain town, which for many reasons
it will be prudent to refrain from mentioning, and to which I will
assign no fictitious name, there is one anciently common to most towns,
great or small: to wit, a workhouse; and in this workhouse was born; on
a day and date which I need not trouble myself to repeat, inasmuch as
it can be of no possible consequence to the reader, in this stage of
the business at all events; the item of mortality whose name is
prefixed to the head of this chapter.'''

remove_punc = [',', ':', ';', '.']

## ENTER CODE HERE



*Note:* For a larger text where it would be prohibitive to create a list of all possible punctuation characters for removal, consider using the NLTK library as per the top response here: https://stackoverflow.com/questions/21361073/tokenize-words-in-a-list-of-sentences-python. Alternatively, you can use the re library for regular expressions: https://docs.python.org/3/howto/regex.html#splitting-strings.

### Introduction to pandas

The last part of this workshop is a quick introduction to pandas. The pandas library is a popular data science library which allows fast access to and analysis of structured data. It's often used for tabular data which gets converted into a 2D data structure (rows x columns) called a DataFrame.

For this section, we'll be working with a dataset from Kaggle called "Foodborne Disease Outbreaks, 1998-2015". This file contains data from the CDC's electronic Foodborne Outbreak Reporting System (eFORS). More information can be found here: https://www.kaggle.com/cdc/foodborne-diseases

Let's start by importing the pandas library.

In [None]:
## CODE CELL 18

import pandas as pd

We can read the CSV file "outbreaks.csv" in as a DataFrame for easier manipulation.

In [None]:
## CODE CELL 19

df = pd.read_csv('outbreaks.csv')
df

For convenience, pandas DataFrames have methods called `head()` and `tail()` which display the first few or last few rows, respectively (default # of rows = 5).

In [None]:
## CODE CELL 20

df.head()

In [None]:
## CODE CELL 21
# Showing the last 10 rows

df.tail(10)

What are all the column names in this table?

In [None]:
## CODE CELL 22

df.columns

How many rows/columns are in this table?

In [None]:
## CODE CELL 23

df.shape

You can get this information along with information about the data types present using the `info()` method.

In [None]:
## CODE CELL 24

df.info()

We can get summary statistics of the columns with numerical data using the `describe()` method.

In [None]:
## CODE CELL 25

df.describe()

What "species" were responsible for outbreaks over the course of the years covered in this dataset?

In [None]:
## CODE CELL 26

df['Species'].unique()

What if we want to only deal with outbreaks that happened in New Jersey?

In [None]:
## CODE CELL 27

df[df['State'] == 'New Jersey']

Subsets of DataFrames can be assigned to variables, just as DataFrames can be assigned to variables:

In [None]:
## CODE CELL 28

nj = df[df['State'] == 'New Jersey']
nj.head()

We can choose to only look at certain columns in the DataFrame:

In [None]:
## CODE CELL 29

df[['Year', 'Illnesses']].head(10)

What if we only want to focus on certain rows?

In [None]:
## CODE CELL 30

df[10:20]

We can also sort the rows by a chosen column. What were the top 5 outbreaks in terms of fatalities?

In [None]:
## CODE CELL 31

df.sort_values(by='Fatalities', ascending=False).head()

How many outbreaks occurred each year?

In [None]:
## CODE CELL 32

df.groupby('Year').size()

We can quickly plot this using the matplotlib library.

In [None]:
## CODE CELL 33
# "Magic" command below allows you to see plot in the notebook

%matplotlib inline    

In [None]:
## CODE CELL 34

df.groupby('Year').size().plot(kind='bar')

**Exercise 4:**

How many outbreaks occurred for each Food type listed in the dataset between 2010 and 2012? Sort these in descending order by number of outbreaks for each Food type, and plot the top 10 foods which were associated with outbreaks.

*Hint: To subset a DataFrame with a compound conditional, use the "&" character for AND and the "|" pipe character for OR. For example, if you want the subset where Col1 == 1 and Col2 == 2, you could use the following: df[(df['Col1']==1) & (df['Col2']==2)]. *

**Answer 4:**

In [None]:
## CODE CELL 35

## ENTER CODE HERE


## Post-workshop exercise

**Top 50 cities**

One of the files you downloaded from the GitHub repository is called "top50cities.txt". The file includes the rank, city name, and population size for each of the top 50 largest cities in the United States. Read this file into your Python working environment and output a new file showing only the city names, with one city per line. For pointers on how to read and write files in Python, check out this link: http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python. One possible solution is provided in the script "top50cities_solution.py" which is also in the folder you downloaded from the repository.

*References:*

The following materials were consulted during development of this notebook:

B. Rhodes, *PyCon Pandas Tutorial*, (2015), GitHub repository, https://github.com/brandon-rhodes/pycon-pandas-tutorial.