# Python Training Day 2

# <font color = '#526520'> Control Structures </font>

Python has 4 main control structures
1. if / elif / else statements (conditional)
2. while loop
3. for loop
4. try / except statements

## Conditional Statements

* Conditional statements use boolean logic (True or False) to execute lines of code if certain conditions are met
* conditional logic statements are ended with a ":" forgetting to do this will result in a syntax error
* python interprets any indented code under a valid conditional statement as the code to run if a condition is met
 * python requires consistent indentation (e.g. tab or spaces no mixing and matching)

### Boolean operator refresher
* boolean data types have two values (True / False)
* "==" checks if two values are equal
* "!=" check if two values are not equal
* ">" greater than; ">=" greater than or equal
* "<" less than; "<=" less than or equal

### Conditional statements
**And**  
"and" can be used to chain two conditional statements together  
``A==B and A != 0``  
The above statement will evaluate to true if A equals B and A is not equal to 0   

**Or**  
"or" can be used to chain two conditional statements together, will return true if either condition = True  
``A==B or A != 0``  
The above statement will evaluate to true if A equals B or A is not equal to 0  

### if / elif / else

using if statements you can execute certain code only if certain criteria are met (similiar to a case / when block in SQL)

Python has three variations of the if statement

1. if (required)
 * checks if certain criteria have been met, if True then executes code below
2. elif (optional)
 * optional statement that must follow an if statement. 
 * If the initial "if statement" criteria was not met then the criteria in this statement is evaluated. 
 * If this statement evaluates to True then the code below this statement is executed.
3. else (optional)
 * optional statement that does not require any logical criteria. 
 * If none of the criteria in the if / elif statements above were met then the code under the else statement gets executed
 

_* note: if no else statement is used and the if statement criteria is not met Python will continue executing the rest of the program_








#Conditional Example One

usr_input = int(input('type a non-zero number: '))

if usr_input < 0:
    print('you entered a negative number')
else:
    print('you entered a positive number')

In [None]:
#Conditional Example Two
a = 10
b = 15

if a < b:
    print('10 is less than 15')
elif a == 10:
    print('a equals 10')
else:
    print('a is not equal to 10 and a is not less than 15')

## While loop

A while loop is a condition based loop that will keep evaluating the code block underneath it until the condition = True.  

Essentially, a while loop is like an if statement that repeats itself until True.

#### Never ending loop
* a common mistake when writing while loops is to create a condition that will never evaluate to True.
* building a while loop with an impossible to meet condition will result in a loop that will run foreverrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr.



In [None]:
#what will be final number printed?
a = 10
b = 15

while a < b:
    print(a)
    a += 1 # this adds 1 to the variable a

## For loop

For loops are a more precise version of while loops that are useful for iterating through objects or performing a certain action a fixed number of times. 

#### For loop syntax:

The syntax for a basic for loop:

for [variable name] in range([start number], [end number], [step (optional)]: <- end of loop indent below


In [None]:
# how many times will this statement print? 
for i in range (0,2):
    print(i)
    
# i = variable name -- this is completely dynamic you can use almost anything as the variable name
for numbers in range(0,2):
    print(numbers)

In [None]:
# print all items in a list
lst = ['one','two','three','four','five']
print('list size:',len(lst))

for i in range (0,len(lst)): 
    print('i value:', i)
    print('list value:', lst[i])

#### For each:
These are special versions of for loops that iterate through each item in a list or series. These loops are incredibly useful if you want to perform some kind of operation on everything contained within a list or just simply want to access every object in a list. 

for [variable name] in list:

In the for loop above we stored a number in the variable, in the for each loop you store an actual value. 

Use Examples: 

search a list for a string, print string if it contains a certain substring.



In [None]:
# Example 1A, print every item in a list
print ('Example 1A')
test_list = [1,2,3,4,5]

for item in test_list:
    print(item)
    
#Example 1B again the variable name is dynamic
print('Example 1B')
for list_items in test_list:
    print(list_items)
    
# Example two -- search a list for a string containing a certain substring. Print if found.
print('Example 2')
string_list = ['dog','cat','hammerhead shark']

for text in string_list:
    if 'shark' in text:
        print("oh no it's a",text)

#### Enumerate
This a special version of the for each loop that returns both the list item and the index associated with the list item. <br>

These loops are useful if you are looping through a multidimensional list or need to keep track of the location of list objects. <br>



Syntax <br>

for [index variable], [list object variable] in enumerate(list): <br>



In [None]:
lst = [1,2,3,4]
for idx, obj in enumerate(lst):
    print ('Index:', idx)
    print('List Object:', obj)


### When to use a while loop and when to use a for loop

Before answering this question, let's reiterate the differences between the two:
* *for loops* perform a process a set number of times (e.g. n number of times, where n is equal to the number of records in an object)
* *while loops* perform a process indefinitely until a certain condition is met/a certain condition is no longer met

The general rule of thumb is to use a *for loop* whenever possible, if for nothing more than it's (marginally) easier to read and interpret the code.

However, there are some circumstances where a *while loop* is the more appropriate choice

1. When you are waiting for a specific action to occur (e.g. user interaction, or a specific value to be entered)
2. When you are working with some kind of data structure with a difficult to determine size (e.g. a server that is streaming live data)

If you decide to implement a *while loop*, then there are **two things you must ensure** (and which are not required for for loops):

1. You've included a condition which instructs the program to break outside of the loop (once condition is met)
2. You are highly confident that the condition (for breaking outside the loop) *will actually occur*

Should neither of these be met, then your program will be stuck in an **infinite loop** and the program will never complete- at which point you will need to manually interupt the program and rethink your code implementation.


### `Try` and `Except`: Statements for Handling Errors
Try and except statements are a unique form of conditional statements that check if an operation is valid (i.e. no errors). <br>

If no exception is present then the code in the `try` block is executed <br>

If an exception is raised then the code in the `except` block is executed <br>

These statements are useful for handling errors -- if an error is present you can use the `except` block to print an error message or handle the exception.

For more information on how to handle exceptions in your code look at the documentation for [Errors and Exceptions](https://docs.python.org/3/tutorial/errors.html).

In [None]:
# Simple try / except 
a = 1
b = 0

try:
    a / b # <-- no ":" needed after this statement
except:
    print('cannot divide by 0')

In [None]:
# Complex try / except

usr_input = input('enter a number: ') 

try:
    number = int(usr_input) # <-- try casting string into integer
    print(number) # if the conversion works print the number
    
except:
    print('you did not enter a valid number')
    

In [None]:
# tie it all together

def askint():
    while True:
        # try to take user input and convert it to an integer
        try:
            val = int(input("Please enter your age: "))
        # if the conversion fails, then print this message and begin another loop
        except:
            print("Please enter an integer.")
            # continue breaks out of the current loop and starts anew
            continue
        # if there is no error, print 'thank you!'
        else:
            print ('Thank you!')
            # breaks you out of the loop
            break
        # after making it through the loop, print again to the console
        finally:
            print("I execute no matter what!")
        
askint()

# <font color = '#526520'> Errors and Exceptions </font>

While writing programs or stepping through code interactively, you are sure to encounter times where Python does not like the instructions you've provided and returns an error. At a high level, there are two types of errors: *syntax errors* and *exceptions*.

* **Syntax Errors (a.k.a. 'parsing errors')** - arguably the most common error, which arises when your code is not syntactically correct. You may be missing a colon, bracket, or some other sytanx element, and Python doesn't like it.
    * When Python raises a Syntax Error, it includes a pointer `^` to the earliest point in the affected code where an error was detected. The error, itself, will actually occur in the code preceding the arrow. <br> <br>
    
* **Exceptions** - exceptions are errors in code which occur during execution despite the code itself being syntactically correct. There are various types of *exceptions*, and each has a name to help guide you in finding where the issue lies.
    * NOTE: There is a family of built-in exceptions which you will find adequately discussed in the documentation. However there are also user-defined exceptions (commonly found in open source packages) and for that you will need to refer to the module's documentation.

While these might appear unclear at first, you will soon get used to interpretting the errors in a meaningful way that helps you fix your code. For more information, check out [Errors and Exceptions](https://docs.python.org/3/tutorial/errors.html) and
[Built-in Exceptions](https://docs.python.org/3/library/exceptions.html#bltin-exceptions).

In [None]:
# Example of a Syntax Error in Python
for element in [1, 2, 3] print element

# look at the ouput below and see if you can't identify where the error actually occurs (HINT: it's before the print function)

In [None]:
# Example of an Exception in Python
# look at the output below, notice Python provides both the type of exeception (i.e. ImportError) as well as a description 
import abracadabara

# <font color = '#526520'> Quick Mention of NumPy Module </font>

## What is NumPy

NumPy is an abbreviated name for *Numerical Python* and is a (if not THE) leading package for high-performance scientific-computing and data analysis. It provides many useful components, two of which are:

* **ndarray** - N-dimensional arrays (row x column x levels) providing capability for vectorized operations
    * Very similar to the arrays we learned about during the R training. Although the Pythonic notation and syntax is different, the concept is the same.
* Standard math functions for efficient operations that do not require writing a loop

It is important to be aware of NumPy since the *Pandas* Module builds upon it. In the interest of time we will not explore *NumPy* in more detail, and you can find more information within the [Numpy Documentation](http://www.numpy.org/)

# <font color = '#526520'> Introduction to Pandas Module </font>

## What is Pandas 

Pandas is a Python package that provides data structures that enables Python to efficiently handle large amounts of data from different sources. 

Key features:

- Easy handling of missing data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes
- Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
- Time series functionality: date range generation and frequency conversion, moving window statistics, moving window, linear regressions, date shifting and lagging, etc.
- Source: (http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_1-Introduction-to-Pandas.ipynb)

## Pandas Documentation
Before we take a look at the exciting world of Pandas you should know how to get some help.

Pandas has very good online documentation for almost all of the DataFrame / Series methods and functions. 

This can be found here: http://pandas.pydata.org/pandas-docs/stable/

Google and Stack Overflow also have a plethora of helpful Panda's resources. 

## Pandas Data Structures

Pandas has two main data structures: the series (one-dimensional) and the data frame (two-dimensional)

### Series

A Pandas series is like a list in that it is one dimensional and can handle data of different types. However, unlike a list which is bound to a numerical index a Pandas series can have a more meaningful index.
<br>


In [None]:
#import pandas
import pandas as pd #<--- pd = alias we will use later on to refer to pandas package
import numpy as np # <--- a lot of pandas functions use numpy so it's a good idea to import

In [None]:
#Pandas Series default index
default_series = pd.Series([100,200,300,400,'500'])
default_series


In [None]:
#view default_series values
default_series.values


In [None]:
#view default_series index
default_series.index

In [None]:
#meaningful series index
complex_series = pd.Series([100,200,300,400,'500'],
                          index = ['Number1','Number2','Number3','Number4','Text1'])

complex_series


In [None]:
#view complex_series indices
complex_series.index

#### Selecting Data

If you select data from a dataframe row or column the data will be by default returned in the form of a series.

In order to get at the data that you are interested in it is useful to understand series functionality. 

In [None]:
#select Text1 value 
complex_series['Text1'] # <-- just like a python dictionary!

### DataFrame

A dataframe is a tabular data structure that will look very familiar to anyone who has worked with an Excel spreadsheet or SQL data table. 
<br> 

Dataframes can be sliced, filtered, added, joined, concatenated, duplicated. 
<br>

The real power in dataframes comes from being able to quickly manipulate data across an entire dataframe or apply a pythonic function to an entire column.

#### Building a dataframe

In it's simplest form all a dataframe needs is a series of data.

This data can come in the form of a python list or a Panda's series



In [None]:
#build a VERY basic dataframe
list_of_values = ['value1','value2','value3']
basic_df = pd.DataFrame(list_of_values)
basic_df

In [None]:
#Dir for Series and DataFrames
print(dir(pd.DataFrame)[207:])

The data is input as a column with a default column index (0) automatically assigned.

#### Building a complex dataframe 

Most dataframes are built using a combination of data and column names.

If you are doing this from scratch there are three common ways to build a data frame

1. Using two lists
2. Using a dictionary
3. Using a series


In [None]:
#build a complex dataframe using lists

list_of_values = [['value1','value2'],['value1','value2']]
column_names = ['Col1','Col2']

df1 = pd.DataFrame(list_of_values,columns = column_names)
df1

In [None]:
#build a complex dataframe using dictionaries

data_dict = {'Col1':['value1','value2'],
            'Col2':['value1','value2']}

df2 = pd.DataFrame(data_dict)
df2

In [None]:
#build a complex dataframe using series

series1 = pd.Series(['value1','value2'])
series2 = pd.Series(['value1','value2'])
column_names = ['Col1','Col2']

#df3 = pd.DataFrame([series1,series2],columns = column_names)
df3 = pd.DataFrame([series1,series2])
df3

### Inputs

Python makes it easy to convert native data structures to Pandas data structures

Pandas makes it really easy to input data from non-pythonic sources (e.g. flat files, excel spreadsheets, clipboards)

#### Flat File input
`pd.read_csv('filename.txt')` <br>
helpful options:
- sep: what character to use a delimiter `sep = '|'` including this will use pipes as the delimiter
- nrows: read in a specific number of rows (good for large files) `nrows = 100` will limit import to the first 100 rows
- error_bad_lines: option to skip rows of data with too many values (caused by delimiting error) `error_bad_lines = False` will skip these lines instead of throwing an error

#### Excel File Input
`pd.read_excel('filename.xlsx',sheetname='Sheet1')`
helpful options:
- converters: can specify the datatype for a specific excel column 

#### Clipboard
Possibly one of the neatest Pandas imports -- this function will import whatever data is stored on the clipboard. This is really handy for working with data that you may not want to save or map to in your script.
`pd.read_clipboard(sep='|')` <br>
helpful options:
- sep: what delimiter to use


In [None]:
#get an example dataframe from the web
clean = lambda s: s.replace('$', '')[:-1] if '.' in s else s.replace('$', '') # don't worry about these steps
url = 'https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv'
sandwiches = pd.read_table(url, sep='\t', converters={'price': lambda s: float(clean(s))})
sandwiches.head(3)




### Accessing DF data

#### Accessing data from a column (or list of columns)

*Single Column* <br>
`df['COLUMN_NAME']` <br> or <br>
`df.COLUMN_NAME` <-- no spaces allowed <br> 
*Multiple Columns* <br>
`df[['COLUMN_NAME1',COLUMN_NAME2']]` <br> or <br>
`df[df.columns[1:3]]` 




#### Accessing data from a row (or list of rows)
`df.loc[index]`





In [None]:
# get list of sandwich columns
sandwiches.info()

In [None]:
#view data in sandwich column
sandwiches['sandwich'].head(5) # <--- returned as a series, since it is only a single column

In [None]:
#view data in sandwich and city columns
sandwiches[['sandwich','city']].head(5) #<--- returned as a dataframe

In [None]:
# view data in rows 0,1, & 4 
sandwiches.loc[[0,1,4]]

In [None]:
#view sandwich and description for rows 0,1 & 4
sandwiches.loc[[0,1,4],['sandwich','description']]

### Filtering Data

Pandas makes it easy to filter data in a way that is very similar to writing a query in SQL. <br>

NOTE: most sql interpreters will are not case-sensitive -- when filtering in python you have to match the case of the strings that you are searching for.


### Editing Data

In [None]:
# like SQL if we are interested in seeing what unique values make up a column we can select distinct values 
# SQL 
# select distinct city from sandwiches

sandwiches['city'].unique()

In [None]:
# like SQL if we are interested in values from a specific city we can filter by that string
# SQL 
# select * from sandwiches where city = 'Chicago'

sandwiches[sandwiches['city'] == 'Chicago']

In [None]:
# like SQL if we are interested in values from a specific city and below a certain price we can filter by this combo
# SQL 
# select * from sandwiches where city = 'Chicago' and price <= 9
# the & = AND in Pandas

sandwiches[(sandwiches['city'] == 'Chicago') & (sandwiches['price'] <= 9)]

In [None]:
# like SQL if we are interested in values from a set of cities and below a certain price we can filter by this combo
# SQL 
# select * from sandwiches where city  in ('Evanston','Bolingbrook') and price <= 10

sandwiches[(sandwiches['city'].isin(['Evanston', 'Bolingbrook'])) & (sandwiches['price'] <= 10)]

In [None]:
# Pandas supports OR statements too -- let's find a sandwich that is either in chicago or is cheap 
# SQL 
# select * from sandwiches where city = 'Chicago' or price <= 10
# the pipe symbol = OR in Pandas

sandwiches[(sandwiches['city'] == 'Chicago') | (sandwiches['price'] <= 8)]

## Modifying DataFrames
### Updating DataFrames

In [None]:
# if you want to add a column to a dataframe the syntax is very simple
# add a column called 'Type' with constant value of 'Restaurant' to our table

sandwiches['Type'] = 'Restaurant'
sandwiches['Type'].head(5)



### Using loc

loc can be used to look up specific row / column combinations. You can also use it to update specific row / column combinations.

The syntax for looking up a row / column is as follows:

`df.loc['condition row must meet', 'column_name']` <br>

If you want to update this row / column combination you just need to add an `=`

`df.loc['condition row must meet', 'column_name'] = 'new value'`


In [None]:
# if you want to edit specific rows in a given column you use loc
# let's give all of the shops with expensive sandwiches the title of 'Fancy'

#find the the top 25% of prices
sandwiches['price'].quantile([.25,.5,.75])



In [None]:
#first it's a good idea to write a test filter case to check that you are filtering correctly
sandwiches[(sandwiches['price'] >= 10)].price.describe()



In [None]:
#filter checks out -- so now we modify
sandwiches.loc[(sandwiches['price'] >= 10),'Type'] = 'Fancy'

#check if constant changed
sandwiches.loc[(sandwiches['Type'] == 'Fancy'),'price'].describe()

## DataFrame Operations


In [1]:
import pandas as pd

#get an example dataframe from the web
url = 'https://raw.githubusercontent.com/jvns/pandas-cookbook/v0.1/data/weather_2012.csv'
weather = pd.read_table(url, sep=',')
weather.head(3)


Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog"


In [2]:
#Pandas allows you to use existing column data to create new columns
#You are able to use all arithemetic operations between columns (+, -, *, /)

#Lets create a column called misery that is a combination of wind speed and negative temperature. 

weather['Misery'] = abs(weather['Temp (C)'] - weather['Wind Spd (km/h)'])

weather[['Temp (C)','Wind Spd (km/h)','Misery']].head(5)



Unnamed: 0,Temp (C),Wind Spd (km/h),Misery
0,-1.8,4,5.8
1,-1.8,4,5.8
2,-1.8,7,8.8
3,-1.5,6,7.5
4,-1.5,7,8.5


### Group By

Group bys in Pandas work just like SQL group bys. <br>

#### Syntax <br>

`df.groupby(['fields to group by'].aggregate('operation')['optional - columns to get data from']` <br>

group by aggregations <br>
* sum
* mean
* min
* max
* count
* std (standard deviation)
* any valid custom function


<br>

Note: when you perform a groupby the "grouping" columns become the index of the dataframe. If you want to see these columns as columns and not index values you can reset the index and it will return the grouping columns to columns and set the index to autonumber. 

df = df.reset_index()

In [3]:
# Pandas also has powerful group by functions
# lets look at total and average misery by weather type

#total and average misery
weather.groupby('Weather').aggregate(['sum','mean'])['Misery'].head(5)

Unnamed: 0_level_0,sum,mean
Weather,Unnamed: 1_level_1,Unnamed: 2_level_1
Clear,15269.1,11.515158
Cloudy,23010.5,13.316262
Drizzle,473.7,11.553659
"Drizzle,Fog",608.2,7.6025
"Drizzle,Ice Pellets,Fog",19.6,19.6


### Working with DateTime Data



In [4]:
#pandas can convert strings to datetime
weather['Date/Time'] = pd.to_datetime(weather['Date/Time'])


In [5]:
# once in datetime format you can easily access different date attributes
# lets make a column of dates (no time)
weather['Date'] = weather['Date/Time'].dt.date

#.dt is the Panda's datetime method -- similar to extract in SQL

# and lets also make a column of months
weather['Month'] = weather['Date/Time'].dt.month
weather.head(5)

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,Misery,Date,Month
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog,5.8,2012-01-01,1
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog,5.8,2012-01-01,1
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog",8.8,2012-01-01,1
3,2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog",7.5,2012-01-01,1
4,2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog,8.5,2012-01-01,1


### Pivoting Data in Python

Pandas allows you to easily transform dataframes into pivot tables (these are still stored as dataframes -- just with more levels. 

This is very advantageous when you are dealing with large amounts of data that would slow down excel.

#### Syntax

when calling functions in python, some have keyword assignments so you actually assign values to local variables in the funciton. This is more precise so the order of the variables is not important as long as they are assigned to the correct keyword. 

`pd.pivot_table(df,values='column to aggregate', index = 'column(s) to use as rows', columns = 'column(s) to use as columns, aggfunc = 'how to aggregate, default = mean')`

In [6]:
#Let's pivot avg misery by month and weather 
pt = pd.pivot_table(weather, values = 'Misery', index = 'Month', columns = 'Weather')
pt

Weather,Clear,Cloudy,Drizzle,"Drizzle,Fog","Drizzle,Ice Pellets,Fog","Drizzle,Snow","Drizzle,Snow,Fog",Fog,Freezing Drizzle,"Freezing Drizzle,Fog",...,"Snow,Fog","Snow,Haze","Snow,Ice Pellets",Thunderstorms,"Thunderstorms,Heavy Rain Showers","Thunderstorms,Moderate Rain Showers,Fog","Thunderstorms,Rain","Thunderstorms,Rain Showers","Thunderstorms,Rain Showers,Fog","Thunderstorms,Rain,Fog"
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,25.280519,22.895862,20.7,20.2,,,,11.661111,19.675,12.75,...,,,,,,,,,,
2,18.415753,19.514013,17.7,,,,,7.725,25.3,,...,14.35,9.02,,,,,,,,
3,11.930323,12.313208,14.6,6.742857,,,,5.034286,,,...,7.3,,20.95,,,,,,,
4,9.621505,12.241243,,1.6,,,,3.45,,,...,,,,,,,,,,
5,6.566972,7.297059,8.1,4.020833,,,,4.2375,,,...,,,,,1.9,,8.3,9.075,,
6,7.896842,7.225,6.2,5.05,,,,,,,...,,,,,,,,,12.5,
7,10.387597,12.793694,,10.9,,,,16.3,,,...,,,,16.65,,4.6,12.4,7.028571,9.25,1.6
8,9.799107,7.943299,18.8,,,,,9.95,,,...,,,,,,,,5.033333,,
9,6.452308,8.588525,8.185714,8.0,,,,5.7,,,...,,,,,,,,5.2,,
10,4.838554,8.842484,1.9,5.71,,,,5.44,,,...,,,,,,,,,,


### Cleaning DataFrames

Pandas has two operations for dealing with NaN values.

#### dropna()

This will drop columns or rows that contain NaN values from a dataframe. 

You can drop NaN values from both columns and rows. 

##### Dropping Columns
`df.dropna(axis = 1)`
##### Dropping Rows
`df.dropna(axis = 0)`

There are multiple options to control exactly which rows / columns get dropped. These can be found here: [dropna documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)

#### fillna()

This method allows you to fill in any NaN values with a constant value

`df.fillna('value')`

In [7]:
# lets try dropping columns with > 6 nan values

dropped_pt = pd.DataFrame(pt.dropna(axis = 1, thresh = 7))
dropped_pt

Weather,Clear,Cloudy,Drizzle,"Drizzle,Fog",Fog,Mainly Clear,Mostly Cloudy,Rain,Rain Showers,"Rain,Fog"
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,25.280519,22.895862,20.7,20.2,11.661111,30.133333,26.484756,18.09,24.7,17.066667
2,18.415753,19.514013,17.7,,7.725,21.473267,19.443939,12.719048,20.15,13.4
3,11.930323,12.313208,14.6,6.742857,5.034286,13.251825,16.744715,19.783333,12.307692,14.333333
4,9.621505,12.241243,,1.6,3.45,10.659322,9.238596,17.276364,13.17037,12.983333
5,6.566972,7.297059,8.1,4.020833,4.2375,7.804762,7.135079,8.161538,7.787097,8.614286
6,7.896842,7.225,6.2,5.05,,8.611,8.030374,4.616981,5.927778,3.0
7,10.387597,12.793694,,10.9,16.3,11.129866,10.637143,9.1,10.9875,
8,9.799107,7.943299,18.8,,9.95,10.491935,9.815169,4.622222,7.434783,1.957143
9,6.452308,8.588525,8.185714,8.0,5.7,8.71588,8.168553,9.008,8.711111,4.875
10,4.838554,8.842484,1.9,5.71,5.44,7.36875,9.209649,8.84375,7.9,5.793333


In [8]:
# Now lets fill in the remaining NaN values with 0
filled_pt = pd.DataFrame(dropped_pt.fillna(0))
filled_pt

Weather,Clear,Cloudy,Drizzle,"Drizzle,Fog",Fog,Mainly Clear,Mostly Cloudy,Rain,Rain Showers,"Rain,Fog"
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,25.280519,22.895862,20.7,20.2,11.661111,30.133333,26.484756,18.09,24.7,17.066667
2,18.415753,19.514013,17.7,0.0,7.725,21.473267,19.443939,12.719048,20.15,13.4
3,11.930323,12.313208,14.6,6.742857,5.034286,13.251825,16.744715,19.783333,12.307692,14.333333
4,9.621505,12.241243,0.0,1.6,3.45,10.659322,9.238596,17.276364,13.17037,12.983333
5,6.566972,7.297059,8.1,4.020833,4.2375,7.804762,7.135079,8.161538,7.787097,8.614286
6,7.896842,7.225,6.2,5.05,0.0,8.611,8.030374,4.616981,5.927778,3.0
7,10.387597,12.793694,0.0,10.9,16.3,11.129866,10.637143,9.1,10.9875,0.0
8,9.799107,7.943299,18.8,0.0,9.95,10.491935,9.815169,4.622222,7.434783,1.957143
9,6.452308,8.588525,8.185714,8.0,5.7,8.71588,8.168553,9.008,8.711111,4.875
10,4.838554,8.842484,1.9,5.71,5.44,7.36875,9.209649,8.84375,7.9,5.793333


# Merging / Joining / Concatenating DataFrames

It's nice when data is centralized, but more often than not we work with data spread out across multiple tables or DataFrames. 

Pandas makes it easy to merge, join, and union DataFrames using syntax that will look familiar to SQL users.

## Merging / Joining

To accomplish a join between two DataFrames you use the Panda's DataFrame merge method [Merge Method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) <br>

Merging Notes:
* The requirements for the merge is two DataFrames
* The two DataFrames will need to contain some primary key(s) to join on
  * These keys can be explicitly declared within the function or if both keys are contained in the indices of the DataFrame then you can merge using the Indices
* If the same column name exists in both DataFrames Pandas will automatically rename the columns adding a default suffix (defaults = _x, and _y) to differentiate between the two
  * you can set your own suffixes to use if you would like
  * you can also set the `copy` keyword to `False` to avoid these duplicate columns being brought over
* The default join method is an inner join, you can specify three other joins: left, right, and outer using the `how` keyword


[Merging Examples](https://pandas.pydata.org/pandas-docs/stable/merging.html)





In [9]:
#create two example data frames a and b
import pandas as pd
import numpy as np

a = pd.DataFrame({'lkey':['r1','r2','r3','r4','r5'],'ldata':[1,2,3,4,5]})
b = pd.DataFrame({'rkey':['r1','r2','r3','r4'],'rdata':['a','b','c','d']})

print(a)
print(b)

   ldata lkey
0      1   r1
1      2   r2
2      3   r3
3      4   r4
4      5   r5
  rdata rkey
0     a   r1
1     b   r2
2     c   r3
3     d   r4


In [10]:
#perform an inner join between a and b
# a = left table, b = right table

merged_df = a.merge(b,
                    how='inner',
                    left_on = 'lkey',
                    right_on = 'rkey'
                   )

merged_df

Unnamed: 0,ldata,lkey,rdata,rkey
0,1,r1,a,r1
1,2,r2,b,r2
2,3,r3,c,r3
3,4,r4,d,r4


In [11]:
#perform an left join between a and b
# a = left table, b = right table

merged_df = a.merge(b,
                    how='left',
                    left_on = 'lkey',
                    right_on = 'rkey'
                   )

merged_df

Unnamed: 0,ldata,lkey,rdata,rkey
0,1,r1,a,r1
1,2,r2,b,r2
2,3,r3,c,r3
3,4,r4,d,r4
4,5,r5,,


In [12]:
#perform a join using the index of each DataFrame

#print indices

print('df a indices:',a.index.tolist())
print('df b indices:',b.index.tolist())

merged_df = a.merge(b,
                    how='inner',
                    left_index = True,
                    right_index = True
                   )
merged_df

#merged_df

df a indices: [0, 1, 2, 3, 4]
df b indices: [0, 1, 2, 3]


Unnamed: 0,ldata,lkey,rdata,rkey
0,1,r1,a,r1
1,2,r2,b,r2
2,3,r3,c,r3
3,4,r4,d,r4


In [13]:
# ADVANCED -- set the index of each dataframe and then join

a2 = a.set_index('lkey')
print('a2 indices:', a2.index.tolist())


b2 = b.set_index('rkey')
print('b2 indices:', b2.index.tolist())

merged_df = a2.merge(b2,left_index = True, right_index = True)
merged_df

a2 indices: ['r1', 'r2', 'r3', 'r4', 'r5']
b2 indices: ['r1', 'r2', 'r3', 'r4']


Unnamed: 0,ldata,rdata
r1,1,a
r2,2,b
r3,3,c
r4,4,d


## Concatenating DataFrames

Pandas allows to perform a concatenation function to join two (or more) DataFrames into one

[Pandas concat()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)

Requirements:
* Two (or more) DataFrames
* To perform a typical union the column names must match between the two DataFrames
* If column names do not match or more columns are present in one dataframe than the other all columns will be present in the unioned DataFrame but null values will be imputed where ever data is missing


In [14]:
#create two example data frames c and d
import pandas as pd
import numpy as np

c = pd.DataFrame({'key':['r1','r2'],'data':[1,2]})
d = pd.DataFrame({'key':['r3','r4'],'data':[3,4]})

print(c)
print(d)

   data key
0     1  r1
1     2  r2
   data key
0     3  r3
1     4  r4


In [15]:
#union the two DataFrames
unioned_df = pd.concat([c,d])
unioned_df

Unnamed: 0,data,key
0,1,r1
1,2,r2
0,3,r3
1,4,r4


In [16]:
#create two DataFrames with different column names (data1 and data2)
e = pd.DataFrame({'key':['r1','r2'],'data1':[1,2]})
f = pd.DataFrame({'key':['r3','r4'],'data2':[3,4]})

#union the two DataFrames (ignore_index = True will automatically renumber the index for you (only use if index is meaningless))
unioned_df = pd.concat([e,f],ignore_index = True)
unioned_df[['key','data1','data2']]

Unnamed: 0,key,data1,data2
0,r1,1.0,
1,r2,2.0,
2,r3,,3.0
3,r4,,4.0


In [17]:
#create two DataFrames with different column names (data1 and data2)
e = pd.DataFrame({'key':['r1','r2'],'data1':[1,2]})
f = pd.DataFrame({'key':['r3','r4'],'data2':[3,4], 'data3':[4,5]})

#union the two DataFrames (ignore_index = True will automatically renumber the index for you (only use if index is meaningless))
unioned_df = pd.concat([e,f],ignore_index = True)
unioned_df[['key','data1','data2','data3']]

Unnamed: 0,key,data1,data2,data3
0,r1,1.0,,
1,r2,2.0,,
2,r3,,3.0,4.0
3,r4,,4.0,5.0


## Iterating Through a DataFrame

### iterrows()
If you want to loop through every row of a DataFrame one of the most efficient ways is to use the built in `iterrows()` method. 

This works like the enumerate control structure as it returns an index and an object. Except in this case the object is a Panda's series representing all the data in a single row. 

Syntax

`for idx, obj in df.iterrows():
    print(idx)
    print(obj)
`


### apply()

If you want to perform any kind of operation on all of the rows in a DataFrame or in a column of the DataFrame then the most efficient way to do this is by using the `apply()` method.

The apply method can be used for basic algebra (e.g. raising every cell in a column to a certain exponent) or it can be leveraged to apply a custom function to every cell in a DataFrame. 


In [20]:
#iterrows and apply example
# lets bring back the pivoted weather table
filled_pt.head(5)



Weather,Clear,Cloudy,Drizzle,"Drizzle,Fog",Fog,Mainly Clear,Mostly Cloudy,Rain,Rain Showers,"Rain,Fog"
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,25.280519,22.895862,20.7,20.2,11.661111,30.133333,26.484756,18.09,24.7,17.066667
2,18.415753,19.514013,17.7,0.0,7.725,21.473267,19.443939,12.719048,20.15,13.4
3,11.930323,12.313208,14.6,6.742857,5.034286,13.251825,16.744715,19.783333,12.307692,14.333333
4,9.621505,12.241243,0.0,1.6,3.45,10.659322,9.238596,17.276364,13.17037,12.983333
5,6.566972,7.297059,8.1,4.020833,4.2375,7.804762,7.135079,8.161538,7.787097,8.614286


In [29]:
# use iterrows to loop through every row (i.e. month) and return 
# the weather condition with the highest misery score

for idx, row in filled_pt.iterrows():
    weather = ''
    cell = 0
    for row_weather, misery in row.iteritems(): #does the same thing but for a series
        if misery > cell:
            cell = misery
            weather = row_weather
    print(idx,weather)
        

1 Mainly Clear
2 Mainly Clear
3 Rain
4 Rain
5 Rain,Fog
6 Mainly Clear
7 Fog
8 Drizzle
9 Rain
10 Mostly Cloudy
11 Rain
12 Mainly Clear


In [39]:
# apply example
# return the mean misery score for each type of weather 
# across all 12 months
filled_pt.apply(pd.Series.mean)

Weather
Clear            11.916625
Cloudy           12.771880
Drizzle          10.383532
Drizzle,Fog       6.132530
Fog               7.626677
Mainly Clear     13.609652
Mostly Cloudy    13.140108
Rain             12.337939
Rain Showers     12.023714
Rain,Fog          7.823719
dtype: float64

# Outputting DataFrames

Earlier we loaded a .CSV file into a Pandas DataFrame with the `read_csv()` function (from Pandas). Unironically, writing files back to disk is just as easy with the `to_csv()` function (from Pandas).

"But what if I'm using a filetype other than .CSV?!?!" Thankfully for you the author of Pandas anticipated this and incorporate a suite of functions for writing different filetypes.

It's encouraged that you visit the Panda's documention on [Input/Output](https://pandas.pydata.org/pandas-docs/stable/io.html) to see the family of reader/writer functions and the various arguments they can take.

In [19]:
# Example of writing filled_pt DataFrame to disk
pd.to_csv(filled_pt)

AttributeError: module 'pandas' has no attribute 'to_csv'

# <font color = '#526520'> Other Python Tools for Analytics </font>

So far we've covered mostly vanilla (i.e. built-in) Python, along with an introduction to the powerful *Pandas* module. While *Pandas* provides the friendly dataframe object, as you strive for greater insights from your analyses in Python it is likely that you'll need additional tools and resources. Let's briefly mention other modules useful for analytics (so that you're at least aware of their names and what they do) and see where you can find more information for when the time comes.

## SciPy
SciPy isn't a specific per se package but rather a *stack* of open source software for scientific computing in Python. There are several packages associated with SciPy, with some of the most common being:

* **NumPy** - fundamental package for numerical computation, which defines the *numerical array* and *matrices* <br> <br>
* **SciPy Library** - collection of numercial algorithms and domain-specific tools <br> <br>
* **Matplotlib** - a data visualization package <br> <br>
* **Pandas** - seminal package for working with DataFrames in Python <br> <br>
* **Scikits** - a family of packages used for machine learning, computer vision, and various other domains of data analysis <br> <br>

You can find all this information (and more!) at the [SciPy Homepage](https://www.scipy.org/)

## ScikitLearn
If you are looking to perform any sort of statistical modeling/machine learning, then the *scikit-learn* is your go-to module. It contains a thorough collection of algorithms for classification, regression, clustering, dimensionality reduction, and more!

Although not being covered directly in this training, it's highly recommended to explore the [Scikit-learn website](http://scikit-learn.org/stable/)


# Example Time