# Lecture 4 Notebook
Duncan Callaway
September 10 2019

This lecture continues introducing the class to Pandas and goes into "groupby"

In class I worked in "Duncan's Lecture 4 in class workbook.ipynb"

In [None]:
import numpy as np
import pandas as pd

## Recap last lecture

#### Data frame vs dict of lists

In [None]:
fruit_info={'fruit':['apple','banana','orange','raspberry'],
                  'color':['red','yellow','orange','pink'],
                  'weight':[120,150,250,15]
         }
fruit_info_df = pd.DataFrame(data = fruit_info)
print(fruit_info)
fruit_info_df

The data frame has 
1. column headers,
2. An index column
2. rows
4. columns
5. numeric and text entries -- but columns are all the same type.

#### "the index"

Note this is different from locational indexing.  We're talking about the column that identifies the row of the frame.

In [None]:
fruit_info_df.index

What I was *expecting* was a list from 0 to 3, e.g. `[0, 1, 2, 3]`

But I got the above.  This is just an alternative way of giving the same information.  However note, if I do this:

In [None]:
fruit_info_df.index = ['zero', 'one', 'two', 'three']

In [None]:
fruit_info_df.index

...then I get what I expected.  More on indices in a moment.

#### loc and iloc

loc identifies location by column and header names.  

In [None]:
fruit_info_df.loc['zero':'two', 'fruit':'weight']

note, loc is inclusive!

.iloc identifies location by number -- just like indexing in numpy.

In [None]:
fruit_info_df.iloc[0:3,1:3]

.iloc is exclusive on the end location value.  

## Back to our question: which hour had the most wind...

In [None]:
caiso_data_stack = pd.read_csv('CAISO_2017to2018_stack.csv', index_col= 0)

Let's make the name shorter to save me typing:

In [None]:
cds = caiso_data_stack
cds.head()

Let's look at some info about the data:

In [None]:
cds.shape

In [None]:
cds.size

What did those two commands give us? <br>
`shape`: (number of rows, number of columns of *data*)
`size`: total number of cell entries.

Notice these numbers don't include what's in the index.

Here's something fun -- 

In [None]:
cds.describe()

## Logical indexing
Logical indexing is an extremely powerful way to pull data out of a frame.  
For example, with the stacked data frame, let's pull out only wind generation.

First, I'll show you a boolean series based on comparisons to the 'Source' data column:

In [None]:
(cds['Source']=='WIND TOTAL').head()

Now we can embed that inside the `.loc` method:

In [None]:
cds.loc[cds['Source']=='WIND TOTAL',:].head()

Ok.  Any ideas how we can use that to get the information we want?  Reminder, the question is:

What hour of the day had the lowest average wind power in California in the last 12 months?

In [None]:
wind = cds.loc[cds['Source']=='WIND TOTAL',:]

What is the data structure of `wind`?

In [None]:
type(wind)

Next week we'll use pivots to do this better, but for now let's use a for loop to get information by hour. 

First thing to do is figure out how to get the hour out of the index.

[`datetime.strptime`](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior) is useful for this if you're working on individual dates.

But `pd.to_datetime` is even better, especially if you're working on a lot of values in a list (or as the case will be, values in a pandas series).

In [None]:
windex = pd.to_datetime(wind.index)
windex

That looks basically the same as what we had before the `to_datetime` transformation.  But The Datetime object has a number of attributes -- such as `hour`

In [None]:
windex.hour

Now we can scroll through the wind dataframe, grabbing one hour at a time for averaging:

In [None]:
wind_ave = [] # initalizes a list to populate
for i in range(0,24):
    wind_ave.append(np.mean(wind.loc[windex.hour == i,:]))

In [None]:
print(wind_ave)

In [None]:
type(wind_ave)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(wind_ave)

We can see pretty clearly that the min is 10 or 11...let's dig a little more.

One way to do this is to drop the data into a data frame and then *sort* the data frame.

In [None]:
df_wind = pd.DataFrame(wind_ave)
df_wind

I'm going to be adding more MWh values to the data frame in just a moment, so let's be clear that this is the average

In [None]:
df_wind.columns = ['Average MWh']

In [None]:
df_wind.sort_values(by='Average MWh',ascending=True).head()

Ok -- so it looks as though mid-day is the minimum *average*.  

Nice to see that the index values were preserved.

Note that I didn't permanently change the dataframe to have data in the ascending order -- I just output that series.

But what's the range? For each hour:

In [None]:
wind_min = [] # initalizes a list to populate
wind_max = [] # initalizes a list to populate
for i in range(0,24):
    wind_min.append(np.min(wind.loc[windex.hour == i,:]))
    wind_max.append(np.max(wind.loc[windex.hour == i,:]))

In [None]:
wind_max[0]

In [None]:
df_wind['min MWh']=pd.DataFrame(wind_min)['MWh']
df_wind['max MWh']=pd.DataFrame(wind_max)['MWh']

In [None]:
df_wind

In [None]:
plt.plot(df_wind)

## Row and column labels
The columns are identified with a list of values.  Let's look at the fruit data set again:

In [None]:
fruit_info_df.columns

In [None]:
type(fruit_info_df.columns)

The rows are similarly labeled:

In [None]:
fruit_info_df.index

In [None]:
type(fruit_info_df.index)

They are both the same data type within Pandas -- the "Index"

Note, we can do a bunch of other stuff:

## Merging
Lets make another data frame and tack it on to the first

In [None]:
price_df = pd.DataFrame({'price':[0.5, 0.65, 1, 0.15],
                        'frut':['apple', 'banana', 'orange', 'rasberry']})
price_df

In [None]:
fruit_info_df

Now let's blindly merge:

In [None]:
pd.merge(price_df,fruit_info_df)

What went wrong?

First, we didn't spell fruit correctly.  Two ways to fix.  First, specify the columns directly:

In [None]:
pd.merge(price_df,fruit_info_df, left_on = 'frut', right_on = 'fruit')

Second, fix the spelling and *don't* tell pandas.  In this case pandas works to figure out what's in common.

In [None]:
price_df.columns[0]='fruit'

Bummer!  Can't mutate index values.  What to do?

In [None]:
col_list = list(price_df.columns)
col_list

In [None]:
col_list[1] = 'fruit'

In [None]:
price_df.columns = col_list
price_df

In [None]:
pd.merge(fruit_info_df,price_df)

Note we can use different syntax:

In [None]:
fruit_info_df.merge(price_df)

Now we're still missing raspberries -- why?

Again, spelling error in the new frame.  Let's fix:

In [None]:
price_df.loc[3,'fruit'] = 'raspberry'

Note we could change individual entries in the data frame itself.  They are mutable.

In [None]:
fruit_info_df.merge(price_df)

Note the fruit_info data frame is still intact, you'd need to assign it to a data frame name to save it.

In [None]:
fruit_info_df

Here's a cool little factoid about data frames: you can write for loops that burn through the columns of the frame.  

In [None]:
for i in fruit_info_df:
    print(fruit_info_df.loc['one',i])

Note, there are other commands -- `join`, `concat`, and these do similar things.  

I've found merge seems to work well for most purposes.

FWIW, `pd.concat` seems to be a little more brute force -- requires more careful syntax, but likely does unexpected things less often once you understand the syntax.

In [None]:
merged_df = fruit_info_df.merge(price_df)
merged_df

We can streamline by replacing the index number with the fruit column.  

What's the `inplace` command for?  It means the re-defined dataframe is assigned to the original name.  This is advantageous in memory constrained situations.  

In [None]:
merged_df.set_index('fruit', inplace = True)
merged_df

## Multilevel indexing
We can also assign "multilevel" column or row names, like so:

In [None]:
levels = [('categorical', 'color'),('quantitative', 'weight'),('quantitative','price')]
levels

Note the  use of tuples (sets of values in parentheses) in setting up multiindex.  This will come again later.  

In [None]:
merged_df.columns = pd.MultiIndex.from_tuples(levels)
merged_df

Now we have categories and subcategories of columns:

In [None]:
merged_df['quantitative']

Note, we can also drop and add things.  With multilevel indexing things get a little tricky.  

First, we can drop everything from the top level:

In [None]:
merged_test_df = merged_df.drop(columns=[('quantitative',)], axis = 1)
merged_test_df

Note that I put the column identifier inside the parens, like a tuple, but it's not essential there.

However if we want to drop only a column from the second level, we get an error without the tuple syntax:

In [None]:
merged_test_df = merged_df.drop(columns=[('quantitative','price')], axis = 1)
merged_test_df

We can also drop rows: 

In [None]:
merged_df.drop(index=[('apple')], axis = 0, inplace = True)
merged_df

Note indexing multilevels with `.loc` gets a little tricky.  The thing to keep in mind is that you're working with tuples in each index location:

In [None]:
merged_df.loc['banana', ('quantitative', 'price')]

If you leave an entry of the tuple empty you get all values.  

In [None]:
merged_df.loc['banana', ('quantitative', )]

You can also loop through the columns of the multilevel data frame like this: 

In [None]:
for i, j in merged_df:
    print(merged_df.loc['banana', (i, j)])