# Loading the Pandas library

The pandas library is not included as part of the Python standard library. Because of this, we must first import the library into our program

In [4]:
import pandas


# Awesome, you imported pandas, now what?

Now that the pandas library has been imported into your python program,
it's now time to load in our first data set.

The pandas library allows for various types of data files to be loaded into your program such as:
* CSV
* Excel
* html
* sql
* json
* And many more

For our purposes, we will use CSV data files

# The read_csv() function

We will use the read_csv function to load our first data file into our program.
This function will load in the data in what is called a 'dataframe'
Dot notation is used to access the functions located within the pandas library

In [6]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(type(dataframe))

print(dataframe.head())

<class 'pandas.core.frame.DataFrame'>
             MeterId            StartDate              EndDate  Value (kWh)  \
0  10443720000759353  2018-08-05 00:00:00  2018-08-05 01:00:00        2.343   
1  10443720000759353  2018-08-05 01:00:00  2018-08-05 02:00:00        1.211   
2  10443720000759353  2018-08-05 02:00:00  2018-08-05 03:00:00        1.199   
3  10443720000759353  2018-08-05 03:00:00  2018-08-05 04:00:00        2.251   
4  10443720000759353  2018-08-05 04:00:00  2018-08-05 05:00:00        1.064   

   Notes  
0    NaN  
1    NaN  
2    NaN  
3    NaN  
4    NaN  


## Checking our data

By running the exmample above, you can see that we imported the pandas library, and
read in the file usage_2018-08-22.csv

We also checked that we correctly created a dataframe by calling the python function 'type', which shows that the variable dataframe is of the type 'pandas.core.frame.DataFrame'.

We can also verify the content of the datafile by using a function of the dataframe object called 'head()', which will show the first 5 rows of our data

## Examining our data

There are many attributes associated with a dataframe object. They include:
* shape - which tells us the number of rows and colunms in the dataframe, displayed as a tuple
* columns - which tells us the names of the columns in the dataframe
* dtypes - which tells us the datatype of each column

Or, if you would rather see all of this information in one step, the dataframe object also has a built in function called 'info()' which will display the number and the names of the columns, the datatypes, and more

In [7]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.shape)

(432, 5)


In [8]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.columns)

Index(['MeterId', 'StartDate', 'EndDate', 'Value (kWh)', 'Notes'], dtype='object')


In [9]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.dtypes)

MeterId          int64
StartDate       object
EndDate         object
Value (kWh)    float64
Notes          float64
dtype: object


In [1]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432 entries, 0 to 431
Data columns (total 5 columns):
MeterId        432 non-null int64
StartDate      432 non-null object
EndDate        432 non-null object
Value (kWh)    432 non-null float64
Notes          0 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 17.0+ KB
None


## Datatypes in pandas

As you can see above, the datatypes in pandas are a little different than those used in python.
Here is the relation between pandas and python datatypes

| Pandas | Python | Description  |
|----|---|-------|
| object  | string | Most Common |
| int64  | int | Whole Numbers |
| float64  | float | Decimals |
| datetime64  | datetime |     |


## Subsetting

The Pandas library allows you to subset you data by column, row, and individual cells.

### Subset by Column Name

In [3]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

subset = dataframe['Value (kWh)']

print(subset.head())

0    2.343
1    1.211
2    1.199
3    2.251
4    1.064
Name: Value (kWh), dtype: float64


### Subset by Row
Subsetting by row can be done in two ways:
* Row Name (also called the index label) by using the loc[ ] function
* Row Index by using the iloc[ ] function

A majority of the time, both row name and row index will be the same.
An example of then being different might be with data that is based on time, in which case the index label will be the timestamp and the row index will simply be the index of the row.

The following will print out the first row from our dataset

In [7]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.loc[0])
print(type(dataframe.loc[0]))

MeterId          10443720000759353
StartDate      2018-08-05 00:00:00
EndDate        2018-08-05 01:00:00
Value (kWh)                  2.343
Notes                          NaN
Name: 0, dtype: object
<class 'pandas.core.series.Series'>


In [8]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.iloc[0])
print(type(dataframe.iloc[0]))

MeterId          10443720000759353
StartDate      2018-08-05 00:00:00
EndDate        2018-08-05 01:00:00
Value (kWh)                  2.343
Notes                          NaN
Name: 0, dtype: object
<class 'pandas.core.series.Series'>


As you can see, using the loc[ ] and iloc[ ] functions does not return the data in a dataframe object. Rather, it returns it as a Series object. The Series object is the second major data type that the Pandas library provides

### Subsetting by row and column
Pandas allows you to subset by both column name and row name/index using the loc[ ] and iloc[ ] functions. In order to do this, you must pass in the row and column names are two comma separated lists

In [11]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.loc[[0],['MeterId']])

             MeterId
0  10443720000759353


You can also use the built-in python slicing technique to select multiple columns and rows

In [13]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')

print(dataframe.iloc[:4, :2])

             MeterId            StartDate
0  10443720000759353  2018-08-05 00:00:00
1  10443720000759353  2018-08-05 01:00:00
2  10443720000759353  2018-08-05 02:00:00
3  10443720000759353  2018-08-05 03:00:00


### Grouping and Calculations

Subsetting is one way to group your data. Using the groupby() method is another way to group data. The groupby() method takes the column name you want to group by as its parameter.

Once you have grouped your data, you can subset the group based on the specific column that you want to perform a calculation on

Then you can perform your calculation on the specific column of interest.
Calculations include the sum() and mean() functions

The following code will group the data based on the MeterId column name. Once it has that, it will then select the Value column from that grouped data, and then add all of the values for each MeterId(in this case there is only 1 MeterId)

In [20]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')
group = dataframe.groupby('MeterId')
group_value = group['Value (kWh)']
group_mean = group_value.sum()
print(group_mean.head())

MeterId
10443720000759353    1003.443
Name: Value (kWh), dtype: float64


## Exercise 

The file that is provided, usage_2018-08-22.csv, contains electric usage data hour by hour for the first billing cycle. 

Write a program that will display the total usage from each day in the cycle, the average usage for the cycle, and then the total usage for the cycle

In [36]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')
total_hours = dataframe.shape[0]               # Shape[0] returns the total number of rows, each row represents 1 hour
cycle_total = 0

for i in range(0,total_days,24):               # Each loop will start the counter at the first hour of the next day
    day = dataframe.loc[i:i+23,'Value (kWh)']  # By selecting the next 24 rows, I am selecting all the rows for that day
    day_total = day.sum()
    cycle_total+= day_total
    print("The total usage for day ", int((i/24)+1)," is: ", day_total, "(kWh)")

average_usage = cycle_total/(total_hours/24)
print("\nThe average daily usage is: ", average_usage, "(kWh)")
print("\nThe total usage for the cycle is: ", cycle_total, "(kWh)")
    



The total usage for day  1  is:  48.415 (kWh)
The total usage for day  2  is:  54.92 (kWh)
The total usage for day  3  is:  56.724000000000004 (kWh)
The total usage for day  4  is:  52.745 (kWh)
The total usage for day  5  is:  40.929 (kWh)
The total usage for day  6  is:  37.683 (kWh)
The total usage for day  7  is:  42.396 (kWh)
The total usage for day  8  is:  55.491 (kWh)
The total usage for day  9  is:  43.948 (kWh)
The total usage for day  10  is:  46.216 (kWh)
The total usage for day  11  is:  57.086 (kWh)
The total usage for day  12  is:  63.77199999999999 (kWh)
The total usage for day  13  is:  65.13999999999999 (kWh)
The total usage for day  14  is:  73.246 (kWh)
The total usage for day  15  is:  76.955 (kWh)
The total usage for day  16  is:  67.396 (kWh)
The total usage for day  17  is:  57.003 (kWh)
The total usage for day  18  is:  63.378 (kWh)

The average daily usage is:  55.74683333333334 (kWh)

The total usage for the cycle is:  1003.4430000000001 (kWh)
