# Agenda
Over the previous two sessions we covered most of the basics we need before all the basics we need to get started on some more interesting reservoir engineering, production engineering, and geosciences problems. From this point on we will be taking a learn by projects approach. 

# A Gentle Introduction to Pandas

## What is Pandas
Pandas is an importable Python package with a data platform and an enourmous number of useful subroutines built in, and it has its own syntax. Pandas is easily the most used package in Python. If you installed the Anaconda data science platform then the pandas does not need to be installed as it already is. To use it, it must be imported, and recall that it is imported like so:

import pandas 

Recall that we can import and assign an alias in one line like so:

import pandas as pd

Pandas is supposed to mean a panel of data. Pandas is as to a spreadsheet as one can get under Python, but its much more powerful and complicated than a spreadsheet.

The scope of use of Pandas is enormous and can cover many hours and many hundreds of lines of text if we dedicated ourselves solely toward learning about Pandas. I dont see much utility in that. I think Pandas has to be absorbed in small doses, and this is the first dose. 


## Pandas Series and DataFrames

There are two building blocks in Pandas, the series, and the dataframe. In Pandas a column of data is called a series.a collection of one or more series is called a dataframe. A series can be only one data type, an integer, a float, a logical True or False, a datetime, a timedelta, a string, or a category. It is not possible to have a column in a dataframe with more than one data type, but each column can be any data type. 

On first examination a series looks like a simple list of values, see below.

In [None]:
'''
We need to build the kernel and populate it with data
''' 

import pandas as pd

# Now use pandas to import the file. Note the naming convention and the pandas alias of pd for pandas

wells_df = pd.read_csv('Marten_Hills_meta.csv')

In [None]:
# Now lets look at the names of the columns using the pandas columns command

cols = wells_df.columns
print(cols)


In [None]:
import pandas as pd

# Recall that any string must be encased by quotes or tick marks, and that each value is delineated with a comma

zone_series = pd.Series(['Wabamun', 'Clearwater', 'McMurray', 'Paleozoic', 'Wabiskaw', 'Unspecified',
 'Mannville', 'Nisku', 'Grand Rapids', 'Keg River', 'Graminia', 'Devonian',
 'Wabiskaw-McMurray', 'Joli Fou', 'Banff', 'Ireton'])

zone_list = ['Wabamun', 'Clearwater', 'McMurray', 'Paleozoic', 'Wabiskaw', 'Unspecified',
 'Mannville', 'Nisku', 'Grand Rapids', 'Keg River', 'Graminia', 'Devonian',
 'Wabiskaw-McMurray', 'Joli Fou', 'Banff', 'Ireton']


But a Series is very different than a list in that every value in a Series has an index that follows it on every operation whereas a list has no persistent index, see below. 
    
    
    

In [2]:
# In the eyes of Python a series looks like this:

print('This is an unsorted Series:', zone_series)

# note the index 

print()
print('This is a sorted Series:\n', zone_series.sort_values())

# Note the index of each value is preserved. Now lets look at a list




NameError: name 'zone_series' is not defined

In [3]:
print(wells_df['Formation'].unique())

NameError: name 'zone_list' is not defined

In [4]:
# In the eyes of Python a list looks like this:

print('This is an unsorted List:\n', zone_list)

# note the index 

zone_list.sort()
print()
print('This is a sorted List:\n', zone_list)

# Note that a list has no index

NameError: name 'zone_list' is not defined

In [None]:
## Loading a Pandas DataFrame

There are a number of ways to load a Pandas dataframe,  but we will look at only one technique right now and that using a csv or excel spreadsheet to load a dataframe. 

The syntax is simple:

df = read_csv( *filename* ) for a csv or df = read_excel( *filename*, sheetname= *Sheet1* ) for excel. The difference with excel is that you must also state a tab name.  

There are control phrase that we can use to dictate how the data is loaded and these can be found here:

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Among the most useful are `skiprows= XX` for skipping rows of data, `sep= YY` for specifying a space or tab as a delimiter where `YY` is a string enclosed by quotes or ticks.

In [None]:
## Pandas Columns Indexing

Recall that Pandas is a collection of one or more Series or columns of data. There are several ways to specify or index a column of data in Pandas but for the time being I will use only one. A column of data in a dataframe can be indexed using the name of the column as a string enclosed in quotes or ticks and enclosed in square brackets like so:

wells_df[*column_name*] 

There is a rule for naming a column in Pandas: the first character should not be a number (integer or float). You can force the column name to be a number but it ruins indexing and causes issues.


There are tips for naming columns: use snake case: `snake_case`. Snake case 

In [None]:
### First steps when loading data

Its helpful to make a list of the dataframe column names for easy reference. The columns command (a command because it does not have brackets behind it whereas a method does. More on this later)  