# Pandas Fundamentals
Welcome to a short primer on how to use Pandas.  Note that this is not an in depth tutorial, and there are excellent resources at AFIT and online for further study.  

If you haven't already done so, please install the Pandas library by going to your Anaconda distribution and typing **conda install pandas**.  You may need administrative privileges for this.  

Let's start by importing the Pandas library. 

In [None]:
import pandas as pd

1-D data structures in Pandas are called **Series**

Let's create a very simple Series.  Note that the index (0,1,2) is also printed alongside the values (101,102,103)

In [None]:
data = pd.Series([101,102,103])
print(data)

You can acceess a particular value in the Series by using it's index, similar to Numpy

In [None]:
data[0]

You can replace the index with other numbers or even strings.  Here, let's replace the index with state abbreviations

In [None]:
data.index = ['OH','NY','FL']

Then if we wanted to access the value for Ohio, we would use 'OH' as our index.  This comes in very handy when you're querying data.  

In [None]:
data

So 3 different ways to access values in a Series are

In [None]:
print(data['OH'])
print(data.loc['OH'])
print(data.iloc[0]) # use iloc when you want the base indices that start with 0 on the first entry and go to n-1, where n=length of Series

Finally, you can find the length of the Series using the .size function

In [None]:
data.size

## Pandas Dataframes

Pandas dataframes are mainly created by reading in a data file, such as .csv, .xlsx, etc.  If the input file has a header, Pandas can make that the header row.  

Below, we read from a dataset about the survivors from the titanic.  It can be found here: https://www.kaggle.com/competitions/titanic.  It is also located in this folder. 

For information on how to read other formats, see: https://pandas.pydata.org/docs/reference/io.html

In [None]:
titanic = pd.read_csv('titanic_train.csv') 
titanic.head() #print the header and the first 5 lines.

Note that the dataframe index is set to the default starting at 0.  

We can reset the index to another column such as the PassengerID using the set_index command.  


In [None]:
titanic = titanic.set_index('PassengerId')

In [None]:
titanic.isnull().sum()

In [None]:
titanic.head(3)  # get the first 3 rows only

In [None]:
titanic.columns # prints the column names

To retrieve a column of data, simply use the column title.  

In [None]:
titanic.loc[:,'Age']
# or titanic['Age']

You can also get more than 1 column, but **you need to make a list of the column names** first

In [None]:
titanic.loc[:,['Age','Survived']]

# NOTE: this would not work: housing['Age','Fare']

Now what if we wanted the 2nd entry of the 'Age' column?  
- The PassengerID of the 2nd entry is 2
    - You can use the .loc[2] function for that
- The base index for the 2nd entry is 1 (the first entry starts at 0)
    - You can use the iloc[1] function for that

In [None]:
print(titanic.loc[2,'Age'])
print(titanic.iloc[1,4])

In [None]:
titanic.head(3)

In [None]:
print(titanic.loc[1:3,'Age'])
# or
print (titanic.iloc[0:3,4])

If you want to use retrieve a range of rows, use the colon (:) symbol.  Note in the example below that iloc() does NOT include the last indexed digit.  

In [None]:
print(titanic.loc[1:4,'Age'])
print(titanic.iloc[0:3,4])

Lastly note that there are 2 ways to use the .loc function

In [None]:
print(titanic.loc[2,'Age'])
print(titanic['Age'].loc[2])