# Pandas Fundamentals
Welcome to a short primer on how to use Pandas.  Note that this is not an in depth tutorial, and there are excellent resources at AFIT and online for further study.  

If you haven't already done so, please install the Pandas library by going to your Anaconda distribution and typing **conda install pandas**.  You may need administrative privileges for this.  

Let's start by importing the Pandas library. 

In [1]:
import pandas as pd

1-D data structures in Pandas are called **Series**

Let's create a very simple Series.  Note that the index (0,1,2) is also printed alongside the values (101,102,103)

In [2]:
data = pd.Series([101,102,103])
print(data)

0    101
1    102
2    103
dtype: int64


You can acceess a particular value in the Series by using it's index, similar to Numpy

In [3]:
data[0]

101

You can replace the index with other numbers or even strings.  Here, let's replace the index with state abbreviations

In [4]:
data.index = ['OH','NY','FL']

Then if we wanted to access the value for Ohio, we would use 'OH' as our index.  This comes in very handy when you're querying data.  

In [5]:
data

OH    101
NY    102
FL    103
dtype: int64

So 3 different ways to access values in a Series are

In [6]:
print(data['OH'])
print(data.loc['OH'])
print(data.iloc[0]) # use iloc when you want the base indices that start with 0 on the first entry and go to n-1, where n=length of Series

101
101
101


Finally, you can find the length of the Series using the .size function

In [7]:
data.size

3

## Pandas Dataframes

Pandas dataframes are mainly created by reading in a data file, such as .csv, .xlsx, etc.  If the input file has a header, Pandas can make that the header row.  

Below, we read from a dataset about the survivors from the titanic.  It can be found here: https://www.kaggle.com/competitions/titanic.  It is also located in this folder. 

For information on how to read other formats, see: https://pandas.pydata.org/docs/reference/io.html

In [8]:
titanic = pd.read_csv('titanic_train.csv') 
titanic.head() #print the header and the first 5 lines.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Note that the dataframe index is set to the default starting at 0.  

We can reset the index to another column such as the PassengerID using the set_index command.  


In [9]:
titanic = titanic.set_index('PassengerId')

In [10]:
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [11]:
titanic.head(3)  # get the first 3 rows only

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [12]:
titanic.columns # prints the column names

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

To retrieve a column of data, simply use the column title.  

In [13]:
titanic.loc[:,'Age']
# or titanic['Age']

PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64

You can also get more than 1 column, but **you need to make a list of the column names** first

In [14]:
titanic.loc[:,['Age','Survived']]

# NOTE: this would not work: housing['Age','Fare']

Unnamed: 0_level_0,Age,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,22.0,0
2,38.0,1
3,26.0,1
4,35.0,1
5,35.0,0
...,...,...
887,27.0,0
888,19.0,1
889,,0
890,26.0,1


Now what if we wanted the 2nd entry of the 'Age' column?  
- The PassengerID of the 2nd entry is 2
    - You can use the .loc[2] function for that
- The base index for the 2nd entry is 1 (the first entry starts at 0)
    - You can use the iloc[1] function for that

In [15]:
print(titanic.loc[2,'Age'])
print(titanic.iloc[1,4])

38.0
38.0


In [16]:
titanic.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [17]:
print(titanic.loc[1:3,'Age'])
# or
print (titanic.iloc[0:3,4])

PassengerId
1    22.0
2    38.0
3    26.0
Name: Age, dtype: float64
PassengerId
1    22.0
2    38.0
3    26.0
Name: Age, dtype: float64


If you want to use retrieve a range of rows, use the colon (:) symbol.  Note in the example below that iloc() does NOT include the last indexed digit.  

In [20]:
print(titanic.loc[1:4,'Age'])
print(titanic.iloc[0:3,4])

PassengerId
1    22.0
2    38.0
3    26.0
4    35.0
Name: Age, dtype: float64
PassengerId
1    22.0
2    38.0
3    26.0
Name: Age, dtype: float64


Lastly note that there are 2 ways to use the .loc function

In [19]:
print(titanic.loc[2,'Age'])
print(titanic['Age'].loc[2])

38.0
38.0
