### Pandas / Package

Pandas is the main package for working with `relational` or labeled data.  
It is built on '`top'` of NumPy package. [<u>more details</u>](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)

In [5]:
# pip install pandas

import pandas as pd

### DataFrame

A DataFrame is a multi-dimensional `table` containing a collection of Series.  
A Pandas Series is like a `column` in a table.  
It is an one-dimensional array holding data of `any` type.

In [6]:
data = {
    'apples': [3, 2, 0, 1],
    'oranges': [0, 3, 7, 2],
    'available': ['yes', 'no', 'yes', 'no'],
}

df = pd.DataFrame(data)
display(df)

Unnamed: 0,apples,oranges,available
0,3,0,yes
1,2,3,no
2,0,7,yes
3,1,2,no


### Read CSV

Import a comma-separated values (csv) file into a `DataFrame`.


In [9]:
df = pd.read_csv("read_data/_data/titanic.csv")
display(df.head())

print("Shape:", df.shape)
print("Columns:", df.columns)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


Shape: (1313, 6)
Columns: Index(['Name', 'PClass', 'Age', 'Sex', 'Survived', 'SexCode'], dtype='object')


### Location

Both `iloc` and loc are very useful during search and data cleaning.

In [None]:
display(df.iloc[1:4]) # 2nd to 4th row

# Select by index (name):
df = df.set_index(df['Name'])
display(df.loc['Allen, Miss Elisabeth Walton'])

### Conditional

Conditional selecting and `filtering` data are common tasks.

In [None]:
females = df[df['Sex'] == 'female']
males_60 = df[(df['Sex'] == 'male') & (df['Age'] >= 60)]

print("Females:", females.size)
display(females.head())

print("Males age 60+:", males_60.size)
display(males_60.head())

### Replace

Replace accepts `regex` regular expressions.

In [None]:
# Replace female/male
df['Sex'] = df['Sex'].replace(['female', 'male'], ['Woman', 'Man'])
df['PClass'] = df['PClass'].replace(r'1st', 'First', regex=True)
display(df.head())

### Statistics

Pandas has multiple `built-in methods` for descriptive statistics.

In [None]:
# Statistics (by Age)
A = pd.DataFrame()
A['max'] = [df['Age'].max()]
A['min'] = [df['Age'].min()]
A['avg'] = [df['Age'].mean()]
display(A)

# Value counts (by PClass)
A = pd.DataFrame()
A['PClass'] = df['PClass'].value_counts()
display(A)

# Unique values (by Sex)
A = pd.DataFrame()
A['unique_value'] = df['Sex'].unique()
A['total'] = [df['Sex'].value_counts()[0], df['Sex'].value_counts()[1]]
display(A)

# Missing values (by Agge)
A = pd.DataFrame()
A = df[df['Age'].isnull()]

print("Missing values (Age):", A.size)
display(A.head())


### GroupBy

Groupby is one of the `most powerful` feature in pandas.

In [None]:
A = df.groupby('Sex').count()
display(A)

A = df.groupby('Sex').mean(numeric_only=True)
display(A)

A = df.groupby('Sex')['Survived'].count()
display(A)

A = df.groupby(['Sex', 'Survived']).mean(numeric_only=True)
display(A)

### Plotting

Pandas `integrates` with Matplotlib, so we can directly plot DataFrames.

In [None]:
import matplotlib.pyplot as plt

df.plot(kind='scatter', x='Age', y='PClass')
df['Age'].plot(kind='hist', title='Age')
