## Pandas ✨ Machine Learning


### Package

Library for working with `relational` or labeled data, built on top of NumPy package. [<u>more details</u>](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)

In [1]:
# pip install pandas

import pandas as pd

### DataFrame

A DataFrame is a `multi-dimensional table` made up from a collection of Series.

In [None]:
data = {
    'apples': [3, 2, 0, 1],
    'oranges': [0, 3, 7, 2]
}

df = pd.DataFrame(data)
display(df)

### Read CSV


Import a comma-separated values (csv) file into a `DataFrame`.

In [None]:
df = pd.read_csv("_data/titanic.csv")
display(df.head())

print("Size:", df.size)
print("Shape:", df.shape)
print("Columns:", df.columns)
df.info()

### Location

Both `iloc` and loc are very useful during search and data cleaning.

In [None]:
df = pd.read_csv("_data/titanic.csv")

# Select 2nd to 4th row:
display(df.iloc[1:4])

# Select by index (name):
df = df.set_index(df['Name'])
display(df.loc['Allen, Miss Elisabeth Walton'])

### Conditional

Conditional selecting and `filtering` data are common tasks.

In [None]:
df = pd.read_csv("_data/titanic.csv")

females = df[df['Sex'] == 'female']
males_60 = df[(df['Sex'] == 'male') & (df['Age'] >= 60)]

print("Females:", females.size)
display(females.head())

print("Males age 60+:", males_60.size)
display(males_60.head())

### Replace

Replace accepts `regex` regular expressions.

In [None]:
df = pd.read_csv("_data/titanic.csv")

# Replace female/male
df['Sex'] = df['Sex'].replace(['female', 'male'], ['Woman', 'Man'])
df['PClass'] = df['PClass'].replace(r'1st', 'First', regex=True)
display(df.head())

### Statistics

Pandas has multiple `built-in methods` for descriptive statistics.

In [None]:
df = pd.read_csv("_data/titanic.csv")

# Statistics (by Age)
A = pd.DataFrame()
A['max'] = [df['Age'].max()]
A['min'] = [df['Age'].min()]
A['avg'] = [df['Age'].mean()]
display(A)

# Value counts (by PClass)
A = pd.DataFrame()
A['PClass'] = df['PClass'].value_counts()
display(A)

# Unique values (by Sex)
A = pd.DataFrame()
A['unique_value'] = df['Sex'].unique()
A['total'] = [df['Sex'].value_counts()[0], df['Sex'].value_counts()[1]]
display(A)

# Missing values (by Agge)
A = pd.DataFrame()
A = df[df['Age'].isnull()]

print("Missing values (Age):", A.size)
display(A.head())


### GroupBy

Groupby is one of the `most powerful` feature in pandas.

In [None]:
df = pd.read_csv("_data/titanic.csv")

A = df.groupby('Sex').count()
display(A)

A = df.groupby('Sex').mean(numeric_only=True)
display(A)

A = df.groupby('Sex')['Survived'].count()
display(A)

A = df.groupby(['Sex', 'Survived']).mean(numeric_only=True)
display(A)

### Plotting

Pandas `integrates` with Matplotlib, so we can directly plot DataFrames.

In [None]:
import matplotlib.pyplot as plt

df = pd.read_csv("_data/titanic.csv")

df.plot(kind='scatter', x='Age', y='PClass')
df['Age'].plot(kind='hist', title='Age')
