# Pandas

In this notebook, we'll learn the basics of data analysis with the Python Pandas library.

<img src="figures/pandas.png" width=500>




# Loading the data

We're first going to get some data to play with. We're going to load the titanic dataset.

Now that we have some data to play with, let's load it into a Pandas dataframe. Pandas is a great Python library for data analysis.

In [None]:
# import the libraries
import matplotlib
import urllib
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Read from CSV to Pandas DataFrame
df = pd.read_csv("data/titanic.csv", header=0)

In [None]:
# First five items
df.head()

These are the diferent features: 
* pclass: class of travel
* name: full name of the passenger
* sex: gender
* age: numerical age
* sibsp: # of siblings/spouse aboard
* parch: number of parents/child aboard
* ticket: ticket number
* fare: cost of the ticket
* cabin: location of room
* emarked: port that the passenger embarked at (C - Cherbourg, S - Southampton, Q = Queenstown)
* survived: survial metric (0 - died, 1 - survived)

# Exploratory analysis

We're going to use the Pandas library and see how we can explore and process our data.

In [None]:
# Describe features
df.describe()

In [None]:
# Histograms
df["age"].hist()

In [None]:
# Unique values
df["embarked"].unique()

In [None]:
# Selecting data by feature
df["name"].head()

In [None]:
# Filtering
df[df["sex"]=="female"].head() # only the female data appear

In [None]:
# Sorting
df.sort_values("age", ascending=False).head()

In [None]:
# Grouping
survived_group = df.groupby("survived")
survived_group.mean()

In [None]:
# Selecting row
df.iloc[0, :] # iloc gets rows (or columns) at particular positions in the index (so it only takes integers)

In [None]:
# Selecting specific value
df.iloc[0, 1]


In [None]:
# Selecting by index
df.loc[0] # loc gets rows (or columns) with particular labels from the index

# Preprocessing

In [None]:
# Rows with at least one NaN value
df[pd.isnull(df).any(axis=1)].head()

In [None]:
# Drop rows with Nan values
df = df.dropna() # removes rows with any NaN values
df = df.reset_index() # reset's row indexes in case any rows were dropped
df.head()

In [None]:
# Dropping multiple columns
df = df.drop(["name", "cabin", "ticket"], axis=1) # we won't use text features for our initial basic models
df.head()

In [None]:
# Map feature values
df['sex'] = df['sex'].map( {'female': 0, 'male': 1} ).astype(int)
df["embarked"] = df['embarked'].dropna().map( {'S':0, 'C':1, 'Q':2} ).astype(int)
df.head()

# Feature engineering

In [None]:
# Lambda expressions to create new features
def get_family_size(sibsp, parch):
    family_size = sibsp + parch
    return family_size

df["family_size"] = df[["sibsp", "parch"]].apply(lambda x: get_family_size(x["sibsp"], x["parch"]), axis=1)
df.head()

In [None]:
# Reorganize headers
df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'family_size', 'fare', 'embarked', 'survived']]
df.head()

# Saving data

In [None]:
# Saving dataframe to CSV
df.to_csv("data/processed_titanic.csv", index=False)