# Playing with Pandas

In this notebook, we'll go over the basics of Pandas. The focus will be on going through one pass of the process instead of covering many concepts.

**Pandas** is a Python Library for Data manipulation and analysis.

Some of the reasons we use it are:

* Easy read and write from different sources/formats
* Easy to do common tasks like handling missing data
* Well-suited for tabular data with different types of columns
* Plays well with other libraries like Numpy (In fact, builds on top of it). 

Basically, it's easy to load, process and analyze data with *Pandas*.

But before we dive in, let's look at the whole ML process and where *Pandas* fits in.

In [1]:
import pandas as pd


## 1. Data Structures: Intro and basic access

The main datastructures in *Pandas* are **Series** and **DataFrame**.

### Series

* One-dimensional array-like structure
* capable of holding any (one) data type
* Has indices

So, it's a numpy array with row labels and a name. Basically a *smart* array.

![A series is an indexed array](resources/Series.png)

In [2]:
# "Normal" integer indexed series
s = pd.Series([22, 12, 18, 25, 30])
s

0    22
1    12
2    18
3    25
4    30
dtype: int64

In [3]:
# Series with String index and a name
s = pd.Series([22, 12, 18, 25, 30], index=['Anna', 'Bob', 'Carol', 'Dave', 'Elsa'], name='Age')
s

Anna     22
Bob      12
Carol    18
Dave     25
Elsa     30
Name: Age, dtype: int64

In [4]:
# Series with String index and a name
s = pd.Series([22, 12, 18, 25, 30], index=['Anna', 'Bob', 'Carol', 'Dave', 'Elsa'], name='Age')
s

Anna     22
Bob      12
Carol    18
Dave     25
Elsa     30
Name: Age, dtype: int64

In [5]:
s['Anna': 'Carol']

Anna     22
Bob      12
Carol    18
Name: Age, dtype: int64

### DataFrame

* Two-dimenstional tabular data structure
* Has indices and columns
* Columns can be of different data types

Could think of it as:

* Dictionary of *Series* objects with same index, or
* A 2-D numpy array with row and column labels.

![A DataFrame has data, row labels and column labels](resources/DataFrame.png)

In [6]:
people_data = [[22, 'F'], [12, 'M'], [18, 'F'], [25, 'M'], [30, 'F']]
names = ['Anna', 'Bob', 'Carol', 'Dave', 'Elsa']
fields = ['Age', 'Gender']

In [7]:
df = pd.DataFrame(people_data, index=names, columns=fields)
df

Unnamed: 0,Age,Gender
Anna,22,F
Bob,12,M
Carol,18,F
Dave,25,M
Elsa,30,F


In [8]:
df = pd.DataFrame({'Age': [22, 12, 18, 25, 30], 'Gender': ['F', 'M', 'F', 'M', 'F'], 'Id': [12, 13, 14, 15, 16]}, index=names)
df

Unnamed: 0,Age,Gender,Id
Anna,22,F,12
Bob,12,M,13
Carol,18,F,14
Dave,25,M,15
Elsa,30,F,16


In [9]:
df.index

Index([u'Anna', u'Bob', u'Carol', u'Dave', u'Elsa'], dtype='object')

In [10]:
df.index

Index([u'Anna', u'Bob', u'Carol', u'Dave', u'Elsa'], dtype='object')

In [11]:
df['Age']

Anna     22
Bob      12
Carol    18
Dave     25
Elsa     30
Name: Age, dtype: int64

In [12]:
df[['Age', 'Id']]

Unnamed: 0,Age,Id
Anna,22,12
Bob,12,13
Carol,18,14
Dave,25,15
Elsa,30,16


In [13]:
df = pd.read_csv('resources/titanic.csv')

# Quick peek of the data. First 5 rows.
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
df.shape

(891, 12)

In [15]:
df.columns

Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object')

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [17]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [18]:
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [19]:
# Mean Age
df['Age'].mean()

29.69911764705882