# Pandas(Top-down)

## Introduction

- Pandas is the most widely used library when it comes to data science.
- Pandas is python's version of R language.
- [Official site](https://pandas.pydata.org/)
- [Author Wes Mckinney](http://wesmckinney.com/)

We will start by thinking of pandas as a tool for handling **tabular data**. By handling we mean:
- Storing data in most memory efficient way.
- Representing data in a format which is easy to interpret.
- Easy ways/methods to work, process and manipulate data in hand.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/train.csv'); df.head() # explaing dataset, head(), df and series display in jupyter nb. (not all values will be printed)  

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
df['Name'] # selecting a single column

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Wil

In [11]:
df[['Name', 'Sex', 'Age']] # selecting multiple columns by passing a list to the indexing operator,[].

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0
5,"Moran, Mr. James",male,
6,"McCarthy, Mr. Timothy J",male,54.0
7,"Palsson, Master. Gosta Leonard",male,2.0
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0
9,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0


#### Selecting rows and columns simultaneously

In [12]:
# df.loc[row_selection]; row_selection: single label, list of labels, slice of lables or sequence of booleans
df.loc[890] # single label

PassengerId                    891
Survived                         0
Pclass                           3
Name           Dooley, Mr. Patrick
Sex                           male
Age                             32
SibSp                            0
Parch                            0
Ticket                      370376
Fare                          7.75
Cabin                          NaN
Embarked                         Q
Name: 890, dtype: object

In [13]:
df.loc[[888, 889, 890]] # list of lables

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [14]:
df.loc[888:890] # slice of labels. Note: stop is included

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [15]:
# df.loc[row_selection, col_selection]; col_selection, row_selection both can have same values.
df.loc[890, ['Age', 'Sex']] 

Age      32
Sex    male
Name: 890, dtype: object

A really good practice for subsetting would be to make some imaginary selection(in your mind) and then try selecting it using this notation.

## Pandas is build on top of Numpy

Now, can we think of pandas dataframe and series as numpy array with custom row and column labels?, i.e dictionary of numpy array.

That means, we can use all the methods and attributes that work for numpy array object.

In [22]:
df['Age'].mean(), df['Age'].std()

(29.69911764705882, 14.526497332334044)

In [19]:
df['Age'].max(), df['Age'].argmax()

will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  """Entry point for launching an IPython kernel.


(80.0, 630)

In [20]:
df.shape

(891, 12)

In [21]:
df['Age'].dtypes

dtype('float64')

Not only methods and attributes, we can also use concepts like **broadcasting** with pandas dataframe and series.

In [25]:
df['Age'].head() + 1 # arithematic operation will be broadcasted to every element of the dataframe or series.

0    23.0
1    39.0
2    27.0
3    36.0
4    36.0
Name: Age, dtype: float64

In [27]:
df['Age'].head() > 30 # comparision operators also work.

0    False
1     True
2    False
3     True
4     True
Name: Age, dtype: bool

This is great, huh? we can use all our numpy methods, attributes and concepts. Plus you have some **extra methods and operations** defined for pandas ojects specifically.

In [35]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [39]:
df['Sex'].unique()# tail(), head(), nunique(), etc

array(['male', 'female'], dtype=object)

### Anatomy of DataFrame

At first glance, the DataFrame looks like any other two-dimensional table of data that you have seen. It has rows and it has columns. Technically, there are three main components of the DataFrame.
#### The three components of a DataFrame

A DataFrame is composed of three different components, the **index, columns, and the data**; that you must be aware of in order to maximize the DataFrame's full potential. The data is also known as the **values**.

In [28]:
df.index

RangeIndex(start=0, stop=891, step=1)

In [29]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [30]:
df.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

In [31]:
df.index.values # its a numpy array

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

In [32]:
df.columns.values # columns are also numpy array

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

#### Summary:
<img src="images/anatomy.png"  width="400">

So, pandas dataframe or series is nothing but multiple numpy array put together in a meaning full way. Pandas in its essence, is numpy array with labeled index. So logical we should be able to access rows and column through **integer location**

#### Pandas object have **dual reference** (PPT time)

Next obvious question is why do we want dual reference?? For the same reason we have **for** and **while** loops together. Sometimes *for* loop is more appropriate and sometimes *while* loop. It all depends upon problem statement in hand. The same reasoning goes for having **labels** and **integer Locations** for referencing together.

To diffirentiate between the two references we have **iloc** operator. Its tells pandas that you are using *integer locations* for reference.

In [33]:
# df.iloc[row_selection, col_selection] 
# row, col can be int, list of int, slice of int

___
#### Redefining dataframe and Series: Numpy array with axis labels.
___

## Processing dataframe

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/train.csv'); df.head()

In [41]:
names = df.Name.copy()

In [51]:
names[0].split(', ')[0]

'Braund'

In [52]:
names.head().str.split(', ')

0                            [Braund, Mr. Owen Harris]
1    [Cumings, Mrs. John Bradley (Florence Briggs T...
2                             [Heikkinen, Miss. Laina]
3       [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4                           [Allen, Mr. William Henry]
Name: Name, dtype: object

In [54]:
names.head().str.split(', ').str[0] # .str.get(0) will have the same effect

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

In [55]:
names.head().str.split(', ', expand=True)

Unnamed: 0,0,1
0,Braund,Mr. Owen Harris
1,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,Heikkinen,Miss. Laina
3,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,Allen,Mr. William Henry


In [None]:
# Create new feature FamilySize as a combination of SibSp and Parch
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

In [None]:
# Create new feature IsAlone from FamilySize
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

In [None]:
# Remove all NULLS in the Fare column
df['Fare'] = df['Fare'].fillna(df['Fare'].median())

# df.loc[df['Fare'].isnull(), 'Fare'] =  df['Fare'].median()

In [None]:
# Mapping Sex
df['Sex'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

In [61]:
# Mapping Embarked
df.Embarked.astype('category').cat.categories

Index(['C', 'Q', 'S'], dtype='object')

In [None]:
# Mapping Fare
df.loc[ df['Fare'] <= 7.91, 'Fare']= 0
df.loc[(df['Fare'] > 7.91) & (df['Fare'] <= 14.454), 'Fare'] = 1
df.loc[(df['Fare'] > 14.454) & (df['Fare'] <= 31), 'Fare'] = 2
df.loc[ df['Fare'] > 31, 'Fare'] = 3
df['Fare'] = df['Fare'].astype(int)

In [64]:
pd.cut(df['Fare'].head(), 4, labels=False)

0    0
1    3
2    0
3    2
4    0
Name: Fare, dtype: int64

In [None]:
# Mapping Age
df.loc[ df['Age'] <= 16, 'Age'] = 0
df.loc[(df['Age'] > 16) & (df['Age'] <= 32), 'Age'] = 1
df.loc[(df['Age'] > 32) & (df['Age'] <= 48), 'Age'] = 2
df.loc[(df['Age'] > 48) & (df['Age'] <= 64), 'Age'] = 3
df.loc[ df['Age'] > 64, 'Age']df;

In [None]:
# Feature selection: remove variables no longer containing relevant information
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
df = df.drop(drop_elements, axis = 1)