*For full screen view & to download, visit *
- `HTML` : <a href="https://prasanth-ntu.github.io/html/ML-Course/Pandas-Introduction/Pandas-Introduction-Notes.html" target="_blank">Click here</a>
- `GITHUB` : <a href="https://github.com/prasanth-ntu/prasanth-ntu.github.io/blob/master/html/ML-Course/Pandas-Introduction/Pandas-Introduction-Notes.ipynb" target="_blank">Click here</a>
- `NBVIEWER` : <a href="https://nbviewer.jupyter.org/github/prasanth-ntu/prasanth-ntu.github.io/blob/master/html/ML-Course/Pandas-Introduction/Pandas-Introduction-Notes.ipynb" target="_blank">Click here</a>

*To get back to ML website*, <a href="https://prasanth-ntu.github.io/html/ML-Course-9.html" target="_blank">Click here</a>

In [1]:
import pandas as pd
import numpy as np

# Basic Data Structure
<ul>
    <li><code>Series</code> : <b>One-dimensional</b> data structure with any data type</li>
    <li><code>DataFrame</code>: <b>Two-dimensional</b> data structure with different data type</li>
    <li><code>Panel</code>  : <b>Three-dimensional</b> data structure</li>
    <li><code>PanelND</code>: <b>N-dimensional data</b> structure </li>
</ul>
<p>More details : <a href="https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro" target="_blank">Pandas - Basic Data Structure</a></p>

## Series

- A Pandas Series is a one-dimensional array of indexed data. 
- It's capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.) 
- Unlike `list` in Python and `ndarray` in Numpy, the index of Pandas `Series` could be explicitly defined, using keyword `index`.

- We will see 
    - how to create pandas `Series` from `list` and `dict` 
    - how to access index names/numbering, values,  
    - how to do indexing and slicing
    - how to define indexes (both numeric and non-numeric)
    
<p><font size="3" color="blue">Pandas Series data structure has Only One indexing</font></p>


**Create a Pandas Series with a list**

If we don't specify any indexing, Pandas will assign default index no starting from 0.

In [10]:
s = pd.Series(data=[1,3,5,np.nan,9.0,11.5, 'hello'])  
print (s)

0        1
1        3
2        5
3      NaN
4        9
5     11.5
6    hello
dtype: object


<p>Note: <code>np.NaN</code>, or <b>'Not a Number'</b>, is a special floating point value in Python that indicates a missing data point. In this case, the value in index no <code>3</code> is not available.
<br><br>
We will go through common ways to handle <code>np.NaN</code> in the later sections of this post.</p>

### `index` & `values`

- Applying `.index` on a pandas Series will return the index names/numbering.
- Applying `.values` on a pandas Series will return the values as *numpy array*.

In [11]:
s.index

RangeIndex(start=0, stop=7, step=1)

In [12]:
s.values

array([1, 3, 5, nan, 9.0, 11.5, 'hello'], dtype=object)

### indexing & slicing
Similar to list & numpy in Python, we can access data in Series by using square-bracket notation `[]`.

In [14]:
s[1]           # indexing - returns an element

3

In [15]:
s[0:4]         # slicing - returns a pandas Series

0      1
1      3
2      5
3    NaN
dtype: object

### `name`

Using `.name`, we can assign name to the Pandas Series and its index 

In [17]:
s            # Before assigning name to Series and its index

0        1
1        3
2        5
3      NaN
4        9
5     11.5
6    hello
dtype: object

In [24]:
s.name = 'Series name is Cool'
s.index.name= 'idx'

In [25]:
s            # After assigning name to Series and its index

idx
0        1
1        3
2        5
3      NaN
4        9
5     11.5
6    hello
Name: Series name is Cool, dtype: object

**Create a Pandas Series with a list and user-defined index**

Here, we are speciying **index values** for each element in the list. This allows us to override default sequential indexing. 

<u>Note:</u> <i>If data has <code>n</code> elements, index must be the same length as data.</i>

In [29]:
s2 = pd.Series(data=[1,3,5,np.nan,9.0,11.5, 'hello'], index=[0,10,2,3,4,5,6])  
print (s2)

0         1
10        3
2         5
3       NaN
4         9
5      11.5
6     hello
dtype: object


In [30]:
s2.index

Int64Index([0, 10, 2, 3, 4, 5, 6], dtype='int64')

We can also specify **non-numeric index values** for each element in the list.

In [36]:
s3 = pd.Series(data=[1,3,5,np.nan,9.0,11.5, 'hello'], index=["one", "two", "three", "four", "five", "six", "seven"])  
print (s3)

one          1
two          3
three        5
four       NaN
five         9
six       11.5
seven    hello
dtype: object


We can also perform **indexing on Pandas Series with user-defined non-numeric index values**. We can still use the numeric indexing (which works by default) even though the default indexing is overwritten.

In [43]:
s3[2]                   # using numeric indexing

5

In [44]:
s3["three"]             # using non-numeric indexing 

5

In [46]:
s3[0:6]

one         1
two         3
three       5
four      NaN
five        9
six      11.5
dtype: object

In [39]:
s3["one":"six"]     # slicing

one         1
two         3
three       5
four      NaN
five        9
six      11.5
dtype: object

In [47]:
# Attributes values and index
s = pd.Series([1,3,5,np.nan,9.0,11.5, 'hello'])
val_s = s.values
idx_s = s.index
val_s

array([1, 3, 5, nan, 9.0, 11.5, 'hello'], dtype=object)

In [48]:
val_s, idx_s

(array([1, 3, 5, nan, 9.0, 11.5, 'hello'], dtype=object),
 RangeIndex(start=0, stop=7, step=1))

**Create a Pandas Series with a dictionary**

When we convert a dictionary into Pandas Series, it will sort the index based on dictionary keys and store them. 

In [54]:
d = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}       # user-defined index
s = pd.Series(d)
s

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [55]:
s.index

Index(['Ohio', 'Oregon', 'Texas', 'Utah'], dtype='object')

**Create a Pandas Series with repeated indices**

In [56]:
s = pd.Series(np.random.randn(6), index=['a', 'b', 'r', 'a', 'c', 'a'])  # repeated indices
s

a   -0.883589
b   -0.292810
r   -0.333291
a    0.483789
c    1.633583
a   -1.122912
dtype: float64

In [59]:
s['a']               # extracts all values with index as 'a'

a   -0.883589
a    0.483789
a   -1.122912
dtype: float64

**Create a Pandas Series and update/change the index names**

We can create a Series without specifying any index values/names and later we change and update it.

In [68]:
s = pd.Series(data=[1, 3, 5, np.nan, 9.0, 11.5, 'hello'])  
print ("Panas Series Values:", s.values)
print ("Pandas Seies Index (before updating the index):\n", s.index)

Panas Series Values: [1 3 5 nan 9.0 11.5 'hello']
Pandas Seies Index (before updating the index):
 RangeIndex(start=0, stop=7, step=1)


In [69]:
s.index = ['a', 'b', 'r', 'a', 'c', 'a', 'd']         # updating the index names
print ("Pandas Seies Index (after updating the index):\n", s.index)

Pandas Seies Index (after updating the index):
 Index(['a', 'b', 'r', 'a', 'c', 'a', 'd'], dtype='object')


In [70]:
s

a        1
b        3
r        5
a      NaN
c        9
a     11.5
d    hello
dtype: object

## `Dataframe`

DataFrames are essentially multidimensional arrays with attached **row** and **column** labels, and often with heterogeneous types and/or missing data.

<p><font size="3" color="blue">Two indexing, <code><font size="4" color="blue">index</p></code>for rows & <code><font size="4" color="blue">columns</p></code>for columns </font></p>

**Create a DataFrame from a dictionary of pd.Series**

`d` is a dictionary made of `pd.Series` values. Using `pd.DataFrame()`, the `d` dictionary is converted into a `DataFrame`.

Since, the `Age` value for `Steven` is not given, `DataFrame` automatically fills it with `NaN`.

In [77]:
d = {'Age'    : pd.Series([27, 21, 30],         index=['John', 'Emma', 'Andrew']), 
     'Height' : pd.Series([165, 177, 154, 169], index=['John', 'Emma', 'Andrew', 'Steven'])}
df_ds = pd.DataFrame(d)
df_ds

Unnamed: 0,Age,Height
Andrew,30.0,154
Emma,21.0,177
John,27.0,165
Steven,,169


### `index`, `columns` & `values`

In [11]:
df_ds.index

Index(['Andrew', 'Emma', 'John', 'Steven'], dtype='object')

In [12]:
df_ds.columns

Index(['Age', 'Height'], dtype='object')

In [13]:
df_ds.values

array([[ 30., 154.],
       [ 21., 177.],
       [ 27., 165.],
       [ nan, 169.]])

The code below is expanded version of the above code.

In [14]:
s_1 = pd.Series([27, 21, 30],         index=['John', 'Emma', 'Andrew'])
s_2 = pd.Series([165, 177, 154, 169], index=['John', 'Emma', 'Andrew', 'Steven'])
d = {'Age'    : s_1,
     'Height' : s_2}
df_ds = pd.DataFrame(d)
df_ds

Unnamed: 0,Age,Height
Andrew,30.0,154
Emma,21.0,177
John,27.0,165
Steven,,169


### Sorting using `sort_values(by='')`

In [17]:
df_ds.sort_values(by="Age", ascending=False)

Unnamed: 0,Age,Height
Andrew,30.0,154
John,27.0,165
Emma,21.0,177
Steven,,169


### indexing & slicing

To extract all the columns of a set of rows. 

In [155]:
df_ds['Andrew':'Andrew']          # slicing - returns a Pandas DataFrame

Unnamed: 0,Age,Height
Andrew,30.0,154


In [174]:
df_ds[0:2]                        # we select the specific rows using default index no

Unnamed: 0,Age,Height
Andrew,30.0,154
Emma,21.0,177


In [146]:
df_ds['Andrew':'Emma']            # we select the specific rows using index name

Unnamed: 0,Age,Height
Andrew,30.0,154
Emma,21.0,177


In [69]:
df_ds["Age"]                      # we select the specific column  
                                  # returns a Pandas Series

Andrew    30.0
Emma      21.0
John      27.0
Steven     NaN
Name: Age, dtype: float64

In [173]:
df_ds["Andrew":"Emma"]["Age"]     # we select the specific rows & a column

Andrew    30.0
Emma      21.0
Name: Age, dtype: float64

**Create a DataFrame from a dictionary of lists**

In [167]:
# 2. Create from a list of dictionaries 
l_1 = [1, 2, 3, 4, 5]
l_2 = [2, 4, 6, 8, 10]
l_d = {'list1': l_1, 'list2':l_2}   
df_dl = pd.DataFrame(l_d)
df_dl

Unnamed: 0,list1,list2
0,1,2
1,2,4
2,3,6
3,4,8
4,5,10


We can also index and slice through the rows and columns of the above DataFrame.

In [175]:
df_dl[0:3]

Unnamed: 0,list1,list2
0,1,2
1,2,4
2,3,6


In [169]:
df_dl[0:3]['list1']

0    1
1    2
2    3
Name: list1, dtype: int64

## Index Object
Both the `Series` and `DataFrame` objects contain an explicit **`index`** that helps you reference and modify data. 

It is an `immutable array` and `ordered set`.
- `immutable` - Values in the index cannot be changed
- `ordered set` - The values will stay in the same sequence they are created (Similar to dictionary)

In [179]:
idx1 = pd.Index([1, 3, 5, 7, 9, 11])
idx2 = pd.Index([2, 4, 5, 7, 9, 12])

print (idx1)
print ("Data type:", type(idx1))

Int64Index([1, 3, 5, 7, 9, 11], dtype='int64')
Data type: <class 'pandas.core.indexes.numeric.Int64Index'>


### indexing and slicing

In [188]:
idx1[2]                 # indexing

5

In [190]:
idx1[2:5]               # slicing

Int64Index([5, 7, 9], dtype='int64')

### `size`,`shape`,`ndim`,`dtype`

In [195]:
print ("size of the index      :", idx1.size)    # can be applied to Series, DataFrame as well
print ("shape of the index     :", idx1.shape)   # can be applied to Series, DataFrame as well
print ("dimension of the index :", idx1.ndim)    # can be applied to Series, DataFrame as well
print ("type of the index      :", idx1.dtype)   # can be applied to Series as well

size of the index      : 6
shape of the index     : (6,)
dimension of the index : 1
type of the index      : int64


### Set operation - `&`,`|`,`^`

https://c1.staticflickr.com/5/4162/34264296262_7cb1fb7395_o.jpg

In [207]:
print ('idx1                                  :', idx1)
print ('idx2                                  :', idx2, "\n")
print ("Intersection of idx1 and idx2         :", idx1 & idx2)
print ("Union of idx1 and idx2                :", idx1 | idx2)
print ("Symmetric difference of idx1 and idx2 :", idx1 ^ idx2)

idx1                                  : Int64Index([1, 3, 5, 7, 9, 11], dtype='int64')
idx2                                  : Int64Index([2, 4, 5, 7, 9, 12], dtype='int64') 

Intersection of idx1 and idx2         : Int64Index([5, 7, 9], dtype='int64')
Union of idx1 and idx2                : Int64Index([1, 2, 3, 4, 5, 7, 9, 11, 12], dtype='int64')
Symmetric difference of idx1 and idx2 : Int64Index([1, 2, 3, 4, 11, 12], dtype='int64')


# Basic Statistics

In [215]:
# Create a DataFrame
# AZ = Arizona, CA = California, FL = Florida
# HI = Hawaii,  MI = Michigan,   PA = Pennsylvania
# TX = Texas,   VA = Virginia,   WA = Washington

s_popu  = pd.Series([6.9,   39.3,  20.6,  1.43, 9.93,  12.8,  27.9,  8.4,   7.3])    # million persons
s_area  = pd.Series([295.2, 424.0, 170.3, 28.3, 250.5, 119.3, 695.7, 110.8, 184.7])  # thousand km^2
s_warea = pd.Series([1.0,   20.5,  31.4,  11.7, 104.1, 3.4,   19.1,  8.5,   12.5])   # thousand km^2

d = {'area_water'  : s_warea, 
     'population'  : s_popu,
     'area_total'  : s_area
    }

df = pd.DataFrame(d)
df.index = ['AZ', 'CA', 'FL', 'HI', 'MI', 'PA', 'TX', 'VA', 'WA']
df

Unnamed: 0,area_total,area_water,population
AZ,295.2,1.0,6.9
CA,424.0,20.5,39.3
FL,170.3,31.4,20.6
HI,28.3,11.7,1.43
MI,250.5,104.1,9.93
PA,119.3,3.4,12.8
TX,695.7,19.1,27.9
VA,110.8,8.5,8.4
WA,184.7,12.5,7.3


## `mean`, `min`, `max`,`std`

In [223]:
df_mean = df.mean(axis = 0)                  # axis=0 => Calculate average along the columns
print ("The average of ...\n", df_mean)

The average of ...
 area_total    253.200000
area_water     23.577778
population     14.951111
dtype: float64


In [224]:
df_min = df.min(axis = 0)                   # axis=0 => Calculate average along the columns
print ("The minimum of ...\n", df_min)

The minimum of ...
 area_total    28.30
area_water     1.00
population     1.43
dtype: float64


In [225]:
df_max = df.max(axis = 0)                   # axis=0 => Calculate average along the columns
print ("The maximum of ...\n", df_max)

The maximum of ...
 area_total    695.7
area_water    104.1
population     39.3
dtype: float64


In [229]:
df_std = df.std(axis = 0)                   # axis=0 => Calculate average along the columns
print ("The standard deviation of ...\n", df_std)

The standard deviation of ...
 area_total    202.207140
area_water     31.588320
population     12.100559
dtype: float64


## `corr`

In [230]:
df_corr = df.corr()                  
print ("The correlation between the columns are ...\n", df_corr)

The correlation between the columns are ...
             area_total  area_water  population
area_total    1.000000    0.077453    0.709171
area_water    0.077453    1.000000    0.024941
population    0.709171    0.024941    1.000000


# `DataFrame` - Revisited in detail

## `read_csv()`

In [5]:
import pandas as pd
import numpy as np

# Read Data from csv file named 'Titanic.csv'
# create a Pandas.Dataframe
titanic = pd.read_csv('Titanic.csv')
titanic

Unnamed: 0,Name,PClass,Age,Sex,Survived
0,"Allen, Miss Elisabeth Walton",1st,29.00,female,1
1,"Allison, Miss Helen Loraine",1st,2.00,female,0
2,"Allison, Mr Hudson Joshua Creighton",1st,30.00,male,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.00,female,0
4,"Allison, Master Hudson Trevor",1st,0.92,male,1
5,"Anderson, Mr Harry",1st,47.00,male,1
6,"Andrews, Miss Kornelia Theodosia",1st,63.00,female,1
7,"Andrews, Mr Thomas, jr",1st,39.00,male,0
8,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.00,female,1
9,"Artagaveytia, Mr Ramon",1st,71.00,male,0


In [6]:
print (type(titanic))

<class 'pandas.core.frame.DataFrame'>


In [7]:
titanic.shape                      # (nrows, ncols)

(1312, 5)

## `head` & `tail`
<ul>
    <li>`head()` : return first <b>n</b> rows of a DataFrame</li>
    <li>`tail()` : return last <b>n</b> rows of a DataFrame</li>
</ul>

In [7]:
titanic.head()                     # default (n = 5)

Unnamed: 0,Name,PClass,Age,Sex,Survived
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0
4,"Allison, Master Hudson Trevor",1st,0.92,male,1


In [9]:
titanic.tail(n=4)                    # default (n = 5)

Unnamed: 0,Name,PClass,Age,Sex,Survived
1308,"Zakarian, Mr Maprieder",3rd,26.0,male,0
1309,"Zenni, Mr Philip",3rd,22.0,male,0
1310,"Lievens, Mr Rene",3rd,24.0,male,0
1311,"Zimmerman, Leo",3rd,29.0,male,0


## indexing & slicing - Detailed

<p>There are commonly three ways of data selection:
<ul>
    <li>Selection by Label</li>
    <li>Selection by Position</li>
    <li>Selection by Boolean Mask (Filtering)</li>
</ul>

<code>.loc[]</code> <i>attribute is the primary indexing method of DataFrames</i>

</p>

In [98]:
titanic_head_7 = titanic.head(n=7).copy()
titanic_head_7

Unnamed: 0,Name,PClass,Age,Sex,Survived
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0
4,"Allison, Master Hudson Trevor",1st,0.92,male,1
5,"Anderson, Mr Harry",1st,47.0,male,1
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1


### `loc` - Selection by Label

Using `.loc[]`, we can extract the rows of the dataframe using the index names/numbers.

In [53]:
titanic_head_7.loc[2]                   # single label

Name        Allison, Mr Hudson Joshua Creighton
PClass                                      1st
Age                                          30
Sex                                        male
Survived                                      0
Name: 2, dtype: object

In [25]:
titanic_head_7[2:4]                    # slicing - without using `.loc[]` method
                                       # note that the index (4) is NOT included

Unnamed: 0,Name,PClass,Age,Sex,Survived
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0


In [24]:
titanic_head_7.loc[2:4]                # slicing
                                       # note that the index (4) is ALSO included

Unnamed: 0,Name,PClass,Age,Sex,Survived
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0
4,"Allison, Master Hudson Trevor",1st,0.92,male,1


In [22]:
titanic_head_7.loc[[4,1,1,2]]           # list or array of label

Unnamed: 0,Name,PClass,Age,Sex,Survived
4,"Allison, Master Hudson Trevor",1st,0.92,male,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0
1,"Allison, Miss Helen Loraine",1st,2.0,female,0
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0


In [45]:
titanic_head_7[2:5][['Age','Survived']] # without using `iloc`
                                        # has limitations when it comes to selecting random rows

Unnamed: 0,Age,Survived
2,30.0,0
3,25.0,0
4,0.92,1


In [44]:
titanic_head_7.loc[2:4,['Age','Survived']]           
                                         # Two labels are given as [index, column]

Unnamed: 0,Age,Survived
2,30.0,0
3,25.0,0
4,0.92,1


### `iloc` - Selection by Postion

`.iloc[]` is the primary attribute for selection by absolute position along index and columns.
Here 'i' in 'iloc' means 'implicit' indexing

In [54]:
titanic_head_7.iloc[2]                # it goes by integer location and nt by index name/number

Name        Allison, Mr Hudson Joshua Creighton
PClass                                      1st
Age                                          30
Sex                                        male
Survived                                      0
Name: 2, dtype: object

In [62]:
titanic_head_7.iloc[2:5, 0:3]         # slicing
                                      # Two integer locations are given as [index_loc, column_loc]

Unnamed: 0,Name,PClass,Age
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0
4,"Allison, Master Hudson Trevor",1st,0.92


In [64]:
titanic_head_7.iloc[[3, 1, 5], [2, 4]] # list or array of label

Unnamed: 0,Age,Survived
3,25.0,0
1,2.0,0
5,47.0,1


### Selection by Boolean Mask

Boolean masking is very helpful and flexible in specifying condition during data selection.


<p><u>Note</u> <i>that direct masking operations are interprected row-wise rather than column-wise</i></p>

In [68]:
titanic_head_7['Age'] <= 30

0     True
1     True
2     True
3     True
4     True
5    False
6    False
Name: Age, dtype: bool

In [71]:
titanic_head_7['Sex'] == 'male'

0    False
1    False
2     True
3    False
4     True
5     True
6    False
Name: Sex, dtype: bool

The result will be `True` only if both the conditions are satisfied as we are using `&` <=> `AND` operator.

In [74]:
(titanic_head_7['Age'] <= 30) & (titanic_head_7['Sex'] == 'male')

0    False
1    False
2     True
3    False
4     True
5    False
6    False
dtype: bool

Extracts the rows from the DataFrame for which the above condition was `True`.

In [75]:
titanic_head_7[(titanic_head_7['Age'] <= 30) & (titanic_head_7['Sex'] == 'male')]

Unnamed: 0,Name,PClass,Age,Sex,Survived
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0
4,"Allison, Master Hudson Trevor",1st,0.92,male,1


 ### Add new row

In [99]:
titanic_head_7

Unnamed: 0,Name,PClass,Age,Sex,Survived
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0
4,"Allison, Master Hudson Trevor",1st,0.92,male,1
5,"Anderson, Mr Harry",1st,47.0,male,1
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1


In [103]:
titanic_head_7.shape

(7, 5)

Let's add a new row to the existing `titanic_head_7`. Remember to use `.loc[]` when **adding a new row** to DataFrame.

However, `.iloc[]` can still be used in setting/updating a row if it already exists.

In [104]:
new_item = ['Emma Stone', '1st', 28, 'female', 1]

# Add the new item into the DataFrame
titanic_head_7.loc[7,:] = new_item   # titanic_head_7.iloc[7,:] = new_item ------> will throw an error
titanic_head_7

Unnamed: 0,Name,PClass,Age,Sex,Survived
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1.0
1,"Allison, Miss Helen Loraine",1st,2.0,female,0.0
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0.0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0.0
4,"Allison, Master Hudson Trevor",1st,0.92,male,1.0
5,"Anderson, Mr Harry",1st,47.0,male,1.0
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1.0
7,Emma Stone,1st,28.0,female,1.0


### Add new column

In [94]:
height = np.array([166] * len(titanic_head_7)) 
height

array([166, 166, 166, 166, 166, 166, 166, 166])

In [108]:
titanic_head_7.loc[:,'Height'] = height # titanic_head_7.iloc[:.5] = height ------> will throw an error
titanic_head_7

Unnamed: 0,Name,PClass,Age,Sex,Survived,Height
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1.0,166
1,"Allison, Miss Helen Loraine",1st,2.0,female,0.0,166
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0.0,166
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0.0,166
4,"Allison, Master Hudson Trevor",1st,0.92,male,1.0,166
5,"Anderson, Mr Harry",1st,47.0,male,1.0,166
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1.0,166
7,Emma Stone,1st,28.0,female,1.0,166


### Replace values in DataFrame

In [111]:
replace_age = np.array([22] * len(titanic_head_7))

# replace values in DataFrame
titanic_head_7.loc[:,'Age'] = replace_age # titanic_head7.iloc[:,2] = replace_age ------> will also work
titanic_head_7

Unnamed: 0,Name,PClass,Age,Sex,Survived,Height
0,"Allen, Miss Elisabeth Walton",1st,22,female,1.0,166
1,"Allison, Miss Helen Loraine",1st,22,female,0.0,166
2,"Allison, Mr Hudson Joshua Creighton",1st,22,male,0.0,166
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,22,female,0.0,166
4,"Allison, Master Hudson Trevor",1st,22,male,1.0,166
5,"Anderson, Mr Harry",1st,22,male,1.0,166
6,"Andrews, Miss Kornelia Theodosia",1st,22,female,1.0,166
7,Emma Stone,1st,22,female,1.0,166


# Missing Data

The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.

Pandas chose to use two already-existing Python null values for missing data:
<ul>
    <li>the special floating-point <b>NaN</b> value</li>
    <li>the Python <b>None</b> object</li>
</ul>

`NaN` usually appears in numerical datasets and `None` appears in the rest of cases.

`NaN` and `None` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate.

In [3]:
# Let's create a simple Pandas Series example. Data in Series share the same type
s_nan = pd.Series([1, np.nan, 2, None])
s_nan

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

Note that in the above example, `np.nan` is of dtype float64, thus, 1 and 2 are both upcasted to float64 from integer type. 

The `None` (dtype: Object) appears in a numerical array, and automatically downcast to float64 as well.

In [None]:
## `Dataframe`

DataFrames are essentially multidimensional arrays with attached **row** and **column** labels, and often with heterogeneous types and/or missing data.

<p><font size="3" color="blue">Two indexing, <code><font size="4" color="blue">index</p></code>for rows & <code><font size="4" color="blue">columns</p></code>for columns </font></p>

**Create a DataFrame from a dictionary of pd.Series**

`d` is a dictionary made of `pd.Series` values. Using `pd.DataFrame()`, the `d` dictionary is converted into a `DataFrame`.

Since, the `Age` value for `Steven` is not given, `DataFrame` fills it with `NaN`.

d = {'Age'    : pd.Series([27, 21, 30],         index=['John', 'Emma', 'Andrew']), 
     'Height' : pd.Series([165, 177, 154, 169], index=['John', 'Emma', 'Andrew', 'Steven'])}
df_ds = pd.DataFrame(d)
df_ds

### `index`, `columns` & `values`

df_ds.index

df_ds.columns

df_ds.values

The code below is expanded version of the above code.

s_1 = pd.Series([27, 21, 30],         index=['John', 'Emma', 'Andrew'])
s_2 = pd.Series([165, 177, 154, 169], index=['John', 'Emma', 'Andrew', 'Steven'])
d = {'Age'    : s_1,
     'Height' : s_2}
df_ds = pd.DataFrame(d)
df_ds

### Sorting using `sort_values(by='')`

df_ds.sort_values(by="Age", ascending=False)

### indexing & slicing

To extract all the columns of a set of rows. 

df_ds['Andrew':'Andrew']          # slicing - returns a Pandas DataFrame

df_ds[0:2]                        # we select the specific rows using default index no

df_ds['Andrew':'Emma']            # we select the specific rows using index name

df_ds["Age"]                      # we select the specific column  
                                  # returns a Pandas Series

df_ds["Andrew":"Emma"]["Age"]     # we select the specific rows & a column

**Create a DataFrame from a dictionary of lists**

# 2. Create from a list of dictionaries 
l_1 = [1, 2, 3, 4, 5]
l_2 = [2, 4, 6, 8, 10]
l_d = {'list1': l_1, 'list2':l_2}   
df_dl = pd.DataFrame(l_d)
df_dl

We can also index and slice through the rows and columns of the above DataFrame.

df_dl[0:3]

df_dl[0:3]['list1']

## Index Object
Both the `Series` and `DataFrame` objects contain an explicit **`index`** that helps you reference and modify data. 

It is an `immutable array` and `ordered set`.
- `immutable` - Values in the index cannot be changed
- `ordered set` - The values will stay in the same sequence they are created (Similar to dictionary)

idx1 = pd.Index([1, 3, 5, 7, 9, 11])
idx2 = pd.Index([2, 4, 5, 7, 9, 12])

print (idx1)
print ("Data type:", type(idx1))

### indexing and slicing

idx1[2]                 # indexing

idx1[2:5]               # slicing

### `size`,`shape`,`ndim`,`dtype`

print ("size of the index      :", idx1.size)    # can be applied to Series, DataFrame as well
print ("shape of the index     :", idx1.shape)   # can be applied to Series, DataFrame as well
print ("dimension of the index :", idx1.ndim)    # can be applied to Series, DataFrame as well
print ("type of the index      :", idx1.dtype)   # can be applied to Series as well

### Set operation - `&`,`|`,`^`

https://c1.staticflickr.com/5/4162/34264296262_7cb1fb7395_o.jpg

print ('idx1                                  :', idx1)
print ('idx2                                  :', idx2, "\n")
print ("Intersection of idx1 and idx2         :", idx1 & idx2)
print ("Union of idx1 and idx2                :", idx1 | idx2)
print ("Symmetric difference of idx1 and idx2 :", idx1 ^ idx2)

# Basic Statistics

# Create a DataFrame
# AZ = Arizona, CA = California, FL = Florida
# HI = Hawaii,  MI = Michigan,   PA = Pennsylvania
# TX = Texas,   VA = Virginia,   WA = Washington

s_popu  = pd.Series([6.9,   39.3,  20.6,  1.43, 9.93,  12.8,  27.9,  8.4,   7.3])    # million persons
s_area  = pd.Series([295.2, 424.0, 170.3, 28.3, 250.5, 119.3, 695.7, 110.8, 184.7])  # thousand km^2
s_warea = pd.Series([1.0,   20.5,  31.4,  11.7, 104.1, 3.4,   19.1,  8.5,   12.5])   # thousand km^2

d = {'area_water'  : s_warea, 
     'population'  : s_popu,
     'area_total'  : s_area
    }

df = pd.DataFrame(d)
df.index = ['AZ', 'CA', 'FL', 'HI', 'MI', 'PA', 'TX', 'VA', 'WA']
df

## `mean`, `min`, `max`,`std`

df_mean = df.mean(axis = 0)                  # axis=0 => Calculate average along the columns
print ("The average of ...\n", df_mean)

df_min = df.min(axis = 0)                   # axis=0 => Calculate average along the columns
print ("The minimum of ...\n", df_min)

df_max = df.max(axis = 0)                   # axis=0 => Calculate average along the columns
print ("The maximum of ...\n", df_max)

df_std = df.std(axis = 0)                   # axis=0 => Calculate average along the columns
print ("The standard deviation of ...\n", df_std)

## `corr`

df_corr = df.corr()                  
print ("The correlation between the columns are ...\n", df_corr)

# `DataFrame` - Revisited in detail

## `read_csv()`

import pandas as pd
import numpy as np

# Read Data from csv file named 'Titanic.csv'
# create a Pandas.Dataframe
titanic = pd.read_csv('Titanic.csv')
titanic

print (type(titanic))

titanic.shape                      # (nrows, ncols)

## `head` & `tail`
<ul>
    <li>`head()` : return first <b>n</b> rows of a DataFrame</li>
    <li>`tail()` : return last <b>n</b> rows of a DataFrame</li>
</ul>

titanic.head()                     # default (n = 5)

titanic.tail(n=4)                    # default (n = 5)

## indexing & slicing - Detailed

<p>There are commonly three ways of data selection:
<ul>
    <li>Selection by Label</li>
    <li>Selection by Position</li>
    <li>Selection by Boolean Mask (Filtering)</li>
</ul>

<code>.loc[]</code> <i>attribute is the primary indexing method of DataFrames</i>

</p>

titanic_head_7 = titanic.head(n=7).copy()
titanic_head_7

### `loc` - Selection by Label

Using `.loc[]`, we can extract the rows of the dataframe using the index names/numbers.

titanic_head_7.loc[2]                   # single label

titanic_head_7[2:4]                    # slicing - without using `.loc[]` method
                                       # note that the index (4) is NOT included

titanic_head_7.loc[2:4]                # slicing
                                       # note that the index (4) is ALSO included

titanic_head_7.loc[[4,1,1,2]]           # list or array of label

titanic_head_7[2:5][['Age','Survived']] # without using `iloc`
                                        # has limitations when it comes to selecting random rows

titanic_head_7.loc[2:4,['Age','Survived']]           
                                         # Two labels are given as [index, column]

### `iloc` - Selection by Postion

`.iloc[]` is the primary attribute for selection by absolute position along index and columns.
Here 'i' in 'iloc' means 'implicit' indexing

titanic_head_7.iloc[2]                # it goes by integer location and nt by index name/number

titanic_head_7.iloc[2:5, 0:3]         # slicing
                                      # Two integer locations are given as [index_loc, column_loc]

titanic_head_7.iloc[[3, 1, 5], [2, 4]] # list or array of label

### Selection by Boolean Mask

Boolean masking is very helpful and flexible in specifying condition during data selection.


<p><u>Note</u> <i>that direct masking operations are interprected row-wise rather than column-wise</i></p>

titanic_head_7['Age'] <= 30

titanic_head_7['Sex'] == 'male'

The result will be `True` only if both the conditions are satisfied as we are using `&` <=> `AND` operator.

(titanic_head_7['Age'] <= 30) & (titanic_head_7['Sex'] == 'male')

Extracts the rows from the DataFrame for which the above condition was `True`.

titanic_head_7[(titanic_head_7['Age'] <= 30) & (titanic_head_7['Sex'] == 'male')]

 ### Add new row

titanic_head_7

titanic_head_7.shape

Let's add a new row to the existing `titanic_head_7`. Remember to use `.loc[]` when **adding a new row** to DataFrame.

However, `.iloc[]` can still be used in setting/updating a row if it already exists.

new_item = ['Emma Stone', '1st', 28, 'female', 1]

# Add the new item into the DataFrame
titanic_head_7.loc[7,:] = new_item   # titanic_head_7.iloc[7,:] = new_item ------> will throw an error
titanic_head_7



### Add new column

height = np.array([166] * len(titanic_head_7)) 
height

titanic_head_7.loc[:,'Height'] = height # titanic_head_7.iloc[:.5] = height ------> will throw an error
titanic_head_7

### Replace values in DataFrame

replace_age = np.array([22] * len(titanic_head_7))

# replace values in DataFrame
titanic_head_7.loc[:,'Age'] = replace_age # titanic_head7.iloc[:,2] = replace_age ------> will also work
titanic_head_7

# Missing Data

The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.

Pandas chose to use two already-existing Python null values for missing data:
<ul>
    <li>the special floating-point <b>NaN</b> value</li>
    <li>the Python <b>None</b> object</li>
</ul>

`NaN` usually appears in numerical datasets and `None` appears in the rest of cases.

`NaN` and `None` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate.

# Let's create a simple Pandas Series example. Data in Series share the same type
s_nan = pd.Series([1, np.nan, 2, None])
s_nan

Note that in the above example, `np.nan` is of dtype float64, thus, 1 and 2 are both upcasted to float64 from integer type. 

The `None` (dtype: Object) appears in a numerical array, and automatically downcast to float64 as well.

