# Pandas
- named after "Panel-Data"
- has a fast and efficient DataFrame object for data manipulation and integrated indexing
- includes tools to rea and write data in different formats: csv, txt, excel,...
- [More about Pandas](https://pandas.pydata.org/)

Son Huynh
31.01.2020

## Pandas Documentation:
- Dataframe attributes and methods: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
- Series attributes and methods: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

## Table of content:
* [1. DataFrame](#1.-DataFrame)
    - [1.1 Reading File](#1.1-Reading-file-into-a-dataframe)
    - [1.2 Initial Inspection](#1.2-Initial-inspection)
    - [1.3 DataFrame: Attributes](#1.3-DataFrame:-Attributes)
    - [1.4 DataFrame: Methods](#1.4-DataFrame:-Methods)
* [2. Series](#2.-Series)
* [3. Index and Column](#3.-Index-and-Column)
* [4. Slicing](#4.-Slicing)
* [5. Basic Filtering](#5.-Basic-Filtering)

In [2]:
import pandas as pd

## <p style="color:blue;">1. DataFrame</p>

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

### 1.1 Reading file into a dataframe

If the file is in the same folder with the notebook, we don't need to specify the full file path

In [3]:
titanic = pd.read_csv("titanic.csv")

Export dataframe to csv file

In [4]:
titanic.to_csv("titanic_copy.csv", index=False) # If you don't want to store the index as a column

### 1.2 Initial inspection

In [5]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.2500,S,
1,1,1,female,38.0,1,0,71.2833,C,3.0
2,1,3,female,26.0,0,0,7.9250,S,
3,1,1,female,35.0,1,0,53.1000,S,3.0
4,0,3,male,35.0,0,0,8.0500,S,
5,0,3,male,,0,0,8.4583,Q,
6,0,1,male,54.0,0,0,51.8625,S,5.0
7,0,3,male,2.0,3,1,21.0750,S,
8,1,3,female,27.0,0,2,11.1333,S,
9,1,2,female,14.0,1,0,30.0708,C,


#### Visualize the Dataframe structure:

https://www.w3resource.com/w3r_images/pandas-data-structure.svg

### 1.3 DataFrame: Attributes

In [6]:
titanic.shape

(891, 9)

In [7]:
titanic.index

RangeIndex(start=0, stop=891, step=1)

In [8]:
titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'deck'],
      dtype='object')

In [9]:
titanic.sex # Each column or each row is a Series (one-dimensional array)

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: sex, Length: 891, dtype: object

In [10]:
titanic.sex.values # Pandas DataFrame is a numpy array

array(['male', 'female', 'female', 'female', 'male', 'male', 'male',
       'male', 'female', 'female', 'female', 'female', 'male', 'male',
       'female', 'female', 'male', 'male', 'female', 'female', 'male',
       'male', 'female', 'male', 'female', 'female', 'male', 'male',
       'female', 'male', 'male', 'female', 'female', 'male', 'male',
       'male', 'male', 'male', 'female', 'female', 'female', 'female',
       'male', 'female', 'female', 'male', 'male', 'female', 'male',
       'female', 'male', 'male', 'female', 'female', 'male', 'male',
       'female', 'male', 'female', 'male', 'male', 'female', 'male',
       'male', 'male', 'male', 'female', 'male', 'female', 'male', 'male',
       'female', 'male', 'male', 'male', 'male', 'male', 'male', 'male',
       'female', 'male', 'male', 'female', 'male', 'female', 'female',
       'male', 'male', 'female', 'male', 'male', 'male', 'male', 'male',
       'male', 'male', 'male', 'male', 'female', 'male', 'female', 'male',
      

#### Two styles to access a column:

- Attribute style: `titanic.sex`
- Dictionary style: `titanic['sex']`

The __dictionary style__ allows access to column name with __spaces__. However it is recommended to remove spaces in column names and use the attribute style where possible.

### 1.4 DataFrame: Methods

__Tip for working with jupyter notebook:__ 
- Press Tab to autocomplete a variable or attribute/method name.
- Inside the method, press Tab to see available parameters

In [11]:
titanic.head() # View first 5 rows

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,3.0
2,1,3,female,26.0,0,0,7.925,S,
3,1,1,female,35.0,1,0,53.1,S,3.0
4,0,3,male,35.0,0,0,8.05,S,


In [13]:
titanic.tail(3) # View last 3 rows

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
888,0,3,female,,1,2,23.45,S,
889,1,1,male,26.0,0,0,30.0,C,3.0
890,0,3,male,32.0,0,0,7.75,Q,


In [14]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
survived    891 non-null int64
pclass      891 non-null int64
sex         891 non-null object
age         714 non-null float64
sibsp       891 non-null int64
parch       891 non-null int64
fare        891 non-null float64
embarked    889 non-null object
deck        203 non-null float64
dtypes: float64(3), int64(4), object(2)
memory usage: 62.8+ KB


In [15]:
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,deck
count,891.0,891.0,714.0,891.0,891.0,891.0,203.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,3.369458
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,1.44416
min,0.0,1.0,0.42,0.0,0.0,0.0,1.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104,2.0
50%,0.0,3.0,28.0,0.0,0.0,14.4542,3.0
75%,1.0,3.0,38.0,1.0,0.0,31.0,4.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,7.0


<p style="color:red;">Quiz</p>

In [16]:
titanic.describe(include=['O']) # For data of type string

Unnamed: 0,sex,embarked
count,891,889
unique,2,3
top,male,S
freq,577,644


In [17]:
titanic.sum() # Default axis is 0

#titanic.mean()

#titanic.count()

survived                                                  342
pclass                                                   2057
sex         malefemalefemalefemalemalemalemalemalefemalefe...
age                                                   21205.2
sibsp                                                     466
parch                                                     340
fare                                                  28693.9
deck                                                      684
dtype: object

#### Counting missing values

In [18]:
titanic.isnull().sum()

survived      0
pclass        0
sex           0
age         177
sibsp         0
parch         0
fare          0
embarked      2
deck        688
dtype: int64

#### Correlation table:
- The closer the value to `1` or `-1`, the more positively (negatively) correlated the pair is.
- Correlation closer to `0` means two variables are uncorrelated.
- Correlation with magnitude `>0.5` is usually considered strong.

In [19]:
titanic.corr()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,deck
survived,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307,0.041841
pclass,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495,0.619288
age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,-0.192717
sibsp,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651,0.047626
parch,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225,0.033374
fare,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0,-0.297521
deck,0.041841,0.619288,-0.192717,0.047626,0.033374,-0.297521,1.0


<p style="color:red;">Quiz</p>

#### Sorting Dataframe

Remark: 
- Almost all pandas functions and methods won't modify the original dataframe, but return a copy instead. If you want to keep the change, you have to assign it back to the original dataframe.
- Avoid using the parameter `inplace=True`. It is not more memory efficient and is planned to be deprecated in future pandas versions.

In [20]:
titanic = titanic.sort_values('age')

titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
803,1,3,male,0.42,0,1,8.5167,C,
755,1,2,male,0.67,1,1,14.5,S,
644,1,3,female,0.75,2,1,19.2583,C,
469,1,3,female,0.75,2,1,19.2583,C,
78,1,2,male,0.83,0,2,29.0,S,


In [21]:
titanic = titanic.sort_values(['pclass', 'parch', 'age'], ascending=[True, False, False])

titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
438,0,1,male,64.0,1,4,263.0,S,3.0
659,0,1,male,58.0,0,2,113.275,C,4.0
390,1,1,male,36.0,1,2,120.0,S,2.0
763,1,1,female,36.0,1,2,120.0,S,2.0
540,1,1,female,36.0,0,2,71.0,S,2.0


In [22]:
# Use sort_index if you want to sort the dataframe by index
titanic = titanic.sort_index(axis=0)

titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,3.0
2,1,3,female,26.0,0,0,7.925,S,
3,1,1,female,35.0,1,0,53.1,S,3.0
4,0,3,male,35.0,0,0,8.05,S,


## <p style="color:blue;">2. Series</p>

A series is a single column (or row) with index labels. Its structure is similar to Python's dictionary, but is built on numpy array and comes with several useful attributes and methods.

#### Converting between types

str, int, float, 'category', ...

In [23]:
titanic.fare.astype(int)

0       7
1      71
2       7
3      53
4       8
       ..
886    13
887    30
888    23
889    30
890     7
Name: fare, Length: 891, dtype: int32

#### Count values:

Count each value in the series. Return a series with the values as index

In [24]:
titanic.sex.value_counts()

male      577
female    314
Name: sex, dtype: int64

In [25]:
titanic.sex.value_counts(normalize=True) # Percentage count

male      0.647587
female    0.352413
Name: sex, dtype: float64

#### List unique values in a series

In [26]:
titanic.deck.unique()

array([nan,  3.,  5.,  7.,  4.,  1.,  2.,  6.])

In [27]:
titanic.deck.is_unique

False

### <p style="color:orange;">Practice</p>

In [28]:
# Read 'cars.csv' into a dataframe called cars

cars = pd.read_csv('cars.csv')

In [29]:
# Show last 5 rows

In [None]:
# Are there any missing value in the dataset?

In [None]:
# Does MPG_Highway have higher standard deviation than MPG_City?

In [None]:
# How many Audi cars are in the dataset?

In [None]:
# Which is the heaviest car?

In [None]:
# What are the unique car types?

In [None]:
# List 2 strongly positively correlated pair and 1 strongly negatively correlated pair

## <p style="color:blue;">3. Index and Column</p>

Both index and column are index object. They are also built upon numpy array.

#### Set and reset dataframe index

In [30]:
titanic.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,3.0
2,1,3,female,26.0,0,0,7.925,S,


In [31]:
titanic = titanic.set_index('survived')

In [32]:
titanic

Unnamed: 0_level_0,pclass,sex,age,sibsp,parch,fare,embarked,deck
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3,male,22.0,1,0,7.2500,S,
1,1,female,38.0,1,0,71.2833,C,3.0
1,3,female,26.0,0,0,7.9250,S,
1,1,female,35.0,1,0,53.1000,S,3.0
0,3,male,35.0,0,0,8.0500,S,
0,3,male,,0,0,8.4583,Q,
0,1,male,54.0,0,0,51.8625,S,5.0
0,3,male,2.0,3,1,21.0750,S,
1,3,female,27.0,0,2,11.1333,S,
1,2,female,14.0,1,0,30.0708,C,


In [33]:
titanic = titanic.reset_index()

In [34]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.2500,S,
1,1,1,female,38.0,1,0,71.2833,C,3.0
2,1,3,female,26.0,0,0,7.9250,S,
3,1,1,female,35.0,1,0,53.1000,S,3.0
4,0,3,male,35.0,0,0,8.0500,S,
5,0,3,male,,0,0,8.4583,Q,
6,0,1,male,54.0,0,0,51.8625,S,5.0
7,0,3,male,2.0,3,1,21.0750,S,
8,1,3,female,27.0,0,2,11.1333,S,
9,1,2,female,14.0,1,0,30.0708,C,


#### Rename columns

In [35]:
# By assigning new values directly
titanic.columns = ['Alive', 'Class', 'Sex', 'Age', 'SibingSpouse', 'ParentChild', 'Fare', 'Embarked', 'Deck']

In [36]:
titanic

Unnamed: 0,Alive,Class,Sex,Age,SibingSpouse,ParentChild,Fare,Embarked,Deck
0,0,3,male,22.0,1,0,7.2500,S,
1,1,1,female,38.0,1,0,71.2833,C,3.0
2,1,3,female,26.0,0,0,7.9250,S,
3,1,1,female,35.0,1,0,53.1000,S,3.0
4,0,3,male,35.0,0,0,8.0500,S,
5,0,3,male,,0,0,8.4583,Q,
6,0,1,male,54.0,0,0,51.8625,S,5.0
7,0,3,male,2.0,3,1,21.0750,S,
8,1,3,female,27.0,0,2,11.1333,S,
9,1,2,female,14.0,1,0,30.0708,C,


In [37]:
# By using rename to change only some column
titanic = titanic.rename({'SibingSpouse': 'SibSp', 'ParentChild': 'ParChi'}, axis=1)

In [None]:
titanic.head(1)

#### Create and delete column

When creating a new column, the attribute style cannot be used because the column doesn't exist yet

In [None]:
titanic['Age_Months'] = titanic.Age * 12

In [None]:
titanic

In [None]:
titanic = titanic.drop(columns=['Age_Months'])

In [None]:
titanic

## <p style="color:blue;">4. Slicing</p>

#### Slicing column(s)

In [None]:
# Select single column (equivalent to titanic.Class)
#titanic['Alive']

# Select multiple columns
titanic[['Alive', 'Class', 'Sex']]

In [None]:
# Remember: store your result if you want access later
titanic2 = titanic[['Alive', 'Class', 'Sex']]

<p style="color:red;">Quiz</p>

#### Slicing using index position

In [None]:
titanic.iloc[0]

In [None]:
titanic.iloc[:5,-3:]

#### Slicing using index name

When using loc, the range is inclusive

In [None]:
titanic.loc[0:3, cols]

In [None]:
titanic2 = titanic.set_index('Sex')

titanic2.head(3)

In [None]:
titanic2.loc['male','Class':'Fare']

## <p style="color:blue;">5. Basic Filtering</p>

In [None]:
titanic[titanic['Sex'] == 'female'].head()

In [None]:
# Lazy style
#titanic[titanic['Age'] < 1]['Alive']

# Recommendation: avoid chain indexing and use loc instead, especcially when trying to set new values
titanic.loc[(titanic['Age'] < 1), 'Alive']

<p style="color:red;">Quiz</p>