# Data Loading

Import required python package

In [1]:
import sys
import pandas as pd

In case you forget to install it you can install through this notebook using:

In [2]:
!pip install pandas



Check python, pandas version

In [3]:
print (f'Python Version: ', sys.version[0:3])
print (f'Pandas Version: ', pd.__version__)

Python Version:  3.7
Pandas Version:  1.0.3


---

## Create DataFrame

Load data from .csv file to pandas dataframe using pandas.read_csv

In [4]:
df_titanic = pd.read_csv('./datasets/titanic.csv')

Show the Dataframe:

In [5]:
df_titanic

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


Check the first 10 data using .head()

In [6]:
df_titanic.head(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.075
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0,0,2,11.1333
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708


Check the last 10 data using .tail()

In [7]:
df_titanic.tail(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
877,0,3,Mr. Johann Markun,male,33.0,0,0,7.8958
878,0,3,Miss. Gerda Ulrika Dahlberg,female,22.0,0,0,10.5167
879,0,2,Mr. Frederick James Banfield,male,28.0,0,0,10.5
880,0,3,Mr. Henry Jr Sutehall,male,25.0,0,0,7.05
881,0,3,Mrs. William (Margaret Norton) Rice,female,39.0,0,5,29.125
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.45
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0
886,0,3,Mr. Patrick Dooley,male,32.0,0,0,7.75


---

## Select Specific Columns

Create variable that holds array of columns name that you want to select, and then show the dataframe

In [8]:
select_columns = ['Name', 'Sex', 'Age', 'Fare']

In [9]:
df_titanic[select_columns]

Unnamed: 0,Name,Sex,Age,Fare
0,Mr. Owen Harris Braund,male,22.0,7.2500
1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,71.2833
2,Miss. Laina Heikkinen,female,26.0,7.9250
3,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,53.1000
4,Mr. William Henry Allen,male,35.0,8.0500
...,...,...,...,...
882,Rev. Juozas Montvila,male,27.0,13.0000
883,Miss. Margaret Edith Graham,female,19.0,30.0000
884,Miss. Catherine Helen Johnston,female,7.0,23.4500
885,Mr. Karl Howell Behr,male,26.0,30.0000


---

## Describe Data

Check data types of each column using .dtypes

In [10]:
df_titanic.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

Convert Age data types from float to integer

In [11]:
df_titanic['Age'] = df_titanic['Age'].astype('int64')
df_titanic.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                          int64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

Check total rows & columns in dataframe (you can check on the bottom-left of dataframe, or using .shape to return dataframe Dimensionality)

In [12]:
df_titanic

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26,0,0,30.0000


In [13]:
print('Total rows: ', df_titanic.shape[0])
print('Total columns: ', df_titanic.shape[1])

Total rows:  887
Total columns:  8


You can use .describe() to generate descriptive statistics. (like mean, min, max, etc.)

In [14]:
df_titanic.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.455468,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.129919,1.104669,0.807466,49.78204
min,0.0,1.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,20.0,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


Let's try to standardize the column names using Capital Letter

In [15]:
new_columns = { 'Survived': 'SURVIVED',
                'Pclass': 'PCLASS',
                'Name': 'NAME',
                'Sex': 'SEX',
                'Age': 'AGE',
                'Siblings/Spouses Aboard': 'SSA',
                'Parents/Children Aboard': 'PCA',
                'Fare': 'FARE'
}

In [16]:
df_titanic = df_titanic.rename(columns=new_columns)
df_titanic

Unnamed: 0,SURVIVED,PCLASS,NAME,SEX,AGE,SSA,PCA,FARE
0,0,3,Mr. Owen Harris Braund,male,22,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26,0,0,30.0000


---

## Check Value

Check if there's any NULL values in dataframe using .isnull() (it will return True if the data is NULL/NaN)

In [17]:
df_titanic.isnull()

Unnamed: 0,SURVIVED,PCLASS,NAME,SEX,AGE,SSA,PCA,FARE
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
882,False,False,False,False,False,False,False,False
883,False,False,False,False,False,False,False,False
884,False,False,False,False,False,False,False,False
885,False,False,False,False,False,False,False,False


For general check you could also use .any() or .sum()

In [18]:
df_titanic.isnull().any()

SURVIVED    False
PCLASS      False
NAME        False
SEX         False
AGE         False
SSA         False
PCA         False
FARE        False
dtype: bool

In [19]:
df_titanic.isnull().sum()

SURVIVED    0
PCLASS      0
NAME        0
SEX         0
AGE         0
SSA         0
PCA         0
FARE        0
dtype: int64

---

## Save Data

Let's manipulate data in previous dataframe, in this case I want to select only Name, Sex, Age, and Fare Column and dump the other columns

In [20]:
df_titanic = df_titanic[['NAME', 'SEX', 'AGE', 'FARE']]
df_titanic

Unnamed: 0,NAME,SEX,AGE,FARE
0,Mr. Owen Harris Braund,male,22,7.2500
1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38,71.2833
2,Miss. Laina Heikkinen,female,26,7.9250
3,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35,53.1000
4,Mr. William Henry Allen,male,35,8.0500
...,...,...,...,...
882,Rev. Juozas Montvila,male,27,13.0000
883,Miss. Margaret Edith Graham,female,19,30.0000
884,Miss. Catherine Helen Johnston,female,7,23.4500
885,Mr. Karl Howell Behr,male,26,30.0000


And save the dataframe to another .csv file using pandas.to_scv

In [21]:
df_titanic.to_csv('./datasets/titanic_edit.csv')

Let's try to load the new data that we generated previously

In [22]:
df_titanic_edit = pd.read_csv('./datasets/titanic_edit.csv')

In [23]:
df_titanic_edit

Unnamed: 0.1,Unnamed: 0,NAME,SEX,AGE,FARE
0,0,Mr. Owen Harris Braund,male,22,7.2500
1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38,71.2833
2,2,Miss. Laina Heikkinen,female,26,7.9250
3,3,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35,53.1000
4,4,Mr. William Henry Allen,male,35,8.0500
...,...,...,...,...,...
882,882,Rev. Juozas Montvila,male,27,13.0000
883,883,Miss. Margaret Edith Graham,female,19,30.0000
884,884,Miss. Catherine Helen Johnston,female,7,23.4500
885,885,Mr. Karl Howell Behr,male,26,30.0000


Let's try to save to another format file (in this case I will try to save dataframe to parquet file)

Save the dataframe to parquet file using pandas.to_parquet (you need to install pyarrow or fastparquet before, if not the script will fail)

In [24]:
df_titanic.to_parquet('./datasets/titanic_edit.parquet')

Let's try to load the parquet file using pandas.read_parquet

In [25]:
df_new = pd.read_parquet('./datasets/titanic_edit.parquet')
df_new

Unnamed: 0,NAME,SEX,AGE,FARE
0,Mr. Owen Harris Braund,male,22,7.2500
1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38,71.2833
2,Miss. Laina Heikkinen,female,26,7.9250
3,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35,53.1000
4,Mr. William Henry Allen,male,35,8.0500
...,...,...,...,...
882,Rev. Juozas Montvila,male,27,13.0000
883,Miss. Margaret Edith Graham,female,19,30.0000
884,Miss. Catherine Helen Johnston,female,7,23.4500
885,Mr. Karl Howell Behr,male,26,30.0000


You can explore more data format that pandas support through this documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Keep Exploring!