# Intro

This notebook is intended to walk through basic data manipulation operations with Pandas and scikit-learn. After completing this exercise, you will know:

1. How to read csv data to file
2. How to do basic aggregations
3. How to print basic data summary operations
4. How to plot basic information about the data


# Technical imports

In [3]:
import pandas as pd               # basic data manipulation tool
import numpy as np                # numerical operations tool
import matplotlib.pyplot as plt   # plotting and visualiztion tool

In [2]:
%matplotlib inline

# Reading data to csv file

First of all we need to load our data to some variable, from csv file

In [4]:
data = pd.read_csv("./titanic_train_data.csv",)

In [6]:
# A quick glimpse on the dataset

data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


# General data information

Set of operations to display summary information about our dataset

In [14]:
data.info() # data type information as well as not-null count

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [12]:
data.describe() # More detailed information about numerical columns

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [16]:
pd.value_counts(data.Sex) # Calculate value counts for Sex variable

male      577
female    314
Name: Sex, dtype: int64

We can print multiple statistics for numerical columns

In [22]:
print("Age mean:", data.Age.mean())
print("Age standard deviation:", data.Age.std())
print("Age quantiles:\n\n", data.Age.quantile(q=[0.2, 0.5, 0.75]))

Age mean: 29.69911764705882
Age standard deviation: 14.526497332334044
Age quantiles:

 0.20    19.0
0.50    28.0
0.75    38.0
Name: Age, dtype: float64


# Basic data aggregations and grouping

Grouping operations usually take form of:


```{python}

data.groupby(COLUMN(s)).OPERATION
```

In [31]:
# Group by sex and count number of people

data[["Sex", "Age"]].groupby("Sex").size()

Sex
female    314
male      577
dtype: int64

In [24]:
# Group by sex and calculate mean age

data[["Sex", "Age"]].groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


You can perform multiple aggregation operations in the same time, using the following syntax:

```{python}

data.groupby(COLUMNS).agg({
        'col1': [functions],
        'col2': [functions],
        ....
        'coln': [functions]
})

```

In [33]:
# Group by sex and calculate both - number of records and mean age.

data.groupby("Sex").agg({
    'Age': [np.mean, np.std],
    'Sex': ["size"]   
})

Unnamed: 0_level_0,Age,Age,Sex
Unnamed: 0_level_1,mean,std,size
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,27.915709,14.110146,314
male,30.726645,14.678201,577


In [34]:
#TODO: more things like this...