![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Data Science Summer School

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2018 Carlos A. Iglesias

## [Introduction to Data Visualization](0_Intro_Visualization.ipynb)

# Table of Contents
* [Introduction](#Introduction)
* [Visualisation with Pandas](#Visualisation-with-Pandas)
* [Loading and Cleaning](#Loading-and-Cleaning)
* [General exploration](#General-exploration)
* [Feature Age](#Feature-Age)
* [Feature Sex](#Feature-Sex)
* [Feature Pclass](#Feature-Pclass)
* [Feature Fare](#Feature-Fare)
* [Feature Embarked](#Feature-Embarked)
* [Features SibSp](#Features-SibSp)
* [Feature ParCh](#Feature-ParCh)

# Introduction

We are going to show some examples of visualization with two libraries: [pandas](https://pandas.pydata.org/) and [seaborn](https://seaborn.pydata.org/), both of them are based on [matplotlib](https://matplotlib.org/).

The best way to learn these libraries is to play with them, and consult their documentation as well as online forums (stackoverflow, ...) when you want to learn something in particular.

# Visualisation with Pandas

Pandas provides a very good integration with matplotlib. DataFrames have the following methods:
* **plot()**, for a number of charts, that can be selected with the argument *kind*:
  * 'bar' for bar plots
  * 'hist' for histograms
  * 'box' for boxplots
  * 'kde' for density plots
  * 'area' for area plots
  * 'scatter' for scatter plots
  * 'hexbin' for hexagonal bin plots
  * 'pie' for pie charts
  
Every plot kind has an equivalent on Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of parameters.

In addition, the module *pandas.tools.plotting* provides: **scatter_matrix**.

You can consult more details in the [documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html).

# Loading and Cleaning

In [None]:
# General import and load data
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set(color_codes=True)

# if matplotlib is not set inline, you will not see plots

#alternatives auto gtk gtk2 inline osx qt qt5 wx tk
#%matplotlib auto
#%matplotlib qt
%matplotlib inline

In [None]:
#We get a URL with raw content (not HTML one)
url="https://raw.githubusercontent.com/gsi-upm/dsss-2018/master/data-titanic/train.csv"
df = pd.read_csv(url)
df_original = df.copy() # Copy to have a version of df without modifications
df.head()

In [None]:
# Cleaning
# Encode categorical variables
df['Age'] = df['Age'].fillna(df['Age'].median())

#Commented for simplifying visualization (should be done before ML)
#df.loc[df["Sex"] == "male", "Sex"] = 0
#df.loc[df["Sex"] == "female", "Sex"] = 1
#df.loc[df["Embarked"] == "S", "Embarked"] = 0
#df.loc[df["Embarked"] == "C", "Embarked"] = 1
#df.loc[df["Embarked"] == "Q", "Embarked"] = 2

# Drop colums
df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)


#Fill missing values with median or most frequent value (mode) (or remove missing values or...)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Sex'].fillna(df['Sex'].mode()[0], inplace=True)
df

#  General exploration

Let's examine the dataset

In [None]:
# General description of the dataset
df.describe()

In [None]:
# Column types
df.dtypes

In [None]:
# Columns non numeric
df.dtypes[df.dtypes == object]

In [None]:
# Number of null values
df.isnull().sum()

In [None]:
# Analise distribution
df.hist(figsize=(10,10))
plt.show()

We can see Age and Fare are in very different scales, so it will be good to scale them before applying ML algorithms.

**What is the correlation of the variables?**

In [None]:
# We can see the pairwise correlation between variables. A value near 0 means low correlation
# while a value  near -1 or 1 indicates strong correlation.
df.corr()

We do not find any relevant correlation between features.

## Visualization with Seaborn

We will start by understanding the dataset.

**What was the distribution by sex?**

Seaborn provides the [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html) function that help us to generate many graphs changing the parameter *kind*, as well as shortcuts for every kind of graph (barplot, violinplot, etc.).

In [None]:
sns.catplot("Sex", data=df, kind='count')

We can see also the distribution by Age and Sex. **How many passengers by age and sex?**

In [None]:
fg = sns.FacetGrid(df, hue="Sex", aspect=3)
fg.map(sns.kdeplot, "Age", shade=True)
fg.set(xlim=(0, 80));

Now let's analyze by PClass.  **How many passengers by Sex and PClass?**

In [None]:
#sns.catplot("Pclass", data=df, hue='Sex', kind='count') #the same with the general one
sns.countplot("Pclass", data=df, hue='Sex')

Now let's see *fare*. **How much did the passengers pay for the tickets in each class?**

In [None]:
#sns.catplot("Pclass", "Fare", data=df, kind='bar')
sns.catplot("Pclass", "Fare", data=df, kind='strip')

In [None]:
sns.catplot("Pclass", "Fare", data=df, kind='violin',aspect=1.5)

In [None]:
sns.catplot("Pclass", "Fare", data=df, kind='box',aspect=1.5)

We see some outliners in the first distribution.

Let's see better the correlation between features with a heatmap chart.

In [None]:
sns.heatmap(df.corr(), vmax=.8, linewidths=0.01,
            square=True,annot=True,cmap='YlGnBu',linecolor="white")

In [None]:
#Make large the figure and add a title
plt.figure(figsize=(10, 10))
plt.title('Correlation between features');
sns.heatmap(df.corr(), vmax=.8, linewidths=0.01,
            square=True,annot=True,cmap='YlGnBu',linecolor="white")

We see Pclass has the highest negative correlation, followed by Fare, Parch and Age.

We could also represent this with a scatterplot.

In [None]:
# General description of relationship between variables uwing Seaborn PairGrid
# We use df_clean, since the null values of df would gives us an error, you can check it.
g = sns.PairGrid(df, hue="Survived")
g = g.map(plt.scatter)

There are two many variables, we are going to represent only a subset.

In [None]:
# PairGrid of variables
g = sns.PairGrid(df, hue="Survived", vars=['Pclass', 'Sex', 'Age'])
g = g.map(plt.scatter)

We can observe, for example, that more women survived as well as more people in 3rd class. 

We can represent these findings.

In [None]:
sns.barplot(x="Pclass", y='Survived', hue='Sex', data=df)

We can see that more women survived in all the passenger classes.

In [None]:
# sns.catplot(x="Age", y="Embarked", hue="Sex", data=df, kind="violin")
sns.violinplot(x="Age", y="Embarked", hue="Sex", data=df)

Now we are going to put in practice our knowledge about munging and visualisation. We will analyse every feature of the dataset.

# Feature Age

We saw that there are 177 missing values of age. We are going this feature with more detail.

In [None]:
# Histogram of Age
# For Series, you can use hist(), plot.hist() or plot(kind='hist')
df['Age'].hist()

We see the histogram is slightly *right skewed* (*sesgada a la derecha*), so we will replace null values with the median instead of the mean.

In case we have a significant *skewed distribution*, the extreme values in the long tail can have a disproportionately large influence on our model. So, it can be good to transform the variable before building our model to reduce skewness.Taking the natural logarithm or the square root of each point are two simple transformations. 

In [None]:
# We see with more bins the distribution
df['Age'].hist(bins=30, range=(0, df['Age'].max()))

Now we analyse the relationship of Age and Survived.

In [None]:
# Now we visualise age and survived to see if there is some relationship
sns.FacetGrid(df, hue="Survived", size=5).map(sns.kdeplot, "Age").add_legend()

We do no observe significant differences.

In [None]:
# We plot the histogram per age
g = sns.FacetGrid(df, col='Survived')
g.map(plt.hist, "Age", color="steelblue")

We observe that non survived is left skewed. Most children survived.

In [None]:
#Alternative to Seaborn with matplotlib integrated in pandas
df.hist(column='Age', by='Survived', sharey=True)

In [None]:
# We can observe the detail for children
df[df.Age < 20].hist(column='Age', by='Survived', sharey=True)

In [None]:
#Mean of survival for young
df[df.Age < 20]['Survived'].mean()

There were null values, we will recap at the end of this notebook how to manage them.

We are going now to see the distribution of passengers younger than 20 that survived.

In [None]:
df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Pclass']).plot(kind='bar')

In [None]:
# Passengers older than 25 that survived grouped by Sex

df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().plot(kind='bar')

We are going to improve it a bit.

In [None]:
# We pass 'Sex' from columns to rows with unstack, so that now Pclass is in the columns
df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar')

In [None]:
# Now we make that the plot shows both values combined, and change the labels
df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar', \
                        
                                                                                              stacked=True)                                                                                                    

In [None]:
#Small touches

pclass_labels = ['First', 'Second', 'Third']
sex_labels = {'Female': 0, 'Male': 1}

plt = df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar', 
                                                            stacked=True, rot=0, subplots=False, figsize=(5,10))
plt.set_xticklabels(pclass_labels)
plt.legend(labels=sex_labels)
plt.set_xlabel('Passenger class')
plt.set_title('Passenger class per sex')

In [None]:
#The same horizontal
pclass_labels = ['First', 'Second', 'Third']
sex_labels = {'Female': 0, 'Male': 1}

plt = df.query('Age > 25 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='barh', 
                                                            stacked=True, rot=0, subplots=False)
plt.set_yticklabels(pclass_labels)
plt.legend(labels=sex_labels)

plt.set_ylabel('Passenger class')
plt.set_title('Passenger class per sex')

# Feature Sex

We are now going to explore the Sex attribute

In [None]:
# How many passengers by sex
df.groupby('Sex').size()

We see men are more numerous than women.

In [None]:
# Plot with seaborn
sns.countplot('Sex', data=df)

In [None]:
# Same graph with matplotlib and pandas
colors_sex = ['#ff69b4', 'b']
df.groupby('Sex').size().plot(kind='bar', rot=0, color=colors_sex)

In [None]:
# How many passergers survived by sex
df.groupby('Sex')['Survived'].sum()

In [None]:
# How many passergers survived by sex
df.groupby('Sex')['Survived'].mean()

We see that 74% of female survived, while only 18% of male survived.

In [None]:
#Graphical representation
# You can add the parameter estimator to change the estimator. (e.g. estimator=np.median)
# For example, estimator=np.size is you get the same chart than with countplot
#sns.barplot(x='Sex', y='Survived', data=df, estimator=np.size)
sns.barplot(x='Sex', y='Survived', data=df)

We can see now if men and women follow the same age distribution.

In [None]:
df.hist(column='Age', by='Sex')

It seems they follow a similar distribution. We can separate per passenger class.

In [None]:
df.hist(column='Age', by='Pclass')

We see there are more young men in third class. 

# Feature Pclass

We have already seen how passengers are distributed with Pclass

In [None]:
df.groupby('Pclass').size()

In [None]:
# Distribution
sns.countplot('Pclass', data=df)

Most passengers are in 3rd class.

In [None]:
# Survivors per class
sns.barplot(x='Pclass', y='Survived', data=df)

As expected, passenger class is very significant, since most survivors are in first class.

We can also see the distribution of classes per sex.

In [None]:
sns.factorplot('Pclass',data=df,hue='Sex',kind='count')

In [None]:
df.groupby(['Pclass', 'Sex']).Survived.mean()

We see most women in first class and second survived, 96% and 92% respectively.

# Feature Fare

We are going to analyse the feature *Fare* and will take the opportunity to introduce how to manage outliers.

As we see in the PairGrid chart, Fare is directly related to the Passenger class.

In [None]:
df['Fare'].hist()

In [None]:
df.hist(['Fare','Pclass'])

We see the distribution is right sweked. We are going to detect outliers using a box plot

In [None]:
sns.boxplot(data=df['Fare'])

In [None]:
# We can see the same with matplotlib.
# There is a bug and if you import seaborn, you should add 'sym='k.' to show the outliers
df.boxplot(column='Fare', return_type='axes', sym='k.')

Since Fare depends on Pclass, we are going to show outliers per passenger class.

In [None]:
df.boxplot(column='Fare', by = 'Pclass', return_type='axes', sym='k.')

We see that most outliers are in class 1. In particular, we see some values higher thatn 500 that should be an error.

In [None]:
df[df.Fare > 400]

We can replace this value by the median(), the mean(), or the second highest value.

In [None]:
#Calculate hight values
df.sort_values('Fare', ascending=False).head(8)

In [None]:
# Replace
df.loc[df.Fare > 400, 'Fare'] = 263.0

# Check we have removed outliers
df.sort_values('Fare', ascending=False).head(8)

In [None]:
df.boxplot(column='Fare', by='Pclass', return_type='axes', sym='k.')

# Feature Embarked

We can analyze the distribution based on the port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton). 

**Where did the passengers come from?** 

In [None]:
df.groupby('Embarked').size()

In [None]:
# Distribution
sns.countplot('Embarked', data=df)

Since there are missing values, we will replace them by the most popular value ('S'), and we will also encode it since it is a categorical variable.

We can see if this has impact on its survival.

In [None]:
df.groupby(['Embarked']).Survived.mean()

In [None]:
sns.barplot(x='Embarked', y='Survived', data=df)

It seems passengers embarked in C (Cherbourg) have a higher chance of survival.
We can analyse this by sex.

In [None]:
sns.barplot(x="Embarked", y='Survived', hue='Sex', data=df)

There is also an improvement by gender for passengers embarking in Cherbourg.

We have to fill null values (2 null values) and encode this variable, since it is categorical. We will do it after reviewing the rest of features.

# Features SibSp

We analyse the distribution.

In [None]:
df.groupby('SibSp').size()

In [None]:
# Distribution
sns.countplot('SibSp', data=df)

We can see that most passengers traveled without siblings or spouses. 

We analyse if this had impact on its survival.

In [None]:
df.groupby('SibSp').Survived.mean()

In [None]:
df.hist(column='SibSp', by='Survived', sharey=True)

We see that it does not provide too much information. While the survival mean of all passengers is 38%, passengers with 0 SibSp has 34% of probability. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender

In [None]:
df.groupby(['SibSp', 'Sex']).size()

We see that for SibSp, there is almost the same number of men and women. Now we calculate the survival probability.

In [None]:
df.groupby(['SibSp', 'Sex']).Survived.mean()

In [None]:
sns.barplot(x="SibSp", y='Survived', hue='Sex', data=df)

We observe that when SibSp > 2, the survival probability decreases to the half. We are going to check if there is a difference in the age. 

In [None]:
df.groupby(['SibSp', 'Sex']).Age.mean()

In [None]:
sns.barplot(x="SibSp", y='Age', hue='Sex', data=df)

Effectively, when SibSp > 3, age is lower. We are going to check the relationship with Pclass.

In [None]:
df.groupby(['SibSp', 'Pclass']).size()

In [None]:
df.groupby(['SibSp', 'Pclass']).Survived.mean()

In [None]:
sns.barplot(x="Sex", y='SibSp', hue='Pclass', data=df)

We see that in 3rd class, females had higher SibSp.

In [None]:
sns.barplot(x="SibSp", y='Survived', hue='Pclass', data=df)

It seems that SibSp is relevant for determining the survival rate.

## Feature ParCh

The feature Parch (Parents-Children Aboard) is somewhat related to the previous one, since it reflects family ties. It is well known that in emergencies, family groups often all die or evacuate together, so it is expected that it will also have an impact on our model.

In [None]:
df.groupby('Parch').size()

In [None]:
# Distribution
sns.countplot('Parch', data=df)

We see most of the passenger had any parent or children.

We analyze now the relationship with Survived.

In [None]:
df.groupby('Parch').Survived.mean()

In [None]:
#Probability survival
df.groupby('Parch').Survived.mean().plot()

We see the probability of surviving is higher in 2 and 3. Sincethere were too few rows for Parch >= 3, this part is not relevant.

In [None]:
df.hist(column='Parch', by='Survived', sharey=True)

In [None]:
df.groupby(['Pclass', 'Sex', 'Parch'])['Parch', 'SibSp', 'Survived'].agg({'Parch': np.size, 'SibSp': np.mean, 'Survived': np.mean})

We observe that Parch has an important impact for men in first and second class. We are going to check the age.

In [None]:
df.query('(Sex == "male") and (Pclass == [1, 2]) and (Parch == [1, 2])')[['Survived', 'Age']].mean()

We see that in those cases, the age is 27. We can compare with the rest of men if first and second class.

In [None]:
df.query('(Sex == "male") and (Pclass == [1, 2])')[['Survived', 'Age']].mean()

We observe that there is a significant difference, so we suspect that this feature has impact of men in first and second class.

# References

* [Basic Feature Engineering with the Titanic Data](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© 2018 Carlos A. Iglesias, Universidad Politécnica de Madrid.