#### This is a tutorial for data visualization that introduces how to create basic visualizations using Python, as well as some interactive visualizations.

#### <mark>Yellow highlights indicate a small exercise or task for you to try out.</mark>
#### <mark> Remember to hit Shift+Enter in all the code cells. </mark>

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer for this Challenge on the U4I platform.</div>

Before we work with data, we need to import some Python libraries.\
`Pandas` is a commonly used library for analyzing data.\
`Matplotlib` is a plotting library for visualizing data.\
`NumPy` is a library for scientific computing.\
`Seaborn` is a library for statistical data visualization.

In [None]:
# import Pandas library and call it 'pd'
import pandas as pd

# import matplotlib.plplot and call it 'plt'
import matplotlib.pyplot as plt

#import numpy and call it 'np'
import numpy as np

#import seaborn and call it 'sns'
import seaborn as sns 

%matplotlib inline
plt.style.use("seaborn-dark")

As you will see below, we will "call upon" these libraries in commands by referring to these names.

### Import Data Set

We're going to start with some data that show the number of searches for machine learning related terms on a search engine from 2004 to 2020. 

One simple way of visualizing data is to simply put it into a chart. 

In [None]:
# import data & view first 10 rows
temporal_data = pd.read_csv('temporal.csv')
temporal_data.head(10) 

Notice the "categorical" heading. In this data set, all the values are arbitrarily grouped into category 1 or 0 for the purposes of demonstrating how to visualise a large data set with continuous and cateogorical vairables. 

In [None]:
#Change Mes colums to datetime format so years are grouped together on the x axis
temporal_data['Mes'] = pd.to_datetime(temporal_data['Mes'])

### Visualize Data

#### Line Graphs

Now let's create a simple visualization with our data set: the number of searches from 2004-2020 for "data science".

In [None]:
plt.plot(temporal_data['Mes'], temporal_data['data science'])
plt.xlabel('Year')
plt.ylabel('Number of Searches')
plt.title('Searches for "data science"')
plt.show()

Now let's see the searches for all 3 terms in one graph.

In [None]:
plt.plot(temporal_data['Mes'], temporal_data['data science'], label='data science')
plt.plot(temporal_data['Mes'], temporal_data['machine learning'], label='machine learning')
plt.plot(temporal_data['Mes'], temporal_data['deep learning'], label='deep learning')
plt.xlabel('Date', )
plt.ylabel('Number of Searches')
plt.title('Searches for AI Terms by Date')
plt.legend()
plt.show()

#### Scatter plots

In [None]:
plt.scatter(temporal_data['data science'], temporal_data['machine learning'])
plt.xlabel('data science')
plt.ylabel('machine learning')
plt.title('Relationship Between "machine learning" and "data science" Searches')
plt.show()

<mark>Create two more scatter plots: one for Deep learning and Data science, and another for Deep learning and Machine learning.</mark>

In [None]:
# your code here

In [None]:
# your code here

#### Bar charts

In [None]:
plt.bar(temporal_data['Mes'], temporal_data['machine learning'], width=10)
plt.xlabel('Date')
plt.ylabel('Number of searches')
plt.title('Searches for "machine learning"')
plt.show()

<mark>Create two more bar charts: one for Deep learning, and another for Data science.</mark>

In [None]:
# your code here

In [None]:
# your code here

#### Pairplot

In [None]:
pp = sns.pairplot(temporal_data)

In [None]:
pp = sns.pairplot(temporal_data, hue='categorical')

#### Jointplot

In [None]:
sns.jointplot(x='data science', y='machine learning', data=temporal_data)
plt.show()

#### Heatmap

In [None]:
plt.figure(figsize=(9, 8))
sns.heatmap(temporal_data[['data science', 'machine learning', 'deep learning',]])

## The Titanic Data Set

#### The Titanic data set is known for the Titanic ML competition on Kaggle. In this Jupyter notebook, we simply want to explore the data set and gain some insights into what the data means with data visualizations. The original data set comes in two parts (a training set and a test set)  but for the porposes of visualization, we will only work with the training set. 

In [None]:
# Import training data set and call it "train_data"
train_data = pd.read_csv("train.csv")

These are the variables in our data set:\
PassengerId: Unique ID of the passenger\
Survived: Survived (1) or died (0)\
Pclass: Passenger's class (1st, 2nd, or 3rd)\
Name: Passenger's name\
Sex: Passenger's sex\
Age: Passenger's age\
SibSp: Number of siblings/spouses aboard the Titanic\
Parch: Number of parents/children aboard the Titanic\
Ticket: Ticket number\
Fare: Fare paid for ticket\
Cabin: Cabin number\
Embarked: Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q = Queenstown)

In [None]:
# Show first 5 rows of data set
train_data.head()

In [None]:
# Show descriptive statistics
train_data.describe()

In [None]:
# Show missing values in data set
column_names = train_data.columns
for column in column_names:
    print(column + ' - ' + str(train_data[column].isnull().sum()))

In [None]:
# Remove 'Ticket', and 'PassengerId' as they don't contribute to 'Survival'
# Remove 'Cabin' due to many missing valures'
train_data = train_data.drop(columns=['Ticket','PassengerId','Cabin'])

In [None]:
# Check first 5 rows of data to make sure variables are removed
train_data.head()

#### Heatmaps

In [None]:
plt.figure(figsize=(9, 8))
sns.heatmap(data = corr_matrix,cmap='bwr', annot=True, linewidths=0.2)

In [None]:
# Show passengers by sex
sex_graph = sns.catplot('Sex', data=train_data, kind='count')

In [None]:
# Now let separate the sexes by classes by using the 'hue' parameter
class_graph = sns.catplot('Pclass', data=train_data, hue='Sex', kind='count')

It seems there were more males in 3rd class than in 1st or 2nd. But how many children were there?

In [None]:
# Create a new column 'Person' in which every person under 16 is child.

train_data['Person'] = train_data.Sex
train_data.loc[train_data['Age'] < 16, 'Person'] = 'child'

In [None]:
children_graph = sns.catplot('Pclass', data=train_data, hue='Person', kind='count')

Now we can see that most of the children onboard the Titanic were in 3rd class.

Another way to see understand the age of the passengers onboard the Titanic is use a histogram.

In [None]:
age_hist = train_data.Age.hist(bins=80)