#### This tutorial introduces how to create basic visualizations of a large data set using Python. 
#### Note: You do not need to understand the Python code or be able to write code to complete this tutorial and pass the Challenge.
#### Remember to hit Shift+Enter in all the code cells.

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer for this Challenge on the U4I platform. Please answer the question <b>before</b> continuing through the notebook.</div>

# Introduction

The World Happiness Report is survey of global happiness. It contains articles, and rankings of happiness based on participants' ratings of their own lives. 
Happiness is based on a survey in which nationally representative samples of participants are asked to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The report correlates the results with various life factors.

In this tutorial, we will try to answer some questions by creating visualizations of the data:
- Which countries and regions of the world are happiest?
- What factors contribute to a country's/region's happiness ranking?
- How did happiness change from 2016 to 2019?

<b>We will work through a few different visualizations (listed below). You can follow these links to jump directly to a particular section/visualization.</b>
* [Variables](#1)
* [Import Libraries](#2)
* [Load & Preview Data](#3)
* [Bar Graph](#4)
* [Violin Plot](#5)
* [Box Plot](#6)
* [Scatter Plot](#7)
* [Pair Plot](#8) 
* [Heat Map](#9)
* [Interactive Bubble Plot](#10)
* [Sources](#11)

<a id=1></a>
## Variables

Although the original report and data sets contain more variables, we will focus on the following variables in this tutorial:
* **Country**: Name of the country
* **Region**: Region the country belongs to
* **Happiness_Score**: A metric measured in 2015 by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10, where 10 is the happiest?"
* **Economy**: GDP per capita
* **Family**: Social support
* **Health**: Healthy life expectancy
* **Freedom**: Freedom to make life choices
* **Trust**: Perceptions of corruption
* **Generosity**: Perceptions of generosity

<div class="alert alert-block alert-info">Pause! Answer Q1 on the U4I platform.

<a id=2></a>
## Import Libraries

In [None]:
# import Pandas library and call it 'pd' for analyzing & visualizing data
import pandas as pd
from pandas.plotting import autocorrelation_plot

# import matplotlib.plplot and call it 'plt' for plotting data
import matplotlib.pyplot as plt
%matplotlib inline

#import numpy and call it 'np' for scientific computing
import numpy as np

#import seaborn and call it 'sns' for visualizations
import seaborn as sns 

#import libraries for interactive viaulizations
from IPython.display import HTML
from bubbly.bubbly import bubbleplot
import plotly.offline as py
import plotly.graph_objs as go

import warnings            
warnings.filterwarnings("ignore") 

<a id=3></a>
## Load & Preview Data

In [None]:
#Load data from csv files and assign to "data_2015", "data_2016", "data_2017", "data_2018", "data_2019"
data_2016=pd.read_csv("2016.csv")
data_2017=pd.read_csv("2017.csv")
data_2018=pd.read_csv("2018.csv")
data_2019=pd.read_csv("2019.csv")

In [None]:
#Show the first 5 rows of data_2016
data_2016.head()

In [None]:
#Show the first 5 rows of data_2017
data_2017.head()

In [None]:
data_2018.head()

In [None]:
data_2019.head()

Looking at the first few rows of data, it's still hard to make any conclusions from just looking at the numbers.\
This is exactly why data visualizations are so powerful. Let's get let's started!

<a id=4></a>
## Bar Graph

#### Let's start with an easy question: <b>What were the happiest countries in 2016?</b>

In [None]:
#Set figure size
plt.figure(figsize=(25,10))

#Create bar graph
sns.barplot(x=data_2016['Country'], y=data_2016['Happiness_Score'], palette="BuPu")

#Set axes and title
plt.xticks(rotation= 90)
plt.xlabel('Country', size = 15)
plt.ylabel('Happiness Score', size = 15)
plt.title('Happiness Score by Country in 2016', size = 18)

plt.show()

We can see the happiest country in 2019 was Denmark, and the least happiest country was Burundi.\
\
But this graph has a lot of information, so let's try to condense it by grouping countries by region and creating a new variable "region_happiness_ratio" (the sum of happiness_score for all the countries in a region divided by the number of countries in the region).

In [None]:
#Create new data frame "sorted_data_2016" with region and region_happiness_ratio_2016
region_lists=list(data_2016['Region'].unique())
region_happiness_ratio_2016=[]
for each in region_lists:
    region=data_2016[data_2016['Region']==each]
    region_happiness_rate_2016=sum(region.Happiness_Score)/len(region)
    region_happiness_ratio_2016.append(region_happiness_rate_2016)
data=pd.DataFrame({'region':region_lists,'region_happiness_ratio_2016':region_happiness_ratio_2016})
new_index=(data['region_happiness_ratio_2016'].sort_values(ascending=False)).index.values
sorted_data_2016 = data.reindex(new_index)

#Show new data frame
sorted_data_2016

#### Let's look at our condensed data: <b>What were the happiest regions of the world in 2016?</b>

In [None]:
#Set figure size
plt.figure(figsize=(10,8))

#Create bar graph
sns.barplot(x=sorted_data_2016['region'], y=sorted_data_2016['region_happiness_ratio_2016'], palette="BuPu")

#Set axes and title
plt.xticks(rotation= 90)
plt.xlabel('Region', size = 15)
plt.ylabel('Region Happiness Ratio', size = 15)
plt.title('Happiest Regions in the World in 2016', size = 18)

plt.show()

<div class="alert alert-block alert-info">Pause!
Create another bar graph for the region happiness ratio in 2019 (use the code cell below) and answer Q2 on the U4I platform. Remember to first create a new data frame called "sorted_data_2019".</div>

In [None]:
# your code here

#### Now that we know the which countries and regions are the happiest, let's learn more about our data: <b>Which factors affect happiness the most?</b>

In [None]:
region_lists=list(data_2016['Region'].unique())
share_economy=[]
share_family=[]
share_health=[]
share_freedom=[]
share_trust=[]
for each in region_lists:
    region=data_2016[data_2016['Region']==each]
    share_economy.append(sum(region.Economy)/len(region))
    share_family.append(sum(region.Family)/len(region))
    share_health.append(sum(region.Health)/len(region))
    share_freedom.append(sum(region.Freedom)/len(region))
    share_trust.append(sum(region.Trust)/len(region))

#Create horizontal bar plot for factors affecting happiness in 2016
f,ax = plt.subplots(figsize = (20,5))

#Set colors for each variable
sns.barplot(x=share_economy,y=region_lists,color='pink',label="Economy")
sns.barplot(x=share_family,y=region_lists,color='orange',label="Family")
sns.barplot(x=share_health,y=region_lists,color='red',label="Health")
sns.barplot(x=share_freedom,y=region_lists,color='lightgreen',label="Freedom")
sns.barplot(x=share_trust,y=region_lists,color='purple',label="Trust")

#Set legend
ax.legend(loc="lower right",frameon = True)

#Set axes and title
ax.set(xlabel='Percentage of Region', ylabel='Region',title = "Factors Affecting Happiness in 2016")

plt.show()

<a id=5></a>
## Violin Plot
A violin plot shows the distribution of values within the variable(s).

In [None]:
# Create new data frame with Region, Economy, Health, Family, Freedom, Trust
dataframe2=pd.pivot_table(data_2016, index = 'Region', values=["Economy", "Health", "Family", "Freedom", "Trust"])

#Set figure size
f,ax=plt.subplots(figsize=(20,8))

#Set colors
my_pal = {"Economy": "pink", "Family": "orange", "Freedom":"lightgreen", "Health":"red", "Trust":"purple"}

#Create violin plot
sns.violinplot(data=dataframe2, inner="points", palette=my_pal)
plt.xlabel('Factor', size = 15)
plt.ylabel('Percentage of Region', size = 15)
plt.title('Distribution of Factors Affecting Happiness in 2016', size = 18)

plt.show()

<div class="alert alert-block alert-info">Pause! Create another violin plot (use the code cell below) for data_2019 and answer Q3 on the U4I platform. 

In [None]:
# your code here

<a id=6></a>
## Box Plot
A box plot also displays the distribution of a data set based on 5 points: the minimum, first quartile, median, third quartile, and maximum. Box plots also show outliers in the data set.

Since we already know the distribution of factors that affect happiness for every region from the violin plot above, let's look at the `Happiness_Score` across years.

#### <b>Did regional `Happiness_Score` change from 2016 to 2019?</b>

In [None]:
#combine all 4 data sets into one "data_concat" and add variable "Year"
data_2016['Year']=2016
data_2017['Year']=2017
data_2018['Year']=2018
data_2019['Year']=2019
data_concat=pd.concat([data_2016,data_2017,data_2018,data_2019],axis=0,sort = False)

#Create box plot
f,ax = plt.subplots(figsize =(20,10))
sns.boxplot(x="Year" , y="Happiness_Score", hue="Region",data=data_concat,ax=ax)
plt.xlabel('Year', size = 15)
plt.ylabel('Happiness_Score', size = 15)
plt.title('Happiness Score 2016-2019', size = 18)

#Format legend location
plt.legend(bbox_to_anchor=(1, 0.4, 0.2, 0.2), loc='center')

plt.show()

<div class="alert alert-block alert-info">Pause! Answer Q4 on the U4I platform.

<a id=7></a>
## Scatter Plot

Now that we know more about our variables, let's look at the relationships between factors.

A scatter plot can be used to visualize whether there is a correlation (relationship) between two variables (i.e., whether the increase or decrease in one varaible depends on the increase or decrease of the other variable.\
In the horizontal bar graph above, we learned that `Economy` is the factor that affects happiness the most. 

#### <b>What was the relationship between `Economy` and `Happiness_Score` in 2019?</b>

In [None]:
#Create scatter plot for economy and happiness_score for data_2019
f,ax = plt.subplots(figsize = (15,8))
sns.scatterplot(x=data_2019["Happiness_Score"], y=data_2019["Economy"])

#Format title and axes
plt.title("Relationship Between Economy and Happiness Score in 2019", size=18)
plt.xlabel('Happiness_Score', size = 15)
plt.ylabel('Economy', size = 15)

plt.show()

We can see a clear positive relationship here. 

<div class="alert alert-block alert-info">Pause! Create another scatter plot for any two variables in the data set "data_2019" (use the code cell below), take a screenshot, and answer Q5 on the U4I platform. 

In [None]:
# your code here

<a id=8></a>
## Pair Plot
A pair plot allows us to see the distribution of single variables and the relationships between variables for all the variables in a data set.\
Pair plots are useful for identifying trends to follow up on in large data sets with several variables.

#### <b>What are the relationships between 4 factors: `economy`, `family`, `health`, and `freedom`?</b>

In [None]:
#drop columns from data sets to only include variables of interest
data_2016_reduced = data_2016.drop(['Generosity', 'Trust', 'Year'], axis=1)
data_2017_reduced = data_2017.drop(['Generosity', 'Trust', 'Year'], axis=1)
data_2018_reduced = data_2018.drop(['Generosity', 'Trust', 'Year'], axis=1)
data_2019_reduced = data_2019.drop(['Generosity', 'Trust', 'Year'], axis=1)

In [None]:
#Create a pair plot for data_2019_reduced
sns.pairplot(data_2019_reduced, hue="Region")
plt.show()

<div class="alert alert-block alert-info">Pause! Answer Q6 on the U4I platform. 

<a id=9></a>
## Heat Map
We can also visualize correlations between variables with a heat map.\
A heat map shows the magnitude of a relationship as color.

In [None]:
#Remove column Year from data frame
del data_2019['Year']

#Create heat map for data_2019_reduced
f,ax=plt.subplots(figsize=(10,8))
sns.heatmap(data_2019.corr(),annot=True, cmap="BuPu")
plt.show()

In this heat map, lighter colors represent a lower correlation (a weaker relationship) and darker colors represent a higher correlation (a stronger relationship).\
The diagnoal represents variables correlated with themselves.\
Keep in mind that a correlation is simply an association between two variables, wheher it be positive or negative, and does not indicate causality.

<div class="alert alert-block alert-info">Pause! Create another heat map for "data_2016_reduced" (use the code cell below) and answer Q7 on the U4I platform. Remember to first delete the column "Year" from data_2016.

In [None]:
# your code here

<a id=10></a>
## Interactive Bubble Plot

A bubble plot is a scatterplot with a third dimension which is represented by the size of the dots.

#### <b>What is the relationship between `Happiness_Score`, `Trust`, and `Economy`.

Once you create the interactive plot, you can explore the visualization by adding/removing regions and hovering over the bubbles.

In [None]:
figure = bubbleplot(dataset = data_2019, x_column = 'Happiness_Score', y_column = 'Trust', 
    bubble_column = 'Country', size_column = 'Economy', color_column = 'Region', 
    x_title = "Happiness Score", y_title = "Trust", title = 'Happiness Score, Trust, and Economy by Region',
    x_logscale = False, scale_bubble = 1, height = 650)

py.iplot(figure, config={'scrollzoom': True})

### Well done! You have completed this tutorial. Remember to submit the exercise on the U4I platform.

<a id=11></a>
**<h1>Sources</h1>**

https://www.kaggle.com/saduman/eda-and-data-visualization-with-seaborn \
https://www.kaggle.com/roshansharma/world-happiness-report \
https://www.kaggle.com/unsdsn/world-happiness \
https://en.wikipedia.org/wiki/World_Happiness_Report