# Data Visualization and Decision Trees
In this instruction, you will learn how to use python as a powerful tool for data visualization and making decision trees.
## Requirements
Before starting instruction, you should make sure that you have installed following libraries:
- Pandas
- Seaborn
- Numpy
- Matplotlib
- Scipy
- Sklearn
- Subprocess
- Graphviz

If you have installed “Anaconda”, all these libraries should be installed on your system along with “Anaconda”. You can check availability of a library by the following command:
- python -c "import {name of library}"

For example to check if “Numpy” is installed or not you can use the following command:
- python -c "import numpy"

In case there is any missing library, you can install it by the following command:
- pip install {name of library}
    
If you have not installed “Graphviz” package, in the following link you can find the right package with respect to your operation system.
- https://www.graphviz.org/download/

# Data Visualization 
Data visualization is the very beginning and important topic in data science which refers to the graphical representation of information and data by using visual elements like charts, graphs, maps, and etc. In this instruction for data visualization, we will discuss following topics:
1.	Basic statistical analysis
2.	Simple plots by Numpy/Pandas
3.	Generating sample data
4.	Box plots
5.	Distributions
    - Plotting univariate distributions
    - Plotting bivariate distributions
    - Pair plot

## Basic Statistical Analysis
In this section, you will learn how to obtain some basic statistical values like; mean, median, and standard deviation.

In [None]:
import pandas as pd
import seaborn as sns 
import numpy as np

#Providing data set
print(sns.get_dataset_names())
tips = sns.load_dataset("tips")
tips_10 = tips[:10]
print(tips_10)

#Some basic statistics
print(np.mean(tips_10['total_bill'])) #mean for the 'total_bill' of the first 10 rows
print(np.std(tips_10['total_bill'])) #standard diviation for the 'total_bill' of the first 10 rows 
print(np.var(tips_10['total_bill'])) #variance for the 'total_bill' of the first 10 rows 

## Simple Plots by Numpy/Pandas
Numpy/Pandas have simple plot method which can be used to draw simple plots like; bar plot, stacked plot, etc.

## Bar Plot
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. It can show the relationship between a numerical variable and a categorical variable. 

In [None]:
#Using Numpy/Pandas for drawing simple plots
tips['total_bill'][:10].plot(kind='bar', figsize=(10, 10))

## Stacked Plot
A stacked bar graph (or stacked bar chart)  is a chart that uses bars to show comparisons between categories of data.

In [None]:
#generating data set
df = pd.DataFrame(columns=["Language","Scripting", "Cross Platform","Fast",
                           "Data Science","Easy"], 
                  data=[["Python",1,1,1,1,1],
                        ["Java",0,1,1,1,0],
                        ["PHP",1,1,0,0,1],
                        ["Perl",1,1,1,0,1],
                        ["C#",0,0,0,0,0]])
#drawing stacked bar plot
df.set_index('Language').plot(kind='bar', stacked=True, figsize=(10, 10))

More advanced bar plot by **seaborn** (y axis shows a statistical value called estimator)

In [None]:
import pandas as pd
import seaborn as sns 
import numpy as np

#Providing data set for testing
tips = sns.load_dataset("tips")
#Using seaborn for drawing more complicated bar plot (default estimator is mean)
sns.barplot(x="day", y="total_bill", data=tips)

In [None]:
#changing estimator to median
sns.barplot(x="day", y="total_bill", data=tips, estimator=np.median)

## Generating Sample Data Via Random (Numpy)
Numpy can be used to generate sample data with different distributions. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

sample1 = np.random.rand(50) * 100 #Generate 50 random data between [0,100)
print("Sampl1:\n")
print(sample1)
sample2 = np.ones(25) * 50  #Generate an array with size 25, all values are 1
print("Sample2:\n")
print(sample2)
sample3 = np.random.rand(10) * 100 + 100
sample4 = np.random.rand(10) * -100
data = np.concatenate((sample1, sample2, sample3, sample4), 0)
data

## Box Plots
A box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles.Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles.

In [None]:
plt.boxplot(data)

In [None]:
#notched plot, displays a confidence interval around the median which is normally based on the median +/- 1.57 x IQR/sqrt of n
plt.boxplot(data, 1)

In [None]:
#Color and shape of outliers
plt.boxplot(data, 0, 'gd')

In [None]:
#Change the orientation (vertical, horizontal)
plt.boxplot(data, 0, 'rs', 0)

In [None]:
#Multiple box plots together
data = [data, data[:50],sample1]
plt.boxplot(data)

## Your turn
<ol>
  <li>Load "flights" data set by seaborn. </li>
  <li>Calculate mean, median, and standard diviation of "passengers" for the first 100 rows.</li>
  <li>Show average number of passengers per month (bar plot) for the whole data set.</li>
  <li>Explore outliers of "passengers" (box plot) in the whole data set, is there any outlier?</li>
</ol>

In [None]:
#Your answer
import pandas as pd
import seaborn as sns 
import numpy as np
import matplotlib.pyplot as plt

flights = sns.load_dataset("flights")
flights_100 = flights[:100]
print(np.mean(flights_100['passengers']))
print(np.median(flights_100['passengers']))
print(np.std(flights_100['passengers']))

plt.figure(figsize=(15,10))
sns.barplot(x="month", y="passengers", data=flights)

plt.figure()
plt.boxplot(flights['passengers'])



# Distributions
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur.
## Plotting Univariate Distributions
In univariate, distribution of just one variable is explored.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
 
sns.set(color_codes=True)
 
x = np.random.normal(10,1,size=100)
sns.distplot(x);   #default distribution with histogram and kernel density

In [None]:
#Without kernel density, with rug plot
sns.distplot(x, kde=False, rug=True);

In [None]:
#Identifying number of bins
sns.distplot(x, bins=20, kde=False, rug=True);

In [None]:
#Without histogram, with rug plot
sns.distplot(x, hist=False, rug=True);

In [None]:
#Aditional features for rug, kde, and hist
sns.distplot(x, rug=True, 
             rug_kws={"color": "r"}, 
             kde_kws={"color": "k", "lw": 3, "label": "KDE"}, 
             hist_kws={"histtype": "step", "linewidth": 3, "alpha": 1, "color": "g"})

## Plotting Bivariate Distributions
In this section, we will explore distribution involving two variables.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

mean = [0, 1] 
cov = [(1, .5), (.5, 1)]
#Generate 200 random normal data based predefined mean and covariane
data = np.random.multivariate_normal(mean, cov, 200)


#convert Numpy to Dataframe with specific names for columns 
df = pd.DataFrame(data, columns=["x", "y"])
#print(df.corr())

sns.jointplot(x="x", y="y", data=df, kind="kde");  #kind= scatter, hex, reg,kde 



In [None]:
#Changing type to Scatter
scatter = sns.jointplot(x="x", y="y", data=df, kind="scatter");

In [None]:
#Adding colors and labels
scatter = sns.jointplot(x="x", y="y", data=df, kind="scatter", joint_kws = {"color": ['red', 'blue']}); 
scatter.ax_joint.set_xlabel('x_red', fontweight='bold')
scatter.ax_joint.set_ylabel('y_blue', fontweight='bold')

In [None]:
#Changing type to Hexagons
sns.jointplot(x="x", y="y", data=df, kind="hex");  

In [None]:
#Changing type to Regression
sns.jointplot(x="x", y="y", data=df, kind="reg");  

## Pair Plot
By pair plot, we will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

iris = sns.load_dataset("iris")
iris[:10]

In [None]:
#By default all numeric variables are used
sns.pairplot(iris); 

In [None]:
#Specify specific variables
sns.pairplot(iris, vars = ['petal_length','sepal_length']);

In [None]:
#Adding Color
sns.pairplot(iris, hue = 'species');

In [None]:
#Adding markers
sns.pairplot(iris, hue = 'species', markers=["o", "s", "D"]);

## Your Turn :)
<ol>
  <li>Load "mpg" data set by seaborn. </li>
  <li>Show distribution of "horsepower" and "acceleration" together (by scatter plot, and different colors for each variable). Interpret the correlation between "horsepower" and "acceleration"</li>
  <li>Represent correlation of "horsepower", "weight", and "acceleration" (by a pair plot). Use "origin" as a resource for coloring</li>
</ol>

In [None]:
#Your answer
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

mpg = sns.load_dataset("mpg")
sns.jointplot(x="horsepower", y="acceleration", data=mpg, kind="scatter", joint_kws = {"color": ['red', 'blue']}); 
sns.pairplot(mpg, vars = ['horsepower','weight','acceleration'], hue = 'origin');

# Decision Tree
A decision tree is a tree where each node represents a feature(attribute), each link(branch) represents a decision(rule) and each leaf represents an outcome(categorical or continues value). In the following, you can find criteria you have learned during the lecture like; Information Gain (IG), Gain Ratio (GR), and Gini. 


$
H(t) ={-\sum_{i=1}^l (P(t=i) \times {log_2} {P(t=i)})}
$

$
IG_{Entrpy}(a) = Entropy_{First} - Entropy_{splitting}
$ 

$
GR(d,D) ={ \frac{IG(d,D)}{ - \sum_{l \in levels(d)} (P(d=l) \times {log_2} {P(d=l)})}}
$

$
Gini(t,D) ={1 - \sum_{l \in levels(t)} {P(t=l)}^2 }
$

$
IG_{Gini}(a) = Gini_{First} - Gini_{splitting}
$

This part includes following topics:
- Loading data
- Identifying descriptive and target features
- Identifying parameters to make the desired tree (algorithm, pruning, etc)
- Using “Graphviz” to visualize the resulted tree


## Loading Data
You can simply load “csv” or “excel” data by the corresponding methods (“read_csv”, “read_excel” respectively) of Pandas. before executing following code, you should make sure that you have uploaded the file in the home page of your Jupyter. 

In [None]:
import pandas as pd

DataFrame = pd.read_excel('ManWoman.xlsx')
DataFrame


## Identifying Descriptive and Target Features
As you know based on the concepts of decision tree, descriptive features and target feature should be specified. Descriptive features are used to make a decision about the target feature. In the code descriptive features have been specified by array “X” and target feature has specified by “Y”. 

In [None]:
#descriptive features
X = DataFrame[['height','weight']] 
#target feature
Y = DataFrame[["Class"]]

## Identifying Parameters to Make The Desired Tree (Algorithm and Pruning)
“DecisionTreeClassifier” method of “sklearn” is used to generate tree classifier. You can set the parameters of this method based on what you need. In the following you can find some of the most important parameters of this method:
- Main parameters to specify the algorithm
    - Criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain. (Default = "gini")
    - Splitter: The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split. (Default = "best")
- Parameters to control growth of the tree (Pruning)
    - Min_samples_split: The minimum number of samples required to split an internal node
    - Min_samples_leaf: The minimum number of samples required to be at a leaf node. (Default = 1)
    - Max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than “min_samples_split” samples. (Default = None)
    - Max_leaf_nodes: Grow a tree with “max_leaf_nodes” in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. (Default = None)
    - Min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value. (Default = 0.)



In [None]:
from sklearn import tree

job_classifier = tree.DecisionTreeClassifier(criterion="entropy")   
job_classifier.fit(X, Y)

## Using “Graphviz” to Visualize The Resulted Tree
In the final step, we will use “dot” tool of “Graphviz” to convert the resulted dot file (a description about the nodes and edges) to a visual decision tree.

In [None]:
from subprocess import check_output

column_names = list(DataFrame.columns.values)
del column_names[-1]
dot_file = "Classification.dot"
pdf_file = "Classification.pdf"
with open(dot_file, "w") as f:
    f = tree.export_graphviz(job_classifier, out_file=f, 
                                 feature_names= column_names, 
                                 class_names=["Man", "Woman"], 
                                 filled=True, rounded=True)
try:
    check_output("dot -Tpdf "+ dot_file + " -o " + pdf_file , shell=True)
except:
    print("Make sure that you have installed Graphviz, otherwise you can not see the visual tree. But you can find descriptions in a dot file")


Afte executing all the above-mentioned codes step by step, you can find the results as "Classification.dot" (description) and "Classification.pdf" (visual tree) in the home page of your Jupyter.