# Data Visualization and Decision Tree
In this instruction, you will learn how to use python as a powerful tool for data visualization and making decision trees. Note that you do not need to do the requirements section if you have already done the enironmental setup. You need to add "Seaborn_Datasets" and "DT_Datasets" to the home page of your Jupyter notebook.
## Requirements
Before starting instruction, you should make sure that you have installed following libraries:
- pandas
- seaborn
- numpy
- matplotlib
- scipy
- sklearn
- subprocess
- graphviz
- p_decision_tree

If you have installed “Anaconda”, all these libraries should be installed on your system along with “Anaconda” (except "p_decision_tree"). You can check availability of a library by the following command:
- python -c "import {name of library}"

For example to check if “Numpy” is installed or not you can use the following command:
- python -c "import numpy"

In case there is any missing library, you can install it by the following command:
- pip install {name of library}
    
If you have not installed “Graphviz” package, in the following link you can find the right package with respect to your operation system.
- https://www.graphviz.org/download/

After installing graphviz, you need to add the installed executable “dot” path to your system environment path (default path in Windows is “C:\Program Files (x86)\Graphviz2.38\bin”).


# Data Visualization 
Data visualization is usually the starting point for data analysis which refers to the graphical representation of data by using visual elements like charts, graphs, maps, and etc. In this instruction for the data visualization, we will discuss following topics:
1.	Basic statistical analysis
2.	Simple plots
    - Bar plot
    - Stacked plot
    - Box plot
5.	Distributions
    - Plotting univariate distributions
    - Plotting bivariate distributions
    - Pair plot

## Basic Statistical Analysis
In this section, you will learn how to obtain some basic statistical values like; mean, median, and standard deviation.

### Providing Datasets
In Python, some libraries provide some simple datasets for training reasons. Seaborn is one of them that provides several datasets which can be loaded and utilized.

In [None]:
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

print(sns.get_dataset_names())


In [None]:
tips = sns.load_dataset("tips")
tips_10 = tips[:10]
print(tips_10)

Depending on the version of Python that you have installed, you may access to a bit different datasets. In order to avoid incosistency of the results during the instruction, we use the downloaded datasets in the "Seaborn_Datasets" folder.  

In [None]:
import pandas as pd
import seaborn as sns 
import numpy as np

def read_dataset(dataset):
    folder = "Seaborn_Datasets/"
    data = pd.read_csv(folder + dataset)
    return data

tips = read_dataset("tips.csv")

#Some basic statistics
print(np.mean(tips['total_bill'])) #mean for the 'total_bill'
print(np.std(tips['total_bill'])) #standard diviation for the 'total_bill' 
print(np.var(tips['total_bill'])) #variance for the 'total_bill'

### Your turn
<ol>
  <li>Load "tips.csv" data set from the datasets folder (Seaborn_Datasets). </li>
  <li>Calculate the mean of the "tip"s for dinner and lunch.</li>
</ol>

In [None]:
#Your answer


## Simple Plots
In this section, you will learn how to draw some simple plots to start data analysis.

### Bar Plot
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. It can show the relationship between a numerical variable and a categorical variable. 

In [None]:
import pandas as pd
import seaborn as sns 
import numpy as np

def read_dataset(dataset):
    folder = "Seaborn_Datasets/"
    data = pd.read_csv(folder + dataset)
    return data

tips = read_dataset("tips.csv")

sns.barplot(x="day", y="tip", data=tips)

In [None]:
#changing estimator to median
sns.barplot(x="day", y="total_bill", data=tips, estimator=np.median)

### Stacked Plot
A stacked bar graph (or stacked bar chart)  is a chart that uses bars to show comparisons between categories of data.

In [None]:
import pandas as pd

#generating data set
df = pd.DataFrame(columns=["Language","Scripting", "Cross Platform","Fast",
                           "Data Science","Easy"], 
                  data=[["Python",1,1,1,1,1],
                        ["Java",0,1,1,1,0],
                        ["PHP",1,1,0,0,1],
                        ["Perl",1,1,1,0,1],
                        ["C#",0,0,0,0,0]])
#drawing stacked bar plot
df.set_index('Language').plot(kind='bar', stacked=True, figsize=(10, 10))

### Box Plot
A box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles.Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles.

### Generating Sample Data Via Random (Numpy)
Numpy can be used to generate sample data with different distributions. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

sample1 = np.random.rand(50) * 100 #Generates 50 random data between [0,100)
print("Sampl1:\n")
print(sample1)
sample2 = np.ones(25) * 50  #Generates an array with size 25, all values are 50
print("\nSample2:\n")
print(sample2)
sample3 = np.random.rand(10) * 100 + 100 #Greater than 100
sample4 = np.random.rand(10) * -100      #Between [0,-100)
data = np.concatenate((sample1, sample2, sample3, sample4), 0) #Concatenates row by row
data

In [None]:
plt.boxplot(data)

In [None]:
#Color and shape of outliers
plt.boxplot(data, 0, 'gd')

In [None]:
#Change the orientation (vertical, horizontal)
plt.boxplot(data, 0, 'rs', 0)

In [None]:
#Multiple box plots together
data = [data, data[:50],sample1]
plt.boxplot(data)

### Your turn
<ol>
  <li>Load "flights.csv" data set from the datasets folder (Seaborn_Datasets). </li>
  <li>Calculate mean, median, and standard diviation of "passengers" for the first 100 rows.</li>
  <li>Show average number of passengers per month (bar plot) for the whole data set.</li>
  <li>Explore outliers of "passengers" (box plot) in the whole data set, is there any outlier?</li>
</ol>

In [None]:
#Your answer


## Distributions
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur.
### Plotting Univariate Distributions
In univariate, distribution of just one variable is explored.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
 
sns.set(color_codes=True)
 
x = np.random.normal(10,1,size=100)
sns.distplot(x);   #default distribution with histogram and kernel density

In [None]:
#Without kernel density, with rug plot (small vertical lines show the observations in each bin)
sns.distplot(x, kde=False, rug=True);

In [None]:
#Identifying number of bins 
sns.distplot(x, bins=20, kde=False, rug=True); 

In [None]:
#Without histogram, with rug plot
sns.distplot(x, hist=False, rug=True);

In [None]:
#Aditional features for rug, kde, and hist
sns.distplot(x, rug=True, 
             rug_kws={"color": "r"}, 
             kde_kws={"color": "k", "lw": 3, "label": "KDE"}, 
             hist_kws={"histtype": "step", "linewidth": 3, "alpha": 1, "color": "g"})

### Plotting Bivariate Distributions
In this section, we will explore distribution involving two variables.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

mean = [10, 20] 
cov = [(1, .5), (.5, 1)]
#Generate 200 random normal data based predefined mean and covariane
data = np.random.multivariate_normal(mean, cov, 100)


#convert Numpy to Dataframe with specific names for columns 
df = pd.DataFrame(data, columns=["x", "y"])
#print(df.corr())

sns.jointplot(x="x", y="y", data=df, kind="kde");  #kind= scatter, hex, reg,kde 



In [None]:
#Changing type to Scatter
scatter = sns.jointplot(x="x", y="y", data=df, kind="scatter");

In [None]:
#Changing type to Hexagons
sns.jointplot(x="x", y="y", data=df, kind="hex");  

In [None]:
#Changing type to Regression
sns.jointplot(x="x", y="y", data=df, kind="reg");  

### Pair Plot
By pair plot, we will create a grid of Axes such that each variable in data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal axis shows the univariate distribution of the data for the corresponding variable.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def read_dataset(dataset):
    folder = "Seaborn_Datasets/"
    data = pd.read_csv(folder + dataset)
    return data

iris = read_dataset("iris.csv")

In [None]:
#By default all numeric variables are used
sns.pairplot(iris); 

In [None]:
#Specify specific variables
sns.pairplot(iris, vars = ['petal_length','sepal_length']);

In [None]:
#Adding Color
sns.pairplot(iris, hue = 'species');

In [None]:
#Adding markers
sns.pairplot(iris, hue = 'species', markers=["o", "s", "D"]);

### Your Turn :)
<ol>
  <li>Load "mpg" data set by seaborn. </li>
  <li>Show distribution of "horsepower" and "acceleration" together (by a joint plot). Interpret the correlation between "horsepower" and "acceleration".</li>
  <li>Compare the correlation between "horsepower", "weight", and "acceleration" for the cars produced by different continents ("origin"). </li>
</ol>

In [None]:
#Your answer


# Decision Tree
In this part, we will use the "p_decision_tree" library to make a decision tree based on categorical descriptive attributes and the "scikit-learn" library to make a decision tree based on numerical descriptive attributes.

## Decision Tree (Categorical Descriptive Attributes)
We use the "p_decision_tree" library to make a decision tree based on the categorical descriptive attributes (make sure that you have installed "p_decision_tree" library). This library is not able to make decision tree based on the numerical descriptive attributes, and you have to convert the numerical descriptive attributes to the categorical attributes. 

Note that in order to see a visual tree, you need to install graphviz package. [Here](https://www.graphviz.org/download/) you can find the right package with respect to your operation system. 
### Features
The main algorithm used by the library is ID3 with the following features:

* Information gain based on [entropy](https://en.wikipedia.org/wiki/Decision_tree_learning)
* Information gain based on [gini](https://en.wikipedia.org/wiki/Decision_tree_learning)
* Some pruning capabilities like:
	* Minimum number of samples
	* Minimum information gain
* The resulting tree is not binary



### Loading Dataset
As aforementioned, you can simply load “csv” or “excel” data by the corresponding methods (“read_csv”, “read_excel” respectively) of Pandas. Make sure that you have uploaded the "DT_Datasets" folder on the home page of your Jupyter notebook.

In [None]:
from p_decision_tree.DecisionTree import DecisionTree
import pandas as pd

def read_dataset(dataset):
    folder = "DT_Datasets/"
    data = pd.read_csv(folder + dataset)
    return data

data = read_dataset('playtennis.csv')
data

### Identifying Descriptive and Target Attributes (Features)
As you know based on the concepts of decision tree, descriptive features and target feature should be specified. Descriptive features are used to make a decision to predict the target feature.

In [None]:
columns = data.columns

#All columns except the last one are descriptive by default
descriptive_features = columns[:-1]
#The last column is considered as label
label = columns[-1]

#Converting all the columns to string
for column in columns:
    data[column]= data[column].astype(str)

data_descriptive = data[descriptive_features].values
data_label = data[label].values

print("descriptive features:")
print(descriptive_features)
print("\ntarget feature:\n" + label)

### Making the Tree
The "id3" method is ued to make the decision tree. One can pass the minimum gain and also the minimum samples to this function to prune the tree.

In [None]:
#Calling DecisionTree constructor (the last parameter is criterion which can also be "gini")
decisionTree = DecisionTree(data_descriptive.tolist(), descriptive_features.tolist(), data_label.tolist(), "entropy")

#Here you can pass pruning features (gain_threshold and minimum_samples)
decisionTree.id3(0,0)

#Visualizing decision tree by Graphviz
dot = decisionTree.print_visualTree( render=True )

#print(dot)

print("System entropy: ", format(decisionTree.entropy))
print("System gini: ", format(decisionTree.gini))

## Decision Tree (Numerical Descriptive Attributes)
The "scikit-learn" library is used to make a decision tree based on numerical descriptive attributes. Note that "scikit-learn" as the main library for data science in Python is not able to make a decision tree based on categorical descriptive attributes, and you have to convert the categorical attributes to numerical before passing them to the classifier method. Also, the resulting decision tree by this library is a binary tree.
In the following, you can find a sample code in order to make a decision tree based on numerical descriptive attributes, using "scikit-learn" library.

“DecisionTreeClassifier” method of “sklearn” is used to generate the tree classifier. You can set the parameters of this method based on what you need. In the following you can find some of the most important parameters of this method:
- Main parameters to specify the algorithm
    - Criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain. (Default = "gini")
    - Splitter: The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split. (Default = "best")
- Parameters to control growth of the tree (Pruning)
    - Min_samples_split: The minimum number of samples required to split an internal node
    - Min_samples_leaf: The minimum number of samples required to be at a leaf node. (Default = 1)
    - Max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than “min_samples_split” samples. (Default = None)
    - Max_leaf_nodes: Grow a tree with “max_leaf_nodes” in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. (Default = None)
    - Min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value. (Default = 0.)

In [None]:
import pandas as pd
from sklearn import tree
from subprocess import check_output

#loading dataset
def read_dataset(dataset):
    folder = "DT_Datasets/"
    data = pd.read_csv(folder + dataset)
    return data

data = read_dataset('ManWoman.csv')

#descriptive features
X = data[['height','weight']] 
#target feature
Y = data[["Class"]]


job_classifier = tree.DecisionTreeClassifier(criterion="entropy")   
job_classifier.fit(X, Y)


column_names = list(data.columns.values)
del column_names[-1]
dot_file = "Classification.dot"
pdf_file = "Classification.pdf"
with open(dot_file, "w") as f:
    f = tree.export_graphviz(job_classifier, out_file=f, 
                                 feature_names= column_names, 
                                 class_names=["Man", "Woman"], 
                                 filled=True, rounded=True)
try:
    check_output("dot -Tpdf "+ dot_file + " -o " + pdf_file , shell=True)
    print("Find Classification.dot (description) and Classification.pdf (visual tree) in the home page of your Jupyter.")
except:
    print("Make sure that you have installed Graphviz, otherwise you can not see the visual tree. But you can find descriptions in a dot file")
