# Gini-Decision Tree and Linear Regression

# 1) Gini-Decision Tree

    A decision tree is a popular machine learning algorithm used for both classification and regression tasks. One of the most commonly used criteria for constructing decision trees is the Gini index, which measures the impurity of a set of data.

    The Gini index is a statistical measure of inequality that is often used to evaluate the distribution of wealth in a society. In the context of decision trees, the Gini index is used to evaluate the impurity of a set of data based on the distribution of class labels. The Gini index ranges from 0 to 1, where 0 indicates a completely pure dataset (all examples belong to the same class), and 1 indicates a completely impure dataset (an equal number of examples from each class).

    To construct a decision tree using the Gini index, the algorithm starts by calculating the Gini index for the entire dataset. It then splits the data into two subsets based on a chosen feature and evaluates the Gini index of each subset. The feature that results in the lowest Gini index (i.e., the most pure subsets) is selected as the splitting criterion. This process is repeated recursively for each subset until a stopping criterion is met, such as reaching a maximum tree depth or a minimum number of examples in a leaf node.

    One of the advantages of using the Gini index for constructing decision trees is its simplicity and speed. The Gini index can be calculated quickly, even for datasets with a large number of features and examples. Additionally, decision trees constructed using the Gini index are often easy to interpret, making them useful for explaining the decision-making process to non-technical stakeholders.

    However, a potential disadvantage of using the Gini index is that it may not always result in the most accurate decision tree. In some cases, other splitting criteria, such as information gain or gain ratio, may result in a more accurate model. Therefore, it is important to experiment with different splitting criteria and evaluate the performance of the resulting decision tree using appropriate evaluation metrics.

In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [7]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)
X_train.shape, X_test.shape

((105, 4), (45, 4))

In [8]:
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
clf_gini.fit(X_train, y_train)
y_pred_gini = clf_gini.predict(X_test)

In [9]:
print(accuracy_score(y_test, y_pred_gini))

1.0


In [10]:
y_pred_train_gini = clf_gini.predict(X_train)
y_pred_train_gini

array([1, 1, 2, 1, 2, 1, 2, 1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1,
       2, 0, 1, 2, 0, 2, 2, 1, 1, 2, 1, 0, 1, 2, 0, 0, 1, 1, 0, 2, 0, 0,
       2, 1, 2, 1, 1, 2, 1, 0, 0, 1, 2, 0, 0, 0, 1, 2, 0, 2, 2, 0, 1, 1,
       2, 1, 2, 0, 2, 1, 2, 1, 1, 1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 1, 0,
       2, 0, 1, 2, 2, 1, 2, 1, 1, 2, 2, 0, 1, 2, 0, 1, 2])

In [11]:
print(accuracy_score(y_train, y_pred_train_gini))

0.9523809523809523


# 2) Linear Regression

    Linear regression is a commonly used statistical technique for modeling the relationship between a dependent variable and one or more independent variables. The goal of linear regression is to find the best linear relationship between the dependent variable and the independent variables, which can then be used to predict the values of the dependent variable for new observations.

    In simple linear regression, there is only one independent variable, and the relationship between the independent variable and the dependent variable is modeled using a straight line. The equation of the line is given by:

    y = β0 + β1x + ε

    where y is the dependent variable, x is the independent variable, β0 and β1 are the intercept and slope of the line, respectively, and ε is the error term.

    The intercept β0 represents the value of y when x is equal to zero, and the slope β1 represents the change in y for a one-unit change in x. The error term ε represents the random variation in the data that is not accounted for by the model.

    In multiple linear regression, there are multiple independent variables, and the relationship between the independent variables and the dependent variable is modeled using a plane or hyperplane. The equation of the plane or hyperplane is given by:

    y = β0 + β1x1 + β2x2 + ... + βpxp + ε

    where p is the number of independent variables.

    The coefficients β1, β2, ..., βp represent the change in y for a one-unit change in the corresponding independent variable, holding all other independent variables constant. The intercept β0 represents the value of y when all independent variables are equal to zero. The error term ε represents the random variation in the data that is not accounted for by the model.

    Linear regression is often used in machine learning and data analysis applications, such as predictive modeling, forecasting, and trend analysis. The performance of a linear regression model can be evaluated using metrics such as the mean squared error (MSE) or the coefficient of determination (R-squared).

In [20]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) 
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]) 
target = raw_df.values[1::2, 2]

model = LinearRegression()
model.fit(data, target)
model = LinearRegression().fit(data, target)

r_sq = model.score(data, target)
print()
print(f"coefficient of determination: {r_sq}", "\n")

print(f"intercept: {model.intercept_}", "\n")
print(f"slope: {model.coef_}", "\n")

y_pred = model.predict(data)
print(f"predicted response:\n{y_pred}")



coefficient of determination: 0.7406426641094095 

intercept: 36.459488385089884 

slope: [-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01] 

predicted response:
[30.00384338 25.02556238 30.56759672 28.60703649 27.94352423 25.25628446
 23.00180827 19.53598843 11.52363685 18.92026211 18.99949651 21.58679568
 20.90652153 19.55290281 19.28348205 19.29748321 20.52750979 16.91140135
 16.17801106 18.40613603 12.52385753 17.67103669 15.83288129 13.80628535
 15.67833832 13.38668561 15.46397655 14.70847428 19.54737285 20.8764282
 11.45511759 18.05923295  8.81105736 14.28275814 13.70675891 23.81463526
 22.34193708 23.10891142 22.91502612 31.35762569 34.21510225 28.02056414
 25.20386628 24.60979273 22.94149176 22.09669817 20.42320032 18.03655088
  9.10655377 17.20607751 21.28152535 23.97222285 27.6558508  24.04901809
 15.3618477  31.15

    This code will print the coefficient of determination (R-squared) value, which represents the proportion of the variance in the dependent variable that is explained by the independent variables. The closer the R-squared value is to 1, the better the model fits the data.

    You can customize the print statements to include additional information about the accuracy of the model, such as the mean squared error or the root mean squared error. These metrics provide information about the average difference between the predicted values and the actual values, and can be useful for evaluating the performance of the model.