## XGBoost Regressor
XGBoost Regressor is a powerful machine learning algorithm used for regression tasks. In this example, a dataset is loaded, and any null values are removed. The correlation between features is visualized using a heatmap, providing insights into their relationships with the target variable. Feature importance is determined using the Extra Trees Regressor, enabling the identification of the top 10 features contributing significantly to the output variable. The XGBoost Regressor is then employed, and its performance is evaluated using coefficients of determination on both training and test sets. Hyperparameter tuning is conducted to optimize the model's performance, and the best parameters are identified using Randomized Search CV. The model's evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE), offer a comprehensive understanding of its accuracy and effectiveness in predicting the target variable. The final tuned model is saved for future use.

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Importing dataset
df=pd.read_csv('Data/Real-Data/Real_Combine.csv')

In [None]:
# Checking shape of dataset
df.shape

In [None]:
# Checking for null values
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
# Dropping null values
df=df.dropna()

In [None]:
X=df.iloc[:,:-1] # Independent features
y=df.iloc[:,-1] # dependent features

In [None]:
# Checking null values
X.isnull()

In [None]:
y.isnull()

In [None]:
# Plotting pairplot
sns.pairplot(df)

In [None]:
# Plotting correlation matrix
df.corr()

### Correlation Matrix with Heatmap
Correlation indicates the degree of relationship between features or between a feature and the target variable. A positive correlation suggests that an increase in one feature corresponds to an increase in the target variable, while a negative correlation implies that an increase in one feature corresponds to a decrease in the target variable. To visually assess these relationships, heatmaps are employed, particularly through libraries like seaborn. Heatmaps provide a clear visualization of the correlations between different features and the target variable, aiding in the identification of the most influential features in relation to the target variable.

In [None]:
import seaborn as sns
# Get correlations of each features in dataset
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
# Plotting heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [None]:
corrmat.index

### Feature Importance
The feature importance of each variable in your dataset can be obtained through the feature importance property of a model. This property assigns a score to each feature, with higher scores indicating greater importance or relevance to the output variable. In the context of tree-based regression models, such as the Extra Tree Regressor, there is an inbuilt feature importance class. By leveraging this class, you can extract and identify the top 10 features that contribute most significantly to the dataset's output variable.

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
import matplotlib.pyplot as plt
model = ExtraTreesRegressor()
model.fit(X,y)

In [None]:
X.head()

In [None]:
print(model.feature_importances_)

In [None]:
# Plotting graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()

### Linear Regression

In [None]:
sns.distplot(y)

### Train Test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
import xgboost as xgb
# conda install -c ananconda py-xgboost

In [None]:
regressor=xgb.XGBRegressor()
regressor.fit(X_train,y_train)

In [None]:
print("Coefficient of determination R^2 <-- on train set: {}".format(regressor.score(X_train, y_train)))

In [None]:
print("Coefficient of determination R^2 <-- on train set: {}".format(regressor.score(X_test, y_test)))

In [None]:
from sklearn.model_selection import cross_val_score
score=cross_val_score(regressor,X,y,cv=5)

In [None]:
score.mean()

#### Model Evaluation

In [None]:
prediction=regressor.predict(X_test)

In [None]:
sns.distplot(y_test-prediction)

In [None]:
plt.scatter(y_test,prediction)

### Hyperparameter Tuning

In [None]:
xgb.XGBRegressor()

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
print(n_estimators)

In [None]:
# Randomized Search CV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Various learning rate parameters
learning_rate = ['0.05','0.1', '0.2','0.3','0.5','0.6']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# max_depth.append(None)
#Subssample parameter values
subsample=[0.7,0.6,0.8]
# Minimum child weight parameters
min_child_weight=[3,4,5,6,7]

In [None]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'learning_rate': learning_rate,
               'max_depth': max_depth,
               'subsample': subsample,
               'min_child_weight': min_child_weight}

print(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
regressor=xgb.XGBRegressor()

In [None]:
# Random search of parameters, using 3 fold cross validation, 
# Search across 100 different combinations
xg_random = RandomizedSearchCV(estimator = regressor, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = 1)

In [None]:
xg_random.fit(X_train,y_train)

In [None]:
xg_random.best_params_

In [None]:
xg_random.best_params_

In [None]:
xg_random.best_score_

In [None]:
rf_random.best_score_

In [None]:
predictions=xg_random.predict(X_test)

In [None]:
sns.distplot(y_test-predictions)

In [None]:
plt.scatter(y_test,predictions)

In [None]:
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

### Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

In [None]:
import pickle 

In [None]:
# Open a file, where you ant to store the data
file = open('random_forest_regression_model.pkl', 'wb')
# Dump information to that file
pickle.dump(rf_random, file)