<a href="https://colab.research.google.com/github/nallagondu/DATASCIENCE-practice/blob/main/Gradient_Boosted_Trees_Continue__18022024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as  sns
import numpy as np

%matplotlib inline
#sns.set_style("whitegrid")

In [None]:
#df = pd.read_csv("heart_disease.csv")
url = "https://raw.githubusercontent.com/nallagondu/datatrained-training-ml-Files/main/heart_disease.csv"
# Read the CSV file from the URL
df = pd.read_csv(url)

# Display the first few rows of the DataFrame
df.head()

In [None]:
df.drop('Unnamed: 0', axis=1,inplace=True)

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.isna().sum()

In [None]:
#Just find correlation of feature vs target using corrwith
df.drop('target',axis = 1).corrwith(df.target)

**# Visualize the correlation**

In [None]:
df.drop('target', axis=1).corrwith(df.target).plot(kind='bar',grid=True, figsize=(8,5),title='Correlation with target')

Lets work on feture selection and see if that can help us building better model

#Model building using selectPrecentile features


Using SelectPercentile features is a feature selection technique that selects the top k percentile of features with the highest scores based on a specified statistical test (e.g., chi-squared test, ANOVA F-test). Here's how you can build a model using SelectPercentile features:

**Data Preparation:**

Prepare your dataset by encoding categorical variables, handling missing values, and scaling numerical features if necessary.
**Feature Selection:**

Import SelectPercentile from sklearn.feature_selection.
Initialize SelectPercentile with the desired statistical test (e.g., chi2 for classification tasks, f_classif for regression tasks) and the desired percentile of features to keep.
Fit SelectPercentile to your training data to compute the scores and select the top percentile of features.
**Model Building:**

Import the necessary model class (e.g., LogisticRegression, RandomForestClassifier, GradientBoostingRegressor) from sklearn.
Initialize the model with any desired hyperparameters.
Concatenate the selected features obtained from SelectPercentile with the corresponding target variable (if applicable) to create the training dataset.
Split the data into training and testing sets using train_test_split from sklearn.model_selection.
Fit the model to the training data and evaluate its performance on the testing data.



#chi2 (Chi-squared)

 Is a statistical test used to determine the independence between two categorical variables in a dataset. In the context of feature selection, the chi-squared test measures the dependency between each feature and the target variable in a classification problem.

**Here's how chi-squared feature selection works:**

For each feature, the chi-squared test computes the chi-squared statistic and corresponding p-value between the feature and the target variable.
The chi-squared statistic measures the extent of the relationship between the feature and the target. A higher chi-squared statistic indicates a stronger association between the feature and the target.
The p-value represents the probability of observing the chi-squared statistic under the null hypothesis that the feature and the target are independent. A lower p-value suggests that the feature is unlikely to be independent of the target.
Features with high chi-squared statistics and low p-values are considered to be more informative and are selected for inclusion in the model.
In scikit-learn, the chi-squared test is commonly used for feature selection in classification tasks through the SelectKBest or SelectPercentile feature selection methods. The chi2 function from sklearn.feature_selection module is used to compute the chi-squared statistics and p-values.

In [None]:
from sklearn.feature_selection import SelectPercentile, chi2

In [None]:
#instantiate selectPercentile and fit (feature and  label)
X = df.drop(['target'], axis=1)
y = df.target

SPercentile = SelectPercentile(score_func=chi2, percentile=80)

# Fit and transform the feature matrix X
X_selected = SPercentile.fit_transform(X, y)

In [None]:
#Seperate the features to check  p-values
cols = SPercentile.get_support(indices=True)  # to teturn index number insted of boolean
print('Feature Indexx = ',cols)

features = X.columns[cols]
print('features = ', list(features))

In [None]:
df_scores = pd.DataFrame({'features': X.columns, 'Chi2Score': SPercentile.scores_, 'pValue': SPercentile.pvalues_})
df_scores.sort_values(by='Chi2Score', ascending=False)

In [None]:
#Create subset of selected features
X = df[features]
y = df.target

In [None]:
#import Libs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaler = scaler.fit_transform(X)
x_train,x_test,y_train,y_test = train_test_split(X_scaler,y ,test_size = 0.3 ,random_state=43)

**GradientBoosting Classifier**



In [None]:
from sklearn.ensemble import GradientBoostingClassifier # if it is regressior use GradientBoostingregressior
from sklearn.metrics import classification_report, accuracy_score


In [None]:
def metric_score(clf, x_train,x_test,y_train,y_test, train= True):
  if train:
    y_pred = clf.predict(x_train)
    print("\n  _____Train result____")
    print(f"Accuracy Score: {accuracy_score(y_train,y_pred) * 100: .2f}%")
  elif train==False:
    pred = clf.predict(x_test)
    print("\n  _____Test result____")
    print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100: .2f}%")

    print('\n \n  test classification report \n', classification_report(y_test, pred,digits = 2))


In [None]:
#Initiate GradientBoosting Classifier
gbdt_clf =  GradientBoostingClassifier()
gbdt_clf.fit(x_train,y_train)

In [None]:
metric_score(gbdt_clf,x_train,x_test,y_train,y_test,train=True)
metric_score(gbdt_clf,x_train,x_test,y_train,y_test,train=False)

**Lets try if we can improve the performacne of our model using parameter tunings**

#Hyperparameter tuning ...

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
grid_param = {
    'max_depth': range(4,8),
    'min_samples_split': range(2,8,2),
    'n_estimators': range(20,200,10),
    'learning_rate': np.arange(0.1,0.3)
}

In [None]:
grid = RandomizedSearchCV(GradientBoostingClassifier(),cv=5,param_distributions=grid_param)
grid.fit(x_train,y_train)

In [None]:
grid.best_params_

In [None]:
gbdt_clf = GradientBoostingClassifier(
    max_depth =7, min_samples_split = 6 ,n_estimators=170,learning_rate=0.1)

gbdt_clf.fit(x_train, y_train)

In [None]:
metric_score(gbdt_clf,x_train,x_test,y_train,y_test,train=True)
metric_score(gbdt_clf,x_train,x_test,y_train,y_test,train=False)

# Still have chance to ture the parameters with different range and try to improve the score


GBDT end

In [None]:
#Ds  most of the fetures are catagerical  then need to use CAT boost ...learn and research
#What is cat and what is boost

# We have Cat boost model for anotther model ----

Continue :
 XGB_used_car_get_dummies_18022024.ipynb