### Importing Libraries

In [33]:
import time

### Machine Learning Models to Master

In [None]:
#Machine Learning Algorithms to Master  
1. Linear and Multiple Linear Regression
2. Logistic Regression
3. Decision Trees
4. Naive Bayes
5. K-Nearest Neighbors
6. Support Vector Machines
7. Random Forests
8. Neural Networks
    1. Convolutional Neural Network (CNN)
    2. Recurrent Neural Network (RNN)
    3. Long Short-Term Memory (LSTM)
    4. Generative Adversarial Network (GAN)
    5. Deep Belief Network (DBN)
    6. Deep Boltzmann Machine (DBM)
    7. Autoencoders
    8. Restricted Boltzmann Machines (RBM)
    9. Hopfield Networks
    10. Self-Organizing Maps (SOM)
9. Gradient Boosting
    1. XGBoost
    2. LightGBM
    3. CatBoost
    4. Gradient Boosting Machines (GBM)
    5. Stochastic Gradient Boosting (SGB)
    6. Adaboost
    7. Gradient Boosted Decision Trees (GBDT)
    8. DeepBoost
    9. Neural Network Boosting (NNBoost)
    10. Gradient Boosted Regression Trees (GBRT)
10. Reinforcement Learning
11. Dimensionality Reduction Algorithms
    1. Principal Component Analysis (PCA)
    2. Linear Discriminant Analysis (LDA)
    3. Independent Component Analysis (ICA)
    4. Non-Negative Matrix Factorization (NMF)
    5. Factor Analysis
    6. Singular Value Decomposition (SVD)
    7. t-Distributed Stochastic Neighbor Embedding (t-SNE)
    8. Uniform Manifold Approximation and Projection (UMAP)
    9. Autoencoders
    10. Random Projection
    11. Feature Selection
    12. Locally Linear Embedding (LLE)
12. Clustering Algorithms
    1. K-Means Clustering
    2. Hierarchical Clustering
    3. Expectation-Maximization (EM) Clustering
    4. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
    5. Mean-Shift Clustering
    6. Gaussian Mixture Model (GMM) Clustering
    7. Spectral Clustering
    8. Affinity Propagation Clustering
    9. Birch Clustering
    10. Optics Clustering
13. Autoencoders
14. Transfer Learning
15. Generative Adversarial Networks (GANs)


Data Preprocessing:
    importing the required libraries
    importing the dataset
    handling missing data
    encoding the categoical data
    feature engineering
    spliting the dataset into test set and training set
    feature scaling 
    *webscraping with beautifulsoup

Developing the Model:
    model selection
    model evaluation
    model persistence
    ensemble methods
    feature extraction
    feature selection
    feature engineering
    hyperparameter tuning
    model ensembling
    model stacking
    model blending
    model bagging
    model boosting
    model averaging

### Data Pre-Processing in Detail

In [4]:
"""""" 
1. Data Cleaning:
    a. Missing values:
        Removing the training example:
        Filling in missing value manually
        Using a standard value to replace the missing value
        Using central tendency (mean, median, mode) for attribute to replace the missing value:
        Using central tendency (mean, median, mode) for attribute belonging to same class to replace the missing value:
        Using the most probable value to fill in the missing value:

    b. Noisy Data and Outliers: 
        Binning: Using binning methods smooths sorted values by using the values around it. The sorted values are then divided 
            into bins. 
        Regression:  Linear regression and multiple linear regression can be used to smooth the data, where the values 
            are conformed to a function.
        Outlier analysis: Approaches such as clustering can be used to detect outliers and deal with them.

    c. Remove Unwanted Data: Unwanted data is duplicate or irrelevant data. 
    
2. Data Integration:
    Data consolidation: The data is physically brought together to one data store. This usually involves Data Warehousing.
    Data propagation: Copying data from one location to another using applications is called data propagation
    Data virtualization: An interface is used to provide a real-time and unified view of data from multiple sources. 

3. Data Reduction:
    Missing values ratio: Attributes that have more missing values than a threshold are removed.
    Low variance filter: Normalized attributes that have variance (distribution) less than a threshold are also removed 
        because little changes in data means less information.
    High correlation filter: Normalized attributes that have correlation coefficients more than a threshold are removed 
        because similar trends means similar information is carried. A correlation coefficient is usually calculated using 
        statistical methods such as Pearson’s chi-square value.
    Principal component analysis: Principal component analysis, or PCA, is a statistical method that reduces the numbers 
        of attributes by lumping highly correlated attributes together.

4. Data Transformation:
    Smoothing: Eliminating noise in the data to see more data patterns.
    Attribute/feature construction: New attributes are constructed from the given set of attributes.
    Aggregation: Summary and aggregation operations are applied on the given set of attributes to come up with new attributes
    Normalization: The data in each attribute is scaled between a smaller range, for example, 0 to 1 or -1 to 1.
    Discretization: Raw values of the numeric attributes are replaced by discrete or conceptual intervals, 
        which can be further organized into higher-level intervals. 
    Concept hierarchy generation for nominal data: Values for nominal data are generalized to higher-order concepts.


"""

### Basic ML notes

In [None]:
# #Cost Function
# A cost function, also known as a loss function or objective function, 
# is a mathematical function that measures the difference between predicted and actual values in machine learning. 
# The purpose of a cost function is to guide the learning algorithm towards finding the optimal model parameters that minimize 
# the difference between the predicted and actual values.

# The choice of cost function depends on the type of problem and the learning algorithm used. 
# Here are some common examples of cost functions and their equations:

# 1. Mean Squared Error (MSE): This cost function is used for regression problems where the goal is to predict a continuous 
#     variable. It measures the average squared difference between the predicted and actual values. The equation for MSE is:

#         MSE = 1/n * ∑(y - y_pred)^2
#         where n is the number of samples, y is the actual value, and y_pred is the predicted value.

# 2. Binary Cross-Entropy: This cost function is used for binary classification problems where the output is either 0 or 1. 
#     It measures the difference between the predicted probability and the actual label. 
#     The equation for binary cross-entropy is:

#         Binary cross-entropy = -1/n * ∑(y * log(y_pred) + (1-y) * log(1-y_pred))
#         where n is the number of samples, y is the actual label (0 or 1), and y_pred is the predicted probability.

# 2. Categorical Cross-Entropy: This cost function is used for multi-class classification problems where the output 
#     can be one of several classes. It measures the difference between the predicted probability distribution and the actual 
#     label. The equation for categorical cross-entropy is:

#         Categorical cross-entropy = -1/n * ∑∑(y_ij * log(y_pred_ij))
#         where n is the number of samples, y_ij is the actual probability for class j in sample i, and y_pred_ij is the predicted probability for class j in sample i.

### Generic Pre-processing

In [1]:
## Importing required libraries
import pandas as pd ## For DataFrame operation
import numpy as np ## Numerical python for matrix operations
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler ## Preprocessing function
import pandas_profiling ## For easy profiling of pandas DataFrame
import missingno as msno ## Missing value co-occurance analysis

####### Data Exploration ############

def print_dim(df):
    '''
    Function to print the dimensions of a given python dataframe
    Required Input -
        - df = Pandas DataFrame
    Expected Output -
        - Data size
    '''
    print("Data size: Rows-{0} Columns-{1}".format(df.shape[0],df.shape[1]))


def print_dataunique(df):
    '''
    Function to print unique information for each column in a python dataframe
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Column name
        - Data type of that column
        - Number of unique values in that column
        - 5 unique values from that column
    '''
    counter = 0
    for i in df.columns:
        x = df.loc[:,i].unique()
        print(counter,i,type(df.loc[0,i]), len(x), x[0:5])
        counter +=1
        
def do_data_profiling(df, filename):
    '''
    Function to do basic data profiling
    Required Input - 
        - df = Pandas DataFrame
        - filename = Path for output file with a .html extension
    Expected Output -
        - HTML file with data profiling summary
    '''
    profile = pandas_profiling.ProfileReport(df)
    profile.to_file(output_file = filename)
    print("Data profiling done")

def view_datatypes_in_perspective(df):
    '''
    Function to group dataframe columns into three common dtypes and visualize the columns
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - three unique datatypes (float, object, others(for the rest))
    '''
    float = 0
    float_col = []
    object = 0
    object_col = []
    others = 0
    others_col = []
    for col in df.columns:
        if df[col].dtype ==  "float":
            float += 1
            float_col.append(col) 
        elif df[col].dtypes == "object":
            object += 1
            object_col.append(col)
        else:
            others +=1
            others_col.append(col)
            others_col.append(smart_home[col].dtype)        
    print (f" float = {float} \t{float_col}, \n \nobject = {object} \t{object_col}, \n\nothers = {others} \t{others_col} ")

def missing_value_analysis(df):
    '''
    Function to do basic missing value analysis
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Chart of Missing value co-occurance
        - Chart of Missing value heatmap
    '''
    msno.matrix(df)
    msno.heatmap(df)

def view_NaN(df):
    """
    Prints the name of any column in a Pandas DataFrame that contains NaN values.

    Parameters:
        - df: Pandas DataFrame

    Returns:
        - None
    """
    for col in df.columns:
        if df[col].isnull().any() == True:
            print("there is NaN present in column:", col)
        else:
            print("No NaN present in column:", col)

def convert_timestamp(ts):
    """
    Converts a Unix timestamp to a formatted date and time string.

    Args:
        ts (int): The Unix timestamp to convert.

    Returns:
        str: A formatted date and time string in the format 'YYYY-MM-DD HH:MM:SS'.
    """
    utc_datetime = datetime.datetime.utcfromtimestamp(ts)
    formatted_datetime = utc_datetime.strftime('%Y-%m-%d %H:%M:%S')
    formatted_datetime = pd.to_datetime(formatted_datetime, infer_datetime_format=True) 
    return formatted_datetime

####### Basic helper function ############

def join_df(left, right, left_on, right_on=None, method='left'):
    '''
    Function to outer joins of pandas dataframe
    Required Input - 
        - left = Pandas DataFrame 1
        - right = Pandas DataFrame 2
        - left_on = Fields in DataFrame 1 to merge on
        - right_on = Fields in DataFrame 2 to merge with left_on fields of Dataframe 1
        - method = Type of join
    Expected Output -
        - Pandas dataframe with dropped no variation columns
    '''
    if right_on is None:
        right_on = left_on
    return left.merge(right, 
                      how=method, 
                      left_on=left_on, 
                      right_on=right_on, 
                      suffixes=("","_y"))
    
####### Pre-processing ############    

def drop_allsame(df):
    '''
    Function to remove any columns which have same value all across
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Pandas dataframe with dropped no variation columns
    '''
    to_drop = list()
    for i in df.columns:
        if len(df.loc[:,i].unique()) == 1:
            to_drop.append(i)
    return df.drop(to_drop,axis =1)

#fill Nan Values in the cloudCover column
def treat_missing_numeric(df,columns,how = 'mean', value = None):
    '''
    Function to treat missing values in numeric columns
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns need to be imputed
        - how = valid values are 'mean', 'mode', 'median','ffill', numeric value
    Expected Output -
        - Pandas dataframe with imputed missing value in mentioned columns
    '''
    if how == 'mean':
        for i in columns:
            print("Filling missing values with mean for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].mean())
            
    elif how == 'mode':
        for i in columns:
            print("Filling missing values with mode for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].mode())
    
    elif how == 'median':
        for i in columns:
            print("Filling missing values with median for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].median())
    
    elif how == 'ffill':
        for i in columns:
            print("Filling missing values with forward fill for columns - {0}".format(i))
            df[i] = df[i].fillna(method ='ffill')
    
    elif how == 'digit':
        for i in columns:
            print("Filling missing values with {0} for columns - {1}".format(how, i))
            df[i] = df[i].fillna(str(value)) 
      
    else:
        print("Missing value fill cannot be completed")
    return df.head(5)
treat_missing_numeric(smart_home, ["cloudCover"], how="digit", value = 0.1)  


def treat_missing_categorical(df, columns, how='mode', value = None):
    '''
    Function to treat missing values in categorical columns
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns need to be imputed
        - how = valid values are 'mode', any string or numeric value
    Expected Output -
        - Pandas dataframe with imputed missing value in mentioned columns
    '''
    if how == 'mode':
        for col in columns:
            print("Filling missing values with mode for column - {0}".format(col))
            df[col] = df[col].fillna(df[col].mode()[0])
            
    elif isinstance(how, str):
        for col in columns:
            print("Filling missing values with '{0}' for column - {1}".format(how, col))
            df[col] = df[col].fillna(how)
            
    elif how == 'digit':
        for i in columns:
            print("Filling missing values with {0} for columns - {1}".format(how, i))
            df[i] = df[i].fillna(str(value)) 
            
    else:
        print("Missing value fill cannot be completed")
    return df.head(4)


def min_max_scaler(df,columns):
    '''
    Function to do Min-Max scaling
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be min-max scaled
    Expected Output -
        - df = Python DataFrame with Min-Max scaled attributes
        - scaler = Function which contains the scaling rules
    '''
    scaler = MinMaxScaler()
    data = pd.DataFrame(scaler.fit_transform(df.loc[:,columns]))
    data.index = df.index
    data.columns = columns
    return data, scaler

def replace_non_numeric(df: pd.DataFrame, columns):
    """
    Replaces non-numeric values in the specified columns of a Pandas dataframe with NaN.

    Parameters:
        df (pd.DataFrame): The dataframe to process.
        columns (list): A list of column names to replace non-numeric values in.

    Returns:
        pd.DataFrame: The updated dataframe with non-numeric values replaced by NaN.
    """
    for col in columns:
        df.dropna(subset = col, inplace= True)
        if df[col].dtype == 'object' or df[col].dtype == 'float':
            # df.dropna(subset = col, inplace= True)
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df.dropna(subset = col, inplace= True)
        else:
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df.dropna(subset = col, inplace= True)
    return df

def z_scaler(df,columns):
    '''
    Function to standardize features by removing the mean and scaling to unit variance
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be min-max scaled
    Expected Output -
        - df = Python DataFrame with Min-Max scaled attributes
        - scaler = Function which contains the scaling rules
    '''
    scaler = StandardScaler()
    data = pd.DataFrame(scaler.fit_transform(df.loc[:,columns]))
    data.index = df.index
    data.columns = columns
    return data, scaler
    
def label_encoder(df,columns):
    '''
    Function to label encode
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be label encoded
    Expected Output -
        - df = Pandas DataFrame with lable encoded columns
        - le_dict = Dictionary of all the column and their label encoders
    '''
    le_dict = {}
    for c in columns:
        print("Label encoding column - {0}".format(c))
        lbl = LabelEncoder()
        lbl.fit(list(df[c].values.astype('str')))
        df[c] = lbl.transform(list(df[c].values.astype('str')))
        le_dict[c] = lbl
    return df, le_dict

def one_hot_encoder(df, columns):
    '''
    Function to do one-hot encoded
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be one-hot encoded
    Expected Output -
        - df = Pandas DataFrame with one-hot encoded columns
    '''
    for each in columns:
        print("One-Hot encoding column - {0}".format(each))
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df.drop(columns,axis = 1)

####### Feature Engineering ############
def create_date_features(df,column, date_format = None, more_features = False, time_features = False):
    '''
    Function to extract date features
    Required Input - 
        - df = Pandas DataFrame
        - date_format = Date parsing format
        - columns = Columns name containing date field
        - more_features = To get more feature extracted
        - time_features = To extract hour from datetime field
    Expected Output -
        - df = Pandas DataFrame with additional extracted date features
    '''
    if date_format is None:
        df.loc[:,column] = pd.to_datetime(df.loc[:,column])
    else:
        df.loc[:,column] = pd.to_datetime(df.loc[:,column],format = date_format)
    df.loc[:,column+'_Year'] = df.loc[:,column].dt.year
    df.loc[:,column+'_Month'] = df.loc[:,column].dt.month.astype('uint8')
    df.loc[:,column+'_Week'] = df.loc[:,column].dt.week.astype('uint8')
    df.loc[:,column+'_Day'] = df.loc[:,column].dt.day.astype('uint8')
    
    if more_features:
        df.loc[:,column+'_Quarter'] = df.loc[:,column].dt.quarter.astype('uint8')
        df.loc[:,column+'_DayOfWeek'] = df.loc[:,column].dt.dayofweek.astype('uint8')
        df.loc[:,column+'_DayOfYear'] = df.loc[:,column].dt.dayofyear
        
    if time_features:
        df.loc[:,column+'_Hour'] = df.loc[:,column].dt.hour.astype('uint8')
    return df

def target_encoder(train_df, col_name, target_name, test_df = None, how='mean'):
    '''
    Function to do target encoding
    Required Input - 
        - train_df = Training Pandas Dataframe
        - test_df = Testing Pandas Dataframe
        - col_name = Name of the columns of the source variable
        - target_name = Name of the columns of target variable
        - how = 'mean' default but can also be 'count'
	Expected Output - 
		- train_df = Training dataframe with added encoded features
		- test_df = Testing dataframe with added encoded features
    '''
    aggregate_data = train_df.groupby(col_name)[target_name] \
                    .agg([how]) \
                    .reset_index() \
                    .rename(columns={how: col_name+'_'+target_name+'_'+how})
    if test_df is None:
        return join_df(train_df,aggregate_data,left_on = col_name)
    else:
        return join_df(train_df,aggregate_data,left_on = col_name), join_df(test_df,aggregate_data,left_on = col_name)

### Scikit-Learn

In [64]:
#Scikit-Learn Sub-modules

# Scikit-Learn library is organized into several sub-modules, each of which contains a set of related functions and classes. 
# Here are the main sub-modules in scikit-learn:

#from sklearn."sub-module" import "model"
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression 


# sklearn.datasets: This sub-module provides a set of standard datasets for machine learning, including iris, 
#     digits, and breast cancer.
        from sklearn.datasets import load_iris
        iris_data = load_iris()
        iris_features = iris_data.data 
        iris_target = iris_data.target
        
        # Convert the data to a DataFrame
        df = pd.DataFrame(iris_features, columns=iris_data.feature_names)
        
        # Add the target variable to the DataFrame
        df['target'] = iris_target 
        
        # print(iris_data.DESCR) - Describes the data 
        # iris_data.data: An array containing the feature values for each instance of the dataset.
        # iris_data.target: An array containing the class labels (i.e., 0, 1, or 2) for each instance of the dataset.
        # Iris_data.target_names: An array containing the names of the three classes 
        # iris_data.feature_names: An array containing the names of the attributes 
# sklearn.model_selection: This sub-module contains functions for model selection, such as splitting data into 
#     training and test sets, cross-validation, and grid search.

# sklearn.preprocessing: This sub-module provides functions for preprocessing data, such as scaling, normalization, 
#     and encoding categorical variables.

# sklearn.feature_extraction: This sub-module contains functions for feature extraction from raw data, 
#     such as text data, including Bag of Words, CountVectorizer, and TfidfVectorizer.

# sklearn.metrics: This sub-module provides functions for evaluating the performance of machine learning models, 
#     such as accuracy, precision, recall, and F1 score.

# sklearn.pipeline: This sub-module provides tools for building machine learning pipelines, 
#     which allows you to chain together multiple steps, such as feature extraction, preprocessing, and model selection.

# sklearn.decomposition: This sub-module provides classes for matrix factorization and decomposition, 
#     such as Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), 
#     and Latent Dirichlet Allocation (LDA).

# sklearn.discriminant_analysis: This sub-module provides classes for linear and quadratic discriminant analysis, 
#     which are used for supervised classification tasks.

# sklearn.covariance: This sub-module provides classes for covariance estimation, such as Empirical Covariance and 
#     Shrunk Covariance.

# sklearn.exceptions: This sub-module contains custom exceptions raised by scikit-learn, such as NotFittedError and 
#     ConvergenceWarning.


#Models: 


# sklearn.linear_model: This sub-module contains classes for linear models, such as linear regression, 
#     logistic regression, and ridge regression.

# sklearn.tree: This sub-module provides classes for decision trees, such as DecisionTreeClassifier and 
#     DecisionTreeRegressor.

# sklearn.ensemble: This sub-module contains classes for ensemble models, such as random forests, AdaBoost, 
#     and Gradient Boosting.

# sklearn.cluster: This sub-module provides classes for clustering, such as KMeans and Hierarchical Clustering.

# sklearn.neural_network: This sub-module contains classes for neural networks, such as Multi-Layer Perceptron (MLP) 
#     and Convolutional Neural Networks (CNNs).

# sklearn.svm: This sub-module contains classes for Support Vector Machines (SVMs), such as SVM classifier and regression.

# sklearn.manifold: This sub-module provides classes for manifold learning, such as t-SNE and Isomap.

# sklearn.naive_bayes: This sub-module provides classes for Naive Bayes models, such as Gaussian Naive Bayes and 
#     Multinomial Naive Bayes.

# sklearn.neighbors: This sub-module provides classes for k-Nearest Neighbors (k-NN) models, 
#     such as KNeighborsClassifier and KNeighborsRegressor.


### Machine Learning Regression

In [None]:
## Importing required libraries
import pandas as pd ## For DataFrame operation
import numpy as np ## Numerical python for matrix operations
from sklearn.model_selection import KFold, train_test_split ## Creating cross validation sets
from sklearn import metrics ## For loss functions
import matplotlib.pyplot as plt

## Libraries for Regressiion algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
import xgboost as xgb
import lightgbm as lgb 
from sklearn.ensemble import ExtraTreesRegressor,RandomForestRegressor
import lime
import lime.lime_tabular

########### Cross Validation ###########
### 1) Train test split
def holdout_cv(X,y,size = 0.3, seed = 1):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = size, random_state = seed)
    X_train = X_train.reset_index(drop='index')
    X_test = X_test.reset_index(drop='index')
    return X_train, X_test, y_train, y_test

### 2) Cross-Validation (K-Fold)
def kfold_cv(X,n_folds = 5, seed = 1):
    cv = KFold(n_splits = n_folds, random_state = seed, shuffle = True)
    return cv.split(X)

########### Model Explanation ###########
## Variable Importance plot
def feature_importance(model,X):
    feature_importance = model.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align='center')
    plt.yticks(pos, X.columns[sorted_idx])
    plt.xlabel('Relative Importance')
    plt.title('Variable Importance')
    plt.show()

########### Functions for explaination using Lime ###########

## Make a prediction function
def make_prediction_function(model, type = None):
    if type == 'xgb':
        predict_fn = lambda x: model.predict(xgb.DMatrix(x)).astype(float)
    else:
        predict_fn = lambda x: model.predict(x).astype(float)
    return predict_fn

## Make a lime explainer
def make_lime_explainer(df, c_names = [], verbose_val = True):
    explainer = lime.lime_tabular.LimeTabularExplainer(df.values,
                                                       class_names=c_names,
                                                       feature_names = list(df.columns),
                                                       kernel_width=3, 
                                                       verbose=verbose_val,
                                                       mode='regression'
                                                    )
    return explainer

## Lime explain function
def lime_explain(explainer,predict_fn, df, index = 0, num_features = None,
                 show_in_notebook = True, filename = None):
    if num_features is not None:
        exp = explainer.explain_instance(df.values[index], predict_fn, num_features=num_features)
    else:
        exp = explainer.explain_instance(df.values[index], predict_fn, num_features=df.shape[1])
    
    if show_in_notebook:
        exp.show_in_notebook(show_all=False)
    
    if filename is not None:
        exp.save_to_file(filename)
        
########### Algorithms For Regression ###########

### Running Xgboost
def runXGB(train_X, train_y, test_X, test_y=None, test_X2=None, seed_val=0, 
           rounds=500, dep=8, eta=0.05,sub_sample=0.7,col_sample=0.7,
           min_child_weight_val=1, silent_val = 1):
    params = {}
    params["objective"] = "reg:linear"
    params['eval_metric'] = 'rmse'
    params["eta"] = eta
    params["subsample"] = sub_sample
    params["min_child_weight"] = min_child_weight_val
    params["colsample_bytree"] = col_sample
    params["max_depth"] = dep
    params["silent"] = silent_val
    params["seed"] = seed_val
    #params["max_delta_step"] = 2
    #params["gamma"] = 0.5
    num_rounds = rounds

    plst = list(params.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
        model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=100, verbose_eval=20)
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)
    
    pred_test_y = model.predict(xgtest, ntree_limit=model.best_iteration)
    
    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(xgb.DMatrix(test_X2), ntree_limit=model.best_iteration)
    
    loss = 0
    if test_y is not None:
        loss = metrics.mean_squared_error(test_y, pred_test_y)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model
        
### Running LightGBM
def runLGB(train_X, train_y, test_X, test_y=None, test_X2=None, feature_names=None, 
           seed_val=0, rounds=500, dep=8, eta=0.05,sub_sample=0.7,
           col_sample=0.7,silent_val = 1,min_data_in_leaf_val = 20, bagging_freq = 5):
    params = {}
    params["objective"] = "regression"
    params['metric'] = 'rmse'
    params["max_depth"] = dep
    params["min_data_in_leaf"] = min_data_in_leaf_val
    params["learning_rate"] = eta
    params["bagging_fraction"] = sub_sample
    params["feature_fraction"] = col_sample
    params["bagging_freq"] = bagging_freq
    params["bagging_seed"] = seed_val
    params["verbosity"] = silent_val
    num_rounds = rounds
    
    lgtrain = lgb.Dataset(train_X, label=train_y)
    
    if test_y is not None:
        lgtest = lgb.Dataset(test_X, label=test_y)
        model = lgb.train(params, lgtrain, num_rounds, valid_sets=[lgtest], early_stopping_rounds=100, verbose_eval=20)
    else:
        lgtest = lgb.Dataset(test_X)
        model = lgb.train(params, lgtrain, num_rounds)
        
    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)
    
    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)
    
    loss = 0
    if test_y is not None:
        loss = metrics.mean_squared_error(test_y, pred_test_y)
        print(loss)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model
        
### Running Extra Trees  
def runET(train_X, train_y, test_X, test_y=None, test_X2=None, rounds=100, depth=20,
          leaf=10, feat=0.2, min_data_split_val=2,seed_val=0,job = -1):
	model = ExtraTreesRegressor(
                                n_estimators = rounds,
                                max_depth = depth,
                                min_samples_split = min_data_split_val,
                                min_samples_leaf = leaf,
                                max_features =  feat,
                                n_jobs = job,
                                random_state = seed_val)
	model.fit(train_X, train_y)
	train_preds = model.predict(train_X)
	test_preds = model.predict(test_X)
	
	test_preds2 = 0
	if test_X2 is not None:
		test_preds2 = model.predict(test_X2)
	
	test_loss = 0
	if test_y is not None:
		train_loss = metrics.mean_squared_error(train_y, train_preds)
		test_loss = metrics.mean_squared_error(test_y, test_preds)
		print("Depth, leaf, feat : ", depth, leaf, feat)
		print("Train and Test loss : ", train_loss, test_loss)
	return test_preds, test_loss, test_preds2, model
 
### Running Random Forest
def runRF(train_X, train_y, test_X, test_y=None, test_X2=None, rounds=100, depth=20, leaf=10,
          feat=0.2,min_data_split_val=2,seed_val=0,job = -1):
    model = RandomForestRegressor(
                                n_estimators = rounds,
                                max_depth = depth,
                                min_samples_split = min_data_split_val,
                                min_samples_leaf = leaf,
                                max_features =  feat,
                                n_jobs = job,
                                random_state = seed_val)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)
    
    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running Linear regression
def runLR(train_X, train_y, test_X, test_y=None, test_X2=None):
    model = LinearRegression()
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running Decision Tree
def runDT(train_X, train_y, test_X, test_y=None, test_X2=None, criterion='mse', 
          depth=None, min_split=2, min_leaf=1):
    model = DecisionTreeRegressor(
                                criterion = criterion, 
                                max_depth = depth, 
                                min_samples_split = min_split, 
                                min_samples_leaf=min_leaf)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model
    
### Running K-Nearest Neighbour
def runKNN(train_X, train_y, test_X, test_y=None, test_X2=None, 
           neighbors=5, job = -1):
    model = KNeighborsRegressor(
                                n_neighbors=neighbors, 
                                n_jobs=job)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running SVM
def runSVC(train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, 
           eps=0.1, kernel_choice = 'rbf'):
    model = SVR(
                C=C, 
                kernel=kernel_choice,  
                epsilon=eps)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Machine Learning Classification

In [None]:
## Importing required libraries
import pandas as pd  ## For DataFrame operation
import numpy as np  ## Numerical python for matrix operations
from sklearn.model_selection import (
    KFold,
    train_test_split,
)  ## Creating cross validation sets
from sklearn import metrics  ## For loss functions
import matplotlib.pyplot as plt
import itertools

## For evaluation
from sklearn.metrics import (
    roc_curve,
    auc,
    roc_auc_score,
    confusion_matrix,
    precision_recall_curve,
    average_precision_score,
)
from inspect import signature

## Libraries for Classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
import lime
import lime.lime_tabular


def split_data(X, y, test_size=0.2, val_size=0.2, random_state=42): #split data into train, test, and validation
    """
    This function splits the data into train and test sets, and further splits the train set into training and validation sets.
    
    df : pandas DataFrame
        The dataframe containing the input data.
    target_col : str
        The name of the target column in the dataframe.
    test_size : float, optional (default=0.2)
        The proportion of the data to be used for testing.
    val_size : float, optional (default=0.2)
        The proportion of the training data to be used for validation.
    random_state : int, optional (default=42)
        The seed used by the random number generator.
    
    Returns
    -------
    xtrain : pandas DataFrame
        The training input data.
    ytrain : pandas Series
        The training target data.
    xvalid : pandas DataFrame
        The validation input data.
    yvalid : pandas Series
        The validation target data.
    xtest : pandas DataFrame
        The test input data.
    ytest : pandas Series
        The test target data.
    """ 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=val_size, random_state=random_state)
    
    return X_train, y_train, X_valid, y_valid, X_test, y_test

########### Cross Validation ###########
### 1) Train test split
def holdout_cv(X, y, size=0.3, seed=1):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=size, random_state=seed
    )
    X_train = X_train.reset_index(drop="index")
    X_test = X_test.reset_index(drop="index")
    return X_train, X_test, y_train, y_test


### 2) Cross-Validation (K-Fold)
def kfold_cv(X, n_folds=5, seed=1):
    cv = KFold(n_splits=n_folds, random_state=seed, shuffle=True)
    return cv.split(X)


########### Model Explanation ###########
## Plotting AUC ROC curve
def plot_roc(y_actual, y_pred):
    """
    Function to plot AUC-ROC curve
    """
    fpr, tpr, thresholds = roc_curve(y_actual, y_pred)
    plt.plot(
        fpr,
        tpr,
        color="b",
        label=r"Model (AUC = %0.2f)" % (roc_auc_score(y_actual, y_pred)),
        lw=2,
        alpha=0.8,
    )
    plt.plot(
        [0, 1],
        [0, 1],
        linestyle="--",
        lw=2,
        color="r",
        label="Luck (AUC = 0.5)",
        alpha=0.8,
    )
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver operating characteristic example")
    plt.legend(loc="lower right")
    plt.show()


def plot_precisionrecall(y_actual, y_pred):
    """
    Function to plot AUC-ROC curve
    """
    average_precision = average_precision_score(y_actual, y_pred)
    precision, recall, _ = precision_recall_curve(y_actual, y_pred)
    # In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
    step_kwargs = (
        {"step": "post"} if "step" in signature(plt.fill_between).parameters else {}
    )

    plt.figure(figsize=(9, 6))
    plt.step(recall, precision, color="b", alpha=0.2, where="post")
    plt.fill_between(recall, precision, alpha=0.2, color="b", **step_kwargs)

    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title("Precision-Recall curve: AP={0:0.2f}".format(average_precision))


## Plotting confusion matrix
def plot_confusion_matrix(
    y_true,
    y_pred,
    classes,
    normalize=False,
    title="Confusion matrix",
    cmap=plt.cm.Blues,
):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    cm = metrics.confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print("Confusion matrix, without normalization")

    print(cm)

    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = ".2f" if normalize else "d"
    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(
            j,
            i,
            format(cm[i, j], fmt),
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black",
        )

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")


## Variable Importance plot
def feature_importance(model, X):
    feature_importance = model.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + 0.5
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align="center")
    plt.yticks(pos, X.columns[sorted_idx])
    plt.xlabel("Relative Importance")
    plt.title("Variable Importance")
    plt.show()


## Functions for explaination using Lime
def make_prediction_function(model):
    predict_fn = lambda x: model.predict_proba(x).astype(float)
    return predict_fn


def make_lime_explainer(df, c_names=[], k_width=3, verbose_val=True):
    explainer = lime.lime_tabular.LimeTabularExplainer(
        df.values,
        class_names=c_names,
        feature_names=list(df.columns),
        kernel_width=3,
        verbose=verbose_val,
    )
    return explainer


def lime_explain(
    explainer,
    predict_fn,
    df,
    index=0,
    num_features=None,
    show_in_notebook=True,
    filename=None,
):
    if num_features is not None:
        exp = explainer.explain_instance(
            df.values[index], predict_fn, num_features=num_features
        )
    else:
        exp = explainer.explain_instance(
            df.values[index], predict_fn, num_features=df.shape[1]
        )

    if show_in_notebook:
        exp.show_in_notebook(show_all=False)

    if filename is not None:
        exp.save_to_file(filename)


########### Algorithms For Binary classification ###########

### Running Xgboost
def runXGB(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    min_child_weight_val=1,
    silent_val=1,
):
    params = {}
    params["objective"] = "binary:logistic"
    params["eval_metric"] = "auc"
    params["eta"] = eta
    params["subsample"] = sub_sample
    params["min_child_weight"] = min_child_weight_val
    params["colsample_bytree"] = col_sample
    params["max_depth"] = dep
    params["silent"] = silent_val
    params["seed"] = seed_val
    # params["max_delta_step"] = 2
    # params["gamma"] = 0.5
    num_rounds = rounds

    plst = list(params.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [(xgtrain, "train"), (xgtest, "test")]
        model = xgb.train(
            plst,
            xgtrain,
            num_rounds,
            watchlist,
            early_stopping_rounds=100,
            verbose_eval=20,
        )
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)

    pred_test_y = model.predict(xgtest, ntree_limit=model.best_iteration)

    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(
            xgb.DMatrix(test_X2), ntree_limit=model.best_iteration
        )

    loss = 0
    if test_y is not None:
        loss = metrics.roc_auc_score(test_y, pred_test_y)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model


### Running Xgboost classifier for model explaination
def runXGBC(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    min_child_weight_val=1,
    silent_val=1,
):
    model = xgb.XGBClassifier(
        objective="binary:logistic",
        learning_rate=eta,
        subsample=sub_sample,
        min_child_weight=min_child_weight_val,
        colsample_bytree=col_sample,
        max_depth=dep,
        silent=silent_val,
        seed=seed_val,
        n_estimators=rounds,
    )

    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running LightGBM
def runLGB(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    feature_names=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    silent_val=1,
    min_data_in_leaf_val=20,
    bagging_freq=5,
    n_thread=20,
    metric="auc",
):
    params = {}
    params["objective"] = "binary"
    params["metric"] = metric
    params["max_depth"] = dep
    params["min_data_in_leaf"] = min_data_in_leaf_val
    params["learning_rate"] = eta
    params["bagging_fraction"] = sub_sample
    params["feature_fraction"] = col_sample
    params["bagging_freq"] = bagging_freq
    params["bagging_seed"] = seed_val
    params["verbosity"] = silent_val
    params["num_threads"] = n_thread
    num_rounds = rounds

    lgtrain = lgb.Dataset(train_X, label=train_y)

    if test_y is not None:
        lgtest = lgb.Dataset(test_X, label=test_y)
        model = lgb.train(
            params,
            lgtrain,
            num_rounds,
            valid_sets=[lgtrain, lgtest],
            early_stopping_rounds=100,
            verbose_eval=20,
        )
    else:
        lgtest = lgb.Dataset(test_X)
        model = lgb.train(params, lgtrain, num_rounds)

    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)

    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)

    loss = 0
    if test_y is not None:
        loss = roc_auc_score(test_y, pred_test_y)
        print(loss)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model


### Running LightGBM classifier for model explaination
def runLGBC(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    silent_val=1,
    min_data_in_leaf_val=20,
    bagging_freq=5,
    n_thread=20,
    metric="auc",
):
    model = lgb.LGBMClassifier(
        max_depth=dep,
        learning_rate=eta,
        min_data_in_leaf=min_data_in_leaf_val,
        bagging_fraction=sub_sample,
        feature_fraction=col_sample,
        bagging_freq=bagging_freq,
        bagging_seed=seed_val,
        verbosity=silent_val,
        num_threads=n_thread,
        n_estimators=rounds,
        metric=metric,
    )

    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = roc_auc_score(train_y, train_preds)
        test_loss = roc_auc_score(test_y, test_preds)
        print("Train and Test AUC : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Extra Trees
def runET(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    rounds=100,
    depth=20,
    leaf=10,
    feat=0.2,
    min_data_split_val=2,
    seed_val=0,
    job=-1,
):
    model = ExtraTreesClassifier(
        n_estimators=rounds,
        max_depth=depth,
        min_samples_split=min_data_split_val,
        min_samples_leaf=leaf,
        max_features=feat,
        n_jobs=job,
        random_state=seed_val,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Depth, leaf, feat : ", depth, leaf, feat)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Random Forest
def runRF(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    rounds=100,
    depth=20,
    leaf=10,
    feat=0.2,
    min_data_split_val=2,
    seed_val=0,
    job=-1,
):
    model = RandomForestClassifier(
        n_estimators=rounds,
        max_depth=depth,
        min_samples_split=min_data_split_val,
        min_samples_leaf=leaf,
        max_features=feat,
        n_jobs=job,
        random_state=seed_val,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Logistic Regression
def runLR(train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, penalty="l1"):
    model = LogisticRegression(C=C, penalty=penalty, n_jobs=-1)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]
    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Decision Tree
def runDT(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    criterion="gini",
    depth=None,
    min_split=2,
    min_leaf=1,
):
    model = DecisionTreeClassifier(
        criterion=criterion,
        max_depth=depth,
        min_samples_split=min_split,
        min_samples_leaf=min_leaf,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running K-Nearest Neighbour
def runKNN(train_X, train_y, test_X, test_y=None, test_X2=None, neighbors=5, job=-1):
    model = KNeighborsClassifier(n_neighbors=neighbors, n_jobs=job)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running SVM
def runSVC(
    train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, kernel_choice="rbf"
):
    model = SVC(C=C, kernel=kernel_choice, probability=True)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Natural Language Processing (NLP)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re
import string
import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

eng_stop = set(stopwords.words('english'))


def word_grams(text, min=1, max=4):
    '''
    Function to create N-grams from text
    Required Input -
        - text = text string for which N-gram needs to be created
        - min = minimum number of N
        - max = maximum number of N
    Expected Output -
        - s = list of N-grams 
    '''
    s = []
    for n in range(min, max+1):
        for ngram in ngrams(text, n):
            s.append(' '.join(str(i) for i in ngram))
    return s


def generate_bigrams_df(df, column_names):
    """
    Generate bigrams from specified columns in a pandas DataFrame.

    Parameters:
    df (pd.DataFrame): DataFrame to generate bigrams from.
    column_names (list of str): List of column names to generate bigrams from.

    Returns:
    pd.DataFrame: DataFrame with bigrams appended as new columns.
    """
    bigram_columns = []
    for col in column_names:
        bigram_col = f"{col}_bigrams"
        bigram_columns.append(bigram_col)
        df[bigram_col] = df[col].apply(lambda x: generate_bigrams([x]))
    return df[bigram_columns]

def make_wordcloud(df,column, bg_color='white', w=1200, h=1000, font_size_max=50, n_words=40,g_min=1,g_max=1):
    '''
    Function to make wordcloud from a text corpus
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - bg_color = Background color
        - w = width
        - h = height
        - font_size_max = maximum font size allowed
        - n_word = maximum words allowed
        - g_min = minimum n-grams
        - g_max = maximum n-grams
    Expected Output -
        - World cloud image
    '''
    text = ""
    for ind, row in df.iterrows(): 
        text += row[column] + " "
    text = text.strip().split(' ') 
    text = word_grams(text,g_min,g_max)
    
    text = list(pd.Series(word_grams(text,1,2)).apply(lambda x: x.replace(' ','_')))
    
    s = ""
    for i in range(len(text)):
        s += text[i] + " "

    wordcloud = WordCloud(background_color=bg_color, \
                          width=w, \
                          height=h, \
                          max_font_size=font_size_max, \
                          max_words=n_words).generate(s)
    wordcloud.recolor(random_state=1)
    plt.rcParams['figure.figsize'] = (20.0, 10.0)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    
def generate_wordcloud(df, column_names):
    """
    Generates a wordcloud from a pandas DataFrame

    Parameters:
    df (pd.DataFrame): DataFrame containing the data
    column_names (list): List of column names in the DataFrame to generate the wordcloud from

    Returns:
    None
    """
    all_words = ' '.join([' '.join(text) for col in column_names for text in df[col]])
    wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

    plt.figure(figsize=(10, 7))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()
    
    
def get_tokens(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be tokenized
    Expected Output -
        - text - tokenized list output
    '''
    return word_tokenize(text)

def tokenize_columns(dataframe, columns):
    """
    Tokenize the values in specified columns of a pandas DataFrame.

    Parameters:
        dataframe (pandas.DataFrame): The DataFrame to tokenize.
        columns (list): A list of column names to tokenize.

    Returns:
        pandas.DataFrame: A new DataFrame with tokenized values in the specified columns.
    """
    # Download necessary NLTK resources if they haven't been downloaded yet
    nltk.download('punkt')

    # Create a new DataFrame to hold the tokenized values
    tokenized_df = pd.DataFrame()

    # Tokenize the values in each specified column
    for col in columns:
        # Tokenize the values in the current column using NLTK's word_tokenize function
        tokenized_values = dataframe[col].apply(nltk.word_tokenize)

        # Add the tokenized values to the new DataFrame
        tokenized_df[col] = tokenized_values

    # Return the new DataFrame with tokenized values
    return tokenized_df

#another way
--------------------------------------------------------------------------
def tokenize(text, sep=' ', preserve_case=False):
    """
    Tokenize a string into a list of tokens.

    Parameters:
    text (str): String to be tokenized
    sep (str, optional): Separator to use for tokenization. Defaults to ' '.
    preserve_case (bool, optional): Whether to preserve the case of the text. Defaults to False.

    Returns:
    list: List of tokens
    """
    if not preserve_case:
        text = text.lower()
    tokens = text.split(sep)
    return tokens

def tokenize_df(df, column_names, sep=' ', preserve_case=False):
    """
    Tokenize a pandas dataframe with multiple columns.

    Parameters:
    df (pd.DataFrame): Dataframe to be tokenized
    columns (list of str): List of column names to be tokenized
    sep (str, optional): Separator to use for tokenization. Defaults to ' '.
    preserve_case (bool, optional): Whether to preserve the case of the text. Defaults to False.

    Returns:
    pd.DataFrame: Tokenized dataframe
    """
    for col in column_names:
        df[col] = df[col].apply(lambda x: tokenize(x, sep, preserve_case))
    return df

carbon_google1 = tokenize_df (carbon_google1, column_names =  ["title"], sep=' ', preserve_case=False)
--------------------------------------------------------------------------------------------------------------
def bag_of_words_features(df, text_columns, target_columns):
    """
    This function takes in a DataFrame and one or two columns and returns a bag of words representation of the data as a DataFrame.

    Parameters:
    df (pandas DataFrame): The DataFrame to extract features from.
    column1 (str): The name of the first column to use as input data.
    column2 (str, optional): The name of the second column to use as input data. If not provided, only the first column will be used.

    Returns:
    pandas DataFrame: The bag of words representation of the input data as a DataFrame.
    """
        
    text_data = df[text_columns].apply(lambda x: " ".join([str(i) for i in x]), axis=1)

    text_data = text_data.str.lower()
    vectorizer = CountVectorizer(max_df=0.90, min_df=4, max_features=1000, stop_words=None)
    X_bow = vectorizer.fit_transform(text_data)
    # Use the new function to get the feature names
    feature_names = vectorizer.get_feature_names_out()
    df.dropna(subset=[target_column], inplace=True) if target_columns else None

    X_bow = pd.DataFrame(X_bow.toarray(), columns=feature_names)
    
    if target_columns:        
        y = df[target_columns]
        return X_bow, y
    
    return X_bow

def convert_lowercase(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be lowercased
    Expected Output -
        - text - lower cased text string output
    '''
    return text.lower()

def remove_unwanted_characters(df, columns):
    """
    Remove unwanted characters (including smileys and emojies) from specified columns in a pandas DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    columns (list): A list of column names to clean.
    unwanted_chars (str): The characters to remove.

    Returns:
    pd.DataFrame: The cleaned DataFrame.
    """
    import re 
    unwanted_chars = '[$#&*@%]'
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\u2764\ufe0f" # heart emoji
                           "]+", flags=re.UNICODE)
    for col in columns:
        if col in df.columns:
            df[col] = df[col].apply(lambda x: emoji_pattern.sub(r'', x))
            df[col] = df[col].str.replace(unwanted_chars, '')
        else:
            print(f"Column '{col}' does not exist in the DataFrame.")
    return df

def remove_punctuations(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string 
    Expected Output -
        - text - text string with punctuation removed
    '''
    return text.translate(None,string.punctuation)

def remove_stopwords(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be tokenized
    Expected Output -
        - text - list output with stopwords removed
    '''
    return [word for word in text.split() if word not in eng_stop]

def remove_short_words(df, column_names, min_length=3):
    """Remove short words from columns in a pandas DataFrame.

    Parameters:
    df (pandas.DataFrame): The DataFrame to modify.
    column_names (List[str]): A list of column names to modify.
    min_length (int, optional): The minimum length of words to keep. Default is 3.

    Returns:
    pandas.DataFrame: The modified DataFrame with short words removed from specified columns.
    """
    for column_name in column_names:
        df[column_name] = df[column_name].apply(
            lambda x: ' '.join([word for word in x.split() if len(word) >= min_length])
        )
    return df

def convert_stemmer(word):
    '''
    Function to tokenize the text
    Required Input - 
        - word - word which needs to be tokenized
    Expected Output -
        - text - word output after stemming
    '''
    porter_stemmer = PorterStemmer()
    return porter_stemmer.stem(word)

def stem_df(df, column_names):
    """
    Perform stemming on a pandas dataframe with multiple columns.

    Parameters:
    df (pd.DataFrame): Dataframe to be stemmed
    columns (list of str): List of column names to be stemmed

    Returns:
    pd.DataFrame: Stemmed dataframe
    """
    stemmer = PorterStemmer()
    for col in column_names:
        df[col] = df[col].apply(lambda x: [stemmer.stem(i) for i in x])
    return df

def convert_lemmatizer(word):
    '''
    Function to tokenize the text
    Required Input - 
        - word - word which needs to be lemmatized
    Expected Output -
        - word - word output after lemmatizing
    '''
    wordnet_lemmatizer = WordNetLemmatizer()
    return wordnet_lemmatizer.lemmatize(word)
    
def create_tf_idf(df, column, train_df = None, test_df = None,n_features = None):
    '''
    Function to do tf-idf on a pandas dataframe
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - train_df(optional) = Train DataFrame
        - test_df(optional) = Test DataFrame
        - n_features(optional) = Maximum number of features needed
    Expected Output -
        - train_tfidf = train tf-idf sparse matrix output
        - test_tfidf = test tf-idf sparse matrix output
        - tfidf_obj = tf-idf model
    '''
    tfidf_obj = TfidfVectorizer(ngram_range=(1,1), stop_words='english', 
                                analyzer='word', max_features = n_features)
    tfidf_text = tfidf_obj.fit_transform(df.ix[:,column].values)
    
    if train_df is not None:        
        train_tfidf = tfidf_obj.transform(train_df.ix[:,column].values)
    else:
        train_tfidf = tfidf_text

    test_tfidf = None
    if test_df is not None:
        test_tfidf = tfidf_obj.transform(test_df.ix[:,column].values)

    return train_tfidf, test_tfidf, tfidf_obj
    
def create_countvector(df, column, train_df = None, test_df = None,n_features = None):
    '''
    Function to do count vectorizer on a pandas dataframe
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - train_df(optional) = Train DataFrame
        - test_df(optional) = Test DataFrame
        - n_features(optional) = Maximum number of features needed
    Expected Output -
        - train_cvect = train count vectorized sparse matrix output
        - test_cvect = test count vectorized sparse matrix output
        - cvect_obj = count vectorized model
    '''
    cvect_obj = CountVectorizer(ngram_range=(1,1), stop_words='english', 
                                analyzer='word', max_features = n_features)
    cvect_text = cvect_obj.fit_transform(df.ix[:,column].values)
    
    if train_df is not None:
        train_cvect = cvect_obj.transform(train_df.ix[:,column].values)
    else:
        train_cvect = cvect_text
        
    test_cvect = None
    if test_df is not None:
        test_cvect = cvect_obj.transform(test_df.ix[:,column].values)

    return train_cvect, test_cvect, cvect_obj

### Recommendation Systems (Recsys)

In [None]:
import pandas as pd
import numpy as np
from scipy import sparse
from lightfm import LightFM
from sklearn.metrics.pairwise import cosine_similarity

def create_interaction_matrix(df,user_col, item_col, rating_col, norm= False, threshold = None):
    '''
    Function to create an interaction matrix dataframe from transactional type interactions
    Required Input -
        - df = Pandas DataFrame containing user-item interactions
        - user_col = column name containing user's identifier
        - item_col = column name containing item's identifier
        - rating col = column name containing user feedback on interaction with a given item
        - norm (optional) = True if a normalization of ratings is needed
        - threshold (required if norm = True) = value above which the rating is favorable
    Expected output - 
        - Pandas dataframe with user-item interactions ready to be fed in a recommendation algorithm
    '''
    interactions = df.groupby([user_col, item_col])[rating_col] \
            .sum().unstack().reset_index(). \
            fillna(0).set_index(user_col)
    if norm:
        interactions = interactions.applymap(lambda x: 1 if x > threshold else 0)
    return interactions

def create_user_dict(interactions):
    '''
    Function to create a user dictionary based on their index and number in interaction dataset
    Required Input - 
        interactions - dataset create by create_interaction_matrix
    Expected Output -
        user_dict - Dictionary type output containing interaction_index as key and user_id as value
    '''
    user_id = list(interactions.index)
    user_dict = {}
    counter = 0 
    for i in user_id:
        user_dict[i] = counter
        counter += 1
    return user_dict
    
def create_item_dict(df,id_col,name_col):
    '''
    Function to create an item dictionary based on their item_id and item name
    Required Input - 
        - df = Pandas dataframe with Item information
        - id_col = Column name containing unique identifier for an item
        - name_col = Column name containing name of the item
    Expected Output -
        item_dict = Dictionary type output containing item_id as key and item_name as value
    '''
    item_dict ={}
    for i in range(df.shape[0]):
        item_dict[(df.loc[i,id_col])] = df.loc[i,name_col]
    return item_dict

def runMF(interactions, n_components=30, loss='warp', k=15, epoch=30,n_jobs = 4):
    '''
    Function to run matrix-factorization algorithm
    Required Input -
        - interactions = dataset create by create_interaction_matrix
        - n_components = number of embeddings you want to create to define Item and user
        - loss = loss function other options are logistic, brp
        - epoch = number of epochs to run 
        - n_jobs = number of cores used for execution 
    Expected Output  -
        Model - Trained model
    '''
    x = sparse.csr_matrix(interactions.values)
    model = LightFM(no_components= n_components, loss=loss,k=k)
    model.fit(x,epochs=epoch,num_threads = n_jobs)
    return model

def sample_recommendation_user(model, interactions, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 10, show = True):
    '''
    Function to produce user recommendations
    Required Input - 
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - user_id = user ID for which we need to generate recommendation
        - user_dict = Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - threshold = value above which the rating is favorable in new interaction matrix
        - nrec_items = Number of output recommendation needed
    Expected Output - 
        - Prints list of items the given user has already bought
        - Prints list of N recommended items  which user hopefully will be interested in
    '''
    n_users, n_items = interactions.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items)))
    scores.index = interactions.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    known_items = list(pd.Series(interactions.loc[user_id,:] \
                                 [interactions.loc[user_id,:] > threshold].index) \
								 .sort_values(ascending=False))
    
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1
    return return_score_list
    

def sample_recommendation_item(model,interactions,item_id,user_dict,item_dict,number_of_user):
    '''
    Funnction to produce a list of top N interested users for a given item
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - item_id = item ID for which we need to generate recommended users
        - user_dict =  Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - number_of_user = Number of users needed as an output
    Expected Output -
        - user_list = List of recommended users 
    '''
    n_users, n_items = interactions.shape
    x = np.array(interactions.columns)
    scores = pd.Series(model.predict(np.arange(n_users), np.repeat(x.searchsorted(item_id),n_users)))
    user_list = list(interactions.index[scores.sort_values(ascending=False).head(number_of_user).index])
    return user_list 


def create_item_emdedding_distance_matrix(model,interactions):
    '''
    Function to create item-item distance embedding matrix
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
    Expected Output -
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
    '''
    df_item_norm_sparse = sparse.csr_matrix(model.item_embeddings)
    similarities = cosine_similarity(df_item_norm_sparse)
    item_emdedding_distance_matrix = pd.DataFrame(similarities)
    item_emdedding_distance_matrix.columns = interactions.columns
    item_emdedding_distance_matrix.index = interactions.columns
    return item_emdedding_distance_matrix

def item_item_recommendation(item_emdedding_distance_matrix, item_id, 
                             item_dict, n_items = 10, show = True):
    '''
    Function to create item-item recommendation
    Required Input - 
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
        - item_id  = item ID for which we need to generate recommended items
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - n_items = Number of items needed as an output
    Expected Output -
        - recommended_items = List of recommended items
    '''
    recommended_items = list(pd.Series(item_emdedding_distance_matrix.loc[item_id,:]. \
                                  sort_values(ascending = False).head(n_items+1). \
                                  index[1:n_items+1]))
    if show == True:
        print("Item of interest :{0}".format(item_dict[item_id]))
        print("Item similar to the above item:")
        counter = 1
        for i in recommended_items:
            print(str(counter) + '- ' +  item_dict[i])
            counter+=1
    return recommended_items