### Importing Libraries

In [None]:
import time

### Machine Learning Models to Master

In [None]:
#Machine Learning Algorithms to Master  
1. Linear and Multiple Linear Regression
2. Logistic Regression
3. Decision Trees
4. Naive Bayes
5. K-Nearest Neighbors
6. Support Vector Machines
7. Random Forests
8. Neural Networks
    1. Convolutional Neural Network (CNN)
    2. Recurrent Neural Network (RNN)
    3. Long Short-Term Memory (LSTM)
    4. Generative Adversarial Network (GAN)
    5. Deep Belief Network (DBN)
    6. Deep Boltzmann Machine (DBM)
    7. Autoencoders
    8. Restricted Boltzmann Machines (RBM)
    9. Hopfield Networks
    10. Self-Organizing Maps (SOM)
9. Gradient Boosting
    1. XGBoost
    2. LightGBM
    3. CatBoost
    4. Gradient Boosting Machines (GBM)
    5. Stochastic Gradient Boosting (SGB)
    6. Adaboost
    7. Gradient Boosted Decision Trees (GBDT)
    8. DeepBoost
    9. Neural Network Boosting (NNBoost)
    10. Gradient Boosted Regression Trees (GBRT)
10. Reinforcement Learning
11. Dimensionality Reduction Algorithms
    1. Principal Component Analysis (PCA)
    2. Linear Discriminant Analysis (LDA)
    3. Independent Component Analysis (ICA)
    4. Non-Negative Matrix Factorization (NMF)
    5. Factor Analysis
    6. Singular Value Decomposition (SVD)
    7. t-Distributed Stochastic Neighbor Embedding (t-SNE)
    8. Uniform Manifold Approximation and Projection (UMAP)
    9. Autoencoders
    10. Random Projection
    11. Feature Selection
    12. Locally Linear Embedding (LLE)
12. Clustering Algorithms
    1. K-Means Clustering
    2. Hierarchical Clustering
    3. Expectation-Maximization (EM) Clustering
    4. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
    5. Mean-Shift Clustering
    6. Gaussian Mixture Model (GMM) Clustering
    7. Spectral Clustering
    8. Affinity Propagation Clustering
    9. Birch Clustering
    10. Optics Clustering
13. Autoencoders
14. Transfer Learning
15. Generative Adversarial Networks (GANs)


Data Preprocessing:
    importing the required libraries
    importing the dataset
    handling missing data
    encoding the categoical data
    feature engineering
    spliting the dataset into test set and training set
    feature scaling 
    *webscraping with beautifulsoup

Developing the Model:
    model selection
    model evaluation
    model persistence
    ensemble methods
    feature extraction
    feature selection
    feature engineering
    hyperparameter tuning
    model ensembling
    model stacking
    model blending
    model bagging
    model boosting
    model averaging

### Data Pre-Processing in Detail

In [4]:
"""""" 
1. Data Cleaning:
    a. Missing values:
        Removing the training example:
        Filling in missing value manually
        Using a standard value to replace the missing value
        Using central tendency (mean, median, mode) for attribute to replace the missing value:
        Using central tendency (mean, median, mode) for attribute belonging to same class to replace the missing value:
        Using the most probable value to fill in the missing value:

    b. Noisy Data and Outliers: 
        Binning: Using binning methods smooths sorted values by using the values around it. The sorted values are then divided 
            into bins. 
        Regression:  Linear regression and multiple linear regression can be used to smooth the data, where the values 
            are conformed to a function.
        Outlier analysis: Approaches such as clustering can be used to detect outliers and deal with them.

    c. Remove Unwanted Data: Unwanted data is duplicate or irrelevant data. 
    
2. Data Integration:
    Data consolidation: The data is physically brought together to one data store. This usually involves Data Warehousing.
    Data propagation: Copying data from one location to another using applications is called data propagation
    Data virtualization: An interface is used to provide a real-time and unified view of data from multiple sources. 

3. Data Reduction:
    Missing values ratio: Attributes that have more missing values than a threshold are removed.
    Low variance filter: Normalized attributes that have variance (distribution) less than a threshold are also removed 
        because little changes in data means less information.
    High correlation filter: Normalized attributes that have correlation coefficients more than a threshold are removed 
        because similar trends means similar information is carried. A correlation coefficient is usually calculated using 
        statistical methods such as Pearson’s chi-square value.
    Principal component analysis: Principal component analysis, or PCA, is a statistical method that reduces the numbers 
        of attributes by lumping highly correlated attributes together.

4. Data Transformation:
    Smoothing: Eliminating noise in the data to see more data patterns.
    Attribute/feature construction: New attributes are constructed from the given set of attributes.
    Aggregation: Summary and aggregation operations are applied on the given set of attributes to come up with new attributes
    Normalization: The data in each attribute is scaled between a smaller range, for example, 0 to 1 or -1 to 1.
    Discretization: Raw values of the numeric attributes are replaced by discrete or conceptual intervals, 
        which can be further organized into higher-level intervals. 
    Concept hierarchy generation for nominal data: Values for nominal data are generalized to higher-order concepts.


"""

### Basic ML notes

In [None]:
# #Cost Function
# A cost function, also known as a loss function or objective function, 
# is a mathematical function that measures the difference between predicted and actual values in machine learning. 
# The purpose of a cost function is to guide the learning algorithm towards finding the optimal model parameters that minimize 
# the difference between the predicted and actual values.

# The choice of cost function depends on the type of problem and the learning algorithm used. 
# Here are some common examples of cost functions and their equations:

# 1. Mean Squared Error (MSE): This cost function is used for regression problems where the goal is to predict a continuous 
#     variable. It measures the average squared difference between the predicted and actual values. The equation for MSE is:

#         MSE = 1/n * ∑(y - y_pred)^2
#         where n is the number of samples, y is the actual value, and y_pred is the predicted value.

# 2. Binary Cross-Entropy: This cost function is used for binary classification problems where the output is either 0 or 1. 
#     It measures the difference between the predicted probability and the actual label. 
#     The equation for binary cross-entropy is:

#         Binary cross-entropy = -1/n * ∑(y * log(y_pred) + (1-y) * log(1-y_pred))
#         where n is the number of samples, y is the actual label (0 or 1), and y_pred is the predicted probability.

# 2. Categorical Cross-Entropy: This cost function is used for multi-class classification problems where the output 
#     can be one of several classes. It measures the difference between the predicted probability distribution and the actual 
#     label. The equation for categorical cross-entropy is:

#         Categorical cross-entropy = -1/n * ∑∑(y_ij * log(y_pred_ij))
#         where n is the number of samples, y_ij is the actual probability for class j in sample i, and y_pred_ij is the predicted probability for class j in sample i.

### Generic Pre-processing

>> Categorical Encoding

In [None]:
#categorical encoding
There are four techniques to encode or convert the categorical features into numbers. Here are them:

Mapping Method
Ordinary Encoding
Label Encoding
Pandas Dummies
OneHot Encoding

The choice of categorical encoding method depends on several factors such as the type and nature of the data, 
the number of unique categories in the variable, the type of machine learning algorithm being used, and the performance 
of the encoding method on the dataset. Here are some general guidelines on when to use each method:

One-Hot Encoding
One-hot encoding is a useful technique for handling categorical variables with a small number of unique categories. 
It is particularly useful when the categories are nominal (unordered) or when there is no inherent order or hierarchy 
among the categories. One-hot encoding can be applied to both linear and tree-based machine learning models. 
However, one limitation of one-hot encoding is that it can lead to a high-dimensional feature space, which can be 
computationally expensive and may lead to the curse of dimensionality.

Label Encoding
Label encoding is a useful technique for handling categorical variables with a large number of unique categories. 
It is particularly useful when the categories are ordinal (ordered) or when there is an inherent order or hierarchy 
among the categories. Label encoding can be applied to both linear and tree-based machine learning models. 
However, one limitation of label encoding is that it may introduce an arbitrary ordering or hierarchy among the 
categories, which may not be appropriate for some models.

In general, it is recommended to use one-hot encoding when dealing with nominal categorical variables and label encoding 
when dealing with ordinal categorical variables. However, it is important to consider the nature of the data and the 
performance of the encoding method on the specific dataset before making a decision on which method to use. 
Additionally, it is often useful to try both encoding methods and compare their performance on the dataset to determine 
the optimal encoding method.

>> Variable Transformation

In [None]:
#Logarithmic (only defined for positive numbers) - log(X)
#Exponential (square root or power transformations) - 
#Reciprocal (naturally not defined for zero, also defined for positive values) - 1/X
#Box-Cox (defined only for positive values X>0)
#Yeo-Johnson (is an adaptation of box-cox that can be used in negative value variables)

#NB: if data is positively skewed (right skewed), use (logarithmic, reciprocal, or square root transformation)
    #if data is negatively skewed (left skewed), use (Box-Cox or Yeo-Johnson transformations)

#check if dataset is normally distributed or not.
def diagnostic_plots(df, variable):

    # function to plot a histogram and a Q-Q plot
    # side by side, for a certain variable

    plt.figure(figsize=(15, 6))

    # histogram
    plt.subplot(1, 2, 1)
    df[variable].hist(bins=30)
    plt.title(f"Histogram of {variable}")

    # q-q plot
    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.title(f"Q-Q plot of {variable}")

    # check for skewness
    skewness = df[variable].skew()
    if skewness > 0:
        skew_type = "positively skewed"
    elif skewness < 0:
        skew_type = "negatively skewed"
    else:
        skew_type = "approximately symmetric"
        
    # print message indicating skewness type
    print(f"The variable {variable} is {skew_type} (skewness = {skewness:.2f})")
    
    plt.show()


#log transform 
def log_transform(df, columns):
     """
    Transforms specified columns of a pandas DataFrame using the natural logarithm function.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(np.log1p, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_log = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_log

#reciprocal transformation
def reciprocal_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the reciprocal transformation.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(lambda x: 1/x, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_recip = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_recip

#square root transformation
def sqrt_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the square root function.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(np.sqrt, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_sqrt = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_sqrt

#exponential transformation
def exp_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the exponential function.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(np.exp, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_exp = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_exp

#box-cox transformation
def boxcox_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the Box-Cox transformation.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = PowerTransformer(method='box-cox', standardize=False)
    X = df.copy()
    X[columns] = transformer.fit_transform(X[columns])
    return X


#Yeo-Johnson
def yeo_johnson_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the Yeo-Johnson transformation.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = PowerTransformer(method='yeo-johnson', standardize=False)
    X = df.copy()
    X[columns] = transformer.fit_transform(X[columns])
    return X

"""
A normal distribution is characterized by a bell-shaped curve that is symmetric around the mean. 
The mean, median, and mode of a normal distribution are all equal, and approximately 68% of the data falls within one 
standard deviation of the mean, 95% falls within two standard deviations, and 99.7% falls within three 
standard deviations.
"""

>> Discretization

In [None]:
# Discretization in machine learning is the process of transforming continuous variables into discrete or 
# categorical variables. This process involves dividing the range of a continuous variable into a finite number of 
# intervals or bins, and then assigning each observation to a particular bin based on the value of the continuous 
# variable. 

#Discretization approaches: equal width, equal frequency, K means, Decision Trees


>> Pipeline

In [3]:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Define the feature union (for feature selection or extraction)
feature_union = FeatureUnion(
    transformer_list=[
        ('pca', PCA(n_components=2)),
        ('univariate', SelectKBest(chi2, k=1))
    ]
)

# Define the pipeline
pipe = Pipeline([
    ('features', feature_union),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

# Fit and predict using the pipeline
pipe.fit(iris.data, iris.target)
preds = pipe.predict(iris.data)


<IPython.core.display.Javascript object>

Mean training error: 0.0356386368576308
Mean validation error: -0.009125646107584522


>> Feature Selection and Extraction

In [None]:
# feature selection is a technique of selecting relevant features from the original feature set, while feature 
# extraction is a technique of creating new features from the original feature set. Feature selection is typically 
# used when the original features are already informative, but there are some irrelevant or redundant features that 
# need to be removed to improve the model performance. 

# Feature extraction, on the other hand, is used when the original features are not informative enough, and 
# new features need to be created to capture the underlying patterns in the data.

#there are two ways to resolve/prevent curse of dimensionality (dimensionality reduction)
#Feature selection
    # Removing features with low variance (VarianceThreshold)
    # Univariate Feature Selection (SelectKBest, SelectPercentile, GenericUnivariateSelect)
    # Recursive Feature Elimination (RFE, RFECV)        RFECV - RFE cross validation
    # Feature selection using SelectFromModel (SelectFromModel) - use L1-based (Lasso, Ridge, ElasticNet) or Tree-based
    # Sequential Feature Selection (SequentialFeatureSelector) - SFS can be either forward or backward


#Feature extraction
    # Principal Component Analysis (PCA)
    # Independent Component Analysis (ICA)
    # t-Distributed Stochastic Neighbor Embedding (t-SNE)




# Variance threshold: - # Removing features with low variance
# This technique removes all features whose variance is below a certain threshold. 
# This is done using the VarianceThreshold function from scikit-learn library.
from sklearn.feature_selection import VarianceThreshold

def variance_threshold(X, threshold=0.0):
    selector = VarianceThreshold(threshold=threshold)
    X_new = selector.fit_transform(X)
    return X_new


# SelectKBest:
# This technique selects the K best features based on univariate statistical tests. 
# This is done using the SelectKBest function from scikit-learn library
from sklearn.feature_selection import SelectKBest, f_classif

def select_k_best(X, y, k=10):
    selector = SelectKBest(score_func = f_classif, k=k)
    X_new = selector.fit_transform(X, y)
    return X_new


# Principal Component Analysis (PCA):
# This technique reduces the dimensionality of the data by projecting it onto a lower dimensional space. 
# This is done using the PCA function from scikit-learn library.
from sklearn.decomposition import PCA

def pca(X, n_components=2):
    pca = PCA(n_components=n_components)
    X_new = pca.fit_transform(X)
    return X_new 
                    # # Get the loadings of the original variables in each component
                    # loadings = pca.components_
                    # # Print the names of the columns that were extracted
                    # print("Columns extracted:")
                    # for i in range(loadings.shape[0]):
                    #     max_loading_index = loadings[i].argmax()
                    #     column_name = data.columns[max_loading_index]
                    #     print(f"Component {i+1}: {column_name}")

# Independent Component Analysis (ICA):
# This technique extracts independent sources from the data by maximizing their statistical independence. 
# This is done using the FastICA function from scikit-learn library.
from sklearn.decomposition import FastICA

def ica(X, n_components=2):
    ica = FastICA(n_components=n_components)
    X_new = ica.fit_transform(X)
    return X_new 


# t-distributed Stochastic Neighbor Embedding (t-SNE):
# This technique is used for visualizing high-dimensional data in a low-dimensional space. 
# This is done using the TSNE function from scikit-learn library.
from sklearn.manifold import TSNE

def tsne(X, n_components=2, perplexity=30):
    tsne = TSNE(n_components=n_components, perplexity=perplexity)
    X_new = tsne.fit_transform(X)
    return X_new


#Feature Selection 
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
import numpy as np


# Univariate Feature Selection
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)

# Recursive Feature Elimination
estimator = RandomForestClassifier()
selector = RFE(estimator, n_features_to_select=2, step=1)
selector = selector.fit(X, y)
X_new = selector.transform(X)

# Principal Component Analysis (PCA)
pca = PCA(n_components=2)
X_new = pca.fit_transform(X)

# backward elimination (you can use ny model of choice) 
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)
model = SelectFromModel(lasso, prefit=True) 
X_new = model.transform(X_scaled)

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_scaled, y)
model = SelectFromModel(ridge, prefit=True)
X_new = model.transform(X_scaled)

# Elastic Net
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_scaled, y)
model = SelectFromModel(elastic, prefit=True)
X_new = model.transform(X_scaled)

# Tree-based Feature Selection
rf = RandomForestClassifier()
rf.fit(X, y)
model = SelectFromModel(rf, prefit=True)
X_new = model.transform(X)

# Mutual Information Feature Selection
X_new = SelectKBest(score_func=mutual_info_classif, k=2).fit_transform(X, y)

# Sequential Feature Selection
estimator = RandomForestClassifier()
selector = SequentialFeatureSelector(estimator, n_features_to_select=2)
selector = selector.fit(X, y)
X_new = selector.transform(X)




>> ML evaluation

In [None]:
#confusion matrix 
from sklearn.metrics import confusion_matrix 

classes = digits.target_names #or df['target].unique()
accuracy = accuracy_score(y_test, y_pred)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function plots a confusion matrix.
    """
    cm = confusion_matrix(y_true, y_pred)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    
    for i, j in np.ndindex(cm.shape):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

plot_confusion_matrix(y_test, y_pred, classes=class_names,
                      title='Confusion matrix, Accuracy = {:.2f}'.format(accuracy))
# A confusion matrix is a table that is often used to evaluate the performance of a machine learning algorithm. 
# It shows the number of true positives, false positives, true negatives, and false negatives for a given 
# classification task.

# A confusion matrix has two axes: one for the predicted values and one for the actual values. Each axis has two 
#     categories: positive and negative. Therefore, a confusion matrix for a binary classification task will have 
#     four cells:
# True Positive (TP): the actual value was positive, and the predicted value was also positive.
# False Positive (FP): the actual value was negative, but the predicted value was positive.
# True Negative (TN): the actual value was negative, and the predicted value was also negative.
# False Negative (FN): the actual value was positive, but the predicted value was negative.
                Predicted Positive    Predicted Negative
Actual Positive         TP                   FN
Actual Negative         FP                   TN

#Recall
# the proportion of true positives among the total number of actual positives. It is calculated as TP / (TP + FN).

#Accuracy
# the proportion of true results (both true positives and true negatives) among the total number of cases examined. 
# It is calculated as (TP + TN) / (TP + FP + TN + FN).

#Precision
# The proportion of true positives among the total number of positive predictions. It is calculated as TP / (TP + FP).

#F1-score
# the harmonic mean of precision and recall. It is calculated as 2 * (precision * recall) / (precision + recall).

>> Others

In [1]:
## Importing required libraries
import pandas as pd ## For DataFrame operation
import numpy as np ## Numerical python for matrix operations
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, OrdinalEncoder, OneHotEncoder ## Preprocessing function
import pandas_profiling ## For easy profiling of pandas DataFrame 
import missingno as msno ## Missing value co-occurance analysis

####### Data Exploration ############

def print_dim(df):
    '''
    Function to print the dimensions of a given python dataframe
    Required Input -
        - df = Pandas DataFrame
    Expected Output -
        - Data size
    '''
    print("Data size: Rows-{0} Columns-{1}".format(df.shape[0],df.shape[1]))


def print_dataunique(df):
    '''
    Function to print unique information for each column in a python dataframe
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Column name
        - Data type of that column
        - Number of unique values in that column
        - 5 unique values from that column
    '''
    counter = 0
    for i in df.columns:
        x = df.loc[:,i].unique()
        print(counter,i,type(df.loc[0,i]), len(x), x[0:5])
        counter +=1
        
def do_data_profiling(df, filename):
    '''
    Function to do basic data profiling
    Required Input - 
        - df = Pandas DataFrame
        - filename = Path for output file with a .html extension
    Expected Output -
        - HTML file with data profiling summary
    '''
    profile = pandas_profiling.ProfileReport(df)
    profile.to_file(output_file = filename)
    print("Data profiling done")

def view_datatypes_in_perspective(df):
    '''
    Function to group dataframe columns into three common dtypes and visualize the columns
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - three unique datatypes (float, object, others(for the rest))
    '''
    float = 0
    float_col = []
    object = 0
    object_col = []
    others = 0
    others_col = []
    for col in df.columns:
        if df[col].dtype ==  "float":
            float += 1
            float_col.append(col) 
        elif df[col].dtypes == "object":
            object += 1
            object_col.append(col)
        else:
            others +=1
            others_col.append(col)
            others_col.append(smart_home[col].dtype)        
    print (f" float = {float} \t{float_col}, \n \nobject = {object} \t{object_col}, \n\nothers = {others} \t{others_col} ")

def missing_value_analysis(df):
    '''
    Function to do basic missing value analysis
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Chart of Missing value co-occurance
        - Chart of Missing value heatmap
    '''
    msno.matrix(df)
    msno.heatmap(df)

def view_NaN(df):
    """
    Prints the name of any column in a Pandas DataFrame that contains NaN values.

    Parameters:
        - df: Pandas DataFrame

    Returns:
        - None
    """
    for col in df.columns:
        if df[col].isnull().any() == True:
            print(f"there is {df[col].isnull().sum()} NaN present in column:", col)
        else:
            print("No NaN present in column:", col)

def convert_timestamp(ts):
    """
    Converts a Unix timestamp to a formatted date and time string.

    Args:
        ts (int): The Unix timestamp to convert.

    Returns:
        str: A formatted date and time string in the format 'YYYY-MM-DD HH:MM:SS'.
    """
    utc_datetime = datetime.datetime.utcfromtimestamp(ts)
    formatted_datetime = utc_datetime.strftime('%Y-%m-%d %H:%M:%S')
    formatted_datetime = pd.to_datetime(formatted_datetime, infer_datetime_format=True) 
    return formatted_datetime

def visualize_outlier (df: pd.DataFrame):
    # Select only numeric columns
    numeric_cols = df.select_dtypes(include=["float64", "int64"])
    # Set figure size and create boxplot
    fig, ax = plt.subplots(figsize=(12, 6))
    numeric_cols.boxplot(ax=ax, rot=90)
    # Set x-axis label
    ax.set_xlabel("Numeric Columns")
    # Adjust subplot spacing to prevent x-axis labels from being cut off
    plt.subplots_adjust(bottom=0.4) 
    # Increase the size of the plot
    fig.set_size_inches(10, 6)
    # Show the plot
    plt.show()


####### Basic helper function ############

def join_df(left, right, left_on, right_on=None, method='left'):
    '''
    Function to outer joins of pandas dataframe
    Required Input - 
        - left = Pandas DataFrame 1
        - right = Pandas DataFrame 2
        - left_on = Fields in DataFrame 1 to merge on
        - right_on = Fields in DataFrame 2 to merge with left_on fields of Dataframe 1
        - method = Type of join
    Expected Output -
        - Pandas dataframe with dropped no variation columns
    '''
    if right_on is None:
        right_on = left_on
    return left.merge(right, 
                      how=method, 
                      left_on=left_on, 
                      right_on=right_on, 
                      suffixes=("","_y"))
    
####### Pre-processing ############    

def drop_allsame(df):
    '''
    Function to remove any columns which have same value all across
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Pandas dataframe with dropped no variation columns
    '''
    to_drop = list()
    for i in df.columns:
        if len(df.loc[:,i].unique()) == 1:
            to_drop.append(i)
    return df.drop(to_drop,axis =1)

------------------------------------------------------------------------------------------------------
#Handling Missing Values
----------------------------------------------------
#fill Nan Values in the cloudCover column
def treat_missing_numeric(df,columns,how = 'mean', value = None):
    '''
    Function to treat missing values in numeric columns
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns need to be imputed
        - how = valid values are 'mean', 'mode', 'median','ffill', numeric value
    Expected Output -
        - Pandas dataframe with imputed missing value in mentioned columns
    '''
    if how == 'mean':
        for i in columns:
            print("Filling missing values with mean for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].mean())
            
    elif how == 'mode':
        for i in columns:
            print("Filling missing values with mode for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].mode())
    
    elif how == 'median':
        for i in columns:
            print("Filling missing values with median for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].median())
    
    elif how == 'ffill':
        for i in columns:
            print("Filling missing values with forward fill for columns - {0}".format(i))
            df[i] = df[i].fillna(method ='ffill')
    
    elif how == 'digit':
        for i in columns:
            print("Filling missing values with {0} for columns - {1}".format(how, i))
            df[i] = df[i].fillna(str(value)) 
      
    else:
        print("Missing value fill cannot be completed")
    return df.head(5)
treat_missing_numeric(smart_home, ["cloudCover"], how="digit", value = 0.1)  


def treat_missing_categorical(df, columns, how='mode', value = None):
    '''
    Function to treat missing values in categorical columns
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns need to be imputed
        - how = valid values are 'mode', any string or numeric value
    Expected Output -
        - Pandas dataframe with imputed missing value in mentioned columns
    '''
    if how == 'mode':
        for col in columns:
            print("Filling missing values with mode for column - {0}".format(col))
            df[col] = df[col].fillna(df[col].mode()[0])
            
    elif isinstance(how, str):
        for col in columns:
            print("Filling missing values with '{0}' for column - {1}".format(how, col))
            df[col] = df[col].fillna(how)
            
    elif how == 'digit':
        for i in columns:
            print("Filling missing values with {0} for columns - {1}".format(how, i))
            df[i] = df[i].fillna(str(value)) 
            
    else:
        print("Missing value fill cannot be completed")
    return df.head(4)


#SimpleImputer: This function replaces missing values with a specified strategy.
from sklearn.impute import SimpleImputer

def impute_missing_values(X, strategy='mean'): #strategy = "median", 'most_frequent', 'constant'. (strategy="constant", fill_value=-1)
    imputer = SimpleImputer(strategy=strategy)
    X_imputed = imputer.fit_transform(X)
    X_imputed = pd.DataFrame(X_imputed, 
                            columns=X.columns, index=X.index )
    return X_imputed

#MissingIndicator: This function creates a binary indicator for each feature indicating whether the value is missing or not.
from sklearn.impute import MissingIndicator

def create_missing_indicator(X):
    indicator = MissingIndicator()
    X_missing_indicator = indicator.fit_transform(X)
    return X_missing_indicator

#KNNImputer: The missing values are estimated as the average value from the closest K neighbours
    # multivariate imputation
from sklearn.impute import KNNImputer
def knn_impute(X, k):
    imputer = KNNImputer(n_neighbors=k, # the number of neighbours K
                        weights='distance', # the weighting factor
                        metric='nan_euclidean', # the metric to find the neighbours
                        add_indicator=False, # whether to add a missing indicator
                        )
    imputed_X = imputer.fit_transform(X)
    return imputed_X

#IterativeImputer: This function estimates missing values using a predictive model.
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

def impute_missing_values_iteratively(X): #or (X, Columns)
    imputer = IterativeImputer(
        # estimator = RandomForestRegressor() 
        estimator=BayesianRidge(), # the estimator to predict the NA
        initial_strategy='mean', # how will NA be imputed in step 1
        max_iter=10, # number of cycles
        imputation_order='ascending', # the order in which to impute the variables
        n_nearest_features=None, # whether to limit the number of predictors
        skip_complete=True, # whether to ignore variables without NA
        random_state=0,)
        
    # select only the columns with missing values to be imputed
    # X_cols = X[columns]
    X_imputed = imputer.fit_transform(X) #or X_cols
    return X_imputed

#other predictive models include
imputer = IterativeImputer(estimator=BayesianRidge()) #from sklearn.linear_model import BayesianRidge
imputer = IterativeImputer(estimator=LinearRegression()) #from sklearn.linear_model import LinearRegression
imputer = IterativeImputer(estimator=DecisionTreeRegressor()) #from sklearn.tree import DecisionTreeRegressor
imputer = IterativeImputer(estimator=RandomForestRegressor()) #from sklearn.ensemble import RandomForestRegressor
imputer = IterativeImputer(estimator=KNeighborsRegressor()) #from sklearn.neighbors import KNeighborsRegressor
imputer = IterativeImputer(estimator=MLPRegressor()) #from sklearn.neural_network import MLPRegressor

# IterativeImputer in SKlearn is a class that can estimate missing values in a dataset by modeling each feature with 
# missing values as a function of the other features. It does this by taking a predictive model and using it to 
# fill in the missing values iteratively. 
----------------------------------------------------------------------------------

def min_max_scaler(df,columns):
    '''
    Function to do Min-Max scaling
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be min-max scaled
    Expected Output -
        - df = Python DataFrame with Min-Max scaled attributes
        - scaler = Function which contains the scaling rules
    '''
    scaler = MinMaxScaler()
    data = pd.DataFrame(scaler.fit_transform(df.loc[:,columns]))
    data.index = df.index
    data.columns = columns
    return data, scaler

def replace_non_numeric(df: pd.DataFrame, columns):
    """
    Replaces non-numeric values in the specified columns of a Pandas dataframe with NaN.

    Parameters:
        df (pd.DataFrame): The dataframe to process.
        columns (list): A list of column names to replace non-numeric values in.

    Returns:
        pd.DataFrame: The updated dataframe with non-numeric values replaced by NaN.
    """
    for col in columns:
        df.dropna(subset = col, inplace= True)
        if df[col].dtype == 'object' or df[col].dtype == 'float':
            # df.dropna(subset = col, inplace= True)
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df.dropna(subset = col, inplace= True)
        else:
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df.dropna(subset = col, inplace= True)
    return df

def z_scaler(df,columns):
    '''
    Function to standardize features by removing the mean and scaling to unit variance
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be min-max scaled
    Expected Output -
        - df = Python DataFrame with Min-Max scaled attributes
        - scaler = Function which contains the scaling rules
    '''
    scaler = StandardScaler()
    data = pd.DataFrame(scaler.fit_transform(df.loc[:,columns]))
    data.index = df.index
    data.columns = columns
    return data, scaler

mapping_dict = {        #an example
    'First':0,
    'Second': 1,
    'Third': 2 
}
def map_encoding(data, feature_name, mapping_dict):
    """
    Encodes a categorical feature using mapping method.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.
        mapping_dict (dict): A dictionary containing the mapping of category values to integers.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Replace the category values with their corresponding integers
    encoded_data[feature_name] = encoded_data[feature_name].map(mapping_dict)

    return encoded_data

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

def ordinal_encoding_sklearn(data, feature_name, categories):
    """
    Encodes a categorical feature using ordinal encoding method with scikit-learn's OrdinalEncoder class.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.
        categories (list): The list of categories in the order of their numerical encoding.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Perform ordinal encoding using the OrdinalEncoder class
    ordinal_encoder = OrdinalEncoder(categories=[categories])
    encoded_data[feature_name] = ordinal_encoder.fit_transform(encoded_data[[feature_name]])
    encoded_data[feature_name] = pd.DataFrame(encoded_data, columns=encoded_data.columns, index=encoded_data.index)
    
    return encoded_data.head


def one_hot_encoding_sklearn(data, feature_name):
    """
    Encodes a categorical feature using one-hot encoding method with scikit-learn's OneHotEncoder class.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Perform one-hot encoding using the OneHotEncoder class
    one_hot_encoder = OneHotEncoder()
    encoded_data = pd.DataFrame(one_hot_encoder.fit_transform(encoded_data[[feature_name]]).toarray())
    feature_names_out = [f"{feature_name}_{category}" for category in one_hot_encoder.categories_[0]]
    encoded_data.columns = feature_names_out
    encoded_data.index = data.index
    encoded_data = pd.concat([data.drop(feature_name, axis=1), encoded_data], axis=1)
    encoded_data[feature_name] = pd.DataFrame(encoded_data, columns=encoded_data.columns, index=encoded_data.index)
    
    return encoded_data.head(3) 

from sklearn.preprocessing import LabelEncoder

def label_encoding_sklearn(data, feature_name):
    """
    Encodes a categorical feature using label encoding method with scikit-learn's LabelEncoder class.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Perform label encoding using the LabelEncoder class
    label_encoder = LabelEncoder()
    encoded_data[feature_name] = label_encoder.fit_transform(encoded_data[feature_name])
    encoded_data[feature_name] = pd.DataFrame(encoded_data, columns=encoded_data.columns, index=encoded_data.index)
    
    return encoded_data

    
def label_encoder(df,columns):
    '''
    Function to label encode
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be label encoded
    Expected Output -
        - df = Pandas DataFrame with lable encoded columns
        - le_dict = Dictionary of all the column and their label encoders
    '''
    le_dict = {}
    for c in columns:
        print("Label encoding column - {0}".format(c))
        lbl = LabelEncoder()
        lbl.fit(list(df[c].values.astype('str')))
        df[c] = lbl.transform(list(df[c].values.astype('str')))
        le_dict[c] = lbl
    return df, le_dict

def one_hot_encoder(df, columns):
    '''
    Function to do one-hot encoded
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be one-hot encoded
    Expected Output -
        - df = Pandas DataFrame with one-hot encoded columns
    '''
    for each in columns:
        print("One-Hot encoding column - {0}".format(each))
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df.drop(columns,axis = 1)

####### Feature Engineering ############ 
def create_date_features(df,column, date_format = None, more_features = False, time_features = False): 
    '''
    Function to extract date features
    Required Input - 
        - df = Pandas DataFrame
        - date_format = Date parsing format
        - columns = Columns name containing date field
        - more_features = To get more feature extracted
        - time_features = To extract hour from datetime field
    Expected Output -
        - df = Pandas DataFrame with additional extracted date features
    '''
    if date_format is None:
        df.loc[:,column] = pd.to_datetime(df.loc[:,column])
    else:
        df.loc[:,column] = pd.to_datetime(df.loc[:,column],format = date_format)
    df.loc[:,column+'_Year'] = df.loc[:,column].dt.year
    df.loc[:,column+'_Month'] = df.loc[:,column].dt.month.astype('uint8')
    df.loc[:,column+'_Week'] = df.loc[:,column].dt.week.astype('uint8')
    df.loc[:,column+'_Day'] = df.loc[:,column].dt.day.astype('uint8')
    
    if more_features:
        df.loc[:,column+'_Quarter'] = df.loc[:,column].dt.quarter.astype('uint8')
        df.loc[:,column+'_DayOfWeek'] = df.loc[:,column].dt.dayofweek.astype('uint8')
        df.loc[:,column+'_DayOfYear'] = df.loc[:,column].dt.dayofyear
        
    if time_features:
        df.loc[:,column+'_Hour'] = df.loc[:,column].dt.hour.astype('uint8')
    return df

def target_encoder(train_df, col_name, target_name, test_df = None, how='mean'):
    '''
    Function to do target encoding
    Required Input - 
        - train_df = Training Pandas Dataframe
        - test_df = Testing Pandas Dataframe
        - col_name = Name of the columns of the source variable
        - target_name = Name of the columns of target variable
        - how = 'mean' default but can also be 'count'
	Expected Output - 
		- train_df = Training dataframe with added encoded features
		- test_df = Testing dataframe with added encoded features
    '''
    aggregate_data = train_df.groupby(col_name)[target_name] \
                    .agg([how]) \
                    .reset_index() \
                    .rename(columns={how: col_name+'_'+target_name+'_'+how})
    if test_df is None:
        return join_df(train_df,aggregate_data,left_on = col_name)
    else:
        return join_df(train_df,aggregate_data,left_on = col_name), join_df(test_df,aggregate_data,left_on = col_name)

### Data Visualization

In [None]:
def visualize_df(df):       #best for time series analysis when your index is in datetime 
    """
    Creates an interactive plot of the dataframe.

    Args:
        df: A Pandas DataFrame containing energy consumption data for the PJM Interconnection region.

    Returns:
        None. Displays an interactive plot of the energy consumption data using Plotly.

    Example:
        >>> visualize_df(my_df)
    """
    import plotly.graph_objects as go

    fig = go.Figure(layout=go.Layout(
        height=500,
        width=800,
    ))

    for col in df.columns:
        fig.add_trace(go.Scatter(x=df.index, y=df[col], name=col))

    fig.update_layout(
        title={
            'text': 'PJM Energy Consumption',
            'font': {'size': 25, 'family': 'Arial', 'color': 'black'}
        },
        xaxis_title='Date',
        yaxis_title='Energy Consumption (MW)'
    )

    return fig.show(renderer='svg')



def visualize_subplots_boxplots(df: DataFrame, columns: List[str], nrows: int, ncols: int) -> None:
    """
    Creates a grid of subplots containing boxplots of daily average energy consumption.

    Args:
        df: A Pandas DataFrame containing energy consumption data.
        columns: A list of column names to include in the boxplots.
        nrows: The number of rows in the subplot grid.
        ncols: The number of columns in the subplot grid.

    Returns:
        None. Displays a grid of subplots containing boxplots of daily average energy consumption.

    Example:
        >>> visualize_subplots_boxplots(my_df, ['Consumption', 'Generation'], 3, 4)
    """
    from typing import List
    from pandas import DataFrame
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 12))
    fig.suptitle('Hourly Average Energy Consumption', weight='bold', fontsize=25)

    # We just need 11 figures, so we delete the last one
    if nrows*ncols > len(columns):
        fig.delaxes(axes[nrows-1][ncols-1])

    for i, col in enumerate(columns):
        sns.boxplot(data=df, x='Hour', y=col, ax=axes.flatten()[i], color='#cc444b')

    plt.tight_layout()
    fig.savefig("Images/xxx.png", dpi=300, bbox_inches='tight')
    plt.show()
visualize_subplots_boxplots(df=result_2, columns=['AEP_MW', 'COMED_MW', 'DAYTON_MW', 'DEOK_MW', 'DOM_MW', 'DUQ_MW',
        'EKPC_MW', 'FE_MW', 'NI_MW', 'PJME_MW', 'PJMW_MW'], nrows=6, ncols=2)



def visualize_subplots_barcharts(df: DataFrame, columns: List[str], nrows: int, ncols: int) -> None:
    """
    Creates a grid of subplots containing bar charts of energy consumption.

    Args:
        df: A Pandas DataFrame containing energy consumption data.
        columns: A list of column names to include in the bar charts.
        nrows: The number of rows in the subplot grid.
        ncols: The number of columns in the subplot grid.

    Returns:
        None. Displays a grid of subplots containing bar charts of energy consumption.

    Example:
        >>> visualize_subplots_barcharts(my_df, ['Consumption', 'Generation'], 3, 4)
    """
    from typing import List
    from pandas import DataFrame
    import matplotlib.pyplot as plt
    import seaborn as sns
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 12))
    fig.suptitle('Energy Consumption Bar Charts', weight='bold', fontsize=25)

    # We just need enough figures for each column
    if nrows*ncols > len(columns):
        fig.delaxes(axes[nrows-1][ncols-1])

    for i, col in enumerate(columns):
        sns.barplot(data=df, x='Hour', y=col, ax=axes.flatten()[i], color='#cc444b')

    plt.tight_layout()
    plt.show()



def moving_average(data: pd.DataFrame, window: int) -> None:
    """
    Calculates and visualizes the moving average of a time series data.

    Args:
        data: A Pandas DataFrame containing the time series data.
        window: An integer representing the window size for calculating the moving average.

    Returns:
        None. Visualizes the actual data and the moving average.

    Example:
        >>> moving_average(my_data, 5)
    """
    # calculate the moving average
    data['Moving Average'] = data['DAYTON_MW'].rolling(window).mean()
    actual = data['DAYTON_MW'][-(window+30):]
    ma = data['Moving Average'][-(window+30):]

    # plot the actual data and moving average
    plt.figure(figsize=(20,8))
    actual.plot(label='Actual', lw=4)
    ma.plot(label='MA-{}'.format(str(window)), ls='--', lw=2)
    plt.title('{}-Days Moving Average'.format(str(window)), weight='bold', fontsize=25, loc= "center", pad=20)
    plt.legend()
    plt.show()


def plot_pivot_line(pivot: DataFrame, xlabel: str, ylabel: str, title: str, savepath: str):
    """
    Plots a line chart from a pivot table and saves it to a file.

    Args:
        pivot: A Pandas DataFrame containing the data to plot in a pivot table format.
        xlabel: A string representing the label for the X-axis.
        ylabel: A string representing the label for the Y-axis.
        title: A string representing the title of the plot.
        savepath: A string representing the file path to save the plot image.

    Returns:
        None. Displays and saves a line chart from the input pivot table.

    Example:
        >>> plot_pivot_line(my_pivot, "Year", "Deaths", "Deaths by Cardiovascular diseases", "/path/to/image.png")
    """
    fig,ax = plt.subplots(nrows = 2, ncols = 2, figsize=(3,2), dpi=100)
    pivot.plot(kind='line', ax = ax)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title, weight='bold', fontsize=25, loc= "center", pad=20)
    plt.show()
    fig.savefig(savepath, dpi=300, bbox_inches='tight')



def plot_boxplot(focus_countries, save_path):
    """
    Plots a boxplot of deaths caused by Cardiovascular diseases for 70+ years, for different entities.

    Args:
        focus_countries: A Pandas DataFrame containing the relevant data.
        save_path: A string containing the path to save the plot image.

    Returns:
        None. Displays and saves the plot.

    Example:
        >>> plot_boxplot(my_df, "C:/myDrive/xx_image.png")
    """
    sns.boxplot(x="Year", y="Deaths 70+ years", hue='Entity', data=focus_countries, palette='mako')
    plt.title('Deaths caused by Cardiovascular deaths for 70+ Years', weight='bold', fontsize=25, loc= "center", pad=20)
    ax = plt.gca() # Get the Axes object
    ax.set_xlim(0, 50) # Set the x-axis range
    ax.xaxis.set_ticks(range(0, 55, 5)) # Set the number of x-axis values to display or ax.xaxis.set_ticks([1990, 1995, 2000, 2005, 2010])
    ax.set_xticks([1990, 1995, 2000, 2005, 2010, 2015, 2020])
    ax.set_xticklabels(["1990", "1995", "2000", "2005", "2010", "2015", "2020"])
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    plt.show()



def plot_data_splitting(train, test):
    """
    Plots the training and test sets of a time series.

    Args:
    train (pandas.DataFrame): DataFrame containing the training set with a DatetimeIndex and a 'PJME_MW' column.
    test (pandas.DataFrame): DataFrame containing the test set with a DatetimeIndex and a 'PJME_MW' column.

    Returns:
    None
    """
    plt.figure(figsize=(20,8))

    plt.plot(train.index, train['PJME_MW'], label='Training Set')
    plt.plot(test.index, test['PJME_MW'], label='Test Set')

    plt.title('Data Splitting', weight='bold', fontsize=25, loc= "center", pad=20)
    plt.axvline('2015-09-01', color='black', ls='--', lw=3) 
    plt.legend()
    plt.show()

def plot_mean_energy_per_month(df):     #you can change the month column to plot for hour, day, year etc. 
    """
    Plots the mean energy consumption per month for each column in the given DataFrame.

    Args:
        df: A Pandas DataFrame containing energy consumption data.

    Returns:
        None. Displays a plot of the mean energy consumption per month for each column in the DataFrame.

    Example:
        >>> plot_mean_energy_per_month(my_df)
    """
    mean_month = df.groupby('month').agg({i: 'mean' for i in df.columns[:-5].tolist()})
    mean_month[mean_month.columns[0:13].tolist()].plot(subplots=True, layout=(-1, 3), figsize=(15, 10),
                                                        grid=True, rot=45, xlabel=None, marker='o')
    plt.savefig("Images/mean_energy_per_month.png")
    plt.show()


def generate_energy_plots():
    """
    Generates energy consumption plots for each month of 2016.

    Args:
        None.

    Returns:
        None. Saves energy consumption plots for each month to individual PNG files and displays the plots.

    Example:
        >>> generate_energy_plots()
    """
    # Load energy consumption data
    energy_data = pd.read_csv('energy_data.csv', parse_dates=['Date/Time'])
    energy_data = energy_data.set_index('Date/Time')
    
    # Resample to daily frequency
    energy_per_day = energy_data.resample('D').sum()
    
    # Define columns to include in plots
    cols_energy = energy_per_day.columns[:-5].tolist()
    
    for month in range(1, 13):
        # Filter the energy and weather data for the current month
        start_date = f'2016-{month:02}-01'
        end_date = f'2016-{month:02}-' + str(calendar.monthrange(2016, month)[1])
        energy_per_day_month = energy_per_day.loc[start_date:end_date].filter(items=cols_energy)

        # Generate the plots for the current month
        fig_energy = px.line(data_frame=energy_per_day_month, line_dash_sequence=['solid']*15, width=900, height=600, title=f'Energy Consumption - {calendar.month_name[month]}')

        # Save the plots to files
        fig_energy.write_image(f'Images/energy_{month:02}.png')

        # Show the plots
        fig_energy.show()

def plot_predictions(y_pred, y_test):
    """
    Plots the predicted and actual values on separate scatter plots.
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
    
    # Plot the actual values
    ax1.scatter(range(len(y_test)), y_test, label='Actual Values')
    ax1.set_xlabel('Index')
    ax1.set_ylabel('Actual Values')
    ax1.set_title('Scatter plot of Actual Values')
    ax1.legend()
    
    # Plot the predicted values
    ax2.scatter(range(len(y_pred)), y_pred, label='Predicted Values')
    ax2.set_xlabel('Index')
    ax2.set_ylabel('Predicted Values')
    ax2.set_title('Scatter plot of Predicted Values')
    ax2.legend()
    
    # Show the plots
    plt.show()


### Scikit-Learn

In [64]:
#Scikit-Learn Sub-modules

# Scikit-Learn library is organized into several sub-modules, each of which contains a set of related functions and classes. 
# Here are the main sub-modules in scikit-learn:

#from sklearn."sub-module" import "model"
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression 


# sklearn.datasets: This sub-module provides a set of standard datasets for machine learning, including iris, 
#     digits, and breast cancer.
        from sklearn.datasets import load_iris
        iris_data = load_iris()
        iris_features = iris_data.data 
        iris_target = iris_data.target
        
        # Convert the data to a DataFrame
        df = pd.DataFrame(iris_features, columns=iris_data.feature_names)
        
        # Add the target variable to the DataFrame
        df['target'] = iris_target 
        
        # print(iris_data.DESCR) - Describes the data 
        # iris_data.data: An array containing the feature values for each instance of the dataset.
        # iris_data.target: An array containing the class labels (i.e., 0, 1, or 2) for each instance of the dataset.
        # Iris_data.target_names: An array containing the names of the three classes 
        # iris_data.feature_names: An array containing the names of the attributes 
        #or
        X, y = load_iris(return_X_y=True, as_frame=True) 
        
        from sklearn.datasets import load_digits

        X, y = load_digits(return_X_y=True, as_frame=True)
        X.shape, y.shape
        # Plot the first 10 digits
        fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(10, 5))
        for i, ax in enumerate(axes.flat):
        ax.imshow(X.iloc[i].values.reshape(8, 8), cmap='gray')
        ax.set_title(f"Digit {y.iloc[i]}")
        plt.tight_layout()
        plt.show()
        
# sklearn.model_selection: This sub-module contains functions for model selection, such as splitting data into 
#     training and test sets, cross-validation, and grid search.

# sklearn.preprocessing: This sub-module provides functions for preprocessing data, such as scaling, normalization, 
#     and encoding categorical variables.

# sklearn.feature_extraction: This sub-module contains functions for feature extraction from raw data, 
#     such as text data, including Bag of Words, CountVectorizer, and TfidfVectorizer.

# sklearn.metrics: This sub-module provides functions for evaluating the performance of machine learning models, 
#     such as accuracy, precision, recall, and F1 score.

# sklearn.pipeline: This sub-module provides tools for building machine learning pipelines, 
#     which allows you to chain together multiple steps, such as feature extraction, preprocessing, and model selection.

# sklearn.decomposition: This sub-module provides classes for matrix factorization and decomposition, 
#     such as Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), 
#     and Latent Dirichlet Allocation (LDA).

# sklearn.discriminant_analysis: This sub-module provides classes for linear and quadratic discriminant analysis, 
#     which are used for supervised classification tasks.

# sklearn.covariance: This sub-module provides classes for covariance estimation, such as Empirical Covariance and 
#     Shrunk Covariance.

# sklearn.exceptions: This sub-module contains custom exceptions raised by scikit-learn, such as NotFittedError and 
#     ConvergenceWarning.


#Models: 


# sklearn.linear_model: This sub-module contains classes for linear models, such as linear regression, 
#     logistic regression, and ridge regression.

# sklearn.tree: This sub-module provides classes for decision trees, such as DecisionTreeClassifier and 
#     DecisionTreeRegressor.

# sklearn.ensemble: This sub-module contains classes for ensemble models, such as random forests, AdaBoost, 
#     and Gradient Boosting.

# sklearn.cluster: This sub-module provides classes for clustering, such as KMeans and Hierarchical Clustering.

# sklearn.neural_network: This sub-module contains classes for neural networks, such as Multi-Layer Perceptron (MLP) 
#     and Convolutional Neural Networks (CNNs).

# sklearn.svm: This sub-module contains classes for Support Vector Machines (SVMs), such as SVM classifier and regression.

# sklearn.manifold: This sub-module provides classes for manifold learning, such as t-SNE and Isomap.

# sklearn.naive_bayes: This sub-module provides classes for Naive Bayes models, such as Gaussian Naive Bayes and 
#     Multinomial Naive Bayes.

# sklearn.neighbors: This sub-module provides classes for k-Nearest Neighbors (k-NN) models, 
#     such as KNeighborsClassifier and KNeighborsRegressor.


### Ensembles

In [None]:


#Decision Tree                          (use one hot encoding for this)
#steps for Decision tree (classification)
    #Start with all examples at the root node
    #calculate information gain for all possible features, and pick the one with the highest infrmation gain
    #Split the dataset according to selected feature, and create left and right branches of the tree
    #Keep repeating splitting process until stopping criteria is met:
        #when a node is 100% one class
        #when splitting a node will result in the tree exceeding a maximum depth
        #information gain from additional splits is less than threshold
        #when number of examples in a node is below threshold


#Bagging (Bootstrapping + Aggregating (or voting)) i.e., randomly creating samples (subsets) of the dataset with replacement, 
    # then builds models on the random subsets. The multiple models are combined by taking a majority vote or 
    # averaging their predictions to make the final prediction or decision
    from sklearn.ensemble import BaggingClassifier, VotingClassifier,RandomForestClassifier
    bagging_classifier = BaggingClassifier(estimator=RandomForestClassifier(), n_estimators=15,  max_samples=200, max_features=X_train.shape[1])
    vot_classifier = VotingClassifier(    
                                    estimators=[('log_reg', log_classifier),
                                                ('svc', sv_classifier),
                                                ('sgd', sgd_classifier)], 
                                    voting='hard')  #there are several types of voting/aggregation (majority vote,
                                                                                                    #average, 
                                                                                                    # weighted average etc.)
#Boosting
    #Boosting is a machine learning ensemble technique that combines multiple base models to create a stronger overall model. 
    # Unlike bagging, which creates subsets of the training data for training base models, The basic idea behind boosting is 
    # to sequentially train a series of base models, where each subsequent base model focuses on correcting the errors 
    # made by the previous base models
    from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

    #AdaBoost (Adaptive Boosting): 
    # AdaBoost is a widely used boosting algorithm that assigns higher weights to misclassified samples and adjusts the 
    # weights of base models based on their accuracy. It places more emphasis on samples that are misclassified by the 
    # current ensemble and updates the weights of samples and base models accordingly.
    
    #Gradient Boosting: 
    # Gradient Boosting is a generalization of AdaBoost that uses gradient descent optimization to minimize the loss 
    # function of the ensemble model. It sequentially fits the base models to the residuals (i.e., the differences between 
    # the true labels and the predictions) of the previous base models, resulting in a more accurate and robust model.
    
    #XGBoost (Extreme Gradient Boosting): XGBoost is a popular implementation of gradient boosting that incorporates 
    # additional optimizations for improved performance, such as parallelization, regularization, and handling of missing 
    # values.
    from xgboost import XGBClassifier

#Stacking and Blending 
from sklearn.ensemble import StackingClassifier, StackingRegressor
    #Stacking refers to training a learning algorithm to combine the predictions of several other algorithms. The 
    #predictions of the base algorithms are used as input to train the stacking algorithm. This helps to create a 
    #meta-model that can leverage the predictions of the base models
    
    #Blending refers to simply averaging the predictions of multiple models. The predictions of the base models are
    #combined using a weighted average to get the final prediction. 

stacking_classifier = StackingClassifier(estimators=[('random forest', RandomForestClassifier()), 
                                                     ('decision trees', DecisionTreeClassifier()),
                                                     ('logistic regression', LogisticRegression())], stack_method='predict'
                                         final_estimator=RandomForestClassifier())    
        #final_estimator is the metal-model

### Tips

In [None]:
# Missing Data:
#     Some machine learning models, such as decision trees and random forests, can handle missing data directly, 
#     while others may require imputation or removal of missing data. For example, models like K-nearest neighbors (KNN) 
#     and Support Vector Machines (SVM) may be sensitive to missing data and may require imputation or removal of 
#     missing values before training the model.

# Data Imbalance:
#     Techniques such as oversampling, undersampling, or using ensemble methods like SMOTE 
#     (Synthetic Minority Over-sampling Technique) can be used to address data imbalance. Some machine learning models, 
#     such as decision trees and random forests, can handle imbalanced data well, while others may require handling of 
#     imbalanced data as a preprocessing step, such as using oversampling or undersampling techniques. 
#     For example, models like logistic regression, naive Bayes, KNN,    and support vector machines (SVM) may require handling 
#     of imbalanced data.

# Feature Scaling:
#     Some machine learning models, such as k-nearest neighbors (KNN) and support vector machines (SVM), are sensitive 
#     to the scale of features and may require feature scaling.  On the other hand, decision trees and random forests 
#     are not sensitive to feature scaling and do not require this preprocessing step.

# Categorical Data:
#     Some models, like decision trees and random forests, can directly handle categorical data without encoding, 
#     while others, like logistic regression and support vector machines (SVM), may require encoding of categorical data

# Outliers:
#     Some machine learning models, such as decision trees and random forests, are less sensitive to outliers, while 
#     others, such as linear regression, SVM, and k-nearest neighbors (KNN), can be affected by outliers and may require 
#     handling of outliers as a preprocessing step.
    
# Dimensionality:
    #  refers to the number of features or variables in the dataset. High-dimensional data can lead to increased 
    #  complexity, increased computation time, and reduced model performance. 
    #  Some machine learning models, such as decision trees and random forests, are less sensitive to 
    #  high-dimensional data, while others, such as logistic regression and support vector machines (SVM), 
    #  may require handling of high-dimensional data as a preprocessing step

#reduce memory usage of the dataset
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
       

### Machine Learning Regression

In [None]:
## Importing required libraries
import pandas as pd ## For DataFrame operation
import numpy as np ## Numerical python for matrix operations
from sklearn.model_selection import KFold, train_test_split ## Creating cross validation sets
from sklearn import metrics ## For loss functions
import matplotlib.pyplot as plt

## Libraries for Regressiion algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
import xgboost as xgb
import lightgbm as lgb 
from sklearn.ensemble import ExtraTreesRegressor,RandomForestRegressor
import lime 
import lime.lime_tabular


model.get_params()  #to get the parameters of the models in order to improve it

########### Cross Validation ###########
### 1) Train test split
def holdout_cv(X,y,size = 0.3, seed = 1):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = size, random_state = seed)
    X_train = X_train.reset_index(drop='index')
    X_test = X_test.reset_index(drop='index')
    return X_train, X_test, y_train, y_test

### 2) Cross-Validation (K-Fold)
def kfold_cv(X,n_folds = 5, seed = 1):
    cv = KFold(n_splits = n_folds, random_state = seed, shuffle = True)
    return cv.split(X)

########### Model Explanation ###########
## Variable Importance plot
def feature_importance(model,X):
    feature_importance = model.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align='center')
    plt.yticks(pos, X.columns[sorted_idx])
    plt.xlabel('Relative Importance')
    plt.title('Variable Importance')
    plt.show()

def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    
    Args:
      X (ndarray (m,n))     : input data, m examples, n features
      
    Returns:
      X_norm (ndarray (m,n)): input normalized by column
      mu (ndarray (n,))     : mean of each feature
      sigma (ndarray (n,))  : standard deviation of each feature
    """
    # find the mean of each column/feature
    mu     = np.mean(X, axis=0)                 # mu will have shape (n,)
    # find the standard deviation of each column/feature
    sigma  = np.std(X, axis=0)                  # sigma will have shape (n,)
    # element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma      

    return (X_norm, mu, sigma)

def standardize_data(X_train, X_test):
    """
    Standardizes the training and testing data using the mean and standard deviation
    learned from the training set.
    
    Args:
    - X_train: numpy array or pandas dataframe, training data
    - X_test: numpy array or pandas dataframe, testing data
    
    Returns:
    - X_train_scaled: numpy array or pandas dataframe, standardized training data
    - X_test_scaled: numpy array or pandas dataframe, standardized testing data
    """
    from sklearn.preprocessing import StandardScaler 
    # Set up the scaler
    scaler = StandardScaler()
    
    # Fit the scaler to the training set
    scaler.fit(X_train) 
    
    # Transform the training and testing sets
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled

def mean_normalize(X_train, X_test):
    """
    Perform mean normalization on both the training and testing sets.

    Parameters:
    -----------
    X_train: numpy.ndarray
        The training set features as a 2D array.

    X_test: numpy.ndarray
        The testing set features as a 2D array.

    Returns:
    --------
    X_train_norm: numpy.ndarray
        The mean-normalized training set features as a 2D array.

    X_test_norm: numpy.ndarray
        The mean-normalized testing set features as a 2D array.
    """
    scaler_mean = StandardScaler(with_mean=True, with_std=False) # set up the scaler
    scaler_minmax = RobustScaler(with_centering = False, with_scaling = True,   #use this when working with outliers
                                 quantile_range = (0,100))
    
    scaler_mean.fit(X_train) # fit the scaler to the train set, it will learn the parameters
    scaler_minmax.fit(X_train) #fit the scaler to the train set, it will learn the parameters
    
    X_train_norm = scaler_minmax.transform(scaler_mean.transform(X_train)) # transform train set
    X_test_norm = scaler_minmax.transform(scaler_mean.transform(X_test)) # transform test set
    return X_train_norm, X_test_norm


def scale_min_max(X_train, X_test):
    """
    Scales the features in X_train and X_test to the range [0, 1] using MinMaxScaler.
    
    Parameters:
    -----------
    X_train: numpy array
        Training data features
        
    X_test: numpy array
        Test data features
        
    Returns:
    --------
    X_train_scaled: numpy array
        Scaled training data features
        
    X_test_scaled: numpy array
        Scaled test data features
    """
    from sklearn.preprocessing import MinMaxScaler 
    # set up the scaler
    scaler = MinMaxScaler()
    
    # fit the scaler to the train set, it will learn the parameters
    scaler.fit(X_train)
    
    # transform train and test sets
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled


########### Functions for explaination using Lime ###########

## Make a prediction function
def make_prediction_function(model, type = None):
    if type == 'xgb':
        predict_fn = lambda x: model.predict(xgb.DMatrix(x)).astype(float)
    else:
        predict_fn = lambda x: model.predict(x).astype(float)
    return predict_fn

## Make a lime explainer
def make_lime_explainer(df, c_names = [], verbose_val = True):
    explainer = lime.lime_tabular.LimeTabularExplainer(df.values,
                                                       class_names=c_names,
                                                       feature_names = list(df.columns),
                                                       kernel_width=3, 
                                                       verbose=verbose_val,
                                                       mode='regression'
                                                    )
    return explainer

## Lime explain function
def lime_explain(explainer,predict_fn, df, index = 0, num_features = None,
                 show_in_notebook = True, filename = None):
    if num_features is not None:
        exp = explainer.explain_instance(df.values[index], predict_fn, num_features=num_features)
    else:
        exp = explainer.explain_instance(df.values[index], predict_fn, num_features=df.shape[1])
    
    if show_in_notebook:
        exp.show_in_notebook(show_all=False)
    
    if filename is not None:
        exp.save_to_file(filename)
        
########### Algorithms For Regression ###########

### Running Xgboost
def runXGB(train_X, train_y, test_X, test_y=None, test_X2=None, seed_val=0, 
           rounds=500, dep=8, eta=0.05,sub_sample=0.7,col_sample=0.7,
           min_child_weight_val=1, silent_val = 1):
    params = {}
    params["objective"] = "reg:linear"
    params['eval_metric'] = 'rmse'
    params["eta"] = eta
    params["subsample"] = sub_sample
    params["min_child_weight"] = min_child_weight_val
    params["colsample_bytree"] = col_sample
    params["max_depth"] = dep
    params["silent"] = silent_val
    params["seed"] = seed_val
    #params["max_delta_step"] = 2
    #params["gamma"] = 0.5
    num_rounds = rounds

    plst = list(params.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
        model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=100, verbose_eval=20)
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)
    
    pred_test_y = model.predict(xgtest, ntree_limit=model.best_iteration)
    
    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(xgb.DMatrix(test_X2), ntree_limit=model.best_iteration)
    
    loss = 0
    if test_y is not None:
        loss = metrics.mean_squared_error(test_y, pred_test_y)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model
        
### Running LightGBM
def runLGB(train_X, train_y, test_X, test_y=None, test_X2=None, feature_names=None, 
           seed_val=0, rounds=500, dep=8, eta=0.05,sub_sample=0.7,
           col_sample=0.7,silent_val = 1,min_data_in_leaf_val = 20, bagging_freq = 5):
    params = {}
    params["objective"] = "regression"
    params['metric'] = 'rmse'
    params["max_depth"] = dep
    params["min_data_in_leaf"] = min_data_in_leaf_val
    params["learning_rate"] = eta
    params["bagging_fraction"] = sub_sample
    params["feature_fraction"] = col_sample
    params["bagging_freq"] = bagging_freq
    params["bagging_seed"] = seed_val
    params["verbosity"] = silent_val
    num_rounds = rounds
    
    lgtrain = lgb.Dataset(train_X, label=train_y)
    
    if test_y is not None:
        lgtest = lgb.Dataset(test_X, label=test_y)
        model = lgb.train(params, lgtrain, num_rounds, valid_sets=[lgtest], early_stopping_rounds=100, verbose_eval=20)
    else:
        lgtest = lgb.Dataset(test_X)
        model = lgb.train(params, lgtrain, num_rounds)
        
    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)
    
    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)
    
    loss = 0
    if test_y is not None:
        loss = metrics.mean_squared_error(test_y, pred_test_y)
        print(loss)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model
        
### Running Extra Trees  
def runET(train_X, train_y, test_X, test_y=None, test_X2=None, rounds=100, depth=20,
          leaf=10, feat=0.2, min_data_split_val=2,seed_val=0,job = -1):
	model = ExtraTreesRegressor(
                                n_estimators = rounds,
                                max_depth = depth,
                                min_samples_split = min_data_split_val,
                                min_samples_leaf = leaf,
                                max_features =  feat,
                                n_jobs = job,
                                random_state = seed_val)
	model.fit(train_X, train_y)
	train_preds = model.predict(train_X)
	test_preds = model.predict(test_X)
	
	test_preds2 = 0
	if test_X2 is not None:
		test_preds2 = model.predict(test_X2)
	
	test_loss = 0
	if test_y is not None:
		train_loss = metrics.mean_squared_error(train_y, train_preds)
		test_loss = metrics.mean_squared_error(test_y, test_preds)
		print("Depth, leaf, feat : ", depth, leaf, feat)
		print("Train and Test loss : ", train_loss, test_loss)
	return test_preds, test_loss, test_preds2, model
 
### Running Random Forest
def runRF(train_X, train_y, test_X, test_y=None, test_X2=None, rounds=100, depth=20, leaf=10,
          feat=0.2,min_data_split_val=2,seed_val=0,job = -1):
    model = RandomForestRegressor(
                                n_estimators = rounds,
                                max_depth = depth,
                                min_samples_split = min_data_split_val,
                                min_samples_leaf = leaf,
                                max_features =  feat,
                                n_jobs = job,
                                random_state = seed_val)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)
    
    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running Linear regression
def runLR(train_X, train_y, test_X, test_y=None, test_X2=None):
    model = LinearRegression()
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running Decision Tree
def runDT(train_X, train_y, test_X, test_y=None, test_X2=None, criterion='mse', 
          depth=None, min_split=2, min_leaf=1):
    model = DecisionTreeRegressor(
                                criterion = criterion, 
                                max_depth = depth, 
                                min_samples_split = min_split, 
                                min_samples_leaf=min_leaf)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model
    
### Running K-Nearest Neighbour
def runKNN(train_X, train_y, test_X, test_y=None, test_X2=None, 
           neighbors=5, job = -1):
    model = KNeighborsRegressor(
                                n_neighbors=neighbors, 
                                n_jobs=job)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running SVM
def runSVC(train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, 
           eps=0.1, kernel_choice = 'rbf'):
    model = SVR(
                C=C, 
                kernel=kernel_choice,  
                epsilon=eps)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Machine Learning Classification

In [None]:
## Importing required libraries
import pandas as pd  ## For DataFrame operation
import numpy as np  ## Numerical python for matrix operations
from sklearn.model_selection import (
    KFold,
    train_test_split,
)  ## Creating cross validation sets
from sklearn import metrics  ## For loss functions
import matplotlib.pyplot as plt
import itertools

## For evaluation
from sklearn.metrics import (
    roc_curve,
    auc,
    roc_auc_score,
    confusion_matrix,
    precision_recall_curve,
    average_precision_score,
)
from inspect import signature

## Libraries for Classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
import lime
import lime.lime_tabular

model.get_params()  #to get the parameters of the models in order to improve it



def split_data(X, y, test_size=0.2, val_size=0.2, random_state=42): #split data into train, test, and validation
    """
    This function splits the data into train and test sets, and further splits the train set into training and validation sets.
    
    df : pandas DataFrame
        The dataframe containing the input data.
    target_col : str
        The name of the target column in the dataframe.
    test_size : float, optional (default=0.2)
        The proportion of the data to be used for testing.
    val_size : float, optional (default=0.2)
        The proportion of the training data to be used for validation.
    random_state : int, optional (default=42)
        The seed used by the random number generator.
    
    Returns
    -------
    xtrain : pandas DataFrame
        The training input data.
    ytrain : pandas Series
        The training target data.
    xvalid : pandas DataFrame
        The validation input data.
    yvalid : pandas Series
        The validation target data.
    xtest : pandas DataFrame
        The test input data.
    ytest : pandas Series
        The test target data.
    """ 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=val_size, random_state=random_state)
    
    return X_train, y_train, X_valid, y_valid, X_test, y_test

########### Cross Validation ###########
### 1) Train test split
def holdout_cv(X, y, size=0.3, seed=1):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=size, random_state=seed
    )
    X_train = X_train.reset_index(drop="index")
    X_test = X_test.reset_index(drop="index")
    return X_train, X_test, y_train, y_test


### 2) Cross-Validation (K-Fold)
def kfold_cv(X, n_folds=5, seed=1):
    cv = KFold(n_splits=n_folds, random_state=seed, shuffle=True)
    return cv.split(X)


########### Model Explanation ###########
## Plotting AUC ROC curve
def plot_roc(y_actual, y_pred):
    """
    Function to plot AUC-ROC curve
    """
    fpr, tpr, thresholds = roc_curve(y_actual, y_pred)
    plt.plot(
        fpr,
        tpr,
        color="b",
        label=r"Model (AUC = %0.2f)" % (roc_auc_score(y_actual, y_pred)),
        lw=2,
        alpha=0.8,
    )
    plt.plot(
        [0, 1],
        [0, 1],
        linestyle="--",
        lw=2,
        color="r",
        label="Luck (AUC = 0.5)",
        alpha=0.8,
    )
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver operating characteristic example")
    plt.legend(loc="lower right")
    plt.show()


def plot_precisionrecall(y_actual, y_pred):
    """
    Function to plot AUC-ROC curve
    """
    average_precision = average_precision_score(y_actual, y_pred)
    precision, recall, _ = precision_recall_curve(y_actual, y_pred)
    # In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
    step_kwargs = (
        {"step": "post"} if "step" in signature(plt.fill_between).parameters else {}
    )

    plt.figure(figsize=(9, 6))
    plt.step(recall, precision, color="b", alpha=0.2, where="post")
    plt.fill_between(recall, precision, alpha=0.2, color="b", **step_kwargs)

    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title("Precision-Recall curve: AP={0:0.2f}".format(average_precision))


## Plotting confusion matrix
def plot_confusion_matrix(
    y_true,
    y_pred,
    classes,
    normalize=False,
    title="Confusion matrix",
    cmap=plt.cm.Blues,
):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    cm = metrics.confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print("Confusion matrix, without normalization")

    print(cm)

    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = ".2f" if normalize else "d"
    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(
            j,
            i,
            format(cm[i, j], fmt),
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black",
        )

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")


## Variable Importance plot
def feature_importance(model, X):
    feature_importance = model.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + 0.5
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align="center")
    plt.yticks(pos, X.columns[sorted_idx])
    plt.xlabel("Relative Importance")
    plt.title("Variable Importance")
    plt.show()


## Functions for explaination using Lime
def make_prediction_function(model):
    predict_fn = lambda x: model.predict_proba(x).astype(float)
    return predict_fn


def make_lime_explainer(df, c_names=[], k_width=3, verbose_val=True):
    explainer = lime.lime_tabular.LimeTabularExplainer(
        df.values,
        class_names=c_names,
        feature_names=list(df.columns),
        kernel_width=3,
        verbose=verbose_val,
    )
    return explainer


def lime_explain(
    explainer,
    predict_fn,
    df,
    index=0,
    num_features=None,
    show_in_notebook=True,
    filename=None,
):
    if num_features is not None:
        exp = explainer.explain_instance(
            df.values[index], predict_fn, num_features=num_features
        )
    else:
        exp = explainer.explain_instance(
            df.values[index], predict_fn, num_features=df.shape[1]
        )

    if show_in_notebook:
        exp.show_in_notebook(show_all=False)

    if filename is not None:
        exp.save_to_file(filename)


########### Algorithms For Binary classification ###########

### Running Xgboost
def runXGB(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    min_child_weight_val=1,
    silent_val=1,
):
    params = {}
    params["objective"] = "binary:logistic"
    params["eval_metric"] = "auc"
    params["eta"] = eta
    params["subsample"] = sub_sample
    params["min_child_weight"] = min_child_weight_val
    params["colsample_bytree"] = col_sample
    params["max_depth"] = dep
    params["silent"] = silent_val
    params["seed"] = seed_val
    # params["max_delta_step"] = 2
    # params["gamma"] = 0.5
    num_rounds = rounds

    plst = list(params.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [(xgtrain, "train"), (xgtest, "test")]
        model = xgb.train(
            plst,
            xgtrain,
            num_rounds,
            watchlist,
            early_stopping_rounds=100,
            verbose_eval=20,
        )
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)

    pred_test_y = model.predict(xgtest, ntree_limit=model.best_iteration)

    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(
            xgb.DMatrix(test_X2), ntree_limit=model.best_iteration
        )

    loss = 0
    if test_y is not None:
        loss = metrics.roc_auc_score(test_y, pred_test_y)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model


### Running Xgboost classifier for model explaination
def runXGBC(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    min_child_weight_val=1,
    silent_val=1,
):
    model = xgb.XGBClassifier(
        objective="binary:logistic",
        learning_rate=eta,
        subsample=sub_sample,
        min_child_weight=min_child_weight_val,
        colsample_bytree=col_sample,
        max_depth=dep,
        silent=silent_val,
        seed=seed_val,
        n_estimators=rounds,
    )

    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running LightGBM
def runLGB(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    feature_names=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    silent_val=1,
    min_data_in_leaf_val=20,
    bagging_freq=5,
    n_thread=20,
    metric="auc",
):
    params = {}
    params["objective"] = "binary"
    params["metric"] = metric
    params["max_depth"] = dep
    params["min_data_in_leaf"] = min_data_in_leaf_val
    params["learning_rate"] = eta
    params["bagging_fraction"] = sub_sample
    params["feature_fraction"] = col_sample
    params["bagging_freq"] = bagging_freq
    params["bagging_seed"] = seed_val
    params["verbosity"] = silent_val
    params["num_threads"] = n_thread
    num_rounds = rounds

    lgtrain = lgb.Dataset(train_X, label=train_y)

    if test_y is not None:
        lgtest = lgb.Dataset(test_X, label=test_y)
        model = lgb.train(
            params,
            lgtrain,
            num_rounds,
            valid_sets=[lgtrain, lgtest],
            early_stopping_rounds=100,
            verbose_eval=20,
        )
    else:
        lgtest = lgb.Dataset(test_X)
        model = lgb.train(params, lgtrain, num_rounds)

    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)

    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)

    loss = 0
    if test_y is not None:
        loss = roc_auc_score(test_y, pred_test_y)
        print(loss)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model


### Running LightGBM classifier for model explaination
def runLGBC(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    silent_val=1,
    min_data_in_leaf_val=20,
    bagging_freq=5,
    n_thread=20,
    metric="auc",
):
    model = lgb.LGBMClassifier(
        max_depth=dep,
        learning_rate=eta,
        min_data_in_leaf=min_data_in_leaf_val,
        bagging_fraction=sub_sample,
        feature_fraction=col_sample,
        bagging_freq=bagging_freq,
        bagging_seed=seed_val,
        verbosity=silent_val,
        num_threads=n_thread,
        n_estimators=rounds,
        metric=metric,
    )

    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = roc_auc_score(train_y, train_preds)
        test_loss = roc_auc_score(test_y, test_preds)
        print("Train and Test AUC : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Extra Trees
def runET(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    rounds=100,
    depth=20,
    leaf=10,
    feat=0.2,
    min_data_split_val=2,
    seed_val=0,
    job=-1,
):
    model = ExtraTreesClassifier(
        n_estimators=rounds,
        max_depth=depth,
        min_samples_split=min_data_split_val,
        min_samples_leaf=leaf,
        max_features=feat,
        n_jobs=job,
        random_state=seed_val,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Depth, leaf, feat : ", depth, leaf, feat)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Random Forest
def runRF(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    rounds=100,
    depth=20,
    leaf=10,
    feat=0.2,
    min_data_split_val=2,
    seed_val=0,
    job=-1,
):
    model = RandomForestClassifier(
        n_estimators=rounds,
        max_depth=depth,
        min_samples_split=min_data_split_val,
        min_samples_leaf=leaf,
        max_features=feat,
        n_jobs=job,
        random_state=seed_val,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Logistic Regression
def runLR(train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, penalty="l1"):
    model = LogisticRegression(C=C, penalty=penalty, n_jobs=-1)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]
    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Decision Tree
def runDT(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    criterion="gini",
    depth=None,
    min_split=2,
    min_leaf=1,
):
    model = DecisionTreeClassifier(
        criterion=criterion,
        max_depth=depth,
        min_samples_split=min_split,
        min_samples_leaf=min_leaf,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running K-Nearest Neighbour
def runKNN(train_X, train_y, test_X, test_y=None, test_X2=None, neighbors=5, job=-1):
    model = KNeighborsClassifier(n_neighbors=neighbors, n_jobs=job)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running SVM
def runSVC(
    train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, kernel_choice="rbf"
):
    model = SVC(C=C, kernel=kernel_choice, probability=True)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Natural Language Processing (NLP)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re
import string
import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

eng_stop = set(stopwords.words('english'))


def word_grams(text, min=1, max=4):
    '''
    Function to create N-grams from text
    Required Input -
        - text = text string for which N-gram needs to be created
        - min = minimum number of N
        - max = maximum number of N
    Expected Output -
        - s = list of N-grams 
    '''
    s = []
    for n in range(min, max+1):
        for ngram in ngrams(text, n):
            s.append(' '.join(str(i) for i in ngram))
    return s


def generate_bigrams_df(df, column_names):
    """
    Generate bigrams from specified columns in a pandas DataFrame.

    Parameters:
    df (pd.DataFrame): DataFrame to generate bigrams from.
    column_names (list of str): List of column names to generate bigrams from.

    Returns:
    pd.DataFrame: DataFrame with bigrams appended as new columns.
    """
    bigram_columns = []
    for col in column_names:
        bigram_col = f"{col}_bigrams"
        bigram_columns.append(bigram_col)
        df[bigram_col] = df[col].apply(lambda x: generate_bigrams([x]))
    return df[bigram_columns]

def make_wordcloud(df,column, bg_color='white', w=1200, h=1000, font_size_max=50, n_words=40,g_min=1,g_max=1):
    '''
    Function to make wordcloud from a text corpus
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - bg_color = Background color
        - w = width
        - h = height
        - font_size_max = maximum font size allowed
        - n_word = maximum words allowed
        - g_min = minimum n-grams
        - g_max = maximum n-grams
    Expected Output -
        - World cloud image
    '''
    text = ""
    for ind, row in df.iterrows(): 
        text += row[column] + " "
    text = text.strip().split(' ') 
    text = word_grams(text,g_min,g_max)
    
    text = list(pd.Series(word_grams(text,1,2)).apply(lambda x: x.replace(' ','_')))
    
    s = ""
    for i in range(len(text)):
        s += text[i] + " "

    wordcloud = WordCloud(background_color=bg_color, \
                          width=w, \
                          height=h, \
                          max_font_size=font_size_max, \
                          max_words=n_words).generate(s)
    wordcloud.recolor(random_state=1)
    plt.rcParams['figure.figsize'] = (20.0, 10.0)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    
def generate_wordcloud(df, column_names):
    """
    Generates a wordcloud from a pandas DataFrame

    Parameters:
    df (pd.DataFrame): DataFrame containing the data
    column_names (list): List of column names in the DataFrame to generate the wordcloud from

    Returns:
    None
    """
    all_words = ' '.join([' '.join(text) for col in column_names for text in df[col]])
    wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

    plt.figure(figsize=(10, 7))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()
    
    
def get_tokens(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be tokenized
    Expected Output -
        - text - tokenized list output
    '''
    return word_tokenize(text)

def tokenize_columns(dataframe, columns):
    """
    Tokenize the values in specified columns of a pandas DataFrame.

    Parameters:
        dataframe (pandas.DataFrame): The DataFrame to tokenize.
        columns (list): A list of column names to tokenize.

    Returns:
        pandas.DataFrame: A new DataFrame with tokenized values in the specified columns.
    """
    # Download necessary NLTK resources if they haven't been downloaded yet
    nltk.download('punkt')

    # Create a new DataFrame to hold the tokenized values
    tokenized_df = pd.DataFrame()

    # Tokenize the values in each specified column
    for col in columns:
        # Tokenize the values in the current column using NLTK's word_tokenize function
        tokenized_values = dataframe[col].apply(nltk.word_tokenize)

        # Add the tokenized values to the new DataFrame
        tokenized_df[col] = tokenized_values

    # Return the new DataFrame with tokenized values
    return tokenized_df

#another way
--------------------------------------------------------------------------
def tokenize(text, sep=' ', preserve_case=False):
    """
    Tokenize a string into a list of tokens.

    Parameters:
    text (str): String to be tokenized
    sep (str, optional): Separator to use for tokenization. Defaults to ' '.
    preserve_case (bool, optional): Whether to preserve the case of the text. Defaults to False.

    Returns:
    list: List of tokens
    """
    if not preserve_case:
        text = text.lower()
    tokens = text.split(sep)
    return tokens

def tokenize_df(df, column_names, sep=' ', preserve_case=False):
    """
    Tokenize a pandas dataframe with multiple columns.

    Parameters:
    df (pd.DataFrame): Dataframe to be tokenized
    columns (list of str): List of column names to be tokenized
    sep (str, optional): Separator to use for tokenization. Defaults to ' '.
    preserve_case (bool, optional): Whether to preserve the case of the text. Defaults to False.

    Returns:
    pd.DataFrame: Tokenized dataframe
    """
    for col in column_names:
        df[col] = df[col].apply(lambda x: tokenize(x, sep, preserve_case))
    return df

carbon_google1 = tokenize_df (carbon_google1, column_names =  ["title"], sep=' ', preserve_case=False)
--------------------------------------------------------------------------------------------------------------
def bag_of_words_features(df, text_columns, target_columns):
    """
    This function takes in a DataFrame and one or two columns and returns a bag of words representation of the data as a DataFrame.

    Parameters:
    df (pandas DataFrame): The DataFrame to extract features from.
    column1 (str): The name of the first column to use as input data.
    column2 (str, optional): The name of the second column to use as input data. If not provided, only the first column will be used.

    Returns:
    pandas DataFrame: The bag of words representation of the input data as a DataFrame.
    """
        
    text_data = df[text_columns].apply(lambda x: " ".join([str(i) for i in x]), axis=1)

    text_data = text_data.str.lower()
    vectorizer = CountVectorizer(max_df=0.90, min_df=4, max_features=1000, stop_words=None)
    X_bow = vectorizer.fit_transform(text_data)
    # Use the new function to get the feature names
    feature_names = vectorizer.get_feature_names_out()
    df.dropna(subset=[target_column], inplace=True) if target_columns else None

    X_bow = pd.DataFrame(X_bow.toarray(), columns=feature_names)
    
    if target_columns:        
        y = df[target_columns]
        return X_bow, y
    
    return X_bow

def convert_lowercase(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be lowercased
    Expected Output -
        - text - lower cased text string output
    '''
    return text.lower()

def remove_unwanted_characters(df, columns):
    """
    Remove unwanted characters (including smileys and emojies) from specified columns in a pandas DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    columns (list): A list of column names to clean.
    unwanted_chars (str): The characters to remove.

    Returns:
    pd.DataFrame: The cleaned DataFrame.
    """
    import re 
    unwanted_chars = '[$#&*@%]'
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\u2764\ufe0f" # heart emoji
                           "]+", flags=re.UNICODE)
    for col in columns:
        if col in df.columns:
            df[col] = df[col].apply(lambda x: emoji_pattern.sub(r'', x))
            df[col] = df[col].str.replace(unwanted_chars, '')
        else:
            print(f"Column '{col}' does not exist in the DataFrame.")
    return df

def remove_punctuations(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string 
    Expected Output -
        - text - text string with punctuation removed
    '''
    return text.translate(None,string.punctuation)

def remove_stopwords(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be tokenized
    Expected Output -
        - text - list output with stopwords removed
    '''
    return [word for word in text.split() if word not in eng_stop]

def remove_short_words(df, column_names, min_length=3):
    """Remove short words from columns in a pandas DataFrame.

    Parameters:
    df (pandas.DataFrame): The DataFrame to modify.
    column_names (List[str]): A list of column names to modify.
    min_length (int, optional): The minimum length of words to keep. Default is 3.

    Returns:
    pandas.DataFrame: The modified DataFrame with short words removed from specified columns.
    """
    for column_name in column_names:
        df[column_name] = df[column_name].apply(
            lambda x: ' '.join([word for word in x.split() if len(word) >= min_length])
        )
    return df

def convert_stemmer(word):
    '''
    Function to tokenize the text
    Required Input - 
        - word - word which needs to be tokenized
    Expected Output -
        - text - word output after stemming
    '''
    porter_stemmer = PorterStemmer()
    return porter_stemmer.stem(word)

def stem_df(df, column_names):
    """
    Perform stemming on a pandas dataframe with multiple columns.

    Parameters:
    df (pd.DataFrame): Dataframe to be stemmed
    columns (list of str): List of column names to be stemmed

    Returns:
    pd.DataFrame: Stemmed dataframe
    """
    stemmer = PorterStemmer()
    for col in column_names:
        df[col] = df[col].apply(lambda x: [stemmer.stem(i) for i in x])
    return df

def convert_lemmatizer(word):
    '''
    Function to tokenize the text
    Required Input - 
        - word - word which needs to be lemmatized
    Expected Output -
        - word - word output after lemmatizing
    '''
    wordnet_lemmatizer = WordNetLemmatizer()
    return wordnet_lemmatizer.lemmatize(word)
    
def create_tf_idf(df, column, train_df = None, test_df = None,n_features = None):
    '''
    Function to do tf-idf on a pandas dataframe
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - train_df(optional) = Train DataFrame
        - test_df(optional) = Test DataFrame
        - n_features(optional) = Maximum number of features needed
    Expected Output -
        - train_tfidf = train tf-idf sparse matrix output
        - test_tfidf = test tf-idf sparse matrix output
        - tfidf_obj = tf-idf model
    '''
    tfidf_obj = TfidfVectorizer(ngram_range=(1,1), stop_words='english', 
                                analyzer='word', max_features = n_features)
    tfidf_text = tfidf_obj.fit_transform(df.ix[:,column].values)
    
    if train_df is not None:        
        train_tfidf = tfidf_obj.transform(train_df.ix[:,column].values)
    else:
        train_tfidf = tfidf_text

    test_tfidf = None
    if test_df is not None:
        test_tfidf = tfidf_obj.transform(test_df.ix[:,column].values)

    return train_tfidf, test_tfidf, tfidf_obj
    
def create_countvector(df, column, train_df = None, test_df = None,n_features = None):
    '''
    Function to do count vectorizer on a pandas dataframe
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - train_df(optional) = Train DataFrame
        - test_df(optional) = Test DataFrame
        - n_features(optional) = Maximum number of features needed
    Expected Output -
        - train_cvect = train count vectorized sparse matrix output
        - test_cvect = test count vectorized sparse matrix output
        - cvect_obj = count vectorized model
    '''
    cvect_obj = CountVectorizer(ngram_range=(1,1), stop_words='english', 
                                analyzer='word', max_features = n_features)
    cvect_text = cvect_obj.fit_transform(df.ix[:,column].values)
    
    if train_df is not None:
        train_cvect = cvect_obj.transform(train_df.ix[:,column].values)
    else:
        train_cvect = cvect_text
        
    test_cvect = None
    if test_df is not None:
        test_cvect = cvect_obj.transform(test_df.ix[:,column].values)

    return train_cvect, test_cvect, cvect_obj

### NLP Text Preprocessing Steps for Machine Learning Algorithms

In [None]:
# 1. Tokenization
# 2. Stopword Removal
# 3. Stemming
# 4. Lemmatization
# 5. Part-of-speech (POS) tagging
# 6. Named Entity Recognition (NER)
# 7. Spell Checking and Correction
# 8. Removing HTML tags, punctuation, and special characters
# 9. Converting to Lowercase
# 10. Text Vectorization



# 1. Tokenization
# The process of converting a raw text into a sequence of tokens (words, phrases, symbols, etc.) is called tokenization.

    from nltk.tokenize import word_tokenize
    text = "This is a sample text for tokenization."
    tokens = word_tokenize(text)
    print(tokens)

# 2. Stopword Removal
# Stopwords are commonly used words in a language, such as “the,” “and,” “a,” etc., that do not add much meaning to the 
# text. Removing these words helps to reduce the noise in the text data.

    from nltk.corpus import stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if not word in stop_words]
    print(tokens)

# 3. Stemming
# Stemming is the process of reducing a word to its base or root form. For example, the words “jumping”, “jumps”, and 
# “jumped” would all be reduced to “jump” by a stemming algorithm. The main goal of stemming is to reduce different 
# forms of a word to a common base form, which can help in tasks like text classification, sentiment analysis, and 
# information retrieval.

    from nltk.stem import PorterStemmer
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in tokens]
    print(stemmed_words)

# 4. Lemmatization
# Lemmatization is the process of reducing words to their base or dictionary form (known as a lemma) so that they can be 
# analyzed as a single item, rather than multiple different forms. For example, the word “running” can be reduced to 
# its base form “run” through lemmatization.

    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
    print(lemmatized_words)

# 5. Part-of-speech (POS) tagging
# Part-of-speech (POS) tagging is the process of identifying and labeling the part of speech of each word in a sentence, 
# such as a noun, verb, adjective, adverb, etc. POS tagging is useful in various natural languages processing tasks like 
# sentiment analysis, text classification, information extraction, and machine translation.

    import nltk
    # Sample sentence
    sentence = "The quick brown fox jumps over the lazy dog"
    # Tokenize the sentence into words
    words = nltk.word_tokenize(sentence)
    # Perform POS tagging
    pos_tags = nltk.pos_tag(words)
    # Print the POS tags
    print(pos_tags)

# 6. Named Entity Recognition (NER)
# Named Entity Recognition (NER) is a natural language processing technique that is used to identify and extract the 
# named entities from a given text. Named entities can be anything like a person, organization, location, product, etc.

    import spacy
    # Load the English language model
    nlp = spacy.load("en_core_web_sm")
    # Sample text for NER
    text = "Apple is looking at buying U.K. startup for $1 billion"
    # Process the text with the language model
    doc = nlp(text)
    # Extract named entities from the text
    for ent in doc.ents:
    print(ent.text, ent.label_)

# 7. Spell Checking and Correction
# Spell checking and correction is the process of identifying and correcting spelling errors in the text. It is an 
# important step in text preprocessing as it can improve the accuracy of natural language processing algorithms that 
# are applied to text data.

    !pip install pyspellchecker

    from spellchecker import SpellChecker
    # initialize spell checker
    spell = SpellChecker()
    # example sentence with spelling errors
    sentence = "Ths sentnce hs spellng erors that nd to b corcted."
    # tokenize sentence
    tokens = sentence.split()
    # iterate over tokens and correct spelling errors
    for i in range(len(tokens)):
    # check if token is misspelled
    if not spell.correction(tokens[i]) == tokens[i]:
    # replace misspelled token with corrected spelling
    tokens[i] = spell.correction(tokens[i])
    # join corrected tokens back into sentence
    corrected_sentence = ' '.join(tokens)
    print(corrected_sentence)

# 8. Removing HTML tags, punctuation, and special characters
# Removing HTML tags, punctuation, and special characters is necessary for text preprocessing to clean the text data 
# and make it ready for further processing. HTML tags, punctuation, and special characters do not contribute to the 
# meaning of the text and can cause issues during text analysis.

    import re
    import string

    def remove_html_tags(text):
    clean_text = re.sub('<.*?>', '', text)
    return clean_text

    def remove_punctuation(text):
    clean_text = text.translate(str.maketrans('', '', string.punctuation))
    return clean_text

    def remove_special_characters(text):
    clean_text = re.sub('[^a-zA-Z0–9\s]', '', text)
    return clean_text

    text = "<p>Hello, world!</p>"
    clean_text = remove_html_tags(text)
    clean_text = remove_punctuation(clean_text)
    clean_text = remove_special_characters(clean_text)
    print(clean_text)

# 9. Converting to Lowercase
# Lowercasing the text is a common preprocessing step in natural language processing (NLP) to make text data 
# consistent and easier to analyze. This step involves converting all the letters in the text to lowercase so 
# that words that differ only by the case are treated as the same word.

    text = "This is a sample TEXT for preprocessing"
    text = text.lower()
    print(text)

# 10. Text Vectorization
# Text vectorization is the process of transforming raw text into a numerical representation that can be used by 
# machine learning algorithms. This is a crucial step in text preprocessing as most machine learning algorithms work 
# with numerical data. There are several ways to vectorize text, including Bag of Words (BoW), Term Frequency-Inverse 
# Document Frequency (TF-IDF), and Word Embeddings.

    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

    # Example text corpus
    corpus = ["This is the first document.", 
    "This document is the second document.", 
    "And this is the third one.", 
    "Is this the first document?"]

    # Vectorize text using BoW representation
    vectorizer = CountVectorizer()
    X_bow = vectorizer.fit_transform(corpus)

    print("BoW representation:")
    print(X_bow.toarray())
    print("Vocabulary:")
    print(vectorizer.get_feature_names())

    # Vectorize text using TF-IDF representation
    vectorizer = TfidfVectorizer()
    X_tfidf = vectorizer.fit_transform(corpus)

    print("TF-IDF representation:")
    print(X_tfidf.toarray())


### Recommendation Systems (Recsys)

In [None]:
import pandas as pd
import numpy as np
from scipy import sparse
from lightfm import LightFM
from sklearn.metrics.pairwise import cosine_similarity

def create_interaction_matrix(df,user_col, item_col, rating_col, norm= False, threshold = None):
    '''
    Function to create an interaction matrix dataframe from transactional type interactions
    Required Input -
        - df = Pandas DataFrame containing user-item interactions
        - user_col = column name containing user's identifier
        - item_col = column name containing item's identifier
        - rating col = column name containing user feedback on interaction with a given item
        - norm (optional) = True if a normalization of ratings is needed
        - threshold (required if norm = True) = value above which the rating is favorable
    Expected output - 
        - Pandas dataframe with user-item interactions ready to be fed in a recommendation algorithm
    '''
    interactions = df.groupby([user_col, item_col])[rating_col] \
            .sum().unstack().reset_index(). \
            fillna(0).set_index(user_col)
    if norm:
        interactions = interactions.applymap(lambda x: 1 if x > threshold else 0)
    return interactions

def create_user_dict(interactions):
    '''
    Function to create a user dictionary based on their index and number in interaction dataset
    Required Input - 
        interactions - dataset create by create_interaction_matrix
    Expected Output -
        user_dict - Dictionary type output containing interaction_index as key and user_id as value
    '''
    user_id = list(interactions.index)
    user_dict = {}
    counter = 0 
    for i in user_id:
        user_dict[i] = counter
        counter += 1
    return user_dict
    
def create_item_dict(df,id_col,name_col):
    '''
    Function to create an item dictionary based on their item_id and item name
    Required Input - 
        - df = Pandas dataframe with Item information
        - id_col = Column name containing unique identifier for an item
        - name_col = Column name containing name of the item
    Expected Output -
        item_dict = Dictionary type output containing item_id as key and item_name as value
    '''
    item_dict ={}
    for i in range(df.shape[0]):
        item_dict[(df.loc[i,id_col])] = df.loc[i,name_col]
    return item_dict

def runMF(interactions, n_components=30, loss='warp', k=15, epoch=30,n_jobs = 4):
    '''
    Function to run matrix-factorization algorithm
    Required Input -
        - interactions = dataset create by create_interaction_matrix
        - n_components = number of embeddings you want to create to define Item and user
        - loss = loss function other options are logistic, brp
        - epoch = number of epochs to run 
        - n_jobs = number of cores used for execution 
    Expected Output  -
        Model - Trained model
    '''
    x = sparse.csr_matrix(interactions.values)
    model = LightFM(no_components= n_components, loss=loss,k=k)
    model.fit(x,epochs=epoch,num_threads = n_jobs)
    return model

def sample_recommendation_user(model, interactions, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 10, show = True):
    '''
    Function to produce user recommendations
    Required Input - 
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - user_id = user ID for which we need to generate recommendation
        - user_dict = Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - threshold = value above which the rating is favorable in new interaction matrix
        - nrec_items = Number of output recommendation needed
    Expected Output - 
        - Prints list of items the given user has already bought
        - Prints list of N recommended items  which user hopefully will be interested in
    '''
    n_users, n_items = interactions.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items)))
    scores.index = interactions.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    known_items = list(pd.Series(interactions.loc[user_id,:] \
                                 [interactions.loc[user_id,:] > threshold].index) \
								 .sort_values(ascending=False))
    
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1
    return return_score_list
    

def sample_recommendation_item(model,interactions,item_id,user_dict,item_dict,number_of_user):
    '''
    Funnction to produce a list of top N interested users for a given item
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - item_id = item ID for which we need to generate recommended users
        - user_dict =  Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - number_of_user = Number of users needed as an output
    Expected Output -
        - user_list = List of recommended users 
    '''
    n_users, n_items = interactions.shape
    x = np.array(interactions.columns)
    scores = pd.Series(model.predict(np.arange(n_users), np.repeat(x.searchsorted(item_id),n_users)))
    user_list = list(interactions.index[scores.sort_values(ascending=False).head(number_of_user).index])
    return user_list 


def create_item_emdedding_distance_matrix(model,interactions):
    '''
    Function to create item-item distance embedding matrix
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
    Expected Output -
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
    '''
    df_item_norm_sparse = sparse.csr_matrix(model.item_embeddings)
    similarities = cosine_similarity(df_item_norm_sparse)
    item_emdedding_distance_matrix = pd.DataFrame(similarities)
    item_emdedding_distance_matrix.columns = interactions.columns
    item_emdedding_distance_matrix.index = interactions.columns
    return item_emdedding_distance_matrix

def item_item_recommendation(item_emdedding_distance_matrix, item_id, 
                             item_dict, n_items = 10, show = True):
    '''
    Function to create item-item recommendation
    Required Input - 
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
        - item_id  = item ID for which we need to generate recommended items
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - n_items = Number of items needed as an output
    Expected Output -
        - recommended_items = List of recommended items
    '''
    recommended_items = list(pd.Series(item_emdedding_distance_matrix.loc[item_id,:]. \
                                  sort_values(ascending = False).head(n_items+1). \
                                  index[1:n_items+1]))
    if show == True:
        print("Item of interest :{0}".format(item_dict[item_id]))
        print("Item similar to the above item:")
        counter = 1
        for i in recommended_items:
            print(str(counter) + '- ' +  item_dict[i])
            counter+=1
    return recommended_items