### Machine Learning Glossary

In [None]:
https://ml-cheatsheet.readthedocs.io/en/latest/index.html   #this explains various ML concepts 


https://developers.google.com/machine-learning/glossary     #click the dropdown and choose fundamentals to filter for the most important.
### Machine Learning Glossary of Terms


### Machine Learning Models to Master

In [None]:
'''
# Machine Learning Algorithms to Master  

1. Linear and Multiple Linear Regression
2. Logistic Regression
3. Decision Trees
4. Naive Bayes
5. K-Nearest Neighbors
6. Support Vector Machines
7. Random Forests
8. Neural Networks
    1. Convolutional Neural Network (CNN)
    2. Recurrent Neural Network (RNN)
    3. Long Short-Term Memory (LSTM)
    4. Generative Adversarial Network (GAN)
    5. Deep Belief Network (DBN)
    6. Deep Boltzmann Machine (DBM)
    7. Autoencoders
    8. Restricted Boltzmann Machines (RBM)
    9. Hopfield Networks
    10. Self-Organizing Maps (SOM)
9. Gradient Boosting
    1. XGBoost
    2. LightGBM
    3. CatBoost
    4. Gradient Boosting Machines (GBM)
    5. Stochastic Gradient Boosting (SGB)
    6. Adaboost
    7. Gradient Boosted Decision Trees (GBDT)
    8. DeepBoost
    9. Neural Network Boosting (NNBoost)
    10. Gradient Boosted Regression Trees (GBRT)
10. Reinforcement Learning
11. Dimensionality Reduction Algorithms
    1. Principal Component Analysis (PCA)
    2. Linear Discriminant Analysis (LDA)
    3. Independent Component Analysis (ICA)
    4. Non-Negative Matrix Factorization (NMF)
    5. Factor Analysis
    6. Singular Value Decomposition (SVD)
    7. t-Distributed Stochastic Neighbor Embedding (t-SNE)
    8. Uniform Manifold Approximation and Projection (UMAP)
    9. Autoencoders
    10. Random Projection
    11. Feature Selection
    12. Locally Linear Embedding (LLE)
12. Clustering Algorithms
    1. K-Means Clustering
    2. Hierarchical Clustering
    3. Expectation-Maximization (EM) Clustering
    4. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
    5. Mean-Shift Clustering
    6. Gaussian Mixture Model (GMM) Clustering
    7. Spectral Clustering
    8. Affinity Propagation Clustering
    9. Birch Clustering
    10. Optics Clustering
13. Autoencoders
14. Transfer Learning
15. Generative Adversarial Networks (GANs)


Data Preprocessing:
    importing the required libraries
    importing the dataset
    handling missing data
    encoding the categoical data
    feature engineering
    spliting the dataset into test set and training set
    feature scaling 

Developing the Model:
    model selection
    model evaluation
    model persistence
    ensemble methods
    feature extraction
    feature selection
    feature engineering
    hyperparameter tuning
   
    '''

### Cost Function

In [None]:
# #Cost Function
# A cost function, also known as a loss function or objective function, 
# is a mathematical function that measures the difference between predicted and actual values in machine learning. 
# The purpose of a cost function is to guide the learning algorithm towards finding the optimal model parameters that minimize 
# the difference between the predicted and actual values.

# The choice of cost function depends on the type of problem and the learning algorithm used. 
# Here are some common examples of cost functions and their equations:

# 1. Mean Squared Error (MSE): This cost function is used for regression problems where the goal is to predict a continuous 
#     variable. It measures the average squared difference between the predicted and actual values. The equation for MSE is:

#         MSE = 1/n * ∑(y - y_pred)^2
#         where n is the number of samples, y is the actual value, and y_pred is the predicted value.

# 2. Binary Cross-Entropy: This cost function is used for binary classification problems where the output is either 0 or 1. 
#     It measures the difference between the predicted probability and the actual label. 
#     The equation for binary cross-entropy is:

#         Binary cross-entropy = -1/n * ∑(y * log(y_pred) + (1-y) * log(1-y_pred))
#         where n is the number of samples, y is the actual label (0 or 1), and y_pred is the predicted probability.

# 2. Categorical Cross-Entropy: This cost function is used for multi-class classification problems where the output 
#     can be one of several classes. It measures the difference between the predicted probability distribution and the actual 
#     label. The equation for categorical cross-entropy is:

#         Categorical cross-entropy = -1/n * ∑∑(y_ij * log(y_pred_ij))
#         where n is the number of samples, y_ij is the actual probability for class j in sample i, and y_pred_ij is the predicted probability for class j in sample i.

### Time Series

In [None]:
# Resample to daily frequency  

df_train_day = df_train.resample('D').mean() 
df_test_day = df_test.resample('D').mean()
df['hour'] = df.index.hour 
df['day'] = df.index.day 
df['weekday'] = df.index.day_name() 
df['month'] = df.index.month 
df['year'] = df.index.year 


#plot data split 
def plot_data_splitting(train, test):
    """
    Plots the training and test sets of a time series.

    Args:
    train (pandas.DataFrame): DataFrame containing the training set with a DatetimeIndex and a 'PJME_MW' column.
    test (pandas.DataFrame): DataFrame containing the test set with a DatetimeIndex and a 'PJME_MW' column.

    Returns:
    None
    """
    plt.figure(figsize=(20,8))

    plt.plot(train.index, train['PJME_MW'], label='Training Set')
    plt.plot(test.index, test['PJME_MW'], label='Test Set')

    plt.title('Data Splitting', weight='bold', fontsize=25, loc= "center", pad=20)
    plt.axvline('2015-09-01', color='black', ls='--', lw=3) 
    plt.legend()
    plt.show()


#Time Series Split

#using Sklearn 
from sklearn.model_selection import TimeSeriesSplit
# Assume 'X' is your feature matrix, and 'y' is your target variable 
tscv = TimeSeriesSplit(n_splits=5)
# Split the data into training and testing sets
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

#using pandas
import pandas as pd
# Assume 'df' is your time series data stored in a pandas DataFrame
# Set the split point as a percentage of the data
train_size = 0.8
split_point = int(len(df) * train_size)
# Split the data into training and testing sets
train_df = df[:split_point]
test_df = df[split_point:]

#using evalml
import evalml
# Assume 'df' is your time series data stored in a pandas DataFrame
# Set the target variable name
target_name = "target_variable_name"
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(df, target=target_name, problem_type="time series",
                                                                   test_size=0.2, shuffle=False)





### Scikit-Learn

In [64]:
#Scikit-Learn Sub-modules

# Scikit-Learn library is organized into several sub-modules, each of which contains a set of related functions and classes. 
# Here are the main sub-modules in scikit-learn:

#from sklearn."sub-module" import "model"
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression 


# sklearn.datasets: This sub-module provides a set of standard datasets for machine learning, including iris, 
#     digits, and breast cancer.
        from sklearn.datasets import load_iris
        iris_data = load_iris()
        iris_features = iris_data.data 
        iris_target = iris_data.target
        
        # Convert the data to a DataFrame
        df = pd.DataFrame(iris_features, columns=iris_data.feature_names)
        
        # Add the target variable to the DataFrame
        df['target'] = iris_target 
        
        # print(iris_data.DESCR) - Describes the data 
        # iris_data.data: An array containing the feature values for each instance of the dataset.
        # iris_data.target: An array containing the class labels (i.e., 0, 1, or 2) for each instance of the dataset.
        # Iris_data.target_names: An array containing the names of the three classes 
        # iris_data.feature_names: An array containing the names of the attributes 
        #or
        X, y = load_iris(return_X_y=True, as_frame=True) 
        X, y = fetch_openml(name = 'blood-transfusion-service-center', as_frame=True, return_X_y=True)
        
        from sklearn.datasets import load_digits
        from sklearn.datasets import fetch_openml

        X, y = load_digits(return_X_y=True, as_frame=True)
        X.shape, y.shape
        # Plot the first 10 digits
        fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(10, 5))
        for i, ax in enumerate(axes.flat):
        ax.imshow(X.iloc[i].values.reshape(8, 8), cmap='gray')
        ax.set_title(f"Digit {y.iloc[i]}")
        plt.tight_layout()
        plt.show()
        
# sklearn.model_selection: This sub-module contains functions for model selection, such as splitting data into 
#     training and test sets, cross-validation, and grid search.

# sklearn.preprocessing: This sub-module provides functions for preprocessing data, such as scaling, normalization, 
#     and encoding categorical variables.

# sklearn.feature_extraction: This sub-module contains functions for feature extraction from raw data, 
#     such as text data, including Bag of Words, CountVectorizer, and TfidfVectorizer.

# sklearn.metrics: This sub-module provides functions for evaluating the performance of machine learning models, 
#     such as accuracy, precision, recall, and F1 score.

# sklearn.pipeline: This sub-module provides tools for building machine learning pipelines, 
#     which allows you to chain together multiple steps, such as feature extraction, preprocessing, and model selection.

# sklearn.decomposition: This sub-module provides classes for matrix factorization and decomposition, 
#     such as Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), 
#     and Latent Dirichlet Allocation (LDA).

# sklearn.discriminant_analysis: This sub-module provides classes for linear and quadratic discriminant analysis, 
#     which are used for supervised classification tasks.

# sklearn.covariance: This sub-module provides classes for covariance estimation, such as Empirical Covariance and 
#     Shrunk Covariance.

# sklearn.exceptions: This sub-module contains custom exceptions raised by scikit-learn, such as NotFittedError and 
#     ConvergenceWarning.


#Models: 


# sklearn.linear_model: This sub-module contains classes for linear models, such as linear regression, 
#     logistic regression, and ridge regression.

# sklearn.tree: This sub-module provides classes for decision trees, such as DecisionTreeClassifier and 
#     DecisionTreeRegressor.

# sklearn.ensemble: This sub-module contains classes for ensemble models, such as random forests, AdaBoost, 
#     and Gradient Boosting.

# sklearn.cluster: This sub-module provides classes for clustering, such as KMeans and Hierarchical Clustering.

# sklearn.neural_network: This sub-module contains classes for neural networks, such as Multi-Layer Perceptron (MLP) 
#     and Convolutional Neural Networks (CNNs).

# sklearn.svm: This sub-module contains classes for Support Vector Machines (SVMs), such as SVM classifier and regression.

# sklearn.manifold: This sub-module provides classes for manifold learning, such as t-SNE and Isomap.

# sklearn.naive_bayes: This sub-module provides classes for Naive Bayes models, such as Gaussian Naive Bayes and 
#     Multinomial Naive Bayes.

# sklearn.neighbors: This sub-module provides classes for k-Nearest Neighbors (k-NN) models, 
#     such as KNeighborsClassifier and KNeighborsRegressor.


### Decision Tree, Random Forest, and Ensembles

In [None]:
https://ml-cheatsheet.readthedocs.io/en/latest/index.html   #use this to explain the concepts. 


# Decision Tree                          (use one hot encoding for this)
    # A supervised learning model composed of a set of conditions and leaves organized hierarchically. Decision tree works by successively 
    # splitting the dataset into small segments until the target variable are the same or until the dataset can no longer be split. 
    # It’s a greedy algorithm which make the best decision at the given time without concern for the global optimality

#steps for Decision tree (classification)
    #Start with all examples at the root node
    #calculate information gain for all possible features, and pick the one with the highest infrmation gain
    #Split the dataset according to selected feature, and create left and right branches of the tree
    #Keep repeating splitting process until stopping criteria is met:
        #when a node is 100% one class
        #when splitting a node will result in the tree exceeding a maximum depth
        #information gain from additional splits is less than threshold
        #when number of examples in a node is below threshold



# Random Forest 
    # An ensemble of decision trees in which each decision tree is trained with a specific random noise, such as bagging. 
    # Random forests are a type of decision forest.
    #Create multiple decision trees on bootstrap sample of data.
    #Average results across all trees to improve accuracy.



#Bagging (Bootstrapping + Aggregating (or voting)) i.e., randomly creating samples (subsets) of the dataset with replacement, 
    # then builds models on the random subsets. The multiple models are combined by taking a majority vote or 
    # averaging their predictions to make the final prediction or decision
    from sklearn.ensemble import BaggingClassifier, VotingClassifier,RandomForestClassifier
    bagging_classifier = BaggingClassifier(estimator=RandomForestClassifier(), n_estimators=15,  max_samples=200, max_features=X_train.shape[1])
    vot_classifier = VotingClassifier(    
                                    estimators=[('log_reg', log_classifier),
                                                ('svc', sv_classifier),
                                                ('sgd', sgd_classifier)], 
                                    voting='hard')  #there are several types of voting/aggregation (majority vote,
                                                                                                    #average, 
                                                                                                    # weighted average etc.)
#Boosting
    #Boosting is a machine learning ensemble technique that combines multiple base models to create a stronger overall model. 
    # Unlike bagging, which creates subsets of the training data for training base models, The basic idea behind boosting is 
    # to sequentially train a series of base models, where each subsequent base model focuses on correcting the errors 
    # made by the previous base models
    from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

    #AdaBoost (Adaptive Boosting): 
    # AdaBoost is a widely used boosting algorithm that assigns higher weights to misclassified samples and adjusts the 
    # weights of base models based on their accuracy. It places more emphasis on samples that are misclassified by the 
    # current ensemble and updates the weights of samples and base models accordingly.
    
    #Gradient Boosting: 
    # Gradient Boosting is a generalization of AdaBoost that uses gradient descent optimization to minimize the loss 
    # function of the ensemble model. It sequentially fits the base models to the residuals (i.e., the differences between 
    # the true labels and the predictions) of the previous base models, resulting in a more accurate and robust model.
    
    #XGBoost (Extreme Gradient Boosting): XGBoost is a popular implementation of gradient boosting that incorporates 
    # additional optimizations for improved performance, such as parallelization, regularization, and handling of missing 
    # values.
    from xgboost import XGBClassifier

#Stacking and Blending 
from sklearn.ensemble import StackingClassifier, StackingRegressor
    #Stacking refers to training a learning algorithm to combine the predictions of several other algorithms. The 
    #predictions of the base algorithms are used as input to train the stacking algorithm. This helps to create a 
    #meta-model that can leverage the predictions of the base models
    
    #Blending refers to simply averaging the predictions of multiple models. The predictions of the base models are
    #combined using a weighted average to get the final prediction. 

stacking_classifier = StackingClassifier(estimators=[('random forest', RandomForestClassifier()), 
                                                     ('decision trees', DecisionTreeClassifier()),
                                                     ('logistic regression', LogisticRegression())], stack_method='predict'
                                         final_estimator=RandomForestClassifier())    
        #final_estimator is the metal-model

In [None]:
'''

#### Ensemble Methods

- Feature extraction and selection are relatively manual processes. Bagging and boosting are automated or semi-automated approaches to determine which features to include.  
- At a high level, Ensemble Methods is about bringing together multiple models (called weak learners) so that the result is an incredibly powerful and more accurate model (called a strong learner). There are several strategies and tricks involved in this.

##### Motive

- There are two competing variables in finding a well-fitting machine learning model:
  - Bias: When a model has a high bias, this means that means it doesn't do a good job of bending to the data.
    - Linear regression is an example of an algorithm that usually has a high bias . Even with completely different datasets, we end up with the same line fit to the data.
  - Variance: When a model has high variance, this means that it changes drastically to meet the needs of every point in our dataset.
    - Decision tree is an example of an algorithm that tends to have high variance and low bias (especially decision trees with no early stopping parameters). A decision tree, as a high variance algorithm, will attempt to split every point into its own branch if possible. This is a trait of high variance, low bias algorithms - they are extremely flexible to fit exactly whatever data they see.
- By combining algorithms, we can often build models that perform better by meeting in the middle in terms of bias and variance. These ideas are based on minimizing bias and variance based on mathematical theories, like the central limit theorem.

##### Methods

- Introducing Randomness to high variance algorithms:
  - The introduction of randomness combats the tendency of these algorithms to overfit.
  - There are two main ways
    - **Bagging (Bootstrapping Aggregation)**
      - Generate a group of weak learners that when combined together generate higher accuracy. Each weak learner is trained on a sample of the data (because the data may be huge and training all the data on multiple algorithms is computationally expensive).
      - Sampling the data with replacement and fitting your algorithm to the sampled data.
      - The weak learners are combined by voting. Each learner is imposed on the data, and the predicted value with most votes wins.
      - It reduces variance but keeps bias the same.
      - sklearn:
        - `sklearn.ensemble.BaggingClassifier`
        - `sklearn.ensemble.BaggingRegressor`
    - Subset the features
      - In each split of a decision tree or with each algorithm used in an ensemble, only a subset of the total possible features are used.

- **Boosting**
  - Assign strengths to each weak learner
  - Iteratively train learners using misclassified examples by previous weak learners.
  - It is used for models that have a high bias and accepts weights on individual samples.
  - Types:
    - AdaBoost:
      - Discovered by Freund and Schapire in 1996
      - Steps:
        - First model that maximizes accuracy, minimizes errors.
        - Identify misclassified points from the previous step and apply wights to them. So subsequent models, classifiers can focus on the misclassified samples more. Each misclassified point will have added weight with $\frac{\text{correct points}}{\text{incorrect points}}$
        - Calculate the weight for the model as $\text{weight} = ln(\frac{\text{correct points}}{\text{incorrect points}}) = ln(\frac{\text{accuracy}}{\text{1 - accuracy}})$
        - Reiterate the same steps but the new model will try to correctly classify misclassified points in the previous step, with the help of the weights added to each misclassified point.
        - Finally, combine all models by adding the models weights for the area for one of the classes, and subtract the weight for the area of the other class.
        ![AdaBoost-models-weights](ML%20images/adaboost-models-weights.png)
        ![AdaBoost-combine-models](ML%20images/adaboost-combine-models.png)
        ![AdaBoost-combined-model](ML%20images/adaboost-combined-model.png)
      - sklearn:
        - `sklearn.ensemble.AdaBoostClassifier`
          - Hyperparameters:
            - `base_estimator`: The model utilized for the weak learners (Warning: Don't forget to import the model that you decide to use for the weak learner).
            - `n_estimators`: The maximum number of weak learners used.
        - `sklearn.ensemble.AdaBoostRegressor`
        - `sklearn.ensemble.GradientBoostingClassifier`
    - XGBoost library
    
    ''' 

### Tips

In [None]:
# Missing Data:
#     Some machine learning models, such as decision trees and random forests, can handle missing data directly, 
#     while others may require imputation or removal of missing data. For example, models like K-nearest neighbors (KNN) 
#     and Support Vector Machines (SVM) may be sensitive to missing data and may require imputation or removal of 
#     missing values before training the model.

# Data Imbalance:
#     Techniques such as oversampling, undersampling, or using ensemble methods like SMOTE 
#     (Synthetic Minority Over-sampling Technique) can be used to address data imbalance. Some machine learning models, 
#     such as decision trees and random forests, can handle imbalanced data well, while others may require handling of 
#     imbalanced data as a preprocessing step, such as using oversampling or undersampling techniques. 
#     For example, models like logistic regression, naive Bayes, KNN,    and support vector machines (SVM) may require handling 
#     of imbalanced data.

# Feature Scaling:
#     Some machine learning models, such as k-nearest neighbors (KNN) and support vector machines (SVM), are sensitive 
#     to the scale of features and may require feature scaling.  On the other hand, decision trees and random forests 
#     are not sensitive to feature scaling and do not require this preprocessing step.

# Categorical Data:
#     Some models, like decision trees and random forests, can directly handle categorical data without encoding, 
#     while others, like logistic regression and support vector machines (SVM), may require encoding of categorical data

# Outliers:
#     Some machine learning models, such as decision trees and random forests, are less sensitive to outliers, while 
#     others, such as linear regression, SVM, and k-nearest neighbors (KNN), can be affected by outliers and may require 
#     handling of outliers as a preprocessing step.
    
# Dimensionality:
    #  refers to the number of features or variables in the dataset. High-dimensional data can lead to increased 
    #  complexity, increased computation time, and reduced model performance. 
    #  Some machine learning models, such as decision trees and random forests, are less sensitive to 
    #  high-dimensional data, while others, such as logistic regression and support vector machines (SVM), 
    #  may require handling of high-dimensional data as a preprocessing step

#reduce memory usage of the dataset
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
       

### Data Pre-processing

In [None]:
""" 
1. Data Cleaning:
    a. Missing values:
        Removing the training example:
        Filling in missing value manually
        Using a standard value to replace the missing value
        Using central tendency (mean, median, mode) for attribute to replace the missing value:
        Using central tendency (mean, median, mode) for attribute belonging to same class to replace the missing value:
        Using the most probable value to fill in the missing value:

    b. Noisy Data and Outliers: 
        Binning: Using binning methods smooths sorted values by using the values around it. The sorted values are then divided 
            into bins. 
        Regression:  Linear regression and multiple linear regression can be used to smooth the data, where the values 
            are conformed to a function.
        Outlier analysis: Approaches such as clustering can be used to detect outliers and deal with them.

    c. Remove Unwanted Data: Unwanted data is duplicate or irrelevant data. 
    
2. Data Integration:
    Data consolidation: The data is physically brought together to one data store. This usually involves Data Warehousing.
    Data propagation: Copying data from one location to another using applications is called data propagation
    Data virtualization: An interface is used to provide a real-time and unified view of data from multiple sources. 

3. Data Reduction:
    Missing values ratio: Attributes that have more missing values than a threshold are removed.
    Low variance filter: Normalized attributes that have variance (distribution) less than a threshold are also removed 
        because little changes in data means less information.
    High correlation filter: Normalized attributes that have correlation coefficients more than a threshold are removed 
        because similar trends means similar information is carried. A correlation coefficient is usually calculated using 
        statistical methods such as Pearson’s chi-square value.
    Principal component analysis: Principal component analysis, or PCA, is a statistical method that reduces the numbers 
        of attributes by lumping highly correlated attributes together.

4. Data Transformation:
    Smoothing: Eliminating noise in the data to see more data patterns.
    Attribute/feature construction: New attributes are constructed from the given set of attributes.
    Aggregation: Summary and aggregation operations are applied on the given set of attributes to come up with new attributes
    Normalization: The data in each attribute is scaled between a smaller range, for example, 0 to 1 or -1 to 1.
    Discretization: Raw values of the numeric attributes are replaced by discrete or conceptual intervals, 
        which can be further organized into higher-level intervals. 
    Concept hierarchy generation for nominal data: Values for nominal data are generalized to higher-order concepts.


"""

>> Categorical Encoding

In [None]:
'''
# Categorical encoding
There are four techniques to encode or convert the categorical features into numbers. Here are them:

Mapping Method
Ordinary Encoding
Label Encoding
Pandas Dummies
OneHot Encoding

The choice of categorical encoding method depends on several factors such as the type and nature of the data, 
the number of unique categories in the variable, the type of machine learning algorithm being used, and the performance 
of the encoding method on the dataset. Here are some general guidelines on when to use each method:

One-Hot Encoding
One-hot encoding is a useful technique for handling categorical variables with a small number of unique categories. 
It is particularly useful when the categories are nominal (unordered) or when there is no inherent order or hierarchy 
among the categories. One-hot encoding can be applied to both linear and tree-based machine learning models. 
However, one limitation of one-hot encoding is that it can lead to a high-dimensional feature space, which can be 
computationally expensive and may lead to the curse of dimensionality.

Label Encoding
Label encoding is a useful technique for handling categorical variables with a large number of unique categories. 
It is particularly useful when the categories are ordinal (ordered) or when there is an inherent order or hierarchy 
among the categories. Label encoding can be applied to both linear and tree-based machine learning models. 
However, one limitation of label encoding is that it may introduce an arbitrary ordering or hierarchy among the 
categories, which may not be appropriate for some models.

In general, it is recommended to use one-hot encoding when dealing with nominal categorical variables and label encoding 
when dealing with ordinal categorical variables. However, it is important to consider the nature of the data and the 
performance of the encoding method on the specific dataset before making a decision on which method to use. 
Additionally, it is often useful to try both encoding methods and compare their performance on the dataset to determine 
the optimal encoding method.

'''

>> Mixed Variables

In [None]:
# ColumnTransformer is another transformer class in scikit-learn, but it is used to apply different transformations 
# to different columns of a dataset. It allows you to specify a list of transformers, where each transformer is 
# applied to a specific subset of columns.

# For example, if you have a dataset with both numerical and categorical features, you can use ColumnTransformer to 
# apply different preprocessing steps to each subset of features. You could apply one transformation (e.g., scaling) 
# to the numerical features and another transformation (e.g., one-hot encoding) to the categorical features.

# ColumnTransformer is especially useful when you have a dataset with a mix of different types of features and you want 
# to apply different preprocessing steps to each subset of features. It can also be used in combination with Pipeline 
# to create more complex preprocessing pipelines for your data.


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), [0, 1, 2, 3]),  # apply StandardScaler to numerical columns
        ('cat', OneHotEncoder(), [4])  # apply OneHotEncoder to categorical column
    ])

# Define the pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit the pipeline to the data
X = iris.data
y = iris.target
clf.fit(X, y)




#Another code

# Import necessary libraries
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction import DictVectorizer

# Define the transformers to be used in the pipeline
subject_body_transformer = FunctionTransformer(
    lambda x: x[['subject', 'body']], validate=False)

text_stats_transformer = FunctionTransformer(
    lambda x: [{'length': len(text)} for text in x['body']], validate=False)

# Define the pipeline with three steps:
# 1. Extract the 'subject' and 'body' columns
# 2. Combine the 'subject' and 'body' features with a ColumnTransformer
# 3. Apply a LinearSVC classifier to the combined features
pipeline = Pipeline( transformers = [
    # Step 1: Extract 'subject' and 'body' columns
    ('subject_body', subject_body_transformer),
    # Step 2: Combine 'subject' and 'body' features with a ColumnTransformer
    ('union', ColumnTransformer([
        # Create a bag-of-words representation for 'subject' column
        ('subject_bow', TfidfVectorizer(min_df=50), 'subject'),
        # Create a bag-of-words representation for 'body' column with decomposition
        ('body_bow', Pipeline([
            ('tfidf', TfidfVectorizer()),
            ('best', TruncatedSVD(n_components=50))]), 'body'),
        # Use a Pipeline to extract text stats from 'body' column and create features
        ('body_stats', Pipeline([
            ('stats', text_stats_transformer),
            ('vect', DictVectorizer())]), 'body')
    ],
    # Weight the different feature extraction methods
    transformer_weights={
        'subject_bow': 0.8,
        'body_bow': 0.5,
        'body_stats': 1.0
    }
    )),
    # Step 3: Apply a LinearSVC classifier to the combined features
    ('svc', LinearSVC(dual=False))
], verbose=True)


>> Variable Transformation

In [None]:
#Logarithmic (only defined for positive numbers) - log(X)
#Exponential (square root or power transformations) - 
#Reciprocal (naturally not defined for zero, also defined for positive values) - 1/X
#Box-Cox (defined only for positive values X>0)
#Yeo-Johnson (is an adaptation of box-cox that can be used in negative value variables)

#NB: if data is positively skewed (right skewed), use (logarithmic, reciprocal, or square root transformation)
    #if data is negatively skewed (left skewed), use (Box-Cox or Yeo-Johnson transformations)

#check if dataset is normally distributed or not.
def diagnostic_plots(df, variable):

    # function to plot a histogram and a Q-Q plot
    # side by side, for a certain variable

    plt.figure(figsize=(15, 6))

    # histogram
    plt.subplot(1, 2, 1)
    df[variable].hist(bins=30)
    plt.title(f"Histogram of {variable}")

    # q-q plot
    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.title(f"Q-Q plot of {variable}")

    # check for skewness
    skewness = df[variable].skew()
    if skewness > 0:
        skew_type = "positively skewed"
    elif skewness < 0:
        skew_type = "negatively skewed"
    else:
        skew_type = "approximately symmetric"
        
    # print message indicating skewness type
    print(f"The variable {variable} is {skew_type} (skewness = {skewness:.2f})")
    
    plt.show()


#log transform 
def log_transform(df, columns):
     """
    Transforms specified columns of a pandas DataFrame using the natural logarithm function.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(np.log1p, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_log = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_log

#reciprocal transformation
def reciprocal_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the reciprocal transformation.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(lambda x: 1/x, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_recip = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_recip

#square root transformation
def sqrt_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the square root function.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(np.sqrt, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_sqrt = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_sqrt

#exponential transformation
def exp_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the exponential function.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = FunctionTransformer(np.exp, validate=True)
    X = df.values.copy()
    X[:, df.columns.isin(columns)] = transformer.transform(X[:, df.columns.isin(columns)])
    X_exp = pd.DataFrame(X, index=df.index, columns=df.columns)
    return X_exp

#box-cox transformation
def boxcox_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the Box-Cox transformation.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = PowerTransformer(method='box-cox', standardize=False)
    X = df.copy()
    X[columns] = transformer.fit_transform(X[columns])
    return X


#Yeo-Johnson
def yeo_johnson_transform(df, columns):
    """
    Transforms specified columns of a pandas DataFrame using the Yeo-Johnson transformation.

    Parameters:
    -----------
    df : pandas DataFrame
        The DataFrame to transform.
    columns : list
        A list of column names to transform.

    Returns:
    --------
    pandas DataFrame
        The transformed DataFrame.
    """
    transformer = PowerTransformer(method='yeo-johnson', standardize=False)
    X = df.copy()
    X[columns] = transformer.fit_transform(X[columns])
    return X

"""
A normal distribution is characterized by a bell-shaped curve that is symmetric around the mean. 
The mean, median, and mode of a normal distribution are all equal, and approximately 68% of the data falls within one 
standard deviation of the mean, 95% falls within two standard deviations, and 99.7% falls within three 
standard deviations.
"""

>> Discretization

In [None]:
# Discretization in machine learning is the process of transforming continuous variables into discrete or 
# categorical variables. This process involves dividing the range of a continuous variable into a finite number of 
# intervals or bins, and then assigning each observation to a particular bin based on the value of the continuous 
# variable. 

#Discretization approaches: equal width, equal frequency, K means, Decision Trees


#use equal width or equal frequency binning method for majority of cases
df['age_bin_equal_width'] = pd.cut(df['age'], bins=3)   # Equal width binning
df['income_bin_equal_freq'] = pd.qcut(df['income'], q=3)    # Equal frequency binning


# if the data contains outliers or unevenly distributed data, then equal width or frequency binning may not be 
# appropriate. In such cases, methods like K-Means clustering, decision tree discretization, and Gaussian mixture 
# modeling may be more suitable.
# K-Means clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
df['age_income_cluster'] = kmeans.fit_predict(df[['age', 'income']])

# Decision tree discretization
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=1)
dt.fit(df[['age']], df['income'])
df['age_bin_decision_tree'] = dt.predict(df[['age']])

# Gaussian mixture model
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(df[['age', 'income']])
df['age_income_gmm'] = gmm.predict(df[['age', 'income']])

# Entropy-based discretization
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(df[['age']], df['income'])
df['age_bin_entropy'] = dt.predict(df[['age']]) 

>> Pipeline

In [None]:
# Pipeline and FeatureUnion are classes in scikit-learn that are used to create pipelines for machine learning tasks. 
# Pipeline is used to sequentially apply a list of transformers and an estimator, while FeatureUnion is used to 
# concatenate the output of multiple transformer objects. 

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Define pipeline using FeatureUnion
preprocessor = FeatureUnion(transformer_list=[
    ('numeric_transformer', Pipeline(steps=[
                                            ('scaler', StandardScaler()),
                                            ('pca', PCA(n_components=2)),
                                        ])),
    ('categorical_transformer', Pipeline(steps=[
                                            ('onehot', OneHotEncoder(handle_unknown='ignore')),
                                        ])),
])

# Define the pipeline
pipe = Pipeline([
    ('features', feature_union),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

# Fit and predict using the pipeline
pipe.fit(iris.data, iris.target)
preds = pipe.predict(iris.data)

-----------------------------------------------------------------------------------------------------
#use this to visualize the pipelines as a diagram 
from sklearn import set_config
set_config(display='diagram')

#NB: use FunctionTransformer  to apply a custom function to each feature, and ColumnTransformer to apply different 
# transformers to different columns in a dataset

# Define preprocessing functions
log_transformer = FunctionTransformer(func=np.log1p, validate=True)
scale_transformer = StandardScaler()
ohe_transformer = OneHotEncoder(sparse=False)

# Define preprocessing steps for different feature types
preprocessor = ColumnTransformer(transformers=[
    ('log_transform', log_transformer, ['numerical_feature']),
    ('scale_transform', scale_transformer, ['numerical_feature_2']),
    ('ohe_transform', ohe_transformer, ['categorical_feature']),
])

# Define pipeline with preprocessing and model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())
])

# Fit pipeline to data
pipeline.fit(X, y)
----------------------------------------------------------------------------------------------------
#An Illustration

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer 
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split


X,y = load_breast_cancer(return_X_y=True, as_frame=True) 

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45) 

def treat_missing_numeric(df,columns,how = 'mean', value = None): 
    # Function to treat missing values in numeric columns 
    if how == 'mean':
        for i in columns:
            print("Filling missing values with mean for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].mean()) 
      
    else:
        print("Missing value fill cannot be completed")
    return df

preprocessor = ColumnTransformer(transformers=[
    ('Standard_scalar', StandardScaler(), ['mean smoothness']),
    ('minmax_scalar', MinMaxScaler(), ['mean area', 'mean texture', 'mean perimeter']),
    ('NaN', FunctionTransformer(treat_missing_numeric, kw_args={'columns': ['mean symmetry', 'mean fractal dimension'], 'how': 'mean'}), 
                                       ['mean symmetry', 'mean fractal dimension']),
])

pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())
    
    ])
pipe.fit(X_train, y_train)
y_pred= pipe.predict(X_test)

y_pred

<IPython.core.display.Javascript object>

Mean training error: 0.0356386368576308
Mean validation error: -0.009125646107584522


>> Feature Selection and Extraction

In [None]:
# feature selection is a technique of selecting relevant features from the original feature set, while feature 
# extraction is a technique of creating new features from the original feature set. Feature selection is typically 
# used when the original features are already informative, but there are some irrelevant or redundant features that 
# need to be removed to improve the model performance. 

# Feature extraction, on the other hand, is used when the original features are not informative enough, and 
# new features need to be created to capture the underlying patterns in the data.

#there are two ways to resolve/prevent curse of dimensionality (dimensionality reduction)
#Feature selection
    # Removing features with low variance (VarianceThreshold)
    # Univariate Feature Selection (SelectKBest, SelectPercentile, GenericUnivariateSelect)
    # Recursive Feature Elimination (RFE, RFECV)        RFECV - RFE cross validation
    # Feature selection using SelectFromModel (SelectFromModel) - use L1-based (Lasso, Ridge, ElasticNet) or Tree-based
    # Sequential Feature Selection (SequentialFeatureSelector) - SFS can be either forward or backward


#Feature extraction
    # Principal Component Analysis (PCA)
    # Independent Component Analysis (ICA)
    # t-Distributed Stochastic Neighbor Embedding (t-SNE)




# Variance threshold: - # Removing features with low variance
# This technique removes all features whose variance is below a certain threshold. 
# This is done using the VarianceThreshold function from scikit-learn library.
from sklearn.feature_selection import VarianceThreshold

def variance_threshold(X, threshold=0.0):
    selector = VarianceThreshold(threshold=threshold)
    X_new = selector.fit_transform(X)
    return X_new


# SelectKBest:
# This technique selects the K best features based on univariate statistical tests. 
# This is done using the SelectKBest function from scikit-learn library
from sklearn.feature_selection import SelectKBest, f_classif

def select_k_best(X, y, k=10):
    selector = SelectKBest(score_func = f_classif, k=k)
    X_new = selector.fit_transform(X, y)
    return X_new


# Principal Component Analysis (PCA):
# This technique reduces the dimensionality of the data by projecting it onto a lower dimensional space. 
# This is done using the PCA function from scikit-learn library.
from sklearn.decomposition import PCA

def pca(X, n_components=2):
    pca = PCA(n_components=n_components)
    X_new = pca.fit_transform(X)
    return X_new 
                    # # Get the loadings of the original variables in each component
                    # loadings = pca.components_
                    # # Print the names of the columns that were extracted
                    # print("Columns extracted:")
                    # for i in range(loadings.shape[0]):
                    #     max_loading_index = loadings[i].argmax()
                    #     column_name = data.columns[max_loading_index]
                    #     print(f"Component {i+1}: {column_name}")

# Independent Component Analysis (ICA):
# This technique extracts independent sources from the data by maximizing their statistical independence. 
# This is done using the FastICA function from scikit-learn library.
from sklearn.decomposition import FastICA

def ica(X, n_components=2):
    ica = FastICA(n_components=n_components)
    X_new = ica.fit_transform(X)
    return X_new 


# t-distributed Stochastic Neighbor Embedding (t-SNE):
# This technique is used for visualizing high-dimensional data in a low-dimensional space. 
# This is done using the TSNE function from scikit-learn library.
from sklearn.manifold import TSNE

def tsne(X, n_components=2, perplexity=30):
    tsne = TSNE(n_components=n_components, perplexity=perplexity)
    X_new = tsne.fit_transform(X)
    return X_new


#Feature Selection 
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
import numpy as np


# Univariate Feature Selection
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)

# Recursive Feature Elimination
estimator = RandomForestClassifier()
selector = RFE(estimator, n_features_to_select=2, step=1)
selector = selector.fit(X, y)
X_new = selector.transform(X)

# Principal Component Analysis (PCA)
# PCA is a dimensionality reduction technique that projects the data onto a lower-dimensional space while preserving
# as much variance as possible. This is done using the PCA function from scikit-learn library.
pca = PCA(n_components=2)
X_new = pca.fit_transform(X)

# backward elimination (you can use any model of choice) 
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)
model = SelectFromModel(lasso, prefit=True) 
X_new = model.transform(X_scaled) 
    # selected_features = X.columns[(X_new.get_support())] #to view the selected variables
    # np.sum(X_new.estimator_.coef_ == 0) #to see how many variables were shrank to zero and unselected.  

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_scaled, y)
model = SelectFromModel(ridge, prefit=True)
X_new = model.transform(X_scaled)

# Elastic Net
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_scaled, y)
model = SelectFromModel(elastic, prefit=True)
X_new = model.transform(X_scaled)

# Tree-based Feature Selection
rf = RandomForestClassifier()
rf.fit(X, y)
model = SelectFromModel(rf, prefit=True)
X_new = model.transform(X)

# Mutual Information Feature Selection
X_new = SelectKBest(score_func=mutual_info_classif, k=2).fit_transform(X, y)

# Sequential Feature Selection
estimator = RandomForestClassifier()
selector = SequentialFeatureSelector(estimator, n_features_to_select=2)
selector = selector.fit(X, y)
X_new = selector.transform(X)




>> ML evaluation

In [None]:
#confusion matrix 
from sklearn.metrics import confusion_matrix 

classes = digits.target_names #or df['target].unique()
accuracy = accuracy_score(y_test, y_pred)

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function plots a confusion matrix.
    """
    cm = confusion_matrix(y_true, y_pred)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    
    for i, j in np.ndindex(cm.shape):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

plot_confusion_matrix(y_test, y_pred, classes=class_names,
                      title='Confusion matrix, Accuracy = {:.2f}'.format(accuracy))
# A confusion matrix is a table that is often used to evaluate the performance of a machine learning algorithm. 
# It shows the number of true positives, false positives, true negatives, and false negatives for a given 
# classification task.

# A confusion matrix has two axes: one for the predicted values and one for the actual values. Each axis has two 
#     categories: positive and negative. Therefore, a confusion matrix for a binary classification task will have 
#     four cells:
# True Positive (TP): the actual value was positive, and the predicted value was also positive.
# False Positive (FP): the actual value was negative, but the predicted value was positive.
# True Negative (TN): the actual value was negative, and the predicted value was also negative.
# False Negative (FN): the actual value was positive, but the predicted value was negative.
                Predicted Positive    Predicted Negative
Actual Positive         TP                   FN
Actual Negative         FP                   TN

#Recall
# the proportion of true positives among the total number of actual positives. It is calculated as TP / (TP + FN).

#Accuracy
# the proportion of true results (both true positives and true negatives) among the total number of cases examined. 
# It is calculated as (TP + TN) / (TP + FP + TN + FN).

#Precision
# The proportion of true positives among the total number of positive predictions. It is calculated as TP / (TP + FP).

#F1-score
# the harmonic mean of precision and recall. It is calculated as 2 * (precision * recall) / (precision + recall).

>> Others

In [None]:
## Importing required libraries
import pandas as pd ## For DataFrame operation
import numpy as np ## Numerical python for matrix operations
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, OrdinalEncoder, OneHotEncoder ## Preprocessing function
import pandas_profiling ## For easy profiling of pandas DataFrame 
import missingno as msno ## Missing value co-occurance analysis

####### Data Exploration ############

def print_dim(df):
    '''
    Function to print the dimensions of a given python dataframe
    Required Input -
        - df = Pandas DataFrame
    Expected Output -
        - Data size
    '''
    print("Data size: Rows-{0} Columns-{1}".format(df.shape[0],df.shape[1]))


def print_dataunique(df):
    '''
    Function to print unique information for each column in a python dataframe
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Column name
        - Data type of that column
        - Number of unique values in that column
        - 5 unique values from that column
    '''
    counter = 0
    for i in df.columns:
        x = df.loc[:,i].unique()
        print(counter,i,type(df.loc[0,i]), len(x), x[0:5])
        counter +=1
        
def do_data_profiling(df, filename):
    '''
    Function to do basic data profiling
    Required Input - 
        - df = Pandas DataFrame
        - filename = Path for output file with a .html extension
    Expected Output -
        - HTML file with data profiling summary
    '''
    profile = pandas_profiling.ProfileReport(df)
    profile.to_file(output_file = filename)
    print("Data profiling done")

def view_datatypes_in_perspective(df):
    '''
    Function to group dataframe columns into three common dtypes and visualize the columns
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - three unique datatypes (float, object, others(for the rest))
    '''
    float = 0
    float_col = []
    object = 0
    object_col = []
    others = 0
    others_col = []
    for col in df.columns:
        if df[col].dtype ==  "float":
            float += 1
            float_col.append(col) 
        elif df[col].dtypes == "object":
            object += 1
            object_col.append(col)
        else:
            others +=1
            others_col.append(col)
            others_col.append(smart_home[col].dtype)        
    print (f" float = {float} \t{float_col}, \n \nobject = {object} \t{object_col}, \n\nothers = {others} \t{others_col} ")

def missing_value_analysis(df):
    '''
    Function to do basic missing value analysis
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Chart of Missing value co-occurance
        - Chart of Missing value heatmap
    '''
    msno.matrix(df)
    msno.heatmap(df)

def view_NaN(df):
    """
    Prints the name of any column in a Pandas DataFrame that contains NaN values.

    Parameters:
        - df: Pandas DataFrame

    Returns:
        - None
    """
    for col in df.columns:
        if df[col].isnull().any() == True:
            print(f"there is {df[col].isnull().sum()} NaN present in column:", col)
        else:
            print("No NaN present in column:", col)

def convert_timestamp(ts):
    """
    Converts a Unix timestamp to a formatted date and time string.

    Args:
        ts (int): The Unix timestamp to convert.

    Returns:
        str: A formatted date and time string in the format 'YYYY-MM-DD HH:MM:SS'.
    """
    utc_datetime = datetime.datetime.utcfromtimestamp(ts)
    formatted_datetime = utc_datetime.strftime('%Y-%m-%d %H:%M:%S')
    formatted_datetime = pd.to_datetime(formatted_datetime, infer_datetime_format=True) 
    return formatted_datetime

def visualize_outlier (df: pd.DataFrame):
    # Select only numeric columns
    numeric_cols = df.select_dtypes(include=["float64", "int64"])
    # Set figure size and create boxplot
    fig, ax = plt.subplots(figsize=(12, 6))
    numeric_cols.boxplot(ax=ax, rot=90)
    # Set x-axis label
    ax.set_xlabel("Numeric Columns")
    # Adjust subplot spacing to prevent x-axis labels from being cut off
    plt.subplots_adjust(bottom=0.4) 
    # Increase the size of the plot
    fig.set_size_inches(10, 6)
    # Show the plot
    plt.show()


####### Basic helper function ############

def join_df(left, right, left_on, right_on=None, method='left'):
    '''
    Function to outer joins of pandas dataframe
    Required Input - 
        - left = Pandas DataFrame 1
        - right = Pandas DataFrame 2
        - left_on = Fields in DataFrame 1 to merge on
        - right_on = Fields in DataFrame 2 to merge with left_on fields of Dataframe 1
        - method = Type of join
    Expected Output -
        - Pandas dataframe with dropped no variation columns
    '''
    if right_on is None:
        right_on = left_on
    return left.merge(right, 
                      how=method, 
                      left_on=left_on, 
                      right_on=right_on, 
                      suffixes=("","_y"))
    
####### Pre-processing ############    

def drop_allsame(df):
    '''
    Function to remove any columns which have same value all across
    Required Input - 
        - df = Pandas DataFrame
    Expected Output -
        - Pandas dataframe with dropped no variation columns
    '''
    to_drop = list()
    for i in df.columns:
        if len(df.loc[:,i].unique()) == 1:
            to_drop.append(i)
    return df.drop(to_drop,axis =1)

------------------------------------------------------------------------------------------------------
#Handling Missing Values
----------------------------------------------------
#fill Nan Values in the cloudCover column
def treat_missing_numeric(df,columns,how = 'mean', value = None):
    '''
    Function to treat missing values in numeric columns
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns need to be imputed
        - how = valid values are 'mean', 'mode', 'median','ffill', numeric value
    Expected Output -
        - Pandas dataframe with imputed missing value in mentioned columns
    '''
    if how == 'mean':
        for i in columns:
            print("Filling missing values with mean for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].mean())
            
    elif how == 'mode':
        for i in columns:
            print("Filling missing values with mode for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].mode())
    
    elif how == 'median':
        for i in columns:
            print("Filling missing values with median for columns - {0}".format(i))
            df[i] = df[i].fillna(df[i].median())
    
    elif how == 'ffill':
        for i in columns:
            print("Filling missing values with forward fill for columns - {0}".format(i))
            df[i] = df[i].fillna(method ='ffill')
    
    elif how == 'digit':
        for i in columns:
            print("Filling missing values with {0} for columns - {1}".format(how, i))
            df[i] = df[i].fillna(str(value)) 
      
    else:
        print("Missing value fill cannot be completed")
    return df.head(5)
treat_missing_numeric(smart_home, ["cloudCover"], how="digit", value = 0.1)  


def treat_missing_categorical(df, columns, how='mode', value = None):
    '''
    Function to treat missing values in categorical columns
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns need to be imputed
        - how = valid values are 'mode', any string or numeric value
    Expected Output -
        - Pandas dataframe with imputed missing value in mentioned columns
    '''
    if how == 'mode':
        for col in columns:
            print("Filling missing values with mode for column - {0}".format(col))
            df[col] = df[col].fillna(df[col].mode()[0])
            
    elif isinstance(how, str):
        for col in columns:
            print("Filling missing values with '{0}' for column - {1}".format(how, col))
            df[col] = df[col].fillna(how)
            
    elif how == 'digit':
        for i in columns:
            print("Filling missing values with {0} for columns - {1}".format(how, i))
            df[i] = df[i].fillna(str(value)) 
            
    else:
        print("Missing value fill cannot be completed")
    return df.head(4)


#SimpleImputer: This function replaces missing values with a specified strategy.
from sklearn.impute import SimpleImputer

def impute_missing_values(X, strategy='mean'): #strategy = "median", 'most_frequent', 'constant'. (strategy="constant", fill_value=-1)
    imputer = SimpleImputer(strategy=strategy)
    X_imputed = imputer.fit_transform(X)
    X_imputed = pd.DataFrame(X_imputed, 
                            columns=X.columns, index=X.index )
    return X_imputed

#EvalML Time Series Imputer
from evalml.pipelines.components import TimeSeriesImputer

ts_imputer = TimeSeriesImputer(
    categorical_impute_strategy="forwards_fill",
    numeric_impute_strategy="backwards_fill",
    target_impute_strategy="interpolate",
)
X_train, y_train = ts_imputer.fit_transform(X_train, y_train)

#MissingIndicator: This function creates a binary indicator for each feature indicating whether the value is missing or not.
from sklearn.impute import MissingIndicator

def create_missing_indicator(X):
    indicator = MissingIndicator()
    X_missing_indicator = indicator.fit_transform(X)
    return X_missing_indicator

#KNNImputer: The missing values are estimated as the average value from the closest K neighbours
    # multivariate imputation
from sklearn.impute import KNNImputer
def knn_impute(X, k):
    imputer = KNNImputer(n_neighbors=k, # the number of neighbours K
                        weights='distance', # the weighting factor
                        metric='nan_euclidean', # the metric to find the neighbours
                        add_indicator=False, # whether to add a missing indicator
                        )
    imputed_X = imputer.fit_transform(X)
    return imputed_X

#IterativeImputer: This function estimates missing values using a predictive model.
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

def impute_missing_values_iteratively(X): #or (X, Columns)
    imputer = IterativeImputer(
        # estimator = RandomForestRegressor() 
        estimator=BayesianRidge(), # the estimator to predict the NA
        initial_strategy='mean', # how will NA be imputed in step 1
        max_iter=10, # number of cycles
        imputation_order='ascending', # the order in which to impute the variables
        n_nearest_features=None, # whether to limit the number of predictors
        skip_complete=True, # whether to ignore variables without NA
        random_state=0,)
        
    # select only the columns with missing values to be imputed
    # X_cols = X[columns]
    X_imputed = imputer.fit_transform(X) #or X_cols
    return X_imputed

#other predictive models include
imputer = IterativeImputer(estimator=BayesianRidge()) #from sklearn.linear_model import BayesianRidge
imputer = IterativeImputer(estimator=LinearRegression()) #from sklearn.linear_model import LinearRegression
imputer = IterativeImputer(estimator=DecisionTreeRegressor()) #from sklearn.tree import DecisionTreeRegressor
imputer = IterativeImputer(estimator=RandomForestRegressor()) #from sklearn.ensemble import RandomForestRegressor
imputer = IterativeImputer(estimator=KNeighborsRegressor()) #from sklearn.neighbors import KNeighborsRegressor
imputer = IterativeImputer(estimator=MLPRegressor()) #from sklearn.neural_network import MLPRegressor

# IterativeImputer in SKlearn is a class that can estimate missing values in a dataset by modeling each feature with 
# missing values as a function of the other features. It does this by taking a predictive model and using it to 
# fill in the missing values iteratively. 
----------------------------------------------------------------------------------

def min_max_scaler(df,columns):
    '''
    Function to do Min-Max scaling
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be min-max scaled
    Expected Output -
        - df = Python DataFrame with Min-Max scaled attributes
        - scaler = Function which contains the scaling rules
    '''
    scaler = MinMaxScaler()
    data = pd.DataFrame(scaler.fit_transform(df.loc[:,columns]))
    data.index = df.index
    data.columns = columns
    return data, scaler

def replace_non_numeric(df: pd.DataFrame, columns):
    """
    Replaces non-numeric values in the specified columns of a Pandas dataframe with NaN.

    Parameters:
        df (pd.DataFrame): The dataframe to process.
        columns (list): A list of column names to replace non-numeric values in.

    Returns:
        pd.DataFrame: The updated dataframe with non-numeric values replaced by NaN.
    """
    for col in columns:
        df.dropna(subset = col, inplace= True)
        if df[col].dtype == 'object' or df[col].dtype == 'float':
            # df.dropna(subset = col, inplace= True)
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df.dropna(subset = col, inplace= True)
        else:
            df[col] = pd.to_numeric(df[col], errors='coerce')
            df.dropna(subset = col, inplace= True)
    return df

def z_scaler(df,columns):
    '''
    Function to standardize features by removing the mean and scaling to unit variance
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be min-max scaled
    Expected Output -
        - df = Python DataFrame with Min-Max scaled attributes
        - scaler = Function which contains the scaling rules
    '''
    scaler = StandardScaler()
    data = pd.DataFrame(scaler.fit_transform(df.loc[:,columns]))
    data.index = df.index
    data.columns = columns
    return data, scaler

mapping_dict = {        #an example
    'First':0,
    'Second': 1,
    'Third': 2 
}
def map_encoding(data, feature_name, mapping_dict):
    """
    Encodes a categorical feature using mapping method.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.
        mapping_dict (dict): A dictionary containing the mapping of category values to integers.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Replace the category values with their corresponding integers
    encoded_data[feature_name] = encoded_data[feature_name].map(mapping_dict)

    return encoded_data

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

def ordinal_encoding_sklearn(data, feature_name, categories):
    """
    Encodes a categorical feature using ordinal encoding method with scikit-learn's OrdinalEncoder class.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.
        categories (list): The list of categories in the order of their numerical encoding.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Perform ordinal encoding using the OrdinalEncoder class
    ordinal_encoder = OrdinalEncoder(categories=[categories])
    encoded_data[feature_name] = ordinal_encoder.fit_transform(encoded_data[[feature_name]])
    encoded_data[feature_name] = pd.DataFrame(encoded_data, columns=encoded_data.columns, index=encoded_data.index)
    
    return encoded_data.head


def one_hot_encoding_sklearn(data, feature_name):
    """
    Encodes a categorical feature using one-hot encoding method with scikit-learn's OneHotEncoder class.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Perform one-hot encoding using the OneHotEncoder class
    one_hot_encoder = OneHotEncoder()
    encoded_data = pd.DataFrame(one_hot_encoder.fit_transform(encoded_data[[feature_name]]).toarray())
    feature_names_out = [f"{feature_name}_{category}" for category in one_hot_encoder.categories_[0]]
    encoded_data.columns = feature_names_out
    encoded_data.index = data.index
    encoded_data = pd.concat([data.drop(feature_name, axis=1), encoded_data], axis=1)
    encoded_data[feature_name] = pd.DataFrame(encoded_data, columns=encoded_data.columns, index=encoded_data.index)
    
    return encoded_data.head(3) 

from sklearn.preprocessing import LabelEncoder

def label_encoding_sklearn(data, feature_name):
    """
    Encodes a categorical feature using label encoding method with scikit-learn's LabelEncoder class.

    Args:
        data (pandas.DataFrame): The DataFrame containing the categorical feature to encode.
        feature_name (str): The name of the categorical feature to encode.

    Returns:
        pandas.DataFrame: The DataFrame with the encoded categorical feature.
    """

    # Create a copy of the original DataFrame
    encoded_data = data.copy()

    # Perform label encoding using the LabelEncoder class
    label_encoder = LabelEncoder()
    encoded_data[feature_name] = label_encoder.fit_transform(encoded_data[feature_name])
    encoded_data[feature_name] = pd.DataFrame(encoded_data, columns=encoded_data.columns, index=encoded_data.index)
    
    return encoded_data

    
def label_encoder(df,columns):
    '''
    Function to label encode
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be label encoded
    Expected Output -
        - df = Pandas DataFrame with lable encoded columns
        - le_dict = Dictionary of all the column and their label encoders
    '''
    le_dict = {}
    for c in columns:
        print("Label encoding column - {0}".format(c))
        lbl = LabelEncoder()
        lbl.fit(list(df[c].values.astype('str')))
        df[c] = lbl.transform(list(df[c].values.astype('str')))
        le_dict[c] = lbl
    return df, le_dict

def one_hot_encoder(df, columns):
    '''
    Function to do one-hot encoded
    Required Input - 
        - df = Pandas DataFrame
        - columns = List input of all the columns which needs to be one-hot encoded
    Expected Output -
        - df = Pandas DataFrame with one-hot encoded columns
    '''
    for each in columns:
        print("One-Hot encoding column - {0}".format(each))
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df.drop(columns,axis = 1)

####### Feature Engineering ############ 
def create_date_features(df,column, date_format = None, more_features = False, time_features = False): 
    '''
    Function to extract date features
    Required Input - 
        - df = Pandas DataFrame
        - date_format = Date parsing format
        - columns = Columns name containing date field
        - more_features = To get more feature extracted
        - time_features = To extract hour from datetime field
    Expected Output -
        - df = Pandas DataFrame with additional extracted date features
    '''
    if date_format is None:
        df.loc[:,column] = pd.to_datetime(df.loc[:,column])
    else:
        df.loc[:,column] = pd.to_datetime(df.loc[:,column],format = date_format)
    df.loc[:,column+'_Year'] = df.loc[:,column].dt.year
    df.loc[:,column+'_Month'] = df.loc[:,column].dt.month.astype('uint8')
    df.loc[:,column+'_Week'] = df.loc[:,column].dt.week.astype('uint8')
    df.loc[:,column+'_Day'] = df.loc[:,column].dt.day.astype('uint8')
    
    if more_features:
        df.loc[:,column+'_Quarter'] = df.loc[:,column].dt.quarter.astype('uint8')
        df.loc[:,column+'_DayOfWeek'] = df.loc[:,column].dt.dayofweek.astype('uint8')
        df.loc[:,column+'_DayOfYear'] = df.loc[:,column].dt.dayofyear
        
    if time_features:
        df.loc[:,column+'_Hour'] = df.loc[:,column].dt.hour.astype('uint8')
    return df

def target_encoder(train_df, col_name, target_name, test_df = None, how='mean'):
    '''
    Function to do target encoding
    Required Input - 
        - train_df = Training Pandas Dataframe
        - test_df = Testing Pandas Dataframe
        - col_name = Name of the columns of the source variable
        - target_name = Name of the columns of target variable
        - how = 'mean' default but can also be 'count'
	Expected Output - 
		- train_df = Training dataframe with added encoded features
		- test_df = Testing dataframe with added encoded features
    '''
    aggregate_data = train_df.groupby(col_name)[target_name] \
                    .agg([how]) \
                    .reset_index() \
                    .rename(columns={how: col_name+'_'+target_name+'_'+how})
    if test_df is None:
        return join_df(train_df,aggregate_data,left_on = col_name)
    else:
        return join_df(train_df,aggregate_data,left_on = col_name), join_df(test_df,aggregate_data,left_on = col_name)

### Choosing the ML Model

| Model | Description and How It Works | Key Points | Example Code (Python) |
|-------|------------------------------|------------|-----------------------|
| **Linear Regression** | Predicts a continuous value by finding the best-fitting straight line through the data points, assuming a linear relationship between input and output. | Simple and interpretable, prone to underfitting, assumes linear relationship, sensitive to outliers. | `from sklearn.linear_model import LinearRegression` |
| **Logistic Regression** | Estimates probabilities using a logistic function to model a binary outcome. Despite the name, it's used for classification. | Interpretable, outputs probabilities, best for binary classification, assumes linear decision boundary. | `from sklearn.linear_model import LogisticRegression` |
| **Decision Trees** | Models decisions using a tree-like structure, splitting data on feature values to create pure subsets. | Easy to interpret, handles numerical and categorical data, prone to overfitting, can model non-linear relationships. | `from sklearn.tree import DecisionTreeClassifier` |
| **Random Forest** | An ensemble of Decision Trees trained with the "bagging" method to improve accuracy and control overfitting. | Tackles overfitting of decision trees, works well with large datasets and high dimensions, maintains accuracy for missing data, handles both regression and classification tasks. | `from sklearn.ensemble import RandomForestClassifier` |
| **SVM (Support Vector Machines)** | Finds the hyperplane that best separates different classes by constructing a hyperplane in a high-dimensional space. | Effective in high-dimensional spaces, best suited for datasets with a clear margin of separation, memory efficient, more effective when number of dimensions exceeds the number of samples. | `from sklearn.svm import SVC` |
| **KNN (K-Nearest Neighbors)** | Classifies points based on the 'k' most similar instances using distance metrics to find the closest training examples. | Instance-based learning, minimal training, suitable for classification and regression, sensitive to localized data, depends on the number and proximity of neighbors. | `from sklearn.neighbors import KNeighborsClassifier` |
| **Naive Bayes** | Applies Bayes' Theorem with the "naive" assumption of conditional independence between every pair of features to classify data. | Fast, suitable for large datasets, predominantly used in text classification, handles both continuous and discrete data, effective in multi-class prediction. | `from sklearn.naive_bayes import GaussianNB` |
| **Ridge Regression** | Extends Linear Regression with a penalty term on the size of coefficients to address multicollinearity and prevent overfitting. | Reduces model complexity, particularly useful when dealing with multicollinearity, includes a bias-variance trade-off through regularization. | `from sklearn.linear_model import Ridge` |
| **Lasso Regression** | Similar to Ridge, but uses L1 regularization to penalize the absolute size of coefficients, encouraging sparse solutions. | Facilitates feature selection by shrinking some coefficients to zero, useful for models with high dimensionality, offers sparse solutions. | `from sklearn.linear_model import Lasso` |
| **XGBoost** | A decision-tree-based ensemble that uses a gradient boosting framework for supervised learning tasks, optimized for speed and performance. | Optimized for performance and speed, handles large datasets effectively, used for both classification and regression, can deal with missing data, widely used due to its efficiency and accuracy. | `import xgboost as xgb` |
| **LightGBM** | A gradient boosting framework that uses tree-based learning algorithms, designed for distributed and efficient training. | Fast training speed, lower memory usage, supports large datasets, performs well with categorical data, capable of handling high-dimensional data. | `import lightgbm as lgb` |
| **K-Means Clustering** | Partitions 'n' observations into 'k' clusters, where each observation belongs to the cluster with the nearest mean. | Widely used for unsupervised clustering, efficient with large datasets, sensitive to the selection of 'k', may converge to local minima. | `from sklearn.cluster import KMeans` |
| **PCA (Principal Component Analysis)** | Reduces dimensionality by transforming to a new set of variables (principal components) that summarize the most variance. | Effective for dimensionality reduction, identifies most significant features, can improve model performance and visualization. | `from sklearn.decomposition import PCA` |


### Machine Learning Regression

In [None]:
## Importing required libraries
import pandas as pd ## For DataFrame operation
import numpy as np ## Numerical python for matrix operations
from sklearn.model_selection import KFold, train_test_split ## Creating cross validation sets
from sklearn import metrics ## For loss functions
import matplotlib.pyplot as plt

## Libraries for Regressiion algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
import xgboost as xgb
import lightgbm as lgb 
from sklearn.ensemble import ExtraTreesRegressor,RandomForestRegressor
import lime 
import lime.lime_tabular


model.get_params()  #to get the parameters of the models in order to improve it

########### Cross Validation ###########
### 1) Train test split
def holdout_cv(X,y,size = 0.3, seed = 1):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = size, random_state = seed)
    X_train = X_train.reset_index(drop='index')
    X_test = X_test.reset_index(drop='index')
    return X_train, X_test, y_train, y_test

### 2) Cross-Validation (K-Fold)
def kfold_cv(X,n_folds = 5, seed = 1):
    cv = KFold(n_splits = n_folds, random_state = seed, shuffle = True)
    return cv.split(X)

########### Model Explanation ###########
## Variable Importance plot
def feature_importance(model,X):
    feature_importance = model.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align='center')
    plt.yticks(pos, X.columns[sorted_idx])
    plt.xlabel('Relative Importance')
    plt.title('Variable Importance')
    plt.show()

def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    
    Args:
      X (ndarray (m,n))     : input data, m examples, n features
      
    Returns:
      X_norm (ndarray (m,n)): input normalized by column
      mu (ndarray (n,))     : mean of each feature
      sigma (ndarray (n,))  : standard deviation of each feature
    """
    # find the mean of each column/feature
    mu     = np.mean(X, axis=0)                 # mu will have shape (n,)
    # find the standard deviation of each column/feature
    sigma  = np.std(X, axis=0)                  # sigma will have shape (n,)
    # element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma      

    return (X_norm, mu, sigma)

def standardize_data(X_train, X_test):
    """
    Standardizes the training and testing data using the mean and standard deviation
    learned from the training set.
    
    Args:
    - X_train: numpy array or pandas dataframe, training data
    - X_test: numpy array or pandas dataframe, testing data
    
    Returns:
    - X_train_scaled: numpy array or pandas dataframe, standardized training data
    - X_test_scaled: numpy array or pandas dataframe, standardized testing data
    """
    from sklearn.preprocessing import StandardScaler 
    # Set up the scaler
    scaler = StandardScaler()
    
    # Fit the scaler to the training set
    scaler.fit(X_train) 
    
    # Transform the training and testing sets
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled

def mean_normalize(X_train, X_test):
    """
    Perform mean normalization on both the training and testing sets.

    Parameters:
    -----------
    X_train: numpy.ndarray
        The training set features as a 2D array.

    X_test: numpy.ndarray
        The testing set features as a 2D array.

    Returns:
    --------
    X_train_norm: numpy.ndarray
        The mean-normalized training set features as a 2D array.

    X_test_norm: numpy.ndarray
        The mean-normalized testing set features as a 2D array.
    """
    scaler_mean = StandardScaler(with_mean=True, with_std=False) # set up the scaler
    scaler_minmax = RobustScaler(with_centering = False, with_scaling = True,   #use this when working with outliers
                                 quantile_range = (0,100))
    
    scaler_mean.fit(X_train) # fit the scaler to the train set, it will learn the parameters
    scaler_minmax.fit(X_train) #fit the scaler to the train set, it will learn the parameters
    
    X_train_norm = scaler_minmax.transform(scaler_mean.transform(X_train)) # transform train set
    X_test_norm = scaler_minmax.transform(scaler_mean.transform(X_test)) # transform test set
    return X_train_norm, X_test_norm


def scale_min_max(X_train, X_test):
    """
    Scales the features in X_train and X_test to the range [0, 1] using MinMaxScaler.
    
    Parameters:
    -----------
    X_train: numpy array
        Training data features
        
    X_test: numpy array
        Test data features
        
    Returns:
    --------
    X_train_scaled: numpy array
        Scaled training data features
        
    X_test_scaled: numpy array
        Scaled test data features
    """
    from sklearn.preprocessing import MinMaxScaler 
    # set up the scaler
    scaler = MinMaxScaler()
    
    # fit the scaler to the train set, it will learn the parameters
    scaler.fit(X_train)
    
    # transform train and test sets
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled


########### Functions for explaination using Lime ###########

## Make a prediction function
def make_prediction_function(model, type = None):
    if type == 'xgb':
        predict_fn = lambda x: model.predict(xgb.DMatrix(x)).astype(float)
    else:
        predict_fn = lambda x: model.predict(x).astype(float)
    return predict_fn

## Make a lime explainer
def make_lime_explainer(df, c_names = [], verbose_val = True):
    explainer = lime.lime_tabular.LimeTabularExplainer(df.values,
                                                       class_names=c_names,
                                                       feature_names = list(df.columns),
                                                       kernel_width=3, 
                                                       verbose=verbose_val,
                                                       mode='regression'
                                                    )
    return explainer

## Lime explain function
def lime_explain(explainer,predict_fn, df, index = 0, num_features = None,
                 show_in_notebook = True, filename = None):
    if num_features is not None:
        exp = explainer.explain_instance(df.values[index], predict_fn, num_features=num_features)
    else:
        exp = explainer.explain_instance(df.values[index], predict_fn, num_features=df.shape[1])
    
    if show_in_notebook:
        exp.show_in_notebook(show_all=False)
    
    if filename is not None:
        exp.save_to_file(filename)
        
########### Algorithms For Regression ###########

### Running Xgboost
def runXGB(train_X, train_y, test_X, test_y=None, test_X2=None, seed_val=0, 
           rounds=500, dep=8, eta=0.05,sub_sample=0.7,col_sample=0.7,
           min_child_weight_val=1, silent_val = 1):
    params = {
        "objective": "reg:squarederror",
        "eval_metric": "rmse",
        "eta": eta,
        "subsample": sub_sample,
        "min_child_weight": min_child_weight_val,
        "colsample_bytree": col_sample,
        "max_depth": dep,
        "seed": seed_val,
        "verbosity": 0  
    }
    num_rounds = rounds

    plst = list(params.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
        model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=100, verbose_eval=20)
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)
    
    pred_test_y = model.predict(xgtest, ntree_limit=model.best_iteration)
    
    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(xgb.DMatrix(test_X2), ntree_limit=model.best_iteration)
    
    loss = 0
    r2 = 0
    if test_y is not None:
        loss = metrics.mean_squared_error(test_y, pred_test_y)
        r2 = r2_score(test_y, pred_test_y) 
        print(f'r2_score is: {r2}')
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model
        
### Running LightGBM
def runLGB(train_X, train_y, test_X, test_y=None, test_X2=None, feature_names=None, 
           seed_val=0, rounds=500, dep=8, eta=0.05,sub_sample=0.7,
           col_sample=0.7,silent_val = 1,min_data_in_leaf_val = 20, bagging_freq = 5):
    params = {}
    params["objective"] = "regression"
    params['metric'] = 'rmse'
    params["max_depth"] = dep
    params["min_data_in_leaf"] = min_data_in_leaf_val
    params["learning_rate"] = eta
    params["bagging_fraction"] = sub_sample
    params["feature_fraction"] = col_sample
    params["bagging_freq"] = bagging_freq
    params["bagging_seed"] = seed_val
    params["verbosity"] = silent_val
    num_rounds = rounds
    
    lgtrain = lgb.Dataset(train_X, label=train_y)
    
    if test_y is not None:
        lgtest = lgb.Dataset(test_X, label=test_y)
        model = lgb.train(params, lgtrain, num_rounds, valid_sets=[lgtest], early_stopping_rounds=100, verbose_eval=20)
    else:
        lgtest = lgb.Dataset(test_X)
        model = lgb.train(params, lgtrain, num_rounds)
        
    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)
    
    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)
    
    loss = 0
    if test_y is not None:
        loss = metrics.mean_squared_error(test_y, pred_test_y)
        print(loss)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model
        
### Running Extra Trees  
def runET(train_X, train_y, test_X, test_y=None, test_X2=None, rounds=100, depth=20,
          leaf=10, feat=0.2, min_data_split_val=2,seed_val=0,job = -1):
	model = ExtraTreesRegressor(
                                n_estimators = rounds,
                                max_depth = depth,
                                min_samples_split = min_data_split_val,
                                min_samples_leaf = leaf,
                                max_features =  feat,
                                n_jobs = job,
                                random_state = seed_val)
	model.fit(train_X, train_y)
	train_preds = model.predict(train_X)
	test_preds = model.predict(test_X)
	
	test_preds2 = 0
	if test_X2 is not None:
		test_preds2 = model.predict(test_X2)
	
	test_loss = 0
	if test_y is not None:
		train_loss = metrics.mean_squared_error(train_y, train_preds)
		test_loss = metrics.mean_squared_error(test_y, test_preds)
		print("Depth, leaf, feat : ", depth, leaf, feat)
		print("Train and Test loss : ", train_loss, test_loss)
	return test_preds, test_loss, test_preds2, model
 
### Running Random Forest
def runRF(train_X, train_y, test_X, test_y=None, test_X2=None, rounds=100, depth=20, leaf=10,
          feat=0.2,min_data_split_val=2,seed_val=0,job = -1):
    model = RandomForestRegressor(
                                n_estimators = rounds,
                                max_depth = depth,
                                min_samples_split = min_data_split_val,
                                min_samples_leaf = leaf,
                                max_features =  feat,
                                n_jobs = job,
                                random_state = seed_val)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)
    
    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running Linear regression
def runLR(train_X, train_y, test_X, test_y=None, test_X2=None):
    model = LinearRegression()
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running Decision Tree
def runDT(train_X, train_y, test_X, test_y=None, test_X2=None, criterion='mse', 
          depth=None, min_split=2, min_leaf=1):
    model = DecisionTreeRegressor(
                                criterion = criterion, 
                                max_depth = depth, 
                                min_samples_split = min_split, 
                                min_samples_leaf=min_leaf)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model
    
### Running K-Nearest Neighbour
def runKNN(train_X, train_y, test_X, test_y=None, test_X2=None, 
           neighbors=5, job = -1):
    model = KNeighborsRegressor(
                                n_neighbors=neighbors, 
                                n_jobs=job)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Running SVM
def runSVC(train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, 
           eps=0.1, kernel_choice = 'rbf'):
    model = SVR(
                C=C, 
                kernel=kernel_choice,  
                epsilon=eps)
    model.fit(train_X, train_y)
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict(test_X2)
    
    test_loss = 0
    
    train_loss = metrics.mean_squared_error(train_y, train_preds)
    test_loss = metrics.mean_squared_error(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model

### Machine Learning Classification

In [None]:
## Importing required libraries
import pandas as pd  ## For DataFrame operation
import numpy as np  ## Numerical python for matrix operations
from sklearn.model_selection import (
    KFold,
    train_test_split,
)  ## Creating cross validation sets
from sklearn import metrics  ## For loss functions
import matplotlib.pyplot as plt
import itertools

## For evaluation
from sklearn.metrics import (
    roc_curve,
    auc,
    roc_auc_score,
    confusion_matrix,
    precision_recall_curve,
    average_precision_score,
)
from inspect import signature

## Libraries for Classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
import lime
import lime.lime_tabular

model.get_params()  #to get the parameters of the models in order to improve it



def split_data(X, y, test_size=0.2, val_size=0.2, random_state=42): #split data into train, test, and validation
    """
    This function splits the data into train and test sets, and further splits the train set into training and validation sets.
    
    df : pandas DataFrame
        The dataframe containing the input data.
    target_col : str
        The name of the target column in the dataframe.
    test_size : float, optional (default=0.2)
        The proportion of the data to be used for testing.
    val_size : float, optional (default=0.2)
        The proportion of the training data to be used for validation.
    random_state : int, optional (default=42)
        The seed used by the random number generator.
    
    Returns
    -------
    xtrain : pandas DataFrame
        The training input data.
    ytrain : pandas Series
        The training target data.
    xvalid : pandas DataFrame
        The validation input data.
    yvalid : pandas Series
        The validation target data.
    xtest : pandas DataFrame
        The test input data.
    ytest : pandas Series
        The test target data.
    """ 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=val_size, random_state=random_state)
    
    return X_train, y_train, X_valid, y_valid, X_test, y_test

########### Cross Validation ###########
### 1) Train test split
def holdout_cv(X, y, size=0.3, seed=1):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=size, random_state=seed
    )
    X_train = X_train.reset_index(drop="index")
    X_test = X_test.reset_index(drop="index")
    return X_train, X_test, y_train, y_test


### 2) Cross-Validation (K-Fold)
def kfold_cv(X, n_folds=5, seed=1):
    cv = KFold(n_splits=n_folds, random_state=seed, shuffle=True)
    return cv.split(X)


########### Model Explanation ###########
## Plotting AUC ROC curve
def plot_roc(y_actual, y_pred):
    """
    Function to plot AUC-ROC curve
    """
    fpr, tpr, thresholds = roc_curve(y_actual, y_pred)
    plt.plot(
        fpr,
        tpr,
        color="b",
        label=r"Model (AUC = %0.2f)" % (roc_auc_score(y_actual, y_pred)),
        lw=2,
        alpha=0.8,
    )
    plt.plot(
        [0, 1],
        [0, 1],
        linestyle="--",
        lw=2,
        color="r",
        label="Luck (AUC = 0.5)",
        alpha=0.8,
    )
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver operating characteristic example")
    plt.legend(loc="lower right")
    plt.show()


def plot_precisionrecall(y_actual, y_pred):
    """
    Function to plot AUC-ROC curve
    """
    average_precision = average_precision_score(y_actual, y_pred)
    precision, recall, _ = precision_recall_curve(y_actual, y_pred)
    # In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
    step_kwargs = (
        {"step": "post"} if "step" in signature(plt.fill_between).parameters else {}
    )

    plt.figure(figsize=(9, 6))
    plt.step(recall, precision, color="b", alpha=0.2, where="post")
    plt.fill_between(recall, precision, alpha=0.2, color="b", **step_kwargs)

    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title("Precision-Recall curve: AP={0:0.2f}".format(average_precision))


## Plotting confusion matrix
def plot_confusion_matrix(
    y_true,
    y_pred,
    classes,
    normalize=False,
    title="Confusion matrix",
    cmap=plt.cm.Blues,
):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    cm = metrics.confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print("Confusion matrix, without normalization")

    print(cm)

    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = ".2f" if normalize else "d"
    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(
            j,
            i,
            format(cm[i, j], fmt),
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black",
        )

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")


## Variable Importance plot
def feature_importance(model, X):
    feature_importance = model.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + 0.5
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align="center")
    plt.yticks(pos, X.columns[sorted_idx])
    plt.xlabel("Relative Importance")
    plt.title("Variable Importance")
    plt.show()


## Functions for explaination using Lime
def make_prediction_function(model):
    predict_fn = lambda x: model.predict_proba(x).astype(float)
    return predict_fn


def make_lime_explainer(df, c_names=[], k_width=3, verbose_val=True):
    explainer = lime.lime_tabular.LimeTabularExplainer(
        df.values,
        class_names=c_names,
        feature_names=list(df.columns),
        kernel_width=3,
        verbose=verbose_val,
    )
    return explainer


def lime_explain(
    explainer,
    predict_fn,
    df,
    index=0,
    num_features=None,
    show_in_notebook=True,
    filename=None,
):
    if num_features is not None:
        exp = explainer.explain_instance(
            df.values[index], predict_fn, num_features=num_features
        )
    else:
        exp = explainer.explain_instance(
            df.values[index], predict_fn, num_features=df.shape[1]
        )

    if show_in_notebook:
        exp.show_in_notebook(show_all=False)

    if filename is not None:
        exp.save_to_file(filename)


########### Algorithms For Binary classification ###########

### Running Xgboost
def runXGB(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    min_child_weight_val=1,
    silent_val=1,
):
    params = {}
    params["objective"] = "binary:logistic"
    params["eval_metric"] = "auc"
    params["eta"] = eta
    params["subsample"] = sub_sample
    params["min_child_weight"] = min_child_weight_val
    params["colsample_bytree"] = col_sample
    params["max_depth"] = dep
    params["silent"] = silent_val
    params["seed"] = seed_val
    # params["max_delta_step"] = 2
    # params["gamma"] = 0.5
    num_rounds = rounds

    plst = list(params.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [(xgtrain, "train"), (xgtest, "test")]
        model = xgb.train(
            plst,
            xgtrain,
            num_rounds,
            watchlist,
            early_stopping_rounds=100,
            verbose_eval=20,
        )
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)

    pred_test_y = model.predict(xgtest, ntree_limit=model.best_iteration)

    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(
            xgb.DMatrix(test_X2), ntree_limit=model.best_iteration
        )

    loss = 0
    if test_y is not None:
        loss = metrics.roc_auc_score(test_y, pred_test_y)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model


### Running Xgboost classifier for model explaination
def runXGBC(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    min_child_weight_val=1,
    silent_val=1,
):
    model = xgb.XGBClassifier(
        objective="binary:logistic",
        learning_rate=eta,
        subsample=sub_sample,
        min_child_weight=min_child_weight_val,
        colsample_bytree=col_sample,
        max_depth=dep,
        silent=silent_val,
        seed=seed_val,
        n_estimators=rounds,
    )

    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running LightGBM
def runLGB(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    feature_names=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    silent_val=1,
    min_data_in_leaf_val=20,
    bagging_freq=5,
    n_thread=20,
    metric="auc",
):
    params = {}
    params["objective"] = "binary"
    params["metric"] = metric
    params["max_depth"] = dep
    params["min_data_in_leaf"] = min_data_in_leaf_val
    params["learning_rate"] = eta
    params["bagging_fraction"] = sub_sample
    params["feature_fraction"] = col_sample
    params["bagging_freq"] = bagging_freq
    params["bagging_seed"] = seed_val
    params["verbosity"] = silent_val
    params["num_threads"] = n_thread
    num_rounds = rounds

    lgtrain = lgb.Dataset(train_X, label=train_y)

    if test_y is not None:
        lgtest = lgb.Dataset(test_X, label=test_y)
        model = lgb.train(
            params,
            lgtrain,
            num_rounds,
            valid_sets=[lgtrain, lgtest],
            early_stopping_rounds=100,
            verbose_eval=20,
        )
    else:
        lgtest = lgb.Dataset(test_X)
        model = lgb.train(params, lgtrain, num_rounds)

    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)

    pred_test_y2 = 0
    if test_X2 is not None:
        pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)

    loss = 0
    if test_y is not None:
        loss = roc_auc_score(test_y, pred_test_y)
        print(loss)
        return pred_test_y, loss, pred_test_y2, model
    else:
        return pred_test_y, loss, pred_test_y2, model


### Running LightGBM classifier for model explaination
def runLGBC(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    seed_val=0,
    rounds=500,
    dep=8,
    eta=0.05,
    sub_sample=0.7,
    col_sample=0.7,
    silent_val=1,
    min_data_in_leaf_val=20,
    bagging_freq=5,
    n_thread=20,
    metric="auc",
):
    model = lgb.LGBMClassifier(
        max_depth=dep,
        learning_rate=eta,
        min_data_in_leaf=min_data_in_leaf_val,
        bagging_fraction=sub_sample,
        feature_fraction=col_sample,
        bagging_freq=bagging_freq,
        bagging_seed=seed_val,
        verbosity=silent_val,
        num_threads=n_thread,
        n_estimators=rounds,
        metric=metric,
    )

    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = roc_auc_score(train_y, train_preds)
        test_loss = roc_auc_score(test_y, test_preds)
        print("Train and Test AUC : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Extra Trees
def runET(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    rounds=100,
    depth=20,
    leaf=10,
    feat=0.2,
    min_data_split_val=2,
    seed_val=0,
    job=-1,
):
    model = ExtraTreesClassifier(
        n_estimators=rounds,
        max_depth=depth,
        min_samples_split=min_data_split_val,
        min_samples_leaf=leaf,
        max_features=feat,
        n_jobs=job,
        random_state=seed_val,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Depth, leaf, feat : ", depth, leaf, feat)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Random Forest
def runRF(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    rounds=100,
    depth=20,
    leaf=10,
    feat=0.2,
    min_data_split_val=2,
    seed_val=0,
    job=-1,
):
    model = RandomForestClassifier(
        n_estimators=rounds,
        max_depth=depth,
        min_samples_split=min_data_split_val,
        min_samples_leaf=leaf,
        max_features=feat,
        n_jobs=job,
        random_state=seed_val,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0
    if test_y is not None:
        train_loss = metrics.roc_auc_score(train_y, train_preds)
        test_loss = metrics.roc_auc_score(test_y, test_preds)
        print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Logistic Regression
def runLR(train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, penalty="l1"):
    model = LogisticRegression(C=C, penalty=penalty, n_jobs=-1)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]
    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running Decision Tree
def runDT(
    train_X,
    train_y,
    test_X,
    test_y=None,
    test_X2=None,
    criterion="gini",
    depth=None,
    min_split=2,
    min_leaf=1,
):
    model = DecisionTreeClassifier(
        criterion=criterion,
        max_depth=depth,
        min_samples_split=min_split,
        min_samples_leaf=min_leaf,
    )
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running K-Nearest Neighbour
def runKNN(train_X, train_y, test_X, test_y=None, test_X2=None, neighbors=5, job=-1):
    model = KNeighborsClassifier(n_neighbors=neighbors, n_jobs=job)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Running SVM
def runSVC(
    train_X, train_y, test_X, test_y=None, test_X2=None, C=1.0, kernel_choice="rbf"
):
    model = SVC(C=C, kernel=kernel_choice, probability=True)
    model.fit(train_X, train_y)
    train_preds = model.predict_proba(train_X)[:, 1]
    test_preds = model.predict_proba(test_X)[:, 1]

    test_preds2 = 0
    if test_X2 is not None:
        test_preds2 = model.predict_proba(test_X2)[:, 1]

    test_loss = 0

    train_loss = metrics.roc_auc_score(train_y, train_preds)
    test_loss = metrics.roc_auc_score(test_y, test_preds)
    print("Train and Test loss : ", train_loss, test_loss)
    return test_preds, test_loss, test_preds2, model


### Unsupervised Learning

> Clustering

In [None]:
# KMeans Clustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


# Build the Model 
cluster = KMeans(n_clusters=9, init='k-means++', max_iter=300, n_init=10, random_state=0)
df['cluster'] = cluster.fit(df)


# Elbow Method to Optimize the Number of Clusters
def optimize_kmeans(X, n_clusters_range=(2, 15)):
    """
    Optimize K-Means clustering using the Elbow Method and Silhouette Score.

    Args:
    X: array-like, feature dataset.
    n_clusters_range: tuple, range of n_clusters to try (min, max).

    Returns:
    dict: A dictionary containing the KMeans models, their silhouette scores, and WCSS.
    """

    kmeans_results = {}
    wcss = []

    for n_clusters in range(n_clusters_range[0], n_clusters_range[1]+1):
        kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0).fit(X)
        
        # Silhouette Score
        score = silhouette_score(X, kmeans.labels_)
        kmeans_results[n_clusters] = {'model': kmeans, 'silhouette_score': score, 'wcss': kmeans.inertia_}
        
        # WCSS
        wcss.append(kmeans.inertia_)

    # Plot WCSS - Elbow Method
    plt.figure(figsize=(10, 6))
    plt.plot(range(n_clusters_range[0], n_clusters_range[1]+1), wcss)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()

    return kmeans_results

# Example usage
# Replace 'df' with your actual DataFrame
kmeans_optimization_results = optimize_kmeans(df.iloc[:,:-1], n_clusters_range=(2, 15))

> Anomaly Detection

In [None]:
# Anomaly Detection
from sklearn.ensemble import IsolationForest

# Instantiate the Isolation Forest model
iso_forest = IsolationForest(n_estimators=100, contamination=0.03, random_state=42)

# Fit the model
iso_forest.fit(X)

# Predictions: -1 for anomalies and 1 for normal points
y_pred = iso_forest.predict(X)
anomaly_indices = np.where(y_pred == -1)[0]


# Visualize
plt.figure(figsize=(10, 7))

# Plot the line, the samples, and the nearest vectors to the plane
plt.title("Isolation Forest Anomaly Detection")
plt.scatter(X[:, 0], X[:, 1], color='k', s=30, label='Data Points')
plt.scatter(X[anomaly_indices, 0], X[anomaly_indices, 1], color='r', s=50, 
            label='Anomalies')
plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

In [None]:
# Anomaly Detection using DBSCAN

from sklearn.cluster import DBSCAN

# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan.fit(X)

# Labels for each point
labels = dbscan.labels_

In [None]:
# Anomaly Detection - Gaussian Anomaly Detection

from sklearn.mixture import GaussianMixture
from sklearn.metrics import precision_recall_fscore_support
import numpy as np

class SklearnGaussianAnomalyDetection:
    """Anomaly detection using scikit-learn's GaussianMixture"""

    def __init__(self, n_components=1):
        """Initialize the GaussianMixture model."""
        self.model = GaussianMixture(n_components=n_components)

    def fit(self, X):
        """Fit the model to data."""
        self.model.fit(X)

    def predict(self, X):
        """Predict anomalies in the data."""
        # Assuming an anomaly if the probability is below a certain threshold
        probabilities = self.model.score_samples(X)
        threshold = np.percentile(probabilities, 5)  # setting threshold at 5th percentile
        return probabilities < threshold

    def evaluate(self, X, y_true):
        """Evaluate the model."""
        y_pred = self.predict(X)
        precision, recall, f_score, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')
        return precision, recall, f_score

# Example usage
# gad = SklearnGaussianAnomalyDetection()
# gad.fit(X_train)  # Assuming X_train is your training data
# precision, recall, f_score = gad.evaluate(X_test, y_test)  # Assuming X_test is your test data and y_test is the true labels


In [None]:
# Anomaly Detection using Guassian (Normal) Distribution Method

# Select Features: Choose features you think indicate anomalies.
# Calculate Parameters: Find the mean and variance for each feature.
  '''
    Mean \( \mu_j \):
    \[ \mu_j = \frac{1}{m} \sum_{i=1}^{m} x_j^{(i)} \]

    Variance \( \sigma_j^2 \):
    \[ \sigma_j^2 = \frac{1}{m} \sum_{i=1}^{m} (x_j^{(i)} - \mu_j)^2 \] 
   '''
# Probability Model: For a new example, calculate the probability that it fits within the normal range of each feature.
    '''
        \[ p(x) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2) \]  

        Where the probability density for feature \( j \) is:
        \[ p(x_j; \mu_j, \sigma_j^2) = \frac{1}{\sqrt{2\pi\sigma_j}} \exp\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right) \]

        The combined probability of a new data point \( x \) is the product of its individual feature probabilities, 
            assuming feature independence:
        \[ p(x) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2) \]

        The product is computed as follows:
        \[ p(x) = \prod_{j=1}^{n} \frac{1}{\sqrt{2\pi\sigma_j^2}} \exp\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right) \]
    '''
# Identify Anomalies: If the probability is below a certain threshold, the example is marked as an anomaly.
    # the threshold (epsilon) is a hyperparameter you choose based on the desired sensitivity of the anomaly detection algorithm




In [None]:
# Recommmendation System 

import re
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances

class AdvancedRecommender:
    def __init__(self, df, feature_matrix, similarity_metric='cosine'):
        self.df = df
        self.feature_matrix = feature_matrix
        self.similarity_metric = similarity_metric

    def calculate_similarity(self):
        if self.similarity_metric == 'cosine':
            return cosine_similarity(self.feature_matrix)
        elif self.similarity_metric == 'euclidean':
            return euclidean_distances(self.feature_matrix)
        elif self.similarity_metric == 'manhattan':
            return manhattan_distances(self.feature_matrix)
        else:
            raise ValueError("Unsupported similarity metric")

    def recommend(self, title, total_result=5, threshold=0.5):
        idx = self.find_id(title)
        if idx == -1:
            return "Title not found in dataset", []

        similarity_scores = self.calculate_similarity()
        similarity_scores = similarity_scores[idx]
        
        # Filter based on threshold
        filtered_scores = [(index, score) for index, score in enumerate(similarity_scores) if score >= threshold]

        # Sort by score and get top results
        sorted_scores = sorted(filtered_scores, key=lambda x: x[1], reverse=True)[1:total_result+1]

        recommendations = [self.df.iloc[i]['title'] for i, _ in sorted_scores]

        return recommendations

    def find_id(self, name):
        for index, title in enumerate(self.df['title']):
            if re.search(name, title, re.IGNORECASE):
                return index
        return -1


### Natural Language Processing (NLP)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re
import string
import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

eng_stop = set(stopwords.words('english'))


def word_grams(text, min=1, max=4):
    '''
    Function to create N-grams from text
    Required Input -
        - text = text string for which N-gram needs to be created
        - min = minimum number of N
        - max = maximum number of N
    Expected Output -
        - s = list of N-grams 
    '''
    s = []
    for n in range(min, max+1):
        for ngram in ngrams(text, n):
            s.append(' '.join(str(i) for i in ngram))
    return s


def generate_bigrams_df(df, column_names):
    """
    Generate bigrams from specified columns in a pandas DataFrame.

    Parameters:
    df (pd.DataFrame): DataFrame to generate bigrams from.
    column_names (list of str): List of column names to generate bigrams from.

    Returns:
    pd.DataFrame: DataFrame with bigrams appended as new columns.
    """
    bigram_columns = []
    for col in column_names:
        bigram_col = f"{col}_bigrams"
        bigram_columns.append(bigram_col)
        df[bigram_col] = df[col].apply(lambda x: generate_bigrams([x]))
    return df[bigram_columns]

def make_wordcloud(df,column, bg_color='white', w=1200, h=1000, font_size_max=50, n_words=40,g_min=1,g_max=1):
    '''
    Function to make wordcloud from a text corpus
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - bg_color = Background color
        - w = width
        - h = height
        - font_size_max = maximum font size allowed
        - n_word = maximum words allowed
        - g_min = minimum n-grams
        - g_max = maximum n-grams
    Expected Output -
        - World cloud image
    '''
    text = ""
    for ind, row in df.iterrows(): 
        text += row[column] + " "
    text = text.strip().split(' ') 
    text = word_grams(text,g_min,g_max)
    
    text = list(pd.Series(word_grams(text,1,2)).apply(lambda x: x.replace(' ','_')))
    
    s = ""
    for i in range(len(text)):
        s += text[i] + " "

    wordcloud = WordCloud(background_color=bg_color, \
                          width=w, \
                          height=h, \
                          max_font_size=font_size_max, \
                          max_words=n_words).generate(s)
    wordcloud.recolor(random_state=1)
    plt.rcParams['figure.figsize'] = (20.0, 10.0)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    
def generate_wordcloud(df, column_names):
    """
    Generates a wordcloud from a pandas DataFrame

    Parameters:
    df (pd.DataFrame): DataFrame containing the data
    column_names (list): List of column names in the DataFrame to generate the wordcloud from

    Returns:
    None
    """
    all_words = ' '.join([' '.join(text) for col in column_names for text in df[col]])
    wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

    plt.figure(figsize=(10, 7))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()
    
    
def get_tokens(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be tokenized
    Expected Output -
        - text - tokenized list output
    '''
    return word_tokenize(text)

def tokenize_columns(dataframe, columns):
    """
    Tokenize the values in specified columns of a pandas DataFrame.

    Parameters:
        dataframe (pandas.DataFrame): The DataFrame to tokenize.
        columns (list): A list of column names to tokenize.

    Returns:
        pandas.DataFrame: A new DataFrame with tokenized values in the specified columns.
    """
    # Download necessary NLTK resources if they haven't been downloaded yet
    nltk.download('punkt')

    # Create a new DataFrame to hold the tokenized values
    tokenized_df = pd.DataFrame()

    # Tokenize the values in each specified column
    for col in columns:
        # Tokenize the values in the current column using NLTK's word_tokenize function
        tokenized_values = dataframe[col].apply(nltk.word_tokenize)

        # Add the tokenized values to the new DataFrame
        tokenized_df[col] = tokenized_values

    # Return the new DataFrame with tokenized values
    return tokenized_df

#another way
--------------------------------------------------------------------------
def tokenize(text, sep=' ', preserve_case=False):
    """
    Tokenize a string into a list of tokens.

    Parameters:
    text (str): String to be tokenized
    sep (str, optional): Separator to use for tokenization. Defaults to ' '.
    preserve_case (bool, optional): Whether to preserve the case of the text. Defaults to False.

    Returns:
    list: List of tokens
    """
    if not preserve_case:
        text = text.lower()
    tokens = text.split(sep)
    return tokens

def tokenize_df(df, column_names, sep=' ', preserve_case=False):
    """
    Tokenize a pandas dataframe with multiple columns.

    Parameters:
    df (pd.DataFrame): Dataframe to be tokenized
    columns (list of str): List of column names to be tokenized
    sep (str, optional): Separator to use for tokenization. Defaults to ' '.
    preserve_case (bool, optional): Whether to preserve the case of the text. Defaults to False.

    Returns:
    pd.DataFrame: Tokenized dataframe
    """
    for col in column_names:
        df[col] = df[col].apply(lambda x: tokenize(x, sep, preserve_case))
    return df

carbon_google1 = tokenize_df (carbon_google1, column_names =  ["title"], sep=' ', preserve_case=False)
--------------------------------------------------------------------------------------------------------------
def bag_of_words_features(df, text_columns, target_columns):
    """
    This function takes in a DataFrame and one or two columns and returns a bag of words representation of the data as a DataFrame.

    Parameters:
    df (pandas DataFrame): The DataFrame to extract features from.
    column1 (str): The name of the first column to use as input data.
    column2 (str, optional): The name of the second column to use as input data. If not provided, only the first column will be used.

    Returns:
    pandas DataFrame: The bag of words representation of the input data as a DataFrame.
    """
        
    text_data = df[text_columns].apply(lambda x: " ".join([str(i) for i in x]), axis=1)

    text_data = text_data.str.lower()
    vectorizer = CountVectorizer(max_df=0.90, min_df=4, max_features=1000, stop_words=None)
    X_bow = vectorizer.fit_transform(text_data)
    # Use the new function to get the feature names
    feature_names = vectorizer.get_feature_names_out()
    df.dropna(subset=[target_column], inplace=True) if target_columns else None

    X_bow = pd.DataFrame(X_bow.toarray(), columns=feature_names)
    
    if target_columns:        
        y = df[target_columns]
        return X_bow, y
    
    return X_bow

def convert_lowercase(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be lowercased
    Expected Output -
        - text - lower cased text string output
    '''
    return text.lower()

def remove_unwanted_characters(df, columns):        #clean text A
    """
    Remove unwanted characters (including smileys and emojies) from specified columns in a pandas DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    columns (list): A list of column names to clean.
    unwanted_chars (str): The characters to remove.

    Returns:
    pd.DataFrame: The cleaned DataFrame.
    """
    import re 
    unwanted_chars = '[$#&*@%]'
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\u2764\ufe0f" # heart emoji
                           "]+", flags=re.UNICODE)
    for col in columns:
        if col in df.columns:
            df[col] = df[col].apply(lambda x: emoji_pattern.sub(r'', x))
            df[col] = df[col].str.replace(unwanted_chars, '')
        else:
            print(f"Column '{col}' does not exist in the DataFrame.")
    return df

def remove_punctuations(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string 
    Expected Output -
        - text - text string with punctuation removed
    '''
    return text.translate(None,string.punctuation)

def remove_stopwords(text):
    '''
    Function to tokenize the text
    Required Input - 
        - text - text string which needs to be tokenized
    Expected Output -
        - text - list output with stopwords removed
    '''
    return [word for word in text.split() if word not in eng_stop]

def remove_short_words(df, column_names, min_length=3):
    """Remove short words from columns in a pandas DataFrame.

    Parameters:
    df (pandas.DataFrame): The DataFrame to modify.
    column_names (List[str]): A list of column names to modify.
    min_length (int, optional): The minimum length of words to keep. Default is 3.

    Returns:
    pandas.DataFrame: The modified DataFrame with short words removed from specified columns.
    """
    for column_name in column_names:
        df[column_name] = df[column_name].apply(
            lambda x: ' '.join([word for word in x.split() if len(word) >= min_length])
        )
    return df

def convert_stemmer(word):
    '''
    Function to tokenize the text
    Required Input - 
        - word - word which needs to be tokenized
    Expected Output -
        - text - word output after stemming
    '''
    porter_stemmer = PorterStemmer()
    return porter_stemmer.stem(word)

def stem_df(df, column_names):
    """
    Perform stemming on a pandas dataframe with multiple columns.

    Parameters:
    df (pd.DataFrame): Dataframe to be stemmed
    columns (list of str): List of column names to be stemmed

    Returns:
    pd.DataFrame: Stemmed dataframe
    """
    stemmer = PorterStemmer()
    for col in column_names:
        df[col] = df[col].apply(lambda x: [stemmer.stem(i) for i in x])
    return df

def convert_lemmatizer(word):
    '''
    Function to tokenize the text
    Required Input - 
        - word - word which needs to be lemmatized
    Expected Output -
        - word - word output after lemmatizing
    '''
    wordnet_lemmatizer = WordNetLemmatizer()
    return wordnet_lemmatizer.lemmatize(word)
    
def create_tf_idf(df, column, train_df = None, test_df = None,n_features = None):
    '''
    Function to do tf-idf on a pandas dataframe
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - train_df(optional) = Train DataFrame
        - test_df(optional) = Test DataFrame
        - n_features(optional) = Maximum number of features needed
    Expected Output -
        - train_tfidf = train tf-idf sparse matrix output
        - test_tfidf = test tf-idf sparse matrix output
        - tfidf_obj = tf-idf model
    '''
    tfidf_obj = TfidfVectorizer(ngram_range=(1,1), stop_words='english', 
                                analyzer='word', max_features = n_features)
    tfidf_text = tfidf_obj.fit_transform(df.ix[:,column].values)
    
    if train_df is not None:        
        train_tfidf = tfidf_obj.transform(train_df.ix[:,column].values)
    else:
        train_tfidf = tfidf_text

    test_tfidf = None
    if test_df is not None:
        test_tfidf = tfidf_obj.transform(test_df.ix[:,column].values)

    return train_tfidf, test_tfidf, tfidf_obj
    
def create_countvector(df, column, train_df = None, test_df = None,n_features = None):
    '''
    Function to do count vectorizer on a pandas dataframe
    Required Input -
        - df = Pandas DataFrame
        - column = name of column containing text
        - train_df(optional) = Train DataFrame
        - test_df(optional) = Test DataFrame
        - n_features(optional) = Maximum number of features needed
    Expected Output -
        - train_cvect = train count vectorized sparse matrix output
        - test_cvect = test count vectorized sparse matrix output
        - cvect_obj = count vectorized model
    '''
    cvect_obj = CountVectorizer(ngram_range=(1,1), stop_words='english', 
                                analyzer='word', max_features = n_features)
    cvect_text = cvect_obj.fit_transform(df.ix[:,column].values)
    
    if train_df is not None:
        train_cvect = cvect_obj.transform(train_df.ix[:,column].values)
    else:
        train_cvect = cvect_text
        
    test_cvect = None
    if test_df is not None:
        test_cvect = cvect_obj.transform(test_df.ix[:,column].values)

    return train_cvect, test_cvect, cvect_obj

#Remove Punctuation
import string 

table = str.maketrans(dict.fromkeys(string.punctuation))
"string. With. Punctuation?".translate(table) 

#



>> NLP Text Preprocessing Steps for Machine Learning Algorithms

In [None]:
# 1. Removing HTML tags, punctuation, and special characters
# 2. Converting to Lowercase
# 3. Handling contractions and acronyms 
# 4. Tokenization
# 5. Spell Checking and Correction
# 6. Stopword Removal
# 7. Part-of-speech (POS) tagging
# 8. Named Entity Recognition (NER)
# 9. Stemming or Lemmatization
# 10. Text Vectorization



# 1. Tokenization
# The process of converting a raw text into a sequence of tokens (words, phrases, symbols, etc.) is called tokenization.

    from nltk.tokenize import word_tokenize
    text = "This is a sample text for tokenization."
    tokens = word_tokenize(text)
    print(tokens)

# 2. Stopword Removal
# Stopwords are commonly used words in a language, such as “the,” “and,” “a,” etc., that do not add much meaning to the 
# text. Removing these words helps to reduce the noise in the text data.

    from nltk.corpus import stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if not word in stop_words]
    print(tokens)

# 3. Stemming
# Stemming is the process of reducing a word to its base or root form. For example, the words “jumping”, “jumps”, and 
# “jumped” would all be reduced to “jump” by a stemming algorithm. The main goal of stemming is to reduce different 
# forms of a word to a common base form, which can help in tasks like text classification, sentiment analysis, and 
# information retrieval.

    from nltk.stem import PorterStemmer
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in tokens]
    print(stemmed_words)

# 4. Lemmatization
# Lemmatization is the process of reducing words to their base or dictionary form (known as a lemma) so that they can be 
# analyzed as a single item, rather than multiple different forms. For example, the word “running” can be reduced to 
# its base form “run” through lemmatization.

    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
    print(lemmatized_words)

# 5. Part-of-speech (POS) tagging
# Part-of-speech (POS) tagging is the process of identifying and labeling the part of speech of each word in a sentence, 
# such as a noun, verb, adjective, adverb, etc. POS tagging is useful in various natural languages processing tasks like 
# sentiment analysis, text classification, information extraction, and machine translation.

    import nltk
    # Sample sentence
    sentence = "The quick brown fox jumps over the lazy dog"
    # Tokenize the sentence into words
    words = nltk.word_tokenize(sentence)
    # Perform POS tagging
    pos_tags = nltk.pos_tag(words)
    # Print the POS tags
    print(pos_tags)

# 6. Named Entity Recognition (NER)
# Named Entity Recognition (NER) is a natural language processing technique that is used to identify and extract the 
# named entities from a given text. Named entities can be anything like a person, organization, location, product, etc.

    import spacy
    # Load the English language model
    nlp = spacy.load("en_core_web_sm")
    # Sample text for NER
    text = "Apple is looking at buying U.K. startup for $1 billion"
    # Process the text with the language model
    doc = nlp(text)
    # Extract named entities from the text
    for ent in doc.ents:
    print(ent.text, ent.label_)

# 7. Spell Checking and Correction
# Spell checking and correction is the process of identifying and correcting spelling errors in the text. It is an 
# important step in text preprocessing as it can improve the accuracy of natural language processing algorithms that 
# are applied to text data.

    !pip install pyspellchecker

    from spellchecker import SpellChecker
    # initialize spell checker
    spell = SpellChecker()
    # example sentence with spelling errors
    sentence = "Ths sentnce hs spellng erors that nd to b corcted."
    # tokenize sentence
    tokens = sentence.split()
    # iterate over tokens and correct spelling errors
    for i in range(len(tokens)):
    # check if token is misspelled
    if not spell.correction(tokens[i]) == tokens[i]:
    # replace misspelled token with corrected spelling
    tokens[i] = spell.correction(tokens[i])
    # join corrected tokens back into sentence
    corrected_sentence = ' '.join(tokens)
    print(corrected_sentence)

# 8. Removing HTML tags, punctuation, and special characters
# Removing HTML tags, punctuation, and special characters is necessary for text preprocessing to clean the text data 
# and make it ready for further processing. HTML tags, punctuation, and special characters do not contribute to the 
# meaning of the text and can cause issues during text analysis.

    import re
    import string

    def remove_html_tags(text):
    clean_text = re.sub('<.*?>', '', text)
    return clean_text

    def remove_punctuation(text):
    clean_text = text.translate(str.maketrans('', '', string.punctuation))
    return clean_text

    def remove_special_characters(text):
    clean_text = re.sub('[^a-zA-Z0–9\s]', '', text)
    return clean_text

    text = "<p>Hello, world!</p>"
    clean_text = remove_html_tags(text)
    clean_text = remove_punctuation(clean_text)
    clean_text = remove_special_characters(clean_text)
    print(clean_text)

# 9. Converting to Lowercase
# Lowercasing the text is a common preprocessing step in natural language processing (NLP) to make text data 
# consistent and easier to analyze. This step involves converting all the letters in the text to lowercase so 
# that words that differ only by the case are treated as the same word.

    text = "This is a sample TEXT for preprocessing"
    text = text.lower()
    print(text)

# 10. Text Vectorization
# Text vectorization is the process of transforming raw text into a numerical representation that can be used by 
# machine learning algorithms. This is a crucial step in text preprocessing as most machine learning algorithms work 
# with numerical data. There are several ways to vectorize text, including Bag of Words (BoW), Term Frequency-Inverse 
# Document Frequency (TF-IDF), and Word Embeddings.

    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

    # Example text corpus
    corpus = ["This is the first document.", 
    "This document is the second document.", 
    "And this is the third one.", 
    "Is this the first document?"]

    # Vectorize text using BoW representation
    vectorizer = CountVectorizer()
    X_bow = vectorizer.fit_transform(corpus)

    print("BoW representation:")
    print(X_bow.toarray())
    print("Vocabulary:")
    print(vectorizer.get_feature_names())

    # Vectorize text using TF-IDF representation
    vectorizer = TfidfVectorizer()
    X_tfidf = vectorizer.fit_transform(corpus)

    print("TF-IDF representation:")
    print(X_tfidf.toarray())



#Clean text B

import re
import copy
import string

def clean_text(text, full_clean=False, punctuation=False, numbers=False, lower=False, extra_spaces=False,
               control_characters=False, tokenize_whitespace=False, remove_characters=''):
    r"""
    Clean text using various techniques.

    I took inspiration from the cleantext library `https://github.com/prasanthg3/cleantext`. I did not like the whole
    implementation so I made my own changes.

    Note:
        As in the original cleantext library I will add: stop words removal, stemming and
        negative-positive words removal.

    Arguments:

        text (:obj:`str`):
            String that needs cleaning.

        full_clean (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Remove: punctuation, numbers, extra space, control characters and lower case. This argument is optional and
            it has a default value attributed inside the function.

        punctuation (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Remove punctuation from text. This argument is optional and it has a default value attributed inside
            the function.

        numbers (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Remove digits from text. This argument is optional and it has a default value attributed inside
            the function.

        lower (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Lower case all text. This argument is optional and it has a default value attributed inside the function.

        extra_spaces (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Remove extra spaces - everything beyond one space. This argument is optional and it has a default value
            attributed inside the function.

        control_characters (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Remove characters like `\n`, `\t` etc.This argument is optional and it has a default value attributed
            inside the function.

        tokenize_whitespace (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Return a list of tokens split on whitespace. This argument is optional and it has a default value
            attributed inside the function.

        remove_characters (:obj:`str`, `optional`, defaults to :obj:`''`):
            Remove defined characters form text. This argument is optional and it has a default value attributed
            inside the function.

    Returns:

        :obj:`str`: Clean string.

    Raises:

        ValueError: If `text` is not of type string.

        ValueError: If `remove_characters` needs to be a string.

    """

    if not isinstance(text, str):
        # `text` is not type of string
        raise ValueError("`text` is not of type str!")

    if not isinstance(remove_characters, str):
        # remove characters need to be a string
        raise ValueError("`remove_characters` needs to be a string!")

    # all control characters like `\t` `\n` `\r` etc.
    # Stack Overflow: https://stackoverflow.com/a/8115378/11281368
    control_characters_list = ''.join([chr(char) for char in range(1, 32)])

    # define control characters table
    table_control_characters = str.maketrans(dict.fromkeys(control_characters_list))

    # remove punctuation table
    table_punctuation = str.maketrans(dict.fromkeys(string.punctuation))

    # remove numbers table
    table_digits = str.maketrans(dict.fromkeys('0123456789'))

    # remove certain characters table
    table_remove_characters = str.maketrans(dict.fromkeys(remove_characters))

    # make a copy of text to make sure it doesn't affect original text
    cleaned = copy.deepcopy(text)

    if full_clean or punctuation:
        # remove punctuation
        cleaned = cleaned.translate(table_punctuation)

    if full_clean or numbers:
        # remove numbers
        cleaned = cleaned.translate(table_digits)

    if full_clean or extra_spaces:
        # remove extra spaces - also removes control characters
        # Stack Overflow https://stackoverflow.com/a/2077906/11281368
        cleaned = re.sub('\s+', ' ', cleaned).strip()

    if full_clean or lower:
        # lowercase
        cleaned = cleaned.lower()

    if control_characters:
        # remove control characters
        cleaned = cleaned.translate(table_control_characters)

    if tokenize_whitespace:
        # tokenizes text n whitespace
        cleaned = re.split('\s+', cleaned)

    if remove_characters:
        # remove these characters from text
        cleaned = cleaned.translate(table_remove_characters)

    return cleaned


>> Sentiment Analysis of Tweets (Unsupervised)

In [None]:
import nltk 
nltk.download('vader_lexicon')
import numpy as np
import pandas as pd
import re
import string
from sklearn.cluster import KMeans
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 

# Load data from file or API
tweets = pd.read_csv(r'C:\Users\Cornel\Downloads\tweets.csv')

# tweets = tweets['text'] 
# Preprocess text data
def preprocess_text(text):
    text = re.sub(r"http\S+", "", text) # remove URLs
    text = re.sub('@[^\s]+', '', text) # remove usernames
    text = re.sub('#', '', text) # remove hashtags
    text = re.sub(r'\d+', '', text) # remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
    text = text.lower() # convert to lowercase
    return text

tweets['text_clean'] = tweets['text'].apply(preprocess_text)

# Extract features
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(tweets['text_clean'])

# Cluster tweets using K-means
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X)

# Analyze sentiment of each cluster using a lexicon-based approach
sia = SentimentIntensityAnalyzer()
cluster_sentiment = []
for i in range(5):
    cluster_text = tweets[kmeans.labels_ == i]['text_clean']
    sentiment_scores = [sia.polarity_scores(text)['compound'] for text in cluster_text]
    cluster_sentiment.append(np.mean(sentiment_scores))

# Visualize results
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
clusters = ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5']
ax.bar(clusters, cluster_sentiment)
plt.title('Sentiment Analysis of Tweets')
plt.xlabel('Cluster')
plt.ylabel('Sentiment Score')
plt.show() 


>> NLP Text Classification

In [None]:
import sys 
import os 
import pickle 
import pandas as pd 
import numpy as np 

from argparse import ArgumentParser
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report

from nltk import sent_tokenize
from nltk import pos_tag
from nltk import map_tag
from nltk import word_tokenize
from nltk.corpus import stopwords

# Load NLTK's English stop-words list
stop_words = set(stopwords.words('english'))


#
# embeddings vector representations
#

def tag_pos(x):
    sentences = sent_tokenize(x.decode("utf8"))
    sents = []
    for s in sentences:
        text = word_tokenize(s)
        pos_tagged = pos_tag(text)
        simplified_tags = [
            (word, map_tag('en-ptb', 'universal', tag)) for word, tag in pos_tagged]
        sents.append(simplified_tags)
    return sents


def post_tag_documents(data_df):
    x_data = []
    y_data = []
    total = len(data_df['plot'].as_matrix().tolist())
    plots = data_df['plot'].as_matrix().tolist()
    genres = data_df.drop(['plot', 'title', 'plot_lang'], axis=1).as_matrix()
    for i in range(len(plots)):
        sents = tag_pos(plots[i])
        x_data.append(sents)
        y_data.append(genres[i])
        i += 1
        if i % 5000 == 0:
            print i, "/", total

    return x_data, y_data


def word2vec(x_data, pos_filter):

    print "Loading GoogleNews-vectors-negative300.bin"
    google_vecs = KeyedVectors.load_word2vec_format(
        'GoogleNews-vectors-negative300.bin', binary=True, limit=200000)

    print "Considering only", pos_filter
    print "Averaging Word Embeddings..."
    x_data_embeddings = []
    total = len(x_data)
    processed = 0
    for tagged_plot in x_data:
        count = 0
        doc_vector = np.zeros(300)
        for sentence in tagged_plot:
            for tagged_word in sentence:
                if tagged_word[1] in pos_filter:
                    try:
                        doc_vector += google_vecs[tagged_word[0]]
                        count += 1
                    except KeyError:
                        continue

        doc_vector /= count
        if np.isnan(np.min(doc_vector)):
            continue

        x_data_embeddings.append(doc_vector)

        processed += 1
        if processed % 10000 == 0:
            print processed, "/", total

    return np.array(x_data_embeddings)


def doc2vec(data_df):
    data = []
    print "Building TaggedDocuments"
    total = len(data_df[['title', 'plot']].as_matrix().tolist())
    processed = 0
    for x in data_df[['title', 'plot']].as_matrix().tolist():
        label = ["_".join(x[0].split())]
        words = []
        sentences = sent_tokenize(x[1].decode("utf8"))
        for s in sentences:
            words.extend([x.lower() for x in word_tokenize(s)])
        doc = TaggedDocument(words, label)
        data.append(doc)

        processed += 1
        if processed % 10000 == 0:
            print processed, "/", total

    model = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=2)
    print "Building Vocabulary"
    model.build_vocab(data)

    for epoch in range(20):
        print "Training epoch %s" % epoch
        model.train(data)
        model.alpha -= 0.002  # decrease the learning rate
        model.min_alpha = model.alpha  # fix the learning rate, no decay
        model.train(data)

    # Build doc2vec vectors
    x_data = []
    y_data = []
    genres = data_df.drop(['title', 'plot', 'plot_lang'], axis=1).as_matrix()
    names = data_df[['title']].as_matrix().tolist()
    for i in range(len(names)):
        name = names[i][0]
        label = "_".join(name.split())
        x_data.append(model.docvecs[label])
        y_data.append(genres[i])

    return np.array(x_data), np.array(y_data)


#
# train classifiers and argument handling
#

def train_test_svm(x_data, y_data, genres):

    stratified_split = StratifiedShuffleSplit(n_splits=2, test_size=0.33)
    for train_index, test_index in stratified_split.split(x_data, y_data):
        x_train, x_test = x_data[train_index], x_data[test_index]
        y_train, y_test = y_data[train_index], y_data[test_index]

    """
    print "LinearSVC"
    pipeline = Pipeline([
        ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
    ])
    parameters = {
        "clf__estimator__C": [0.01, 0.1, 1],
        "clf__estimator__class_weight": ['balanced', None],
    }
    grid_search(x_train, y_train, x_test, y_test, genres, parameters, pipeline)

    print "LogisticRegression"
    pipeline = Pipeline([
        ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),
    ])
    parameters = {
        "clf__estimator__C": [0.01, 0.1, 1],
        "clf__estimator__class_weight": ['balanced', None],
    }
    grid_search(x_train, y_train, x_test, y_test, genres, parameters, pipeline)
    """

    print "LinearSVC"
    pipeline = Pipeline([
        ('clf', OneVsRestClassifier(SVC(), n_jobs=1)),
    ])
    """
    parameters = {
        "clf__estimator__C": [0.01, 0.1, 1],
        "clf__estimator__class_weight": ['balanced', None],
    }
    """
    parameters = [

        {'clf__estimator__kernel': ['rbf'],
         'clf__estimator__gamma': [1e-3, 1e-4],
         'clf__estimator__C': [1, 10]
        },

        {'clf__estimator__kernel': ['poly'],
         'clf__estimator__C': [1, 10]
        }
         ]

    grid_search(x_train, y_train, x_test, y_test, genres, parameters, pipeline)


def grid_search(train_x, train_y, test_x, test_y, genres, parameters, pipeline):
    grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=3, verbose=10)
    grid_search_tune.fit(train_x, train_y)

    print
    print("Best parameters set:")
    print grid_search_tune.best_estimator_.steps
    print

    # measuring performance on test set
    print "Applying best classifier on test data:"
    best_clf = grid_search_tune.best_estimator_
    predictions = best_clf.predict(test_x)

    print classification_report(test_y, predictions, target_names=genres)


def parse_arguments():
    arg_parser = ArgumentParser()

    arg_parser.add_argument(
        '--clf', dest='classifier', choices=['nb', 'linearSVC', 'logit'])

    arg_parser.add_argument(
        '--vectors', dest='vectors', type=str, choices=['tfidf', 'word2vec', 'doc2vec'])

    return arg_parser, arg_parser.parse_args()


def main():
    args_parser, args = parse_arguments()

    if len(sys.argv) == 1:
        args_parser.print_help()
        sys.exit(1)

    # load pre-processed data
    print "Loading already processed training data"
    data_df = pd.read_csv("movies_genres_en.csv", delimiter='\t')
    # all the list of genres to be used by the classification report
    genres = list(data_df.drop(['title', 'plot', 'plot_lang'], axis=1).columns.values)

    if args.vectors == 'tfidf':

        # split the data, leave 1/3 out for testing
        data_x = data_df[['plot']].as_matrix()
        data_y = data_df.drop(['title', 'plot', 'plot_lang'], axis=1).as_matrix()
        stratified_split = StratifiedShuffleSplit(n_splits=2, test_size=0.33)
        for train_index, test_index in stratified_split.split(data_x, data_y):
            x_train, x_test = data_x[train_index], data_x[test_index]
            y_train, y_test = data_y[train_index], data_y[test_index]

        # transform matrix of plots into lists to pass to a TfidfVectorizer
        train_x = [x[0].strip() for x in x_train.tolist()]
        test_x = [x[0].strip() for x in x_test.tolist()]

        if args.classifier == 'nb':
            # MultinomialNB: Multi-Class OneVsRestClassifier
            pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(MultinomialNB(
                    fit_prior=True, class_prior=None))),
            ])
            parameters = {
                'tfidf__max_df': (0.25, 0.5, 0.75),
                'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
                'clf__estimator__alpha': (1e-2, 1e-3)
            }
            grid_search(train_x, y_train, test_x, y_test, genres, parameters, pipeline)
            exit(-1)

        if args.classifier == 'linearSVC':
            # LinearSVC
            pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])
            parameters = {
                'tfidf__max_df': (0.25, 0.5, 0.75),
                'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
                "clf__estimator__C": [0.01, 0.1, 1],
                "clf__estimator__class_weight": ['balanced', None],
            }
            grid_search(train_x, y_train, test_x, y_test, genres, parameters, pipeline)
            exit(-1)

        if args.classifier == 'logit':
            # LogisticRegression
            pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),
            ])
            parameters = {
                'tfidf__max_df': (0.25, 0.5, 0.75),
                'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
                "clf__estimator__C": [0.01, 0.1, 1],
                "clf__estimator__class_weight": ['balanced', None],
            }
            grid_search(train_x, y_train, test_x, y_test, genres, parameters, pipeline)
            exit(-1)

    if args.vectors == 'word2vec':
        if os.path.exists("pos_tagged_data.dat"):
            print "Loading Part-of-Speech tagged data..."
            with open('pos_tagged_data.dat', 'rb') as f:
                data = pickle.load(f)
                x_data, y_data = data[0], data[1]
        else:
            print "Part-of-Speech tagging..."
            x_data, y_data = post_tag_documents(data_df)
            with open('pos_tagged_data.dat', 'w') as f:
                pickle.dump((x_data, y_data), f)

        pos_filter = ['NOUN', 'ADJ']

        # get embeddings for train and test data
        x_embeddings = word2vec(x_data, pos_filter)

        # need to transform back into numpy array to apply StratifiedShuffleSplit
        y_data = np.array(y_data)

        train_test_svm(x_embeddings, y_data, genres)
        exit(-1)

    if args.vectors == 'doc2vec':
        if os.path.exists("doc2vec_data.dat"):
            print "Loading Doc2Vec vectors"
            with open('doc2vec_data.dat', 'rb') as f:
                data = pickle.load(f)
                x_data, y_data = data[0], data[1]
        else:
            print "Generating Doc2Vec vectors"
            x_data, y_data = doc2vec(data_df)
            with open('doc2vec_data.dat', 'w') as f:
                pickle.dump((x_data, y_data), f)

        train_test_svm(x_data, y_data, genres)
        exit(-1)


if __name__ == "__main__":
    main()

### Recommendation Systems (Recsys)

In [None]:
import pandas as pd
import numpy as np
from scipy import sparse
from lightfm import LightFM
from sklearn.metrics.pairwise import cosine_similarity

def create_interaction_matrix(df,user_col, item_col, rating_col, norm= False, threshold = None):
    '''
    Function to create an interaction matrix dataframe from transactional type interactions
    Required Input -
        - df = Pandas DataFrame containing user-item interactions
        - user_col = column name containing user's identifier
        - item_col = column name containing item's identifier
        - rating col = column name containing user feedback on interaction with a given item
        - norm (optional) = True if a normalization of ratings is needed
        - threshold (required if norm = True) = value above which the rating is favorable
    Expected output - 
        - Pandas dataframe with user-item interactions ready to be fed in a recommendation algorithm
    '''
    interactions = df.groupby([user_col, item_col])[rating_col] \
            .sum().unstack().reset_index(). \
            fillna(0).set_index(user_col)
    if norm:
        interactions = interactions.applymap(lambda x: 1 if x > threshold else 0)
    return interactions

def create_user_dict(interactions):
    '''
    Function to create a user dictionary based on their index and number in interaction dataset
    Required Input - 
        interactions - dataset create by create_interaction_matrix
    Expected Output -
        user_dict - Dictionary type output containing interaction_index as key and user_id as value
    '''
    user_id = list(interactions.index)
    user_dict = {}
    counter = 0 
    for i in user_id:
        user_dict[i] = counter
        counter += 1
    return user_dict
    
def create_item_dict(df,id_col,name_col):
    '''
    Function to create an item dictionary based on their item_id and item name
    Required Input - 
        - df = Pandas dataframe with Item information
        - id_col = Column name containing unique identifier for an item
        - name_col = Column name containing name of the item
    Expected Output -
        item_dict = Dictionary type output containing item_id as key and item_name as value
    '''
    item_dict ={}
    for i in range(df.shape[0]):
        item_dict[(df.loc[i,id_col])] = df.loc[i,name_col]
    return item_dict

def runMF(interactions, n_components=30, loss='warp', k=15, epoch=30,n_jobs = 4):
    '''
    Function to run matrix-factorization algorithm
    Required Input -
        - interactions = dataset create by create_interaction_matrix
        - n_components = number of embeddings you want to create to define Item and user
        - loss = loss function other options are logistic, brp
        - epoch = number of epochs to run 
        - n_jobs = number of cores used for execution 
    Expected Output  -
        Model - Trained model
    '''
    x = sparse.csr_matrix(interactions.values)
    model = LightFM(no_components= n_components, loss=loss,k=k)
    model.fit(x,epochs=epoch,num_threads = n_jobs)
    return model

def sample_recommendation_user(model, interactions, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 10, show = True):
    '''
    Function to produce user recommendations
    Required Input - 
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - user_id = user ID for which we need to generate recommendation
        - user_dict = Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - threshold = value above which the rating is favorable in new interaction matrix
        - nrec_items = Number of output recommendation needed
    Expected Output - 
        - Prints list of items the given user has already bought
        - Prints list of N recommended items  which user hopefully will be interested in
    '''
    n_users, n_items = interactions.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items)))
    scores.index = interactions.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    known_items = list(pd.Series(interactions.loc[user_id,:] \
                                 [interactions.loc[user_id,:] > threshold].index) \
								 .sort_values(ascending=False))
    
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1
    return return_score_list
    

def sample_recommendation_item(model,interactions,item_id,user_dict,item_dict,number_of_user):
    '''
    Funnction to produce a list of top N interested users for a given item
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
        - item_id = item ID for which we need to generate recommended users
        - user_dict =  Dictionary type input containing interaction_index as key and user_id as value
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - number_of_user = Number of users needed as an output
    Expected Output -
        - user_list = List of recommended users 
    '''
    n_users, n_items = interactions.shape
    x = np.array(interactions.columns)
    scores = pd.Series(model.predict(np.arange(n_users), np.repeat(x.searchsorted(item_id),n_users)))
    user_list = list(interactions.index[scores.sort_values(ascending=False).head(number_of_user).index])
    return user_list 


def create_item_emdedding_distance_matrix(model,interactions):
    '''
    Function to create item-item distance embedding matrix
    Required Input -
        - model = Trained matrix factorization model
        - interactions = dataset used for training the model
    Expected Output -
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
    '''
    df_item_norm_sparse = sparse.csr_matrix(model.item_embeddings)
    similarities = cosine_similarity(df_item_norm_sparse)
    item_emdedding_distance_matrix = pd.DataFrame(similarities)
    item_emdedding_distance_matrix.columns = interactions.columns
    item_emdedding_distance_matrix.index = interactions.columns
    return item_emdedding_distance_matrix

def item_item_recommendation(item_emdedding_distance_matrix, item_id, 
                             item_dict, n_items = 10, show = True):
    '''
    Function to create item-item recommendation
    Required Input - 
        - item_emdedding_distance_matrix = Pandas dataframe containing cosine distance matrix b/w items
        - item_id  = item ID for which we need to generate recommended items
        - item_dict = Dictionary type input containing item_id as key and item_name as value
        - n_items = Number of items needed as an output
    Expected Output -
        - recommended_items = List of recommended items
    '''
    recommended_items = list(pd.Series(item_emdedding_distance_matrix.loc[item_id,:]. \
                                  sort_values(ascending = False).head(n_items+1). \
                                  index[1:n_items+1]))
    if show == True:
        print("Item of interest :{0}".format(item_dict[item_id]))
        print("Item similar to the above item:")
        counter = 1
        for i in recommended_items:
            print(str(counter) + '- ' +  item_dict[i])
            counter+=1
    return recommended_items

### OpenAI

> Pandas AI

In [None]:
!pip install -q pandasai
# pip install --user pandasai

import pandas as pd 
import numpy as np 
from pandasai import SmartDataframe 
from pandasai.llm import OpenAI, GooglePalm, GoogleVertexAI, Falcon, AzureOpenAI, HuggingFaceTextGen, Starcoder   #different LLMs 


# Pandas AI is an extension to the pandas library using OpenAI’s generative AI models. It allows you to generate insights from your 
# dataframe using just a text prompt. It works on the text-to-query generative AI developed by OpenAI


df = pd.read_csv('data.csv')
llm = OpenAI("add key here")
smart_df = SmartDataframe(df, config={"llm": llm})
smart_df.chat('Which are the countries with GDP greater than 3000000000000?')


OpenAI 

In [None]:
pip install -q openai


# importing openai module into your openai environment 
import openai 
  
# assigning API KEY to initialize openai environment 
openai.api_key = '<API_KEY>'

In [None]:


# function that takes in string argument as parameter 
def comp(PROMPT, MaxToken=50, outputs=3): 
    # using OpenAI's Completion module that helps execute  
    # any tasks involving text  
    response = openai.Completion.create( 
        # model name used here is text-davinci-003 
        # there are many other models available under the  
        # umbrella of GPT-3 
        model="text-davinci-003", 
        # passing the user input  
        prompt=PROMPT, 
        # generated output can have "max_tokens" number of tokens  
        max_tokens=MaxToken, 
        # number of outputs generated in one call 
        n=outputs 
    ) 
    # creating a list to store all the outputs 
    output = list() 
    for k in response['choices']: 
        output.append(k['text'].strip()) 
    return output



PROMPT = """Write a story to inspire greatness, take the antagonist as a Rabbit and protagnist as turtle.  
Let antagonist and protagnist compete against each other for a common goal.  
Story should atmost have 3 lines."""
comp(PROMPT, MaxToken=3000, outputs=3) 

In [None]:
# Chat 


# function that takes in string argument as parameter 
def chat(MSGS, MaxToken=50, outputs=3): 
    # We use the Chat Completion endpoint for chat like inputs 
    response = openai.ChatCompletion.create( 
    # model used here is ChatGPT 
    # You can use all these models for this endpoint:  
    # gpt-4, gpt-4-0314, gpt-4-32k, gpt-4-32k-0314,  
    # gpt-3.5-turbo, gpt-3.5-turbo-0301 
    model="gpt-3.5-turbo", 
    messages=MSGS, 
    # max_tokens generated by the AI model 
    # maximu value can be 4096 tokens for "gpt-3.5-turbo"  
    max_tokens = MaxToken, 
    # number of output variations to be generated by AI model 
    n = outputs, 
    ) 
    return response.choices[0].message 
  
# Messages must consist of a collection of message objects,  
# each of which includes a role (either "system," "user," or "assistant")  
# and a content field for the message's actual text.  
# Conversations might last only 1 message or span several pages. 
MSGS = [ 
        {"role": "system", "content": "<message generated by system>"}, 
        {"role": "user", "content": "<message generated by user>"}, 
        {"role": "assistant", "content": "<message generated by assistant>"} 
    ]

In [None]:
# Image



# importing other libraries 
import requests 
from PIL import Image 
from io import BytesIO




# function for text-to-image generation  
# using create endpoint of DALL-E API 
# function takes in a string argument 
def generate(text): 
  res = openai.Image.create( 
    # text describing the generated image 
    prompt=text, 
    # number of images to generate  
    n=1, 
    # size of each generated image 
    size="256x256", 
  ) 
  # returning the URL of one image as  
  # we are generating only one image 
  return res["data"][0]["url"]



# prompt describing the desired image 
text = "batman art in red and blue color"
# calling the custom function "generate" 
# saving the output in "url1" 
url1 = generate(text) 
# using requests library to get the image in bytes 
response = requests.get(url1, stream=True) 
# using the Image module from PIL library to view the image 
Image.open(response.raw)
