<a href="https://www.kaggle.com/code/peremartramanonellas/template-with-sklearn-to-solve-any-class-problem?scriptVersionId=104390089" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Template to solve with minimum effort Regression & Classification Problems with SKLearn. 

I made this template to obtain a really **fast a first solution** for regression & classification problems with .csv or tabular datasets. 

It's mainly based in two functions. 
* **data_transform**: This is the one responsible to modify the dataset, and transform the data to do it usable for the model. 
 * *Null values*: By the moment only replace with the most used value in the column. 
 * *Non numeric values*: At this moment stops the treatment. In this versión the function only accepts numeric columns. 
 * *Normalize Data*.: It's possible to indicate a maximum standard desviation, and the function normalize the columns with a std bigger than the one indicated 
 
* **create_model**: I use RandomizedSearchCV to test different hyperparametres, and SKlearn selects the best one. 

Of course that it's impossible to obtain the best solution to all classfication & regression problems with this template, but is a simple place where to start, study the results and continue with more advanced tunning. 

My intention is improve the data_transform function, not only to transform data but to have a fast way to obtain information of the datasets and save time. 

Feel free to Copy & Edit, or fork this Notebook and use at your convenience, just please upvote the notebbok if you like and use it.



## The dataset
I used the Credit Fraud Detection Dataset. 
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
It's a really imbalanced Dataset with a small representation of fraud cases. In this kind of datasets the accuracy is a useless metric. I did nothing to solve the problme of imbalanced. 

In future notebooks I will try with different Datasets. 

# Import libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# Support functions. 

In [None]:
def evaluate_regression(y_true, y_preds):
    from sklearn.metrics import r2_score

    r2_score = r2_score(y_preds, y_true)

    metric_dict = {"r2_score": round(r2_score, 2)}
    print(f"KPIs-------------------------------------")
    print(f"r2: {r2_score * 100:.2f}%")
    print(f"KPIs-------------------------------------")
    return metric_dict

In [None]:
def evaluate(y_true, y_preds):
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2), 
                  "precision":  round(accuracy, 2),
                  "recall": round(recall, 2),
                  "f1": round(f1, 2)}
    print(f"KPIs-------------------------------------")
    print(f"Acc: {accuracy * 100:.2f}%")
    print(f"precision: {precision * 100:.2f}%")
    print(f"recall: {recall * 100:.2f}%")
    print(f"f1score: {f1 * 100:.2f}%")
    print(f"KPIs-------------------------------------")
    return metric_dict

In [None]:
def print_confusion_matrix(y_true, y_preds):
    cm = pd.crosstab(y_true, y_preds, rownames=['Actual'], colnames=['Predicted'])
    fig, (ax1) = plt.subplots(ncols=1, figsize=(8,8))
    sns.heatmap(cm, 
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Oranges")
    plt.title('Confusion Matrix', fontsize=14)
    plt.show()

## data_transform
Transform the datased to do it usable in a Machine Learning Model. 
* **pd_dataframe**: The dataframe to check / transform. 
* **normalize**: Indicate when we want normalize. 
* **stdlimit**: If normalize is True, will normalize the columns with a *std* > to this parameter. 

In [None]:
def data_transform(pd_dataframe, normalize=False, stdlimit=2):
    
    colums_2_transform = []
    newdf = pd_dataframe.copy()
    pretestfail=False
    
    #Pre checks.    
    #This versiond don't convert Categories to Columns. TBD. 
    for dttype in dataframe.dtypes:
        if dttype == 'object':
            print ('Please, only floats or ints')
            pretestfail = True
            
    #This version have only a treatment for null values. TDB. 
    if (dataframe.isnull().sum().max()) > 0 and (not pretestfail) :
        #to be done
        pd_dataframe = pd_dataframe.apply(lambda x:x.fillna(x.value_counts().index[0]))
        #print ("You must do something with the nulls values before ;-)")
        pretestfail = True
    
    if pretestfail: 
        return nwdf

    #Normalize the values. TBD: Add more ways to change the data. 
    if normalize: 
        for n in range(len(pd_dataframe.columns)): 
            #std = pd_dataframe.take([n], axis=1).describe().loc[['std']]
            std = pd_dataframe.take([n], axis=1).describe().loc[['std', 'min', 'max']]
            if float(std.iloc[0]) > stdlimit: 
                column = pd_dataframe.columns[n]
                colums_2_transform.append(column)
                min = float(std.iloc[1])
                max = float(std.iloc[2])
                print ('min:', min)
                print ('max:', max)
                newdf[column] = (pd_dataframe[column] - min) / max - min
        print ("Columns to normalize: ", colums_2_transform)
    
    return newdf

## create_model
I use a grid to create a hyperparameters combination and RandomizedSearchCV to try different combinations. I use a small number of cross validation to reduce the training time, but it can be increased if training time is not relevant.

* **data**: The dataframe used. 
* **target**: The name of the Target Column. 
* **iterations**: How many combinations we want to try. 
* **alg**: 0 for RandomForestClassifier, 1 for RandomForestRegressor. 



In [None]:
def create_model(data, target, iterations, alg=0):
    np.random.seed(50)
    
    pd_dataframe = data
    X = pd_dataframe.drop(target, axis=1)
    y = pd_dataframe[target]
    
    #split data 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
    #configure values hyperparameters
    gscgrid = {'n_estimators': [10, 50, 100, 120, 150, 400], 
       'max_depth': [None, 5, 10, 20, 100], 
       'min_samples_split': [2, 4, 6, 12], 
       'min_samples_leaf': [1, 2, 4, 6, 12]}
    if alg == 0:
        model = RandomForestClassifier(n_jobs = 1)
    else: 
        model = RandomForestRegressor(n_jobs = 1)
    
    #using RandomizedSearchCV to try different hyperparameters
    #cv is crossvalidation, the default value is 5. 
    #verbose indicates the level of trace desired. 
    #https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
    gscmodel = RandomizedSearchCV(estimator=model, 
                           param_distributions=gscgrid, 
                           n_iter=iterations, 
                           cv=2, 
                           verbose=2)
    gscmodel.fit(X_train, y_train)
    y_preds = gscmodel.predict(X_test)
    
    if alg == 0: 
        evaluate (y_test, y_preds)
    else: 
        evaluate_regression(y_test, y_preds)
    
    return gscmodel, X_train, X_test, y_train, y_test, y_preds

# USE THE TEMPLATE AND SOLVE THE PROBLEM

In [None]:
dataframe = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')
dataframe.head()

## Call the data_transform
I'm indicating a *std* limit of 5, any column with a *std*  bigger than five will be transformed.

In [None]:
pd_dataframe = data_transform(dataframe, normalize=1, stdlimit=5)

In [None]:
#There are only 492 frauds. 
pd_dataframe['Class'].value_counts()

The function has transformed Two columns. *Time* and *Data*. We can see in the result of the *head* function that now the values in *Time* and *Amount* are totaly different and between 0 and 1. 

In [None]:
pd_dataframe.head()

## Call the create_model

Here you can change the parameters to adapt to your model. 
* data: your .csv, previously treated, without empty data and only numbers. 
* target: the dependent variable. 
* iterations: more iterations more hyperparametrers configurations will try

In [None]:
#Algorithms that can be used by the create_model function. 
CLASSIFICATION=0
REGRESSION=1

#create_model returns a tupla of values, with this constants you can acces 
#to the data inside the collection. 
MODEL = 0
XTRAIN = 1 
XTEST =2
YTRAIN =3
YTEST =4
YPREDS = 5

model = create_model(data=pd_dataframe, 
                     target = 'Class', 
                     iterations=15, 
                    alg = CLASSIFICATION)
model[MODEL].best_params_

In [None]:
#In our test data we have 88 frauds. 
model[YTEST].value_counts()

In [None]:
print_confusion_matrix(model[YTEST], model[YPREDS])

As you can see in the Confusion Matrix, we are able to detect 67 of 88 frauds. It's good?.... who knows. But our system have an xxx of accuracy, really amaziong! But, how I said just at the beginning the accuracy dosn't work with imbalanced data. We can fail all the frauds detections and obtain an fantastic accuracy. 

**Use always a confusion matrix to check the results in datasets with imbalanced data**