In [4]:
import pandas as pd
import numpy as np
from raimitigations.dataprocessing import (
    CorrelatedFeatures,
    create_dummy_dataset
)

# Checking Correlations Between Variables - A Comprehensive Guide

This notebook was created to show how several use cases for the **CorrelatedFeatures** class. This class computes several correlation metrics for different types of correlations:
* Numerical x Numerical;
* Numerical x Categorical;
* Categorical x Categorical.

To compute these correlations, it is necessary to call the .fit() method first. This method will be responsible for computing the different correlations metrics, which depend on the data type of each variable. After all computations (which may take a while depending on the correlation metrics chosen and the dataset size), three JSON files containing different summaries will be saved. The user can then open these JSON files and look into all correlation metrics by themselves, allowing them to use their domain knowledge to determine the best variables to be removed.

## 1 - Toy Dataset

First of all, we need a dataset with several correlated variables. Ideally, we need a dataset with pairs of numerical, numerical and categorical, and categorical correlated variables to test all this class has to offer. Since we don't have such dataset at our disposal, we will create an artificial dataset with the desired characteristics. We will create a dataset with **samples** data points and **n_features** base features, where:
* **n_num_num** new numerical features are created, where each new feature copies one of the **n_features** existing numerical features and then adds a noise over these original values. The standard deviation used for generating the noise is a random value between **num_num_noise[0]** and **num_num_noise[1]**. If the degree of the correlation between the two features depends on the standard deviation used for generating the noise;
* **n_cat_num** new categorical features are created, where these new features are correlated to the existing numerical features in the dataset. The ith new categorical feature created will be correlated to the ith existing numerical feature of the dataset df. To force this correlation, the numerical feature will be categorized by creating bins, where the number of bins varies between 2 to 10. Each bin is associated with a categorical value. After that, we change a fraction of **p** bins by swapping the categorical value of some bins. Here, **p** is a value in the range [0,1] that is chosen to be between **pct_change[0]** and **pct_change[1]**. If the numerical feature selected is already correlated to another numerical feature, the new categorical feature could also be correlated to this second numerical feature.;
* **n_cat_cat** new categorical features are created in a way similarly to the previous **n_cat_num** features. The ith new categorical feature here is created following the same logic explained above, and it is also correlated to the ith numerical feature of the dataset. This way, the ith categorical feature created between the **n_cat_cat** features will be correlated to the ith categorical feature created between the **n_cat_num** features, since both are correlated to the same numerical feature.

There is an inherent randomness associated with the creation process of this dataset. We can control the strength of these correlations by tuning the **num_num_noise** and **pct_change** parameters: lower values for both of these parameters results in variables with a higher correlation. 

In [5]:
df = create_dummy_dataset(
                    samples=3000, 
                    n_features=6, 
                    n_num_num=2, 
                    n_cat_num=2,
                    n_cat_cat=2,
                    num_num_noise=[0.01, 0.05],
                    pct_change=[0.05, 0.1]
                )
label_col = "label"
df

Unnamed: 0,num_0,num_1,num_2,num_3,num_4,num_5,label,num_c0_num_0,num_c1_num_1,CN_0_num_0,CN_1_num_1,CC_0_num_0,CC_1_num_1
0,-3.633724,2.402746,0.860549,4.033981,-3.005298,-3.279323,0,-3.614031,2.414210,val0_1,val1_2,val0_1,val1_4
1,4.070874,-2.146126,0.580270,-2.836100,-2.924647,2.463193,1,4.058100,-2.148135,val0_3,val1_1,val0_3,val1_1
2,3.045077,-0.783001,2.363379,-4.038650,-3.980719,1.706057,1,2.977632,-0.772284,val0_2,val1_1,val0_3,val1_2
3,2.529509,-2.821301,2.684528,-2.816390,-2.884799,2.691671,1,2.551148,-2.817238,val0_0,val1_0,val0_3,val1_1
4,-2.088423,1.341175,-0.928002,2.481124,-1.034721,-0.633088,0,-2.070140,1.340329,val0_3,val1_2,val0_1,val1_3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,1.135839,-1.622574,4.121300,-1.993871,-0.507498,3.319100,1,1.114095,-1.604495,val0_2,val1_1,val0_2,val1_2
2996,3.303470,-2.597209,3.760176,-4.244150,-0.823886,2.335958,1,3.273267,-2.601232,val0_3,val1_1,val0_3,val1_1
2997,-3.998412,1.247457,-0.784179,4.423204,-2.921416,-0.574877,0,-3.992604,1.265720,val0_1,val1_2,val0_1,val1_3
2998,-3.016525,2.105135,-3.338568,-0.411485,-2.962806,-1.573175,0,-3.026858,2.118895,val0_1,val1_2,val0_1,val1_4


## 2 - A Beginner User's View of the CorrelatedFeatures class

Let's consider the following scenario: a beginner data scientist wants to check for correlations within a dataset. But this data scientist doesn't really understand how these correlations are measured, nor how to interpret them. Actually, this data scientist doesn't even care for these details: its goal is to simply remove a set of features that are already explained well enough within other features, that is, the data scientist just wants to remove one feature for each pair of correlated features and reduce the overall number of features used in its training pipeline (if possible). This means that the data scientist isn't interested in analyzing the JSON summary generated by the **CorrelatedFeatures** class: the goal is to simply remove a set of features that are correlated to other features.

The **CorrelatedFeatures** class was designed to perform all of these desired steps automatically. If the user doesn't really understand what each parameter means, we set these parameters with default values capable of delivering reasonable results on the average case (depends on the dataset used). The only thing required is that the user instantiates the class (leave all parameters blank, that is, using their default values), call the fit method using their dataset, and then use the transform method to remove the unwanted columns of a second dataset (this second dataset could be the same used for the fit method or another dataset that has the same structure). The following code executes all of these steps.

In [6]:
cor_feat = CorrelatedFeatures(save_json=False) # We setted this single parameter just to avoid creating json files for now
cor_feat.fit(df=df, label_col=label_col)
new_df = cor_feat.transform(df)
new_df

Unnamed: 0,num_0,num_2,num_3,num_4,num_5,CN_0_num_0,CN_1_num_1,CC_1_num_1,label
0,-3.633724,0.860549,4.033981,-3.005298,-3.279323,val0_1,val1_2,val1_4,0
1,4.070874,0.580270,-2.836100,-2.924647,2.463193,val0_3,val1_1,val1_1,1
2,3.045077,2.363379,-4.038650,-3.980719,1.706057,val0_2,val1_1,val1_2,1
3,2.529509,2.684528,-2.816390,-2.884799,2.691671,val0_0,val1_0,val1_1,1
4,-2.088423,-0.928002,2.481124,-1.034721,-0.633088,val0_3,val1_2,val1_3,0
...,...,...,...,...,...,...,...,...,...
2995,1.135839,4.121300,-1.993871,-0.507498,3.319100,val0_2,val1_1,val1_2,1
2996,3.303470,3.760176,-4.244150,-0.823886,2.335958,val0_3,val1_1,val1_1,1
2997,-3.998412,-0.784179,4.423204,-2.921416,-0.574877,val0_1,val1_2,val1_3,0
2998,-3.016525,-3.338568,-0.411485,-2.962806,-1.573175,val0_1,val1_2,val1_4,0


As we can see, we easily instantiated the class, called the fit method, and finally transformed our data set using the transform method. Consider now that we managed to train a model using the selected features and created an API to run the model in a production pipeline. Now, each week a new dataset similar to the one we used is fetched and must be fed to the model. But the dataset arrives with the original columns. First, we need to remove the unwanted columns, similar to what we did previously. To do that, we just call the transform method over the new dataset.

## 3 - An Intermediate User's View of the CorrelatedFeatures class

Let us now consider the scenario where an intermediate data scientist wants to use the **CorrelatedFeatures** class. This user knows something about correlation metrics, so they will be interested in adjusting some parameters. They also have the domain knowledge of the dataset, allowing them to imbue some of this knowledge when using the CorrelatedFeatures class.

First of all, we need to instantiate the class. Since the user is more experienced, they decided that they want to use:
* Pearson, Spearman, and Kendall-tau to measure the correlation between numerical features, with a threshold of 0.8 for both metrics and a p-value threshold of 0.01, that is, only correlations with a p-value smaller than 0.01 and a correlation coefficient greater than 0.8 (for all metrics: Pearson, Spearman, and Kendall) are considered a high correlation;
* To measure the correlation between both categorical variables, the user wants to use Cramer's V test with a correlation metric threshold of 0.9 and a p-value threshold of 0.05, that is, only correlations greater than 0.9 with a p-value lesser than 0.01 are considered a high correlation;
* Through the users domain knowledge of the dataset, they know that it makes no sense to test the correlation between a numerical feature and a categorical feature. Therefore, the user disables the correlation checks for numerical x categorical features by setting **method_num_cat = None**;
* Also using their domain knowledge, the user wants to test the correlation only between a set of variables of the dataset, that being: [num_0, num_4, CN_0_num_0, CC_1_num_1]'. The other variables shouldn't be checked for correlations.

The following cell shows how this user could perform the tasks mentioned above using the **CorrelatedFeatures** class:

In [7]:
cor_feat = CorrelatedFeatures(
					cor_features=["num_0", "num_4", "CN_0_num_0", "CC_1_num_1"],
					method_num_num=["spearman", "pearson", "kendall"],
					num_corr_th=0.8,
					num_pvalue_th=0.01,
					method_num_cat=None,
					cat_corr_th=0.9,
					cat_pvalue_th=0.05,
					save_json=False
				)
cor_feat.fit(df=df, label_col=label_col)
new_df = cor_feat.transform(df)
new_df

No correlations detected. Nothing to be done here.


Unnamed: 0,num_0,num_1,num_2,num_3,num_4,num_5,num_c0_num_0,num_c1_num_1,CN_0_num_0,CN_1_num_1,CC_0_num_0,CC_1_num_1,label
0,-3.633724,2.402746,0.860549,4.033981,-3.005298,-3.279323,-3.614031,2.414210,val0_1,val1_2,val0_1,val1_4,0
1,4.070874,-2.146126,0.580270,-2.836100,-2.924647,2.463193,4.058100,-2.148135,val0_3,val1_1,val0_3,val1_1,1
2,3.045077,-0.783001,2.363379,-4.038650,-3.980719,1.706057,2.977632,-0.772284,val0_2,val1_1,val0_3,val1_2,1
3,2.529509,-2.821301,2.684528,-2.816390,-2.884799,2.691671,2.551148,-2.817238,val0_0,val1_0,val0_3,val1_1,1
4,-2.088423,1.341175,-0.928002,2.481124,-1.034721,-0.633088,-2.070140,1.340329,val0_3,val1_2,val0_1,val1_3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,1.135839,-1.622574,4.121300,-1.993871,-0.507498,3.319100,1.114095,-1.604495,val0_2,val1_1,val0_2,val1_2,1
2996,3.303470,-2.597209,3.760176,-4.244150,-0.823886,2.335958,3.273267,-2.601232,val0_3,val1_1,val0_3,val1_1,1
2997,-3.998412,1.247457,-0.784179,4.423204,-2.921416,-0.574877,-3.992604,1.265720,val0_1,val1_2,val0_1,val1_3,0
2998,-3.016525,2.105135,-3.338568,-0.411485,-2.962806,-1.573175,-3.026858,2.118895,val0_1,val1_2,val0_1,val1_4,0


We can retrieve the selected features by calling the **get_selected_features** method:

In [8]:
cor_feat.get_selected_features()

['num_0',
 'num_1',
 'num_2',
 'num_3',
 'num_4',
 'num_5',
 'num_c0_num_0',
 'num_c1_num_1',
 'CN_0_num_0',
 'CN_1_num_1',
 'CC_0_num_0',
 'CC_1_num_1']

## 4 - A Pro User's View of the CorrelatedFeatures class

Let us now consider the scenario where a pro data scientist wants to use the **CorrelatedFeatures** class to compute different correlation metrics. In contrast to the previous scenarios, this user doesn't want to rely on the class to automatically select the best features. Instead, the user wants to access all correlation metrics computed, even those for pairs of variables defined as not correlated. To this end, the user can use the **get_summary** method, which returns 3 JSON files:

* **Full Summary**: when calling the fit method, all correlations will be computed according to the many parameters detailed previously. After computing all these data, everything is saved in a JSON file, which can then be accessed and analyzed carefully. We recommend using a JSON viewing tool for this. This JSON is saved in the file specified by the parameter **json_summary**;

* **Correlated JSON**: similar to the full summary, but stores only information of pairs of correlated variables (with no repetitions). This JSON is saved in the file specified by the parameter **json_corr**;

* **Uncorrelated JSON**: similar to the full summary, but stores only information of pairs of uncorrelated variables (with no repetitions). This JSON is saved in the file specified by the parameter **json_uncorr**;

Also, the user will use the "model" method for computing the correlations between numerical and categorical variables. For each pair of numerical and categorical variables, this approach does the following: trains a simple decision tree using the numerical variable and predicts the categorical variable. Both variables are first divided into a training and test set (70% and 30% of the size of the original variables, respectively). The training set is used to train the decision tree, where the only feature used by the model is the numerical variable and the predicted label are the different values within the categorical variable. After trained, the model is to predict the values of the test set and a set of metrics are computed to assess the performance of the model (the metrics computed are defined by the **model_metrics** parameter). If all metrics computed are above the threshold defined by the **metric_th** parameter, then both variables are considered to be correlated. Here, we will use F1 and Accuracy as the evaluation metrics, both with a threshold of 0.9.

Finally, the user wants to save the resulting JSON files in the folder "./corr_json_examples". To do this, we use the *json_summary*, *json_corr*, and *json_uncorr* parameters to specify the path to the json files.

In [9]:
cor_feat = CorrelatedFeatures(
					method_num_num=["spearman", "pearson", "kendall"],				# Used for Numerical x Numerical correlations
					num_corr_th=0.8,												# Used for Numerical x Numerical correlations
					num_pvalue_th=0.01,												# Used for Numerical x Numerical correlations
					method_num_cat="model",											# Used for Numerical x Categorical correlations
					model_metrics=["f1"],											# Used for Numerical x Categorical correlations
					metric_th=0.9,													# Used for Numerical x Categorical correlations
					cat_corr_th=0.8,												# Used for Categorical x Categorical correlations
					cat_pvalue_th=0.01,												# Used for Categorical x Categorical correlations
					json_summary="./corr_json_examples/1_summary_model.json",
					json_corr="./corr_json_examples/1_corr_model.json",
					json_uncorr="./corr_json_examples/1_uncorr_model.json"
				)
cor_feat.fit(df=df, label_col=label_col)

<raimitigations.dataprocessing.feat_selection.correlated_features.CorrelatedFeatures at 0x7fefe89991f0>

We can now go on and check the resulting JSON files. We recommend using a JSON file viewing tool to check the results. We used the JSON Viewer tool from Visual Studio. 

### Full Summary
First, let's take a look at the full summary JSON file. When we open this file with the JSON Viewer, we get the following:

![sum1](./imgs/summary1.PNG)

This shows one primary key for each variable in the dataset. When we click on one of these keys, we will see a list of secondary keys made up by all variables in the dataset not including the variable being analyzed (the primary key), as the follow image shows:

![sum2](./imgs/summary2.PNG)

Each secondary key holds a summary of the correlation metrics computed for the pair A X B, where A is the primary key and B is the secondary key. If we click on one of these secondary keys, we can check the type of correlation (numerical x numerical, numerical x categorical, or categorical x categorical), the correlation metrics (depends on the type of the correlation and the methods used to compute this type of correlation), the thresholds used, the number of exact matches between the two variables, and if they are correlated or not. In the following image, we can see an example for a numerical x numerical correlation type, where we used all three numerical correlation metrics: Spearman, Pearson, and Kendall. Therefore, we have a third-level key for each of these correlations, each one providing the info the user needs.

![sum3](./imgs/summary3.PNG)

If we select a categorical variable for our secondary key, we will notice that the type is different, as well as the correlation results. Since we used the "model" approach for numerical x categorical correlations, we have here the results of the model trained using the numerical variable that predicts the categorical one.

![sum4](./imgs/summary4.PNG)

Finally, let's open a categorical variable as our primary and secondary keys, as depicted in the following image:

![sum5](./imgs/summary5.PNG)

Here we can see the results for the Cramer's V test used for this type of correlation.

### Correlated Summary

Now we will take a look at the correlated summary, which shows only the results for the correlated variables (that is, the pairs of variables that passed the correlation thresholds defined by the different parameters). The following image shows this JSON file:

![cor1](./imgs/cor1.PNG)

The main difference here is that the primary keys are variable pairs instead of a single variable. Since we have only 4 primary keys, then there are only 4 correlated pairs of variables in our dataset. If we click over the pair **A x B** of these pairs, we will notice that the contents inside this key are exactly the same in the full summary under the primary key **A** and secondary key **B**. Therefore, this summary doesn't add any new information in comparison to the previous JSON file. Its advantage is that it only shows the information for correlated features.

![cor2](./imgs/cor2.PNG)
![cor3](./imgs/cor3.PNG)

### Uncorrelated Summary

The final JSON file is very similar to the correlated summary, but instead of showing only the correlated pairs, it shows only the pairs considered NOT correlated. As we can see in the following image, this file is larger than the previous one (for this dataset), because there are many more pairs of uncorrelated variables than pairs of correlated variables.

![uncor1](./imgs/uncor1.PNG)
![uncor2](./imgs/uncor2.PNG)
![uncor3](./imgs/uncor3.PNG)
![uncor4](./imgs/uncor4.PNG)


### Setting the selected features manually

After looking over all of these different results, the user will decide by themselves which variables are the best ones to keep. They can then call the **set_selected_features** method and provide it a list with the selected columns. This will override the list of selected features defined automatically by the class. After that, the user can simply call the transform method normally to remove the unwanted features.

In [10]:
features_manual = ["num_0", "num_1", "num_2", "num_3", "num_4", "CN_0_num_0", "CC_1_num_1"]
cor_feat.set_selected_features(features_manual)
new_df = cor_feat.transform(df)
new_df

Unnamed: 0,num_0,num_1,num_2,num_3,num_4,CN_0_num_0,CC_1_num_1,label
0,-3.633724,2.402746,0.860549,4.033981,-3.005298,val0_1,val1_4,0
1,4.070874,-2.146126,0.580270,-2.836100,-2.924647,val0_3,val1_1,1
2,3.045077,-0.783001,2.363379,-4.038650,-3.980719,val0_2,val1_2,1
3,2.529509,-2.821301,2.684528,-2.816390,-2.884799,val0_0,val1_1,1
4,-2.088423,1.341175,-0.928002,2.481124,-1.034721,val0_3,val1_3,0
...,...,...,...,...,...,...,...,...
2995,1.135839,-1.622574,4.121300,-1.993871,-0.507498,val0_2,val1_2,1
2996,3.303470,-2.597209,3.760176,-4.244150,-0.823886,val0_3,val1_1,1
2997,-3.998412,1.247457,-0.784179,4.423204,-2.921416,val0_1,val1_3,0
2998,-3.016525,2.105135,-3.338568,-0.411485,-2.962806,val0_1,val1_4,0


### Using the Jensen-Shannon approach for numerical x categorical correlations

The following cell shows how to use the Jensen-Shannon approach for numerical x categorical correlations. For a given pair A, B, where A is a numerical feature and B is a categorical feature, this approach does the following: first we cluster the numerical values of A according to their respective values of the categorical data B. We then compute the probability density function of the numerical variable for each cluster (we approximate the PDF with the histogram using **jensen_n_bins** different bins). The next step is to compute the Jensen-Shannon Distance metric between the distribution functions of each pair of clusters. This distance metric varies from 0 to 1, where values closer to 0 means that both distributions tested are similar and values closer to 1 means that the distributions are different. If all pairs of distributions tested are considered different (a Jensen-Shannon metric above **jensen_th** for all pairs tested), then both variables are considered to be correlated.

Here we also present another parameter: the **tie_method**. Whenever a pair of variables is considered correlated, we must choose one of them to remove from the dataset. The first priority is to remove the variables with the most number of correlations. After computing all correlations, we build a graph G = (V, E) with a set of vertices V and a set of edges E. We create one vertex for each variable and one edge for each pair of correlated variables. We then compute the degree of all variables. The next step is to analyze each edge: for each edge, we remove the vertex with the highest degree. If both vertices have the same degree, then we use a tie method, which can be the following:
* "missing": chooses the variable with the least number of missing values;
* "var": chooses the variable with the largest data dispersion (std / (V - v), where std is the standard deviation of the variable, V and v are the maximum and minimum values observed in the variable, respectively). Works only for numerical x numerical analysis. Otherwise, it uses the cardinality approach internally;
* "cardinality":chooses the variable with the most number of different values present;

In all three cases, if both variables are tied (same dispersion, same number of missing values, or same cardinality), the variable to be removed will be selected randomly. When we remove a variable, we remove its associated vertex and all of its edges. We then recompute the degree of all variables again, and repeat this process until no edges are left.

Finally, we also call the **get_summary** method, which returns the full, correlated, and uncorrelated JSONs as a dictionary. This method will also print out all of the information contained in the full summary dictionary. If the user doesn't want this method to print anything, they can just set the **print_summary** parameter to False (it is set to True by default).

In [11]:
cor_feat = CorrelatedFeatures(
					method_num_num=["kendall"],
					num_corr_th=0.8,
					num_pvalue_th=0.01,
					method_num_cat="jensen",
					jensen_n_bins=None, # use the Freedman Diaconis rule to compute this for each numerical variable
					jensen_th=0.8,
					tie_method="var",
					json_summary="./corr_json_examples/2_summary_jensen.json",
					json_corr="./corr_json_examples/2_corr_jensen.json",
					json_uncorr="./corr_json_examples/2_uncorr_jensen.json"
				)
cor_feat.fit(df=df, label_col=label_col)
full_sum, cor_sum, uncor_sum = cor_feat.get_summary()
cor_feat.get_selected_features()


CORRELATION SUMMARY

1 - num_0 x num_c0_num_0:
	* kendall correlation = 0.9911254862732022 with a p-value of 0.0
2 - num_0 x CC_0_num_0:
	Jensen-Shannon results:
	jensen val0_1 x val0_3 = 0.9887453820753916
	jensen val0_1 x val0_2 = 0.980515389122213
	jensen val0_1 x val0_4 = 0.9829295554361851
	jensen val0_1 x val0_0 = 0.9753239770250397
	jensen val0_3 x val0_2 = 0.9535848025138043
	jensen val0_3 x val0_4 = 0.9679167353417252
	jensen val0_3 x val0_0 = 1.0
	jensen val0_2 x val0_4 = 0.9999999999999998
	jensen val0_2 x val0_0 = 0.9784703332107834
	jensen val0_4 x val0_0 = 0.9999999999999999
3 - num_1 x num_c1_num_1:
	* kendall correlation = 0.9958648438368345 with a p-value of 0.0
4 - num_1 x CN_1_num_1:
	Jensen-Shannon results:
	jensen val1_2 x val1_1 = 0.9522839497552043
	jensen val1_2 x val1_0 = 0.9406072818586725
	jensen val1_1 x val1_0 = 0.9423753807964595
5 - num_1 x CC_1_num_1:
	Jensen-Shannon results:
	jensen val1_4 x val1_1 = 0.9921670954590563
	jensen val1_4 x val1_2 = 0.98063

['num_2',
 'num_3',
 'num_4',
 'num_5',
 'num_c0_num_0',
 'CN_0_num_0',
 'CN_1_num_1',
 'CC_1_num_1']

### Using the ANOVA test approach for numerical x categorical correlations

The other approach implemented for detecting numerical x categorical correlation is the ANOVA test. The following cell shows how to use this method to detect this type of correlation. This method uses the ANOVA test to identify a correlation. First we use the Levene test to see if the numerical variable has a similar variance across the different values of the categorical variable (Homoscedastic data). If the test passes (that is, if the p-value of the Levene test is greater than the value set to the **levene_pvalue** parameter), then we can perform the ANOVA test, in which we compute the F-statistic to see if there is a correlation between the numerical and categorical variables and its associated p-value. We also compute the omega-squared metric. If the p-value is less than the parameter **anova_pvalue** and the omega-squared is greater than the parameter **omega_th**, then both variables are considered to be correlated;

In [12]:
cor_feat = CorrelatedFeatures(
					method_num_num=["kendall"],
					num_corr_th=0.8,
					num_pvalue_th=0.01,
					method_num_cat="anova",
					levene_pvalue=0.01,
					anova_pvalue=0.05,
					omega_th=0.75,
					tie_method="cardinality",
					json_summary="./corr_json_examples/3_summary_anova.json",
					json_corr="./corr_json_examples/3_corr_anova.json",
					json_uncorr="./corr_json_examples/3_uncorr_anova.json"
				)
cor_feat.fit(df=df, label_col=label_col)
_ = cor_feat.get_summary()
cor_feat.get_selected_features()


CORRELATION SUMMARY

1 - num_0 x num_c0_num_0:
	* kendall correlation = 0.9911254862732022 with a p-value of 0.0
2 - num_1 x num_c1_num_1:
	* kendall correlation = 0.9958648438368345 with a p-value of 0.0
3 - CN_0_num_0 x CC_0_num_0:
	* Cramer's V = 0.8796 with a p-value of 0.0

NOT CORRELATED VARIABLES SUMMARY

num_0 x num_1:
	* kendall correlation = -0.4134747137934867 with a p-value of 1.1947780382366597e-252
num_0 x num_2:
	* kendall correlation = 0.2370381238190508 with a p-value of 2.247652578766561e-84
num_0 x num_3:
	* kendall correlation = -0.6197656996776704 with a p-value of 0.0
num_0 x num_4:
	* kendall correlation = 0.17180260086695567 with a p-value of 3.4490184419693836e-45
num_0 x num_5:
	* kendall correlation = 0.3512722018450595 with a p-value of 6.1787944029726694e-183
num_0 x num_c1_num_1:
	* kendall correlation = -0.4132830943647883 with a p-value of 2.0390357962426037e-252
num_0 x CN_0_num_0:
	ANOVA results:
	P-Value for the Levene's Test of Homoscedasticity = 9.

['num_0',
 'num_1',
 'num_2',
 'num_3',
 'num_4',
 'num_5',
 'CN_0_num_0',
 'CN_1_num_1',
 'CC_1_num_1']

### Checking only for certain types of correlation

The following cell shows an example of how to test only for correlations between categorical features. To do this, simply set the method for checking correlations between numerical variables to None (**method_num_num=None**) and the method for checking correlations between numerical and categorical variables to None (**method_num_cat=None**).

In [13]:
cor_feat = CorrelatedFeatures(
					method_num_num=None,
					method_num_cat=None,
					json_summary="./corr_json_examples/4_summary_cat.json",
					json_corr="./corr_json_examples/4_corr_cat.json",
					json_uncorr="./corr_json_examples/4_uncorr_cat.json"
				)
cor_feat.fit(df=df, label_col=label_col)
_ = cor_feat.get_summary()
cor_feat.get_selected_features()


CORRELATION SUMMARY

1 - CN_0_num_0 x CC_0_num_0:
	* Cramer's V = 0.8796 with a p-value of 0.0

NOT CORRELATED VARIABLES SUMMARY

CN_0_num_0 x CN_1_num_1:
	* Cramer's V = 0.502 with a p-value of 0.0
CN_0_num_0 x CC_1_num_1:
	* Cramer's V = 0.3641 with a p-value of 0.0
CN_1_num_1 x CC_0_num_0:
	* Cramer's V = 0.5359 with a p-value of 0.0
CN_1_num_1 x CC_1_num_1:
	* Cramer's V = 0.7698 with a p-value of 0.0
CC_0_num_0 x CC_1_num_1:
	* Cramer's V = 0.3926 with a p-value of 0.0


['num_0',
 'num_1',
 'num_2',
 'num_3',
 'num_4',
 'num_5',
 'num_c0_num_0',
 'num_c1_num_1',
 'CN_0_num_0',
 'CN_1_num_1',
 'CC_1_num_1']

### Update the thresholds used without computing all of the correlations again

The fit() method executes three different steps:
1. **Correlation metrics:** compute the correlation metrics for all pairs of variables;
2. **Correlation thresholds:** identify which pairs of variables are correlated by comparing the pair's metrics against the thresholds used (each type of correlation and each method of correlation uses different thresholds);
3. **Remove variables:** based on all pairs of correlated variables, choose the variables to be removed using the graph approach in conjunction with the **tie_method** parameter (explained previously).

The most computationally expensive operation between these three steps is the **correlation metrics** computation. After these metrics are computed and saved internally (and in the JSON file), we can still test execute steps (2) and (3) without having to recompute all the correlation metrics. To do this, we can use the **update_selected_features()** method, which is a method that allows the user to pass different threshold values and re-select the best features. This is faster than running the fit() method, so we advise using this method instead of fitting the same object several times with different thresholds. Note that the **update_selected_features** only accepts threshold related variables. If the user wishes, for example, to change the correlation method used for numerical x categorical features, so it is necessary to run the fit() method again. But to change the threshold used for the Cramer's V correlation metric, for example, the user can use the **update_selected_features** instead.

In the following cells we show a use case for this method. First, let's create a new object of the **CorrelatedFeatures** class and fit it for our dataset using a set of thresholds. We'll set the thresholds high for the numerical x numerical correlations and for the numerical x categorical correlations, but we'll set it low for the categorical x categorical correlations. We'll also save the JSON files so we can compare the results later.

In [14]:
cor_feat = CorrelatedFeatures(
					method_num_num=["kendall", "pearson"],
					num_corr_th=0.99999,
					num_pvalue_th=0.01,
					method_num_cat="model",
					model_metrics=["auc", "f1"],
					metric_th=0.97,
					cat_corr_th=0.5,
					tie_method="cardinality",
					json_summary="./corr_json_examples/5_summary_1.json",
					json_corr="./corr_json_examples/5_corr_1.json",
					json_uncorr="./corr_json_examples/5_uncorr_1.json"
				)
cor_feat.fit(df=df, label_col=label_col)
cor_feat.get_selected_features()

['num_0',
 'num_1',
 'num_2',
 'num_3',
 'num_4',
 'num_5',
 'num_c0_num_0',
 'num_c1_num_1',
 'CN_0_num_0']

As we can see, only two categorical features were removed due to correlations between two or more categorical features. But no correlations were detected for numerical x numerical and numerical x categorical features, as expected. Let's now reverse the situation and set a high threshold for categorical x categorical correlations and low thresholds for numerical x numerical and numerical x categorical correlations.

In [15]:
cor_feat.update_selected_features(
					num_corr_th=0.5,
					num_pvalue_th=0.001,
					model_metrics=["accuracy", "precision"],
					metric_th=0.7,
					cat_corr_th=0.9,
					json_summary="./corr_json_examples/5_summary_2.json",
					json_corr="./corr_json_examples/5_corr_2.json",
					json_uncorr="./corr_json_examples/5_uncorr_2.json"
				)
cor_feat.get_selected_features()

['num_2',
 'num_3',
 'num_4',
 'num_5',
 'CN_0_num_0',
 'CN_1_num_1',
 'CC_0_num_0',
 'CC_1_num_1']

We can see here that we managed to change the thresholds without recomputing all of the correlation metrics. We can also check that the selected features is quite different than the previous cell. We encourage the user to compare the JSON files generated in each case to better understand the difference between these two runs.

## Using a dataset without headers

In [16]:
df_new = df.copy()
df_new.columns = [i for i in range(df_new.shape[1])]
df_new

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,-3.633724,2.402746,0.860549,4.033981,-3.005298,-3.279323,0,-3.614031,2.414210,val0_1,val1_2,val0_1,val1_4
1,4.070874,-2.146126,0.580270,-2.836100,-2.924647,2.463193,1,4.058100,-2.148135,val0_3,val1_1,val0_3,val1_1
2,3.045077,-0.783001,2.363379,-4.038650,-3.980719,1.706057,1,2.977632,-0.772284,val0_2,val1_1,val0_3,val1_2
3,2.529509,-2.821301,2.684528,-2.816390,-2.884799,2.691671,1,2.551148,-2.817238,val0_0,val1_0,val0_3,val1_1
4,-2.088423,1.341175,-0.928002,2.481124,-1.034721,-0.633088,0,-2.070140,1.340329,val0_3,val1_2,val0_1,val1_3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,1.135839,-1.622574,4.121300,-1.993871,-0.507498,3.319100,1,1.114095,-1.604495,val0_2,val1_1,val0_2,val1_2
2996,3.303470,-2.597209,3.760176,-4.244150,-0.823886,2.335958,1,3.273267,-2.601232,val0_3,val1_1,val0_3,val1_1
2997,-3.998412,1.247457,-0.784179,4.423204,-2.921416,-0.574877,0,-3.992604,1.265720,val0_1,val1_2,val0_1,val1_3
2998,-3.016525,2.105135,-3.338568,-0.411485,-2.962806,-1.573175,0,-3.026858,2.118895,val0_1,val1_2,val0_1,val1_4


In [17]:
cor_feat = CorrelatedFeatures(
					json_summary="./corr_json_examples/6_summary.json",
					json_corr="./corr_json_examples/6_corr.json",
					json_uncorr="./corr_json_examples/6_uncorr.json"
				)
cor_feat.fit(df=df_new, label_col=6)
cor_feat.get_selected_features()

['0', '2', '3', '4', '5', '9', '10', '12']