# **Feature Selection**
Here we explore methodologies to identify which features are useful provide a higher predictive power to the model. Given a dataset, a model trained on it can depend on features directly on derived features. How do we tell wich features are the most useful? Multiple approaches exist, which are based on simple ideas of univariate analysis to complex multivariate analysis. In univariate analysis we look at how a single feature contribute to the model. Although useful, it does have pitfalls as some features are better together. In multivariate analysis we can tell which features perform well and more importantly which perform well together. Various techniques exist driven differentiated by how information is extracted. When data contains label like the case here, we use supervised techniques, nevetheless, unsupervised techniques can be used for unlabelled data.

Collaborative filtering is built on the assumption that a good way to predict the
preference of an active consumer for a target product is to find other consumers
who have similar preferences and use their votes for that product to make a
prediction.
As noted in the [source page](https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/), these techniques can be classified as follows
- **Filter methods:** based on features properties highlighted via univariate analysis

- **Wrapper methods:** With a specific learning algorithm, these methose can perform a greedy search of the best feature by fitting models with possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. 
- **Embedded methods:** Here they aim to combine the power of both filters and wrapper while maintaining reasonable computational cost.
- **Hybrid method:** Hybrid methods basically select features via a global transformation reduces the data to a desided number of dimensions. The new features can bear little or no resemblance to the initial features.



In [18]:
import pandas as pd
import numpy as np
import saspy
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## 6 Hybrid methods
As mentioned, these are methods that transforms the data into a completely different vector space and bear little or no resemblance to the original data yet carry the same information. They are commonly referred to as dimentionalily reduction methods, and a viewed as __feature engineering__

In [None]:
%run ../src/data_utils.py

In [None]:
sess = saspy.SASsession(
        cfgfile=mk.saspy_file_path,
        cfgname=mk.saspy_cfgname
    )

In [None]:
dataset2 = mk.dataset2

In [None]:
sess.saslib(dataset2['lib_name'], path=dataset2['path'])
lgd_data = sess.sd2df(dataset2['table_name'], libref=dataset2['lib_name'], method="CSV")

In [None]:
lgd_data.head(2)

In [None]:
lgd_data.shape

In [None]:
lgd_data.columns = [col.lower() for col in lgd_data.columns]
lgd_data.head(2)

In [None]:
import category_encoders as ce

In [None]:
encoder = ce.TargetEncoder()
lgd_data_cat = encoder.fit_transform(lgd_data[categorical_columns], lgd_data['LGD_bad_ind'])

In [None]:
processed_data = lgd_data.copy()
processed_data[categorical_columns] = lgd_data_cat

In [None]:
# with open('../data/lgd_data.pkl', 'wb') as f:
#     pickle.dump(lgd_data, f)

In [17]:
import pickle
with open('../data/lgd_data.pkl', 'rb') as f:
    lgd_data = pickle.load(f)

In [19]:
pd.options.display.max_columns = None
lgd_data.columns = [column.lower() for column in lgd_data.columns]
lgd_data.head()

Unnamed: 0,cf_utilization,province,naics_industry_cd,naics_num,lgd_entcode,cf_tot_pdue_over_500_dols,cf_tot_priors_bal_amt,cf_tot_pdue_amt,cf_agmtcur_limit_amt,cf_agmtcur_orig_amt,cf_aprvl_process_en,cf_chattel_security_amt,cf_real_prpty_security_amt,cf_quota_security_amt,cf_fcc_priors_bal_amt,cf_non_fcc_priors_bal_amt,cf_security_type_en,cf_strtch_debt_type_en_desc,cf_unexpired_undisbd_funds_amt,b1_busn_maturity_type_en,b1_connected_expsr_amt,b1_scndry_industry_class_en,b1_times_nsf_in_24_months_cnt,cf_purp,bus_div_cd,cust_exp_5yr,subnaicsdescription,districtname,total_prior_charges,totsecurity,secured_to,undersecured_to,unsecured_to,cltrl_agmnt_typ_en_desc,aprvl_process_en_desc,cr_fac_security_type_en_desc,cr_fac_sts_en_desc,strtch_debt_type_en_desc,quota_security_amt,real_prpty_security_amt,chattel_security_amt,agmtcur_cr_fac_avail_cr_amt,agmtcur_cr_fac_limit_amt,brp_scor,beacon_scor,fcc_priors,other_priors,chattel_sec,land_sec,quota_sec,tot_priors_per_sec,cred_fclty_to_sec_ratio,leverage_ratio,bal_sheet_type,re_qt_per,tot_sec,tms_nsf,cds_score,ers_score,adscr,der,cred_fclty_to_equity,tot_equity,customer_location,score,lev_ratio,agp1_agv0,lev_ratio_1,lgd_bad_ind
0,0.279005,-4,3119900,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,1000000.0,Detailed - credit management,0.0,0.0,0.0,0.0,0.0,Real Property Security,Unknown,0.0,Change,283769.7,No Value,0.0,,2,60.0,Food and Beverage Manufacturing,Saint-Hyacinthe,0.0,2460689.0,279005.36,0.0,0.0,,Detailed - credit management,Real Property Security,Active,Unknown,0.0,0.0,0.0,24639.35,308111.47,0.0,9999.0,92223.36,0.0,496636.48,163800.49,0.0,92223.36,0.422456,1.758359,1.0,0.125969,660436.97,0.0,0.0,9999.0,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
1,0.279005,-4,3119900,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,1000000.0,Detailed - credit management,0.0,0.0,0.0,0.0,0.0,Real Property Security,Unknown,0.0,Change,283769.7,No Value,0.0,,2,60.0,Food and Beverage Manufacturing,Saint-Hyacinthe,0.0,2460689.0,279005.36,0.0,0.0,,Detailed - credit management,Real Property Security,Active,Unknown,0.0,0.0,0.0,24639.35,308111.47,0.0,9999.0,92223.36,0.0,496636.48,163800.49,0.0,92223.36,0.422456,1.758359,1.0,0.125969,660436.97,0.0,0.0,9999.0,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
2,0.279005,-4,3119900,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,1000000.0,Detailed - credit management,0.0,0.0,0.0,0.0,0.0,Real Property Security,Unknown,0.0,Change,283769.7,No Value,0.0,,2,60.0,Food and Beverage Manufacturing,Saint-Hyacinthe,0.0,2460689.0,279005.36,0.0,0.0,,Detailed - credit management,Real Property Security,Active,Unknown,0.0,0.0,0.0,24639.35,308111.47,0.0,9999.0,92223.36,0.0,496636.48,163800.49,0.0,92223.36,0.422456,1.758359,1.0,0.125969,660436.97,0.0,0.0,9999.0,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
3,0.279005,-4,3119900,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,1000000.0,Detailed - credit management,0.0,0.0,0.0,0.0,0.0,Real Property Security,Unknown,0.0,Change,283769.7,No Value,0.0,,2,60.0,Food and Beverage Manufacturing,Saint-Hyacinthe,0.0,2460689.0,279005.36,0.0,0.0,,Detailed - credit management,Real Property Security,Active,Unknown,0.0,0.0,0.0,24639.35,308111.47,0.0,9999.0,92223.36,0.0,496636.48,163800.49,0.0,92223.36,0.422456,1.758359,1.0,0.125969,660436.97,0.0,0.0,9999.0,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
4,0.279005,-4,3119900,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,1000000.0,Detailed - credit management,0.0,0.0,0.0,0.0,0.0,Real Property Security,Unknown,0.0,Change,283769.7,No Value,0.0,,2,60.0,Food and Beverage Manufacturing,Saint-Hyacinthe,0.0,2460689.0,279005.36,0.0,0.0,,Detailed - credit management,Real Property Security,Active,Unknown,0.0,0.0,0.0,24639.35,308111.47,0.0,9999.0,92223.36,0.0,496636.48,163800.49,0.0,92223.36,0.422456,1.758359,1.0,0.125969,660436.97,0.0,0.0,9999.0,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0


Reduce dataset

In [20]:
X_train, X_test, y_train, y_test = train_test_split(
    lgd_data.drop('lgd_bad_ind', axis=1), lgd_data.lgd_bad_ind, test_size=0.2, random_state=42, stratify=lgd_data.lgd_bad_ind)

In [None]:
# X_train.columns = [column.lower() for column in X_train.columns]
# X_test.columns = [column.lower() for column in X_train.columns]

In [None]:
categorical_features = df_application_train.select_dtypes(include=['object', 'category']).columns.values
numerical_features = df_application_train.select_dtypes(include=np.number).columns.values

### 6.1 Principal Component Analysis

Create features matrix

In [None]:
# Feature matrix and class label
cols_to_drop = ['naics_industry_cd']
X, y = X_train.drop(cols_to_drop, axis = 1), y_train

Transformation pipeline
1. Impute missing values

In [None]:
# Import custom classes
%run ../src/data_utils.py
%run ../src/imputer.py
%run ../src/transforms.py

In [None]:
# Instantiate the classes
transfxn = TransformationPipeline()
imputer = DataFrameImputer()

Transformation pipeline
1. Impute missing values

In [None]:
# Fit transform the training set
X_imputed = imputer.fit_transform(X)

2. Pre-processing

In [None]:
# Transform and scale data
X_scaled, _, feat_nm = transfxn.preprocessing(X_imputed, X_imputed)

In [None]:
print('Data size after pre-processing:', X_scaled.shape)

PCA plot

In [None]:
pcs_data = transfxn.pca_plot_labeled(X_scaled, y, palette = ['b', 'r'])

In [None]:
plt.plot(np.cumsum(pcs_data[0].explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

KMean Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
Xx=pcs_data[0]
labels = KMeans(6, random_state=0).fit_predict(Xx)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis')

### 6.2 Singular Value Decomposition
This also a form of feature engineering. SVD is commonly used when data is sparse and basically projects data from higherdimensions to projections that represents a hand full of dimensions. Since we are appying one hot encoding, we will have a lot of zeros making SVD appropriate for this.

In [28]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
import category_encoders as ce

In [None]:
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.values
numerical_features = X_train.select_dtypes(include=np.number).columns.values

In [None]:
#working with numerical data
X = lgd_data.drop('lgd_bad_ind', axis=1)
Y = lgd_data.lgd_bad_ind
numerical_columns = X.select_dtypes(include=np.number).columns.values
categorical_columns = X.select_dtypes(include=['object', 'category']).columns.values

In [None]:
encoder = ce.TargetEncoder()
X_cat = encoder.fit_transform(X[categorical_columns], Y)
X[categorical_columns] = X_cat

In [None]:

# define the pipeline
steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(
    steps= [
        ('svd', TruncatedSVD(n_components=10)), 
        ('m', LogisticRegression())
        ]     
)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X.fillna(0), Y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

### 6.3 Linear Discriminant Analysis
Linear Discriminant Analysis seeks to best separate (or discriminate) the samples in the training dataset by their class value. It is applied to supervised learning

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

In [None]:

# define the pipeline
steps = [('lda', LinearDiscriminantAnalysis(n_components=5)), ('m', GaussianNB())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X.fillna(0), Y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

In [51]:
# Data pre-processing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import category_encoders as ce

In [70]:
numerical_columns = list(X_train.select_dtypes(include=np.number).columns.values)
categorical_columns = list(X_train.select_dtypes(
            include=["object", "category"]
        ).columns.values)

# Create pipelines
num_pipeline = Pipeline(
    [
        ("num_impute", SimpleImputer(strategy="median")),
        ("p_transf", PowerTransformer(standardize=False)),
        ("std_scaler", StandardScaler()),
    ]
)

cat_pipeline = Pipeline(
    [
        ("cat_impute", SimpleImputer(strategy="most_frequent")),
        ("cat_t", ce.TargetEncoder())
        ]
)

transformer = ColumnTransformer(transformers=
    [
        ("categorical_features", cat_pipeline, categorical_columns),
        ("numerical_features", num_pipeline, numerical_columns),
    ],
    remainder="passthrough",
)

pipeline = Pipeline(
    steps=[
        ('tranforms', transformer)
    ]
)

In [63]:
t.shape

(100, 68)

In [72]:
transformer_m = pipeline.fit(X_train, y_train)

  elif pd.api.types.is_categorical(cols):


In [73]:
tt = transformer_m.transform(X_train.head(100))

In [77]:
import pandas as pd
ttt = pd.DataFrame(tt, columns=X_train.columns)

In [78]:
ttt

Unnamed: 0,cf_utilization,province,naics_industry_cd,naics_num,lgd_entcode,cf_tot_pdue_over_500_dols,cf_tot_priors_bal_amt,cf_tot_pdue_amt,cf_agmtcur_limit_amt,cf_agmtcur_orig_amt,cf_aprvl_process_en,cf_chattel_security_amt,cf_real_prpty_security_amt,cf_quota_security_amt,cf_fcc_priors_bal_amt,cf_non_fcc_priors_bal_amt,cf_security_type_en,cf_strtch_debt_type_en_desc,cf_unexpired_undisbd_funds_amt,b1_busn_maturity_type_en,b1_connected_expsr_amt,b1_scndry_industry_class_en,b1_times_nsf_in_24_months_cnt,cf_purp,bus_div_cd,cust_exp_5yr,subnaicsdescription,districtname,total_prior_charges,totsecurity,secured_to,undersecured_to,unsecured_to,cltrl_agmnt_typ_en_desc,aprvl_process_en_desc,cr_fac_security_type_en_desc,cr_fac_sts_en_desc,strtch_debt_type_en_desc,quota_security_amt,real_prpty_security_amt,chattel_security_amt,agmtcur_cr_fac_avail_cr_amt,agmtcur_cr_fac_limit_amt,brp_scor,beacon_scor,fcc_priors,other_priors,chattel_sec,land_sec,quota_sec,tot_priors_per_sec,cred_fclty_to_sec_ratio,leverage_ratio,bal_sheet_type,re_qt_per,tot_sec,tms_nsf,cds_score,ers_score,adscr,der,cred_fclty_to_equity,tot_equity,customer_location,score,lev_ratio,agp1_agv0,lev_ratio_1
0,0.026818,0.027739,0.027492,0.026633,0.026496,0.026666,0.026114,0.027262,0.026617,0.026396,0.027492,0.027382,0.026477,0.026633,0.026496,0.026477,0.026666,0.028497,0.677893,-1.110223e-16,-0.361927,1.972736,-0.199979,1.322941,1.311484,1.712304,1.119357,5.508186,2.075050,-0.180688,-0.04582,-0.130230,-0.649931,0.613644,1.571856,0.914812,1.354320,-0.139732,-0.097494,5.508186,1.119357,1.712304,0.071106,1.322941,-0.278746,-0.192134,0.033288,-0.143911,1.346280,0.702905,4.390667,-0.075857,0.058586,0.053053,0.29713,0.440217,1.248582,-0.505151,0.779789,-0.199079,0.242383,0.088956,0.090475,-0.040402,0.772954,0.393161,0.241128,0.392937
1,0.027266,0.024730,0.026211,0.026536,0.026496,0.026666,0.026649,0.025827,0.026617,0.026396,0.026344,0.026490,0.026477,0.026536,0.026496,0.026477,0.026666,0.026690,-0.145120,-1.665335e-16,-0.361927,-0.507464,-0.199979,-0.688829,-0.902459,-0.597388,0.984662,-0.182575,-0.482036,-0.180688,-0.04582,-0.492739,1.150948,0.613644,-0.634224,-0.559936,-0.590738,-0.139732,-0.097494,-0.182575,0.984662,-0.597388,-0.134582,-0.688829,-0.278746,-0.192134,-0.099842,-0.143911,-0.784127,0.345739,-0.229303,-0.075858,0.058580,0.053821,0.29713,0.896696,-0.331362,1.817093,0.779789,-0.508443,0.242470,0.088997,0.090473,-0.201332,0.400065,0.393179,0.241128,0.392946
2,0.026694,0.026513,0.026211,0.026536,0.023099,0.026139,0.026649,0.025827,0.024817,0.026396,0.026344,0.025690,0.026477,0.026536,0.023099,0.026477,0.026139,0.026620,1.032796,-1.665335e-16,-0.361927,-0.507464,-0.199979,0.084820,0.091342,-0.597388,-0.864772,-0.182575,-0.482036,-0.180688,-0.04582,0.938246,-0.649931,-2.056553,-0.634224,-2.690965,-1.975886,-0.139732,5.329171,-0.182575,-0.864772,-0.597388,-0.190978,0.084820,-0.278746,-0.192134,-0.099842,-0.143911,-0.784127,-1.333541,-0.229303,-0.075858,0.058565,0.052040,0.29713,-1.239459,-2.361807,-0.505151,0.779789,-0.286069,0.242382,0.088936,0.090472,0.180745,0.663255,0.393136,0.241128,0.392925
3,0.027266,0.026064,0.026064,0.026536,0.026496,0.026139,0.026649,0.025827,0.026617,0.026396,0.026021,0.026490,0.026477,0.026536,0.026496,0.026477,0.026139,0.026690,0.955042,-2.220446e-16,-0.361927,1.975908,-0.199979,-0.363330,-0.493013,-0.597388,1.115830,-0.182575,2.076740,5.578623,-0.04582,-0.360030,-0.649931,0.613644,1.582161,0.035818,-0.219607,-0.139732,-0.097494,-0.182575,1.115830,-0.597388,-0.190978,-0.363330,-0.278746,-0.192134,0.061635,6.960874,-0.784127,0.692451,-0.229303,-0.075852,0.058604,0.053121,0.29713,0.896696,0.295345,-0.505151,0.779789,-0.339055,-1.793253,0.088940,0.090472,-0.132310,0.598327,0.393163,0.241128,0.392938
4,0.026818,0.026513,0.026211,0.026536,0.026496,0.026666,0.026649,0.025827,0.026617,0.026396,0.026344,0.026490,0.026477,0.026536,0.026496,0.026477,0.026666,0.026690,0.266357,-1.665335e-16,-0.361927,-0.507464,-0.199979,-0.078937,0.034663,-0.597388,-0.864772,-0.182575,-0.482036,-0.180688,-0.04582,-0.239292,-0.649931,0.613644,-0.634224,-0.051815,-0.107786,-0.139732,-0.097494,-0.182575,-0.864772,-0.597388,-0.133035,-0.078937,-0.278746,-0.192134,-0.099842,-0.143911,-0.784127,0.376504,-0.229303,-0.075858,0.058594,0.053836,0.29713,0.650328,-0.046947,-0.505151,0.612546,-0.221885,0.242383,0.088952,0.090472,-0.104253,0.421869,0.393161,0.241128,0.392937
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.026818,0.026513,0.026211,0.026536,0.026524,0.026666,0.026649,0.027262,0.026617,0.026396,0.026344,0.026797,0.026477,0.026536,0.026524,0.026477,0.026666,0.026690,-0.212289,-1.665335e-16,-0.361927,-0.507464,-0.199979,-1.230774,-1.995835,1.645963,-0.864772,-0.182575,-0.482036,-0.180688,-0.04582,-0.276804,-0.649931,0.613644,-0.634224,-1.202171,-1.003906,-0.139732,-0.097494,-0.182575,-0.864772,1.645963,-0.162640,-1.230774,-0.278746,-0.192134,-0.099842,-0.143911,1.158592,-1.333541,-0.229303,-0.075858,0.058595,0.056513,0.29713,-1.239459,-0.985184,-0.505151,0.779789,-0.270377,0.242381,0.088947,0.090472,-0.112407,0.682756,0.393236,0.241128,0.392974
96,0.027266,0.026513,0.026211,0.026536,0.026496,0.026139,0.026649,0.027262,0.026617,0.026396,0.026344,0.027052,0.026477,0.026536,0.026496,0.026477,0.026139,0.026690,-1.222530,-1.665335e-16,-0.361927,-0.507464,-0.199979,0.220264,0.315175,-0.597388,1.115630,-0.182575,-0.482036,-0.180688,-0.04582,0.148731,-0.649931,0.613644,-0.634224,0.034763,-0.170253,-0.139732,-0.097494,-0.182575,1.115630,-0.597388,0.327920,0.220264,-0.278746,-0.192134,-0.099842,-0.143911,-0.784127,0.691860,-0.229303,-0.075858,0.058574,0.051093,0.29713,0.896696,0.294220,-0.505151,-0.074870,0.087448,0.242407,0.088968,0.090472,0.562660,0.110103,0.393111,0.241128,0.392912
97,0.026818,0.027739,0.027492,0.026633,0.026496,0.026666,0.026649,0.025827,0.026617,0.026396,0.027492,0.027382,0.026477,0.026633,0.026496,0.026477,0.026666,0.027040,-0.248734,-1.110223e-16,-0.361927,-0.507464,-0.199979,0.945046,1.222729,-0.597388,-0.864772,-0.182575,-0.482036,-0.180688,-0.04582,0.303237,-0.649931,0.613644,-0.634224,1.209722,0.966793,-0.139732,-0.097494,-0.182575,-0.864772,-0.597388,-0.040719,0.945046,3.587494,2.773099,0.002789,-0.143911,-0.784127,0.951005,4.390667,0.556771,0.058574,0.063127,0.29713,0.896696,1.399606,-0.505151,0.621485,2.735862,0.242383,0.088945,0.090474,-0.042755,0.965206,0.393352,0.241128,0.393031
98,0.026818,0.024730,0.026211,0.026633,0.026496,0.026666,0.026114,0.025827,0.027037,0.026396,0.026344,0.026490,0.026477,0.026633,0.026496,0.026477,0.026666,0.026690,1.019898,-1.665335e-16,-0.361927,-0.507464,-0.199979,1.310653,1.297535,-0.597388,-0.864772,-0.182575,-0.482036,-0.180688,-0.04582,-0.059819,-0.649931,-1.879580,-0.634224,0.862824,1.424134,-0.139732,-0.097494,-0.182575,-0.864772,-0.597388,-0.190978,1.310653,-0.278746,2.773099,-0.099842,-0.143911,-0.784127,-1.333541,-0.229303,-0.075858,0.058594,0.054112,0.29713,-1.239459,-2.361807,-0.505151,-1.669275,2.735862,0.242418,0.089011,0.090475,-0.014308,-1.691715,0.393186,0.241128,0.392949


How to Interpret Pearson’s Correlation Coefficients
Pearson’s correlation coefficient is represented by the Greek letter rho (ρ) for the population parameter and r for a sample statistic. This correlation coefficient is a single number that measures both the strength and direction of the linear relationship between two continuous variables. Values can range from -1 to +1.

- Strength: The greater the absolute value of the correlation coefficient, the stronger the relationship. 
   - The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one ariable is accompanied by a perfectly consistent change in the other. For these relationships, all of the data points fall on a line. In practice, you won’t see either type of perfect relationship.
   - A coefficient of zero represents no linear relationship. As one variable increases, there is no tendency in the other variable to either increase or decrease.
   - When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a line.
- Direction: The sign of the correlation coefficient represents the direction of the relationship.
  - Positive coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Positive relationships produce an upward slope on a scatterplot.
  - Negative coefficients represent cases when the value of one variable increases, the value of the other variable tends to decrease. Negative relationships produce a downward slope.

import pandas as pd

In [1]:
import pandas as pd

In [3]:
t = pd.read_csv("../data/lgd_data.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [4]:
t.head()

Unnamed: 0.1,Unnamed: 0,cf_utilization,province,naics_industry_cd,naics_num,lgd_entcode,cf_tot_pdue_over_500_dols,cf_tot_priors_bal_amt,cf_tot_pdue_amt,cf_agmtcur_limit_amt,...,adscr,der,cred_fclty_to_equity,tot_equity,customer_location,score,lev_ratio,agp1_agv0,lev_ratio_1,lgd_bad_ind
0,0,0.279005,-4,3119900.0,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,...,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
1,1,0.279005,-4,3119900.0,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,...,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
2,2,0.279005,-4,3119900.0,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,...,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
3,3,0.279005,-4,3119900.0,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,...,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0
4,4,0.279005,-4,3119900.0,3119900.0,Value-added Prcs,0.0,0.0,0.0,308111.47,...,1.408312,3.213342,0.128514,2171008.0,QC,0.0,10001.758359,0.0,20001.758359,1.0


In [5]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/'
    'machine-learning-databases'
    '/breast-cancer-wisconsin/wdbc.data',
    header=None
    )

In [6]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [7]:
from sklearn.preprocessing import LabelEncoder

ModuleNotFoundError: No module named 'sklearn'

In [None]:
>>> X = df.loc[:, 2:].values
>>> y = df.loc[:, 1].values
>>> le = LabelEncoder()
>>> y = le.fit_transform(y)
>>> le.classes_