<a href="https://colab.research.google.com/github/maryam-tayyab/CLIP/blob/main/Binance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Download Clogistic**

In [None]:
pip install clogistic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting clogistic
  Downloading clogistic-0.2.0-py3-none-any.whl (11 kB)
Installing collected packages: clogistic
Successfully installed clogistic-0.2.0


### *Dataset available at*

https://github.com/pararawendy/constrained-logistic-regression/blob/main/telco_churn_clean.csv

**Load Dataset**

In [None]:
Aw import pandas as pd

# load the original/raw data
df = pd.read_csv('./IBM.csv') # Dat afor customer Churn
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,Churn
0,0,0,1,0,1,0,1,29.85,0
1,1,0,0,0,34,1,0,56.95,0
2,1,0,0,0,2,1,1,53.85,1
3,1,0,0,0,45,0,0,42.3,0
4,0,0,0,0,2,1,1,70.7,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   int64  
 1   SeniorCitizen     7032 non-null   int64  
 2   Partner           7032 non-null   int64  
 3   Dependents        7032 non-null   int64  
 4   tenure            7032 non-null   int64  
 5   PhoneService      7032 non-null   int64  
 6   PaperlessBilling  7032 non-null   int64  
 7   MonthlyCharges    7032 non-null   float64
 8   Churn             7032 non-null   int64  
dtypes: float64(1), int64(8)
memory usage: 494.6 KB


In [None]:
# split data for training and testing
from sklearn.model_selection import train_test_split

X = df.drop(columns='Churn').to_numpy()
y = df[['Churn']].to_numpy()
y = y.reshape(len(y),) # sklearn's y shape requirement

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

### Coosing and defining constraints for coefficients (features):

Since we want that all the coefficients maintain their order i.e. a1<=a2<=a3 ...  as well as be non negative, we will manually set the bounds for the maximum and minimum values that coeffieicnts for the input features can take during the optimization process. Thus it is a constrained optimization. The upper and lower bound of the subsequent features will be the same. we are using  a random number generator to chose these bounds.

In [None]:
# define constraints as dataframe such that the individial lower and upper bounds are monotinicly increasing
import numpy as np
constraint_df = pd.DataFrame(data=[
                                   ['gender',0,0.1],
                                   ['SeniorCitizen',0.1,0.25],
                                   ['Partner',0.25, 0.4],
                                   ['Dependents',0.4,0.55],
                                   ['tenure',0.55,0.7],
                                   ['PhoneService',0.7,0.85],
                                   ['PaperlessBilling',0.85,1],
                                   ['MonthlyCharges',1,1.15],
                                   ['intercept',1.15,1.30]],
                             columns=['feature','lower_bound','upper_bound'])

constraint_df

Unnamed: 0,feature,lower_bound,upper_bound
0,gender,0.0,0.1
1,SeniorCitizen,0.1,0.25
2,Partner,0.25,0.4
3,Dependents,0.4,0.55
4,tenure,0.55,0.7
5,PhoneService,0.7,0.85
6,PaperlessBilling,0.85,1.0
7,MonthlyCharges,1.0,1.15
8,intercept,1.15,1.3


**Remarks**
Logistic Regression is a classification method used to predict the value of a categorical dependent variable from its relationship to one or more independent variables assumed to have a logistic distribution. If the dependent variable has only two possible values (success/failure), then the logistic regression is binary. If the dependent variable has more than two possible values (blood type given diagnostic test results), then the logistic regression is multinomial.

The optimization technique used for LogisticRegressionBinaryClassifier is the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS). Both the L-BFGS and regular BFGS algorithms use quasi-Newtonian methods to estimate the computationally intensive Hessian matrix in the equation used by Newton's method to calculate steps. But the L-BFGS approximation uses only a limited amount of memory to compute the next step direction, so that it is especially suited for problems with a large number of variables. The memory_size parameter specifies the number of past positions and gradients to store for use in the computation of the next step.

This learner can use elastic net regularization: a linear combination of L1 (lasso) and L2 (ridge) regularizations. Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing models with extreme coefficient values. This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance tradeoff. Regularization works by adding the penalty that is associated with coefficient values to the error of the hypothesis. An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less. L1 and L2 regularization have different effects and uses that are complementary in certain respects.

l1_weight: can be applied to sparse models, when working with high-dimensional data. It pulls small weights associated
features that are relatively unimportant towards 0.

l2_weight: is preferable for data that is not sparse. It pulls large weights towards zero.

In [None]:
# train using clogistic
from scipy.optimize import Bounds
from clogistic import LogisticRegression as clLogisticRegression

# define bounds by pooling data from constraints dataframe
lower_bounds = constraint_df['lower_bound'].to_numpy()
upper_bounds = constraint_df['upper_bound'].to_numpy()
bounds = Bounds(lower_bounds, upper_bounds)

cl_logreg = clLogisticRegression(solver="lbfgs", penalty="l2")
cl_logreg.fit(X_train, y_train, bounds=bounds)

LogisticRegression(solver='lbfgs')

In [None]:
# coefficients as dataframe
cl_coef = pd.DataFrame({
    'feature': df.drop(columns='Churn').columns.tolist() + ['intercept'],
    'coefficient': list(cl_logreg.coef_[0]) + [cl_logreg.intercept_[0]]
})

cl_coef

Unnamed: 0,feature,coefficient
0,gender,0.0
1,SeniorCitizen,0.1
2,Partner,0.25
3,Dependents,0.4
4,tenure,0.55
5,PhoneService,0.7
6,PaperlessBilling,0.85
7,MonthlyCharges,1.0
8,intercept,1.15


## *Performance on Training set*

In [None]:
from sklearn.metrics import f1_score
y_cl_logreg = cl_logreg.predict(X_train)

print(f'F1 score on train set for sk_logreg model is {f1_score(y_train, y_cl_logreg):.4f}')

F1 score on train set for sk_logreg model is 0.4214


## *Performance on Test set*

In [None]:
y_cl_logreg = cl_logreg.predict(X_test)

print(f'F1 score on train set for sk_logreg model is {f1_score(y_test, y_cl_logreg):.4f}')

F1 score on train set for sk_logreg model is 0.4165
