Check [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) of ``LogisticRegression`` function from ``sklearn.linear_model`` for details.

In [None]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score  

# 1. Data Pre-processing

In [None]:
churn = pd.read_csv('churn.csv', sep=' ')     # modify your data path if needed

display(churn.head(), churn.dtypes)           # some variables are object (string or mixed) 

## 1.1 Convert Target Variable as Numbers (Optional)

Although **scikit-learn** accepts non-numeric target variable.  The classifier automatically assign integers to each class (i.e., `'LEAVE'` = `0`, `'STAY'` = `1`) before model training.  If so, 

- The positive class is `STAY`, then the log_odds and probability returned by the linear equation represent the log_odds and probability of ``STAY``.
- In the array of class probabilities, ``LEAVE`` is in the 1st column and `STAY` is in the 2nd column, as we saw in week 3. 
- The parameter values will be the reverse (e.g., negative instead of positive) from what we have here. 

Here we convert the target as numbers mannually,  as we expect ``'LEAVE'`` as positive class (``1``) and ``'STAY'`` as  negative class (``0``). 

- Here we use `pandas.Series.replace` function (check [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html)) to replace target values as numbers. After conversion, the target's data type changed from `object` (mixed data type) to  `int64` automatically.



In [None]:
churn['LEAVE'] = churn['LEAVE'].replace({'LEAVE':1, 'STAY':0})    # Please don't opt-int to future behavior

churn.head()    # now LEAVE is considered as positive (1)

**Notes** 

If we use conditional selection for this, the target's data type is still `object`. The ambiguous data type will confuse the logistic regression algorithm as it expects `integer` or `category` data type for a numeric target.  To address this, we need to manually change the target's data type as either `int64` or  `category`. See the codes below.
  
```python
churn.loc[churn['LEAVE']=='LEAVE','LEAVE'] = 1        # conditional selection
churn.loc[churn['LEAVE']=='STAY','LEAVE'] = 0
churn = churn.astype({'LEAVE': 'int64'})              # or churn = churn.astype({'LEAVE': 'category'}) 
```


## 1.2 Convert Non-Numeric Features as Numbers (Required)

As **scikit-learn** can only take numeric predictors, we need to convert all string features as numbers. 

In [None]:
# Convert COLLEGE as numbers 

churn['COLLEGE'] = churn['COLLEGE'].replace({'zero':0, 'one':1})  # alternatively, use conditional selection

Here we use `pandas.Categorical` function to convert the string data as an ordered categorical variables and obtain the numeric codes.

- Check [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html) for details.

In [None]:
# check unique values for the three features

for col in ['REPORTED_SATISFACTION','REPORTED_USAGE_LEVEL','CONSIDERING_CHANGE_OF_PLAN']:
    print(col, churn[col].unique())      

In [None]:
# Convert REPORTED_SATISFACTION as numeric (very_unsat = 0, unsat=1... very_sat =4)

# step 1 - convert the variable as a categorical variable (a pandas 1D array)
cat1 = pd.Categorical(values = churn['REPORTED_SATISFACTION'], 
                      categories = ["very_unsat", "unsat", "avg", "sat","very_sat"],  # specify the order
                      ordered = True)                                                # treat categories as ordered

# step 2 - obtain the numeric codes and update original feature
churn['REPORTED_SATISFACTION'] = cat1.codes           

In [None]:
# Convert REPORTED_USAGE_LEVEL as numeric

# step 1
cat2 = pd.Categorical(values = churn['REPORTED_USAGE_LEVEL'], 
                      categories = ["very_little", "little", "avg", "high","very_high"],   
                      ordered = True)

# step 2
churn['REPORTED_USAGE_LEVEL'] = cat2.codes     

In [None]:
# Convert CONSIDERING_CHANGE_OF_PLAN as numeric

# step 1
cat3 = pd.Categorical(values = churn['CONSIDERING_CHANGE_OF_PLAN'], 
                      categories=["never_thought", "no", "perhaps", "considering","actively_looking_into_it"], 
                      ordered= True)

# step 2
churn['CONSIDERING_CHANGE_OF_PLAN'] = cat3.codes

In [None]:
churn.head(5)     # check first 5 rows  

Check the data type again, make sure the data type of a numeric target is either `int64` or `category`.  
- If we use the original string target, then it is fine to be `object` data type. 

In [None]:
churn.dtypes    # check data type: target is now int64 (discrete)

## 1.3 Split into Train and Test 

In [None]:
X = churn.drop(columns = 'LEAVE')

y = churn['LEAVE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  

display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## 1.4 Scale Data

Fit the scaler with the training set, then apply the same scaler to transform the trainand the test set later.
**Do NOT** fit the scaler with the test data: referencing the test data can lead to data leakage. 

- Remember to use scaled data for model training and prediction!

In [None]:
scaler = MinMaxScaler()                          

X_train_scaled = scaler.fit_transform(X_train)   # combine train and transform together

X_test_scaled  = scaler.transform(X_test)        # apply the scaler on test

After scaling, the transformed data is a numpy array without col names. We add the names back by converting them as a dataframes with column names.

In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns)  

X_test_scaled = pd.DataFrame(X_test_scaled, columns = X_train.columns)

display(X_train_scaled.head(), X_test_scaled.head())

# 2. Train m1 with only two Features  

##  2.1 Model Training 

In [None]:
X_train_sub = X_train_scaled[['COLLEGE', 'INCOME']]    # 2D features

m1 = LogisticRegression().fit(X_train_sub, y_train)   

display(m1.intercept_, m1.coef_, m1.feature_names_in_)  # Note that intercept in 1D, coeffs in 2D

## 2.2  Predict and Evaluate on Train Set

**Predict class labels** 

- When making predictions, the default cut-off point for class determination is 50%. 

In [None]:
train_pred1 = m1.predict(X_train_sub)  

train_pred1

**Estimate class probabilities**  

- Note the order should be  0 (`STAY`), 1 (`LEAVE`). The probability of 1 (i.e., `LEAVE`) is in the 2nd column. 

In [None]:
train_prob1 = m1.predict_proba(X_train_sub)

train_prob1     # probability of 0 (STAY), 1 (LEAVE)

**Check model accuracy**

In [None]:
accuracy_score(train_pred1, y_train)     # m1.score(X_train_sub, y_train)

## 2.3  The log_odds and Probabilities of LEAVE

Below are the formulas you may need to use.


- Estimate the ``log_odds`` of LEAVE (i.e., ``f(x)``) with the linear function. 

$$
f(x) = w_0 + w_1 \cdot \text{COLLEGE} + w_2 \cdot \text{INCOME}
$$


- Estimate  the ``probability`` of LEAVE with the logistic function. 

$$
P(Y=1 | x) = \frac{1}{1 + e^{-f(x)}} \quad \text{or} \quad P(Y=1 |x) = \frac{e^{f(x)}}{1 + e^{f(x)}}
$$

 

<font color=red>***Exercise 1: Your Codes Here***</font>  

Please complete the following two tasks:

- Step 1: Calculate the log-odds of `LEAVE(1)` for the 1st instance in the training set.

- Step 2: Calculate the probability of `LEAVE(1)` for the 1st instance as well. You may want to use  ``numpy.exp`` function to perform natural exponential function (with base e) to a value.

Note (1) the intercept is in a 1D array and coefs are in 2D array (1*2), and (2) you may use  **.loc** or **.iloc** method to select the first customer's **COLLEGE** and **INCOME** value. 

**Calculate the log_odds and probability of LEAVE(1) for all instances**

The ``log_odds`` values for all instances are returned by the ``decision_function`` method.  

-  ``log_odds`` is proportional to instances' perpendicular distance to the hyperplane, and also called as ``confidence scores``.

In [None]:
log_odds2 = m1.decision_function(X_train_sub)   

log_odds2             # the first value is the log-odds for the 1st instance

With the ``log_odds`` for all intances, we can calcualte the ``leaving probability`` for all customers very effectively. 

- The estimated leave probabilities should be the same as those returned by ``m1.predict_proba(X_train_sub)``, which is in the 2nd column of  **train_prob1**. 

In [None]:
prob2 = 1 /(1 + np.exp(-log_odds2))    # element-wise computation applies   

prob2                                  # same as train_prob1[:,1]    

## 2.4 Predict and Evaluate on Test Set

When making class predictions,  the default threshhold is 0.5. 

In [None]:
X_test_sub = X_test_scaled[['COLLEGE', 'INCOME']]     # Two features for m1

test_pred1 = m1.predict(X_test_sub)

accuracy_score(test_pred1, y_test)                   # same as m1.score(X_test_sub, y_test)

#  3.  Train m2 with all features 

<font color=red>***Exercise 2: Your Codes Here***</font>  

Please complete the following two steps: 
- Step 1: please train **m2** with all 11 features (i.e.,  **X_train_scaled**). Check the parameter values.
- Step 2: evaluate **m2**'s performance on both train and test set. With more features used, is the model performance better?  
