Check [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) of ``LogisticRegression`` function from ``sklearn.linear_model`` for details.

In [299]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score  

# 1. Data Pre-processing

In [300]:
churn = pd.read_csv('churn.csv', sep=' ')     # modify your data path if needed

display(churn.head(), churn.dtypes)           # some variables are object (string or mixed) 

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN,LEAVE
1,zero,31953,0,6,313378,161,0,4,unsat,little,no,STAY
2,one,36147,0,13,800586,244,0,6,unsat,little,considering,STAY
3,one,27273,230,0,305049,201,16,15,unsat,very_little,perhaps,STAY
4,zero,120070,38,33,788235,780,3,2,unsat,very_high,considering,LEAVE
5,one,29215,208,85,224784,241,21,1,very_unsat,little,never_thought,STAY


COLLEGE                        object
INCOME                          int64
OVERAGE                         int64
LEFTOVER                        int64
HOUSE                           int64
HANDSET_PRICE                   int64
OVER_15MINS_CALLS_PER_MONTH     int64
AVERAGE_CALL_DURATION           int64
REPORTED_SATISFACTION          object
REPORTED_USAGE_LEVEL           object
CONSIDERING_CHANGE_OF_PLAN     object
LEAVE                          object
dtype: object

In [301]:

print("==================================================")
print(churn.info())
print("==================================================")
print("Shape of churn: ",churn.shape)
print("==================================================")
print(churn.head().to_string())

print("==================================================")
churn_copy = churn
print(churn_copy.head().to_string())

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 1 to 20000
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   COLLEGE                      20000 non-null  object
 1   INCOME                       20000 non-null  int64 
 2   OVERAGE                      20000 non-null  int64 
 3   LEFTOVER                     20000 non-null  int64 
 4   HOUSE                        20000 non-null  int64 
 5   HANDSET_PRICE                20000 non-null  int64 
 6   OVER_15MINS_CALLS_PER_MONTH  20000 non-null  int64 
 7   AVERAGE_CALL_DURATION        20000 non-null  int64 
 8   REPORTED_SATISFACTION        20000 non-null  object
 9   REPORTED_USAGE_LEVEL         20000 non-null  object
 10  CONSIDERING_CHANGE_OF_PLAN   20000 non-null  object
 11  LEAVE                        20000 non-null  object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB
None
Shape of churn:  (20000, 12)
  COLLEGE  INC

## 1.1 Convert Target Variable as Numbers (Optional)

Although **scikit-learn** accepts non-numeric target variable.  The classifier automatically assign integers to each class (i.e., `'LEAVE'` = `0`, `'STAY'` = `1`) before model training.  If so, 

- The positive class is `STAY`, then the log_odds and probability returned by the linear equation represent the log_odds and probability of ``STAY``.
- In the array of class probabilities, ``LEAVE`` is in the 1st column and `STAY` is in the 2nd column, as we saw in week 3. 
- The parameter values will be the reverse (e.g., negative instead of positive) from what we have here. 

Here we convert the target as numbers mannually,  as we expect ``'LEAVE'`` as positive class (``1``) and ``'STAY'`` as  negative class (``0``). 

- Here we use `pandas.Series.replace` function (check [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html)) to replace target values as numbers. After conversion, the target's data type changed from `object` (mixed data type) to  `int64` automatically.



In [302]:
print("Original table:")
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
print(churn.head().to_string())
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")


# churn['LEAVE'] = churn['LEAVE'].replace({'LEAVE':1, 'STAY':0})    # Please don't opt-int to future behavior
# Replace column LEAVE , STAY -> 0 , LEAVE -> 1
churn["LEAVE"].replace({"STAY":0,"LEAVE":1},inplace=True)
# churn.head()    # now LEAVE is considered as positive (1)
print("Updated table:")
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
print(churn.head().to_string())
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")


Original table:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  COLLEGE  INCOME  OVERAGE  LEFTOVER   HOUSE  HANDSET_PRICE  OVER_15MINS_CALLS_PER_MONTH  AVERAGE_CALL_DURATION REPORTED_SATISFACTION REPORTED_USAGE_LEVEL CONSIDERING_CHANGE_OF_PLAN  LEAVE
1    zero   31953        0         6  313378            161                            0                      4                 unsat               little                         no   STAY
2     one   36147        0        13  800586            244                            0                      6                 unsat               little                considering   STAY
3     one   27273      230         0  305049            201                           16                     15                 unsat          very_little                    perhaps   STAY
4    zero  120070    

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  churn["LEAVE"].replace({"STAY":0,"LEAVE":1},inplace=True)
  churn["LEAVE"].replace({"STAY":0,"LEAVE":1},inplace=True)


**Notes** 

If we use conditional selection for this, the target's data type is still `object`. The ambiguous data type will confuse the logistic regression algorithm as it expects `integer` or `category` data type for a numeric target.  To address this, we need to manually change the target's data type as either `int64` or  `category`. See the codes below.
  
```python
churn.loc[churn['LEAVE']=='LEAVE','LEAVE'] = 1        # conditional selection
churn.loc[churn['LEAVE']=='STAY','LEAVE'] = 0
churn = churn.astype({'LEAVE': 'int64'})              # or churn = churn.astype({'LEAVE': 'category'}) 
```


## 1.2 Convert Non-Numeric Features as Numbers (Required)

As **scikit-learn** can only take numeric predictors, we need to convert all string features as numbers. 

Feature in form of String: 
 
- COLLEGE  

- REPORTED_SATISFACTION 

- REPORTED_USAGE_LEVEL

- CONSIDERING_CHANGE_OF_PLAN 

In [303]:
# Convert COLLEGE as numbers 

# churn['COLLEGE'] = churn['COLLEGE'].replace({'zero':0, 'one':1})  # alternatively, use conditional selection

print("Original table:")
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
print(churn.head().to_string())
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")

churn['COLLEGE'].replace({'zero':0,'one':1},inplace=True)

print("Updated table:")
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
print(churn.head().to_string())
print("--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------")


Original table:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  COLLEGE  INCOME  OVERAGE  LEFTOVER   HOUSE  HANDSET_PRICE  OVER_15MINS_CALLS_PER_MONTH  AVERAGE_CALL_DURATION REPORTED_SATISFACTION REPORTED_USAGE_LEVEL CONSIDERING_CHANGE_OF_PLAN  LEAVE
1    zero   31953        0         6  313378            161                            0                      4                 unsat               little                         no      0
2     one   36147        0        13  800586            244                            0                      6                 unsat               little                considering      0
3     one   27273      230         0  305049            201                           16                     15                 unsat          very_little                    perhaps      0
4    zero  120070    

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  churn['COLLEGE'].replace({'zero':0,'one':1},inplace=True)
  churn['COLLEGE'].replace({'zero':0,'one':1},inplace=True)


Here we use `pandas.Categorical` function to convert the string data as an ordered categorical variables and obtain the numeric codes.

- Check [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html) for details.
- pd.Categorical()函數 用於 將字串列轉換為分類變量。
- 語法部份 values ，categories ， ordered，
- value： 要轉換成 categorical 的數值
- categorical：定義轉換的規則，通常跟 values 差不多，是處理過後的 values 例如排序 ， 且不重要順序 (目的： 排序 / 標記沒出現過的為 NaN / 有些未出現的可以先有一個空 columnn，最終還是以 categorical 的為準)
- ordered： 是否需要排序

In [304]:
# check unique values for the three features

# for col in ['REPORTED_SATISFACTION','REPORTED_USAGE_LEVEL','CONSIDERING_CHANGE_OF_PLAN']:
#     print(col, churn[col].unique())      

# 需要先找出 Unique 的值，後面要放在 categorical 裡
# for i in churn['REPORTED_SATISFACTION']:
#     print(i)


print(churn['REPORTED_SATISFACTION'].unique())
print(churn['REPORTED_USAGE_LEVEL'].unique())
print(churn['CONSIDERING_CHANGE_OF_PLAN'].unique())

uni_satisfaction = churn['REPORTED_SATISFACTION'].unique()
uni_usage = churn['REPORTED_USAGE_LEVEL'].unique()
uni_considering = churn['CONSIDERING_CHANGE_OF_PLAN'].unique()

# print(churn[''].unique())



['unsat' 'very_unsat' 'very_sat' 'avg' 'sat']
['little' 'very_little' 'very_high' 'high' 'avg']
['no' 'considering' 'perhaps' 'never_thought' 'actively_looking_into_it']


In [305]:
# # Convert REPORTED_SATISFACTION as numeric (very_unsat = 0, unsat=1... very_sat =4)

# # .codes：這是 Categorical對象的屬性，返回一個整數數組

# # step 1 - convert the variable as a categorical variable (a pandas 1D array)
# cat1 = pd.Categorical(values = churn['REPORTED_SATISFACTION'], 
#                       categories = ["very_unsat", "unsat", "avg", "sat","very_sat"],  # specify the order
#                       ordered = True)                                                # treat categories as ordered

# # step 2 - obtain the numeric codes and update original feature
# churn['REPORTED_SATISFACTION'] = cat1.codes           

In [306]:
################################################################################
# Step to convert:
# step 1 - convert the variable as a categorical variable (a pandas 1D array)
# step 2 - obtain the numeric codes and update original feature
################################################################################

cat1 = pd.Categorical(values= churn['REPORTED_SATISFACTION'],
                      # categories= uni_satisfaction,  # <- 這個地方要自己定義的， 目前是我用 unique get到的，所以現在的分類是完全錯誤的
                      categories= ['very_unsat' , 'unsat' , 'avg' , 'sat' ,'very_sat'],
                      ordered=True
                      )
# cat1 是 已經將 churn['REPORTED_SATISFACTION'] 轉換之後 返回的是 被更改後的 churn['REPORTED_SATISFACTION']
# 這個時候原本的還沒被改變，我們需要手動把它換過去
# print(churn['REPORTED_SATISFACTION'])  --> output: ['unsat' 'very_unsat' 'very_sat' 'avg' 'sat']
print("Out put of cat1:")
print('------------------------------------------------------')
print(cat1,'\n')
print(f"代碼解釋：\n這意味著 單純的 pd.Categorical 只會返回類別的定義，不會直接改變成 12345 等數字\n如果要改成成分類後的數字，要用到 xxx.codes,codes 會返回分類後的 數字的 column")

print()
print()

print("cat1.code:")
print('------------------------------------------------------')
print(cat1.codes)
churn['REPORTED_SATISFACTION'] = cat1.codes



Out put of cat1:
------------------------------------------------------
['unsat', 'unsat', 'unsat', 'unsat', 'very_unsat', ..., 'very_sat', 'very_sat', 'unsat', 'very_unsat', 'unsat']
Length: 20000
Categories (5, object): ['very_unsat' < 'unsat' < 'avg' < 'sat' < 'very_sat'] 

代碼解釋：
這意味著 單純的 pd.Categorical 只會返回類別的定義，不會直接改變成 12345 等數字
如果要改成成分類後的數字，要用到 xxx.codes,codes 會返回分類後的 數字的 column


cat1.code:
------------------------------------------------------
[1 1 1 ... 1 0 1]


In [307]:
# Convert REPORTED_USAGE_LEVEL as numeric

# step 1
cat2 = pd.Categorical(values = churn['REPORTED_USAGE_LEVEL'], 
                      categories = ["very_little", "little", "avg", "high","very_high"],   
                      ordered = True)

# step 2
churn['REPORTED_USAGE_LEVEL'] = cat2.codes     

In [308]:
# Convert CONSIDERING_CHANGE_OF_PLAN as numeric

# step 1
cat3 = pd.Categorical(values = churn['CONSIDERING_CHANGE_OF_PLAN'], 
                      categories=["never_thought", "no", "perhaps", "considering","actively_looking_into_it"], 
                      ordered= True)

# step 2
churn['CONSIDERING_CHANGE_OF_PLAN'] = cat3.codes

In [309]:
churn.head(5)     # check first 5 rows  

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN,LEAVE
1,0,31953,0,6,313378,161,0,4,1,1,1,0
2,1,36147,0,13,800586,244,0,6,1,1,3,0
3,1,27273,230,0,305049,201,16,15,1,0,2,0
4,0,120070,38,33,788235,780,3,2,1,4,3,1
5,1,29215,208,85,224784,241,21,1,0,1,0,0


Check the data type again, make sure the data type of a numeric target is either `int64` or `category`.  
- If we use the original string target, then it is fine to be `object` data type. 
- 最終目的是讓所有的 column 都變成 數字 
- 都變成數字後，就可以用來訓練了 ，就正式進入訓練階段了 
- 所以要先用 dtype check 一下變成 數字了嗎

In [310]:
churn.dtypes    # check data type: target is now int64 (discrete)

COLLEGE                        int64
INCOME                         int64
OVERAGE                        int64
LEFTOVER                       int64
HOUSE                          int64
HANDSET_PRICE                  int64
OVER_15MINS_CALLS_PER_MONTH    int64
AVERAGE_CALL_DURATION          int64
REPORTED_SATISFACTION           int8
REPORTED_USAGE_LEVEL            int8
CONSIDERING_CHANGE_OF_PLAN      int8
LEAVE                          int64
dtype: object

## 1.3 Split into Train and Test 
- 至於如何分割數據，首先 先確認 target value 是哪一個 也就是 y
- X 是除了 target feature 之外的所有 column， 所以就是要 drop target column
- 然後用 train_test_split ,定義好 X 和 Y 分別是誰 ，然後就可以 set test data 的數量(test_size= 0.x )

In [311]:
X = churn.drop(columns = 'LEAVE')

y = churn['LEAVE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  

display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(16000, 11)

(4000, 11)

(16000,)

(4000,)

## 1.4 Scale Data

Fit the scaler with the training set, then apply the same scaler to transform the trainand the test set later.
**Do NOT** fit the scaler with the test data: referencing the test data can lead to data leakage. 

- Remember to use scaled data for model training and prediction!

In [312]:
scaler = MinMaxScaler()                          

X_train_scaled = scaler.fit_transform(X_train)   # combine train and transform together

X_test_scaled  = scaler.transform(X_test)        # apply the scaler on test

After scaling, the transformed data is a numpy array without col names. We add the names back by converting them as a dataframes with column names.

In [313]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns)  

X_test_scaled = pd.DataFrame(X_test_scaled, columns = X_train.columns)

display(X_train_scaled.head(), X_test_scaled.head())

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN
0,1.0,0.560291,0.198813,0.0,0.090536,0.343303,0.172414,0.714286,0.0,1.0,0.75
1,1.0,0.827085,0.005935,0.0,0.267077,0.526658,1.0,0.928571,0.0,0.5,0.75
2,0.0,0.924286,0.635015,0.707865,0.57041,0.556567,0.689655,0.0,1.0,0.25,0.0
3,1.0,0.109198,0.620178,0.247191,0.850993,0.093628,0.758621,0.357143,0.5,0.75,0.75
4,1.0,0.452864,0.11276,0.0,0.034099,0.188557,0.0,0.785714,0.0,1.0,0.25


Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN
0,1.0,0.033484,0.560831,0.325843,0.103968,0.074122,0.413793,0.0,0.0,0.75,0.25
1,1.0,0.191894,0.551929,0.0,0.128988,0.042913,0.758621,0.714286,0.25,0.25,0.75
2,1.0,0.260613,0.005935,0.595506,0.685019,0.313394,0.034483,0.0,0.25,0.0,1.0
3,0.0,0.77435,0.133531,0.101124,0.173329,0.495449,0.172414,0.357143,0.0,0.25,0.75
4,1.0,0.613997,0.005935,0.292135,0.104067,0.907672,0.034483,0.0,0.0,1.0,1.0


# 2. Train m1 with only two Features  

##  2.1 Model Training 

In [314]:
X_train_sub = X_train_scaled[['COLLEGE', 'INCOME']]    # 2D features

m1 = LogisticRegression().fit(X_train_sub, y_train)   

display(m1.intercept_, m1.coef_, m1.feature_names_in_)  # Note that intercept in 1D, coeffs in 2D

array([-0.31123949])

array([[0.05295155, 0.60633535]])

array(['COLLEGE', 'INCOME'], dtype=object)

## 2.2  Predict and Evaluate on Train Set

**Predict class labels** 

- When making predictions, the default cut-off point for class determination is 50%. 

In [315]:
train_pred1 = m1.predict(X_train_sub)  

train_pred1

array([1, 1, 1, ..., 0, 0, 1], shape=(16000,))

**Estimate class probabilities**  

- Note the order should be  0 (`STAY`), 1 (`LEAVE`). The probability of 1 (i.e., `LEAVE`) is in the 2nd column. 

In [316]:
train_prob1 = m1.predict_proba(X_train_sub)

train_prob1     # probability of 0 (STAY), 1 (LEAVE)

array([[0.47965222, 0.52034778],
       [0.43949726, 0.56050274],
       [0.43802348, 0.56197652],
       ...,
       [0.57495841, 0.42504159],
       [0.54782924, 0.45217076],
       [0.47444788, 0.52555212]], shape=(16000, 2))

**Check model accuracy**

In [317]:
accuracy_score(train_pred1, y_train)     # m1.score(X_train_sub, y_train)

0.54325

## 2.3  The log_odds and Probabilities of LEAVE

Below are the formulas you may need to use.


- Estimate the ``log_odds`` of LEAVE (i.e., ``f(x)``) with the linear function. 

$$
f(x) = w_0 + w_1 \cdot \text{COLLEGE} + w_2 \cdot \text{INCOME}
$$


- Estimate  the ``probability`` of LEAVE with the logistic function. 

$$
P(Y=1 | x) = \frac{1}{1 + e^{-f(x)}} \quad \text{or} \quad P(Y=1 |x) = \frac{e^{f(x)}}{1 + e^{f(x)}}
$$

 

<font color=red>***Exercise 1: Your Codes Here***</font>  

Please complete the following two tasks:

- Step 1: Calculate the log-odds of `LEAVE(1)` for the 1st instance in the training set.

- Step 2: Calculate the probability of `LEAVE(1)` for the 1st instance as well. You may want to use  ``numpy.exp`` function to perform natural exponential function (with base e) to a value.

Note (1) the intercept is in a 1D array and coefs are in 2D array (1*2), and (2) you may use  **.loc** or **.iloc** method to select the first customer's **COLLEGE** and **INCOME** value. 

**Calculate the log_odds and probability of LEAVE(1) for all instances**

The ``log_odds`` values for all instances are returned by the ``decision_function`` method.  

-  ``log_odds`` is proportional to instances' perpendicular distance to the hyperplane, and also called as ``confidence scores``.

In [318]:
log_odds2 = m1.decision_function(X_train_sub)   

log_odds2             # the first value is the log-odds for the 1st instance

array([ 0.0814361 ,  0.24320265,  0.24918755, ..., -0.30211069,
       -0.19190374,  0.1022976 ], shape=(16000,))

With the ``log_odds`` for all intances, we can calcualte the ``leaving probability`` for all customers very effectively. 

- The estimated leave probabilities should be the same as those returned by ``m1.predict_proba(X_train_sub)``, which is in the 2nd column of  **train_prob1**. 

In [319]:
prob2 = 1 /(1 + np.exp(-log_odds2))    # element-wise computation applies   

prob2                                  # same as train_prob1[:,1]    

array([0.52034778, 0.56050274, 0.56197652, ..., 0.42504159, 0.45217076,
       0.52555212], shape=(16000,))

## 2.4 Predict and Evaluate on Test Set

When making class predictions,  the default threshhold is 0.5. 

In [None]:
X_test_sub = X_test_scaled[['COLLEGE', 'INCOME']]     # Two features for m1

test_pred1 = m1.predict(X_test_sub)

accuracy_score(test_pred1, y_test)                   # same as m1.score(X_test_sub, y_test)

0.5545

#  3.  Train m2 with all features 

<font color=red>***Exercise 2: Your Codes Here***</font>  

Please complete the following two steps: 
- Step 1: please train **m2** with all 11 features (i.e.,  **X_train_scaled**). Check the parameter values.
- Step 2: evaluate **m2**'s performance on both train and test set. With more features used, is the model performance better?  
