# Latihan I : Logistic Regression vs. Linear Regression
1. Latihlah model _logistic regression_ pada dataset `attrition_past` dengan menggunakan fitur-fitur berikut:
    1. `activity_per_employee` $\Rightarrow$ **fitur baru** $\Rightarrow$ termasuk _novelty_,
    2. `lastmonth_activity`,
    3. `lastyear_activity`, dan 
    4. `number_of_employees`
2. Hitunglah jumlah yang `exited` dan yang tidak `exited`.  
3. Buat fitur baru: `activity_per_employee`.
4. Bagilah dataset menjadi _train set_ dan _test set_ dengan komposisi: 80% dan 20% _respectively_.     
Gunakan `random_state= 212`.
5. Latihlah model logistic regression pada train set dan lakukan prediksi pada train set dan test set $\Rightarrow$ hitung accuracy, precision, recall, dan $F_1$ score pada _train set_ dan _test set_.
3. Bandingkan hasil pada _train set_ dan _test set_ bila dataset dilatih dengan menggunakan model _linear regression_.    
Anda dapat membandingkan     
      a. accuracy,   
      b. precision,    
      c. recall, dan    
      d. $F_1$ score-nya.
4. Manakah model yang lebih baik?

## Jawab:

1. Hitunglah jumlah yang `exited` dan yang tidak `exited`.  

In [18]:
import pandas as pd
df_attr = pd.read_csv("attrition_past.csv")
df_attr.head()

Unnamed: 0,corporation,lastmonth_activity,lastyear_activity,number_of_employees,exited
0,abcd,78,1024,12,1
1,asdf,14,2145,20,0
2,xyzz,182,3891,35,0
3,acme,101,10983,2,1
4,qwer,0,118,42,1


In [22]:
exit_counts = df_attr['exited'].value_counts().rename(index={0:'not_exited',1:'exited'})
print(exit_counts)

exited
exited        15
not_exited    11
Name: count, dtype: int64


2. Buat fitur baru: `activity_per_employee`.

In [25]:
df_attr['activity_per_employee'] = (df_attr['lastmonth_activity'] + df_attr['lastyear_activity']) / df_attr['number_of_employees']
df_attr[['activity_per_employee','lastmonth_activity','lastyear_activity','number_of_employees']].head()

Unnamed: 0,activity_per_employee,lastmonth_activity,lastyear_activity,number_of_employees
0,91.833333,78,1024,12
1,107.95,14,2145,20
2,116.371429,182,3891,35
3,5542.0,101,10983,2
4,2.809524,0,118,42


3. Bagilah dataset menjadi _train set_ dan _test set_ dengan komposisi: 80% dan 20% _respectively_.     
Gunakan `random_state= 212`.

In [26]:
from sklearn.model_selection import train_test_split
features = ['activity_per_employee','lastmonth_activity','lastyear_activity','number_of_employees']
X = df_attr[features]
y = df_attr['exited']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=212, stratify=y)
X_train.shape, X_test.shape

((20, 4), (6, 4))

4. Latihlah model logistic regression pada train set dan lakukan prediksi pada train set dan test set $\Rightarrow$ hitung accuracy, precision, recall, dan $F_1$ score pada _train set_ dan _test set_.

In [28]:
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

y_pred_train_log = log_reg.predict(X_train)
y_pred_test_log = log_reg.predict(X_test)

print("Train metrics:")
print("accuracy\t:", accuracy_score(y_train, y_pred_train_log))
print("precision\t:", precision_score(y_train, y_pred_train_log))
print("recall\t:", recall_score(y_train, y_pred_train_log))
print("f1 \t:", f1_score(y_train, y_pred_train_log))

print("\nTest metrics:")
print("accuracy\t:", accuracy_score(y_test, y_pred_test_log))
print("precision\t:", precision_score(y_test, y_pred_test_log))
print("recall\t:", recall_score(y_test, y_pred_test_log))
print("f1 \t:", f1_score(y_test, y_pred_test_log))

Train metrics:
accuracy	: 0.65
precision	: 0.7272727272727273
recall	: 0.6666666666666666
f1 	: 0.6956521739130435

Test metrics:
accuracy	: 0.8333333333333334
precision	: 1.0
recall	: 0.6666666666666666
f1 	: 0.8


In [29]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

train_scores = lin_reg.predict(X_train)
test_scores = lin_reg.predict(X_test)

y_pred_train_lin = (train_scores >= 0.5).astype(int)
y_pred_test_lin = (test_scores >= 0.5).astype(int)
print("Train metrics:")
print("accuracy\t:", accuracy_score(y_train, y_pred_train_lin))
print("precision\t:", precision_score(y_train, y_pred_train_lin))
print("recall\t:", recall_score(y_train, y_pred_train_lin))
print("f1 \t:", f1_score(y_train, y_pred_train_lin))

print("\nTest metrics:")
print("accuracy\t:", accuracy_score(y_test, y_pred_test_lin))
print("precision\t:", precision_score(y_test, y_pred_test_lin))
print("recall\t:", recall_score(y_test, y_pred_test_lin))
print("f1 \t:", f1_score(y_test, y_pred_test_lin))


Train metrics:
accuracy	: 0.7
precision	: 0.75
recall	: 0.75
f1 	: 0.75

Test metrics:
accuracy	: 0.8333333333333334
precision	: 1.0
recall	: 0.6666666666666666
f1 	: 0.8


5. Bandingkan hasil pada _train set_ dan _test set_ bila dataset dilatih dengan menggunakan model _linear regression_.    
Anda dapat membandingkan     
      a. accuracy,   
      b. precision,    
      c. recall, dan    
      d. $F_1$ score-nya.

### Hasil yg Logistic Regression
**Train**
- Accuracy: 0.65  
- Precision: 0.7273  
- Recall: 0.6667  
- F1: 0.6957  

**Test**
- Accuracy: 0.8333  
- Precision: 1.0  
- Recall: 0.6667  
- F1: 0.8  


### Hasil yg Linear Regression
**Train**
- Accuracy: 0.70  
- Precision: 0.75  
- Recall: 0.75  
- F1: 0.75  

**Test**
- Accuracy: 0.8333  
- Precision: 1.0  
- Recall: 0.6667  
- F1: 0.8  


### Kesimpulan
- Train set: Linear Regression lebih baik (berdasarkan F1).  
- Test set: Sama (semua metrik identik).

6. Manakah model yang lebih baik?

- harusnya buat klasifikasi biner, Logistic Regression lebih bagus, tapi
- Pada dataset ini:
  - **Train**: Linear Regression sedikit lebih tinggi (F1 0.75 vs 0.696).
  - **Test**: Kedua model identik pada semua metrik.

# Latihan II : Prediksi Keselamatan Korban Kecelakaan
- The dataset `crash.csv` is an accident-survivors dataset portal for the USA (crash data for individual States can be searched) hosted by data.gov. 
- The dataset contains passengers’ (not necessarily the driver’s) age and the speed of the vehicle (mph) at the time of impact and the fate of the passengers (1 represents survived, 0 represents did not survive) after the crash. 
- **Tugas Anda**:
    - Hitunglah jumlah yang `survived` dan yang tidak `survive`.  
    - Bagilah dataset menjadi _train set_ dan _test set_ dengan komposisi: 80% dan 20% _respectively_.      
    Gunakan `random_state= 212`.
    - Now, use the logistic regression to decide if **the age and speed can predict the survivability of the passengers** $\Rightarrow$ hitung accuracy, precision, recall, dan $F_1$ score-nya pada _train set_ dan _test set_.
    - Bandingkan hasil pada _train set_ dan _test set_ bila dataset dilatih dengan menggunakan model _linear regression_.    
    Anda dapat membandingkan     
      a. accuracy,   
      b. precision,    
      c. recall, dan    
      d. $F_1$ score-nya.     
   - Manakah model yang lebih baik?

## Jawab:

Hitunglah jumlah yang `survived` dan yang tidak `survive`.  

In [35]:
# Load crash dataset
df_crash = pd.read_csv("crash.csv")
df_crash.head()

Unnamed: 0,PassengerId,Age,Speed,Survived
0,1,22,65,0
1,2,38,50,1
2,3,26,45,1
3,4,35,55,1
4,5,35,85,0


In [37]:
surv_counts = df_crash['Survived'].value_counts().rename(index={0:'not_survived',1:'survived'})
print(surv_counts)

Survived
not_survived    10
survived        10
Name: count, dtype: int64


Bagilah dataset menjadi _train set_ dan _test set_ dengan komposisi: 80% dan 20% _respectively_.      
    Gunakan `random_state= 212`.

In [38]:
from sklearn.model_selection import train_test_split
Xc = df_crash[['Age','Speed']]
yc = df_crash['Survived']
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc, test_size=0.2, random_state=212, stratify=yc)
Xc_train.shape, Xc_test.shape

((16, 2), (4, 2))

Now, use the logistic regression to decide if **the age and speed can predict the survivability of the passengers** $\Rightarrow$ hitung accuracy, precision, recall, dan $F_1$ score-nya pada _train set_ dan _test set_.

In [39]:
log_c = LogisticRegression(max_iter=1000)
log_c.fit(Xc_train, yc_train)

yc_pred_train_log = log_c.predict(Xc_train)
yc_pred_test_log = log_c.predict(Xc_test)
print("Train metrics:")
print("accuracy\t:", accuracy_score(yc_train, yc_pred_train_log))
print("precision\t:", precision_score(yc_train, yc_pred_train_log))
print("recall\t:", recall_score(yc_train, yc_pred_train_log))
print("f1 \t:", f1_score(yc_train, yc_pred_train_log))

print("\nTest metrics:")
print("accuracy\t:", accuracy_score(yc_test, yc_pred_test_log))
print("precision\t:", precision_score(yc_test, yc_pred_test_log))
print("recall\t:", recall_score(yc_test, yc_pred_test_log))
print("f1 \t:", f1_score(yc_test, yc_pred_test_log))


Train metrics:
accuracy	: 0.875
precision	: 0.875
recall	: 0.875
f1 	: 0.875

Test metrics:
accuracy	: 0.75
precision	: 0.6666666666666666
recall	: 1.0
f1 	: 0.8


Bandingkan hasil pada _train set_ dan _test set_ bila dataset dilatih dengan menggunakan model _linear regression_.    
    Anda dapat membandingkan     
      a. accuracy,   
      b. precision,    
      c. recall, dan    
      d. $F_1$ score-nya.     

### Hasil yg Linear Regression
**Train**
- Accuracy: 0.875  
- Precision: 0.875  
- Recall: 0.875  
- F1: 0.875  

**Test**
- Accuracy: 0.75  
- Precision: 0.6667  
- Recall: 1.0  
- F1: 0.8  


,

### Hasil yg Logistic Regression
**Train**
- Accuracy: 0.875  
- Precision: 0.875  
- Recall: 0.875  
- F1: 0.875  

**Test**
- Accuracy: 0.75  
- Precision: 0.6667  
- Recall: 1.0  
- F1: 0.8  


,

### Kesimpulan
- Train set: Sama (semua metrik identik).  
- Test set: Sama (semua metrik identik).

Manakah model yang lebih baik?

Logistic reggresion

# Latihan III : Prediksi Kualitas Postingan Orang
- An automated answer-rating site marks each post in a community forum website as “good” or “bad” based on the quality of the post. 
- The CSV file (`quality.csv`) contains the various types of quality as measured by the tool. 
- - Following are the type of qualities that the dataset contains:
    - `num_words`: number of words in the post
    - `num_characters`: number of characters in the post
    - `num_misspelled`: number of misspelled words
    - `bin_end_qmark`: if the post ends with a question mark
    - `num_interrogative`: number of interrogative words in the post
    - `bin_start_small`: if the answer starts with a lowercase letter ("1" means yes, otherwise no)
    - `num_sentences`: number of sentences per post
    - `num_punctuations`: number of punctuation symbols in the post
    - `label`: the label of the post ("G" for good and "B" for bad) as determined by the tool.
   
- Hitunglah jumlah posting-an yang `good` dan yang `bad`. 
- Bagilah dataset menjadi _train set_ dan _test set_ dengan komposisi: 80% dan 20% _respectively_.      
    Gunakan `random_state= 212`.
- Create a logistics regression model to **predict the class label from the ﬁrst eight attributes of
the question set**. Evaluate the accuracy, precision, recall, dan $F_1$ score of your model pada _train set_ dan _test set_.
- Bandingkan hasil pada _train set_ dan _test set_ bila dataset dilatih dengan menggunakan model _linear regression_.    
- Manakah model yang lebih baik?

## Jawab:

Hitunglah jumlah posting-an yang `good` dan yang `bad`. 

In [41]:
df_qual = pd.read_csv("quality.csv")
df_qual.head()

Unnamed: 0,S.No.,num_words,num_characters,num_misspelled,bin_end_qmark,num_interrogative,bin_start_small,num_sentences,num_punctuations,label
0,1,10,48,2,0,0,0,2,4,B
1,2,8,25,0,0,0,1,1,0,B
2,3,20,81,0,1,19,0,1,1,B
3,4,9,34,1,0,1,0,1,2,B
4,5,18,69,3,0,1,0,1,0,B


In [42]:
label_counts = df_qual['label'].value_counts().rename(index={'G':'good','B':'bad'})
print(label_counts)
label_counts

label
bad     14
good    14
Name: count, dtype: int64


label
bad     14
good    14
Name: count, dtype: int64

Bagilah dataset menjadi _train set_ dan _test set_ dengan komposisi: 80% dan 20% _respectively_.      
    Gunakan `random_state= 212`.

In [43]:
feat_cols = ['num_words','num_characters','num_misspelled','bin_end_qmark','num_interrogative','bin_start_small','num_sentences','num_punctuations']
Xq = df_qual[feat_cols]


yq = (df_qual['label'] == 'G').astype(int)
Xq_train, Xq_test, yq_train, yq_test = train_test_split(Xq, yq, test_size=0.2, random_state=212, stratify=yq)
Xq_train.shape, Xq_test.shape

((22, 8), (6, 8))

Create a logistics regression model to **predict the class label from the ﬁrst eight attributes of
the question set**.     
Evaluate the accuracy, precision, recall, dan $F_1$ score of your model pada _train set_ dan _test set_.

In [44]:
log_q = LogisticRegression(max_iter=1000)
log_q.fit(Xq_train, yq_train)

yq_pred_train_log = log_q.predict(Xq_train)
yq_pred_test_log = log_q.predict(Xq_test)
print("Train metrics:")
print("accuracy\t:", accuracy_score(yq_train, yq_pred_train_log))
print("precision\t:", precision_score(yq_train, yq_pred_train_log))
print("recall\t:", recall_score(yq_train, yq_pred_train_log))
print("f1 \t:", f1_score(yq_train, yq_pred_train_log))

print("\nTest metrics:")
print("accuracy\t:", accuracy_score(yq_test, yq_pred_test_log))
print("precision\t:", precision_score(yq_test, yq_pred_test_log))
print("recall\t:", recall_score(yq_test, yq_pred_test_log))
print("f1 \t:", f1_score(yq_test, yq_pred_test_log))


Train metrics:
accuracy	: 0.9545454545454546
precision	: 1.0
recall	: 0.9090909090909091
f1 	: 0.9523809523809523

Test metrics:
accuracy	: 0.5
precision	: 0.5
recall	: 0.6666666666666666
f1 	: 0.5714285714285714


Bandingkan hasil pada _train set_ dan _test set_ bila dataset dilatih dengan menggunakan model _linear regression_.   
Anda dapat membandingkan     
      a. accuracy,   
      b. precision,    
      c. recall, dan    
      d. $F_1$ score-nya.     


### Hasil yg Linear Regression
**Train**
- Accuracy: 0.8636  
- Precision: 0.8333  
- Recall: 0.9091  
- F1: 0.8696  

**Test**
- Accuracy: 0.5  
- Precision: 0.5  
- Recall: 0.6667  
- F1: 0.5714  


,

### Hasil yg Logistic Regression
**Train**
- Accuracy: 0.9545  
- Precision: 1.0  
- Recall: 0.9091  
- F1: 0.9524  

**Test**
- Accuracy: 0.5  
- Precision: 0.5  
- Recall: 0.6667  
- F1: 0.5714  


,

### Kesimpulan
- Train set: Logistic Regression lebih tinggi (F1 0.9524 vs 0.8696).  
- Test set: Sama (semua metrik identik).

# Final Question
Dari 3 studi kasus yang sudah Anda kerjakan, manakah yang lebih baik untuk mengatasi masalah klasifikasi biner...    _Logistic regression_ atau _Linear regression_?

## Jawab:

- dlm hal klasifikasi biner, **Logistic Regression** harusnya lebih baik drpd Linear Regression.
- tapi pada ketiga latihan di atas, performa merka **setara** dan pada sebagian kasus lebih baik di train.
- oleh karena itu logistic regression lebih baik

<center>
    <h1>The End</h1>
</center>    