# Logistic Regression pada Binary Classification Task

## Formula Dasar
perbedaan mendasar dari simple linear regression dengan multiple linear regression adalah 
- Pada simple linear regression nilai x merupakan feature tunggal
- Pada multiple linear regression nilai X merupakan feature jamak

### Simple Linear Regression
####  $y$ =  $\alpha$ +  $\beta_{x}$
#### $g(x)$ = $\alpha$ +  $\beta_{x}$
### Multiple Linear Regression
####  $y$ =  $\alpha$ +  $\beta_{1}x_{1}$ +  $\beta_{2}x_{2}$ + ... +  $\beta_{n}x_{n}$
#### $g(X)$ = $\alpha$ +  $\beta$X
### Logistic Regression
#### $g(X)$ = $sigmoid(\alpha +\beta X)$
#### $sigmoid(x)$ =  $\frac{1}{1+exp(-x)}$

Catatan penting :
- Kurva sigmoid hasil penggambaran akan memiliki bentuk seperti huruf S
- Tidak ada patokan khusus untuk rentang nilai di sumbu X
- Rentang nilai di sumbu y berkisar di antara 0-1

## Dataset SMS Spam Collection Data Set
- Dataset di bawah merupakan contoh dataset yang tidak balance dikarenakan jumlah ham jauh lebih besar dibandingkan jumlah spam

In [2]:
import pandas as pd 
df = pd.read_csv('./SMSSpamCollection',
                sep='\t',
                header = None,
                names = ['label','sms'])
df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Training and Testing dataset
Catatan penting :
- LabelBinarizer digunakan untuk mengkonversi nilai sms menjadi nilai 0 atau 1
- Fungsi .ravel() digunakan untuk mengkonversi array multi dimensi menjadi array 1 dimensi
- test_size dengan nilai 0.25 mengindikasikan proporsi 25% dari data akan digunakan sebagai testing set sedangkan 75% sisanya akan digunakan sebagai training set

In [4]:
from sklearn.preprocessing import LabelBinarizer
X = df['sms'].values
y = df['label'].values

lb = LabelBinarizer()
y = lb.fit_transform(y).ravel()
lb.classes_

array(['ham', 'spam'], dtype='<U4')

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.25,
                                                   random_state=0)
print(X_train,'\n')
print(y_train)

['Its going good...no problem..but still need little experience to understand american customer voice...'
 'U have a secret admirer. REVEAL who thinks U R So special. Call 09065174042. To opt out Reply REVEAL STOP. 1.50 per msg recd. Cust care 07821230901'
 'Ok...' ...
 "For ur chance to win a £250 cash every wk TXT: ACTION to 80608. T's&C's www.movietrivia.tv custcare 08712405022, 1x150p/wk"
 'R U &SAM P IN EACHOTHER. IF WE MEET WE CAN GO 2 MY HOUSE'
 'Mm feeling sleepy. today itself i shall get that dear'] 

[0 1 0 ... 1 0 0]
