# Logistic Regression pada Binary Classification Task

## Formula Dasar
perbedaan mendasar dari simple linear regression dengan multiple linear regression adalah 
- Pada simple linear regression nilai x merupakan feature tunggal
- Pada multiple linear regression nilai X merupakan feature jamak

### Simple Linear Regression
####  $y$ =  $\alpha$ +  $\beta_{x}$
#### $g(x)$ = $\alpha$ +  $\beta_{x}$
### Multiple Linear Regression
####  $y$ =  $\alpha$ +  $\beta_{1}x_{1}$ +  $\beta_{2}x_{2}$ + ... +  $\beta_{n}x_{n}$
#### $g(X)$ = $\alpha$ +  $\beta$X
### Logistic Regression
#### $g(X)$ = $sigmoid(\alpha +\beta X)$
#### $sigmoid(x)$ =  $\frac{1}{1+exp(-x)}$

Catatan penting :
- Kurva sigmoid hasil penggambaran akan memiliki bentuk seperti huruf S
- Tidak ada patokan khusus untuk rentang nilai di sumbu X
- Rentang nilai di sumbu y berkisar di antara 0-1

## Dataset SMS Spam Collection Data Set
- Dataset di bawah merupakan contoh dataset yang tidak balance dikarenakan jumlah ham jauh lebih besar dibandingkan jumlah spam

In [2]:
import pandas as pd 
df = pd.read_csv('./SMSSpamCollection',
                sep='\t',
                header = None,
                names = ['label','sms'])
df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Training and Testing dataset
Catatan penting :
- LabelBinarizer digunakan untuk mengkonversi nilai sms menjadi nilai 0 atau 1
- Fungsi .ravel() digunakan untuk mengkonversi array multi dimensi menjadi array 1 dimensi
- test_size dengan nilai 0.25 mengindikasikan proporsi 25% dari data akan digunakan sebagai testing set sedangkan 75% sisanya akan digunakan sebagai training set

In [4]:
from sklearn.preprocessing import LabelBinarizer
X = df['sms'].values
y = df['label'].values

lb = LabelBinarizer()
y = lb.fit_transform(y).ravel()
lb.classes_

array(['ham', 'spam'], dtype='<U4')

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.25,
                                                   random_state=0)
print(X_train,'\n')
print(y_train)

['Its going good...no problem..but still need little experience to understand american customer voice...'
 'U have a secret admirer. REVEAL who thinks U R So special. Call 09065174042. To opt out Reply REVEAL STOP. 1.50 per msg recd. Cust care 07821230901'
 'Ok...' ...
 "For ur chance to win a £250 cash every wk TXT: ACTION to 80608. T's&C's www.movietrivia.tv custcare 08712405022, 1x150p/wk"
 'R U &SAM P IN EACHOTHER. IF WE MEET WE CAN GO 2 MY HOUSE'
 'Mm feeling sleepy. today itself i shall get that dear'] 

[0 1 0 ... 1 0 0]


## Feature Extraction dengan TF-IDF

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(X_train_tfidf)

  (0, 6903)	0.3591386422223876
  (0, 2006)	0.2898082580285881
  (0, 900)	0.4114867709157148
  (0, 6739)	0.3546359942830148
  (0, 2554)	0.3825278811525034
  (0, 3926)	0.3126721340000456
  (0, 4453)	0.2297719954323795
  (0, 5123)	0.308974289326673
  (0, 3007)	0.21421364306658514
  (0, 2997)	0.23173982975834367
  (1, 36)	0.28902673040368515
  (1, 1548)	0.18167737976542422
  (1, 2003)	0.2711077935907125
  (1, 5301)	0.2711077935907125
  (1, 4358)	0.17341410292348694
  (1, 532)	0.20186022353306565
  (1, 6131)	0.16142609035094446
  (1, 5394)	0.16464655071448758
  (1, 4677)	0.24039776602646504
  (1, 216)	0.28902673040368515
  (1, 6013)	0.20089911182610476
  (1, 6472)	0.24039776602646504
  (1, 5441)	0.5009783758205715
  (1, 799)	0.25048918791028574
  (1, 5642)	0.24344998442301355
  :	:
  (4176, 343)	0.2811068572055718
  (4176, 107)	0.29968668460649284
  (4176, 2004)	0.25589560236817055
  (4176, 4350)	0.29968668460649284
  (4176, 637)	0.29968668460649284
  (4176, 7114)	0.4512018097459442
  (4176

## Binary Classification dengan Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_tfidf,y_train)
y_pred = model.predict(X_test_tfidf)

for pred, sms in zip(y_pred[:5],X_test[:5]):
    print(f'PRED: {pred} - SMS: {sms}\n')

PRED: 0 - SMS: Storming msg: Wen u lift d phne, u say "HELLO" Do u knw wt is d real meaning of HELLO?? . . . It's d name of a girl..! . . . Yes.. And u knw who is dat girl?? "Margaret Hello" She is d girlfrnd f Grahmbell who invnted telphone... . . . . Moral:One can 4get d name of a person, bt not his girlfrnd... G o o d n i g h t . . .@

PRED: 0 - SMS: <Forwarded from 448712404000>Please CALL 08712404000 immediately as there is an urgent message waiting for you.

PRED: 0 - SMS: And also I've sorta blown him off a couple times recently so id rather not text him out of the blue looking for weed

PRED: 0 - SMS: Sir Goodmorning, Once free call me.

PRED: 0 - SMS: All will come alive.better correct any good looking figure there itself..

