# logistic regression

Gives us an umber between zero and 1
- OLS + **logit** function
- a **logit** function gives us a number between 0 and 1 
- output is a number between 0 and 1 -- the probability of an observation being in the positive class
- Pros
    - fast to train, very fast to predict
    - probabilities of being in the positive class
    - more interpretable than some other classification models
- cons:
    - less interpretable than some other classification models
    - assume the x predictors are independent
    - multi-class classification is more complicated (**one-vs-rest**)
- Great baseline

## Mini Exercise
1. Load the titanic dataset that you've put together from previous lessons.
2. Split your data into training and test.
3. Fit a logistic regression model on your training data using sklearn's
   linear_model.LogisticRegression class. Use fare and pclass as the
   predictors.
4. Use the model's .predict method. What is the output?
5. Use the model's .predict_proba method. What is the output? Why do you
   think it is shaped like this?
6. Evaluate your model's predictions on the test data set. How accurate
   is the mode? How does changing the threshold affect this?

In [30]:
import pandas as pd
import numpy as np

import prepare
import acquire
from sklearn.linear_model import LogisticRegression



In [31]:
df = acquire.get_titanic_data()

In [32]:
train, test = prepare.prep_titanic(df)

train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,class,embark_town,alone,embarked C,embarked Q,embarked S,age_scaled,fare_scaled
329,329,1,1,female,16.0,0,1,57.9792,First,Cherbourg,0,1.0,0.0,0.0,0.195778,0.113168
749,749,0,3,male,31.0,0,0,7.75,Third,Queenstown,1,0.0,1.0,0.0,0.384267,0.015127
203,203,0,3,male,45.5,0,0,7.225,Third,Cherbourg,1,1.0,0.0,0.0,0.566474,0.014102
421,421,0,3,male,21.0,0,0,7.7333,Third,Queenstown,1,0.0,1.0,0.0,0.258608,0.015094
97,97,1,1,male,23.0,0,1,63.3583,First,Cherbourg,0,1.0,0.0,0.0,0.28374,0.123667


In [42]:
logit = LogisticRegression(random_state=123)

In [43]:
X_train = train[['pclass', 'fare']]
y_train = train[['survived']]

In [44]:
logit.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [45]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.63843624  0.00473522]]
Intercept: 
 [0.8489633]


In [46]:
y_pred = logit.predict(X_train)
y_pred

array([1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,

In [47]:
y_pred_proba = logit.predict_proba(X_train)
# these are probabilities
y_pred_proba

array([[0.38105535, 0.61894465],
       [0.73684755, 0.26315245],
       [0.7373293 , 0.2626707 ],
       ...,
       [0.73668683, 0.26331317],
       [0.73730638, 0.26269362],
       [0.73684755, 0.26315245]])

In [48]:
print('Accuracy of Logistic Regression classifier on training set: {:.3f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.673


In [51]:
X_test = test[['pclass', 'fare']]
y_test = test[['survived']]

In [52]:
print('Accuracy of Logistic Regression classifier on test set: {:.3f}'
     .format(logit.score(X_test, y_test)))

Accuracy of Logistic Regression classifier on test set: 0.704


In [54]:
# how to change our threshold


NameError: name 'X' is not defined