## Mini-Exercise:

1. Load the titanic dataset that you've put together from previous lessons.
2. Split your data into training and test.
3. Fit a logistic regression model on your training data using sklearn's
   linear_model.LogisticRegression class. Use fare and pclass as the
   predictors.
4. Use the model's .predict method. What is the output?
5. Use the model's .predict_proba method. What is the output? Why do you
   think it is shaped like this?
6. Evaluate your model's predictions on the test data set. How accurate
   is the mode? How does changing the threshold affect this?

#### Load the libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

import env
import acquire

#### Load the titanic dataset

In [2]:
df = acquire.get_titanic_data()
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


#### split and train my data

In [3]:
from sklearn.model_selection import train_test_split

#### Worthy of note: between here and the errors below, my problem was that I picked a continuous variable for my y b/c the exercise said to choose 'pclass' and 'fare.'  I picked fare as the y because I thought money could be categorical (fixed).  It is NOT.  Even after trying to run an '.astype(int)' nothing worked.  I needed to put 'survived' as my y instead (it's categorical - they either survived, or they didn't).

#### Keeping notebook as is to remember this, and correcting the error at the end of the last Error Message.

In [4]:
# X = df[["pclass"]]

# y = df[["fare"]]

# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=123)

In [5]:
# Verify it's training 75% and testing 25%

In [6]:
# X_train.shape

In [7]:
# X_test.shape

In [8]:
# y_train.shape

In [9]:
# y_test.shape

#### Fit a logistic regression model on the training data using sklearn's linear_model.LogisticRegression class. Use fare and pclass as the predictors.

In [10]:
# #1 Create the Logistic Regression object:

# logit = LogisticRegression(C=1, class_weight={1:2}, random_state=123, solver="saga")
# logit

In [11]:
#2 Fit the model to the training data

# logit.fit(X_train, y_train)

In [12]:
# Uh-oh.  The Googs says I have one of the variables saved as a float 
# and need to convert it to an int for this thing to run

# X.info()

In [13]:
# y.info()

In [14]:
# The float is the why.  We've gotta convert it to int:

# df['fare'] = df['fare'].astype(int)

In [15]:
# Now that it's converted, let's try...

# logit.fit(X_train, y_train)

In [16]:
# fuggit. 

#### Below is the corrected stuff, making the y a categorical ('survived')

In [24]:
X = df[["fare", "pclass"]]

y = df[["survived"]]

train, test = train_test_split(df, random_state=123, train_size=.8)

In [25]:
X_train.shape

(668, 2)

In [26]:
X_test.shape

(223, 2)

In [27]:
y_train.shape

(668, 1)

In [28]:
y_test.shape

(223, 1)

In [29]:
X = train[["fare", "pclass"]]
y = train.survived

#### Now to assign my model object:

In [30]:
model = LogisticRegression(random_state=123).fit(X, y)



In [31]:
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

#### Get the unique values from the y variable

In [32]:
model.classes_

array([0, 1])

In [33]:
pd.DataFrame(model.predict_proba(X), columns=model.classes_)

Unnamed: 0,0,1
0,0.381055,0.618945
1,0.736848,0.263152
2,0.737329,0.262671
3,0.736863,0.263137
4,0.375066,0.624934
...,...,...
707,0.579080,0.420920
708,0.591321,0.408679
709,0.736687,0.263313
710,0.737306,0.262694


#### Adding the y-hat column to the trianing dataframe:

In [38]:
train["yhat"] = model.predict(X)

train["probability_of_survival"] = model.predict_proba(X)[:, 1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


#### Check accuracy_score, precision_score, and recall_score

- gotta import another library

In [39]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [40]:
model.score(X, y)

0.672752808988764

In [41]:
accuracy_score(train.survived, train.yhat)

0.672752808988764

In [42]:
precision_score(train.survived, train.yhat)

0.6325301204819277

In [43]:
recall_score(train.survived, train.yhat)

0.37906137184115524

In [44]:
# Recall es no bueno...

#### Set the threshold (t) and see what happens:

In [46]:
t = .25

train["yhat"] = train.probability_of_survival > t

accuracy_score(train.survived, train.yhat)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


0.3890449438202247

In [47]:
precision_score(train.survived, train.yhat)

0.3890449438202247

In [48]:
recall_score(train.survived, train.yhat)

1.0

#### Well lookey-lookey: the lower the threshold (t), the higher the recall score.  That means that recall and threshold are inversely proportional