# Intro to Modeling

Plan ➔ Acquire ➔ Prepare ➔ Explore ➔ **Model** ➔ Deliver

Before modeling:

0. Split your data
1. Data preprocessing

The modeling "loop"

1. Create a model
    - algorithm + hyperparams
    - training data
1. Evaluate the model
1. Repeat

After a certain amount of time or repititions has passed:

1. Compare models
1. Evaluate on test

In [4]:
# We'll use sklearn's Dummy Classifier as a standin for other classification algorithms
# it behaves the same way and we use it the same way that we'll use the "real" models
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report
import acquire
import prepare

## Data Split

In [7]:
train, validate, test = prepare.prep_titanic_data(acquire.get_titanic_data())
train.shape, validate.shape, test.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test[['embark_town']] = imputer.transform(test[['embark_town']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


((498, 9), (214, 9), (179, 9))

In [35]:
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

## Create our First Model

### Aside: Working with sklearn ML objects

1. Create the object
1. Fit the object on training data
1. Use the object (.score, .predict, .transform)

In [36]:
# 1. Create the object
model = DummyClassifier(strategy='constant', constant=1)
# 2. Fit the object
model.fit(X_train, y_train)

DummyClassifier(constant=1, strategy='constant')

Ways we use sklearn classification models:

- `.score` gives us accuracy
- `.predict` lets us make predictions given a set of indep vars
- `.predict_proba` gives us the probability that each observation falls into each label
- some specific model types have additional properties

In [37]:
print('Training accuracy: %.4f' % model.score(X_train, y_train))

Training accuracy: 0.3835


In [38]:
# TODO: view the accuracy on the validate split
model.score(X_validate, y_validate)

0.38317757009345793

In [39]:
model.predict(X_validate)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [40]:
# TODO: create a new column on the train dataframe that contains the models predictions.
train['prediction'] = model.predict(X_train)
train

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton,prediction
474,0,3,0,0,9.8375,1,0,0,1,1
370,1,1,1,0,55.4417,0,1,0,0,1
573,1,3,0,0,7.7500,1,0,1,0,1
110,0,1,0,0,52.0000,1,1,0,1,1
167,0,3,1,4,27.9000,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...
735,0,3,0,0,16.1000,1,1,0,1,1
163,0,3,0,0,8.6625,1,1,0,1,1
770,0,3,0,0,9.5000,1,1,0,1,1
196,0,3,0,0,7.7500,1,1,1,0,1


In [26]:
# use the column you just created and the actual values in the survived column
# to generate a classification report
print(classification_report(train.survived, train.prediction))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       307
           1       0.38      1.00      0.55       191

    accuracy                           0.38       498
   macro avg       0.19      0.50      0.28       498
weighted avg       0.15      0.38      0.21       498



## More models

Now we'll make more models, one model is the unique combination of:

- algorithm
- hyperparameters
- training data

In [41]:
model1 = DummyClassifier(strategy='constant', constant=0)
# TODO: fit the model on the training data
model1.fit(X_train, y_train)
# TODO: see how this model performs on train and validate
print(f'Train model is {model1.score(X_train, y_train)}')
print(f'validate model is {model1.score(X_validate, y_validate)}')

Train model is 0.6164658634538153
validate model is 0.616822429906542


In [51]:
model2 = DummyClassifier(strategy='uniform', random_state=123)
# TODO: fit the model on the training 
model2.fit(X_train, y_train)
# TODO: see how this model performs on train and validate
print(f'Train model is {model2.score(X_train, y_train)}')
print(f'Validate model is {model2.score(X_validate, y_validate)}')

Train model is 0.4799196787148594
Validate model is 0.5467289719626168


In [52]:
# Following the pattern above, create 2 more models that vary in either hyperparameters or training data
model3 = DummyClassifier(strategy = 'stratified', random_state = 123)
# fit the models and view their performance
model3.fit(X_train, y_train)
print(f'Train model is {model3.score(X_train, y_train)}')
print(f'Validate model is {model3.score(X_validate, y_validate)}')

Train model is 0.5381526104417671
Validate model is 0.5327102803738317


In [53]:
model4 = DummyClassifier(strategy = 'most_frequent', random_state = 123)
model4.fit(X_train, y_train)
print(f'Train model is {model4.score(X_train, y_train)}')
print(f'Validate model is {model4.score(X_validate, y_validate)}')

Train model is 0.6164658634538153
Validate model is 0.616822429906542


## Compare and Finalize

In [None]:
# TODO: compare the performance of your models on the validate split

In [None]:
# TODO: find the performance of your best model on the test split