<a href="https://colab.research.google.com/github/gptix/DS-Unit-2-Kaggle-Challenge/blob/master/Jud_Taylor_Sprint_Challenge_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Unit 2_
 
# Sprint Challenge: Predict Steph Curry's shots 🏀

For your Sprint Challenge, you'll use a dataset with all Steph Curry's NBA field goal attempts. (Regular season and playoff games, from October 28, 2009, through June 5, 2019.) 

You'll predict whether each shot was made, using information about the shot and the game. This is hard to predict! Try to get above 60% accuracy. The dataset was collected with the [nba_api](https://github.com/swar/nba_api) Python library.

In [0]:
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

In [0]:
# Read data
import pandas as pd
url = 'https://drive.google.com/uc?export=download&id=1fL7KPyxgGYfQDsuJoBWHIWwCAf-HTFpX'
df = pd.read_csv(url)

# Check data shape
assert df.shape == (13958, 20)

To demonstrate mastery on your Sprint Challenge, do all the required, numbered instructions in this notebook.

To earn a score of "3", also do all the stretch goals.

You are permitted and encouraged to do as much data exploration as you want.

**1. Begin with baselines for classification.** Your target to predict is `shot_made_flag`. What is your baseline accuracy, if you guessed the majority class for every prediction?



In [3]:
target = 'shot_made_flag'
baseline = df[target].value_counts(normalize=True)[1]
print(f'Baseline accuracy: {baseline}')

Baseline accuracy: 0.4729187562688064


**2. Hold out your test set.** Use the 2018-19 season to test. NBA seasons begin in October and end in June. You'll know you've split the data correctly when your test set has 1,709 observations.



In [4]:
correct_test_observation_count = 1709

df['game_date'].head()

test_set_begin_date = '2018-10-01'
test_set_end_date  = '2019-06-30'

test = df[ (df['game_date'] >= test_set_begin_date)  &  
               (df['game_date'] <= test_set_end_date) ]

train = df[ (df['game_date'] < test_set_begin_date)  |
               (df['game_date'] > test_set_end_date) ]

len(test) == correct_test_observation_count
test.shape
train.shape

(12249, 20)

**3. Engineer new feature.** Engineer at least **1** new feature, from this list, or your own idea.

In [0]:
def wrangle(frame):
  frame['homecourt'] = train['htm'] == 'GSW'
  return frame

In [6]:
train = wrangle(train)
test = wrangle(test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


**4. Decide how to validate** your model. Choose one of the following options. Any of these options are good. You are not graded on which you choose.
- **Train/validate/test split: train on the 2009-10 season through 2016-17 season, validate with the 2017-18 season.** You'll know you've split the data correctly when your train set has 11,081 observations, and your validation set has 1,168 observations.
- **Train/validate/test split: random 80/20%** train/validate split.
- **Cross-validation** with independent test set. You may use any scikit-learn cross-validation method.
---

**I choose cross-validation**


**5.** Use a scikit-learn **pipeline** to **encode categoricals** and fit a **Decision Tree** or **Random Forest** model.

### First, I need to set up the X matrix and y vector.

In [0]:
# target is defined above.

X_train = train.drop(columns=target)
y_train = train[target]

X_test = test.drop(columns=target)
y_test = test[target]

### Make pipeline.

In [0]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
import category_encoders as ce

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(random_state=42)
)

### Fit.

In [0]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'randomforestclassifier__n_estimators': range(50, 100), 
    'randomforestclassifier__max_depth': [1, 5, 10, 15, None], 
    'randomforestclassifier__max_features': uniform(0.1, 0.9), 
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    random_state=1984,
    n_jobs=-1, 
    n_iter=30, 
    cv=10, 
    scoring='accuracy', 
    # verbose=10, 
    return_train_score=True
)


**6.** Get your model's **validation accuracy.** (Multiple times if you try multiple iterations.) 


In [10]:
search.fit(X_train, y_train)
print('Validation accuracy', search.best_score_)

Validation accuracy 0.6609519144419953


**Validation accuracy 0.6609519144419953**


**7.** Get your model's **test accuracy.** (One time, at the end.)

In [11]:
print('Test Accuracy', search.score(X_test, y_test))

Test Accuracy 0.620245757753072


## 8. Given a confusion matrix, calculate accuracy, precision, and recall.

Imagine this is the confusion matrix for a binary classification model. Use the confusion matrix to calculate the model's accuracy, precision, and recall.

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

In [12]:
# import pandas as pd

d = {'Predicted Negative': [85, 8], 'Predicted Positive': [58, 36]}
sample_df = pd.DataFrame(data=d)
sample_df

Unnamed: 0,Predicted Negative,Predicted Positive
0,85,58
1,8,36


In [0]:
def confusion_matrix_metrics (matrix_df):

  TN = matrix_df.iloc[0,0] # true negative
  TP = matrix_df.iloc[1,1]  # true_positive 
  FN = matrix_df.iloc[0,1] # false_negative 
  FP = matrix_df.iloc[1,0] # false_positive 

  correct_predictions = TP + TN
  erroneous_predictions = FP + FN
  total_predictions = correct_predictions + erroneous_predictions
  total_positive_predictions = TP + FP # correct and erroneous
  actual_positives = TP + FN 

  accuracy = correct_predictions / total_predictions
  recall = TP / total_positive_predictions
  precision = TP / actual_positives

  return {'accuracy' : accuracy, 'precision' : precision, 'recall' : recall}

In [14]:
cf_mx_metrics = confusion_matrix_metrics (sample_df)
cf_mx_metrics

{'accuracy': 0.6470588235294118,
 'precision': 0.3829787234042553,
 'recall': 0.8181818181818182}

### Calculate accuracy 

In [15]:
print(f"Accuracy: {cf_mx_metrics['accuracy']}")

Accuracy: 0.6470588235294118


### Calculate precision

In [16]:
print(f"Precision: {cf_mx_metrics['precision']}")

Precision: 0.3829787234042553


### Calculate recall

In [17]:
print(f"Recall: {cf_mx_metrics['recall']}")

Recall: 0.8181818181818182



### Stretch Goals
- Engineer 4+ new features total, either from the list above, or your own ideas.

- **Opponent**: Who is the other team playing the Golden State Warriors?
- **Seconds remaining in the period**: Combine minutes remaining with seconds remaining, to get the total number of seconds remaining in the period.
- **Seconds remaining in the game**: Combine period, and seconds remaining in the period, to get the total number of seconds remaining in the game. A basketball game has 4 periods, each 12 minutes long.
- **Made previous shot**: Was Steph Curry's previous shot successful?

In [21]:
train['vtm']

0        HOU
1        HOU
2        HOU
3        HOU
4        HOU
        ... 
12244    GSW
12245    GSW
12246    GSW
12247    GSW
12248    GSW
Name: vtm, Length: 12249, dtype: object

In [29]:
import numpy as np
train.columns
train['opponent'] = train['htm']
# np.where(train['vtm'] != "GSW", train['vtm'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


NameError: ignored

### Make 2+ visualizations to explore relationships between features and target.
- Optimize 3+ hyperparameters by trying 10+ "candidates" (possible combinations of hyperparameters). You can use `RandomizedSearchCV` or do it manually.
- Get and plot your model's feature importances.

In [19]:
train['player_name'].unique()

array(['Stephen Curry'], dtype=object)