<a href="https://colab.research.google.com/github/josenomberto/UTEC-CDIAV3-MISTI/blob/main/day2_kobe_feature_engineering_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**If you haven't already, please hit :**

`File` -> `Save a Copy in Drive`

**to copy this notebook to your Google drive, and work on a copy. If you don't do this, your changes won't be saved!**


# Feature Engineering with the Kobe Bryant Dataset

In [None]:
# Import Packages

# data manipulation
import pandas as pd
import numpy as np
import scipy.stats as st

# plots
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pylab as pl
import seaborn as sns

# scaling
from sklearn.preprocessing import StandardScaler

# classification algorithms
from sklearn.linear_model import LinearRegression, LogisticRegression

# dimension reduction
from sklearn.decomposition import PCA

# cross-validation
from sklearn.model_selection import train_test_split

# model evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# text mining
import re
from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings("ignore")

## EXERCISE: Explore the Kobe Bryant Dataset

Explore the Kobe Bryant Dataset to understand what we may be able learn from it.

Tasks:
1. Get general information for the dataset using the methods `.head()`, `.tail()` and `.info()`.
2. Identify if there are any missing values using the method `.isnull()`.
3. Get summary statistics for the dataset using the method `.describe()`.
4. Identify potential target variables in this dataset.
5. Identify potential input features that could be used to predict your target variable.

In [None]:
# load the data as a pandas dataframe
KobeDataset = pd.read_csv('KobeData.csv')
print("Data dimensions:" + str(KobeDataset.shape))

### TASK 1: Get General Information of Dataset

In [None]:
# TASK 1 EXERCISE

# Preview the first few rows
print("\n--- Head of the DataFrame ---")
''' ADD YOUR CODE HERE '''

# Preview the last few rows
print("\n--- Tail of the DataFrame ---")
''' ADD YOUR CODE HERE '''

# Check all columns and data types
print("\n--- DataFrame Info ---")
''' ADD YOUR CODE HERE '''

# Add any additional methods of interest
''' ADD YOUR CODE HERE '''

### TASK 2: Identify any missing data

In [None]:
# TASK 2 EXERCISE

# Check for missing values
print("\n--- Missing Values ---")
''' ADD YOUR CODE HERE '''

### TASK 3: Generate Summary Statistics

In [None]:
# TASK 3 EXERCISE

# Get summary statistics for numeric columns
print("\n--- Summary Statistics (Numeric) ---")
''' ADD YOUR CODE HERE '''

## Explore Data Types

In [None]:
# we will narrow our focus to only the 15 features listed above
KobeDataset = KobeDataset.filter(["action_type", "loc_x", "loc_y","shot_distance", "shot_zone_basic", "shot_zone_area", "shot_type",
                                  "period", "minutes_remaining", "seconds_remaining", "playoffs", "season", "game_date","matchup", "shot_made_flag"])

print("Filtered Data dimensions:" + str(KobeDataset.shape))

# display the first 10 lines
display(KobeDataset.head(10))

In [None]:
# We can use .iloc to see single rows at a time, and .unique() to see the unique values in a feature

display(KobeDataset['action_type'].unique())
print()
KobeDataset.iloc[[2]]

**DISCUSSION QUESTION:** From the Kobe Bryant dataset, what column could be used as the output label, and what columns could be used as input features?

*YOUR ANSWER HERE:*

### EXERCISE: Identify Feature Data Types

Identify the python data types of the following variables and their respective data type categories using the `.dtype` attribute as depicted in the following concept. Remember to print out examples using the method `.sample(~)` to help you identify the data types in more detail.

Tasks:
1. Identify the datatype of the feature `loc_x`. Print some samples from the dataframe.
2. Identify the datatype of the feature `action_type`. Print some samples from the dataframe.
3. Identify the datatype of the feature `playoffs`. Print some samples from the dataframe.

#### TASK 1: Identify the Data Type of `loc_x`

In [None]:
# TASK 1 EXERCISE

# Identify the Data Type
datatype_loc_x = <REPLACE ME AND MY ARROWS>
print("Data type of loc_x: ", datatype_loc_x)

# Print 5 values of the column
print("\n--- Sample 'loc_x' values ---")
print(<REPLACE ME AND MY ARROWS>)

#### TASK 2: Identify the Data Type of `action_type`

In [None]:
# TASK 2 EXERCISE

# Identify the Data Type
datatype_action_type = <REPLACE ME AND MY ARROWS>
print("Data type of action_type: ", datatype_action_type)

# Print 5 values of the column
print("\n--- Sample 'action_type' values ---")
print(<REPLACE ME AND MY ARROWS>)

#### TASK 3: Identify the Data Type of `playoffs`

In [None]:
# TASK 3 EXERCISE

# Identify the Data Type
datatype_playoffs = <REPLACE ME AND MY ARROWS>
print("Data type of playoffs: ", datatype_playoffs)

# Print 5 values of the column
print("\n--- Sample 'playoffs' values ---")
print(KobeDataset['playoffs'].sample(5))

**DISCUSSION QUESTION:** What are the different Python data types and more specifically, what are the data type categories for loc_x, action_type, and playoffs?

*YOUR ANSWER HERE:*

## Data Pre-Processing

### EXERCISE: Data Pre-Processing with Typecasting and Removing Missing Data

Now we will practice data pre-processing techniques essential for preparing our dataset for analysis, including handling missing values and converting data types to appropriate formats.

Tasks:
1. First, we will practice typecasting some of the features. Here, we focus on `game_date`. This is a date column which is in a string format. Convert this column into a datetime format using the `pd.to_datetime(~).dt.date` method. This method retains only the date part of the feature, which is relevant here, as there is no time portion to the feature. Has the minimum date value changed? Has the python datatype printout (`.dtypes(~)`) changed?
2. Address the missing data points in this dataset. First, remove all of the rows where there is missing data in our target `shot_made_flag`. Next, replace the `NaN` values in `shot_distance` column with the mean of the column.

#### TASK 1: Perform Typecasting

In [None]:
# Prior to typecasting, the following fields with dates look like this:
print("minimum game_date before typecasting is: ", KobeDataset['game_date'].min(axis=0))
KobeDataset.dtypes['game_date']

In [None]:
# TASK 1 EXERCISE

# Convert the column to datetime format
KobeDataset['game_date'] = <REPLACE ME AND MY ARROWS>

# Check what the minimum date is after updating the date format
print("minimum game_date after typecasting is: ", KobeDataset['game_date'].min(axis=0), "\n")

# Notice that the python data type has not changed visibly, but a lot has changed under the hood
KobeDataset.dtypes['game_date']

**DISCUSSION QUESTION:** Why is typecasting to a date type necessary for the columns which have date information?

*YOUR ANSWER HERE:*

#### TASK 2: Address Missing Data

Next, lets check for missing data
Missing data in this dataset is represented by nan values (as opposed to blanks, "?", etc)
This is convenient for data stored in a pandas dataframe.

If you look at the output of `KobeDataset.info()`, you will see that, we have exactly 5000 nulls in our dependent variable.

As a data scientist, whenever you see such a round number for anything, you should be suspicious! *Why would the dataset be missing EXACTLY 5000 values?*

In this case, these values were removed for the purposes of competition on Kaggle.com, to be evaluated as the competitors' test set. Since this is our dependent variable, we don't have much choice but to remove the 5000 records. Remove the 5000 records in the `shot_made_flag` column by using `.loc(~)`,
filtering for all the rows with non-null values. Let's check that exactly 5000 were removed.

Finally, replace the `NaN` value in the `shot_distance` column with the mean value using the `.fillna(~, inplace = True)` substituting "~" for the mean column value.

In [None]:
# TASK 2 EXERCISE

# Look at Data
KobeDataset.info()
print('\n')
FullLength = len(KobeDataset)

# Remove the Missing Rows and then Save This Updated Dataframe to KobeDataset
KobeDataset = <REPLACE ME AND MY ARROWS>
print("Removed", FullLength - len(KobeDataset), "records with null shot_made flag")

# Replace the Nans in `shot_distance` with the mean of the column
'''ADD YOUR CODE HERE'''

## Visualize Data

In [None]:
# Frequency table for how many datapoints have a specific category and label
pd.crosstab(KobeDataset["shot_zone_basic"], KobeDataset["shot_made_flag"])

In [None]:
# a plot of the table above
pd.crosstab(KobeDataset['shot_zone_basic'], KobeDataset['shot_made_flag']).plot(kind="bar");

For numerical features, we don't have categories that we can use as the x-axis. But we can split the datapoints into bins, such as 0.0 to 0.1 and look at the number of 0 and 1 labels for that bin.

In [None]:
# For numerical variables, we can bin the amounts and check frequency of labels within each bin
pd.crosstab(pd.cut(KobeDataset["seconds_remaining"], bins= 5), KobeDataset['shot_made_flag']).plot(kind= "bar")

### EXERCISE: Visualize some of our data

Let's now visualize some of our features. Make additional plots of additional features as time allows.

Tasks:
1. Make a plot to visualize the feature `shot_distance` in relation to the target variable `shot_made_flag`. First, define the bin bounds as the minimum value, 6, 10, 15, 20, 25, and the maximum value. Next, use `pd.cut(~)` to split the data into those bins. Next, use `pd.crosstab(~)` to compute a simple cross tabulation. A cross tabulation (often called a crosstab) is a table that shows the relationship between two or more categorical variables by counting how many observations fall into each combination of categories. Plot this crosstab using `crosstab_result.plot(~)`.
2. Make a plot to visualize the feature `shot_distance` in relation to the target variable `shot_made_flag`. This should be much easier than making a plot for `shot_distance`. You can use `pd.crosstab(~)` or a similar plotting function.

#### TASK 1: Plot `shot_distance`

In [None]:
# TASK 1 EXERCISE

# Define bins for shot distance
distance_bins = [
    ''' ADD YOUR CODE HERE '''
]

# Group shots into distance bins
distance_groups = <REMOVE ME AND MY BRACKETS>

# Create a crosstab comparing distance range vs. shot_made_flag
crosstab_result = <REMOVE ME AND MY BRACKETS>

# Plot the crosstab as a bar chart
crosstab_result.plot(<REMOVE ME AND MY BRACKETS>, figsize=(8, 4))
plt.title("Shot Distance Range vs. Shot Outcome")
plt.xlabel("Distance Range (feet)")
plt.ylabel("Count of Shots")
sns.despine()
plt.show()

#### TASK 2: Plot `shot_type`

In [None]:
# TASK 2 EXERCISE

# Plot 'shot_type' versus the target variable 'shot_made_flag'
''' ADD YOUR CODE HERE '''

## What is Logistic Regression?

### ANALYSIS: What is logistic regression?

Tasks:
1. Run the logistic regression code, analyze the code and the output.

#### TASK 1: Run and Analyze the Logistic Regression Code

In [None]:
# TASK 1 EXERCISE AND SOLUTION

# Selecting one feature (shot_distance) for logistic regression
X = KobeDataset[['shot_distance']]  # Only using shot distance as the feature
y = KobeDataset['shot_made_flag']

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression Model (without standardization)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predicting and evaluating the model
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.5f}")

# Visualizing the sigmoid function and the threshold
def plot_sigmoid(X, model):

    # Create a range of shot distances for smooth plotting
    X_range = np.linspace(X.min(), X.max(), 300).reshape(-1, 1)

    # Get predicted probabilities for this range
    y_proba = model.predict_proba(X_range)[:, 1]
    # print(y_proba)

    # Plotting the sigmoid curve
    plt.plot(X_range, y_proba, label="Predicted Probability (Sigmoid)", color="blue")
    plt.scatter(X_train, y_train, label="Training Data", alpha=0.5, color="red")
    plt.axhline(0.5, color='green', linestyle='--', label="Threshold (0.5)")
    plt.title("Logistic Regression: Sigmoid Function for Shot Distance")
    plt.xlabel("Shot Distance")
    plt.ylabel("Predicted Probability of Shot")
    plt.legend(loc = 'lower left')
    plt.show()

# Plot the sigmoid curve
plot_sigmoid(X, log_reg)

**DISCUSSION QUESTION:** For our initial logistic regression fit, what are the chosen input and output?

*YOUR ANSWER HERE:*

**DISCUSSION QUESTION:** Use the generated plot to answer the following question: what is fit in a Logistic Regression?

*FOR ANSWER REFER TO NOTION!*

**DISCUSSION QUESTION:** Use the generated plot to answer the following question: how are predictions made in Logistic Regression?

*FOR ANSWER REFER TO NOTION!*

## EXERCISE: Train a Baseline Model

Tasks:
1. Fit a `LogisticRegression()` model with your own set of features. Make sure to look at the accuracy of your model and compare it to the simple case we showed above.

#### TASK 1: Fit your own `LogisticRegression()` with your own set of Features

In [None]:
# TASK 1 EXERCISE

''' ADD YOUR CODE HERE '''

**DISCUSSION QUESTION:** Review the concepts of one-hot and multi-hot encoding through the concept pages on Notion. Then, given the following feature of a dataset, usually a column vector (5 rows x 1 column;5 data points x 1 feature). Convert this to a one-hot representation. What would the one-hot representation of the following set of 5 samples be?

*YOUR ANSWER HERE:*

## Feature Engineering

### EXERCISE: Encode Categorical Data

Tasks:
1. Encode the `shot_zone_basic` feature into one-hot encoding using the method `pd.get_dummies(~, prefix = "shot_zone")`. Then, identify the shot_zone for the 979th datapoint using the `.iloc(~)` method.
2. Train a `LogisticRegression()` model with this new feature. You can add the new feature using the `df.join(~)` method. How many extra columns were created for this new variable? Did the accuracy improve?

#### TASK 1: Generate a One-Hot Vector Encoding of `shot_zone_basic`

In [None]:
# TASK 1 EXERCISE

# Display the frequency of each category in the 'shot_zone_basic' column
print(KobeDataset["shot_zone_basic"].value_counts())
print('\n')

# Generate one-hot encoded columns for each category in 'shot_zone_basic'
# and use 'shot_zone' as the prefix for the new columns
shot_zones = <REPLACE ME AND MY ARROWS>

# Show the first five rows of the resulting one-hot encoded DataFrame
shot_zones.head(5)

# Check the shot_zone for datapoint 979
''' ADD YOUR CODE HERE '''

#### TASK 2: Train your own `LogisticRegression()` with this new Categorical Variable

In [None]:
# TASK 2 EXERCISE

# Add this New Feature to your Dataset
KobeDataset.shape
KobeDataset = <REPLACE ME AND MY BRACKETS>
KobeDataset.shape

# Look at your New Columns
print(list(KobeDataset.columns))

# TRAIN A NEW LOGISTIC REGRESSION ----------------------------------------------

''' ADD YOUR CODE HERE '''

**DISCUSSION QUESTION:** What potential issues might arise from using one-hot encoding on a feature with many unique categories?

*YOUR ANSWER HERE:*

**DISCUSSION QUESTION:** What is the purpose of the sigmoid function in logistic regression?

*YOUR ANSWER HERE:*

**DISCUSSION QUESTION:** What is the difference between numerical discrete and categorical nominal data types? Give an example of each from the Kobe Bryant dataset.

*YOUR ANSWER HERE:*

**DISCUSSION QUESTION:** In the context of the Kobe Bryant shot dataset, what are some examples of categorical features that might benefit from one-hot encoding?

*YOUR ANSWER HERE:*

**DISCUSSION QUESTION:** How does one-hot encoding work, and when should it be used?

*YOUR ANSWER HERE:*

**DISCUSSION QUESTION:** What is feature engineering and why is it important in machine learning?

*YOUR ANSWER HERE:*