# Module 03: EDA

In [None]:
# packages
import numpy as np 
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots
from sklearn.model_selection import train_test_split 
from ISLP import load_data

# set seed
seed = 2323

### We'll use the _Hitters_ data from ISLP for this activity. The metadata for _Hitters_ can be found [here](https://intro-stat-learning.github.io/ISLP/datasets/Hitters.html).

In [None]:
# Load the data
Hitters = load_data('Hitters')

### Determine the number of rows and columns in the dataset by returning its "shape" attribute

In [None]:
Hitters.shape

### Determine whether each feature is numeric or categorical by returning the "dtype" attribute for each column

In [None]:
for col in Hitters.columns:
    print(col,Hitters[col].dtype)
    

### Before doing any other analyses, let's create training and test sets.

In [None]:
Train, Test = train_test_split(Hitters, 
                               random_state=seed, 
                               test_size=0.40, 
                               shuffle=True) 

### Based on the metadata, what is the difference between the 6 columns starting with 'C' and the 6 related columns that don't?

The columns starting with 'C' represent cumulative career statistics for the player up until the end of 1986. The corresponding columns without the 'C' represent only the player's performance during the 1986 season. Thus, the 'C' variables measure long-term experience and production, as the others measure recent single-season output.

### On the training set, create pairwise scatterplots for each of these 6 columns with the 'Salary' variable.

In [None]:

cols = ['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks']


fig, axes = subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(cols):
    axes[i].scatter(Train[col], Train['Salary'], alpha=0.5)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Salary')
    axes[i].set_title(f'{col} vs Salary')

plt.tight_layout()

### Use the "describe" method to determine the mean, standard deviation, and 5 number summary of all numeric variables in the training subset of _Hitters_.

In [None]:
Train.describe()

### It looks like the mean and median of 'AtBat' are nearly equal. This _might_ suggest that this variable is normally distributed. Create a histogram of 'AtBat' to check this hypothesis.

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(Train['AtBat'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of AtBat (Training Set)')
plt.xlabel('AtBat')
plt.ylabel('Frequency')
plt.show()

### Let's standardize the AtBat feature (i.e., normalize by z-scores). We'll create a new column in the training data called 'AtBat_st' to represent this.

In [None]:

Train['AtBat_st'] = (Train['AtBat'] - Train['AtBat'].mean()) / Train['AtBat'].std()


Train[['AtBat', 'AtBat_st']].head()

### How many rows have an 'AtBat' value within the first standard deviation?

Hint: the 'len' magic method returns the number of rows of a dataFrame.

In [None]:

within_one_std = Train[(Train['AtBat_st'] >= -1) & (Train['AtBat_st'] <= 1)]


len(within_one_std)

### Going back to the results of the 'describe' method, how can you tell that the 'Salary' variable has missing values?

You can tell by comparing the count value of the Salary variable to the count values of the other features in the describe() output. If Salary has a lower count than the other columns (or lower than the total number of rows in the training set), it shows that some entries are missing (NaN).

### Describe a situation where a variable could have missing values but this would not be reflected in the results of the 'describe' method.

This occurs if missing data is represented by placeholder values (like 0, -1, or 999) or strings like "Unknown" or "N/A" instead of actual NaN values. Since describe() uses these as valid data points, the count will appear full even though the information is technically missing. This also happens with empty strings or spaces in categorical columns.

### On the training data, create separate boxplots of the 'AtBat' variable for when 'Salary' is populated or missing.

In [None]:

NoSalary = Train[Train['Salary'].isna()]
HasSalary = Train[Train['Salary'].notna()]

Train['Salary_Missing'] = Train['Salary'].isna()

Train.boxplot(column='AtBat', by='Salary_Missing', figsize=(8, 6))

plt.title('AtBat Distribution by Salary Availability')
plt.suptitle('') 
plt.xlabel('Is Salary Missing?')
plt.ylabel('AtBat')
plt.xticks([1, 2], ['Populated', 'Missing'])
plt.show()

### Create a correlation matrix for all numeric features in the training set

In [None]:

corr_matrix = Train.select_dtypes(include=[np.number]).corr()


corr_matrix

### Propose two different ways of imputing the missing values of Salary while taking advantage of the information given in the boxplots or the correlation matrix.

1. Conditional Mean/Median Imputation: Based on the boxplots, if players with missing salaries have lower AtBat counts, you can impute the missing values using the median salary of similar subgroups (e.g., players with similar experience or performance levels) rather than a single global average.

2. Regression Imputation: Utilizing the correlation matrix, you can identify features strongly linked to Salary (like Hits or CRBI) and use a linear regression model to predict and fill in the missing values based on those specific performance metrics.

### For our last exercise, we'll explore Hits and Walks relative to AtBat totals. 
- Use the sum function to calculuate the totals of each of these three variables for the 1986 season (on the training set). 
- Create a pie chart which shows total hits, total walks, and remaining total (neither) as percents of the At Bats total (on the training set). 

In [None]:
TotHits = Train['Hits'].sum()
TotWalks = Train['Walks'].sum()
TotAtBat = Train['AtBat'].sum()

Labels = ['Hits', 'Walks', 'Neither']
Totals = [TotHits, TotWalks, TotAtBat - TotHits - TotWalks]

In [None]:
plt.figure(figsize=(7, 7))
plt.pie(Totals, labels=Labels, autopct='%1.1f%%', startangle=140, colors=['skyblue', 'lightgreen', 'lightcoral'])
plt.title('Proportion of Hits and Walks relative to Total At Bats (1986 Season)')
plt.show()


### The previous two cells gave us totals across all players. For each player in the training set, calculate the Hits as a percent of AtBat and store it in a new variable called 'AVG'

In [None]:

Train['AVG'] = Train['Hits'] / Train['AtBat']

Train[['Hits', 'AtBat', 'AVG']].head()

### Using 0.25 and 0.31 as the split points, create a new variable with three bins: high, medium, and low. 

In [None]:
Train['AVG_bin'] = 'medium'
Train['AVG_bin'][Train['AVG'] < 0.25] = 'low'
Train['AVG_bin'][Train['AVG'] > 0.31] = 'high'

### Create a bar chart that displays the number of players in each of the low, medium, and high categories (for the training data).

In [None]:

Train['AVG_bin'].value_counts().plot(kind='bar', color=['skyblue', 'salmon', 'lightgreen'])


plt.title('Number of Players by Batting Average Category')
plt.xlabel('Batting Average Category')
plt.ylabel('Count')
plt.xticks(rotation=0) 

Notice that the order of the bars will be medium, low, high. That's counterintuitive. We can reorder these quickly. 

In [None]:
indexMap = ['low', 'medium', 'high']
reordered_list = [Train['AVG_bin'].value_counts()[i] for i in indexMap]

In [None]:
plt.bar(indexMap, reordered_list)

plt.title("1986 AVG (Training Set)")
plt.ylabel("Number of Players")

plt.xticks(ticks=range(len(indexMap)), labels=indexMap)

plt.show()

### Did we use the depth method or width method for creating these bins? Explain.

We used the width method because the bins were defined by specific numerical thresholds (0.25 and 0.31) rather than the number of players. This approach prioritizes fixed ranges of performance, meaning the bins can have different "depths" (counts). A depth method would have used quantiles to ensure an equal number of players in each category.