In [1]:
import pandas as pd

DATA = '/kaggle/input/negro-league-and-mlb-player-ratings/negro-leagues-player-ratings.csv'
df = pd.read_csv(filepath_or_buffer=DATA)
df.head()

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,playerID,commonName,league,hof,startYear,endYear,totalGames,positionWar,averageHit,patience,...,shortWar,positionCat,position,careerStarts,strikeOuts,control,fip,whip,era,fact
0,culbech01,Charlie Culberson,MLB,0,2012,2020,428,-0.62,41.791451,13.776205,...,-0.234673,Outfielder,Batter,,,,,,,
1,gosseph01,Phil Gosselin,MLB,0,2013,2020,359,0.895,72.992105,28.641438,...,0.403872,Middle IF,Batter,,,,,,,
2,herrmch01,Chris Herrmann,MLB,0,2012,2019,370,-1.15,3.648244,70.10618,...,-0.503514,Catcher,Batter,,,,,,,
3,kratzer01,Erik Kratz,MLB,0,2010,2020,335,1.715,21.236047,19.112442,...,0.829343,Catcher,Batter,,,,,,,
4,pireljo01,Jose Pirela,MLB,0,2014,2019,302,0.545,67.57419,18.976314,...,0.292351,Middle IF,Batter,,,,,,,


How much data do we have?

In [2]:
df.shape

(1117, 25)

We have position players and pitchers, and they have different stats. We expect that if we build a TSNE scatter plot our plot will separate them into two clusters. Let's find out.

In [3]:
from sklearn.manifold import TSNE

RANDOM_STATE = 2025
reducer = TSNE(random_state=RANDOM_STATE)
plot_df = pd.DataFrame(columns=['x', 'y'], data=reducer.fit_transform(X=df[['positionWar', 'averageHit', 'patience', 'power', 'speed',
       'defense', 'careerStarts', 'strikeOuts', 'control', 'fip', 'whip',
       'era',]].fillna(value=0)))
plot_df['commonName'] = df['commonName'].tolist()
plot_df['hof'] = df['hof'].tolist()
plot_df['hof'] = df['hof'] == 1
plot_df['position'] = df['position'].tolist()
plot_df['league'] = df['league'].tolist()
plot_df['startYear'] = df['startYear'].tolist()

We need to fill in missing values and zeros seem like the most reasonable values to fill in, and of course as a result we see that our plot is neatly bitartite between pitchers and position players.

In [4]:
from plotly import express
from plotly import io

io.renderers.default = 'iframe'
express.scatter(data_frame=plot_df, x='x', y='y', hover_name='commonName', color='position')

Now let's look at the same scatter plot and see where the Hall of Famers are.

In [5]:
express.scatter(data_frame=plot_df, x='x', y='y', hover_name='commonName', color='hof')

What do we see? For the most part the Hall of Famers cluster statistically, whether they are pitchers or position players. We do see some exceptions, who in our plot have more neighbors of the opposite type than of their own type. There are multiple reasons why this might be true: the obvious reason is that the Hall of Fame is not determined by statistics; another is that there are some active players in our dataset, and active players are not eligible for the Hall.

Now let's color by league.

In [6]:
express.scatter(data_frame=plot_df, x='x', y='y', hover_name='commonName', color='league')

What do we see? We see that for the most part, players in each league tend to have neighbors that are in their league, but this is more true for pitchers than position players. 

Let's plot our players by start year.

In [7]:
express.scatter(data_frame=plot_df, x='x', y='y', hover_name='commonName', color='startYear')

Our start year data looks bimodal. How bimodal is it?

In [8]:
express.histogram(data_frame=df, x='startYear', nbins=120)

We've plotted our start years in annual bins, and the results look very bimodal. This is important, because professional baseball in 2018 is very different from professional baseball in 1920.