# New York Yankees Data 1913-2016

Data obtain from: https://www.kaggle.com/datasets/timschutzyang/dataset1?resource=download

**Goal:** The goal of this project is to identify trends within the tracked New York Yankees statistics from the dataset provided.

## Setting up my data

The first thing I want to do is import my data and get it ready to be used. I import my library, bring in my CSV file, and test to make sure it is gathered properly.

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [2]:
# import csv
df = pd.read_csv('baseballdata.csv')

# check data frame
df.head()

Unnamed: 0.1,Unnamed: 0,Rk,Year,Tm,Lg,G,W,L,Ties,W.L.,...,R,RA,Attendance,BatAge,PAge,X.Bat,X.P,Top.Player,Managers,current
0,1,1,2016,Arizona Diamondbacks,NL West,162,69,93,0,0.426,...,752,890,2036216,26.7,26.4,50,29,J.Segura (5.7),C.Hale (69-93),Arizona Diamondbacks
1,2,2,2015,Arizona Diamondbacks,NL West,162,79,83,0,0.488,...,720,713,2080145,26.6,27.1,50,27,P.Goldschmidt (8.8),C.Hale (79-83),Arizona Diamondbacks
2,3,3,2014,Arizona Diamondbacks,NL West,162,64,98,0,0.395,...,615,742,2073730,27.6,28.0,52,25,P.Goldschmidt (4.5),K.Gibson (63-96) and A.Trammell (1-2),Arizona Diamondbacks
3,4,4,2013,Arizona Diamondbacks,NL West,162,81,81,0,0.5,...,685,695,2134895,28.1,27.6,44,23,P.Goldschmidt (7.1),K.Gibson (81-81),Arizona Diamondbacks
4,5,5,2012,Arizona Diamondbacks,NL West,162,81,81,0,0.5,...,734,688,2177617,28.3,27.4,48,23,A.Hill (5.0),K.Gibson (81-81),Arizona Diamondbacks


In [3]:
# verify type
type(df)

pandas.core.frame.DataFrame

Now that I know my data is correctly saved into a data frame I want to learn more about the data I will be working with. I don't care about the size of the entire data frame since I only want to work with data for the New York Yankees. Here is what I want to know:

- The names of each column
- The data types of each column

In [4]:
# get the name of each column
df.columns

Index(['Unnamed: 0', 'Rk', 'Year', 'Tm', 'Lg', 'G', 'W', 'L', 'Ties', 'W.L.',
       'pythW.L.', 'Finish', 'GB', 'Playoffs', 'R', 'RA', 'Attendance',
       'BatAge', 'PAge', 'X.Bat', 'X.P', 'Top.Player', 'Managers', 'current'],
      dtype='object')

In [5]:
# get data types of each column
df.dtypes

Unnamed: 0      int64
Rk              int64
Year            int64
Tm             object
Lg             object
G               int64
W               int64
L               int64
Ties            int64
W.L.          float64
pythW.L.      float64
Finish         object
GB             object
Playoffs       object
R               int64
RA              int64
Attendance     object
BatAge        float64
PAge          float64
X.Bat           int64
X.P             int64
Top.Player     object
Managers       object
current        object
dtype: object

## Cleaning the data

After familiarizing myself with the data there is a few changes that should be made before I continue. The `Unnamed: 0`, `Rk`, `pythW.L.`, and `Current` column are not going to be needed so I can remove those from the frame.

In [6]:
# drop the desired columns
df = df.drop(columns=['Rk', 'current', 'Unnamed: 0', 'pythW.L.'])

# test changes
df.head()

Unnamed: 0,Year,Tm,Lg,G,W,L,Ties,W.L.,Finish,GB,Playoffs,R,RA,Attendance,BatAge,PAge,X.Bat,X.P,Top.Player,Managers
0,2016,Arizona Diamondbacks,NL West,162,69,93,0,0.426,4th of 5,22.0,,752,890,2036216,26.7,26.4,50,29,J.Segura (5.7),C.Hale (69-93)
1,2015,Arizona Diamondbacks,NL West,162,79,83,0,0.488,3rd of 5,13.0,,720,713,2080145,26.6,27.1,50,27,P.Goldschmidt (8.8),C.Hale (79-83)
2,2014,Arizona Diamondbacks,NL West,162,64,98,0,0.395,5th of 5,30.0,,615,742,2073730,27.6,28.0,52,25,P.Goldschmidt (4.5),K.Gibson (63-96) and A.Trammell (1-2)
3,2013,Arizona Diamondbacks,NL West,162,81,81,0,0.5,2nd of 5,11.0,,685,695,2134895,28.1,27.6,44,23,P.Goldschmidt (7.1),K.Gibson (81-81)
4,2012,Arizona Diamondbacks,NL West,162,81,81,0,0.5,3rd of 5,13.0,,734,688,2177617,28.3,27.4,48,23,A.Hill (5.0),K.Gibson (81-81)


`Attendance` is the only column that needs it's data type changed at the moment. I want to change it from `object` to `int`. Before I can do that I need to make sure this column does not contain any `NaN` values.

In [7]:
# check for columns that contain null values
df.isnull().sum()

Year             0
Tm               0
Lg               0
G                0
W                0
L                0
Ties             0
W.L.             0
Finish           0
GB               0
Playoffs      2163
R                0
RA               0
Attendance      74
BatAge           0
PAge             0
X.Bat            0
X.P              0
Top.Player       0
Managers         0
dtype: int64

Since my `Attendance` row does have `NaN` values I will need to replace those with a `0` placeholder value in order to change the data type.

In [8]:
# replace null values with 0
df['Attendance'] = df['Attendance'].fillna(0)

In [9]:
# check again for null values
df.isnull().sum()

Year             0
Tm               0
Lg               0
G                0
W                0
L                0
Ties             0
W.L.             0
Finish           0
GB               0
Playoffs      2163
R                0
RA               0
Attendance       0
BatAge           0
PAge             0
X.Bat            0
X.P              0
Top.Player       0
Managers         0
dtype: int64

`Attendance` is now free of any null data. One more thing I need to do is remove the commas from the numbers. If I try to change the type to `int` or `float` before removing the commas I will receive an error. 

In [10]:
# remove commas
df['Attendance'] = df['Attendance'].str.replace("[,]","", regex=True)

# verify commas have been removed
df['Attendance']

0       2036216
1       2080145
2       2073730
3       2134895
4       2177617
         ...   
2589    1246863
2590    1142145
2591    1290963
2592    1424683
2593    1212608
Name: Attendance, Length: 2594, dtype: object

In [11]:
# change data type
df['Attendance'] = df['Attendance'].astype('Int64')

# verify type change
df['Attendance'].dtype

Int64Dtype()

I want to rename the columns to make them a bit easier to understand. 

In [12]:
# create the list of columns to rename
col_to_rename = {'Tm': 'Team', 'Lg': 'League','G': 'Games', 'W': 'Wins', 'L': 'Losses', 'W.L.': 'WinPercentage', 
                 'GB': 'GamesBack', 'R': 'RunsScored', 'RA': 'RunsAllowed', 'PAge': 'PitAge', 'X.Bat': 'NumBattersUsed', 
                 'X.P': 'NumPitchersUsed', 'Top.Player': 'TopPlayer'}

# load in new names
df = df.rename(columns=col_to_rename)

# test new column names
df.columns

Index(['Year', 'Team', 'League', 'Games', 'Wins', 'Losses', 'Ties',
       'WinPercentage', 'Finish', 'GamesBack', 'Playoffs', 'RunsScored',
       'RunsAllowed', 'Attendance', 'BatAge', 'PitAge', 'NumBattersUsed',
       'NumPitchersUsed', 'TopPlayer', 'Managers'],
      dtype='object')

## Diving into the Yankees

Since I only want to analyze data related to the Yankees I should create a new frame that includes only them

In [13]:
# create a new data frame that takes data for only the Yankees 
yankees = df.loc[df.Team == 'New York Yankees'].reset_index(drop=True)
yankees.head()

Unnamed: 0,Year,Team,League,Games,Wins,Losses,Ties,WinPercentage,Finish,GamesBack,Playoffs,RunsScored,RunsAllowed,Attendance,BatAge,PitAge,NumBattersUsed,NumPitchersUsed,TopPlayer,Managers
0,2016,New York Yankees,AL East,162,84,78,0,0.519,4th of 5,9.0,,680,702,3063405,30.0,27.9,42,29,M.Tanaka (5.4),J.Girardi (84-78)
1,2015,New York Yankees,AL East,162,87,75,0,0.537,2nd of 5,6.0,Lost ALWC (1-0),764,698,3193795,31.2,27.4,56,33,M.Teixeira (3.8),J.Girardi (87-75)
2,2014,New York Yankees,AL East,162,84,78,0,0.519,2nd of 5,12.0,,633,664,3401624,32.5,29.3,58,33,B.Gardner (4.0),J.Girardi (84-78)
3,2013,New York Yankees,AL East,162,85,77,0,0.525,3rd of 5,12.0,,650,671,3279589,31.8,31.8,56,24,R.Cano (7.8),J.Girardi (85-77)
4,2012,New York Yankees,AL East,162,95,67,0,0.586,1st of 5,--,Lost ALCS (4-0),804,668,3542406,32.7,30.3,45,23,R.Cano (8.4),J.Girardi (95-67)


In [14]:
# familiarize myself with the data
yankees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Year             104 non-null    int64  
 1   Team             104 non-null    object 
 2   League           104 non-null    object 
 3   Games            104 non-null    int64  
 4   Wins             104 non-null    int64  
 5   Losses           104 non-null    int64  
 6   Ties             104 non-null    int64  
 7   WinPercentage    104 non-null    float64
 8   Finish           104 non-null    object 
 9   GamesBack        104 non-null    object 
 10  Playoffs         52 non-null     object 
 11  RunsScored       104 non-null    int64  
 12  RunsAllowed      104 non-null    int64  
 13  Attendance       104 non-null    Int64  
 14  BatAge           104 non-null    float64
 15  PitAge           104 non-null    float64
 16  NumBattersUsed   104 non-null    int64  
 17  NumPitchersUsed 

I want to change the `Finish` column from an `object` to a `category`. Doing so will allow me to make a logical order of finishes at the end of each season for analysis later on.

In [15]:
# change data type
#yankees.Finish = yankees.Finish.astype('category')

In [17]:
yankees.Finish.unique()

array(['4th of 5', '2nd of 5', '3rd of 5', '1st of 5', '2nd of 7',
       '4th of 7', '5th of 7', '7th of 7', '3rd of 7', '1st of 7',
       '1st of 6', '3rd of 6', '2nd of 6', '4th of 6', '5th of 6',
       '5th of 10', '9th of 10', '10th of 10', '6th of 10', '1st of 10',
       '1st of 8', '3rd of 8', '2nd of 8', '4th of 8', '7th of 8',
       '6th of 8', '5th of 8'], dtype=object)