# MLB Baseball Analytics

In the second part of this project, we are going to use machine learning models to predict whether a player is going to be voted into the Hall of Fame, based on the player's career statistics and awards using classification methods.

Question we seek to answer: can you build a machine learning model that can accurately predict if an MLB baseball player will be voted into the Hall of Fame?

### Hall of Fame Requirements

A baseball player can be elected to the Hall of Fame if they meet the following criteria:

    - The player must have competed in at least ten seasons;
    - The player has been retired for at least five seasons;
    - A screening committee must approve the player’s worthiness to be included on the ballot and most players who played regularly for ten or more years are deemed worthy;
    - The player must not be on the ineligible list (that means that the player should not be banned from baseball);
    - A player is considered elected if he receives at least 75% of the vote in the election; and
    - A player stays on the ballot the following year if they receive at least 5% of the vote and can appear on ballots for a maximum of 10 years.
    
A player who does not get elected to the Hall of Fame can be added by the Veterans Committee or Special Committee appointed by the Commissioner of the MLB.

### Importing Data

The data for player's career statistics was compiled by Sean Lahman and it could be found [here.](http://www.seanlahman.com/baseball-archive/statistics/)

These all come in CSV files. 

    - The Master.csv will tell you more about the player names, Date Of Birth (DOB), and biographical info
    - The Fielding.csv contains the fielding statistics
    - The Batting.csv contains the Batting statistics
    - The AwardsPlayers.csv has data on the awards won by baseball players
    - The AllstarFull.csv file wil give you all the All-Star appearances
    - You’ll also need the Hall of Fame voting data, which you can find in HallOfFame.csv
    - Appearances.csv, which contains details on the positions at which a player appeared

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Read in the CSV files
master_df = pd.read_csv('baseballdatabank/Master.csv',usecols=['playerID','nameFirst','nameLast','bats','throws','debut','finalGame'])
fielding_df = pd.read_csv('baseballdatabank/Fielding.csv',usecols=['playerID','yearID','stint','teamID','lgID','POS','G','GS','InnOuts','PO','A','E','DP'])
batting_df = pd.read_csv('baseballdatabank/Batting.csv')
awards_df = pd.read_csv('baseballdatabank/AwardsPlayers.csv', usecols=['playerID','awardID','yearID'])
allstar_df = pd.read_csv('baseballdatabank/AllstarFull.csv', usecols=['playerID','yearID'])
hof_df = pd.read_csv('baseballdatabank/HallOfFame.csv',usecols=['playerID','yearid','votedBy','needed_note','inducted','category'])
appearances_df = pd.read_csv('baseballdatabank/Appearances.csv')