# PostgreSQL Database Creation

The purpose of this notebook is to use gathered data to create a PostgreSQL database of information for the years 1998-2016. The included tables should include (this list may expand with time):

* People (from Lahman DB)
* Batting (from Lahman DB)
* Pitching (from Lahman DB)
* Salary (from Lahman DB)
* Appearances (from Lahman DB)
* Team data (from Lahman DB)
* Payrolls (scraped from The Baseball Cube)
* Team WAR by Position (scraped from Fangraphs)
* Free Agent Data (scraped from Baseball Reference)

**A Clarification on Date:**

When I say a year (e.g. "1998"), what I mean is that the free agents are those immediately AFTER that season. So for instance, the "1998" data should be:

* Player's stats for the 1998 season
* Player's salary for the 1998 season
* Team payroll data from the 1998 season
* Team WAR by position for the 1998 season
* Free agent information for the 1998 season

We're then predicting where free agents will sign for the following season; so the above data would be used to predict free agent destinations for the 1999 season. Thus, the most recent season would be labeled "2016" and would be used to predict where free agents ended up for the 2017 season. 

** Data Limitations **

We're going to do 1998-2016 and not the other years. The 2 reasons are:

* We can't do 2017 because it's not complete yet; we could certainly predict the destinations for every free agent (and maybe that'll be the final product??), but right now we can't use them for testing. So they're not in the test/train set

* We shouldn't do pre-1998 because all of the teams weren't around until 1998, greatly affecting the environment. Furthermore, even though all the teams popped up in 1998, there was an expansion draft that really screwed everything up. So nothing pre-1998

That said, we'll capture stats and whatnot INCLUDING 2017 so that we could predict for that year. No point in ignoring those data, we'll just have to screen them out when we pull data from the database later.

## Part 1: Establish the PostgreSQL connection and create the database

For now, let's create the database locally; we can mess with this later if necessary

In [1]:
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pandas as pd

# Set postgres username/password, and connection specifics
username = 'postgres'
password = 'S@ndw1ches'     # change this
host     = 'localhost'
port     = '5432'            # default port that postgres listens on
db_name  = 'mlb_fa_db'

engine = create_engine( 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(username, password, host, port, db_name) )
print(engine.url)

postgresql+psycopg2://postgres:S%40ndw1ches@localhost:5432/mlb_fa_db


In [2]:
## create a database (if it doesn't exist)
if not database_exists(engine.url):
    create_database(engine.url)
print(database_exists(engine.url))

True


We now have a local database called "mlb_fa_db" for storing the data

## Part 2: Load all our data sources (no postseason)

We've got our list of 9 data sources; truth be told, there's more than that if we use postseason stats, but we'll go by category for now. The non-Lahman Data were all pickled by the other notebook; the Lahman data was trimmed, but it was a simple operation we can repeat

### 1-6 Load the Lahman Data

First, we'll load everything, then we'll trim to just 1998 forward.

In [3]:
# Load the CSV files
all_batting = pd.read_csv("/home/matt/Github/baseballdatabank/core/Batting.csv")
all_pitching = pd.read_csv("/home/matt/Github/baseballdatabank/core/Pitching.csv")
all_salary = pd.read_csv("/home/matt/Github/baseballdatabank/core/Salaries.csv")
all_people = pd.read_csv("/home/matt/Github/baseballdatabank/core/People.csv")
all_appearances = pd.read_csv("/home/matt/Github/baseballdatabank/core/Appearances.csv")
all_teams = pd.read_csv("/home/matt/Github/baseballdatabank/core/Teams.csv")

# Cut off the year at 1998 for the ones that works for
# We'll screen out years without player IDs in the first 5 DFs
batting_1998, pitching_1998, salary_1998, teams_1998, appearances_1998 = [
df[df.yearID >= 1998] for df in [all_batting, all_pitching, all_salary, all_teams, all_appearances]]

### 7-9 Load the pickle files

We've got 4 pickle files (2 for Team WAR) that we'll just load here. They should be pretty ready

In [4]:
position_war = pd.read_csv('/home/matt/Github/MLB_FA_Predictor/position_war.csv')
pitcher_war = pd.read_csv('/home/matt/Github/MLB_FA_Predictor/pitcher_war.csv')
team_payrolls = pd.read_csv('/home/matt/Github/MLB_FA_Predictor/team_payrolls.csv')
free_agents = pd.read_csv('/home/matt/Github/MLB_FA_Predictor/free_agents.csv')

## Part 3: Write each data frame to sql

This is straight from the Dev_Setup stuff (and pretty slow; about 30-45 seconds locally):

In [5]:
# Insert all tables into SQL

# 1-6 (Lahman tables)
batting_1998.to_sql('batting', engine, if_exists='replace')
pitching_1998.to_sql('pitching', engine, if_exists = 'replace')
salary_1998.to_sql('salary', engine, if_exists='replace')
all_people.to_sql('people', engine, if_exists = 'replace')
appearances_1998.to_sql('appearances', engine, if_exists = 'replace')
teams_1998.to_sql('teams', engine, if_exists = 'replace')

# 7-9 The other 4 tables
position_war.to_sql("position_team_war", engine, if_exists = 'replace')
pitcher_war.to_sql("pitcher_team_war", engine, if_exists = 'replace')
team_payrolls.to_sql("payrolls", engine, if_exists= 'replace')
free_agents.to_sql("free_agents", engine, if_exists = 'replace')

In [6]:
batting_1998.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
76643,abbotje01,1998,1,CHA,AL,89,244,33,68,14,...,41.0,3.0,3.0,9,28.0,1.0,0.0,2.0,5.0,2.0
76644,abbotji01,1998,1,CHA,AL,5,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
76645,abbotku01,1998,1,OAK,AL,35,123,17,33,7,...,9.0,2.0,1.0,10,34.0,0.0,1.0,1.0,1.0,3.0
76646,abbotku01,1998,2,COL,NL,42,71,9,18,6,...,15.0,0.0,0.0,2,19.0,0.0,1.0,0.0,2.0,2.0
76647,abbotpa01,1998,1,SEA,AL,4,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# Experimental stuff
fa_year = int('2017')
 # just select free agents for that year
query = 'SELECT * FROM free_agents WHERE "Year" = {}'.format(fa_year)
print(query)

SELECT * FROM free_agents WHERE "Year" = 2017


In [8]:
# Set postgres username/password, and connection specifics
username = 'postgres'
password = 'S@ndw1ches'     # change this
host     = 'localhost'
port     = '5432'            # default port that postgres listens on
db_name  = 'mlb_fa_db'

engine = create_engine( 'postgresql://{}:{}@{}:{}/{}'.format(username, password, host, port, db_name) )

con = None
con = psycopg2.connect(database = db_name,
                       user = username,
                       password = password,
                       host = host)


In [9]:
query_results = pd.read_sql_query(query, con)
print(query_results)

    index  Unnamed: 0  Age                    Destination           Full_Name  \
0    1736        1736   31               Seattle Mariners      Gordon Beckham   
1    1737        1737   31                  New York Mets           Jay Bruce   
2    1738        1738   29                Minnesota Twins        Addison Reed   
3    1739        1739   34              Chicago White Sox     Miguel Gonzalez   
4    1740        1740   30              Milwaukee Brewers         J.J. Hoover   
5    1741        1741   33              Milwaukee Brewers         Boone Logan   
6    1742        1742   29              Oakland Athletics   Steve Lombardozzi   
7    1743        1743   36                 Detroit Tigers         Brayan Pena   
8    1744        1744   34  Los Angeles Angels of Anaheim         Rene Rivera   
9    1745        1745   30                Cincinnati Reds        Vance Worley   
10   1746        1746   34               San Diego Padres       Craig Stammen   
11   1747        1747   32  

In [11]:
'${:,.2f}'.format(1234569789.0)

'$1,234,569,789.00'