### Joe Leonard (ymd3tv) - DS 2002 (1:00PM MWF)
# Midterm Project - Player Pitch Performance Analysis
This is where I will be performing different actions on my 'pitch_analysis' database for my Midterm Project. The goal of this notebook is to create a connection to both the MySQL and MongoDB databases I have created, to import necessary libraries needed to perform various functions and methods, and most importantly to extract the data as desired to create an effective ETL Pipeline intowhat will be the final Data Warehouse. Hope you enjoy looking through the process I performed!

I will be mirroring the libraries from Lab 4 to create a good base for my project. Any added libraries along the way are on the bottom of the imports and will be highlighted specially. 

##### Importing Necessary Libraries

In [18]:
import os
import json
import numpy
import datetime
import pandas as pd

import pymongo
import sqlalchemy as db

# IMPORT Statements Below Not From Lab 4

# certifi Came Up as an Error in Code Block 4
import certifi 

# pymysql Error Came Up When Loading Data
import pymysql

# MySQLdb Error Came Up When Loading Data
import MySQLdb

##### Declaring Variables to Connect to MongoDB and MySQL
It is important to note that you may need to change variables in order to make a connection on your own end if you are trying to run this. Those variables are likely 'pwd' in 'mysql_args' or 'user_name', 'password', and 'cluster_subnet' in 'mongodb_args'.

In [2]:
mysql_args = {
    "uid" : "root",
    "pwd" : "000025Jl##",
    "hostname" : "localhost",
    "dbname" : "pitch_analysis"
}

mongodb_args = {
    "user_name" : "josephleonard725",
    "password" : "NfG0ZmmMDICsZw5B",
    "cluster_name" : "pitch_analysis",
    "cluster_subnet" : "xg9sh3f",
    "cluster_location" : "atlas",
    "db_name" : "pitch_performance"
}

##### Defining Functions to Pull and Load into the Databases

In [51]:
def get_sql_dataframe(sql_query, **args):
    '''Create a connection to the MySQL database'''
    engine = db.create_engine("mysql://{}:{}@{}/{}".format(
        mysql_args["uid"], mysql_args["pwd"], mysql_args["hostname"], mysql_args["dbname"]
    ))
    connection = engine.connect()
    
    '''Invoke the pd.read_sql() function to query the database, and fill a Pandas DataFrame.'''
    dframe = pd.read_sql(sql_query, connection);
    connection.close()
    
    return dframe


def get_mongo_dataframe(mongo_client, db_name, collection, query):
    '''Query MongoDB, and fill a python list with documents to create a DataFrame'''
    db = mongo_client[db_name]
    dframe = pd.DataFrame(list(db[collection].find(query)))
    dframe.drop(['_id'], axis=1, inplace=True)
    mongo_client.close()
    
    return dframe

def get_mongo_client(**args):
    '''Validate proper input'''
    if args["cluster_location"] not in ['atlas', 'local']:
        raise Exception("You must specify either 'atlas' or 'local' for the cluster_location parameter.")

    else:
        if args["cluster_location"] == "atlas":
            connect_str = f"mongodb+srv://{args['user_name']}:{args['password']}@"
            connect_str += f"{args['cluster_name']}.{args['cluster_subnet']}.mongodb.net"
            client = pymongo.MongoClient(connect_str, tlsCAFile=certifi.where())
            
        elif args["cluster_location"] == "local":
            client = pymongo.MongoClient("mongodb://localhost:27017/")
        
    return client

def set_mongo_collections(mongo_client, db_name, data_directory, json_files):
    db = mongo_client[db_name]
    
    for file in json_files:
        db.drop_collection(file)
        json_file = os.path.join(data_directory, json_files[file])
        with open(json_file, 'r') as openfile:
            json_object = json.load(openfile)
            file = db[file]
            result = file.insert_many(json_object)
        
    mongo_client.close()


def set_dataframe(df, table_name, pk_column, db_operation, **mysql_args):
    engine = db.create_engine("mysql://{}:{}@{}/{}".format(
        mysql_args["uid"], mysql_args["pwd"], mysql_args["hostname"], mysql_args["dbname"]
    ))
    connection = engine.connect()

    if db_operation == "insert":
        df.to_sql(table_name, con=connection, index=False, if_exists='replace')
        connection.execute(db.text(f"ALTER TABLE {table_name} MODIFY COLUMN {pk_column} INT PRIMARY KEY;"))
    elif db_operation == "update":
        df.to_sql(table_name, con=connection, index=False, if_exists='append')
    connection.close()

##### Populate MongoDB with Source Data


In [39]:
client = get_mongo_client(**mongodb_args)

data_dir = os.path.join(os.getcwd(), 'data')

json_files = {"atbats" : 'pitch_analysis_atbats.json',
                "games" : 'pitch_analysis_games.json',
                "pitches" : 'pitch_analysis_pitches.json',
                "players" : 'pitch_analysis_players.json'
             }

set_mongo_collections(client, mongodb_args["db_name"], data_dir, json_files) 

##### Creating Dataframes from MongoDB

Extracting Data from MongoDB Collections into DataFrames

In [6]:
client = get_mongo_client(**mongodb_args)

# At-Bats DF
query = {}
collection = "atbats"

df_atbats = get_mongo_dataframe(client, mongodb_args["db_name"], collection, query)
df_atbats.head(2)

Unnamed: 0,ab_id,batter_id,event,g_id,inning,o,p_score,p_throws,pitcher_id,stand,top
0,2015000001,572761,Groundout,201500001,1,1,0,L,452657,L,True
1,2015000002,518792,Double,201500001,1,1,0,L,452657,L,True


In [7]:
client = get_mongo_client(**mongodb_args)

# Games DF
query = {}
collection = "games"

df_games = get_mongo_dataframe(client, mongodb_args["db_name"], collection, query)
df_games.head(2)

Unnamed: 0,attendance,away_final_score,away_team,date,elapsed_time,g_id,home_final_score,home_team,start_time,umpire_1B,umpire_2B,umpire_3B,umpire_HP,venue_name,weather,wind,delay
0,35055,3,sln,2015-04-05,184,201500001,0,chn,7:17 PM,Mark Wegner,Marty Foster,Mike Muchlinski,Mike Winters,Wrigley Field,"44 degrees, clear","7 mph, In from CF",0
1,45909,1,ana,2015-04-06,153,201500002,4,sea,1:12 PM,Ron Kulpa,Brian Knight,Vic Carapazza,Larry Vanover,Safeco Field,"54 degrees, cloudy","1 mph, Varies",0


In [40]:
client = get_mongo_client(**mongodb_args)

# Pitches DF
query = {}
collection = "pitches"

df_pitches = get_mongo_dataframe(client, mongodb_args["db_name"], collection, query)
df_pitches.head(2)

Unnamed: 0,px,pz,start_speed,end_speed,spin_rate,spin_dir,break_angle,break_length,break_y,ax,...,event_num,b_score,ab_id,b_count,s_count,outs,pitch_num,on_1b,on_2b,on_3b
0,0.416,2.963,92.9,84.1,2305.052,159.235,-25.0,3.2,23.7,7.665,...,3,0.0,2015000001.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.191,2.347,92.8,84.1,2689.935,151.402,-40.7,3.4,23.7,12.043,...,4,0.0,2015000001.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0


In [9]:
client = get_mongo_client(**mongodb_args)

# Players DF
query = {}
collection = "players"

df_players = get_mongo_dataframe(client, mongodb_args["db_name"], collection, query)
df_players.head(2)

Unnamed: 0,id,first_name,last_name
0,452657,Jon,Lester
1,425794,Adam,Wainwright


Performing Necessary Transformations to the DataFrames

In [10]:
# At-Bats DF
df_atbats.rename(columns = {"ab_id" : "atbat_id",
                            "batter_id" : "player_id_b",
                            "event" : "pitch_result",
                            "g_id" : "game_id",
                            "o" : "outs",
                            "p_score" : "fielding_score",
                            "pitcher_id" : "player_id_p"
                            }, inplace = True)
df_atbats.drop(['p_throws', 'stand', 'top'], axis = "columns", inplace = True)
df_atbats.head(2)

Unnamed: 0,atbat_id,player_id_b,pitch_result,game_id,inning,outs,fielding_score,player_id_p
0,2015000001,572761,Groundout,201500001,1,1,0,452657
1,2015000002,518792,Double,201500001,1,1,0,452657


In [11]:
# Games DF
df_games.rename(columns = {"g_id" : "game_id"}, inplace = True)
df_games.drop(['attendance', 'away_team', 'date', 'elapsed_time', 'home_team', 'start_time', 'umpire_1B', 'umpire_2B', 
               'umpire_3B', 'umpire_HP', 'venue_name', 'weather', 'wind', 'delay'], axis = "columns", inplace = True)
df_games.head(2)

Unnamed: 0,away_final_score,game_id,home_final_score
0,3,201500001,0
1,1,201500002,4


In [41]:
# Pitches DF
df_pitches.rename(columns = {"px" : "pitch_x",
                            "pz" : "pitch_z",
                            "start_speed" : "pitch_speed",
                            "code" : "pitch_result",
                            "b_score" : "hitting_score",
                            "ab_id" : "atbat_id",
                            "b_count" : "balls",
                            "s_count" : "strikes",
                            }, inplace = True)
df_pitches.drop(['end_speed', 'spin_rate', 'spin_dir', 'break_angle', 'break_length', 'break_y', 'ax', 'ay', 'az', 'sz_bot', 'sz_top',
                 'type_confidence', 'vx0', 'vy0', 'vz0', 'x', 'x0', 'y', 'y0', 'z0', 'pfx_x', 'pfx_z', 'nasty', 'zone', 'type', 
                 'event_num'], axis = "columns", inplace = True)
df_pitches["pitch_id"] = df_pitches.index
df_pitches.head(2)

Unnamed: 0,pitch_x,pitch_z,pitch_speed,pitch_result,pitch_type,hitting_score,atbat_id,balls,strikes,outs,pitch_num,on_1b,on_2b,on_3b,pitch_id
0,0.416,2.963,92.9,C,FF,0.0,2015000001.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
1,-0.191,2.347,92.8,S,FF,0.0,2015000001.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,1


In [13]:
# Players DF
df_players.rename(columns = {"id" : "player_id"}, inplace = True)
df_players.head(2)

Unnamed: 0,player_id,first_name,last_name
0,452657,Jon,Lester
1,425794,Adam,Wainwright


Loading the Transformed DataFrames into the Data Warehouse

In [42]:
conn_str = f"mysql+pymysql://{mysql_args['uid']}:{mysql_args['pwd']}@{mysql_args['hostname']}/{mysql_args['dbname']}"
sqlEngine = db.create_engine(conn_str, pool_recycle=3600)
connection = sqlEngine.connect()

# At-Bats DF
dataframe = df_atbats
table_name = 'dim_atbats'
primary_key = 'atbat_id'
db_operation = "insert"

set_dataframe(dataframe, table_name, primary_key, db_operation, **mysql_args)

# Games DF
dataframe = df_games
table_name = 'dim_games'
primary_key = 'game_id'
db_operation = "insert"

set_dataframe(dataframe, table_name, primary_key, db_operation, **mysql_args)

# Pitches
dataframe = df_pitches
table_name = 'dim_pitches'
primary_key = 'pitch_id'
db_operation = "insert"

set_dataframe(dataframe, table_name, primary_key, db_operation, **mysql_args)

# Players
dataframe = df_players
table_name = 'dim_players'
primary_key = 'player_id'
db_operation = "insert"

set_dataframe(dataframe, table_name, primary_key, db_operation, **mysql_args)

Validating the Tables Were Created

In [53]:
sql_atbats = "SELECT * FROM pitch_analysis.dim_atbats;"
df_dim_atbats = get_sql_dataframe(sql_atbats, **mysql_args)
df_dim_atbats.head(2)

Unnamed: 0,atbat_id,player_id_b,pitch_result,game_id,inning,outs,fielding_score,player_id_p
0,2015000001,572761,Groundout,201500001,1,1,0,452657
1,2015000002,518792,Double,201500001,1,1,0,452657


In [45]:
sql_games = "SELECT * FROM pitch_analysis.dim_games;"
df_dim_games = get_sql_dataframe(sql_games, **mysql_args)
df_dim_games.head(2)

Unnamed: 0,away_final_score,game_id,home_final_score
0,3,201500001,0
1,1,201500002,4


In [46]:

sql_pitches = "SELECT * FROM pitch_analysis.dim_pitches;"
df_dim_pitches = get_sql_dataframe(sql_pitches, **mysql_args)
df_dim_pitches.head(2)

Unnamed: 0,pitch_x,pitch_z,pitch_speed,pitch_result,pitch_type,hitting_score,atbat_id,balls,strikes,outs,pitch_num,on_1b,on_2b,on_3b,pitch_id
0,0.416,2.963,92.9,C,FF,0.0,2015000001.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
1,-0.191,2.347,92.8,S,FF,0.0,2015000001.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,1


In [58]:
sql_players = "SELECT * FROM pitch_analysis.dim_players;"
df_dim_players = get_sql_dataframe(sql_players, **mysql_args)
df_dim_players.head(2)

Unnamed: 0,player_id,first_name,last_name
0,112526,Bartolo,Colon
1,115629,LaTroy,Hawkins


##### Authoring a SQL Statement That:
- Calculates the Average Pitch Speed for Each Pitch Type
- Retrieves Player Names Along with Their Total Number of At Bats Thrown
- Calculate the Total Number of At Bats Thrown by a Player in a Specific Game


In [61]:
sql_avg_pitch_speed = "SELECT pitch_type, ROUND(AVG(pitch_speed), 2) AS average_pitch_speed FROM dim_pitches GROUP BY pitch_type;"

df_avg_pitch_speed = get_sql_dataframe(sql_avg_pitch_speed, **mysql_args)
df_avg_pitch_speed

Unnamed: 0,pitch_type,average_pitch_speed
0,FF,91.89
1,CU,76.92
2,FC,87.36
3,SI,90.57
4,CH,83.13
5,FT,91.25
6,IN,72.27
7,SL,83.66
8,,0.0
9,KC,79.86


In [63]:
sql_total_pitches = "SELECT CONCAT(p.first_name, ' ', p.last_name) AS player_name, COUNT(*) AS at_bats_thrown FROM dim_atbats a JOIN dim_players p ON a.player_id_p = p.player_id GROUP BY player_name ORDER BY at_bats_thrown DESC;"

df_total_pitches = get_sql_dataframe(sql_total_pitches, **mysql_args)
df_total_pitches

Unnamed: 0,player_name,at_bats_thrown
0,Jeff Samardzija,116
1,Max Scherzer,113
2,Corey Kluber,112
3,Phil Hughes,112
4,Rick Porcello,111
...,...,...
424,Ike Davis,3
425,Rob Wooten,3
426,Cesar Jimenez,2
427,Jon Edwards,2


In [65]:
sql_spec_pitcher_game = "SELECT CONCAT(p.first_name, ' ', p.last_name) AS player_name, COUNT(*) AS at_bats_thrown FROM dim_atbats a JOIN dim_players p ON a.player_id_p = p.player_id WHERE p.player_id = 282332 AND a.game_id = 201500046 GROUP BY player_name;"

df_spec_pitcher_game = get_sql_dataframe(sql_spec_pitcher_game, **mysql_args)
df_spec_pitcher_game

Unnamed: 0,player_name,at_bats_thrown
0,CC Sabathia,24


## Conclusion
#### Overall, this notebook is able to demonstrate the ETL Pipeline I created in order to Extract, Transform, and Load My Data as Desired.
#### The SQL Statements I Authored Show Various Ways One May Use the Tables in Order to Create Different Findings.
#### Hope you enjoyed.