<h1 id="tocheading">Table of Contents</h1>
<br />
<div id="toc"><ul class="toc"><li><a href="#1.-Getting-to-Know-the-Data">1. Getting to Know the Data</a><a class="anchor-link" href="#1.-Getting-to-Know-the-Data">¶</a></li><ul class="toc"><li><a href="#1.1.-Define-data-types-of-main-CSV-file,-game_log.csv">1.1. Define data types of main CSV file, <code>game_log.csv</code></a><a class="anchor-link" href="#1.1.-Define-data-types-of-main-CSV-file,-game_log.csv">¶</a></li><li><a href="#1.2.-Quick-glance-at-each-CSV-file">1.2. Quick glance at each CSV file</a><a class="anchor-link" href="#1.2.-Quick-glance-at-each-CSV-file">¶</a></li><ul class="toc"><li><a href="#1.2.1.-Note-on-players'-positions-in-game_log">1.2.1. Note on players' positions in <code>game_log</code></a><a class="anchor-link" href="#1.2.1.-Note-on-players'-positions-in-game_log">¶</a></li><li><a href="#1.2.2.-Notes-on-types-of-leagues">1.2.2. Notes on types of leagues</a><a class="anchor-link" href="#1.2.2.-Notes-on-types-of-leagues">¶</a></li><ul class="toc"><li><a href="#1.2.2.1.-No-league">1.2.2.1. No league</a><a class="anchor-link" href="#1.2.2.1.-No-league">¶</a></li><li><a href="#1.2.2.2.-Currently-existing-leagues">1.2.2.2. Currently existing leagues</a><a class="anchor-link" href="#1.2.2.2.-Currently-existing-leagues">¶</a></li><li><a href="#1.2.2.3.-Currently-defunct-leagues">1.2.2.3. Currently defunct leagues</a><a class="anchor-link" href="#1.2.2.3.-Currently-defunct-leagues">¶</a></li></ul></ul></ul><li><a href="#2.-Importing-Data-into-SQLite">2. Importing Data into SQLite</a><a class="anchor-link" href="#2.-Importing-Data-into-SQLite">¶</a></li><ul class="toc"><li><a href="#2.1.-Create-unique-identifier-for-each-row-in-game_log-table">2.1. Create unique identifier for each row in <code>game_log</code> table</a><a class="anchor-link" href="#2.1.-Create-unique-identifier-for-each-row-in-game_log-table">¶</a></li></ul><li><a href="#3.-Plan-for-database-normalisation">3. Plan for database normalisation</a><a class="anchor-link" href="#3.-Plan-for-database-normalisation">¶</a></li><ul class="toc"><li><a href="#3.1.-Reduce-number-of-columns-(improve-normal-form-into-1NF)">3.1. Reduce number of columns (improve normal form into 1NF)</a><a class="anchor-link" href="#3.1.-Reduce-number-of-columns-(improve-normal-form-into-1NF)">¶</a></li><li><a href="#3.2.-Improve-normal-form-(into-Boyce–Codd-normal-form-(BCNF))">3.2. Improve normal form (into Boyce–Codd normal form (BCNF))</a><a class="anchor-link" href="#3.2.-Improve-normal-form-(into-Boyce–Codd-normal-form-(BCNF))">¶</a></li><li><a href="#3.3.-Normalized-Schema">3.3. Normalized Schema</a><a class="anchor-link" href="#3.3.-Normalized-Schema">¶</a></li></ul><li><a href="#4.-Execute-database-normalisation">4. Execute database normalisation</a><a class="anchor-link" href="#4.-Execute-database-normalisation">¶</a></li><ul class="toc"><li><a href="#4.1.-Creating-Tables-Without-Foreign-Key-Relations">4.1. Creating Tables Without Foreign Key Relations</a><a class="anchor-link" href="#4.1.-Creating-Tables-Without-Foreign-Key-Relations">¶</a></li><li><a href="#4.2.-Adding-The-Team-and-Game-Tables">4.2. Adding The Team and Game Tables</a><a class="anchor-link" href="#4.2.-Adding-The-Team-and-Game-Tables">¶</a></li><li><a href="#4.3.-Adding-the-Team-Appearance-Table">4.3. Adding the Team Appearance Table</a><a class="anchor-link" href="#4.3.-Adding-the-Team-Appearance-Table">¶</a></li><li><a href="#4.4.-Adding-the-Person-Appearance-Table">4.4. Adding the Person Appearance Table</a><a class="anchor-link" href="#4.4.-Adding-the-Person-Appearance-Table">¶</a></li><li><a href="#4.5.-Removing-the-Original-Tables">4.5. Removing the Original Tables</a><a class="anchor-link" href="#4.5.-Removing-the-Original-Tables">¶</a></li></ul><li><a href="#5.-Tasks-for-later">5. Tasks for later</a><a class="anchor-link" href="#5.-Tasks-for-later">¶</a></li></ul></div>

Project guide: https://www.dataquest.io/m/193/guided-project%3A-designing-and-creating-a-database

Solution by DataQuest: https://github.com/dataquestio/solutions/blob/master/Mission193Solutions.ipynb

Database used in this project is described in the following excerpt from DataQuest.
___
We will be working with a file of [Major League Baseball](https://en.wikipedia.org/wiki/Major_League_Baseball) games from [Retrosheet](http://www.retrosheet.org/). Retrosheet compiles detailed statistics on baseball games from the 1800s through to today. The main file we will be working from `game_log.csv`, has been produced by combining 127 separate CSV files from retrosheet, and has been pre-cleaned to remove some inconsistencies. The game log has hundreds of data points on each game which we will normalize this data into several separate tables using SQL, providing a robust database of game-level statistics.

In addition to the main file, we have also included three 'helper' files, also sourced from Retrosheet:

*   `park_codes.csv`
*   `person_codes.csv`
*   `team_codes.csv`

These three helper files in some cases contain extra data, but will also make things easier as they will form the basis for three of our normalized tables.
___

# 1. Getting to Know the Data


## 1.1. Define data types of main CSV file, `game_log.csv`

I will first create a list of all data types in the biggest CSV file, `game_log.csv`.

This CSV file is massive, and this step will reducing memory usage while reading it into a [pandas](https://pandas.pydata.org/) data frame.

In [1]:
# data types in game_log.csv (https://stackoverflow.com/a/27232309)
game_dtype = {'1b_umpire_id': 'O',
 '1b_umpire_name': 'O',
 '2b_umpire_id': 'O',
 '2b_umpire_name': 'O',
 '3b_umpire_id': 'O',
 '3b_umpire_name': 'O',
 'acquisition_info': 'O',
 'additional_info': 'O',
 'attendance': 'float64',
 'completion': 'O',
 'date': 'int64',
 'day_night': 'O',
 'day_of_week': 'O',
 'forefeit': 'O',
 'h_assists': 'float64',
 'h_at_bats': 'float64',
 'h_balks': 'float64',
 'h_caught_stealing': 'float64',
 'h_double_plays': 'float64',
 'h_doubles': 'float64',
 'h_errors': 'float64',
 'h_first_catcher_interference': 'float64',
 'h_game_number': 'int64',
 'h_grounded_into_double': 'float64',
 'h_hit_by_pitch': 'float64',
 'h_hits': 'float64',
 'h_homeruns': 'float64',
 'h_individual_earned_runs': 'float64',
 'h_intentional_walks': 'float64',
 'h_league': 'O',
 'h_left_on_base': 'float64',
 'h_line_score': 'O',
 'h_manager_id': 'O',
 'h_manager_name': 'O',
 'h_name': 'O',
 'h_passed_balls': 'float64',
 'h_pitchers_used': 'float64',
 'h_player_1_def_pos': 'float64',
 'h_player_1_id': 'O',
 'h_player_1_name': 'O',
 'h_player_2_def_pos': 'float64',
 'h_player_2_id': 'O',
 'h_player_2_name': 'O',
 'h_player_3_def_pos': 'float64',
 'h_player_3_id': 'O',
 'h_player_3_name': 'O',
 'h_player_4_def_pos': 'float64',
 'h_player_4_id': 'O',
 'h_player_4_name': 'O',
 'h_player_5_def_pos': 'float64',
 'h_player_5_id': 'O',
 'h_player_5_name': 'O',
 'h_player_6_def_pos': 'float64',
 'h_player_6_id': 'O',
 'h_player_6_name': 'O',
 'h_player_7_def_pos': 'float64',
 'h_player_7_id': 'O',
 'h_player_7_name': 'O',
 'h_player_8_def_pos': 'float64',
 'h_player_8_id': 'O',
 'h_player_8_name': 'O',
 'h_player_9_def_pos': 'float64',
 'h_player_9_id': 'O',
 'h_player_9_name': 'O',
 'h_putouts': 'float64',
 'h_rbi': 'float64',
 'h_sacrifice_flies': 'float64',
 'h_sacrifice_hits': 'float64',
 'h_score': 'int64',
 'h_starting_pitcher_id': 'O',
 'h_starting_pitcher_name': 'O',
 'h_stolen_bases': 'float64',
 'h_strikeouts': 'float64',
 'h_team_earned_runs': 'float64',
 'h_triple_plays': 'float64',
 'h_triples': 'float64',
 'h_walks': 'float64',
 'h_wild_pitches': 'float64',
 'hp_umpire_id': 'O',
 'hp_umpire_name': 'O',
 'length_minutes': 'float64',
 'length_outs': 'float64',
 'lf_umpire_id': 'O',
 'lf_umpire_name': 'O',
 'losing_pitcher_id': 'O',
 'losing_pitcher_name': 'O',
 'number_of_game': 'int64',
 'park_id': 'O',
 'protest': 'O',
 'rf_umpire_id': 'O',
 'rf_umpire_name': 'O',
 'saving_pitcher_id': 'O',
 'saving_pitcher_name': 'O',
 'v_assists': 'float64',
 'v_at_bats': 'float64',
 'v_balks': 'float64',
 'v_caught_stealing': 'float64',
 'v_double_plays': 'float64',
 'v_doubles': 'float64',
 'v_errors': 'float64',
 'v_first_catcher_interference': 'float64',
 'v_game_number': 'int64',
 'v_grounded_into_double': 'float64',
 'v_hit_by_pitch': 'float64',
 'v_hits': 'float64',
 'v_homeruns': 'float64',
 'v_individual_earned_runs': 'float64',
 'v_intentional_walks': 'float64',
 'v_league': 'O',
 'v_left_on_base': 'float64',
 'v_line_score': 'O',
 'v_manager_id': 'O',
 'v_manager_name': 'O',
 'v_name': 'O',
 'v_passed_balls': 'float64',
 'v_pitchers_used': 'float64',
 'v_player_1_def_pos': 'float64',
 'v_player_1_id': 'O',
 'v_player_1_name': 'O',
 'v_player_2_def_pos': 'float64',
 'v_player_2_id': 'O',
 'v_player_2_name': 'O',
 'v_player_3_def_pos': 'float64',
 'v_player_3_id': 'O',
 'v_player_3_name': 'O',
 'v_player_4_def_pos': 'float64',
 'v_player_4_id': 'O',
 'v_player_4_name': 'O',
 'v_player_5_def_pos': 'float64',
 'v_player_5_id': 'O',
 'v_player_5_name': 'O',
 'v_player_6_def_pos': 'float64',
 'v_player_6_id': 'O',
 'v_player_6_name': 'O',
 'v_player_7_def_pos': 'float64',
 'v_player_7_id': 'O',
 'v_player_7_name': 'O',
 'v_player_8_def_pos': 'float64',
 'v_player_8_id': 'O',
 'v_player_8_name': 'O',
 'v_player_9_def_pos': 'float64',
 'v_player_9_id': 'O',
 'v_player_9_name': 'O',
 'v_putouts': 'float64',
 'v_rbi': 'float64',
 'v_sacrifice_flies': 'float64',
 'v_sacrifice_hits': 'float64',
 'v_score': 'int64',
 'v_starting_pitcher_id': 'O',
 'v_starting_pitcher_name': 'O',
 'v_stolen_bases': 'float64',
 'v_strikeouts': 'float64',
 'v_team_earned_runs': 'float64',
 'v_triple_plays': 'float64',
 'v_triples': 'float64',
 'v_walks': 'float64',
 'v_wild_pitches': 'float64',
 'winning_pitcher_id': 'O',
 'winning_pitcher_name': 'O',
 'winning_rbi_batter_id': 'O',
 'winning_rbi_batter_id_name': 'O'}

## 1.2. Quick glance at each CSV file

Each CSV will be loaded as a pandas data frame. The names of the data frames will be the same as those of the CSV files.

|File name|Data frame|
|---|---|
|game_log.csv|game_log|
|park_codes.csv|park_codes|
|person_codes.csv|person_codes|
|team_codes.csv|team_codes|

The first 5 rows of each data frame will be displayed.

In [2]:
from IPython.display import display

import pandas as pd
import re

def read_display_csv(csv, dtype):
    """
    csv: CSV filename without extension.
    dtype: data types in CSV file
    
    Read in CSV file as Pandas data frame,
    display data dimension and
    first five rows.
    
    Then, return data frame.    
    """
    
    # read in data
    csv_fname = csv + ".csv"
    df = pd.read_csv(csv_fname, dtype=dtype)

    # dipslay data dimension
    dim = df.shape

    print("{}".format(csv_fname))
    print("Number of rows:", dim[0])
    print("Number of columns:", dim[1])

    # display first 5 rows
    display(df.head(5))
    
    print("\n\n")
    
    return df


# ensure rows or values are not truncated
pd.set_option("max_columns", 180)
pd.set_option("max_rows", 200000)
pd.set_option("max_colwidth", 5000)

# set CSV file names
game_n = "game_log"
park_n = "park_codes"
person_n = "person_codes"
team_n = "team_codes"

csv_list = [game_n, park_n, person_n, team_n]
df_list = [game_n, park_n, person_n, team_n]

# read in CSV files and display some info
for i, csv in enumerate(csv_list):
    dtype = game_dtype if csv == game_n else None
    df_list[i] = read_display_csv(csv, dtype)

# remember data loaded from CSV file
game = df_list[0]
park = df_list[1]
person = df_list[2]
team = df_list[3]

game_log.csv
Number of rows: 171907
Number of columns: 161


Unnamed: 0,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info
0,18710504,0,Thu,CL1,,1,FW1,,1,0,2,54.0,D,,,,FOR01,200.0,120.0,0,10010000,30.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,6.0,1.0,,-1.0,,4.0,1.0,1.0,1.0,0.0,0.0,27.0,9.0,0.0,3.0,0.0,0.0,31.0,4.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,0.0,0.0,,-1.0,,3.0,1.0,0.0,0.0,0.0,0.0,27.0,3.0,3.0,1.0,1.0,0.0,boakj901,John Boake,,,,,,,,,,,paboc101,Charlie Pabor,lennb101,Bill Lennon,mathb101,Bobby Mathews,prata101,Al Pratt,,,,,prata101,Al Pratt,mathb101,Bobby Mathews,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y
1,18710505,0,Fri,BS1,,1,WS3,,1,20,18,54.0,D,,,,WAS01,5000.0,145.0,107000435,640113030,41.0,13.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,18.0,,5.0,3.0,,-1.0,,12.0,1.0,6.0,6.0,1.0,0.0,27.0,13.0,10.0,1.0,2.0,0.0,49.0,14.0,2.0,0.0,0.0,11.0,0.0,0.0,0.0,10.0,,2.0,1.0,,-1.0,,14.0,1.0,7.0,7.0,0.0,0.0,27.0,20.0,10.0,2.0,3.0,0.0,dobsh901,Henry Dobson,,,,,,,,,,,wrigh101,Harry Wright,younn801,Nick Young,spala101,Al Spalding,braia102,Asa Brainard,,,,,spala101,Al Spalding,braia102,Asa Brainard,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,watef102,Fred Waterman,5.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,allid101,Doug Allison,2.0,hallg101,George Hall,7.0,leona101,Andy Leonard,4.0,braia102,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y
2,18710506,0,Sat,CL1,,2,RC1,,1,12,4,54.0,D,,,,RCK01,1000.0,140.0,610020003,10020100,49.0,11.0,1.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,10.0,1.0,0.0,0.0,2.0,0.0,27.0,12.0,8.0,5.0,0.0,0.0,36.0,7.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,,3.0,5.0,,-1.0,,5.0,1.0,3.0,3.0,1.0,0.0,27.0,12.0,13.0,3.0,0.0,0.0,mawnj901,J.H. Manny,,,,,,,,,,,paboc101,Charlie Pabor,hasts101,Scott Hastings,prata101,Al Pratt,fishc102,Cherokee Fisher,,,,,prata101,Al Pratt,fishc102,Cherokee Fisher,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mackd101,Denny Mack,3.0,addyb101,Bob Addy,4.0,fishc102,Cherokee Fisher,1.0,hasts101,Scott Hastings,8.0,ham-r101,Ralph Ham,5.0,ansoc101,Cap Anson,2.0,sagep101,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y
3,18710508,0,Mon,CL1,,3,CH1,,1,12,14,54.0,D,,,,CHI01,5000.0,150.0,101403111,77000000,46.0,15.0,2.0,1.0,2.0,10.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,7.0,1.0,6.0,6.0,0.0,0.0,27.0,15.0,11.0,6.0,0.0,0.0,43.0,11.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,,2.0,1.0,,-1.0,,6.0,1.0,4.0,4.0,0.0,0.0,27.0,14.0,7.0,2.0,0.0,0.0,willg901,Gardner Willard,,,,,,,,,,,paboc101,Charlie Pabor,woodj106,Jimmy Wood,zettg101,George Zettlein,prata101,Al Pratt,,,,,prata101,Al Pratt,zettg101,George Zettlein,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y
4,18710509,0,Tue,BS1,,2,TRO,,1,9,5,54.0,D,,,,TRO01,3250.0,145.0,2232,101003000,46.0,17.0,4.0,1.0,0.0,6.0,0.0,0.0,0.0,2.0,,0.0,1.0,,-1.0,,12.0,1.0,2.0,2.0,0.0,0.0,27.0,12.0,5.0,0.0,1.0,0.0,36.0,9.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,,0.0,2.0,,-1.0,,7.0,1.0,3.0,3.0,1.0,0.0,27.0,11.0,7.0,3.0,0.0,0.0,leroi901,Isaac Leroy,,,,,,,,,,,wrigh101,Harry Wright,pikel101,Lip Pike,spala101,Al Spalding,mcmuj101,John McMullin,,,,,spala101,Al Spalding,mcmuj101,John McMullin,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y





park_codes.csv
Number of rows: 252
Number of columns: 9


Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,,AL,





person_codes.csv
Number of rows: 20494
Number of columns: 7


Unnamed: 0,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,04/06/2004,,,
1,aaroh101,Aaron,Hank,04/13/1954,,,
2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,aased001,Aase,Don,07/26/1977,,,
4,abada001,Abad,Andy,09/10/2001,,,





team_codes.csv
Number of rows: 150
Number of columns: 8


Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,BL1,,1872,1874,Baltimore,Canaries,BL1,1







The links between **`game_log`** and the other files seem to be as follows.

* **`park_codes`** and **`game_log`** appears to be linked via `park_id` column which appear in both files.

* **`person_codes`**'s `id` column seems to correspond with **`game_log`**'s `..._player_..._id`, `..._umpire_id`, `..._pitcher_id`, `..._manager_id` and `winning_rbi_batter_id`.

* **`team_codes`**'s `team_id` column appears to be linked with **`game_log`**'s `h_name` and `v_name` (IDs of home and visiting teams).


### 1.2.1. Note on players' positions in `game_log`

There are 9 defensive positions in baseball. Each player's position is shown in `v_player_..._def_pos` and `h_player_..._def_pos` columns where `v` means "visiting team", `h` means "home team" and `def_pos` means "defensive position". `...` is a number ranging from 1 to 9, showing [batting order](https://en.wikipedia.org/wiki/Batting_order_(baseball%29).

The values in these columns indicate defensive positions which should range from [1 to 9 range](https://en.wikipedia.org/wiki/Baseball_positions). However, in `game_log`, this depends on the league that the home team belongs to. This is shown in the table below.



<table>
<thead>
<tr class="header">
<th></th>
<th colspan="2">League which home team belongs to</th>
</tr>
</thead>
<thead>
<tr class="header">
<th></th>
<th>National League<br/>(correct arrangement)</th>
<th>American League</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>pitcher</td>
<td>1</td>
<td>2</td>
</tr>
<tr class="even">
<td>catcher</td>
<td>2</td>
<td>3</td>
</tr>
<tr class="odd">
<td>first baseman</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="even">
<td>second baseman</td>
<td>4</td>
<td>5</td>
</tr>
<tr class="odd">
<td>third baseman</td>
<td>5</td>
<td>6</td>
</tr>
<tr class="even">
<td>shortstop</td>
<td>6</td>
<td>7</td>
</tr>
<tr class="odd">
<td>left fielder</td>
<td>7</td>
<td>8</td>
</tr>
<tr class="even">
<td>center fielder</td>
<td>8</td>
<td>9</td>
</tr>
<tr class="odd">
<td>right fielder</td>
<td>9</td>
<td>10</td>
</tr>
</tbody>
</table>

The ranges are the same for both teams in most cases. For example, if the home team's defensitive positions range from 2-10, so do the visiting team's. However, exceptions were found in two games in which the home team belonged to the American League, but only one team's position numbers ranged from 2-10.

I will not correct this this abnormal range (2-10) to a normal one (1-9). This is because it seems to have been done coherently and I do not understand the rationale behind it.

The following cell will show an example from all games on 2 October 2016. This is the most recent date in the data in which position number 10 appeared.

In [3]:
# get column names
all_cols = []
pos_cols = []

for i in game.columns:

    all_cols.append(i)
    
    # columns for defensive position numbers
    if "_pos" in i:
        pos_cols.append(i)

# get latest date on which position number 10 appears
max_date = None
for p in pos_cols:
    max_date_temp = game[game[p] == 10]["date"].max()
    if max_date is None:
        max_date = max_date_temp
    elif max_date < max_date_temp:
        max_date = max_date_temp


# get columns which uniquely divide _def_pos columns
# into 1-9 and 2-10 groups
cols_decisive = []
for c in all_cols:
    col = game[game["date"] == max_date][c]
    col_1 = col[col.index < 171899]
    col_2 = col[col.index >= 171899]
    
    col_1_u = col_1.unique()
    col_2_u = col_2.unique()
    
    if (col_1_u.shape[0] == 1) and (col_2_u.shape[0] == 1):
        if col_1_u != col_2_u:
            if pd.notnull(col_1_u) and pd.notnull(col_2_u):
                cols_decisive.append(c)

# display all games on the latest date showing position number 10.
# As can be seen, position numbers are 2-10 for American League
# and 1-9 for Natinal League
extra_cols = ["date", "v_name", "h_name"]
game[cols_decisive + extra_cols + pos_cols][game["date"] == max_date]

Unnamed: 0,h_league,date,v_name,h_name,v_player_1_def_pos,v_player_2_def_pos,v_player_3_def_pos,v_player_4_def_pos,v_player_5_def_pos,v_player_6_def_pos,v_player_7_def_pos,v_player_8_def_pos,v_player_9_def_pos,h_player_1_def_pos,h_player_2_def_pos,h_player_3_def_pos,h_player_4_def_pos,h_player_5_def_pos,h_player_6_def_pos,h_player_7_def_pos,h_player_8_def_pos,h_player_9_def_pos
171892,AL,20161002,HOU,ANA,9.0,7.0,10.0,4.0,6.0,3.0,2.0,5.0,8.0,5.0,9.0,10.0,3.0,7.0,6.0,8.0,2.0,4.0
171893,AL,20161002,TOR,BOS,4.0,5.0,3.0,10.0,2.0,6.0,7.0,8.0,9.0,4.0,5.0,9.0,10.0,3.0,6.0,8.0,2.0,7.0
171894,AL,20161002,MIN,CHA,8.0,6.0,10.0,5.0,3.0,9.0,4.0,2.0,7.0,9.0,6.0,7.0,3.0,10.0,5.0,2.0,4.0,8.0
171895,AL,20161002,CLE,KCA,3.0,4.0,6.0,10.0,5.0,9.0,7.0,8.0,2.0,8.0,4.0,3.0,10.0,9.0,7.0,6.0,5.0,2.0
171896,AL,20161002,BAL,NYA,9.0,8.0,5.0,10.0,2.0,3.0,4.0,7.0,6.0,7.0,8.0,2.0,10.0,3.0,6.0,9.0,5.0,4.0
171897,AL,20161002,OAK,SEA,6.0,4.0,10.0,5.0,3.0,2.0,8.0,9.0,7.0,8.0,6.0,10.0,9.0,5.0,3.0,7.0,2.0,4.0
171898,AL,20161002,TBA,TEX,4.0,8.0,5.0,7.0,3.0,6.0,10.0,2.0,9.0,8.0,9.0,10.0,5.0,7.0,2.0,6.0,3.0,4.0
171899,NL,20161002,SDN,ARI,8.0,3.0,5.0,7.0,9.0,4.0,2.0,6.0,1.0,4.0,5.0,3.0,7.0,9.0,6.0,8.0,2.0,1.0
171900,NL,20161002,DET,ATL,4.0,8.0,3.0,9.0,7.0,5.0,2.0,6.0,1.0,8.0,5.0,3.0,7.0,9.0,2.0,4.0,6.0,1.0
171901,NL,20161002,CHN,CIN,8.0,5.0,3.0,4.0,6.0,9.0,7.0,2.0,1.0,6.0,4.0,3.0,7.0,8.0,5.0,2.0,9.0,1.0


### 1.2.2. Notes on types of leagues

Following are the lists of leagues which appear in **`game_log`**'s `h_league` and `h_league` columns. These indicate the leagues that the teams belonged to when the games were played. The meaaning of each abbreviation is as shown below.

#### 1.2.2.1. No league
* NaN: This referes to games played before National League started. ["The first game in National League history was played on April 22, 1876 (...)"](https://en.wikipedia.org/wiki/National_League#Foundation).

#### 1.2.2.2. Currently existing leagues
* NL: [National League](https://en.wikipedia.org/wiki/National_League)
* AL: [American League](https://en.wikipedia.org/wiki/American_League)

#### 1.2.2.3. Currently defunct leagues

Refer to [List of defunct and relocated Major League Baseball teams](https://en.wikipedia.org/wiki/List_of_defunct_and_relocated_Major_League_Baseball_teams)

* AA: [American Association](https://en.wikipedia.org/wiki/American_Association_%2819th_century%29)
* UA: [Union Association](https://en.wikipedia.org/wiki/Union_Association)
* PL: [Players' League](https://en.wikipedia.org/wiki/Players%27_League)
* FL: [Federal League](https://en.wikipedia.org/wiki/Federal_League)

# 2. Importing Data into SQLite

The four data frames will now be imported as a SQLite database and their configuration summaries displayed.

In [4]:
import glob, os, sqlite3

# The functions are from my other project guided by Dataquest (http://bit.ly/2sQGZb8)
# run_query function has been slightly modified.
def run_query(query, display_output=False):
    """
    query: "SELECT" statement
    db: database
    
    Run query on db.
    Then return result.
    """
    
    # connect to database
    with sqlite3.connect(db) as conn:
        
        # get output
        output = pd.read_sql(query, conn)
        
        # display output if requested
        if display_output:
            display(output)
    
        # query database and return result
        return output

def run_command(command):
    """
    Command: Statement which does not return results
    (e.g. CREATE, INSERT)
    db: database
    
    Run command on db using pandas.read_sql
    """
    
    # connect to database
    with sqlite3.connect(db) as conn:
        
        # enable foreign key constraints
        conn.execute("PRAGMA foreign_keys = ON")
        
        try:
            pd.read_sql(command, conn)

        # Ignore error produced due to returning no output
        except Exception as e:
            if e.args[0] == "'NoneType' object is not iterable":
                pass
            else:
                print(e)

                
def run_command_2(command, params=(), executemany=False):
    """
    Command: Statement or list of statements which do not return results
    (e.g. CREATE, INSERT)
    db: database
    
    Run command on db using sqlite3
    """
    
    # connect to database
    with sqlite3.connect(db) as conn:
        
        # enable foreign key constraints
        conn.execute("PRAGMA foreign_keys = ON")
        
        # create cursor
        cursor = conn.cursor()
        
        # execute one same command with each parameter
        if executemany:
            cursor.executemany(command, params)
        
        # execute several commands with same parameters
        else:
            if type(command) == list:
                try:
                    for c in command:
                        cursor.execute(c, params)
                except:
                    cursor.execute("rollback")


# set database name
db = "major_league.db"

# remove database if already exists
if glob.glob(db):
    os.remove(db)

# import data into SQLite database
with sqlite3.connect(db) as conn:
    for idx, df in enumerate(df_list):
        df.to_sql(csv_list[idx], conn, index=False, if_exists="replace")
        
# show table summaries
for table in csv_list:
    print(table)
    run_query("PRAGMA table_info({})".format(table), display_output=True)
    print("\n\n")

game_log


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,date,INTEGER,0,,0
1,1,number_of_game,INTEGER,0,,0
2,2,day_of_week,TEXT,0,,0
3,3,v_name,TEXT,0,,0
4,4,v_league,TEXT,0,,0
5,5,v_game_number,INTEGER,0,,0
6,6,h_name,TEXT,0,,0
7,7,h_league,TEXT,0,,0
8,8,h_game_number,INTEGER,0,,0
9,9,v_score,INTEGER,0,,0





park_codes


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,park_id,TEXT,0,,0
1,1,name,TEXT,0,,0
2,2,aka,TEXT,0,,0
3,3,city,TEXT,0,,0
4,4,state,TEXT,0,,0
5,5,start,TEXT,0,,0
6,6,end,TEXT,0,,0
7,7,league,TEXT,0,,0
8,8,notes,TEXT,0,,0





person_codes


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,TEXT,0,,0
1,1,last,TEXT,0,,0
2,2,first,TEXT,0,,0
3,3,player_debut,TEXT,0,,0
4,4,mgr_debut,TEXT,0,,0
5,5,coach_debut,TEXT,0,,0
6,6,ump_debut,TEXT,0,,0





team_codes


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,team_id,TEXT,0,,0
1,1,league,TEXT,0,,0
2,2,start,INTEGER,0,,0
3,3,end,INTEGER,0,,0
4,4,city,TEXT,0,,0
5,5,nickname,TEXT,0,,0
6,6,franch_id,TEXT,0,,0
7,7,seq,INTEGER,0,,0







## 2.1. Create unique identifier for each row in `game_log` table

Before normalising the data, it will be useful to create a column that can uniquely identify each row in the main table `game_log`.

First, I will check if such a row already exists.

In [5]:
unique_identifiers = []

num_rows = game.shape[0]

for c in all_cols:
    game_c = game[c]
    if num_rows == game_c.unique().shape[0]:
        unique_identifiers.append(c)

print(unique_identifiers)

[]


No such column exists. Following the description at http://www.retrosheet.org/eventfile.htm, I will create an ID column called `game_id` and set it as the PRIMARY KEY. Its values will be something like `ATL198304080`. This is a combination of the following.

* `ATL`: Home team's abbreviation
* `19830408`: Date of game
* `0`: Number of game

In [6]:
# create id column
id_col = "game_id"
id_part_1 = "h_name"
id_part_2 = "date"
id_part_3 = "number_of_game"


command = """
ALTER TABLE {}
    ADD COLUMN
        {} TEXT
""".format(game_n, id_col)

run_command(command)


# insert ids into the id column
command = """
UPDATE {}
SET {} = {} || {} || {}
""".format(game_n, id_col, id_part_1, id_part_2, id_part_3)

run_command(command)

# show id column and its component columns
run_query("SELECT {}, {}, {}, {} FROM {} LIMIT 5".format(\
          id_col, id_part_1, id_part_2, id_part_3, game_n), \
          display_output=True);

Unnamed: 0,game_id,h_name,date,number_of_game
0,FW1187105040,FW1,18710504,0
1,WS3187105050,WS3,18710505,0
2,RC1187105060,RC1,18710506,0
3,CH1187105080,CH1,18710508,0
4,TRO187105090,TRO,18710509,0


# 3. Plan for database normalisation

In this section, I will consdier a few ways to normalise the database. There are three subsections. The first two subsections suggests plans for normalisation. Subsections 1 and 2 are independent from each other (i.e. subsection 2 does not build on subsection 1). Subsection 3 proposes a normalised schema which builds on the first two.

**<font color="red">Note</font>**: The schema proposed in this section will be discarded. The actual Database normalisation which starts in the next section will be done according to DataQuest's guide. This is due to my currently limited skills and time constraint.


## 3.1. Reduce number of columns (improve normal form into 1NF)

The types of persons appearing in `game` are player, umpire, pitcher, manager and winning [RBI](https://en.wikipedia.org/wiki/Run_batted_in) batter. They appear in seperate sets of name and ID columns. The full list of such columns are shown in the following cell.

In [7]:
# types of persons
person_core = ["player", "umpire", "pitcher", "manager", "winning_rbi_batter"]

# gather names of columns for persons' ids and names
person = set()
for c in game.columns:
    for core in person_core:
        if re.search(core, c):
            if c.endswith("name") or c.endswith("id"):
                person.add(c)

# check that each person has both name and id columns
person_id = {i for i in person if not i.endswith("_id")}
person_name = {i for i in person if not i.endswith("_name")}
if person == person_id.union(person_name):
    print("Both id and name columns exist for each person in the record.")

# display row where all "person" columns are filled
person = list(person)
person_notnull = game[person].dropna().index[0]
display(pd.DataFrame(game.loc[person_notnull][person]).sort_index())

Both id and name columns exist for each person in the record.


Unnamed: 0,152469
1b_umpire_id,cousd901
1b_umpire_name,Derryl Cousins
2b_umpire_id,cedeg901
2b_umpire_name,Gary Cederstrom
3b_umpire_id,mealj901
3b_umpire_name,Jerry Meals
h_manager_id,guilo001
h_manager_name,Ozzie Guillen
h_player_1_id,cabro001
h_player_1_name,Orlando Cabrera


These name and ID columns can be removed because they also appear in **`person_coes`** table.

Another thing to note is that the column names themselves contain extra information. For example, `3b_umpire_id` indicates that the person is an `umpire` and at the `third base`. These will need to be added as seperate columns in `person` as shown in the following example.

id|role|position
---|---|---
mealj901|umpire|3b

## 3.2. Improve normal form (into Boyce–Codd normal form (BCNF))

First, note that, as mentioned above, this does not build on the above subsection.

I will follow the example steps given at http://holowczak.com/database-normalization/12/. Following is a modified excerpt.

At each step of the process, we did the following:

1. Write out the relation
2. (optionally) Write out some example data.
3. Write out all of the functional dependencies
4. Starting with first normal form (1NF), go through each normal form and state why the relation is in the given normal form.
___

`A → B` means A determines B.

A table called "TABLE" with columns "column1" and "column2" will be written as `TABLE (column1, column2)`

Keys are in bold font.

* `park`
<br /><br />
    * Relation
    
    PARK (park_id, name, aka, city, state, start, end, league, notes)
    <br /><br />
    * Functional dependencies
    
    park_id → name, aka, city, state, start, end, league, notes
    
    city → state
    <br /><br />
    * Normal form
    
    Second normal form (2NF)
    <br /><br />
    * Possible rearrangement (into BCNF)
    <br /><br />
    PARK (**park_id**, name, aka, city, start, end, league, notes)
    <br /><br />
    CITY (**city**, state)
<br /><br />
* `person`
<br /><br />
    * Relation
    
    PEROSON (id, last, first, player_debut, mgr_debut, coach_debut, ump_debut)
    <br /><br />
    * Functional dependencies
    
    id → last, first, player_debut, mgr_debut, coach_debut, ump_debut
    <br /><br />
    * Normal form
    
    BCNF
<br /><br />
* `team`
<br /><br />
    * Relation
    
    TEAM (team_id, league, start, end, city, nickname, franch_id, seq)
    <br /><br />
    * Functional dependencies
    
    team_id → league, start, end, city, nickname, franch_id, seq
    <br /><br />
    * Normal form
    
    BCNF
<br /><br />
* `game`
<br /><br />
    * Relation
    
    GAME (date, number_of_game, day_of_week, v_name, v_league, v_game_number, h_name, h_league, h_game_number, v_score, h_score, length_outs, day_night, completion, forefeit, protest, park_id, attendance, length_minutes, v_line_score, h_line_score, v_at_bats, v_hits, v_doubles, v_triples, v_homeruns, v_rbi, v_sacrifice_hits, v_sacrifice_flies, v_hit_by_pitch, v_walks, v_intentional_walks, v_strikeouts, v_stolen_bases, v_caught_stealing, v_grounded_into_double, v_first_catcher_interference, v_left_on_base, v_pitchers_used, v_individual_earned_runs, v_team_earned_runs, v_wild_pitches, v_balks, v_putouts, v_assists, v_errors, v_passed_balls, v_double_plays, v_triple_plays, h_at_bats, h_hits, h_doubles, h_triples, h_homeruns, h_rbi, h_sacrifice_hits, h_sacrifice_flies, h_hit_by_pitch, h_walks, h_intentional_walks, h_strikeouts, h_stolen_bases, h_caught_stealing, h_grounded_into_double, h_first_catcher_interference, h_left_on_base, h_pitchers_used, h_individual_earned_runs, h_team_earned_runs, h_wild_pitches, h_balks, h_putouts, h_assists, h_errors, h_passed_balls, h_double_plays, h_triple_plays, hp_umpire_id, hp_umpire_name, 1b_umpire_id, 1b_umpire_name, 2b_umpire_id, 2b_umpire_name, 3b_umpire_id, 3b_umpire_name, lf_umpire_id, lf_umpire_name, rf_umpire_id, rf_umpire_name, v_manager_id, v_manager_name, h_manager_id, h_manager_name, winning_pitcher_id, winning_pitcher_name, losing_pitcher_id, losing_pitcher_name, saving_pitcher_id, saving_pitcher_name, winning_rbi_batter_id, winning_rbi_batter_id_name, v_starting_pitcher_id, v_starting_pitcher_name, h_starting_pitcher_id, h_starting_pitcher_name, v_player_1_id, v_player_1_name, v_player_1_def_pos, v_player_2_id, v_player_2_name, v_player_2_def_pos, v_player_3_id, v_player_3_name, v_player_3_def_pos, v_player_4_id, v_player_4_name, v_player_4_def_pos, v_player_5_id, v_player_5_name, v_player_5_def_pos, v_player_6_id, v_player_6_name, v_player_6_def_pos, v_player_7_id, v_player_7_name, v_player_7_def_pos, v_player_8_id, v_player_8_name, v_player_8_def_pos, v_player_9_id, v_player_9_name, v_player_9_def_pos, h_player_1_id, h_player_1_name, h_player_1_def_pos, h_player_2_id, h_player_2_name, h_player_2_def_pos, h_player_3_id, h_player_3_name, h_player_3_def_pos, h_player_4_id, h_player_4_name, h_player_4_def_pos, h_player_5_id, h_player_5_name, h_player_5_def_pos, h_player_6_id, h_player_6_name, h_player_6_def_pos, h_player_7_id, h_player_7_name, h_player_7_def_pos, h_player_8_id, h_player_8_name, h_player_8_def_pos, h_player_9_id, h_player_9_name, h_player_9_def_pos, additional_info, acquisition_info)
    <br /><br />
    * Functional dependencies
    <br /><br />
```
    date → day_of_week
    
    date, number_of_game, park_id →
    v_?, h_?, day_of_week, length_outs, day_night, completion, forefeit, protest, attendance, length_minutes, hp_umpire_id, hp_umpire_name, 1b_umpire_id, 1b_umpire_name, 2b_umpire_id, 2b_umpire_name, 3b_umpire_id, 3b_umpire_name, lf_umpire_id, lf_umpire_name, rf_umpire_id, rf_umpire_name, winning_pitcher_id, winning_pitcher_name, losing_pitcher_id, losing_pitcher_name, saving_pitcher_id, saving_pitcher_name, winning_rbi_batter_id, winning_rbi_batter_id_name, additional_info, acquisition_info
```
    Note: v\_? and h\_? refers to all column names starting with v\_? or h\_?
    <br /><br />
    <span style=color:red>There seem to be more functional dependencies in this table, but I omit them due to (1) time constraint and (2) my currently limited skills.</span>
    <br /><br />
    * Normal form

    2NF
    <br /><br />
    * Possible rearrangement (into BCNF)
    <br /><br />
    GAME (**date**, **number_of_game**, **park_id**, v\_?, h\_?, length_outs, day_night, completion, forefeit, protest, attendance, length_minutes, hp_umpire_id, hp_umpire_name, 1b_umpire_id, 1b_umpire_name, 2b_umpire_id, 2b_umpire_name, 3b_umpire_id, 3b_umpire_name, lf_umpire_id, lf_umpire_name, rf_umpire_id, rf_umpire_name, winning_pitcher_id, winning_pitcher_name, losing_pitcher_id, losing_pitcher_name, saving_pitcher_id, saving_pitcher_name, winning_rbi_batter_id, winning_rbi_batter_id_name, additional_info, acquisition_info
    <br /><br />
    DATE (**date**, day_of_week)
    




## 3.3. Normalized Schema

In this section, I've built on the plans suggested in the previous two subsections. The suggested schema is shown below.

![major_league_normalised_schema](major_league_normalised_schema.png)

Note that, in this schema, **`game_log`**'s `game_id` column is an integer starting from 1 whereas it was defined differently in [section 2.1 ](#2.1.-Create-unique-identifier-for-each-row-in-game_log-table).

# 4. Execute database normalisation

Be reminded that database normalisation will be carried out using DataQuest's schema instead of the one in the above section.

## 4.1. Creating Tables Without Foreign Key Relations

Following is the schema suggested by DataQuest which the actual database normalisation will be based on.

![mlb_schema](mlb_schema.svg?sanitize=true)

Tables will be linked using foreign keys. As the tables providing foreign keys should be created first, the order of creation will be as follows.

|Table|Creation priority|Description|Tables to draw key from|
|---|---|---|---|
|`person`|0|Each person's names||
|`park`|0|Info on each park||
|`league`|0|League name||
|`appearance_type`|0|Type of role a person can play in a game||
|`game`|1|Info on each game|`park`|
|`team`|1|Info on each team|`league`|
|`team_appearance`|2|Game-specific record on each team|`team`, `league` and `game`|
|`person_appearance`|2|Game-specific record on each person|`person`, `team`, `game` and `appearnace_type`|


I will first create four tables without foreign keys so that other tables can be properly linked to these. Below are DataQuest's notes for these tables.
___
*   `person`
    *   Each of the 'debut' columns have been omitted, as the data will be able to be found from other tables.
    *   Since the game log file has no data on coaches, we made the decision to not include this data.
*   `park`
    *   The start, end, and league columns contain data that is found in the main game log and can be removed.
*   `league`
    *   Because some of the older leagues are not well known, we will create a table to store league names.
*   `appearance_type`
    *   Our appearance table will include data on players with positions, umpires, managers, and awards (like winning pitcher). This table will store information on what different types of appearances are available.
___

The first six rows will be displayed in the following cell.

In [8]:
# 1. person
# create table
table = "person"

command = """
CREATE TABLE iF NOT EXISTS {}
    (person_id TEXT PRIMARY KEY,
    first_name TEXT,
    last_name TEXT)
""".format(table)
run_command(command)

# fill table
command = """
INSERT OR IGNORE INTO {}
    SELECT id, last, first
    FROM person_codes
""".format(table)
run_command(command)

# display table
print(table)
query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))
print("\n\n")



# 2. park
# create table
table = "park"

command = """
CREATE TABLE iF NOT EXISTS {}
    (park_id TEXT PRIMARY KEY,
    name TEXT,
    nickname TEXT,
    city TEXT,
    state TEXT,
    notes TEXT)
""".format(table)

run_command(command)

# fill table
command = """
INSERT OR IGNORE INTO {}
    SELECT park_id, name, aka, city, state, notes
    FROM park_codes
""".format(table)

run_command(command)

# display table
print(table)
query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))
print("\n\n")




# 3. league
# create table
table = "league"

command = """
CREATE TABLE IF NOT EXISTS {}
    (league_id TEXT PRIMARY KEY,
    name TEXT)
""".format(table)

run_command(command)

# fill table
params = [('UA', 'Union Association'),
          ('NL', 'National League'),
          ('PL', 'Players\' League'),
          ('AA', 'American Association'),
          ('AL', 'American League'),
          ('FL', 'Federal League')]
command = """
INSERT OR IGNORE INTO league
    VALUES (?, ?)
""".format(table)

run_command_2(command, params=params, executemany=True)

# display table
print(table)
query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))
print("\n\n")




# 4. appearance_type
table = "appearance_type"
# import appearance type data
app_type = pd.read_csv(table + ".csv")

# convert appearance type data frame to SQLite table
with sqlite3.connect(db) as conn:
    app_type.to_sql(table, conn, index=False, if_exists="replace")
    
# set appearance_type_id as PRIMARY KEY
commands = ["BEGIN TRANSACTION",
            """
            ALTER TABLE {}
                RENAME TO {}_old;
            """.format(table, table),
            """
            CREATE TABLE IF NOT EXISTS {} (
                appearance_type_id TEXT PRIMARY KEY,
                name TEXT,
                category TEXT
            );            
            """.format(table),
            """
            INSERT OR IGNORE INTO {}
                SELECT * FROM {}_old;
            """.format(table, table),
            "DROP TABLE {}_old".format(table),
            "COMMIT"]

run_command_2(commands)

# display table
print(table)

query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))

person


Unnamed: 0,person_id,first_name,last_name
0,aardd001,Aardsma,David
1,aaroh101,Aaron,Hank
2,aarot101,Aaron,Tommie
3,aased001,Aase,Don
4,abada001,Abad,Andy
5,abadf001,Abad,Fernando





park


Unnamed: 0,park_id,name,nickname,city,state,notes
0,ALB01,Riverside Park,,Albany,NY,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,
3,ARL01,Arlington Stadium,,Arlington,TX,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,
5,ATL01,Atlanta-Fulton County Stadium,,Atlanta,GA,





league


Unnamed: 0,league_id,name
0,UA,Union Association
1,NL,National League
2,PL,Players' League
3,AA,American Association
4,AL,American League
5,FL,Federal League





appearance_type


Unnamed: 0,appearance_type_id,name,category
0,O1,Batter 1,offense
1,O2,Batter 2,offense
2,O3,Batter 3,offense
3,O4,Batter 4,offense
4,O5,Batter 5,offense
5,O6,Batter 6,offense


## 4.2. Adding The Team and Game Tables

In this section, I will create `team` and `game` tables which are linked to `league` and `park` tables created in the previous section.

Here's a schema created by DataQuest.

![mlb_schema_2](mlb_schema_2.svg?sanitize=true)

Below are DataQuest's notes on each table.
___
*   `team`
    *   The start, end, and sequence columns can be derived from the game level data.
*   `game`
    *   We have chosen to include all columns for the game log that don't refer to one specific team or player, instead putting those in two appearance tables.
    *   We have removed the column with the day of the week, as this can be derived from the date.
    *   We have changed the `day_night` column to `day`, with the intention of making this a boolean column. Even though SQLite doesn't support the `BOOLEAN` type, we can use this when creating our table and SQLite will manage the underlying types behind the scenes (for more on how this works [refer to the SQLite documentation](https://www.sqlite.org/datatype3.html). This means that anyone quering the schema of our database in the future understands how that column is intended to be used.
___

Again, the first six rows of new tables will be displayed.

In [9]:
# 1. game
# create table
table = "game"

command = """
CREATE TABLE IF NOT EXISTS {} (
    game_id TEXT PRIMARY KEY,
    date INTEGER,
    number_of_game INTEGER,
    park_id TEXT,
    length_outs REAL,
    day NUMERIC,
    completion TEXT,
    forefeit TEXT,
    protest TEXT,
    attendance REAL,
    length_minutes REAL,
    additional_info TEXT,
    acquisition_info TEXT,
    FOREIGN KEY (park_id) REFERENCES park(park_id)
);
""".format(table)

run_command(command)

# fill table
command = """
INSERT OR IGNORE INTO {}
    SELECT
        game_id,
        date,
        number_of_game,
        park_id,
        length_outs,
        CASE
            WHEN day_night == "D" THEN 1
            WHEN day_night == "N" THEN 0
            END day,
        completion,
        forefeit,
        protest,
        attendance,
        length_minutes,
        additional_info,
        acquisition_info
    FROM
        {}_log;
""".format(table, table)

run_command(command)

# display table
print(table)
query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))
print("\n\n")



# 2. team
# create table
table = "team"

command = """
CREATE TABLE IF NOT EXISTS {} (
    team_id TEXT PRIMARY KEY,
    league_id TEXT,
    city TEXT,
    nickname TEXT,
    franch_id TEXT,
    FOREIGN KEY (league_id) REFERENCES league(league_id)
);
""".format(table)

run_command(command)

# fill table
command = """
INSERT OR IGNORE INTO {}
    SELECT
        team_id,
        league,
        city,
        nickname,
        franch_id
    FROM
        {}_codes;
""".format(table, table)

run_command(command)

# display table
print(table)
query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))
print("\n\n")

game


Unnamed: 0,game_id,date,number_of_game,park_id,length_outs,day,completion,forefeit,protest,attendance,length_minutes,additional_info,acquisition_info
0,FW1187105040,18710504,0,FOR01,54.0,1,,,,200.0,120.0,,Y
1,WS3187105050,18710505,0,WAS01,54.0,1,,,,5000.0,145.0,HTBF,Y
2,RC1187105060,18710506,0,RCK01,54.0,1,,,,1000.0,140.0,,Y
3,CH1187105080,18710508,0,CHI01,54.0,1,,,,5000.0,150.0,,Y
4,TRO187105090,18710509,0,TRO01,54.0,1,,,,3250.0,145.0,HTBF,Y
5,CL1187105110,18710511,0,CLE01,48.0,1,,V,,2500.0,120.0,,Y





team


Unnamed: 0,team_id,league_id,city,nickname,franch_id
0,ALT,UA,Altoona,Mountain Cities,ALT
1,ARI,NL,Arizona,Diamondbacks,ARI
2,BFN,NL,Buffalo,Bisons,BFN
3,BFP,PL,Buffalo,Bisons,BFP
4,BL1,,Baltimore,Canaries,BL1
5,BL2,AA,Baltimore,Orioles,BL2







## 4.3. Adding the Team Appearance Table

Now, the `team_appearance` table will be created and filled. Below is DataQuest's schema of the table plus other tables from which it draws foreign keys from.

![mlb_schema_3](mlb_schema_3.svg?sanitize=true)

**<span style=color:red><font size=5>Note: The SQL query used here is not my own, but DataQuest's.</font></span>**

In [10]:
# create table
table = "team_appearance"

command = """
CREATE TABLE IF NOT EXISTS {} (
    team_id TEXT,
    game_id TEXT,
    home NUMERIC,
    league_id TEXT,
    score INTEGER,
    line_score TEXT,
    at_bats REAL,
    hits REAL,
    doubles REAL,
    triples REAL,
    homeruns REAL,
    rbi REAL,
    sacrifice_hits REAL,
    sacrifice_flies REAL,
    hit_by_pitch REAL,
    walks REAL,
    intentional_walks REAL,
    strikeouts REAL,
    stolen_bases REAL,
    caught_stealing REAL,
    grounded_into_double REAL,
    first_catcher_interference REAL,
    left_on_base REAL,
    pitchers_used REAL,
    individual_earned_runs REAL,
    team_earned_runs REAL,
    wild_pitches REAL,
    balks REAL,
    putouts REAL,
    assists REAL,
    errors REAL,
    passed_balls REAL,
    double_plays REAL,
    triple_plays REAL,
    PRIMARY KEY (game_id, team_id),
    FOREIGN KEY (team_id) REFERENCES team(team_id),
    FOREIGN KEY (game_id) REFERENCES game(game_id),
    FOREIGN KEY (league_id) REFERENCES league(league_id)
);
""".format(table)

run_command(command)

# fill table
v_record = """
v_name,
game_id,
0 AS home,
v_league,
v_score,
v_line_score,
v_at_bats,
v_hits,
v_doubles,
v_triples,
v_homeruns,
v_rbi,
v_sacrifice_hits,
v_sacrifice_flies,
v_hit_by_pitch,
v_walks,
v_intentional_walks,
v_strikeouts,
v_stolen_bases,
v_caught_stealing,
v_grounded_into_double,
v_first_catcher_interference,
v_left_on_base,
v_pitchers_used,
v_individual_earned_runs,
v_team_earned_runs,
v_wild_pitches,
v_balks,
v_putouts,
v_assists,
v_errors,
v_passed_balls,
v_double_plays,
v_triple_plays
"""

h_record = v_record.replace("v_", "h_")\
                   .replace("0", "1")

command = """
INSERT OR IGNORE INTO {}
    SELECT
        {}
    FROM
        game_log
UNION
    SELECT
        {}
    FROM
        game_log
""".format(table, v_record, h_record)

run_command(command)

# display table
print(table)
query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))
print("\n\n")

team_appearance


Unnamed: 0,team_id,game_id,home,league_id,score,line_score,at_bats,hits,doubles,triples,homeruns,rbi,sacrifice_hits,sacrifice_flies,hit_by_pitch,walks,intentional_walks,strikeouts,stolen_bases,caught_stealing,grounded_into_double,first_catcher_interference,left_on_base,pitchers_used,individual_earned_runs,team_earned_runs,wild_pitches,balks,putouts,assists,errors,passed_balls,double_plays,triple_plays
0,ALT,ALT188404300,1,UA,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,ALT,ALT188405020,1,UA,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,ALT,ALT188405030,1,UA,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,ALT,ALT188405050,1,UA,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,ALT,ALT188405100,1,UA,9,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,ALT,ALT188405120,1,UA,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,







## 4.4. Adding the Person Appearance Table

Below is the schema for the last table to create, `person_appearance`. It will receive foreign keys from `person`, `team`, `game` and `appearance_type` tables. As DataQuest suggests, an integer PRIMARY KEY will be created to make queries simpler.

![mlb_schema_4](mlb_schema_4.svg?sanitize=true)

**<span style=color:red><font size=5>Note: The SQL query and most of the Python codes used here is not my own, but DataQuest's.</font></span>**

In [11]:
# create table
table = "person_appearance"

command = """
CREATE TABLE IF NOT EXISTS {} (
    appearance_id INTEGER PRIMARY KEY,
    person_id TEXT,
    team_id TEXT,
    game_id TEXT,
    appearance_type_id TEXT,
    FOREIGN KEY (person_id) REFERENCES person(person_id),
    FOREIGN KEY (team_id) REFERENCES team(team_id),
    FOREIGN KEY (game_id) REFERENCES game(game_id),
    FOREIGN KEY (appearance_type_id) REFERENCES appearance_type(appearance_type_id)
);
""".format(table)

run_command(command)

# fill table

# 1. command start
command_start = """
INSERT INTO person_appearance (
    game_id,
    team_id,
    person_id,
    appearance_type_id
)
""".format(table)

# 2. umpires

ump_part = """
    SELECT
        game_id,
        NULL,
        {game_log_col},
        "{pa_col}"
    FROM game_log
    WHERE {game_log_col} IS NOT NULL

UNION
"""

ump = ""

ump_dic = {"UHP": "hp_umpire_id",
            "U1B": "[1b_umpire_id]",
            "U2B": "[2b_umpire_id]",
            "U3B": "[3b_umpire_id]",
            "ULF": "lf_umpire_id",
            "URF": "rf_umpire_id"}

for pa_col in ["UHP", "U1B", "U2B", "U3B", "ULF", "URF"]:
    
    dic = {"game_log_col": ump_dic[pa_col],
          "pa_col": pa_col}

    ump += ump_part.format(**dic)
    



manager = """
    SELECT
        game_id,
        v_name,
        v_manager_id,
        "MM"
    FROM game_log
    WHERE v_manager_id IS NOT NULL
UNION
    SELECT
        game_id,
        h_name,
        h_manager_id,
        "MM"
    FROM game_log
    WHERE h_manager_id IS NOT NULL

UNION
"""

pitcher_wls = """
    SELECT
        game_id,
        CASE
            WHEN h_score > v_score THEN h_name
            ELSE v_name
            END,
        winning_pitcher_id,
        "AWP"
    FROM game_log
    WHERE winning_pitcher_id IS NOT NULL
UNION
    SELECT
        game_id,
        CASE
            WHEN h_score < v_score THEN h_name
            ELSE v_name
            END,
        winning_pitcher_id,
        "ALP"
    FROM game_log
    WHERE winning_pitcher_id IS NOT NULL
UNION
    SELECT
        game_id,
        CASE
            WHEN h_score > v_score THEN h_name
            ELSE v_name
            END,
        saving_pitcher_id,
        "ASP"
    FROM game_log
    WHERE saving_pitcher_id IS NOT NULL
UNION
"""

batter_w = """
    SELECT
        game_id,
        CASE
            WHEN h_score > v_score THEN h_name
            ELSE v_name
            END,
        winning_rbi_batter_id,
        "AWB"
    FROM game_log
    WHERE winning_rbi_batter_id IS NOT NULL
UNION
"""

pitcher_s = """
    SELECT
        game_id,
        v_name,
        v_starting_pitcher_id,
        "PSP"
    FROM game_log
    WHERE v_starting_pitcher_id IS NOT NULL
UNION
    SELECT
        game_id,
        h_name,
        h_starting_pitcher_id,
        "PSP"
    FROM game_log
    WHERE h_starting_pitcher_id IS NOT NULL
UNION
"""






player = """
    SELECT
        game_id,
        {hv}_name,
        {hv}_player_{num}_id,
        "O{num}"
    FROM game_log
    WHERE {hv}_player_{num}_id IS NOT NULL
UNION
    SELECT
        game_id,
        {hv}_name,
        {hv}_player_{num}_id,
        "D" || CAST({hv}_player_{num}_def_pos AS INT)
    FROM game_log
    WHERE {hv}_player_{num}_id IS NOT NULL
UNION
"""

command = command_start + ump + manager + pitcher_wls + batter_w + pitcher_s

for hv in ["h","v"]:
    for num in range(1,10):
        query_vars = {
            "hv": hv,
            "num": num
        }
        # run commmand is a helper function which runs
        # a query against our database.
        command = command + player.format(**query_vars)

command = command.rstrip("UNION\n")

run_command(command)

# display table
print(table)
query = "SELECT * FROM {} LIMIT 6".format(table)
display(run_query(query))
print("\n\n")

person_appearance


Unnamed: 0,appearance_id,person_id,team_id,game_id,appearance_type_id
0,1,maplb901,,ALT188404300,UHP
1,2,curte801,ALT,ALT188404300,MM
2,3,murpj104,ALT,ALT188404300,PSP
3,4,hodnc101,SLU,ALT188404300,PSP
4,5,sullt101,SLU,ALT188404300,MM
5,6,hoopm101,,ALT188405020,UHP







## 4.5. Removing the Original Tables

Finally, the original tables will be removed and the remaining tables listed.

In [12]:
# remove original tables
for table in csv_list:
    query = "DROP TABLE IF EXISTS {}".format(table)
    run_command(query)

# list remaining tables
query = "SELECT name FROM sqlite_master WHERE type == 'table'"
display(run_query(query))

Unnamed: 0,name
0,person
1,park
2,league
3,appearance_type
4,game
5,team
6,team_appearance
7,person_appearance


# 5. Tasks for later

Following are further tasks suggested by DataQuest.

*   Transform the the dates into a [SQLite compatible format](https://www.sqlite.org/lang_datefunc.html).
*   Extract the line scores into innings level data in a new table.
*   Create views to make querying stats easier, eg:
    *   Season level stats.
    *   All time records.
*   Supplement the database using new data, for instance:
    *   Add data from retrosheet game logs for years after 2016.
    *   Source and add missing pitcher information.
    *   Add player level per-game stats.
    *   Source and include base coach data.