## Designing and Creating a Database

In this project, we will cover:

- Import data into SQLite
- Design a normalized database schema
- Create tables for our schema
- Insert data into our schema

We will be working with a file of [Major League Baseball] games from [Retrosheet]. Retrosheet compiles detailed statistics on baseball games from the 1800s through to today. The main file we will be working from `game_log.csv`, has been produced by combining 127 separate CSV files from retrosheet, and has been pre-cleaned to remove some inconsistencies. The game log has hundreds of data points on each game which we will normalize into several separate tables using SQL, providing a robust database of game-level statistics.

In addition to the main file, we have also included three 'helper' files, also sourced from Retrosheet:

- `park_codes.csv`
- `person_codes.csv`
- `team_codes.csv`

These three helper files in some cases contain extra data, but will also make things easier as they will form the basis for three of our normalized tables.

An important first step when working with any new data is to perform exploratory data analysis (EDA). EDA gets us familiar with the data and gives us a level of background knowledge that will help us throughout our project. The methods you use when performing EDA will depend on what you plan to do with the data. In our case, we're wanting to create a normalized database, so our focus should be:

- Becoming familiar, at a high level, with the meaning of each column in each file.
- Thinking about the relationships between columns within each file.
- Thinking about the relationships between columns across different files.

A `game_log_fields.txt` file is included from Retrosheet which explains the fields included in our main file, which will be useful to assist our EDA.


[//]: # (Reference Links)

[Major League Baseball]: https://en.wikipedia.org/wiki/Major_League_Baseball
[Retrosheet]: http://www.retrosheet.org/

## Getting to Know the Data

In [1]:
# import libraries
import pandas as pd
import sqlite3
import csv

In [2]:
# set pandas options to ensure the DataFrame output isn't truncated

pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)

In [3]:
# read files

game_log = pd.read_csv('game_log.csv', low_memory=False)
park = pd.read_csv('park_codes.csv')
person = pd.read_csv('person_codes.csv')
team = pd.read_csv('team_codes.csv')

In [4]:
dfs = {
    'game_log': game_log,
    'park': park,
    'person': person,
    'team': team,
}

In [5]:
for name in dfs.keys():
    print(name)
    df = dfs[name]
    print(f"Size is {df.shape}")
    print('Columns:', df.columns)
    print()

game_log
Size is (171907, 161)
Columns: Index(['date', 'number_of_game', 'day_of_week', 'v_name', 'v_league',
       'v_game_number', 'h_name', 'h_league', 'h_game_number', 'v_score',
       ...
       'h_player_7_name', 'h_player_7_def_pos', 'h_player_8_id',
       'h_player_8_name', 'h_player_8_def_pos', 'h_player_9_id',
       'h_player_9_name', 'h_player_9_def_pos', 'additional_info',
       'acquisition_info'],
      dtype='object', length=161)

park
Size is (252, 9)
Columns: Index(['park_id', 'name', 'aka', 'city', 'state', 'start', 'end', 'league',
       'notes'],
      dtype='object')

person
Size is (20494, 7)
Columns: Index(['id', 'last', 'first', 'player_debut', 'mgr_debut', 'coach_debut',
       'ump_debut'],
      dtype='object')

team
Size is (150, 8)
Columns: Index(['team_id', 'league', 'start', 'end', 'city', 'nickname', 'franch_id',
       'seq'],
      dtype='object')



### Game Log

In [6]:
game_log.shape

(171907, 161)

In [7]:
game_log.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 211.2+ MB


In [8]:
game_log.describe()

Unnamed: 0,date,number_of_game,v_game_number,h_game_number,v_score,h_score,length_outs,attendance,length_minutes,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,v_player_1_def_pos,v_player_2_def_pos,v_player_3_def_pos,v_player_4_def_pos,v_player_5_def_pos,v_player_6_def_pos,v_player_7_def_pos,v_player_8_def_pos,v_player_9_def_pos,h_player_1_def_pos,h_player_2_def_pos,h_player_3_def_pos,h_player_4_def_pos,h_player_5_def_pos,h_player_6_def_pos,h_player_7_def_pos,h_player_8_def_pos,h_player_9_def_pos
count,171907.0,171907.0,171907.0,171907.0,171907.0,171907.0,140841.0,118877.0,136701.0,140838.0,140838.0,140772.0,140835.0,140800.0,139488.0,140838.0,135885.0,140838.0,140728.0,113541.0,140746.0,140775.0,127317.0,131222.0,134357.0,140838.0,140838.0,138135.0,140668.0,140624.0,140838.0,140829.0,140837.0,140807.0,140801.0,140825.0,140838.0,140838.0,140838.0,140772.0,140835.0,140800.0,139515.0,140838.0,135885.0,140838.0,140730.0,113541.0,140746.0,140775.0,127317.0,131222.0,134357.0,140838.0,140838.0,138117.0,140683.0,140618.0,140838.0,140829.0,140837.0,140808.0,140800.0,140824.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0,140835.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0,140838.0
mean,19534620.0,0.260897,76.929887,76.953806,4.420582,4.701461,53.619976,20184.247188,150.903329,34.914398,8.999318,1.563592,0.276039,0.729119,4.041366,0.558741,0.169746,0.236428,3.153409,0.246757,5.290381,0.583342,0.299002,0.602529,0.00352,7.106697,2.64352,3.978463,3.266223,0.254921,0.033876,26.061699,10.768584,0.955649,0.084453,0.900628,0.001349,33.364049,8.915875,1.559742,0.311173,0.737699,4.19704,0.574398,0.173735,0.241405,3.294969,0.288284,4.776569,0.591838,0.268252,0.564745,0.003416,7.052791,2.691518,3.799489,3.133243,0.245829,0.032853,27.553593,11.326377,0.98635,0.082294,0.951088,0.001669,6.462304,5.929309,6.407511,5.90864,5.795481,5.505645,4.963391,4.0848,1.894778,6.46256,5.914767,6.43577,5.922982,5.786144,5.497628,4.956184,4.080078,1.895873
std,414932.6,0.605667,45.178029,45.162564,3.278489,3.355605,5.571512,14257.381902,34.74816,4.633761,3.599728,1.34674,0.560388,0.959742,3.021488,0.862961,0.437917,0.508436,2.128045,0.544284,3.01227,0.934292,0.562095,0.987551,0.060226,2.663555,1.419432,2.887496,3.051028,0.537386,0.193502,3.108039,3.51741,1.178337,0.319276,0.922813,0.036705,4.549779,3.401375,1.327939,0.596195,0.955998,2.970659,0.861761,0.440446,0.511111,2.161982,0.601741,2.878336,0.954736,0.530002,0.958857,0.058984,2.680605,1.488688,2.922254,3.051355,0.526316,0.190425,2.633844,3.629679,1.200212,0.306645,0.947029,0.040814,1.808287,1.943843,2.305505,2.531669,2.529486,2.448485,2.269825,2.166868,1.939072,1.813946,1.9461,2.306364,2.529624,2.521525,2.446669,2.264262,2.167098,1.936534
min,18710500.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,24.0,12.0,0.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,12.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,19180520.0,0.0,38.0,38.0,2.0,2.0,51.0,7962.0,125.0,32.0,6.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,5.0,1.0,2.0,0.0,0.0,0.0,24.0,8.0,0.0,0.0,0.0,0.0,31.0,7.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,5.0,1.0,2.0,0.0,0.0,0.0,27.0,9.0,0.0,0.0,0.0,0.0,5.0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,1.0,5.0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,1.0
50%,19530530.0,0.0,76.0,76.0,4.0,4.0,54.0,18639.0,150.0,34.0,9.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,7.0,2.0,3.0,3.0,0.0,0.0,27.0,11.0,1.0,0.0,1.0,0.0,33.0,9.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,0.0,0.0,0.0,7.0,2.0,3.0,3.0,0.0,0.0,27.0,11.0,1.0,0.0,1.0,0.0,7.0,6.0,7.0,6.0,6.0,5.0,5.0,4.0,1.0,7.0,6.0,7.0,7.0,6.0,5.0,5.0,4.0,1.0
75%,19890510.0,0.0,115.0,115.0,6.0,6.0,54.0,31242.0,173.0,37.0,11.0,2.0,0.0,1.0,6.0,1.0,0.0,0.0,4.0,0.0,7.0,1.0,1.0,1.0,0.0,9.0,4.0,6.0,5.0,0.0,0.0,27.0,13.0,1.0,0.0,1.0,0.0,35.0,11.0,2.0,1.0,1.0,6.0,1.0,0.0,0.0,5.0,0.0,7.0,1.0,0.0,1.0,0.0,9.0,4.0,5.0,5.0,0.0,0.0,27.0,14.0,2.0,0.0,1.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0,6.0,6.0,1.0,8.0,8.0,8.0,8.0,8.0,8.0,6.0,6.0,1.0
max,20161000.0,3.0,165.0,165.0,49.0,38.0,156.0,99027.0,1245.0,85.0,42.0,12.0,6.0,8.0,31.0,9.0,5.0,6.0,18.0,7.0,26.0,11.0,6.0,7.0,2.0,25.0,13.0,26.0,26.0,8.0,5.0,78.0,37.0,30.0,11.0,7.0,1.0,95.0,36.0,13.0,8.0,10.0,29.0,8.0,5.0,5.0,18.0,7.0,24.0,15.0,5.0,6.0,2.0,24.0,12.0,31.0,31.0,6.0,6.0,78.0,41.0,28.0,9.0,7.0,1.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


In [9]:
game_log.head()

Unnamed: 0,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info
0,18710504,0,Thu,CL1,,1,FW1,,1,0,2,54.0,D,,,,FOR01,200.0,120.0,0,10010000,30.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,6.0,1.0,,-1.0,,4.0,1.0,1.0,1.0,0.0,0.0,27.0,9.0,0.0,3.0,0.0,0.0,31.0,4.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,0.0,0.0,,-1.0,,3.0,1.0,0.0,0.0,0.0,0.0,27.0,3.0,3.0,1.0,1.0,0.0,boakj901,John Boake,,,,,,,,,,,paboc101,Charlie Pabor,lennb101,Bill Lennon,mathb101,Bobby Mathews,prata101,Al Pratt,,,,,prata101,Al Pratt,mathb101,Bobby Mathews,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y
1,18710505,0,Fri,BS1,,1,WS3,,1,20,18,54.0,D,,,,WAS01,5000.0,145.0,107000435,640113030,41.0,13.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,18.0,,5.0,3.0,,-1.0,,12.0,1.0,6.0,6.0,1.0,0.0,27.0,13.0,10.0,1.0,2.0,0.0,49.0,14.0,2.0,0.0,0.0,11.0,0.0,0.0,0.0,10.0,,2.0,1.0,,-1.0,,14.0,1.0,7.0,7.0,0.0,0.0,27.0,20.0,10.0,2.0,3.0,0.0,dobsh901,Henry Dobson,,,,,,,,,,,wrigh101,Harry Wright,younn801,Nick Young,spala101,Al Spalding,braia102,Asa Brainard,,,,,spala101,Al Spalding,braia102,Asa Brainard,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,watef102,Fred Waterman,5.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,allid101,Doug Allison,2.0,hallg101,George Hall,7.0,leona101,Andy Leonard,4.0,braia102,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y
2,18710506,0,Sat,CL1,,2,RC1,,1,12,4,54.0,D,,,,RCK01,1000.0,140.0,610020003,10020100,49.0,11.0,1.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,10.0,1.0,0.0,0.0,2.0,0.0,27.0,12.0,8.0,5.0,0.0,0.0,36.0,7.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,,3.0,5.0,,-1.0,,5.0,1.0,3.0,3.0,1.0,0.0,27.0,12.0,13.0,3.0,0.0,0.0,mawnj901,J.H. Manny,,,,,,,,,,,paboc101,Charlie Pabor,hasts101,Scott Hastings,prata101,Al Pratt,fishc102,Cherokee Fisher,,,,,prata101,Al Pratt,fishc102,Cherokee Fisher,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mackd101,Denny Mack,3.0,addyb101,Bob Addy,4.0,fishc102,Cherokee Fisher,1.0,hasts101,Scott Hastings,8.0,ham-r101,Ralph Ham,5.0,ansoc101,Cap Anson,2.0,sagep101,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y
3,18710508,0,Mon,CL1,,3,CH1,,1,12,14,54.0,D,,,,CHI01,5000.0,150.0,101403111,77000000,46.0,15.0,2.0,1.0,2.0,10.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,7.0,1.0,6.0,6.0,0.0,0.0,27.0,15.0,11.0,6.0,0.0,0.0,43.0,11.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,,2.0,1.0,,-1.0,,6.0,1.0,4.0,4.0,0.0,0.0,27.0,14.0,7.0,2.0,0.0,0.0,willg901,Gardner Willard,,,,,,,,,,,paboc101,Charlie Pabor,woodj106,Jimmy Wood,zettg101,George Zettlein,prata101,Al Pratt,,,,,prata101,Al Pratt,zettg101,George Zettlein,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y
4,18710509,0,Tue,BS1,,2,TRO,,1,9,5,54.0,D,,,,TRO01,3250.0,145.0,2232,101003000,46.0,17.0,4.0,1.0,0.0,6.0,0.0,0.0,0.0,2.0,,0.0,1.0,,-1.0,,12.0,1.0,2.0,2.0,0.0,0.0,27.0,12.0,5.0,0.0,1.0,0.0,36.0,9.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,,0.0,2.0,,-1.0,,7.0,1.0,3.0,3.0,1.0,0.0,27.0,11.0,7.0,3.0,0.0,0.0,leroi901,Isaac Leroy,,,,,,,,,,,wrigh101,Harry Wright,pikel101,Lip Pike,spala101,Al Spalding,mcmuj101,John McMullin,,,,,spala101,Al Spalding,mcmuj101,John McMullin,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y


In [10]:
game_log.tail()

Unnamed: 0,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info
171902,20161002,0,Sun,MIL,NL,162,COL,NL,162,6,4,60.0,D,,,,DEN02,27762.0,203.0,200000202,1100100010,39.0,10.0,4.0,1.0,2.0,6.0,0.0,0.0,1.0,4.0,0.0,12.0,2.0,1.0,0.0,0.0,8.0,7.0,4.0,4.0,1.0,0.0,30.0,12.0,1.0,0.0,0.0,0.0,41.0,13.0,4.0,0.0,1.0,4.0,1.0,0.0,1.0,3.0,0.0,11.0,0.0,1.0,0.0,0.0,12.0,5.0,6.0,6.0,0.0,0.0,30.0,13.0,0.0,0.0,0.0,0.0,barrs901,Scott Barry,woodt901,Tom Woodring,randt901,Tony Randazzo,ortir901,Roberto Ortiz,,,,,counc001,Craig Counsell,weisw001,Walt Weiss,thort001,Tyler Thornburg,rusic001,Chris Rusin,knebc001,Corey Knebel,susaa001,Andrew Susac,cravt001,Tyler Cravy,marqg001,German Marquez,villj001,Jonathan Villar,5.0,genns001,Scooter Gennett,4.0,cartc002,Chris Carter,3.0,santd002,Domingo Santana,9.0,pereh001,Hernan Perez,8.0,arcio002,Orlando Arcia,6.0,susaa001,Andrew Susac,2.0,elmoj001,Jake Elmore,7.0,cravt001,Tyler Cravy,1.0,blacc001,Charlie Blackmon,8.0,dahld001,David Dahl,7.0,arenn001,Nolan Arenado,5.0,gonzc001,Carlos Gonzalez,9.0,murpt002,Tom Murphy,2.0,pattj005,Jordan Patterson,3.0,valap001,Pat Valaika,4.0,adamc001,Cristhian Adames,6.0,marqg001,German Marquez,1.0,,Y
171903,20161002,0,Sun,NYN,NL,162,PHI,NL,162,2,5,51.0,D,,,,PHI13,36935.0,159.0,1100,00100031x,33.0,8.0,3.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,9.0,1.0,1.0,1.0,0.0,6.0,6.0,3.0,3.0,0.0,0.0,24.0,12.0,3.0,1.0,2.0,0.0,33.0,10.0,1.0,0.0,0.0,3.0,0.0,1.0,0.0,2.0,0.0,3.0,0.0,0.0,2.0,0.0,7.0,5.0,2.0,2.0,0.0,0.0,27.0,7.0,0.0,0.0,1.0,0.0,barkl901,Lance Barksdale,herna901,Angel Hernandez,barrt901,Ted Barrett,littw901,Will Little,,,,,collt801,Terry Collins,mackp101,Pete Mackanin,murrc002,Colton Murray,goede001,Erik Goeddel,nerih001,Hector Neris,hernc005,Cesar Hernandez,ynoag001,Gabriel Ynoa,eickj001,Jerad Eickhoff,granc001,Curtis Granderson,8.0,cabra002,Asdrubal Cabrera,6.0,brucj001,Jay Bruce,9.0,dudal001,Lucas Duda,3.0,johnk003,Kelly Johnson,4.0,confm001,Michael Conforto,7.0,campe001,Eric Campbell,5.0,plawk001,Kevin Plawecki,2.0,ynoag001,Gabriel Ynoa,1.0,hernc005,Cesar Hernandez,4.0,parej002,Jimmy Paredes,7.0,herro001,Odubel Herrera,8.0,franm004,Maikel Franco,5.0,howar001,Ryan Howard,3.0,ruppc001,Cameron Rupp,2.0,blana001,Andres Blanco,6.0,altha001,Aaron Altherr,9.0,eickj001,Jerad Eickhoff,1.0,,Y
171904,20161002,0,Sun,LAN,NL,162,SFN,NL,162,1,7,51.0,D,,,,SFO03,41445.0,184.0,100000,23000002x,30.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,7.0,0.0,0.0,1.0,0.0,4.0,7.0,7.0,7.0,0.0,0.0,24.0,5.0,1.0,0.0,0.0,0.0,39.0,16.0,3.0,1.0,0.0,7.0,0.0,0.0,0.0,4.0,1.0,11.0,2.0,1.0,0.0,0.0,12.0,2.0,1.0,1.0,0.0,0.0,27.0,7.0,0.0,0.0,1.0,0.0,knigb901,Brian Knight,westj901,Joe West,fleta901,Andy Fletcher,danlk901,Kerwin Danley,,,,,robed001,Dave Roberts,bochb002,Bruce Bochy,moorm003,Matt Moore,maedk001,Kenta Maeda,,,poseb001,Buster Posey,maedk001,Kenta Maeda,moorm003,Matt Moore,kendh001,Howie Kendrick,7.0,turnj001,Justin Turner,5.0,seagc001,Corey Seager,6.0,puigy001,Yasiel Puig,9.0,gonza003,Adrian Gonzalez,3.0,grany001,Yasmani Grandal,2.0,pedej001,Joc Pederson,8.0,utlec001,Chase Utley,4.0,maedk001,Kenta Maeda,1.0,spand001,Denard Span,8.0,beltb001,Brandon Belt,3.0,poseb001,Buster Posey,2.0,pench001,Hunter Pence,9.0,crawb001,Brandon Crawford,6.0,pagaa001,Angel Pagan,7.0,panij002,Joe Panik,4.0,gillc001,Conor Gillaspie,5.0,moorm003,Matt Moore,1.0,,Y
171905,20161002,0,Sun,PIT,NL,162,SLN,NL,162,4,10,51.0,D,,,,STL10,44615.0,192.0,20200,00100360x,35.0,9.0,0.0,0.0,1.0,4.0,0.0,0.0,0.0,4.0,0.0,11.0,0.0,1.0,0.0,0.0,8.0,6.0,8.0,8.0,0.0,0.0,24.0,2.0,2.0,0.0,0.0,0.0,36.0,12.0,2.0,0.0,1.0,10.0,0.0,2.0,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,8.0,3.0,4.0,4.0,0.0,0.0,27.0,7.0,0.0,0.0,1.0,0.0,cuzzp901,Phil Cuzzi,ticht901,Todd Tichenor,vanol901,Larry Vanover,marqa901,Alfonso Marquez,,,,,hurdc001,Clint Hurdle,mathm001,Mike Matheny,broxj001,Jonathan Broxton,nicaj001,Juan Nicasio,,,piscs001,Stephen Piscotty,voger001,Ryan Vogelsong,waina001,Adam Wainwright,jasoj001,John Jaso,3.0,polag001,Gregory Polanco,9.0,mccua001,Andrew McCutchen,8.0,kangj001,Jung Ho Kang,5.0,joycm001,Matt Joyce,7.0,hansa001,Alen Hanson,4.0,fryee001,Eric Fryer,2.0,florp001,Pedro Florimon,6.0,voger001,Ryan Vogelsong,1.0,carpm002,Matt Carpenter,3.0,diaza003,Aledmys Diaz,6.0,moliy001,Yadier Molina,2.0,piscs001,Stephen Piscotty,9.0,peraj001,Jhonny Peralta,5.0,mossb001,Brandon Moss,7.0,gyorj001,Jedd Gyorko,4.0,gricr001,Randal Grichuk,8.0,waina001,Adam Wainwright,1.0,,Y
171906,20161002,0,Sun,MIA,NL,161,WAS,NL,162,7,10,51.0,D,,,,WAS11,28730.0,216.0,230020,03023002x,38.0,14.0,1.0,1.0,2.0,7.0,1.0,0.0,0.0,3.0,2.0,10.0,1.0,1.0,1.0,0.0,8.0,7.0,10.0,10.0,1.0,0.0,24.0,11.0,0.0,0.0,1.0,0.0,30.0,10.0,2.0,0.0,1.0,10.0,1.0,1.0,1.0,8.0,0.0,3.0,2.0,0.0,1.0,0.0,7.0,6.0,7.0,7.0,1.0,0.0,27.0,11.0,0.0,0.0,1.0,0.0,tumpj901,John Tumpane,porta901,Alan Porter,onorb901,Brian O'Nora,kellj901,Jeff Kellogg,,,,,mattd001,Don Mattingly,baked002,Dusty Baker,schem001,Max Scherzer,brica001,Austin Brice,melam001,Mark Melancon,difow001,Wilmer Difo,koeht001,Tom Koehler,schem001,Max Scherzer,gordd002,Dee Gordon,4.0,telit001,Tomas Telis,2.0,pradm001,Martin Prado,5.0,yelic001,Christian Yelich,8.0,bourj002,Justin Bour,3.0,scrux001,Xavier Scruggs,7.0,hoodd001,Destin Hood,9.0,hecha001,Adeiny Hechavarria,6.0,koeht001,Tom Koehler,1.0,turnt001,Trea Turner,8.0,reveb001,Ben Revere,7.0,harpb003,Bryce Harper,9.0,zimmr001,Ryan Zimmerman,3.0,drews001,Stephen Drew,5.0,difow001,Wilmer Difo,4.0,espid001,Danny Espinosa,6.0,lobaj001,Jose Lobaton,2.0,schem001,Max Scherzer,1.0,,Y


In [11]:
game_log.columns

Index(['date', 'number_of_game', 'day_of_week', 'v_name', 'v_league',
       'v_game_number', 'h_name', 'h_league', 'h_game_number', 'v_score',
       ...
       'h_player_7_name', 'h_player_7_def_pos', 'h_player_8_id',
       'h_player_8_name', 'h_player_8_def_pos', 'h_player_9_id',
       'h_player_9_name', 'h_player_9_def_pos', 'additional_info',
       'acquisition_info'],
      dtype='object', length=161)

In [12]:
!ls

[31mappearance_type.csv[m[m
[31mgame_log.csv[m[m
[31mgame_log_fields.txt[m[m
[31mpark_codes.csv[m[m
[31mperson_codes.csv[m[m
[31mproject13_designing_and_creating_a_database.ipynb[m[m
[31mteam_codes.csv[m[m


In [13]:
!cat game_log_fields.txt

Field(s)  Meaning
    1     Date in the form "yyyymmdd"
    2     Number of game:
             "0" -- a single game
             "1" -- the first game of a double (or triple) header
                    including seperate admission doubleheaders
             "2" -- the second game of a double (or triple) header
                    including seperate admission doubleheaders
             "3" -- the third game of a triple-header
             "A" -- the first game of a double-header involving 3 teams
             "B" -- the second game of a double-header involving 3 teams
    3     Day of week  ("Sun","Mon","Tue","Wed","Thu","Fri","Sat")
  4-5     Visiting team and league
    6     Visiting team game number
          For this and the home team game number, ties are counted as
          games and suspended games are counted from the starting
          rather than the ending date.
  7-8     Home team and league
    9     Home team game number
10-11     Visiting and home tea

### Park Codes

In [14]:
park.shape

(252, 9)

In [15]:
park.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 9 columns):
park_id    252 non-null object
name       252 non-null object
aka        58 non-null object
city       252 non-null object
state      252 non-null object
start      252 non-null object
end        222 non-null object
league     186 non-null object
notes      128 non-null object
dtypes: object(9)
memory usage: 17.8+ KB


In [16]:
park.describe()

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
count,252,252,58,252,252,252,222,186,128
unique,252,241,56,85,36,215,205,6,123
top,NYC03,Athletic Park,Federal League Park,Philadelphia,NY,05/01/1883,10/03/1915,NL,PIT
freq,1,4,2,14,40,5,3,88,2


In [17]:
park.head()

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,,AL,


In [18]:
park.tail()

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
247,WIL02,BB&T Ballpark at Bowman Field,,Wiliamsport,PA,08/20/2017,08/20/2017,NL,PIT
248,WNY01,West New York Field Club Grounds,,West New York,NJ,09/11/1898,09/17/1899,NL,"BRO:9/18&10/2/1898; NY1:9/11/98, 6/4&7/16&8/13&9/17/99"
249,WOR01,Agricultural County Fair Grounds I,,Worcester,MA,05/01/1880,09/29/1882,NL,
250,WOR02,Agricultural County Fair Grounds II,,Worcester,MA,08/17/1887,08/17/1887,NL,1 BSN game
251,WOR03,Worcester Driving Park Grounds,,Worcester,MA,10/30/1874,10/30/1874,,1 BS1 game


In [19]:
park.columns

Index(['park_id', 'name', 'aka', 'city', 'state', 'start', 'end', 'league',
       'notes'],
      dtype='object')

### Person Codes

In [20]:
person.shape

(20494, 7)

In [21]:
person.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20494 entries, 0 to 20493
Data columns (total 7 columns):
id              20494 non-null object
last            20494 non-null object
first           20433 non-null object
player_debut    19183 non-null object
mgr_debut       701 non-null object
coach_debut     1566 non-null object
ump_debut       1546 non-null object
dtypes: object(7)
memory usage: 1.1+ MB


In [22]:
person.describe()

Unnamed: 0,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
count,20494,20494,20433,19183,701,1566,1546
unique,20494,10397,2484,10278,537,332,1130
top,martm101,Smith,Bill,05/01/1884,05/01/1884,04/01/2013,08/25/1978
freq,1,168,579,36,10,29,52


In [23]:
person.head()

Unnamed: 0,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,04/06/2004,,,
1,aaroh101,Aaron,Hank,04/13/1954,,,
2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,aased001,Aase,Don,07/26/1977,,,
4,abada001,Abad,Andy,09/10/2001,,,


In [24]:
person.tail()

Unnamed: 0,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
20489,zuvep001,Zuvella,Paul,09/04/1982,,04/02/1996,
20490,zuveg101,Zuverink,George,04/21/1951,,,
20491,zwild101,Zwilling,Dutch,08/14/1910,,04/15/1941,
20492,zycht001,Zych,Tony,09/04/2015,,,
20493,thoma102,Thompson,,,,,


In [25]:
person.columns

Index(['id', 'last', 'first', 'player_debut', 'mgr_debut', 'coach_debut',
       'ump_debut'],
      dtype='object')

### Team Codes

In [26]:
team.shape

(150, 8)

In [27]:
team.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
team_id      150 non-null object
league       124 non-null object
start        150 non-null int64
end          150 non-null int64
city         150 non-null object
nickname     150 non-null object
franch_id    150 non-null object
seq          150 non-null int64
dtypes: int64(3), object(5)
memory usage: 9.5+ KB


In [28]:
team.describe()

Unnamed: 0,start,end,seq
count,150.0,150.0,150.0
mean,1902.733333,1517.44,1.293333
std,37.326002,761.772866,0.585584
min,1871.0,0.0,1.0
25%,1879.0,1872.0,1.0
50%,1887.0,1884.0,1.0
75%,1911.25,1891.0,1.0
max,2012.0,2011.0,4.0


In [29]:
team.head()

Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


In [30]:
team.tail()

Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
145,WS8,NL,1886,1889,Washington,Senators,WS8,1
146,WS9,AA,1891,1891,Washington,Senators,WS9,1
147,WSN,NL,1892,1899,Washington,Senators,WS9,2
148,WSU,UA,1884,1884,Washington,Nationals,WSU,1
149,MIA,NL,2012,0,Miami,Marlins,FLO,2


In [31]:
team.columns

Index(['team_id', 'league', 'start', 'end', 'city', 'nickname', 'franch_id',
       'seq'],
      dtype='object')

**Observations**

Game Log contains all the information about baseball games played from the 1800s through to today. This include play statistics for both teams, conditions of the game including location, attendance...etc.

Park Codes contains information about the Baseball stadiums - Park ID, name, location, league, when they started to be used and when they were no longer used. The park_id seems to correspond to the park_id in Game Logs.

Person Codes contains information about the individual players, managers, coaches and umpires. It includes ID, First and Last name along with a debut date - for player, manager, coach and empire respectively. Some people were both players and later managers and coaches. The ID corresponds to a several columns in Game Logs (1b_umpire_id, manager_id, winning_pitch_id...etc).

Team Codes contains information about the respective teams. Their ID, which league they belonged to, when the team started and ended, City they were based in, their nickname, franchise ID along with seq. Not quite sure what the latest column (seq) is for. The team_id seems to correspond to the v_name/h_name in Game Logs.

**Research**

_What each defensive position number represents?_

1 - pitcher
2 - catcher
3 - first baseman
4 - second baseman
5 - third baseman
6 - shortstop
7 - left fielder
8 - center fielder
9 - right fielder
10 - not sure?

_The values in the various league fields and which leagues they represent_

NL - National League
AL - American League
AA - American Association
UA - Union Association (1884)
FL - Federal League (1914–1915)
PL - Player's League (1890)

In [32]:
team.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
team_id      150 non-null object
league       124 non-null object
start        150 non-null int64
end          150 non-null int64
city         150 non-null object
nickname     150 non-null object
franch_id    150 non-null object
seq          150 non-null int64
dtypes: int64(3), object(5)
memory usage: 9.5+ KB


In [33]:
# unique leagues
team['league'].value_counts(dropna=False)

NL     45
NaN    26
AL     25
AA     24
UA     13
FL      9
PL      8
Name: league, dtype: int64

In [34]:
# unique defensive positions
game_log['h_player_6_def_pos'].value_counts(dropna=False)

NaN      31069
 5.0     24202
 3.0     21125
 2.0     18973
 9.0     18766
 7.0     17520
 8.0     13198
 4.0     12401
 6.0     10192
 10.0     4401
 1.0        60
Name: h_player_6_def_pos, dtype: int64

In [35]:
# unique defensive positions
game_log['h_player_7_def_pos'].value_counts(dropna=False)

NaN      31069
 2.0     30064
 6.0     23739
 5.0     23049
 4.0     19043
 3.0     12200
 9.0     10342
 8.0     10193
 7.0      9610
 10.0     2550
 1.0        48
Name: h_player_7_def_pos, dtype: int64

---

## Importing Data into SQLite

---

## Looking for Normalization Opportunities

---

## Planning a Normalized Schema

---

## Creating Tables Without Foreign Key Relations

---

## Adding the Team and Game Tables

---

## Adding the Team Appearance Table

---

## Adding the Person Appearance Table

---

## Removing the Original Tables

---

## Next Steps