# Challenge Set 9
## Part II: Baseball Data

*Introductory - Intermediate level SQL*

--

Please complete this exercise via SQLalchemy and Jupyter notebook.

We will be working with the Lahman baseball data we uploaded to your AWS instance in class. 


1. What was the total spent on salaries by each team, each year?

2. What is the first and last year played for each player? *Hint:* Create a new table from 'Fielding.csv'.

3. Who has played the most all star games?

4. Which school has generated the most distinct players? *Hint:* Create new table from 'CollegePlaying.csv'.

5. Which players have the longest career? Assume that the `debut` and `finalGame` columns comprise the start and end, respectively, of a player's career. *Hint:* Create a new table from 'Master.csv'. Also note that strings can be converted to dates using the [`DATE`](https://wiki.postgresql.org/wiki/Working_with_Dates_and_Times_in_PostgreSQL#WORKING_with_DATETIME.2C_DATE.2C_and_INTERVAL_VALUES) function and can then be subtracted from each other yielding their difference in days.

6. What is the distribution of debut months? *Hint:* Look at the `DATE` and [`EXTRACT`](https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) functions.

7. What is the effect of table join order on mean salary for the players listed in the main (master) table? *Hint:* Perform two different queries, one that joins on playerID in the salary table and other that joins on the same column in the master table. You will have to use left joins for each since right joins are not currently supported with SQLalchemy.


In [None]:
NOTE TO SELF:
    

In [None]:
'''SQL
SELECT 
    SUM(salary) AS TotalSalarySpend, 
    schoolName
FROM 
    salaries
GROUP BY 
    TeamID
ORDER BY 
    TotalSalarySpend DESC;
'''


'''OUTPUT
       1460955253 | TOR
       1372976857 | HOU
       1326557848 | CIN
       1322864229 | CLE
       1198507139 | COL
       1186474624 | MIN
       1114146053 | OAK
       1112113728 | ARI
       1102872893 | KCA
       1101814262 | SDN
       1062683436 | LAA
        946944053 | MIL
        917369218 | PIT
        695421775 | TBA
        675016005 | FLO
        583376341 | WAS
        468091973 | ANA
        408203761 | MON
        271978930 | CAL
        233645804 | ML4
        151679900 | MIA
'''

2. What is the first and last year played for each player? Hint: Create a new table from 'Fielding.csv'.

In [None]:
'''SQL
SELECT 
    playerID, 
    MIN(yearID) AS FirstYear, 
    MAX(yearID) AS LastYear
FROM 
    AllstarFull
GROUP BY 
    playerID;
'''

3. Who has played the most all star games

In [None]:
'''SQL
SELECT 
    playerID, 
    COUNT(DISTINCT gameID) AS NumGames
FROM 
    AllstarFull
GROUP BY 
    playerID
ORDER BY 
    NumGames DESC;
LIMIT 
    1
'''

# Output: aaronha01

4. Which school has generated the most distinct players? Hint: Create new table from 'CollegePlaying.csv'.

In [None]:
# Create table
'''SQL
CREATE TABLE IF NOT EXISTS SchoolPlayers (
    schoolID varchar(15) NOT NULL,
    playerID varchar(15) NOT NULL,
    yearMin int DEFAULT NULL,
    yearMax int DEFAULT NULL,
    PRIMARY KEY (schoolID,playerID)
);
'''
# Populate data
'''SQL
COPY Schools FROM '/home/gretta/baseballdata/SchoolsPlayers.csv' DELIMITER ',' CSV HEADER;
'''
# Run Query
'''SQL
SELECT 
    schoolID, 
    COUNT(DISTINCT playerID) AS NumPlayers
FROM 
    SchoolPlayers
GROUP BY 
    schoolID
ORDER BY 
    NumPlayers DESC;
'''

# Output: whelato01

5. Which players have the longest career? Assume that the debut and finalGame columns comprise the start and end, respectively, of a player's career. Hint: Create a new table from 'Master.csv'. Also note that strings can be converted to dates using the DATE function and can then be subtracted from each other yielding their difference in days.

In [None]:
# Create table
'''SQL
CREATE TABLE IF NOT EXISTS Master4 (
    playerID varchar(15) NOT NULL,
    birthYear int DEFAULT NULL,
    birthMonth int DEFAULT NULL,
    birthDay int DEFAULT NULL,
    birthCountry text DEFAULT NULL,
    birthState text DEFAULT NULL,
    birthCity text DEFAULT NULL,
    deathYear int DEFAULT NULL,
    deathMonth int DEFAULT NULL,
    deathDay int DEFAULT NULL,
    deathCountry text DEFAULT NULL,
    deathState text DEFAULT NULL,
    deathCity text DEFAULT NULL,
    nameFirst varchar(50) DEFAULT NULL,
    nameLast varchar(50) DEFAULT NULL,
    nameGiven varchar(50) DEFAULT NULL,
    weight int DEFAULT NULL,
    height int DEFAULT NULL,
    bats text DEFAULT NULL,
    throws text DEFAULT NULL,
    debut text DEFAULT NULL,
    finalGame text DEFAULT NULL,
    retroID varchar(15) DEFAULT NULL,
    bbrefID varchar(15) DEFAULT NULL,
    PRIMARY KEY (playerID)
);
'''
# Populate data
'''SQL
COPY Master4 FROM '/home/gretta/baseballdata/Master.csv' DELIMITER ',' CSV HEADER;
'''
# Run Query
'''SQL
SELECT 
    playerID, 
    nameFirst, 
    nameLast, 
    (DATE(debut) - DATE(finalGame)) AS CareerLength
FROM 
    Master4
GROUP BY 
    playerID
ORDER BY 
    CareerLength DESC;
'''

6. What is the distribution of debut months? Hint: Look at the DATE and EXTRACT functions.

In [None]:
'''SQL
SELECT 
    COUNT(DISTINCT playerID) AS NumPlayers, 
    EXTRACT(MONTH FROM debut) AS DebutMonth
FROM 
    Master4
GROUP BY 
    DebutMonth
ORDER BY 
    DebutMonth;
'''

7. What is the effect of table join order on mean salary for the players listed in the main (master) table? Hint: Perform two different queries, one that joins on playerID in the salary table and other that joins on the same column in the master table. You will have to use left joins for each since right joins are not currently supported with SQLalchemy.

In [None]:
'''SQL
SELECT
    Salaries.playerID, 
    AVG(Salaries.salary) AS AvgSalary
FROM
    Salaries
  JOIN
    Master4
  ON 
    Salaries.playerID = Master4.playerID
GROUP BY 
    Master4.playerID
ORDER BY 
    Salaries.salary DESC;
'''    


'''SQL
SELECT
    Salaries.playerID, 
    AVG(s.salary) AS AvgSalary
FROM
    Master4
  JOIN
    Salaries
  ON 
    Salaries.playerID = Master4.playerID
GROUP BY 
    Master4.playerID
ORDER BY 
    Salaries.salary DESC;
'''