```markdown
# Python SQL Methods in Python

This tutorial covers essential methods for interacting with SQL databases in Python. We will demonstrate how to connect to a database, execute queries, and fetch results. All examples are shown in code blocks using triple backticks for syntax highlighting.

---

## 1. Connecting to a Database

To connect to a database in Python, you can use the `sqlite3` module (or another database library if you are using a different database system). One common method is `sql.connect(filename)` or more precisely `sqlite3.connect(filename)`:

```python
import sqlite3

# Connect to a local SQLite database file
# If the file does not exist, it will be created
connection = sqlite3.connect("example.db")

# Alternatively, use an in-memory database for testing:
# connection = sqlite3.connect(":memory:")
```

This establishes a connection object (often named `connection` or `db`) that you will use to interact with the database.

---

## 2. Executing SQL Queries

Once you have a connection, you can create a cursor and execute SQL commands using `db.execute(query)`. In many examples, you'll see this as `connection.execute(query)` or with a cursor like `cursor.execute(query)`:

```python
# Create a cursor from the connection
cursor = connection.cursor()

# Create a sample table (if it doesn't already exist)
cursor.execute("""
    CREATE TABLE IF NOT EXISTS users (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT NOT NULL,
        email TEXT
    )
""")

# Insert some data
cursor.execute("""
    INSERT INTO users (name, email)
    VALUES ('Alice', 'alice@example.com')
""")

# Don't forget to commit your changes after INSERT/UPDATE/DELETE operations
connection.commit()
```

---

## 3. Fetching Results

After executing a `SELECT` query, you can retrieve results using these methods:

- **`cur.fetchall()`**: Fetches **all** rows as a list of tuples.
- **`cur.fetchone()`**: Fetches the **next** single row.
- **`cur.fetchmany(n)`**: Fetches the **next n** rows.

Example:

```python
# Select all rows
cursor.execute("SELECT * FROM users")

# Fetch all rows as a list of tuples
all_rows = cursor.fetchall()
print("All rows:", all_rows)

# Select rows again
cursor.execute("SELECT * FROM users")

# Fetch one row at a time
first_row = cursor.fetchone()
print("First row:", first_row)

second_row = cursor.fetchone()
print("Second row:", second_row)

# Fetch many rows (for example, 2 rows)
cursor.execute("SELECT * FROM users")
some_rows = cursor.fetchmany(2)
print("First 2 rows:", some_rows)
```

In real-world scenarios, you can use these fetch methods for different pagination approaches or to process large datasets in smaller chunks.

---

## Summary

- **`sql.connect(filename)`** or **`sqlite3.connect(filename)`**: Establish a connection to an SQLite database.
- **`db.execute(query)`** or **`cursor.execute(query)`**: Send SQL commands (like `CREATE TABLE`, `INSERT`, `SELECT`, etc.).
- **`cur.fetchall()`**: Retrieve all query results at once.
- **`cur.fetchone()`**: Retrieve the next single row of the current query result.
- **`cur.fetchmany(n)`**: Retrieve the next `n` rows from the current query result.

These methods form the foundation for basic SQL operations in Python. By mastering them, you’ll be ready to build and manage relational database applications efficiently.

```


```markdown

---

## 4. Setting Row Factory

By default, rows returned from a SQLite cursor are tuples. If you prefer dictionary-like access where columns can be accessed by name, you can set a row factory on the database connection:

```python
import sqlite3

# Connect to the database
db = sqlite3.connect("example.db")

# Setting the row factory to sqlite3.Row allows us to access columns by name
db.row_factory = sqlite3.Row

cursor = db.cursor()
cursor.execute("SELECT * FROM users")

# Now each row behaves like a dictionary
for row in cursor:
    print(row["name"], row["email"])
```

When `db.row_factory = sqlite3.Row` is used, each row supports both index-based and key-based access to its columns.

---

## 5. Reading SQL Queries with Pandas

If you have the **pandas** library installed, you can directly read SQL query results into a Pandas DataFrame. This can be more convenient for data analysis and manipulation:

```python
import pandas as pd
import sqlite3

# Connect to the database
db = sqlite3.connect("example.db")

# Example query
query = "SELECT * FROM users"

# Read the query result into a Pandas DataFrame
df = pd.read_sql(query, db)

print(df.head())  # Prints the first few rows of the DataFrame
```

Reading results into Pandas makes it easy to use DataFrame operations, plotting, and other data analysis features.

---

## Summary

1. **Setting Row Factory**:  
   ```python
   db.row_factory = sqlite3.Row
   ```
   This allows dictionary-like access to rows returned from SQL queries.

2. **Reading SQL Queries with Pandas**:  
   ```python
   df = pd.read_sql(query, db)
   ```
   This reads query results into a DataFrame, ideal for data analysis tasks.

Combine these techniques with the methods from the main tutorial to manage and analyze your data efficiently in Python.
```


```markdown
# Python + SQL: A Concise Tutorial with Comments

This tutorial demonstrates how to use common SQL commands in Python (with **SQLite**) while including **comments** to explain each step. We’ll show **SQL snippets** alongside **Python code** so you can see how to execute and retrieve results in your code.

---

## Setup

Below we connect to an SQLite database named `example.db`. We’ll also create a cursor object and optionally enable dictionary-like row access by setting the `row_factory`.

```python
import sqlite3

# Connect to (or create) an SQLite database file named "example.db"
db = sqlite3.connect("example.db")

# Cursor is used to run SQL commands via Python
cursor = db.cursor()

# This optional setting makes returned rows behave like dictionaries,
# allowing access by column name instead of numeric indexes.
db.row_factory = sqlite3.Row
```

For demonstration, assume we have a table named `users`. Here’s how you might create it in **raw SQL** and then in **Python**:

```sql
-- SQL (run independently in an SQL environment)
CREATE TABLE IF NOT EXISTS users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    email TEXT
);
```

```python
# Equivalent command in Python using sqlite3
cursor.execute("""
CREATE TABLE IF NOT EXISTS users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    email TEXT
)
""")
db.commit()  # Commit to save changes
```

---

## 1. SELECT

**Goal**: Retrieve data from a table.

**SQL**:
```sql
-- Select all columns (id, name, email) from all rows in "users"
SELECT * FROM users;
```

**Python**:
```python
# Execute a SELECT query
cursor.execute("SELECT * FROM users")

# Fetch all rows returned by the query
rows = cursor.fetchall()

# Iterate and print each row
for row in rows:
    # If row_factory = sqlite3.Row, we can access by column name
    print(row["id"], row["name"], row["email"])
```

---

## 2. LIMIT

**Goal**: Restrict the number of rows returned.

**SQL**:
```sql
-- Select the first 5 rows
SELECT * FROM users
LIMIT 5;
```

**Python**:
```python
# Limit the result set to 5 rows
cursor.execute("SELECT * FROM users LIMIT 5")

# Fetch all rows (up to 5, because of the LIMIT)
limited_rows = cursor.fetchall()

# Print the rows to see only 5 entries
print("First 5 rows:", limited_rows)
```

---

## 3. DISTINCT

**Goal**: Return only unique values in a column.

**SQL**:
```sql
-- Return unique (distinct) name values from "users"
SELECT DISTINCT name
FROM users;
```

**Python**:
```python
cursor.execute("SELECT DISTINCT name FROM users")
unique_names = cursor.fetchall()

# Print unique names (each row contains one column: 'name')
print("Distinct names:", unique_names)
```

---

## 4. ORDER BY

**Goal**: Sort rows by a column (ascending by default, or use `DESC` for descending).

**SQL**:
```sql
-- Sort all rows by the "name" column in ascending order
SELECT *
FROM users
ORDER BY name ASC;
```

**Python**:
```python
cursor.execute("SELECT * FROM users ORDER BY name ASC")
ordered_rows = cursor.fetchall()

print("Rows ordered by name:", ordered_rows)
```

---

## 5. WHERE

**Goal**: Filter rows based on a condition.

**SQL**:
```sql
-- Retrieve rows where "email" is NOT NULL
SELECT *
FROM users
WHERE email IS NOT NULL;
```

**Python**:
```python
cursor.execute("SELECT * FROM users WHERE email IS NOT NULL")
filtered_rows = cursor.fetchall()

print("Users with non-null email:", filtered_rows)
```

---

## 6. IN

**Goal**: Filter rows by checking if a column’s value is in a given list.

**SQL**:
```sql
-- Return rows where "name" is either 'Alice' or 'Bob'
SELECT *
FROM users
WHERE name IN ('Alice', 'Bob');
```

**Python**:
```python
# Use parameter substitution to avoid SQL injection
cursor.execute("SELECT * FROM users WHERE name IN (?, ?)", ("Alice", "Bob"))
in_rows = cursor.fetchall()

print("Users named Alice or Bob:", in_rows)
```

---

## 7. LIKE

**Goal**: Pattern matching with wildcards (`%` for multiple characters, `_` for a single character).

**SQL**:
```sql
-- Retrieve rows where "email" ends with 'example.com'
SELECT *
FROM users
WHERE email LIKE '%example.com';
```

**Python**:
```python
# The placeholder "?" is replaced by the tuple ("...") at runtime
cursor.execute("SELECT * FROM users WHERE email LIKE ?", ("%example.com",))
like_rows = cursor.fetchall()

print("Users with 'example.com' in email:", like_rows)
```

---

## 8. BETWEEN

**Goal**: Filter rows within a range of values (inclusive).

**SQL**:
```sql
-- Get rows where the "id" is between 1 and 5 (inclusive)
SELECT *
FROM users
WHERE id BETWEEN 1 AND 5;
```

**Python**:
```python
cursor.execute("SELECT * FROM users WHERE id BETWEEN ? AND ?", (1, 5))
between_rows = cursor.fetchall()

print("Users with IDs 1 to 5:", between_rows)
```

---

## 9. AS (Aliasing)

**Goal**: Rename columns (or tables) in the query result for readability.

**SQL**:
```sql
-- Select "name" but rename it to "username" in the result
SELECT name AS username, email AS user_email
FROM users;
```

**Python**:
```python
cursor.execute("SELECT name AS username, email AS user_email FROM users")
aliased_rows = cursor.fetchall()

# If row_factory = sqlite3.Row, we can access aliased columns by their new names
for row in aliased_rows:
    print(row["username"], row["user_email"])
```

---

## 10. GROUP BY / HAVING

**Goal**: Aggregate rows by a column and optionally filter the aggregated results with `HAVING`.

**SQL**:
```sql
-- Group rows by "email", counting how many times each appears,
-- and only show email groups that appear more than once
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING cnt > 1;
```

**Python**:
```python
cursor.execute("""
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING cnt > 1
""")
grouped_rows = cursor.fetchall()

for row in grouped_rows:
    print("Email:", row["email"], "Count:", row["cnt"])
```

---

## 11. JOINs

- **Inner JOIN**: Returns rows only when there’s a matching record in both tables.
- **LEFT JOIN**: Returns all rows from the left table, and the matching rows from the right table.

Assume we have a second table `orders`:

```sql
-- SQL to create "orders" table
CREATE TABLE IF NOT EXISTS orders (
    order_id INTEGER PRIMARY KEY,
    user_id INTEGER,
    total_amount REAL,
    FOREIGN KEY(user_id) REFERENCES users(id)
);
```

```python
# Python equivalent
cursor.execute("""
CREATE TABLE IF NOT EXISTS orders (
    order_id INTEGER PRIMARY KEY,
    user_id INTEGER,
    total_amount REAL,
    FOREIGN KEY(user_id) REFERENCES users(id)
)
""")
db.commit()
```

### Inner JOIN Example

**SQL**:
```sql
-- Retrieve rows that have a matching user in "users" and an order in "orders"
SELECT u.name, o.order_id
FROM users AS u
JOIN orders AS o ON u.id = o.user_id;
```

**Python**:
```python
cursor.execute("""
SELECT u.name, o.order_id
FROM users AS u
JOIN orders AS o
ON u.id = o.user_id
""")
join_rows = cursor.fetchall()

print("Inner JOIN result:", join_rows)
```

### LEFT JOIN Example

**SQL**:
```sql
-- Return all rows from "users" and matching ones from "orders"
SELECT u.name, o.order_id
FROM users AS u
LEFT JOIN orders AS o ON u.id = o.user_id;
```

**Python**:
```python
cursor.execute("""
SELECT u.name, o.order_id
FROM users AS u
LEFT JOIN orders AS o
ON u.id = o.user_id
""")
left_join_rows = cursor.fetchall()

print("LEFT JOIN result:", left_join_rows)
```

> **Note**: SQLite does **not** support `RIGHT JOIN` or `FULL JOIN`. Other databases like MySQL/PostgreSQL do.

---

## 12. Subqueries

**Goal**: Use one query’s result inside another query.

**SQL**:
```sql
-- Select from "users" only if their "id" appears in a subquery
-- that checks orders over 100.
SELECT *
FROM users
WHERE id IN (
    SELECT user_id
    FROM orders
    WHERE total_amount > 100
);
```

**Python**:
```python
cursor.execute("""
SELECT *
FROM users
WHERE id IN (
    SELECT user_id
    FROM orders
    WHERE total_amount > 100
)
""")
subquery_rows = cursor.fetchall()

print("Subquery result:", subquery_rows)
```

---

## 13. CREATE TABLE

**Goal**: Define a new table. (Shown earlier for `users` and `orders`.)

**SQL**:
```sql
-- Example for an "orders" table
CREATE TABLE IF NOT EXISTS orders (
    order_id INTEGER PRIMARY KEY,
    user_id INTEGER,
    total_amount REAL,
    FOREIGN KEY (user_id) REFERENCES users(id)
);
```

**Python**:
```python
cursor.execute("""
CREATE TABLE IF NOT EXISTS orders (
    order_id INTEGER PRIMARY KEY,
    user_id INTEGER,
    total_amount REAL,
    FOREIGN KEY (user_id) REFERENCES users(id)
)
""")
db.commit()
```

---

## 14. UPDATE

**Goal**: Modify existing rows in a table.

**SQL**:
```sql
-- Change "email" for rows where "name" is "Alice"
UPDATE users
SET email = 'updated@example.com'
WHERE name = 'Alice';
```

**Python**:
```python
cursor.execute("UPDATE users SET email = ? WHERE name = ?", ("updated@example.com", "Alice"))
db.commit()  # Always commit changes
```

---

## 15. DELETE

**Goal**: Remove rows from a table.

**SQL**:
```sql
-- Delete rows where the "name" is "Alice"
DELETE FROM users
WHERE name = 'Alice';
```

**Python**:
```python
cursor.execute("DELETE FROM users WHERE name = ?", ("Alice",))
db.commit()  # Commit the deletion
```

---

## 16. DROP TABLE

**Goal**: Permanently remove a table definition and all its data.

**SQL**:
```sql
-- Delete the "orders" table
DROP TABLE orders;
```

**Python**:
```python
cursor.execute("DROP TABLE orders")
db.commit()
```

---

## 17. COMMIT

- **Purpose**: Finalize changes in the database.  
- In many DBMSes, you run `COMMIT;` as an SQL statement. In Python’s `sqlite3`, call `db.commit()` on the connection.

**SQL**:
```sql
-- In some SQL environments, you'd explicitly do:
COMMIT;
```

**Python**:
```python
# In Python with sqlite3, just call commit on the connection
db.commit()
```

---

## Wrap-Up & Tips

1. **Always commit** after `INSERT`, `UPDATE`, or `DELETE` queries to save changes.
2. **Use parameter substitution** (e.g., `cursor.execute("... WHERE x=?", (value,))`) to avoid SQL injection.
3. For **dictionary-like row access**, set `db.row_factory = sqlite3.Row`.
4. SQLite is an easy starting point, but the same SQL statements apply broadly to other DBMSes (e.g., PostgreSQL, MySQL) with slight syntax differences.

With these commands and usage patterns, you can **create** and **manage** database structures, **insert/update/delete** data, **query** tables with powerful filtering, grouping, and joins, and **commit** changes to keep everything consistent.
```


```markdown
# Example Python Code to Answer Questions Using Lahman’s Baseball Database

Below is an example of how you can use **pandas** (`read_sql`) and **sqlite3** to answer the following questions from the Lahman baseball database (`../lahmansbaseballdb.sqlite`):

1. **Report the ten players that had the most home runs in their entire career.**  
2. **Report the top three schools that produced the most players who ended up managing.**  
3. **How many players went to UCD?**  

> **Note**: The Lahman database contains multiple tables, including:
> - `People` (biographical info on players)
> - `Batting` (batting stats, including home runs)
> - `Managers` (records of managers, referencing `playerID` for those who also played)
> - `CollegePlaying` (links players to schools)
> - `Schools` (school details, including `name_full`, `city`, etc.)

## Full Code

```python
import sqlite3 as sql
from pandas import read_sql

# 1. Connect to the Lahman database
db = sql.connect("../lahmansbaseballdb.sqlite")

# 2. Q1: Ten players with the most career home runs
query_1 = """
SELECT 
    p.playerID,
    p.nameFirst,
    p.nameLast,
    SUM(b.HR) AS careerHR
FROM People p
JOIN Batting b USING (playerID)
GROUP BY p.playerID
ORDER BY careerHR DESC
LIMIT 10;
"""
df_q1 = read_sql(query_1, db)
print("Top 10 players by career HR:")
print(df_q1)
print()

# Explanation:
# - We join People and Batting on playerID to get each player's HR stats.
# - SUM(b.HR) computes total home runs across all seasons.
# - GROUP BY ensures we sum per player.
# - ORDER BY careerHR DESC sorts by descending order of total HR.
# - LIMIT 10 restricts the output to the top ten.

# 3. Q2: Top three schools that produced the most players who ended up managers
query_2 = """
SELECT 
    s.name_full AS schoolName,
    COUNT(DISTINCT p.playerID) AS managerCount
FROM People p
JOIN Managers m ON p.playerID = m.playerID
JOIN CollegePlaying cp ON p.playerID = cp.playerID
JOIN Schools s ON cp.schoolID = s.schoolID
GROUP BY s.name_full
ORDER BY managerCount DESC
LIMIT 3;
"""
df_q2 = read_sql(query_2, db)
print("Top 3 schools producing the most future managers:")
print(df_q2)
print()

# Explanation:
# - Managers table references playerID if the manager was also a player.
# - CollegePlaying links a playerID to the school(s) they attended.
# - Schools table has the full school name (name_full).
# - We group by the full school name (s.name_full) to count how many distinct playerIDs
#   from that school ended up managing.
# - We take the top 3 by ordering descending and limiting to 3.

# 4. Q3: How many players went to UCD?
query_3 = """
SELECT 
    COUNT(DISTINCT cp.playerID) AS UCD_Players
FROM CollegePlaying cp
JOIN Schools s ON cp.schoolID = s.schoolID
WHERE s.name_full LIKE '%Davis%';
"""
df_q3 = read_sql(query_3, db)
print("Number of players who attended UCD (University of California, Davis):")
print(df_q3)
print()

# Explanation:
# - We join CollegePlaying and Schools on schoolID to link players to school info.
# - We filter for schools whose full name (name_full) includes "Davis".
# - COUNT(DISTINCT cp.playerID) ensures we count unique players who attended UCD.

# (Optional) Close the database connection when done
db.close()
```

## How It Works

1. **Database Connection**: We establish a connection to the SQLite file (`example.db` replaced with Lahman’s `../lahmansbaseballdb.sqlite`).
2. **read_sql Function**: We use `pandas.read_sql(query, db)` to run SQL queries and return results as DataFrames, which are easy to inspect and print.
3. **Aggregations & Joins**: 
   - **JOIN** statements allow us to combine data from multiple tables.  
   - **GROUP BY** and **aggregate functions** (like `SUM()`, `COUNT()`) let us compute totals or counts per group (in this case, per player or per school).  
4. **Filtering / Searching**:
   - We use `LIMIT` to restrict the top results (e.g., top 10, top 3).
   - We use `LIKE '%Davis%'` to find any school name that contains “Davis.”
   - We use `COUNT(DISTINCT ...)` to ensure we don’t count the same player multiple times.
5. **Closing Connection**: Always good practice to close the SQLite database connection with `db.close()` once you’re done.

Use or modify these query examples to explore other aspects of the Lahman baseball data!
```


In [1]:
# Import the sqlite3 library for database operations and pandas for handling SQL query results
import sqlite3 as sql  
from pandas import read_sql  

# 1. Connect to the Lahman database
# - Establishes a connection to the SQLite database file located in the given directory.
# - This database contains baseball statistics, including player and team data.
db = sql.connect("/Users/lingyoupang/Downloads/sta141b/Discussion/lahmansbaseballdb.sqlite")  

# 2. Q1: Ten players with the most career home runs
# - Retrieves the top 10 baseball players based on total career home runs.
query_1 = """
SELECT 
    People.playerID,          -- Selects the unique player ID
    People.nameFirst,         -- Selects the player's first name
    People.nameLast,          -- Selects the player's last name
    SUM(Batting.HR) AS careerHR -- Sums the total home runs (HR) per player over their career
FROM People
JOIN Batting USING (playerID)  -- Joins batting statistics with player information using playerID
GROUP BY People.playerID              -- Groups by playerID to calculate total home runs per player
ORDER BY careerHR DESC            -- Sorts players in descending order of career home runs
LIMIT 10;                         -- Limits the output to the top 10 players
"""

# Execute the SQL query and store the results in a Pandas DataFrame
df_q1 = read_sql(query_1, db)  

# Print the results of the top 10 home run hitters
print("Top 10 players by career HR:")  
print(df_q1)  
print()  

# 3. Q2: Top three schools that produced the most players who later became managers
# - Identifies the three schools that produced the most players who later became MLB managers.
query_2 = """
SELECT 
    Schools.name_full AS schoolName,        -- Selects the full school name
    COUNT(DISTINCT People.playerID) AS managerCount  -- Counts the number of unique players who became managers
FROM People
JOIN Managers ON People.playerID = Managers.playerID   -- Joins the Managers table to link players to manager records
JOIN CollegePlaying ON People.playerID = CollegePlaying.playerID  -- Links players to the colleges they attended
JOIN Schools ON CollegePlaying.schoolID = Schools.schoolID  -- Links colleges to their full names
GROUP BY Schools.name_full              -- Groups results by school name to aggregate manager counts
ORDER BY managerCount DESC         -- Sorts schools in descending order by manager count
LIMIT 3;                           -- Retrieves only the top 3 schools
"""

# Execute the query and store the results in a Pandas DataFrame
df_q2 = read_sql(query_2, db)  

# Print the results of the top 3 schools that produced the most MLB managers
print("Top 3 schools producing the most future managers:")  
print(df_q2)  
print()  

# Explanation:
# - The `Managers` table references `playerID` if the manager was also a player.
# - The `CollegePlaying` table links each playerID to the school(s) they attended.
# - The `Schools` table contains the full school names (`name_full`).
# - We group by `name_full` to count how many distinct playerIDs from each school became managers.
# - We then order the results in descending order and limit the output to the top 3 schools.

# 4. Q3: How many players went to UC Davis?
# - Counts the number of unique baseball players who attended the University of California, Davis.
query_3 = """
SELECT 
    COUNT(DISTINCT CollegePlaying.playerID) AS UCD_Players  -- Counts the number of unique players who attended UCD
FROM CollegePlaying
JOIN Schools ON CollegePlaying.schoolID = Schools.schoolID  -- Joins the CollegePlaying and Schools tables on schoolID
WHERE Schools.name_full LIKE '%Davis%';  -- Filters for schools whose name contains 'Davis'
"""

# Execute the query and store the results in a Pandas DataFrame
df_q3 = read_sql(query_3, db)  

# Print the number of players who attended UC Davis
print("Number of players who attended UCD (University of California, Davis):")  
print(df_q3)  
print()  

# Explanation:
# - The `CollegePlaying` table links each player to their college.
# - The `Schools` table contains the full names of schools.
# - We use `LIKE '%Davis%'` to find schools with "Davis" in their name.
# - `COUNT(DISTINCT CollegePlaying.playerID)` ensures that each player is only counted once.

# (Optional) Close the database connection when done
db.close()  


Top 10 players by career HR:
    playerID nameFirst   nameLast  careerHR
0  bondsba01     Barry      Bonds       762
1  aaronha01      Hank      Aaron       755
2   ruthba01      Babe       Ruth       714
3  rodrial01      Alex  Rodriguez       696
4   mayswi01    Willie       Mays       660
5  pujolal01    Albert     Pujols       656
6  griffke02       Ken    Griffey       630
7  thomeji01       Jim      Thome       612
8   sosasa01     Sammy       Sosa       609
9  robinfr02     Frank   Robinson       586

Top 3 schools producing the most future managers:
                      schoolName  managerCount
0  University of Texas at Austin             4
1         University of Michigan             4
2           Villanova University             3

Number of players who attended UCD (University of California, Davis):
   UCD_Players
0            9



In [None]:
# Step 1: Connect to the SQLite database
# - We establish a connection to 'lahmansbaseballdb.sqlite'.
# - This allows us to execute SQL queries on the database.
db = sql.connect("../data/lahmansbaseballdb.sqlite")


# Step 2: Compute total payroll for each team per year (Subquery 1)
# - We extract salary data from the 'salaries' table.
# - We sum the salaries for each team in each year (1990-1999).
# - This results in a table where each row represents:
#   * A specific team
#   * The year they played
#   * The total payroll they had for that year
#
# Example output from this step:
# | yearid | teamid | payroll_sum  |
# |--------|--------|--------------|
# | 1990   | NYY    | 23,000,000   |
# | 1990   | BOS    | 21,500,000   |
# | 1990   | LAD    | 22,000,000   |
# | 1991   | NYY    | 36,000,000   |
# | 1991   | BOS    | 34,000,000   |
#
query_1 = """
    SELECT SUM(salary) AS payroll_sum, yearid, teamid
    FROM salaries
    WHERE yearid BETWEEN 1990 AND 1999
    GROUP BY yearid, teamid
"""


# Step 3: Find the maximum payroll per year (Subquery 2)
# - We take the result from Step 2 and determine the highest payroll per year.
# - This gives us one row per year, showing the highest payroll but NOT the team name yet.
# - We use MAX(payroll_sum) to find the highest payroll in each year.
#
# Example output from this step:
# | year  | payroll     |
# |-------|------------|
# | 1990  | 23,000,000 |
# | 1991  | 36,000,000 |
#
query_2 = """
    SELECT yearid AS year, MAX(payroll_sum) AS payroll
    FROM (query_1) AS salary_summary
    GROUP BY yearid
"""


# Step 4: Retrieve the team name associated with the highest payroll (Final query)
# - Now, we need to **match the correct team** that had this payroll.
# - We join the previous result with the 'teams' table to get the team names.
# - We make sure to match the correct year and the correct maximum payroll.
#
# - To do this, we:
#   * Join the salary summary with the teams table (matching teamid and yearid)
#   * Ensure that we only keep rows where the payroll is equal to the max payroll per year
#
# Example output from this step:
# | year  | payroll     | team                |
# |-------|------------|---------------------|
# | 1990  | 23,000,000 | New York Yankees    |
# | 1991  | 36,000,000 | Oakland Athletics   |
#
final_query = """
    SELECT yearid AS year, payroll_sum AS payroll, name AS team
    FROM (
        -- Compute total payroll per team per year
        SELECT SUM(salary) AS payroll_sum, yearid, teamid
        FROM salaries
        WHERE yearid BETWEEN 1990 AND 1999
        GROUP BY yearid, teamid
    ) AS salary_summary
    JOIN teams
    ON salary_summary.teamid = teams.teamid AND salary_summary.yearid = teams.yearid
    WHERE payroll_sum = (
        -- Find the max payroll per year
        SELECT MAX(payroll_sum)
        FROM (
            SELECT SUM(salary) AS payroll_sum, yearid, teamid
            FROM salaries
            WHERE yearid BETWEEN 1990 AND 1999
            GROUP BY yearid, teamid
        ) AS max_salary_summary
        WHERE max_salary_summary.yearid = salary_summary.yearid
    )
    ORDER BY yearid;
"""


# Summary of Execution Order:
# 1️⃣ Compute **total payroll per team per year** (SUM salaries, GROUP BY team & year).
# 2️⃣ Determine **highest payroll per year** (MAX payroll from the previous result).
# 3️⃣ Retrieve **team name that matches the highest payroll** using a JOIN with teams.




# Execute an SQL query to find the team with the highest payroll for each year from 1990 to 1999
df = pd.read_sql('''
    -- Select the year, maximum payroll, and corresponding team name
    SELECT yearid AS year, MAX(payroll_sum) AS payroll, name AS team
    FROM (
        -- First, calculate the total payroll for each team per year
        SELECT SUM(salary) AS payroll_sum, yearid, teamid
        FROM salaries
        WHERE yearid BETWEEN 1990 AND 1999 -- Filter only years between 1990 and 1999
        GROUP BY yearid, teamid -- Group the data by year and team to calculate total payroll per team per year
    ) AS salary_summary
    LEFT JOIN (
        -- Retrieve the team names for the corresponding team IDs
        SELECT teamid, name
        FROM teams
        WHERE yearid BETWEEN 2006 AND 2016 -- This part is incorrect because we are dealing with data from 1990-1999
    ) AS team_names
    ON salary_summary.teamid = team_names.teamid -- Match team names with the calculated payrolls
    GROUP BY yearid -- Group by year to find the highest payroll team per year
''', db)

# Display the DataFrame
print(df)


In [None]:
# Step 1: Connect to the SQLite database
# - Establish a connection to 'lahmansbaseballdb.sqlite'.
# - This allows us to execute SQL queries on the database.
db = sql.connect("../data/lahmansbaseballdb.sqlite")


# Step 2: Retrieve salary data for the years 1990-1999
# - We extract all salary records from the 'salaries' table.
# - We filter the data to include only salaries between 1990 and 1999.
# - We sort the salaries in descending order so the highest-paid players appear first.
#
# Example output from this step:
# | yearid | playerID | salary     |
# |--------|---------|------------|
# | 1990   | p001    | 3,200,000  |
# | 1990   | p002    | 3,000,000  |
# | 1990   | p003    | 2,500,000  |
# | 1991   | p004    | 3,800,000  |
# | 1991   | p005    | 3,500,000  |
#
query_1 = """
    SELECT * 
    FROM salaries
    WHERE yearid BETWEEN 1990 AND 1999
    ORDER BY salary DESC
"""


# Step 3: Identify the highest-paid player per year
# - We take the result from Step 2 (which is already sorted by salary).
# - We use `GROUP BY yearid` to **keep only the first player per year**, which is the highest-paid.
# - This ensures we only keep **one row per year**.
#
# Example output from this step:
# | yearid | playerID | salary     |
# |--------|---------|------------|
# | 1990   | p001    | 3,200,000  |
# | 1991   | p004    | 3,800,000  |
# | 1992   | p006    | 6,100,000  |
#
query_2 = """
    SELECT playerID, yearid AS year, salary
    FROM (query_1)
    GROUP BY yearid  -- Keep only the highest-paid player per year
"""


# Step 4: Retrieve player names using a LEFT JOIN with 'people' table
# - The salaries table only contains `playerID`, so we need to get their actual name.
# - We do a `LEFT JOIN` between the filtered salary table (query_2) and the 'people' table.
# - The 'people' table contains player details (like first and last names).
#
# Example output from this step:
# | year  | name        | salary     |
# |-------|------------|------------|
# | 1990  | Yount      | 3,200,000  |
# | 1991  | Strawberry | 3,800,000  |
# | 1992  | Bonilla    | 6,100,000  |
#
final_query = """
    SELECT year, nameLast AS name, salary
    FROM (query_2) AS sal
    LEFT JOIN people
    ON sal.playerid = people.playerid
"""


# Step 5: Execute the final SQL query and store the results in a DataFrame
data0 = pd.read_sql(final_query, db)

# Step 6: Display the DataFrame to verify results
print(data0)


In [None]:
# Step 1: Connect to the SQLite database
# - We establish a connection to 'lahmansbaseballdb.sqlite'
# - This allows us to execute SQL queries on the database.
db = sql.connect("../data/lahmansbaseballdb.sqlite")


# Step 2: Identify the highest-paid player per year (1990-1999)
# - We select the player with the **highest salary** in each year from 1990 to 1999.
# - To do this, we:
#   * Retrieve all salary records from the 'salaries' table
#   * Sort the salaries in descending order (ORDER BY salary DESC)
#   * Use `GROUP BY yearid` to keep only the highest-paid player per year
#
# Example output from this step:
# | yearid | playerid | salary     |
# |--------|---------|------------|
# | 1990   | p001    | 3,200,000  |
# | 1991   | p002    | 3,800,000  |
# | 1992   | p003    | 6,100,000  |
# | 1993   | p003    | 6,200,000  |
#
query_1 = """
    SELECT playerid, yearid AS year, salary
    FROM salaries
    WHERE playerid IN (
        -- Step 2: Find the player with the highest salary for each year
        SELECT DISTINCT playerid  -- Ensure we get only one unique player per year
        FROM (
            -- Get all salary records for years 1990-1999 and order them by salary
            SELECT * 
            FROM salaries
            WHERE yearid BETWEEN 1990 AND 1999
            ORDER BY salary DESC
        )
        GROUP BY yearid  -- Keep only the highest-paid player per year
    )
"""


# Step 3: Retrieve player names using a LEFT JOIN with 'people' table
# - The salaries table only contains `playerid`, so we need to get their actual name.
# - We do a `LEFT JOIN` between the filtered salary table (query_1) and the 'people' table.
# - The 'people' table contains player details (like first and last names).
#
# Example output from this step:
# | year  | name      | salary     |
# |-------|----------|------------|
# | 1990  | Yount    | 3,200,000  |
# | 1991  | Strawberry | 3,800,000  |
# | 1992  | Bonilla  | 6,100,000  |
# | 1993  | Bonilla  | 6,200,000  |
#
final_query = """
    SELECT year, nameLast AS name, salary
    FROM (query_1) AS sal
    LEFT JOIN people
    ON sal.playerid = people.playerid
"""


# Step 4: Execute the final SQL query and store the results in a DataFrame
data1 = pd.read_sql(final_query, db)

# Step 5: Print the number of rows in the dataset
# - This tells us how many unique salary records were found in the result.
print(data1.shape[0])  # Example output: 89
