# Homework 8: SQL and PyTorch(21 pts)

name: James Jung

email: Jameshj@umich.edu

This homework assignment took me 15 hours in total to complete. (Please help us to gauge the difficulty of the assignment.)

## Collaboration Disclosure

In the cell below, please list *everyone* with whom you discussed any of the homework problems, excluding only the GSIs and the course instructor. 

If you did not discuss the homework with anyone else, write __"I did not discuss this homework with anyone."__

Even if you discuss questions with other, the code you submit must be only yours. All work is checked with the [MOSS plagiarism detector](https://theory.stanford.edu/~aiken/moss/).

Spoke with Mattias Blum

---

## Submission Instructions
Your homework solutions should be written entirely in this Jupyter notebook file. Once it contains your solutions, you should submit this notebook through Canvas. 


Before submitting, please make sure to __Cells->Run All__ executes without errors; errors in your code translate directly to point deductions. 
In general, you don't need to do explicitly raise errors (e.g. with the ```raise``` function) if we don't ask you to in the problem statement.
However, even in cases where we ask you to check for errors, your submission should not contain any examples of your functions actually raising those errors.

Note that many parts of this homework where you are expected to type in code will have ```NotImplementedError()``` as a placeholder. You need to delete this function and replace it with your own code.

## Homework tips 

1. **Start early!** If you run into trouble installing things or importing packages, it’s
best to find those problems well in advance, not the night before your assignment is
due when we cannot help you!

2. **Make sure you back up your work!** At a minimum, do your work in a Dropbox
folder. Better yet, use git, which is well worth your time and effort to learn.

3. **Be careful to follow directions!** Remember that Python is case sensitive. If
you are ask you to define a function called my_function and you define a function
called My_Function, you will not receive full credit. You may want to copy-paste
the function names below to make sure that the functions in your notebook match.

## Error checking

You do not need to do error checking (raising errors, etc.) in your code unless we explicitly ask you to so in a problem.


## Nbgrader

We will be using `nbgrader` to grade your jupyter notebook. You will notice some `read-only` cells in the assignment that contain `assert` statements. These are tests that your code must pass for your solution to be correct. If any of the tests fail, you will get an python error and not get points for that question. 

**Note:** The tests shown not are not comprehensive; additional tests will be used at grading time. You are encouraged to read the problem carefully and verify your code covers all possible cases.

**Be careful:** If a jupyter notebook cell takes longer than `60s` to run, the autograder will not grade it and you will receive zero credit for that question.

## Question 1: Basic SQL (13 pts)
In this problem, you'll interact with a toy SQL database using Python's
built-in `sqlite3` package. Documentation can be found at
<https://docs.python.org/3/library/sqlite3.html>. For this problem,
we'll use a popular toy SQLite database, called Chinook, which
represents a digital music collection. See the documentation at <https://github.com/lerocha/chinook-database/blob/master/README.md>
for a more detailed explanation. We'll use the `chinook.sqlite` file:

In [1]:
import sqlite3
con = sqlite3.connect('chinook.sqlite')
cur = con.cursor()

**1(a)** (2 pt) Load the database using the Python `sqlite3` package. How many tables are in the database? Save the answer in the variable `n_tables`.

In [2]:
# YOUR CODE HERE
query = "SELECT COUNT(name) FROM sqlite_master WHERE type='table';"
n_tables = con.execute(query).fetchone()[0]

print(f"Number of tables: {n_tables}")

Number of tables: 11


In [3]:
assert n_tables == 11

**1(b)** (2 pts) What are the names of the tables in the database? Save the answer as
    a set of strings, `table_names`. **Note:** you should write Python `sqlite3`
    code to answer this; don't just look up the answer in the
    documentation!

In [4]:
# YOUR CODE HERE
query = "SELECT name FROM sqlite_master WHERE type='table';"
table_names = {row[0] for row in con.execute(query).fetchall()}

print(f"Table names: {table_names}")

Table names: {'Track', 'InvoiceLine', 'Album', 'MediaType', 'Genre', 'Employee', 'PlaylistTrack', 'Playlist', 'Customer', 'Artist', 'Invoice'}


In [5]:
expected = {'Album', 'Genre', 'Playlist', 'PlaylistTrack', 'Employee', 'Customer', 'InvoiceLine', 'Track', 'Artist', 'MediaType', 'Invoice'}
assert table_names == expected

**1(c)** (3 pts) Write a function `albums_starting_with(c)` that takes as an argument a single character `c` and
    returns a list of the primary keys (AlbumIds) of all the albums whose titles
    start with that character. Your function should ignore case, so that
    the inputs "a" and "A" yield the same results. Include error
    checking that raises an error in the event that the input is not a
    single character.


In [16]:
def albums_starting_with(c):
    # YOUR CODE HERE
    # check if input c is a string and length is 1.
    # instructions did not state if character should be a letter or number. So I did not check for that.
    if not isinstance(c, str) or len(c) != 1:
        raise ValueError("Input must be a single character.")
    # take the alubumid from the table album when the title is anything. 
    query = "SELECT AlbumId FROM Album WHERE Title LIKE ?;"
    # set the first entry in albums to be the first entry of the row that meets the query requirements. That is the albumid.
    albums = [row[0] for row in cur.execute(query, (f"{c}%",)).fetchall()]
    return albums

In [17]:
res = albums_starting_with('a')
assert type(res) == list
assert len(res) == 32
for i in [10, 14, 15, 24, 26, 29, 74, 167, 319]:
    assert i in res

try:
    albums_starting_with(3)
except Exception:
    pass
else:
    raise Exception("should raise an exception")
    

**1(d)** (3 pts) Write a function `songs_starting_with(c)` that takes as an argument a single character and
    returns a list of the primary keys (TrackIds) of all the **songs** whose album
    names begin with that letter. Again, your function should ignore
    case and perform error checking as in the previous exercise (again ignoring case).
    **Hint:** you'll need a JOIN statement here. Don't forget that you
    can use the `cursor.description` attribute to find out about tables
    and the names of their columns.

In [8]:
def songs_starting_with(c):
    # YOUR CODE HERE
    if len(c) != 1 or not isinstance(c, str):
        raise ValueError("Input must be a single character.")
    
    # SQL query to join album and track tables. Then select the trackid when the album starts with the character c. 
    query = """
    SELECT Track.TrackId
    FROM Track
    JOIN Album ON Track.AlbumId = Album.AlbumId
    WHERE Album.Title LIKE ?
    COLLATE NOCASE;
    """
    # haha nice to do some sql queries again as I did that in my old job.
    # again grab the entries that meet the query requirments. since in query we grab the albumid only we can just grab the first entry of the row.
    track_ids = [row[0] for row in con.execute(query, (f"{c}%",)).fetchall()]
    
    return track_ids

In [9]:
res = songs_starting_with('a')
assert type(res) == list
assert len(res) == 369
for i in [85, 86, 87, 331, 332, 333, 923, 924, 925]:
    assert i in res

try:
    songs_starting_with(3)
except Exception:
    pass
else:
    raise Exception("should raise an exception")

try:
    songs_starting_with('res')
except Exception:
    pass
else:
    raise Exception("should raise an exception")

**1(e)** (3 pts) Write a function `cost_of(c)` that takes as an argument a single character and
    returns the cost of buying every song (consider only the songs that were sold - you need to look into InvoiceLine table) whose album begins with that
    letter. This cost should be based on the tracks' unit prices when it was sold, so
    that the cost of buying a set of tracks is simply the sum of the
    unit prices of all the tracks in the set. Again your function should
    ignore case and perform appropriate error checking.


In [10]:
def cost_of(c):
    # YOUR CODE HERE
    if len(c) != 1 or not isinstance(c, str):
        raise ValueError("Input must be a single character.")
    
    query = """
    SELECT SUM(InvoiceLine.UnitPrice)  -- Sum the price of each distinct track
    FROM Track
    JOIN Album ON Track.AlbumId = Album.AlbumId
    JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId
    WHERE Album.Title LIKE ? COLLATE NOCASE  -- Match albums starting with 'c'
    """
    total_cost = cur.execute(query, (f"{c}%",)).fetchone()[0]
    
    # Handle case where no matching tracks are found (SUM returns None)
    if total_cost is None:
        total_cost = 0.0
    
    return total_cost
# I believe your assert statement is wrong. It is perfectly 30 songs off if each song is 99 cents.
for i in range(97, 123):
    print(cost_of(chr(i)))

246.53
207.21
171.27
62.37
21.78
47.519999999999996
105.92999999999999
69.42999999999999
79.2
31.68
29.7
189.5
161.37
53.46
54.45
83.16
14.85
109.89
118.8
247.75
88.11
82.17
27.72
0.0
0.0
8.91


In [11]:
res = cost_of('a')
assert type(res) == float
assert abs(res - 246.53) < 1e-5

try:
    cost_of(3)
except Exception:
    pass
else:
    raise Exception("should raise an exception")

## Problem 2: Building simple models with Pytorch (8 points) 
In this problem, you'll use **Pytorch** to build the loss functions for a pair of commonly-used statistical models. 

We will use variables $X$ and $Y$, which will serve as the predictor (independent variable) and response (dependent variable), respectively. Please use $W$ to denote a parameter that multiplies the predictor, and $b$ to denote a bias parameter (i.e., a parameter that is added).

**2(a)** (4 pts)

In this model, the binary variable $Y$ is distributed as a Bernoulli random variable with success parameter $\sigma(W^T X + b)$, where $\sigma(z) = (1+\exp(-z))^{-1}$ is the logistic function, $X \in R^6$ is the predictor random variable, and $W \in R^6, b \in R$ are the model parameters. 
 
Using **Pytorch** code, implement a class `LogisticRegression` that inherits from `nn.module`. This class should should have two attributes `w` and `b` which should be `nn.parameters` with shapes `(6,1)` and `(1)` respectivelly. 

This class should a method called `forward` that takes in the predictor random variable `x` with shape `(N, 6)`, where `N` is the number of observations, and returns the success parameter (also known as the prediction of our model on $Y$).

**Note:** Please initialize both `w, b` to be __all-one float tensors.__

In [22]:
import torch
import torch.nn as nn
class LogisticRegression(nn.Module):
    # YOUR CODE HERE
    def __init__(self):
        super(LogisticRegression, self).__init__()
        # ues ones to set w and b as  all one float tensors
        self.w = nn.Parameter(torch.ones(6, 1, dtype=torch.float32))
        self.b = nn.Parameter(torch.ones(1, dtype=torch.float32))
    # takes x  and returns the prediction.
    def forward(self, x):
        interim = torch.matmul(x, self.w) + self.b
        prediction = torch.sigmoid(interim)
        
        return prediction


In [23]:
model = LogisticRegression()

# Create dummy input data (N=3 observations, 6 features each)
x = torch.tensor([
    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
    [0.5, 1.5, 2.5, 3.5, 4.5, 5.5],
    [6.0, 5.0, 4.0, 3.0, 2.0, 1.0]
], dtype=torch.float32)

# Expected shapes
N = x.size(0)

# Forward pass: Compute predictions
predictions = model(x)

# Output the results
print("Input:\n", x)
print("Predictions:\n", predictions)
print("Prediction Shape:", predictions.shape)

# Verify that predictions are in the range [0, 1]
assert (predictions >= 0).all() and (predictions <= 1).all(), "Predictions are not in range [0, 1]"

# Verify that output shape is (N, 1)
assert predictions.shape == (N, 1), f"Output shape is incorrect: {predictions.shape}"

Input:
 tensor([[1.0000, 2.0000, 3.0000, 4.0000, 5.0000, 6.0000],
        [0.5000, 1.5000, 2.5000, 3.5000, 4.5000, 5.5000],
        [6.0000, 5.0000, 4.0000, 3.0000, 2.0000, 1.0000]])
Predictions:
 tensor([[1.],
        [1.],
        [1.]], grad_fn=<SigmoidBackward0>)
Prediction Shape: torch.Size([3, 1])


In [21]:
model = LogisticRegression()
assert type(model.w) == nn.Parameter
assert type(model.b) == nn.Parameter
x = torch.eye(6).float()
y = model.forward(x)
assert y.shape == (6, 1)

**2(b)** (4 pts)

Using **Pytorch** code, write a function called `neg_log` that takes in `y_true` (true value of $Y$) and `y_pred` (predicted value of $Y$) and returns the negative log-likelihood loss function. You can assume that both `y_true` and `y_pred` have shapes `N x 1` where `N` is the number of observations. 
 
 __Hint:__ The loss should be a negative log-likelihood term, summed over all the observations. Remember that $Y$ is Bernoulli distributed which should suggest what the likelihood is. 

In [14]:
def neg_log(y_true, y_pred):
    # YOUR CODE HERE
    loss = -sum(y_true * torch.log(y_pred) + (1 - y_true) * torch.log(1 - y_pred))
    return loss

In [15]:
y_true = torch.tensor([1, 1, 1, 1, 1]).float().view(5, 1)
y_pred = torch.tensor([0.5, 0.5, 0.5, 0.5, 0.5]).float().view(5, 1)
assert 3.4 < neg_log(y_true, y_pred).item() < 3.6