Welcome to lesson 8 of the Noisebridge Python class! ([Noisebridge Wiki](https://www.noisebridge.net/wiki/PyClass) | [Github](https://github.com/audiodude/PythonClass))

In this lesson, we will discuss SQL, which is **Structured Query Language**. SQL is used to store, query and update data in a database. We will perform the following tasks using Python:

1. Creating a database from a schema file
1. Loading data into the database from a CSV file
1. Basic creating, reading, updating, and deleting (CRUD) in Python

Then we will talk about Pandas, a Python library for data analysis, and we will discuss

1. Loading a CSV file in Pandas into a **Dataframe**.
1. Inspecting the data, including getting summary statistics.
1. Filtering the data using **Boolean Masks**.
1. Assigning new derivative columns to the Dataframe
1. Using aggregate functions

Let's get started!

# SQL

SQL is not considered to be a general purpose programming language (though I'm sure some weirdo somewhere has been able to use it to write programs at some point). Instead, it is a standard way of adding data to, and getting data from, a database. Different database systems, such as **SQLite**, **MySQL**, **MariaDB**, and **PostgreSQL** (just for a few) are able to be used with SQL. The slight customizations and variations of the language that each database uses is called the specific **dialect** of SQL.

We will be using **SQLite** for our examples, because it comes built in to Python and doesn't require running any additional software. The other database systems mentioned above all operate as servers, where you must run an ongoing process somewhere that manages and serves the data, and that you connect to over the network (whether the internet or the local network).

Luckily for us, Python defines a [Python Database API](https://peps.python.org/pep-0249/) which is a standard way of interacting with databases in the language. Almost all database interface libraries in Python implement database access using this **API** (Application Programming Interface, in this case a standard way of interacting with similarly specced resources). All this means that if your code is designed to interact with SQLite, you can switch your database to one of the server-based options and keep most if not all of your code as is.

*Note: we won't really be covering SQL itself in-depth here, if you've never used it before. We're focusing on a light overview with an emphasis on how to run the examples in Python. For more information on SQL, check out this [list of resources on Coursera](https://medium.com/@steverramos/10-best-sql-courses-on-coursera-25faf19b2ec3)*

SQLite databases are stored in individual files, one database per file. Inside the database, there can be multiple **tables**. Each table has a specific set of named **columns** which contain data. A database **row** is a particular set of columnar data. You can think of database rows as rows in a spreadsheet, with the columns being the columns.

For this lesson, we are considering the database for a fictional Reddit-clone called *Radish*. Think about what the table for the *links* on the site looks like. If it were a spreadsheet, it might look like this:

![radish.png](radish.png)

Here there is a row for each link in the database. This could be thought of as the "links" table. Each column holds a particular piece of data for a particular link. For example, the first link in row 2 has an id of `1` a name of `Google` and a url of `https://google.com`, among other things.

The definitions of all the tables and columns in a database is called its **schema**. Generally, we define the schema when we create the database. The schema is defined using SQL statements. For the table above, the schema would look like:

```sql
CREATE TABLE IF NOT EXISTS links (
  id INTEGER NOT NULL PRIMARY KEY,
  name VARCHAR(255),
  url TEXT,
  created_at TIMESTAMP,
  upvotes INTEGER DEFAULT 0,
  downvotes INTEGER DEFAULT 0
);
```

Here, the name of the table is `links`, which we will refer to when inserting or querying data. Each column is defined starting with it's name, then it's **datatype**. Just like in Python where there are `int`s, `str`s, etc, the columns of a database have a certain type of data that they can hold. This is used when inserting or querying, and also used for sorting. We can also use the `DEFAULT` specifier to indicate that if no value is given, the column should have a value of 0 (or whatever we put after `DEFAULT`). This is useful, otherwise the value would be `NULL` (which is the rough equivalent of Python's `None`), and we might have bugs later trying to compare numbers to `NULL` values.

Along with this lesson, there is a file called `schema.sql`, which contains the above create table statement as well as the following:

```sql
CREATE TABLE IF NOT EXISTS users (
  id INTEGER NOT NULL PRIMARY KEY,
  email TEXT,
  hashed_password VARCHAR(255),
  profile TEXT
);
```

The schema can also sometimes include data that needs to be present when your program or application starts up, such as an admin user that already has permissions. The `IF NOT EXISTS` clause means that we can load our schema multiple times and it won't throw an error because the table is already there.

## Connecting to the database

Let's connect to our database using the `sqlite3` module.

In [None]:
import sqlite3

def connect():
    # This is the file that will store our database, we can
    # name it anything we want, but it's nice to have a .sqlite
    # or .sqlite3 extension.
    return sqlite3.connect('radish.sqlite')

db = connect()

An sqlite database is stored in a single file, with all the tables included. Here we instruct sqlite3 to connect to the database `'radish.sqlite'`, which is simply a file path. When we run this code, the file `radish.sqlite` is automatically created as an empty database alongside this notebook.

Sqlite also supports "in-memory" databases, where no file exists and all of the data is stored in the Python process memory. This means that it's not persisted, and when the Python process exits, all the data is lost. What is the benefit then? In-memory databases are an order of magnitude faster than those based on disk. Maybe you just need to store some data temporarily, while your program is running, to perform a calculation. Additionally, in some scenarios you might want the database to be "emptied" every time the code is run, such as when you're running a test and loading a database with test data. It would be error-prone to manually clear that data after ever run, so it's best if it is just discarded.

Once we've connected, we get a **cursor** for the database. The cursor is the object that you use to execute all queries, and has methods for executing SQL as well as getting back metadata about inserted or queried data.

Similar to the way that we use a context manager for opening and reading/writing files, we can use one for the database cursor. The cursor is a resource: it is a bit of memory reserved in the database. It needs to be `.close()`'d when we are done with it. A **context manager** helps us make sure this happens, even if we accidentally return early or an error occurs. The `closing()` context manager provides the cursor and calls its `.close()` method when the context is exited.

*Note: the db object that we connected to above is valid for the entire Jupyter notebook (this file), so we don't need to keep connecting to it.*

In [None]:
from contextlib import closing

with closing(db.cursor()) as cursor:
  pass # Here we would do a database operation

We can use the `executescript` command on a cursor to execute all of the many SQL commands in a file. This is mostly useful for setting up the schema, which we do here.

In [None]:
from contextlib import closing

db = connect()

with closing(db.cursor()) as cursor:
    with open('schema.sql', 'r') as f:
        # Execute all of the SQL statements in the schema file
        cursor.executescript(f.read())


Now we can start inserting our data. I've stored the spreadsheet above as a csv file alongside this lesson, `data_links.csv`. We can read it using the Python `csv` module, which will automatically parse the data (we don't have to worry about splitting by commas or columns that have escaped values).

In [None]:
import csv

with closing(db.cursor()) as cursor:
  with open('data_links.csv', 'r') as f:
      reader = csv.reader(f)
      for idx, row in enumerate(reader):
          # Skip the first row, because it contains the column names ('id,name,url...')
          if idx == 0:
              continue
          # The syntax with three ' or " indicates a '''multi-line string'''.
          # These can be used anywhere a string is used and allow for strings to
          # more easily contain newlines/line breaks. As for SQL, any amount of
          # empty space (whitespace) is allowed before, after and in the middle
          # of a statement.We use the multi-line string to make our query easier
          # to read.
          cursor.execute('''
            INSERT INTO links
              (id, name, url, created_at, upvotes, downvotes)    
            VALUES
              (?, ?, ?, ?, ?, ?)
          ''', row)

In [None]:
def alternate():
    # The multi-line string is used only for convenience and readability.
    # We could have easily done the following:
    cursor.execute('INSERT INTO links (id, name, url, created_at, upvotes, downvotes) VALUES (?, ?, ?, ?, ?, ?)', row)
    # Or even an alternate version using "implicit string concatenation"
    cursor.execute('INSERT INTO links '
                   '  (id, name, url, created_at, upvotes, downvotes) '
                   'VALUES '
                   '  (?, ?, ?, ?, ?, ?)', row)
                   
    # The downside of the first example is that the line with the code
    # is long and hard to read.

    # The downside of the second is that we have to remember to put
    # trailing spaces at the end of our lines

So what's going on here? We create a `csv.reader` from our csv file, which we can iterate over using a for loop to get all of the rows in the csv. We use `enumerate` to get the index of each row (assigned to `idx`) along with the row data. This is useful so that we can skip the first row, which contains the names of the columns. Finally, for each row we use the `INSERT INTO` SQL statement, which takes a table name (`links`), a list of columns to insert for (`(id, name, url, created_at, upvotes, downvotes)` -- any non-specified columns are skipped and given their default value, which is usually `NULL` or 0), and finally a list of values to insert.

Note that we didn't specify the values directly, we used the placeholder `?` and then passed an array of values to populate the query with. **If there is ANYTHING you learn from today's lesson, it's that you should always, ALWAYS use placeholders to insert data into a SQL database!**. The alternative would be to use string concatenation, or Python format strings directly:

```python
cursor.execute('INSERT INTO links (id, name, url, created_at, upvotes, downvotes) '
               'VALUES' + row[0] + ', ' + row[1] ... )
```

**This will cause your code to be subject to SQL injection**. You can read more [here](https://learn.microsoft.com/en-us/sql/relational-databases/security/sql-injection?view=sql-server-ver16), [here](https://en.wikipedia.org/wiki/SQL_injection) and [here](https://www.vice.com/en/article/aekzez/the-history-of-sql-injection-the-hack-that-will-never-go-away).

Now that we have data in the database, we can query it in various ways. By running a SQL `SELECT` statement, we can get all of the URLs of our links:

In [None]:
with closing(db.cursor()) as cursor:
    cursor.execute('SELECT * FROM links')
    data = cursor.fetchall()
    print(data)

Notice that the `fetchall()` method returns a list, and in each list, there is a tuple of values. This is always the case, no matter if you select one value (as we did here, `url`) or many. *How would you access the url of the second item from the `data` variable?*

---

*Side note:* You may remember **tuples** from a past lesson. They are like lists, but immutable

In [None]:
t = (1,2,3)
for t1 in t:
    print(t1)

Tuples are delimited by parenthesis. However this causes a problem, because parenthesis are primarily used in Python for group arithmetic or logical expressions

In [None]:
n = (1 + 5) / (100 - 10)

Because of this, a single Python expression wrapped in parentheses acts as if the parenthesis simply aren't there:

In [None]:
a = ((1) + (2))
print(a)
# This doesn't work because a is 3 not a tuple consisting of one item, 3:
for a1 in a:
    print(a1)

To create a single item tuple, we must put a single comma between the expression and the trailing parenthesis. Lists don't require this, because they use square brackets (`[]`) as delimiters (but trailing commas are allowed).

In [None]:
b = (2,)
b_as_list = [2]
another_b_as_list = [2,]
c = ((1) + (2),)
c_as_list = [(1) + (2)]
for b1 in b:
    print(b1)
print(b, b_as_list, another_b_as_list, c, c_as_list)

---

In [None]:
with closing(db.cursor()) as cursor:
    cursor.execute('SELECT name, url FROM links')
    data = cursor.fetchall()
    print(data)
# Use a list comprehension to construct a dictionary mapping.
# link names to their URL. The syntax dict([(key1, value1), (key2, value2), ...])
# Creates a dictionary from an iterable where the first value of each item is
# the key and the second item is the value.
print(dict([(row[0], row[1]) for row in data]))

## WHERE clauses and filtering

We can use a `WHERE clause` in a query to restrict the results to rows that match certain criteria. Next, we will filter rows so that the only ones returned are those that have an `upvotes` value greater than 100. We then use the `csv.writer` class to create a writer object in order to write our results to a csv named `output.csv` which will be created in the same directory as this notebook.

In [None]:
import csv

with closing(db.cursor()) as cursor:
    cursor.execute('SELECT name, upvotes FROM links WHERE upvotes > 100')
    data = cursor.fetchall()
    with open('output.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerow(('name', 'upvotes'))
        for row in data:
            writer.writerow(row)

Let's say someone made an error when inserting some data into the database.

In [None]:
with closing(db.cursor()) as cursor:
  cursor.execute('''
    INSERT INTO links
      (id, name, url, created_at)
    VALUES
      (6, "Google", "https://en.wikipedia.org", "2023-07-01 04:04:04")''')

*Note that in this scenario, it's okay to "hardcode" the values we are inserting. This is very different from concatenating strings from variables, and doesn't contain the risk of SQL injection, do you see why?*

## UPDATE statement

So now we have an entry in our database for https://en.wikipedia.org, but the name is Google! *What would happen if we ran our dictionary mapping code again?*. Either way, we need to update that row.

In [None]:
with closing(db.cursor()) as cursor:
    cursor.execute('UPDATE links SET name = "Foo" WHERE id = 6')
    cursor.close()

This SQL says to `UPDATE` the `links` table, and for any row `WHERE` the `id = 6`, `SET` the `name = "Wikipedia"`. Note that, unlike in Python, SQL uses a single equal sign (`=`) for both assignment (`name =`) and comparison (`id = 6`). In Python we use `==` to test does `id == 6`? What will we see when we run the dictionary mapping above now?

In SQL, you can use anything you like in the `WHERE` clause to select the row or rows to be updated. However, it is usually best to use the `id`, because since it is the primary key, it is **indexed** and will allow for near immediate retrieval of the row required. Otherwise, for non-indexed columns (everything else in the table), Sqlite will have to do a [**table scan**](https://en.wikipedia.org/wiki/Full_table_scan) and load every single row from the database in order to find those that match the criteria. This is fine for our 6 row table, but imagine if you had several million rows!

In [None]:
with closing(db.cursor()) as cursor:
    cursor.execute('UPDATE links SET name = "Wikipedia" WHERE url LIKE "%wikipedia.org%"')


Here, we use a **regular expression** syntax with the `LIKE` keyword, to match any rows that have `wikipedia.org` in their URLs.

## DELETE statement

Finally, we might decide that we don't want the Wikipedia row in our databse. We can use an SQL `DELETE` statement to remove it.

In [None]:
with closing(db.cursor()) as cursor:
  cursor.execute('DELETE FROM links WHERE id = 6')

Note that we can run this code as many times as we like, but it only does the delete the first time (after which, the row doesn't exist). In most cases in SQL, besides obvious syntax errors or errors formatting parameters, statements that have no effect *do not produce errors*. This can be a source of bugs in your code, because you might run the above SQL and report that 'Item id=6 was deleted!' when in fact it didn't even exist. In this case, we can check the `rowcount` on the cursor to see how many rows were affected by the operation.

In [None]:
with closing(db.cursor()) as cursor:
    cursor.execute('DELETE FROM links WHERE id = 6')
    was_deleted = cursor.rowcount > 0
    print(cursor.rowcount, was_deleted)

---

Let's try the following exercises:

1. Insert a new row in the database for your favorite site
2. Update your row so that your favorite site has 1000 upvotes
3. Update your row to add one upvote (this wasn't necessarily covered in the lesson)
4. Write a CSV file with the URL of every link and the date it was created

In [None]:
# Your code here!

---

# Pandas

Now let's start exploring doing data analysis using the popular [**Pandas**](https://pandas.pydata.org/) library. Pandas isn't really a database, it doesn't necessarily store data itself. It's more of a library for manipulating and inspecting/analyzing data. You usually load data into Pandas from an "external" source, like a CSV file or a SQL database server.

In [None]:
import pandas as pd

First we must import the pandas library. It's common practice to import pandas `as pd`. If you remember all the way back to week 1, this syntax allows us to refer to pandas using the shortened name `pd`. This syntax is also useful if you have multiple libraries whose names would otherwise conflict.

Next, we read a CSV file into a Pandas **Dataframe**. A Dataframe is like a sheet in a spreadsheet, or a table in an SQL database. It is two dimensional, with a row for each data item and a column for each piece of data relating to that item.

We will be using a CSV that contains data on links submitted to the [Hacker News](https://news.ycombinator.com/) link aggregation service from July 2023.

In [None]:
df = pd.read_csv('links.csv')

The CSV contains a header row with all of the column names. This is automatically used as the **index** of the columns in the Dataframe, which will provide labels for them.

We can get an idea of how many rows there are in the table, how many columns there are, how populated or sparse they are (the number of rows that contain non-null data), and the datatypes associated with each column. Pandas is flexible enough to automatically assign a datatype to a column based on the data that it finds there.

In [None]:
df.info()

We can also get the number of rows and columns of the Dataframe with the `shape` attribute:

In [None]:
df.shape

Pandas data frames act in some ways like 2D arrays, or list of lists. Imagine you had the following Python list:

In [None]:
data = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
]

You could accesss the individual items in the `data` 2D array by specifying a row index and a column index:

In [None]:
data[1][1]

In a similar way, we can access a specific row and column in the Dataframe:

In [None]:
df.iloc[5, 7]

The `iloc` method refers to data by its "coordinates" in the Dataframe. We can also use the `loc` method to refer to data directly by its column name, which is usually more convenient: 

In [None]:
df.loc[5, 'url']

We can use slice notation to select a range of rows, with a specific column, and use the `head()` method to see the first few rows:

In [None]:
five_through_ten_url = df.loc[5:10, 'url']
five_through_ten_url.head()

We can pass in a list of column names to select as well, and use `:` for all rows (similar to the Python code `fruits[:]` which selects all elements of a list and serves to make a copy).

In [None]:
fruits = ['apple', 'banana', 'orange', 'pear']
my_fruits = fruits[:]
my_fruits.append('cherry')
print(fruits)
print(my_fruits)

In [None]:
all_rows_url_title = df.loc[:, ['url', 'title']]
all_rows_url_title.head()

Because we will often be selecting entire columns, Pandas provides a shortcut notation for that:

In [None]:
all_scores = df['score']
all_scores.head()

Note that this returned a Pandas **Series** object, which is a separate data container that contains only 1 column. We can calculate various basic statistics on the series:

In [None]:
all_scores.describe()

We can also use methods directly on the series:

In [None]:
all_scores.max()

In Pandas, `NULL` values (Python `None`) are referred to as "NA" in Pandas. Due to quirks in Python and NumPy (which Pandas is based on), the presence of an `NA` in an integer column automatically causes the column to be converted to float (see the decimal points) and `NaN` (Not a Number) used as the `NA` value.

Now let's look at a few more operations on Dataframes, using a new tiny Dataframe with fruit prices.

In [29]:
fruits_df = pd.DataFrame({'name': ['apple', 'banana', 'orange'], 'price': [1.29, .89, 2.29]})

We can add 10 cents to each price with one operation.

In [None]:
# Fruit price goes up by 10 cents
fruits_df['price'] += .1
fruits_df.head()

Note that this is not valid Python syntax, in general. You can't generally add a scalar to a list in Python.

In [None]:
numbers = [10, 20, 30, 40]
numbers + 100

Pandas "overloads" the addition operator in its Dataframe class to allow for special operations like this. All of the operations you'd expect, like `+`, `-`, `/`, `*`, `%` and of course the shortcuts like `+=` and `*=`, work for Pandas Dataframes.

We can also assign to individual values, or entire (potentially new) columns in our Dataframe.

In [None]:
fruits_df.shape

In [30]:
fruits_df.loc[1, 'price'] = 0.69
# We must contruct a new Dataframe and concatente them together.
# Note that the .concat(a) function returns a new Dataframe, it does not modify
# the original.
# 
# We use ignore_index=True to reset the index of the new Dataframe.
fruits_df = pd.concat([fruits_df, pd.DataFrame({'name': ['grape'], 'price': [0.1]})], ignore_index=True)
fruits_df['on_sale'] = [False, True, False, False]
fruits_df.head()

Unnamed: 0,name,price,on_sale
0,apple,1.29,False
1,banana,0.69,True
2,orange,2.29,False
3,grape,0.1,False


---

## Boolean masks

We can filter rows in our Dataframe using **Boolean Masks**. A Boolean Mask is a Dataframe or Series that contains only boolean values. It is not a separate data type.

In [None]:
fruits_df.head()

In [32]:
mask = fruits_df['price'] > 1
mask.head()

0     True
1    False
2     True
3    False
Name: price, dtype: bool

The mask contains one column, and the values of every row are either `True` or `False`. When we index the `fruits_df` Dataframe using the mask, it only returns the corresponding rows for which the mask is `True`. So in this example, it will skip rows 1 and 3, where the mask is `False` and return only rows 0 and 2. Note that all corresponding columns for the row are returned by default.

In [33]:
fruits_over_1 = fruits_df[mask]
fruits_over_1.head()

Unnamed: 0,name,price,on_sale
0,apple,1.29,False
2,orange,2.29,False


What if we we want to combine conditions, like we do with normal boolean values? What if we want all of the fruits that have a price over 2 and doesn't start with 'a'? First, we have to use the special syntax `.str.startswith('a')` to use the `str` method `startswith`. This is because Pandas can't overload any operator to indicate the startswith method, so this syntax specifies "Apply the `str` method `startswith` to every row of the Series and create a new Series with the corresponding boolean value".

In [35]:
'apple'.startswith('a')

True

In [37]:
price_mask = fruits_df['price'] > 1
starts_with_a_mask = fruits_df['name'].str.startswith('a')
print(price_mask.head())
print(starts_with_a_mask.head())

0     True
1    False
2     True
3    False
Name: price, dtype: bool
0     True
1    False
2    False
3    False
Name: name, dtype: bool


Now we can use [bitwise operators](https://wiki.python.org/moin/BitwiseOperators) to emulate Python's boolean operators.

In [None]:
fruits_df[mask & ~starts_with_a_mask].head()

Here is a mapping of bitwise operators and their Python equivalent, when dealing with Boolean Masks:

| Python | Pandas Boolean Mask |
| ------ | ------------------- |
| and    | & |
| or     | \| |
| not    | ~ |

First, Pandas computed a final boolean mask by performing all the operations on the boolean masks we provided.

In [38]:
(mask & ~starts_with_a_mask).head()

0    False
1    False
2     True
3    False
dtype: bool

Then that boolean mask was applied to the fruits Dataframe as we've seen before.

---

## Answering questions about the data

Now that we've learned some basics, let's try to answer some questions about our dataset. How many links have a score over 100?

In [39]:
df[df['score'] > 100].shape[0]

201

What about the number of links with titles that start with 'A'?

In [40]:
df[df['title'].str.startswith('A', na=False)].shape[0]

124

Can we combine these masks to find all rows with a score over 100 and that start with 'A'? (Note, we use `na=False` to instruct Pandas that if it finds an `NA` value in the Series, it should replace it with `False` instead of an `NA` in the output. If there was an `NA` in our boolean mask, it wouldn't operate properly).

In [41]:
a_mask = df['title'].str.startswith('A', na=False)
score_mask = df['score'] > 100

# Write the code to retrieve the rows where the title starts with
# 'A' and the score is greater than 100 from the dataframe.

We can use the `sample()` method to get a random sample of some of our data:

In [42]:
df['time'].sample(10)

766    1688898121
858    1688910452
153    1688678282
446    1688848541
927    1688915784
620    1688878996
307    1688791624
120    1688653381
104    1688646609
700    1688890000
Name: time, dtype: int64

These `time` values are stored as [UNIX timestamps](https://en.wikipedia.org/wiki/Unix_time), the number of integer seconds since January 1, 1970 at midnight in the UTC timezone. We can convert them to Python `datetime` objects and create a human readable string.

In [None]:
import datetime

t = 1688889269
# Convert a UNIX Timestamp to a Python datetime object
dt = datetime.datetime.fromtimestamp(t)
# Format the datetime as human readable
dt.strftime('%Y-%m-%d %H:%M:%S')

What if we wanted to calculate some value for all links posted in a given day, month or year? It would be useful to have this information as a separate column on our Dataframe. We can do that by first converting the timestamp using the Pandas `to_datetime` method, and then creating new columns from each of the components.

In [None]:
# Create a temporary Series that stores each timestamp as a datetime
df_dt = pd.to_datetime(df['time'], unit='s')
print(df_dt.head())

# Create new columns ('year', 'month' and 'day') for the components
# of the datetime in the df_dt Series.
df['year'] = df_dt.apply(lambda dt: dt.year)
df['month'] = df_dt.apply(lambda dt: dt.month)
df['day'] = df_dt.apply(lambda dt: dt.day)

The `apply()` method runs the given function for each row in a Series or Dataframe and returns a Series or Dataframe with the same shape, where each cell has the result of the operation. So for example:

| df_dt value | dt.year | dt.month | dt.day |
|-|-|-|-|
|datetime(2023, 7, 9)|2023|7|9|
|datetime(2023, 7, 9)|2023|7|9|
|datetime(2023, 7, 5)|2023|7|5|

The `lambda` keyword lets us define ultra simple, one line anonymous functions. The code:

```
df_dt.apply(lambda dt: dt.year)
```

Is equivalent to:

```
def get_year(dt):
  return dt.year
  
df_dt.apply(get_year)
```

*Special note: if you use the second syntax, there are no parentheses after get_year when we pass it to the `apply()` method. That's because we don't want to call get_year and return the result to `apply`, but rather we want to pass the entire function as an argument to `apply`.*

We can see that our new columns have been added.

In [None]:
df.head()

Now let's try to figure out the mean scores for each day in our dataset. This is a simple one-liner where we use the `groupby()` method to segregate the table based on the value of one column, then provide a function to apply to all of the values in each group, keeping them grouped.

In [None]:
df[['score', 'day']].groupby('day').mean()

While it seems odd that the scores decrease day after day, it does make some sense. Links that have been posted earlier in the week have had more time to accumulate score. Let's double check the max score for items on day 9.

In [None]:
df[df['day'] == 9]['score'].max()

That's it for this lesson! There was a *lot* of material, I know!

SQL is a powerful and ubiquitous language that is used in almost every web application, as well as in most companies for storing and querying some type of data. If you're interested, you should definitely follow up with some online SQL tutorials and resources to learn more. You can practice in your Jupyter notebook and sqlite.

Hopefully you've also learned a bit about Pandas dataframes. Data analysts like using Pandas because it is easy to load and work with the data, and many questions about the data can be answered in a single Python line. Additionally, many use Pandas right inside a Jupyter notebook like this one because it allows them to easily run single lines of code without reloading all of the data by running an entire Python script each time.