Welcome to Week 6 of the [Noisebridge Python Class](https://github.com/audiodude/PythonClass)!

In this lesson, we will explore doing data analysis using the popular [**Pandas**](https://pandas.pydata.org/) library.

You will learn:

* Loading a CSV file in Pandas into a **Dataframe**.
* Inspecting the data, including getting summary statistics.
* Filtering the data using **Boolean Masks**.
* Assigning new derivative columns to the Dataframe
* Using aggregate functions

Let's get started!

In [None]:
import pandas as pd

First we must import the pandas library. It's common practice to import pandas `as pd`. If you remember all the way back to week 1, this syntax allows us to refer to pandas using the shortened name `pd`. This syntax is also useful if you have multiple libraries whose names would otherwise conflict.

Next, we read a CSV file into a Pandas **Dataframe**. A Dataframe is like a sheet in a spreadsheet, or a table in an SQL database. It is two dimensional, with a row for each data item and a column for each piece of data relating to that item

In [None]:
df = pd.read_csv('links.csv')

We will be using a CSV that contains recent (July 2023) data on links submitted to the [Hacker News](https://news.ycombinator.com/) link aggregation service.

The CSV contains a header row with all of the column names. This is automatically used as the **index** of the columns in the Dataframe, which will provide labels for them.

We can get an idea of how many rows there are in the table, how many columns there are, how populated or sparse they are (the number of rows that contain non-null data), and the datatypes associated with each column. Pandas is flexible enough to automatically assign a datatype to a column based on the data that it finds there.

In [None]:
df.info()

We can also get the number of rows and columns of the Dataframe with the `shape` attribute:

In [None]:
df.shape

Pandas data frames act in some ways like 2D arrays, or list of lists. Imagine you had the following Python list:

In [None]:
data = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
]

You could accesss the individual items in the `data` 2D array by specifying a row index and a column index:

In [None]:
data[1][1]

In a similar way, we can access a specific row and column in the Dataframe:

In [None]:
df.iloc[5, 7]

The `iloc` method refers to data by its "coordinates" in the Dataframe. We can also use the `loc` method to refer to data directly by its column name, which is usually more convenient: 

In [None]:
df.loc[5, 'url']

We can use slice notation to select a range of rows, with a specific column, and use the `head()` method to see the first few rows:

In [None]:
five_through_ten_url = df.loc[5:10, 'url']
five_through_ten_url.head()

Note that Pandas slices *include* the row at the trailing index, unlike Python slices:

In [None]:
fruits = ['apple', 'banana', 'orange', 'pear']
fruits[1:3]

We can pass in a list of column names to select as well, and use `:` for all rows (similar to the Python code `fruits[:]` which selects all elements of a list and serves to make a copy).

In [None]:
my_fruits = fruits[:]
my_fruits.append('cherry')
print(fruits)
print(my_fruits)

In [None]:
all_rows_url_title = df.loc[:, ['url', 'title']]
all_rows_url_title.head()

Because we will often be selecting entire columns, Pandas provides a shortcut notation for that:

In [None]:
all_scores = df['score']
all_scores.head()

Note that this returned a Pandas **Series** object, which is a separate data container that contains only 1 column. We can calculate various basic statistics on the series:

In [None]:
all_scores.describe()

We can also use methods directly on the series:

In [None]:
all_scores.max()

In Pandas, `NULL` values (Python `None`) are referred to as "NA" in Pandas. Due to quirks in Python and NumPy (which Pandas is based on), the presence of an `NA` in an integer column automatically causes the column to be converted to float (see the decimal points) and `NaN` (Not a Number) used as the `NA` value.

Now let's look at a few more operations on Dataframes, using a new tiny Dataframe with fruit prices.

In [None]:
fruits_df = pd.DataFrame({'name': ['apple', 'banana', 'orange'], 'price': [1.29, .89, 2.29]})

We can add 10 cents to each price with one operation.

In [None]:
# Fruit price goes up by 10 cents
fruits_df['price'] += .1
fruits_df.head()

Note that this is not valid Python syntax, in general. You can't generally add a scalar to a list in Python.

In [None]:
numbers = [10, 20, 30, 40]
numbers + 100

Pandas "overloads" the addition operator in its Dataframe class to allow for special operations like this. All of the operations you'd expect, like `+`, `-`, `/`, `*`, `%` and of course the shortcuts like `+=` and `*=`, work for Pandas Dataframes.

We can also assign to individual values, or entire (potentially new) columns in our Dataframe.

In [None]:
fruits_df.shape

In [None]:
fruits_df.loc[1, 'price'] = 0.69
# We must contruct a new Dataframe and concatente them together.
# Note that the .concat(a) function returns a new Dataframe, it does not modify
# the original.
fruits_df = pd.concat([fruits_df, pd.DataFrame({'name': ['grape'], 'price': [0.1]})])
fruits_df['on_sale'] = [False, True, False, False]
fruits_df.head()

---

We can filter rows in our Dataframe using **Boolean Masks**. A Boolean Mask is a Dataframe or Series that contains only boolean values. It is not a separate data type.

In [None]:
fruits_df.head()

In [None]:
mask = fruits_df['price'] > 2
mask.head()

The mask contains one column, and the values of every row are either `True` or `False`. When we index the `fruits_df` Dataframe using the mask, it only returns the corresponding rows for which the mask is `True`. So in this example, it will skip rows 0 and 1, where the mask is `False` and return only row 3. Note that all corresponding columns for the row are returned by default.

In [None]:
fruits_over_2 = fruits_df[mask]
fruits_over_2.head()

What if we we want to combine conditions, like we do with normal boolean values? What if we want all of the fruits that have a price over 2 and doesn't start with 'a'? First, we have to use the special syntax `.str.startswith('a')` to use the `str` method `startswith`. This is because Pandas can't overload any operator to indicate the startswith method, so this syntax specifies "Apply the `str` method `startswith` to every row of the Series and create a new Series with the corresponding boolean value".

In [None]:
'apple'.startswith('a')

In [None]:
mask = fruits_df['price'] > 2
starts_with_a_mask = fruits_df['name'].str.startswith('a')
print(mask.head())
print(starts_with_a_mask.head())

Now we can use [bitwise operators](https://wiki.python.org/moin/BitwiseOperators) to emulate Python's boolean operators.

In [None]:
fruits_df[mask & ~starts_with_a_mask].head()

Here is a mapping of bitwise operators and their Python equivalent, when dealing with Boolean Masks:

| Python | Pandas Boolean Mask |
| ------ | ------------------- |
| and    | & |
| or     | \| |
| not    | ~ |

First, Pandas computed a final boolean mask by performing all the operations on the boolean masks we provided.

In [None]:
(mask & ~starts_with_a_mask).head()

Then that boolean mask was applied to the fruits Dataframe as we've seen before.

---

Now that we've learned some basics, let's try to answer some questions about our dataset. How many links have a score over 100?

In [None]:
df[df['score'] > 100].shape[0]

What about the number of links with titles that start with 'A'?

In [None]:
df[df['title'].str.startswith('A', na=False)].shape[0]

Can we combine these masks to find all rows with a score over 100 and that start with 'A'? (Note, we use `na=False` to instruct Pandas that if it finds an `NA` value in the Series, it should replace it with `False` instead of an `NA` in the output. If there was an `NA` in our boolean mask, it wouldn't operate properly).

In [None]:
a_mask = df['title'].str.startswith('A', na=False)
score_mask = df['score'] > 100

a_and_over_100 = df[a_mask & score_mask]
print(a_and_over_100.shape[0])
a_and_over_100.head()

We can use the `sample()` method to get a random sample of some of our data:

In [None]:
df['time'].sample(10)

These `time` values are stored as [UNIX timestamps](https://en.wikipedia.org/wiki/Unix_time), the number of integer seconds since January 1, 1970 at midnight in the UTC timezone. We can convert them to Python `datetime` objects and create a human readable string.

In [None]:
import datetime

t = 1688889269
# Convert a UNIX Timestamp to a Python datetime object
dt = datetime.datetime.fromtimestamp(t)
# Format the datetime as human readable
dt.strftime('%Y-%m-%d %H:%M:%S')

What if we wanted to calculate some value for all links posted in a given day, month or year? It would be useful to have this information as a separate column on our Dataframe. We can do that by first converting the timestamp using the Pandas `to_datetime` method, and then creating new columns from each of the components.

In [None]:
# Create a temporary Series that stores each timestamp as a datetime
df_dt = pd.to_datetime(df['time'], unit='s')
print(df_dt.head())

# Create new columns ('year', 'month' and 'day') for the components
# of the datetime in the df_dt Series.
df['year'] = df_dt.apply(lambda dt: dt.year)
df['month'] = df_dt.apply(lambda dt: dt.month)
df['day'] = df_dt.apply(lambda dt: dt.day)

The `apply()` method runs the given function for each row in a Series or Dataframe and returns a Series or Dataframe with the same shape, where each cell has the result of the operation. So for example:

| df_dt value | dt.year | dt.month | dt.day |
|-|-|-|-|
|datetime(2023, 7, 9)|2023|7|9|
|datetime(2023, 7, 9)|2023|7|9|
|datetime(2023, 7, 5)|2023|7|5|

The `lambda` keyword lets us define ultra simple, one line anonymous functions. The code:

```
df_dt.apply(lambda dt: dt.year)
```

Is equivalent to:

```
def get_year(dt):
  return dt.year
  
df_dt.apply(get_year)
```

(Special note: if you use the second syntax, there are no parentheses after get_year when we pass it to the `apply()` method. That's because we don't want to call get_year and return the result to `apply`, but rather we want to pass the entire function as an argument to `apply`.

We can see that our new columns have been added.

In [None]:
df.head()

Now let's try to figure out the mean scores for each day in our dataset. This is a simple one-liner where we use the `groupby()` method to segregate the table based on the value of one column, then provide a function to apply to all of the values in each group, keeping them grouped.

In [None]:
df[['score', 'day']].groupby('day').mean()

While it seems odd that the scores decrease day after day, it does make some sense. Links that have been posted earlier in the week have had more time to accumulate score. Let's double check the max score for items on day 9.

In [None]:
df[df['day'] == 9]['score'].max()

That's it for this lesson!

Hopefully you've learned the basics of working with CSV data in a Pandas dataframe. Data analysts like using Pandas because it is easy to load and work with the data, and many questions about the data can be answered in a single Python line. Additionally, many use Pandas right inside a Jupyter notebook like this one because it allows them to easily run single lines of code without reloading all of the data by running an entire Python script each time.