# Introduction to Jupyter

This is a Jupyter notebook. Notebooks are an environment to explore code, but also document your thought process.

Fundamentally, a notebook is a collection of cells.

Cells can contain code or text. Not just plain text, something called Markdown, which lets you add basic formatting. For example, to add a heading to a Markdown cell, use the # character, like this:

`# This is a heading` produces:

# This is a heading

This is not.


---

Now let's see some code. Jupyter notebooks aren't specific to Python and can be used for other programming languages, but for now we'll assume that "code" means Python.

We can run the code below by clicking the Run button or pressing `Ctrl+Enter` (or `Shift+Enter` which then moves to the cell below).

In [None]:
print("Hello world!")

---

# Introduction to `pandas`

`pandas` is one of the most important data science libraries in Python. It is used for:

- reading in and joining data from multiple sources
- exploring a dataset
- manipulating and reshaping data
- summary descriptive statistics
- visualisation (with the help of other libraries)

One of the most important things `pandas` introduces is new data types to work with. These are the `DataFrame` and the `Series`.

Before we can look at them, let's read in some data.

In [None]:
import pandas as pd

In [None]:
loans = pd.read_csv("./data/loans.csv")

When you read in data, it's a `DataFrame`. Think of it as a 2-D table.

In [None]:
type(loans)

We can check the dimensions of our data:

In [None]:
loans.shape # (rows, columns) tuple

Or just get the length (number of rows):

In [None]:
len(loans)

We can also inspect the column names:

In [None]:
loans.columns

We can look at our data here in the Jupyter notebook:

In [None]:
loans.head()

`pandas` automatically assigns row numbers to our data, called the index:

In [None]:
loans.index

The index of a `DataFrame` can be anything as long as the values are unique.

Looks like our columns are a mix of numeric and non-numeric types. We can verify this by looking at the data types of the columns:

In [None]:
loans.dtypes

Anything that isn't a number, date, or boolean is an `object`. In `pandas` version 2, there is a dedicated string type for text, and categorical data also has its own type (even if it's text, this is a more targeted data type).

Before we start changing our columns' data types, let's see what a single column looks like.

The syntax for accessing a column in a `DataFrame` is similar to accessing an item in a dictionary:

In [None]:
loans["loan_amnt"]

A single column (and a single row) in `pandas` has the `Series` data type:

In [None]:
type(loans["loan_amnt"])

You can select multiple columns by providing a list of column names:

In [None]:
loans[["id", "loan_amnt"]]

In [None]:
type(loans[["id", "loan_amnt"]])

Remember: single columns/rows are `Series`, anything that's 2-D is a `DataFrame`.

Technically you can select a single column as a `DataFrame` by providing it in a list. This is useful for machine learning models that *need* a `DataFrame` as an input, even if it's a single column.

In [None]:
loans[["loan_amnt"]]

In [None]:
type(loans[["loan_amnt"]])

One way to change a column's data type is to use `astype`.

In [None]:
loans["loan_amnt"].astype(int)

If we look at the column again...

In [None]:
loans["loan_amnt"].dtype

It hasn't changed!

**Important**: in `pandas` most methods return *copies* of the data and do not change it.

You need to be *explicit* when you want your source data to change:

In [None]:
loans["loan_amnt"] = loans["loan_amnt"].astype(int)

loans["loan_amnt"].dtype

## Column transformations

A brief aside - why use `pandas` for data analysis?

One big reason is speed. `pandas` encourages **vectorised** operations, which means performing an operation on *all values in a column of data at once*.

Let's compare two approaches:

In [None]:
%%timeit

a = list(range(100_000))

new_list = []

for num in a:
    new_list.append(num**2)

In [None]:
%%timeit

numbers = pd.Series(range(100_000))

# apply the operation to the entire Series at once
squares = numbers ** 2

The biggest change to working with Python and working with `pandas` is getting rid of loops and working in a vectorised way.

#### Column operations with `pandas`

Let's calculate the loan installments as a % of the total loan amount:

In [None]:
loans["installment"] / loans["loan_amnt"]

Column operations return a `Series`, which we can assign to new column names to create new columns:

In [None]:
loans["installment_pct"] = loans["installment"] / loans["loan_amnt"]

loans.head()

You can also delete a column you don't like:

In [None]:
loans.drop(columns=["installment_pct"])

In [None]:
loans.head()

Oh oh, it's still there!

Again, that's because `.drop()` doesn't modify the source data.

We can just overwrite our `DataFrame`:

In [None]:
loans = loans.drop(columns=["installment_pct"])
loans.head()

When our data is the right type, you can perform type-specific operations:

In [None]:
loans["emp_title"] = loans["emp_title"].astype("string")

In [None]:
loans["emp_title"].str.lower()

In [None]:
loans["emp_title"].str.lower().str.replace("manager", "mgr")

We can also slice strings in a `pandas` column the same way as Python strings

(relevant `pandas` documentation: https://pandas.pydata.org/docs/user_guide/text.html)

In [None]:
loans["emp_title"].str[:5]

For categorical data, such as the purpose of a loan, we can look at the unique values in a column:

In [None]:
loans["purpose"].unique()

And we can count the number of unique values (either by counting the list above, or using `.nunique`:

In [None]:
loans["purpose"].nunique()

If you ever want to export the altered version of your data, you can do that!

In [None]:
# set index=False if you don't want to export the index as a separate column
loans.to_csv("loans_new.csv", index=False)

## Missing data

Typically we look at the number of missing records for each column:

In [None]:
loans.isnull().sum()

You can also get that as a percentage:

In [None]:
loans.isnull().sum() / len(loans)

We have a couple of options, we can:

- fill in the missing values with a placeholder like "unknown"
- drop rows with missing values

In [None]:
# .fillna() also doesn't change the underlying data!
loans["emp_title"] = loans["emp_title"].fillna("Unknown")

We can specify which columns to take into account when dropping data, and whether *all* of those columns need to be missing for us to drop a row, or *any* of them.

In [None]:
loans.dropna(subset=["emp_title", "emp_length"], how="all")

<h1 style="color: #fcd805">Exercise: Exploratory Data Analysis with pandas</h1>

For the `pandas` exercises, you will gradually explore a new dataset of Kickstarter projects.

Kickstarter is a site that lets you crowdfund your project ideas. The dataset shows information about such projects including whether they succeeded or failed.

1. Read the file `kickstarter.csv.gz` from the `data` folder into a `pandas` `DataFrame` and inspect the data with the `.head` method.

Note: the `.gz` ending indicates this is a *zipped* CSV file. This greatly reduces the file size without losing any data, and the file can be read in exactly like a CSV file (no need to do anything about the fact that it's zipped, `pandas` will handle it).

2. How many rows and columns are there?

3. Check the data type of each column. Do any of them look incorrect?

4. Are there any missing values? If so, what should be done about them?

5. Create a new column to calculate the percentage of the goal that was achieved. This should be the amount pledged as a percentage of the goal.

6. Drop the `usd pledged` column as it has some incorrect values in it.

7. Convert the `name` column to the `string` type.

8. How many main categories are there in the data, and what are they?

# Filtering

In `pandas` we can filter a dataset similarly to slicing a list, using square brackets `[]`. Within the square brackets should be a logical condition.

In `pandas`, this should be a `Series` of boolean values:

In [None]:
loan_filter = loans["loan_amnt"] > 30_000

loan_filter

Which we can then apply to the `DataFrame`:

In [None]:
loans[loan_filter].head()

Or you can do it in one go:

In [None]:
loans[loans["loan_amnt"] > 30_000].head()

Or you can use `.query()` for a more SQL-like filtering syntax:

In [None]:
loans.query("loan_amnt > 30000").head()

More documentation on query and its options and limitations: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

We can combine logical conditions too

- unlike in Python, we don't use `and` and `or`, we use `&` and `|`:
- each condition should be in its own brackets

In [None]:
loans[(loans["loan_amnt"] > 30_000) & (loans["int_rate"] > 15)].head()

In [None]:
loans[(loans["loan_amnt"] > 30_000) | (loans["int_rate"] > 15)].head()

We can also do type-specific filtering:

In [None]:
loans["emp_title"].str.lower().str.contains("teacher")

In [None]:
teachers = loans[loans["emp_title"].str.lower().str.contains("teacher")]
teachers.head(10)

Any logical filters work, for example finding missing data:

In [None]:
loans[loans["emp_length"].isnull()].head()

### `.loc` and `.iloc`

You can also do 2-D filtering (filtering rows and columns at the same time) using `.loc` or `.iloc`.

`.iloc` uses *integer indices*:

In [None]:
# first row
loans.iloc[0]

We can use slice notation to get multiple rows:

In [None]:
loans.iloc[0:5]

And get 1 or more columns too:

In [None]:
loans.iloc[0:5, 0]

In contrast, `.loc` filters on the *actual* index (whether it's an integer or not).

So `.iloc[0]` always means the first row, whereas `.loc[0]` means the row with index (that's `pandas` index, not just list index) of 0.

In [None]:
loans.loc[0]

Column filtering with `.loc` is done by *name*:

In [None]:
loans.loc[0:5, "loan_amnt"]

And we can slice column names!

In [None]:
loans.loc[0:5, "loan_amnt":"int_rate"]

Finally, `.loc` allows boolean filtering:

In [None]:
loans.loc[loans["int_rate"] < 5.5, "int_rate"]

## Descriptive statistics

`pandas` has built-in methods for getting some statistical summaries of our data:

In [None]:
loans.describe()

By default, only numeric values are shown but we can change that:

In [None]:
loans.describe(include="all")

Individual columns can also be summarised:

In [None]:
loans["loan_amnt"].mean()

In [None]:
loans["loan_amnt"].median()

You can also use `.agg` to request multiple summaries:

In [None]:
loans["loan_amnt"].agg(["min", "max"])

<h1 style="color: #fcd805">Exercise: Filtering and Descriptive Statistics</h1>

We're going to continue working on the Kickstarter data from the previous exercise.

1. How many projects are in the Music category?

2. How many projects in the Music category *succeeded*?

3. How many projects in the Music category contain the word "song"?

4. How many projects are in the Music and Film & Video categories in total?

5. What are the smallest and biggest goals in the dataset?

6. What is the average number of backers a project received?

7. What is the average number of backers that *successful* projects received?

_Hint: Think about the order of operations of how to answer this. What do you need to do first?_

# Sorting and aggregating

We can sort our data:

In [None]:
loans.sort_values("grade").head()

We can sort by multiple values:

In [None]:
loans.sort_values(["grade", "sub_grade"]).head()

We can change the direction of sorting either for a single column, or different sort orders for multiple columns:

In [None]:
loans.sort_values("grade", ascending=False).head()

In [None]:
loans.sort_values(["grade", "sub_grade"], ascending=[True, False]).head()

For categorical data, we can look at the frequencies of the values using `value_counts`

In [None]:
loans["home_ownership"].value_counts()

Or as a percentage:

In [None]:
loans["home_ownership"].value_counts(normalize=True)

The default for this table is to sort values in descending order, but we can sort by the labels (which is the index):

In [None]:
loans["home_ownership"].value_counts().sort_index()

### Groupby

Like in SQL and other data-specific languages, we can use `groupby` to create subsets of our data and aggregate them separately:

In [None]:
loans.groupby("home_ownership")

That doesn't do anything yet because we haven't chosen how to summarise each group!

In [None]:
loans.groupby("home_ownership")["loan_amnt"].median()

The default here is to order by label (index) but we can sort on the values:

In [None]:
loans.groupby("home_ownership")["loan_amnt"].median().sort_values(ascending=False)

We can provide multiple columns to summarise:

In [None]:
loans.groupby("home_ownership")[["loan_amnt", "int_rate"]].median()

Or multiple aggregations on a single column:

In [None]:
loans.groupby("home_ownership")["loan_amnt"].agg(["min", "median", "max"])

Or multiple aggregations for multiple columns!

In [None]:
loans.groupby("home_ownership")[["loan_amnt", "int_rate"]].agg(["min", "median", "max"])

<h1 style="color: #fcd805">Exercise: Sorting and Aggregating</h1>

Back to the Kickstarter data.

1. What is the **total** amount pledged for songs *by category*?

You want to end up with a dataset of one line per category, showing the total pledged amount for each category.

2. What is the breakdown of the state of projects? That is, how many have failed, succeeded etc.? Calculate the answer both as absolute numbers and percentages.

3. Which category has the highest *average* pledged amount?

4. Find the most expensive (i.e. highest goal) project in the Photography category.

5. Find the project in the Food category with the highest number of backers.

6. **BONUS** Find the project with the longest name.

_Hint: figure out how to calculate the length of the names first!_

# Combining data

`pandas` supports much more than just CSV files. It can connect to many different data sources.

Once data is read into `pandas` it is always a `DataFrame` regardless of where it came from. This is one of the strengths of `pandas` and it means you can combine data *from different sources*.

Let's see how we could connect to a SQL database.

In [None]:
import sqlite3

conn = sqlite3.connect("./data/movies.sqlite")

type(conn)

There are 3 tables in this database:

- IMDB (a list of films)
- earning (the amount of money grossed by each film)
- genre (the genre of each film)

We can use `pandas` to directly run a SQL query on our database and save the data as a `DataFrame`.

In [None]:
films = pd.read_sql("""
SELECT
    *
FROM
    IMDB
""", conn)

type(films)

In [None]:
films.head()

Let's also select the genre data

In [None]:
genres = pd.read_sql("""
SELECT
    *
FROM
    genre
""", conn)

genres.head()

We could join these tables directly in the database:

In [None]:
pd.read_sql("""
SELECT
    *
FROM
    IMDB
    JOIN genre ON IMDB.Movie_id = genre.Movie_id
""", conn).head(10)

Or we could join them in `pandas`.

We need to choose:

- the datasets to join (or "merge" as it's called in `pandas`)
- the column(s) to join on (can be different for each table)
- the type of join (inner, left, right, etc.)

In [None]:
films_merged = films.merge(genres, on="Movie_id", how="inner")

films_merged.head()

Oh oh, looks like our data is duplicated!

In [None]:
print(len(films), len(genres))

That's exactly 3 genre records per film.

Our options include:

- deduplicating at source (in the SQL query or even the database)
- leaving the data as-is (but we'd have to remember there are 3 rows per film)
- deduplicating *after* the join

Let's see this third option in action:

In [None]:
films_deduped = films_merged.drop_duplicates(subset=["Movie_id"], keep="first")

films_deduped.head()

<h1 style="color: #fcd805">Exercise: combining data</h1>

1. Select all the rows from the `earning` table in the movies database into a `pandas` `DataFrame`.

2. Now join the earnings data onto the merged film+genre data.

You should now have a `DataFrame` with one row per film and with genre and earnings data added on at the end.

Verify that this is the case before moving on.

3. Which film earned the least **domestically**?

4. Which film earned the most **worldwide**?

5. How many films have a MetaCritic score of less than 75?

_Note: to answer this question you'll have to fix the data type of the column first and you may need to deal with some non-numeric values!_

6. Which genre has the highest total domestic earnings?

7. Convert the `Runtime` column to numeric.

_Hint: You'll have to perform some string manipulation on it before you can do this._

8. Now find the genre with the highest **median** runtime.

<h1 style="color: #fcd805">Exercise: pub names</h1>

Let's do some open-ended data analysis with `pandas`!

We're going to find out what the most common pub name is in the UK.

1. Read in the file `open_pubs.csv` from the `data` folder into a `pandas` `DataFrame` (data originally from https://www.getthedata.com/open-pubs).

2. Looks like there are no column headers!

Here is the data dictionary for the dataset:

|Field|Data type|Comments|
|---|---|---|
|fsa_id|int|Food Standard Agency's ID for this pub.|
|name|string|Name of the pub.|
|address|string|Address fields separated by commas.|
|postcode|string|Postcode of the pub.|
|easting|int| |
|northing|int| |
|latitude|decimal| |
|longitude|decimal| |
|local_authority|string|Local authority this pub falls under.|

Read the documentation for the `read_csv` method and figure out how to add column names to the data when you read it in.

3. Check for any missing data. Drop any row with no name, since we need values from that column.

4. Convert the `name` column (or whatever you called it) to the correct `string` type.

5. Now convert the values in the `name` column to lowercase so that names like "The King's Arms" and "The king's arms" are treated as the same name.

6. Use the `.str.strip()` method to remove any trailing whitespace from the `name` column.

7. Now use `.str.replace` to remove the word "the" from the pub names, so that a pub called "The King's Head" will be treated as having the same name as one that's simply called "King's Head".

*Tip: take care not to replace words that **contain** the word `the` like "theatre"*

8. Use your `name` column to find the most common pub name in the UK.

BONUS: which local authority has the most of these pubs (i.e. the most pubs that have the most common name you found in question 8)?

BONUS: how many unique pub names are there in the data? That is, pub names that appear exactly once.

# `pandas` help

- pandastutor visualises what different operations do: https://pandastutor.com/
- `pandas` cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- Python for Data Analysis book (free online) by the creator of `pandas`: https://wesmckinney.com/book/