# ~~First~~ Fast Python Notebook

<div style="max-width: 640px">

An accelerated guide to analyzing data with the [Python](https://www.python.org/) programming language and a [Jupyter](https://jupyter.org/) notebook

</div>
    
By [Ben Welsh](https://palewi.re/who-is-ben-welsh/)

<div style="max-width: 640px">

First developed in 2016, ["First Python Notebook"](https://palewi.re/docs/first-python-notebook/) is a tutorial that guides students through a data-driven investigation of money in California politics. It is most commonly taught as a six-hour, in-person class. This document is an abbreviated spinoff intended to be taught online in two to three hours.

You will learn just enough of the Python computer programming language to work with the [pandas](https://pandas.pydata.org/) library, a popular open-source tool for analyzing data. The course will teach you how to read, filter, join, group, aggregate and rank structured data by recreating a helicopter accident analysis [published by the Los Angeles Times](https://github.com/datadesk/helicopter-accident-analysis).
    
</div>

## What is a Jupyter notebook?

<div style="max-width: 640px">

![jupyter](https://palewi.re/docs/first-python-notebook/_static/img/labpreview.webp)

A [Jupyter](https://jupyter.org/) notebook is a browser-based interface where you can write, run, remix and republish code. It is free software that anyone can install and run.

[Scientists](https://nbviewer.jupyter.org/github/robertodealmeida/notebooks/blob/master/earth_day_data_challenge/Analyzing%20whale%20tracks.ipynb), [scholars](https://nbviewer.jupyter.org/github/nealcaren/workshop_2014/blob/master/notebooks/5_Times_API.ipynb), [investors](https://github.com/rsvp/fecon235/blob/master/nb/fred-debt-pop.ipynb) and [corporations](https://netflixtechblog.com/notebook-innovation-591ee3221233) use Jupyter to create and share their research. It is also used by journalists to develop stories and show their work. Examples include:

* [“The Tennis Racket”](https://github.com/BuzzFeedNews/2016-01-tennis-betting-analysis/blob/master/notebooks/tennis-analysis.ipynb) by BuzzFeed and the BBC
* [“Machine bias”](https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb) by ProPublica
* [“As Opioid Crisis Ramped Up, Pills Flowed Into Vermont by the Millions”](https://github.com/asuozzo/arcos-opioid-analysis-vt) by Seven Days
* [More than 35 different notebooks](https://github.com/datadesk/notebooks) published by the Los Angeles Times

There are numerous ways to install and configure Jupyter notebooks. This class is taught using [JupyterLite](https://jupyterlite.readthedocs.io/) a lightweight distribution that runs entirely in your web browser. For instructions on how to install a more powerful version on your computer consult [the full edition of "First Python Notebook.'](https://palewi.re/docs/first-python-notebook/jupyter_desktop.html)

Once you have this notebook up and running, you're ready to write Python in a code cell. Do not stress. There is nothing too fancy about it. You can start by just doing a little simple math.

Select the box below, then hit the play button in the toolbar above the notebook or hit `SHIFT+ENTER` on your keyboard.
    
</div>

In [1]:
2+2

4

<div style="max-width:630px">

There. You have just run your first Python code. You have entered two integers and added them together using the plus sign operator.

Not so bad, right? Now try writing in your own math problem in the next cell. Maybe `2+3` or `2+200`. Whatever strikes your fancy. After you've typed it in, hit the play button or `SHIFT+ENTER`.

</div>

<div style="max-width: 640px;">

This to-and-fro of writing Python code in a cell and then running it is the rhythm of working in a notebook. If you get an error after you run a cell, look carefully at your code and see that it exactly matches what’s been written in the example.
    
Here's an example of a error that I've added intentionally:
    
</div>

In [4]:
2+2+

SyntaxError: invalid syntax (4150814810.py, line 1)

<div style="max-width: 640px;">

Don’t worry. Code crashes are a normal part of life for computer programmers. They’re usually caused by small typos that can be quickly corrected.
    
</div>

In [3]:
2+2+2

6

<div style="max-width: 640px;">

Over time you will gradually stack cells to organize an analysis that runs from top to bottom. The cells can contain variables, functions and other Python tools.

> Note: If you’ve never written code before, we recommend ["An Informal Introduction to Python"](https://docs.python.org/3/tutorial/introduction.html) and subsequent sections of python.org’s tutorial.

A simple example would be storing your number in a variable in one cell:
    
</div>

In [4]:
number = 2

Then adding it to another number in the next:

In [5]:
number + 3

5

Change the `number` value to 3 and run both cells again. Instead of 5, it should now output 6.

In [5]:
number = 3

In [6]:
number + 3

6

Now try defining your own numeric variable and doing some math with it. You can name it whatever you want. Want to try some other math operations? The `-` sign does subtraction. Multipication is `*`. Division is `/`.

Once you’ve got the hang of making the notebook run, you’re ready to introduce pandas, a powerful Python analysis library that can do a whole lot more than add a few numbers together.

<div style="max-width: 640px">

## What is pandas?    
    
![pandas on the Python Package Index](https://palewi.re/docs/first-python-notebook/_static/img/pandas-pypi.png)

Lucky for us, Python is filled with functions to do almost anything you’d want to do with a programming language: [navigate the web](http://docs.python-requests.org/), [parse data](https://docs.python.org/2/library/csv.html), [interact with a database](http://www.sqlalchemy.org/), [run fancy statistics](https://www.scipy.org/), [build a pretty website](https://www.djangoproject.com/) and [so](https://www.crummy.com/software/BeautifulSoup/) [much](http://www.nltk.org/) [more](https://pillow.readthedocs.io/en/stable/).

Creative people have put these tools to work to get a [wide range of things](https://www.python.org/about/success/) done in the academy, the laboratory and even in outer space.

Some of those tools are included in a toolbox that comes with the language, known as the standard library. Others have been built by members of Python’s developer community and need to be separately downloaded and installed. One third-party tool that’s important for this class is called [pandas](https://pandas.pydata.org/). Invented by programmers at a [financial investment firm](https://www.aqr.com/), it has become a leading open-source library for accessing and analyzing data.

Here’s how to use pandas yourself. Run the following:
    
</div>

In [8]:
import pandas

<div style="max-width: 640px">

If nothing happens, that’s good. It means you have it installed and ready as to use.

> Note: Since pandas is created by a third party independent from the core Python developers, it may not be available by default if you manually installed Python and Jupyter. It’s available here because JupyterLite, whose developers have curated a list of common utilities to include with their distribution. Consult our [advanced installation guide](https://palewi.re/docs/first-python-notebook/appendix/index.html) if the cell above threw an error.

Now let's run the same code again, but with a small addition.
    
</div>

In [10]:
import pandas as pd

<div style="max-width: 640px">

This will alias the pandas library at the shorter variable name of `pd`. This is standard practice in the pandas community. You will frequently see examples of pandas code online using pd as shorthand. It’s not required, but it’s good to get in the habit so that your code will be better understood by other computer programmers.

Those two little letters contain dozens of data analysis tools that we’ll use in future lessons. They can import massive data files, compute advanced statistics, filter, sort, rank and do just about anything else you’d want to do.

We’ll get to all of that soon enough, but let’s start out with something simple. Let's run some simple stats.

## Calculating descriptive statistics

Start by making a list of numbers in a new notebook cell. To keep things simple, we'll start with all of the even numbers between zero and ten. Note the variable name I've assigned. Then press play.
    
</div>

In [15]:
my_list = [2, 4, 6, 8]

<div style="max-width: 640px">

If you’re a skilled Python programmer, you can do some cool stuff with any list, including run statistics. But if you hand over to pandas instead, you’ll be impressed by how easily you can analyze the data without much computer code.


In this case, it’s as simple as converting that plain Python list into what pandas calls a [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). Here’s how to make it happen:
    
</div>

In [16]:
my_series = pd.Series(my_list)

<div style="max-width: 640px">

Once the data becomes a `Series`, you can immediately run a wide range of [descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics). Let’s try a few.

First, let’s sum all the numbers.
    
</div>

In [17]:
my_series.sum()

20

Then find the maximum value.

In [18]:
my_series.max()

8

The minimum value.

In [19]:
my_series.min()

2

How about the average, which also known as the mean?

In [20]:
my_series.mean()

5.0

The median?

In [21]:
my_series.median()

5.0

The standard deviation?

In [22]:
my_series.std()

2.581988897471611

Finally, all of the above, plus a little more about the distribution, in one simple command.

In [18]:
my_series.describe()

count    4.000000
mean     5.000000
std      2.581989
min      2.000000
25%      3.500000
50%      5.000000
75%      6.500000
max      8.000000
dtype: float64

<div style="max-width: 640px">

Before you move on, go back the `my_list` variable and change the list. Maybe add a few more values. Or switch to odds. Then rerun all the cells above. You'll see all the statistics update to reflect the different dataset.

Substitute in a series of 10 million records and your notebook would calculate all the same statistics without you needing to write any more code. Once your data, however large or complex, is imported into pandas, simple statistics become a snap.

</div>

<div style="max-width:640px">

## Introducing dataframes
    
Now it’s time to get our hands on some real data. The [News Homepages project](https://homepages.news/) is an open-source archive that gathers, saves and shares front pages from sites around the world. It [publishes daily exports of data](https://palewi.re/docs/news-homepages/extracts.html) gathered by the system, which can be used to analyze news coverage. We'll use its roster of sites to demonstrate the power of pandas.

The data are structured in rows of comma-separated values. This is known as a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values). It is the most common way you will find data published online.

The pandas library is able to read in files from a variety formats, including CSV. In our next cell, we'll use pandas' `read_csv` method to read in `sites.csv` from the archive.
    
</div>

In [19]:
pd.read_csv("https://raw.githubusercontent.com/palewire/news-homepages/main/extracts/csv/sites.csv")

Unnamed: 0,handle,name,url,location,timezone
0,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York
1,11AliveNews,11Alive News,https://www.11alive.com,Atlanta,America/New_York
2,12NewsNow,12 News Now,https://www.12newsnow.com/,Beaumont,America/Chicago
3,13wmaznews,13WMAZ News,https://www.13wmaz.com,Macon,America/New_York
4,14eastmag,14 East,http://fourteeneastmag.com/,Chicago,America/Chicago
...,...,...,...,...,...
556,wttw,WTTW,https://www.wttw.com/,Chicago,America/Chicago
557,WTVM,WTVM News Leader 9,https://www.wtvm.com,Columbus,America/New_York
558,YahooNews,Yahoo! News,https://news.yahoo.com/,New York City,America/New_York
559,zerohedge,ZeroHedge,https://www.zerohedge.com/,New York City,America/New_York


You should see a big table like the one above. It is a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might.

> Note: If you're using [JupyterLite](https://jupyterlite.readthedocs.io/) instead of Jupyter Desktop, you'll need to use a workaround with the pyodide library to fetch the URL. Run something like this instead:
> ```
> import pyodide
> pd.read_csv(pyodide.open_url("https://raw.githubusercontent.com/palewire/news-homepages/main/extracts/csv/sites.csv"))
> ```

A major advantage of Jupyter over spreadsheets is that rather than manipulating the data through a haphazard series of clicks and keypunches we will be gradually grinding it down using a computer programming script that is 100% transparent and reproducible.

In order to do more with your DataFrame, we need to store it so it can be reused in subsequent cells. We can do this by saving in a variable, which is a fancy computer programming word for a named shortcut where we save our work as we go.

In [21]:
site_list = pd.read_csv("https://raw.githubusercontent.com/palewire/news-homepages/main/extracts/csv/sites.csv")

After you run it, you shouldn’t see anything. That’s a good thing. It means our DataFrame has been saved under the name `site_list`, which we can now begin interacting with in the cells that follow.

We can do this by calling [“methods”](https://en.wikipedia.org/wiki/Method_(computer_programming)) that pandas makes available to all DataFrames. You may not have known it at the time, but `read_csv` is one of these methods. There are dozens more that can do all sorts of interesting things. Let’s start with some easy ones that analysts use all the time.

To preview the first few rows of the dataset, try the [`head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method.

In [22]:
site_list.head()

Unnamed: 0,handle,name,url,location,timezone
0,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York
1,11AliveNews,11Alive News,https://www.11alive.com,Atlanta,America/New_York
2,12NewsNow,12 News Now,https://www.12newsnow.com/,Beaumont,America/Chicago
3,13wmaznews,13WMAZ News,https://www.13wmaz.com,Macon,America/New_York
4,14eastmag,14 East,http://fourteeneastmag.com/,Chicago,America/Chicago


It does the first five by default. If you want a different number, submit it as an input.

In [23]:
site_list.head(1)

Unnamed: 0,handle,name,url,location,timezone
0,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York


To get a look at all of the columns and what type of data they store, try the [`info`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) method.

In [24]:
site_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   handle    561 non-null    object
 1   name      561 non-null    object
 2   url       561 non-null    object
 3   location  561 non-null    object
 4   timezone  561 non-null    object
dtypes: object(5)
memory usage: 22.0+ KB


Look carefully at the results and you'll see we have more than 550 sites.

## Columns

To see the contents of a column separate from the rest of the DataFrame, add the column’s name to the DataFrame’s variable following a period. We’ll begin with the `location` column where the hometown of the news organization is stored.

In [25]:
site_list.location

0         Washington
1            Atlanta
2           Beaumont
3              Macon
4            Chicago
           ...      
556          Chicago
557         Columbus
558    New York City
559    New York City
560             Kiev
Name: location, Length: 561, dtype: object

That will list the column out as a Series, just like the ones we created from scratch earlier. Just as we did then, you can now start tacking on additional methods that will analyze the contents of the column.

> Note: You can also access columns a second way, like this: `site_list['location']`. This method isn’t as pretty, but it’s required if your column has a space in its name, which would break the simpler dot-based method.

In this case, the column is filled with characters. So we don’t want to calculate statistics like the median and average, as we did before.

There’s another built-in pandas tool that will total up the frequency of values in a column. The method is called `value_counts` and it’s just as easy to use as sum, min or max. All you need to do it is add a period after the column name and chain it on the tail end of your cell.

Run the code and you should see the locations ranked by their number of sites.

In [26]:
site_list.location.value_counts()

New York         62
Washington       47
Los Angeles      20
Chicago          16
New York City    16
                 ..
Evanston          1
Stamford          1
Bakersfield       1
Bogota            1
Wisconsin         1
Name: location, Length: 199, dtype: int64

You may notice that even though the result has two columns, pandas did not return a clean-looking table in the same way as `head` did for our DataFrame. That’s because our column, a Series, acts a little bit different than the DataFrame created by `read_csv`.

In most instances, if you have an ugly Series generated by a method like `value_counts` and you want to convert it into a pretty DataFram,e you can do so by tacking on the `reset_index` method on the end.

In [27]:
site_list.location.value_counts().reset_index()

Unnamed: 0,index,location
0,New York,62
1,Washington,47
2,Los Angeles,20
3,Chicago,16
4,New York City,16
...,...,...
194,Evanston,1
195,Stamford,1
196,Bakersfield,1
197,Bogota,1


Why do Series and DataFrames behave differently? Why does `reset_index` have such a weird name?

Like so much in computer programming, the answer is simply, “because the people who created the library said so.” It’s important to learn that all open-source programming tools are made by humans, and humans have their quirks. Over time you’ll see pandas has more than a few.

As a beginner, you should just accept the oddities and keep moving. As you get more advanced, if there’s something about the system you think could be improved you should consider [contributing](https://pandas.pydata.org/pandas-docs/stable/development/contributing.html) to the Python code that operates the library.

## Filtering

The most common way to filter a DataFrame is to pass an expression as an “index” that can be used to decide which records should be kept and which discarded. You write the expression by combining a column on your DataFrame with an [“operator”](https://en.wikipedia.org/wiki/Operator_(computer_programming)) like == or > or < and a value to compare against each row.

> Note: If you are familiar with writing [SQL](https://en.wikipedia.org/wiki/SQL) to manipulate databases, pandas’ filtering system is somewhat similar to a WHERE query. The [official pandas documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#where) offers direct translations between the two.

Let's try filtering against the `location` field. Save one of the values listed above into a variable. This will allow us to reuse it later.

In [28]:
my_location = "Cedar Rapids"

In the next cell we will ask pandas to narrow down our list of sites to just those that list the location we’re interested in. We will create a filter expression and place it between two flat brackets following the DataFrame we wish to filter.

In [29]:
site_list[site_list.location == my_location]

Unnamed: 0,handle,name,url,location,timezone
192,gazettedotcom,Cedar Rapids Gazette,https://www.thegazette.com/,Cedar Rapids,America/Chicago
246,KCRG,KCRG,https://www.kcrg.com/,Cedar Rapids,America/Chicago


Run it and it outputs the filtered dataset, just those sites located in Cedar Rapids.

Now we should save the results of that filter into a new variable separate from the full list we imported from the CSV file. Since it includes only the sites for the location we’re interested in let’s call it `my_sites`.

In [30]:
my_sites = site_list[site_list.location == my_location]

To check our work and find out how many committees are left after the filter, let’s run the DataFrame inspection commands we learned earlier.

First `head`.

In [31]:
my_sites.head()

Unnamed: 0,handle,name,url,location,timezone
192,gazettedotcom,Cedar Rapids Gazette,https://www.thegazette.com/,Cedar Rapids,America/Chicago
246,KCRG,KCRG,https://www.kcrg.com/,Cedar Rapids,America/Chicago


Then `info`.

In [32]:
my_sites.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 192 to 246
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   handle    2 non-null      object
 1   name      2 non-null      object
 2   url       2 non-null      object
 3   location  2 non-null      object
 4   timezone  2 non-null      object
dtypes: object(5)
memory usage: 96.0+ bytes


## Merging

Next we'll cover how to merge two DataFrames together into a combined table. Before we can do that, we need to read in a second file. Let' use the [`screenshot-files.csv`](https://palewi.re/docs/news-homepages/extracts.html#screenshot-files-csv) extracts [published by News Homepages](https://palewi.re/docs/news-homepages/extracts.html#screenshot-files-csv).

We can read it in the same was the site roster, with `read_csv`.

In [33]:
screenshot_list = pd.read_csv("https://raw.githubusercontent.com/palewire/news-homepages/main/extracts/csv/screenshot-files.csv")

When joining two tables together, the first step is to look carefully at the columns in each table to find a common column that can be joined. We can do that with the `info` command we learned earlier.

In [34]:
site_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   handle    561 non-null    object
 1   name      561 non-null    object
 2   url       561 non-null    object
 3   location  561 non-null    object
 4   timezone  561 non-null    object
dtypes: object(5)
memory usage: 22.0+ KB


In [35]:
screenshot_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43211 entries, 0 to 43210
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   identifier  43211 non-null  object
 1   handle      43211 non-null  object
 2   file_name   43211 non-null  object
 3   url         43211 non-null  object
 4   mtime       43211 non-null  object
 5   size        43211 non-null  int64 
 6   md5         43211 non-null  object
 7   sha1        43211 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.6+ MB


You can see that each table contains the `handle` column, which the documentation tells us acts as the unique identifier for a site. We can therefore join the two files using the pandas [`merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) method.

> Note: If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL’s JOIN statement. If you dig into [merge’s documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) you will see it has many of the same options.

Merging two DataFrames is as simple as passing both to pandas built-in `merge` method and specifying which field we’d like to use to connect them together. We will save the result into another new variable, which I'm going to call `merged_list`.

In [36]:
merged_list = pd.merge(site_list, screenshot_list, on="handle")

That new DataFrame can be inspected like any other.

In [37]:
merged_list.head()

Unnamed: 0,handle,name,url_x,location,timezone,identifier,file_name,url_y,mtime,size,md5,sha1
0,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York,100reporters-2022,100reporters-2022-07-08T23:55:17.494439-04:00.jpg,https://archive.org/download/100reporters-2022...,2022-07-09 03:55:23,309491,df3b7a37cdf271db5168168699c97c08,787c2345f89bf7057b50316a3b00a52066d5a870
1,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York,100reporters-2022,100reporters-2022-07-09T11:38:11.645272-04:00.jpg,https://archive.org/download/100reporters-2022...,2022-07-09 15:38:13,308475,3f1ed73406110916ebee38ce759451b4,1024fdad5527ab2b718eeef75dd044ed70f195ca
2,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York,100reporters-2022,100reporters-2022-07-10T00:02:09.536240-04:00.jpg,https://archive.org/download/100reporters-2022...,2022-07-10 04:02:11,309197,be1353896dfc4f3206eba7ae2a3dd57e,b6db35a3b20dd91c73d62349ea4c525ec9de52ba
3,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York,100reporters-2022,100reporters-2022-07-10T11:38:45.973753-04:00.jpg,https://archive.org/download/100reporters-2022...,2022-07-10 15:38:47,308719,f127ef0f9dc638b40c512103e5f88cb5,c0c84cced1b6b1d67b52cd30ab1229d5c0af359d
4,100Reporters,100Reporters,http://100r.org/,Washington,America/New_York,100reporters-2022,100reporters-2022-07-11T00:03:32.353709-04:00.jpg,https://archive.org/download/100reporters-2022...,2022-07-11 04:03:34,308478,06573ed3defc4901105a895d9bd3a7fc,8450782e88d80dc0bd9198598fc16d91e773f20a


By looking at the columns you can check how many rows survived the merge.

In [38]:
merged_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43211 entries, 0 to 43210
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   handle      43211 non-null  object
 1   name        43211 non-null  object
 2   url_x       43211 non-null  object
 3   location    43211 non-null  object
 4   timezone    43211 non-null  object
 5   identifier  43211 non-null  object
 6   file_name   43211 non-null  object
 7   url_y       43211 non-null  object
 8   mtime       43211 non-null  object
 9   size        43211 non-null  int64 
 10  md5         43211 non-null  object
 11  sha1        43211 non-null  object
dtypes: int64(1), object(11)
memory usage: 4.3+ MB


You can also see that the DataFrame now contains the same number of records as the `screenshot_list`, as well as all of the columns in both tables. Columns with the same name have had a suffix of `_x` or `_y` automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.

## Totals

Using only tricks we learned so far, we can now start to ask an answer questions.

For instance, what is the average size of a screenshot file?

In [39]:
merged_list['size'].mean()

290189.6200273079

> Note: Since `size` is a built-in pandas method, I used a different "bracket notation" to access the field.

What's the biggest screenshot file?

In [40]:
merged_list['size'].max()

687226

What site has the most screenshots?

In [41]:
merged_list.name.value_counts()

New York Times        811
CNN                   607
Los Angeles Times     465
MSNBC                 463
Fox News              463
                     ... 
Freethink               3
Athletic                3
Humans of New York      3
Calmatters              2
Mississippi Today       1
Name: name, Length: 559, dtype: int64

What is the total number of screenshots from sites in Cedar Rapids

In [42]:
merged_list[merged_list.location == my_location].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 361 entries, 13168 to 17171
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   handle      361 non-null    object
 1   name        361 non-null    object
 2   url_x       361 non-null    object
 3   location    361 non-null    object
 4   timezone    361 non-null    object
 5   identifier  361 non-null    object
 6   file_name   361 non-null    object
 7   url_y       361 non-null    object
 8   mtime       361 non-null    object
 9   size        361 non-null    int64 
 10  md5         361 non-null    object
 11  sha1        361 non-null    object
dtypes: int64(1), object(11)
memory usage: 36.7+ KB


## Sorting

Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the highest and lowest members of the group according to a particular column. The [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) is how pandas does it.

In [43]:
merged_list.sort_values("size")

Unnamed: 0,handle,name,url_x,location,timezone,identifier,file_name,url_y,mtime,size,md5,sha1
19552,laist,LAist,https://laist.com/,Los Angeles,America/Los_Angeles,laist-2022,laist-2022-06-24T18:37:43.658736-07:00.jpg,https://archive.org/download/laist-2022/laist-...,2022-06-25 01:37:45,13138,ea987d5a23910acc5fc3afce720baed4,c82e951aeb3af7b182407e2bb14a05efe6928b4d
39067,ukrinform,Ukrinform,https://www.ukrinform.ua/,Kiev,Europe/Kiev,ukrinform-2022,ukrinform-2022-03-24T15:38:42.807929+02:00.jpg,https://archive.org/download/ukrinform-2022/uk...,2022-03-24 13:38:43,14755,74383537eb3be7d4007c86ac3d6402a7,5bcb8e0fce59e4973b64bb12c0d7ce8e7b259eea
25567,News3LV,KSNV News 3,https://news3lv.com/,Las Vegas,America/Los_Angeles,news3lv-2022,news3lv-2022-07-08T19:08:35.222676-07:00.jpg,https://archive.org/download/news3lv-2022/news...,2022-07-09 02:08:37,16092,0594aafe3e1d3df37d5ff4ca86821be3,e07d1aa839b245ab410fba97015a441995e2fe52
23173,mercnews,Mercury News,https://www.mercurynews.com/,San Jose,America/Los_Angeles,mercnews-2022,mercnews-2022-05-02T06:45:41.641250-07:00.jpg,https://archive.org/download/mercnews-2022/mer...,2022-05-02 13:45:43,16104,fb6547e22027b350c61b8f3ff961ea71,8adcfd939075ff57d0e958379944f0305dadf92a
18996,KyivPost,KyivPost,https://www.kyivpost.com/,Kiev,Europe/Kiev,kyivpost-2022,kyivpost-2022-04-21T16:42:48.004499+03:00.jpg,https://archive.org/download/kyivpost-2022/kyi...,2022-04-21 13:42:49,18029,c17661c54118426112255a15f2fe111a,4b6a6e78edd14d48da2da1f86b974cd7e78db8a7
...,...,...,...,...,...,...,...,...,...,...,...,...
42519,wsj,Wall Street Journal,https://www.wsj.com/,New York,America/New_York,wsj-2022,wsj-2022-05-17T08:07:02.756004-04:00.jpg,https://archive.org/download/wsj-2022/wsj-2022...,2022-05-17 12:07:07,569445,533fc96be915658c123ade2c2707f4f7,2f3d3012c0d3c7b4106c88cb2f046c58ad0b7919
10451,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-19T22:16:19.926420-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-20 02:16:21,584409,9c32beefe1fd8b2ca109dd42ec3614ef,7af1fe7285af0cdcfd59a258b51ded54d1f49fd3
10448,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-18T13:44:59.403155-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-18 17:45:02,680645,84d2132c066f50aeadda10f71edcdd33,7ac5a6a7293cc5323114ae219dc94e68e3952038
10449,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-18T22:22:49.761458-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-19 02:22:50,682426,44f890d84a0d92ff4ae28993dde0e627,463a6c808e34785bca0a082f332cf0aced853340


Note that returns the DataFrame resorted in ascending order from lowest to highest. That is pandas default way of sorting. Here's how you reverse it to show the largest values first.

In [44]:
merged_list.sort_values("size", ascending=False)

Unnamed: 0,handle,name,url_x,location,timezone,identifier,file_name,url_y,mtime,size,md5,sha1
10455,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-21T22:19:44.329803-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-22 02:19:46,687226,b4d4f21df49c026b9c3a0aa7d0f48844,a13c74ecb785a5f09f08ac6803457afc1bf9c162
10449,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-18T22:22:49.761458-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-19 02:22:50,682426,44f890d84a0d92ff4ae28993dde0e627,463a6c808e34785bca0a082f332cf0aced853340
10448,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-18T13:44:59.403155-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-18 17:45:02,680645,84d2132c066f50aeadda10f71edcdd33,7ac5a6a7293cc5323114ae219dc94e68e3952038
10451,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-19T22:16:19.926420-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-20 02:16:21,584409,9c32beefe1fd8b2ca109dd42ec3614ef,7af1fe7285af0cdcfd59a258b51ded54d1f49fd3
42519,wsj,Wall Street Journal,https://www.wsj.com/,New York,America/New_York,wsj-2022,wsj-2022-05-17T08:07:02.756004-04:00.jpg,https://archive.org/download/wsj-2022/wsj-2022...,2022-05-17 12:07:07,569445,533fc96be915658c123ade2c2707f4f7,2f3d3012c0d3c7b4106c88cb2f046c58ad0b7919
...,...,...,...,...,...,...,...,...,...,...,...,...
18996,KyivPost,KyivPost,https://www.kyivpost.com/,Kiev,Europe/Kiev,kyivpost-2022,kyivpost-2022-04-21T16:42:48.004499+03:00.jpg,https://archive.org/download/kyivpost-2022/kyi...,2022-04-21 13:42:49,18029,c17661c54118426112255a15f2fe111a,4b6a6e78edd14d48da2da1f86b974cd7e78db8a7
23173,mercnews,Mercury News,https://www.mercurynews.com/,San Jose,America/Los_Angeles,mercnews-2022,mercnews-2022-05-02T06:45:41.641250-07:00.jpg,https://archive.org/download/mercnews-2022/mer...,2022-05-02 13:45:43,16104,fb6547e22027b350c61b8f3ff961ea71,8adcfd939075ff57d0e958379944f0305dadf92a
25567,News3LV,KSNV News 3,https://news3lv.com/,Las Vegas,America/Los_Angeles,news3lv-2022,news3lv-2022-07-08T19:08:35.222676-07:00.jpg,https://archive.org/download/news3lv-2022/news...,2022-07-09 02:08:37,16092,0594aafe3e1d3df37d5ff4ca86821be3,e07d1aa839b245ab410fba97015a441995e2fe52
39067,ukrinform,Ukrinform,https://www.ukrinform.ua/,Kiev,Europe/Kiev,ukrinform-2022,ukrinform-2022-03-24T15:38:42.807929+02:00.jpg,https://archive.org/download/ukrinform-2022/uk...,2022-03-24 13:38:43,14755,74383537eb3be7d4007c86ac3d6402a7,5bcb8e0fce59e4973b64bb12c0d7ce8e7b259eea


You can limit the result to the top five by chaining the head method at the end.

In [45]:
merged_list.sort_values("size", ascending=False).head()

Unnamed: 0,handle,name,url_x,location,timezone,identifier,file_name,url_y,mtime,size,md5,sha1
10455,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-21T22:19:44.329803-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-22 02:19:46,687226,b4d4f21df49c026b9c3a0aa7d0f48844,a13c74ecb785a5f09f08ac6803457afc1bf9c162
10449,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-18T22:22:49.761458-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-19 02:22:50,682426,44f890d84a0d92ff4ae28993dde0e627,463a6c808e34785bca0a082f332cf0aced853340
10448,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-18T13:44:59.403155-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-18 17:45:02,680645,84d2132c066f50aeadda10f71edcdd33,7ac5a6a7293cc5323114ae219dc94e68e3952038
10451,fastcompany,Fast Company,https://www.fastcompany.com/,New York,America/New_York,fastcompany-2022,fastcompany-2022-07-19T22:16:19.926420-04:00.jpg,https://archive.org/download/fastcompany-2022/...,2022-07-20 02:16:21,584409,9c32beefe1fd8b2ca109dd42ec3614ef,7af1fe7285af0cdcfd59a258b51ded54d1f49fd3
42519,wsj,Wall Street Journal,https://www.wsj.com/,New York,America/New_York,wsj-2022,wsj-2022-05-17T08:07:02.756004-04:00.jpg,https://archive.org/download/wsj-2022/wsj-2022...,2022-05-17 12:07:07,569445,533fc96be915658c123ade2c2707f4f7,2f3d3012c0d3c7b4106c88cb2f046c58ad0b7919


## Grouping

The [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the ["pivot table"](https://en.wikipedia.org/wiki/Pivot_table) feature found in most spreadsheets.

Let's use it to analyze the timezone field, which show where the organizations indexed by News Homepages are based. You start by passing its name to the function.

In [46]:
merged_list.groupby("timezone")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc43836a670>

A nice start but you’ll notice you don’t get much back. The data’s been grouped, but we haven’t chosen what to do with it yet. If we wanted the file size total by zone, we would sum the `size` field the same way we did earlier for the entire DataFrame.

In [47]:
merged_list.groupby("timezone")['size'].sum()

timezone
America/Bogota            20359830
America/Buenos_Aires      24953179
America/Chicago         2284188701
America/Denver           329645851
America/Edmonton          19357058
America/Los_Angeles     2402403425
America/Mexico_City      350211916
America/Montevideo         7379875
America/New_York        4664240116
America/Phoenix           58965447
America/Vancouver         37972418
Asia/Beirut               25311438
Asia/Qatar                21678804
Asia/Tbilisi              28888169
Asia/Tokyo               102516510
Europe/Berlin              3371896
Europe/Dublin             18099916
Europe/Kiev              353249376
Europe/London            809543441
Europe/Madrid             28401302
Europe/Moscow             61413650
Europe/Oslo                9954086
Europe/Paris             580587143
Europe/Riga               76488230
US/Eastern               215213124
US/Hawaii                  4988770
Name: size, dtype: int64

Again our data has come back as an ugly Series. To reformat it as a pretty DataFrame use the `reset_index` method again.

In [48]:
merged_list.groupby("timezone")['size'].sum().reset_index()

Unnamed: 0,timezone,size
0,America/Bogota,20359830
1,America/Buenos_Aires,24953179
2,America/Chicago,2284188701
3,America/Denver,329645851
4,America/Edmonton,19357058
5,America/Los_Angeles,2402403425
6,America/Mexico_City,350211916
7,America/Montevideo,7379875
8,America/New_York,4664240116
9,America/Phoenix,58965447


Next re-sort totals from highest to lowest. Remember the `sort_values` trick we learned earlier? That’ll do it.

In [49]:
merged_list.groupby("timezone")['size'].sum().reset_index().sort_values("size", ascending=False)

Unnamed: 0,timezone,size
8,America/New_York,4664240116
5,America/Los_Angeles,2402403425
2,America/Chicago,2284188701
18,Europe/London,809543441
22,Europe/Paris,580587143
17,Europe/Kiev,353249376
6,America/Mexico_City,350211916
3,America/Denver,329645851
24,US/Eastern,215213124
14,Asia/Tokyo,102516510


## Computing

Here's how you can create a new column based on the data in other columns, a process sometimes known as “computing.”

Let's say you want to take an extra step beyond the last chapter and figure out the screenshot totals by month. One approach would be to create a column and fill it with the month of each screenshot. Then to group and count on that field.

One problem. before we can extract the month from the `mtime` field we need to convert it into a Python datetime object. Run `info` again and you'll see that pandas read it in as a string, which pandas prefers to call an `object`.

In [50]:
merged_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43211 entries, 0 to 43210
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   handle      43211 non-null  object
 1   name        43211 non-null  object
 2   url_x       43211 non-null  object
 3   location    43211 non-null  object
 4   timezone    43211 non-null  object
 5   identifier  43211 non-null  object
 6   file_name   43211 non-null  object
 7   url_y       43211 non-null  object
 8   mtime       43211 non-null  object
 9   size        43211 non-null  int64 
 10  md5         43211 non-null  object
 11  sha1        43211 non-null  object
dtypes: int64(1), object(11)
memory usage: 4.3+ MB


Thankfully, pandas comes with a utility that is highly skilled at translating strings into timestamp. It's called `pd.to_datetime` and it expects a Series as an input. Using it to convert mtime looks something like this:

In [51]:
pd.to_datetime(merged_list.mtime)

0       2022-07-09 03:55:23
1       2022-07-09 15:38:13
2       2022-07-10 04:02:11
3       2022-07-10 15:38:47
4       2022-07-11 04:03:34
                ...        
43206   2022-07-20 04:25:48
43207   2022-07-20 16:51:07
43208   2022-07-21 04:34:37
43209   2022-07-21 17:01:36
43210   2022-07-22 04:38:18
Name: mtime, Length: 43211, dtype: datetime64[ns]

The result of that new column can be attached to the DataFrame by assigning to it a new column. We could call it anything. Let's go with `datetime`.

In [52]:
merged_list['datetime'] = pd.to_datetime(merged_list.mtime)

If you run `info` again you'll see the new column and the preferred data type.

In [53]:
merged_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43211 entries, 0 to 43210
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   handle      43211 non-null  object        
 1   name        43211 non-null  object        
 2   url_x       43211 non-null  object        
 3   location    43211 non-null  object        
 4   timezone    43211 non-null  object        
 5   identifier  43211 non-null  object        
 6   file_name   43211 non-null  object        
 7   url_y       43211 non-null  object        
 8   mtime       43211 non-null  object        
 9   size        43211 non-null  int64         
 10  md5         43211 non-null  object        
 11  sha1        43211 non-null  object        
 12  datetime    43211 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(11)
memory usage: 4.6+ MB


Now that Python is treating the column as a datetime object, we can extract the month from each record in the series by accessing the `dt` attribute put there by pandas.

In [54]:
merged_list.datetime.dt.month

0        7
1        7
2        7
3        7
4        7
        ..
43206    7
43207    7
43208    7
43209    7
43210    7
Name: datetime, Length: 43211, dtype: int64

Those values can be saved into their own new computed column.

In [55]:
merged_list['month'] = merged_list.datetime.dt.month

Which will not be found by `info`.

In [56]:
merged_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43211 entries, 0 to 43210
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   handle      43211 non-null  object        
 1   name        43211 non-null  object        
 2   url_x       43211 non-null  object        
 3   location    43211 non-null  object        
 4   timezone    43211 non-null  object        
 5   identifier  43211 non-null  object        
 6   file_name   43211 non-null  object        
 7   url_y       43211 non-null  object        
 8   mtime       43211 non-null  object        
 9   size        43211 non-null  int64         
 10  md5         43211 non-null  object        
 11  sha1        43211 non-null  object        
 12  datetime    43211 non-null  datetime64[ns]
 13  month       43211 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 4.9+ MB


Finally, summing the number of screenshots by month will require going back to `groupby`. We will introduce the `size` method, which can count the number of records in each group. Again, `reset_index` tranlates the result into a DataFrame.

In [57]:
merged_list.groupby("month").size().reset_index()

Unnamed: 0,month,0
0,3,1080
1,4,4510
2,5,4623
3,6,12567
4,7,20431


## Further reading

With that, we've covered the basic skills necessary to access and analyze data with pandas. Or at least everything we could fit in two quick hours. If you'd like to learn more, consult [the full edition of "First Python Notebook"](https://palewi.re/docs/first-python-notebook/).