# Doctora Who



In [None]:
# Don't change this cell; just run it.
import numpy as np  # The array library.

import pandas as pd
# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

# The OKpy testing system.
from client.api.notebook import Notebook
ok = Notebook('process-02.ok')

As everybody who is anybody knows, Doctor Who is an alien who *regenerates*,
and in doing so, takes different human-like forms.

Coincidentally, this makes casting easier, because you can swap out the main
actor when you lose interest in them, they get a better contract, or they start
[forgetting their lines](https://en.wikipedia.org/wiki/William_Hartnell).

The general idea makes it perfectly possible for the Doctor to be man or a
woman, and indeed, as anyone who is anybody knows, Jodie Whittaker played the
Doctor from 2017 through 2022.

In English, "Doctor Who" could be a male or a female, but in Spanish, a female
Doctor is more of a problem, because a male Doctor would be "Doctor Who", but
a female Doctor would be "Doctora Who", or, abbreviated, "Dra Who".

Hence the name of this page.

Incidentally, you may have noticed, but if you are talking about your General
Practitioner, you are a [little bit more
likely](https://www.statista.com/statistics/698260/registered-doctors-united-kingdom-uk-by-gender-and-specialty/)
to be talking about a doctora, rather than a doctor.


##  Has Doctor Who got better, or worse?

Jodie Whittaker's tenure has attracted [some
criticism](https://en.wikipedia.org/wiki/Thirteenth_Doctor#Critical_reception),
not primarily because of her acting, but because of the scripts.  One
accusation that has been made is that the scripts are too 'woke'.  We won't
dive into that can of worms here, but let's start to look at how you'd assess
whether Jodie Whittaker's episodes are popular, in terms of ratings, and number
of viewers, compared to other Doctors.

The most obvious comparison would be to Peter Capaldi, the previous Doctor.

In this exercise, we look at the data to see if there is any good evidence that
Jodie Whittaker's Doctora was less popular than Peter Capaldi's Doctor.


## The data

You are about to read the processed data from that we web-scraped from
<https://guide.doctorwhonews.net>.

See [the dataset page](https://github.com/odsti/datasets/tree/main/doctor_who)
for more information.

Here we read the CSV file as a Pandas DataFrame.

In [None]:
# Run this cell
df = pd.read_csv('./data/doctor_who_stats.csv')
df.head()

As usual we need to know what the column values mean:

*   `Episode Title`
*   `Weekday`: day of week of first broadcast.
*   `Length`: run time.
*   `Share`: audience share relative to other programmes broadcast at same time.
*   `AI`: [Audience Appreciation Index](https://tardis.fandom.com/wiki/Appreciation_Index)
*   `Chart`: Ranking in terms of number of viewers (see below) compared to all
    other programmes broadcast that week.
*   `Broadcast datetime`: Date and time of first broadcast.
*   `viewers_in_millions`: Viewers in millions.  These appear to be viewers
    within 7 days of the original broadcast, initially on TV only, and later
    including Tablets and PCs, and later still, tablets, PCs and smartphones.
    See the notebook above for more discussion.

Notice the `Broadcast datetime` column.  It has the time and date of the first
broadcast of each episode.  Here is the column, extracted as a Pandas Series.

In [None]:
df['Broadcast datetime']

Notice the data type `object`.  Here is the first value in the column:

In [None]:
# Extract the first value.
df['Broadcast datetime'].iloc[0]

There are quotes around the displayed value — the value is a string:

In [None]:
type(df['Broadcast datetime'].iloc[0])

To make this column more useful we need to convert the strings into something
Pandas recognizes as dates and times.  To do this, we pass the column of values
to the `pd.to_datetime` function, to get a column of datatime values, that
Pandas recognizes as recording date and time.

In [None]:
broadcast_dts = pd.to_datetime(df['Broadcast datetime'])
broadcast_dts

Use *direct indexing with column labels* (DICL) to replace the current
`Broadcast datetime` column with the new values stored in `broadcast_dts`.

In [None]:
df... = ...
# Show the first five rows of the result
df.head()

In [None]:
_ = ok.grade('q_01_bcdt')

We are interested in a couple of columns for popularity of each episode.

The first is `AI` — audience appreciation index.  From the link above:

> A 21st century AI score is calculated using a small but representative group
> of viewers. This sample will watch a program and then rate the program on a
> scale of one to ten. The scores are then averaged and multiplied by ten.
> Hence, an AI of 67 means that 6.7 was the simple mean of all responses.

Notice that the measurement of this score has changed several times, so scores
from — say — 1970 are not comparable with those from 2020.

To give an idea of what the change in scores looks like, use the
`.plot.scatter` method of the `df` DataFrame to plot `Broadcast datetime` on
the x-axis against `AI` on the y-axis.

In [None]:
df....

It looks as though scores before the year 2000 were consistently lower than
scores after 2000.

The BBC had not broadcast Doctor Who for 15 years, when they [relaunched the
programme](https://en.wikipedia.org/wiki/History_of_Doctor_Who#Ninth_Doctor) on
March 26 2000.

Here is how we make a value to represent that relaunch date, minus one day for
safety (we are going to look for broadcasts after that date).

In [None]:
just_before_relaunch = pd.to_datetime('2000-03-25')
just_before_relaunch

You can use comparisons on these datetime values, as you would for other
values.  Greater than corresponds to after, and less than corresponds to
before:

In [None]:
# New Year's day of the new millennium.
new_millenium = pd.to_datetime('2000-01-01')
new_millenium

In [None]:
new_millenium > just_before_relaunch

In [None]:
new_millenium < just_before_relaunch

Create a Boolean Series `are_after_relaunch` that has True for rows in the
DataFrame where `Broadcast datetime` was later than the `before_relaunch` and
False otherwise.

In [None]:
are_after_relaunch = ...
# Show the result
are_after_relaunch

In [None]:
_ = ok.grade('q_02_after_relaunch')

We are going to restrict the rest of our analyses to the broadcasts after the
relaunch.   Use *Direct Indexing with Boolean Series* (DIBS) to select the rows
in `df` after the relaunch date.  Call the resulting DataFrame
`relaunched_doctor`.

In [None]:
relaunched_doctor = 
relaunched_doctor

In [None]:
_ = ok.grade('q_03_relaunched_doctor')

The `AI` ratings are all high; nearly all are above 80 / 100.

We are interested in *comparing* between AI ratings, so we are more interested
in differences than absolute scores.

To make that comparison easier, use DICL to make a new Series that contains the
`AI` scores, minus the mean of the `AI` scores, and insert that Series into the
`relaunched_doctor` DataFrame with the column name `AI deviation`.


In [None]:
relaunched_doctor... = ...
# Show the first five rows of the result.
relaunched_doctor.head()

In [None]:
_ = ok.grade('q_04_ai_deviation')

Now use the `.plot.scatter` method to do plot of `AI deviation` scores (y-axis)
*as a function of* `Broadcast datetime` (x-axis), for the `relaunched_doctor`
DataFrame.

In [None]:
relaunched_doctor...

To the matching plot for `Broadcast datetime` and `Chart`.  Remember, low
numbers are good for chart positions.

In [None]:
relaunched_doctor...

**For reflection** - have a look at these plots.  What trends do you see?   How
is this going to affect our interpretation of the scores for Peter Capaldi and
Jodie Whittaker?


## Selecting episodes

Now we are down to the stage where we want to select episodes corresponding to
Jodie Whittaker and to Peter Capaldi.

To make that a little easier, make a new DataFrame that replaces the current,
rather useless numerical row labels with the values from the `Episode Title`
column.

Call this new DataFrame `by_name`.

In [None]:
by_name = ...
# Show the first five rows of the result
by_name

In [None]:
_ = ok.grade('q_05_by_name')

The titles of Jodie Whittaker's [first and last episodes as the
Doctor](https://en.wikipedia.org/wiki/List_of_actors_who_have_played_the_Doctor)
were:

In [None]:
jws_first_episode = "The Woman Who Fell to Earth"
jws_last_episode = "The Power of the Doctor"

With these values in hand, use *indirect indexing by label* (`.loc` indexing)
on the `by_name` DataFrame, to make a new DataFrame, `jws_doctor`, that only
contains Jodie Whittaker's episodes.

In [None]:
jws_doctor = ...
# Show the result
jws_doctor

In [None]:
_ = ok.grade('q_06_jws_doctor')

We are particularly interest in the `AI deviation` scores.

Use the DataFrame plotting methods to do a bar plot of Jodie Whittaker's `AI
deviation` scores.

In [None]:
jws_doctor...

While you're at it — do a bar plot of the `Chart` positions for Jodie
Whittaker's episodes:

In [None]:
jws_doctor...

Now let's shift to Peter Capaldi.  These are his first and last episodes as the
Doctor:

In [None]:
pc_first_episode = "Deep Breath"
pc_last_episode = "Twice Upon A Time"

As you did for Jodie Whittaker, make a new DataFrame called `pcs_doctor` that
contains the Peter Capaldi episodes.

In [None]:
pcs_doctor = ...
# Show the result
pcs_doctor

In [None]:
_ = ok.grade('q_07_pcs_doctor')

As before, do a bar plot of Peter Capaldi's `AI deviation` scores:

In [None]:
pcs_doctor...

Do a bar plot of Peter Capaldi's `Chart` positions:

In [None]:
pcs_doctor...

## What do you think?

Now you have the overall trends, and the values for Jodie Whittaker and Peter
Capaldi, what do you think?  Is there evidence here that Jodie Whittaker was
particularly unpopular with viewers?  Or was she popular?

Prepare your arguments!  We'll discuss.


## Done.

Congratulations, you're done with the assignment!  Be sure to:

- **run all the tests** (the next cell has a shortcut for that).
- **Save and Checkpoint** from the `File` menu.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]