# An Introduction to Time Series Visualizations with Python Using Bird Species Data from Marymoor Park
## (Or as I like to think of it - what bird when?)

Why time series visualizations? Why birds? To explain the former, I need to start with the latter. I like birds. I mean, I REALLY like birds. I also happen to have some cool data about bird species for one of my favorite places on Earth (Marymoor Park, WA), thanks to a dedicated group of birdwatchers led by Michael Hobbs. Since 1983, Friends of Marymoor Park has hosted a weekly bird walk in the park documenting the presence of any bird species observed. Hobbs has also drawn from online sources such as Tweeters (an email listserv for birding in the Pacific Northwest) and eBird (a database of bird sightings and the world's largest citizen science project) to build as complete a picture as possible about bird species at Marymoor. In short, I have a set of time series data that I'm really excited to explore. And now that I've been in the Seattle Flatiron Data Science program for several weeks, I have some newly acquired data scientist superpowers to grow! (How serendipitous...) I also have a program requirement to blog about a data visualization technique. So to *figuratively* kill two birds with one stone (Yes, I can joke about it. Where did that phrase come from anyway?), I decided to blog about time series visualizations using my super-cool, totally awesome, AMAZING data on bird species! (How convenient...)

Let's talk about time data - any information including a variable of time (seconds, minutes, hours, days, years, . Time series are ubiquitous. They tell what is happening when, and, using that information, forecast what will happen then, i.e., predict the future! And frankly, humans can use all the help we can get. I've started to read "The Signal and the Noise: Why So Many Predictions Fail - But Some Don't" by Nate Silver. One of my biggest takeaways so far is that people are pretty bad at making accurate predictions. And from just a personal foray into forecasting, I need to better understand time series predictions before talking about them, so for this post I'd like to focus on the "what is happening when" and how to picture it. The goal of this post is to share how one can understand the narrative of a time series by using visualizations effectively. 

There are many ways to visualize time series. They include:
* polar area diagrams
* line graphs 
* heat maps
* stream graphs 
* Gantt charts 
* bar charts
* stacked area charts<br>
After that - or even before, - it's semantics. Each type of visualization has its own perks and drawbacks.

The narrative of a time series may include trends, seasonality, cyclicity, and unique events. A trend is any long-term change, such as the winter range expansion of Anna's hummingbirds of over 700 kilometers in the past two decades. Seasonality is variation following the change of seasons, like the annual migration of tree swallows north in the springtime and south in the winter. Cyclicity refers to data rising and falling within a fixed period, such as the irruptions of snowy owls every 4-5 years - when snowy owls coming flooding down from the north. And unique events are significant peaks or valleys outside of any pattern, like the mass mortality events in the north pacific marine environment documented by the Coastal Observation and Seabird Survey Team (COASST). To isolate trends, seasonality, cyclicity, and unique events from noise, I find line graphs useful.

So let's get started!<br>
First, I import all the libraries that I will need.

In [1]:
import pandas as pd
import matplotlib as pyplot
import seaborn as sns

Then I take an initial look at my data. 

In [2]:
df = pd.read_csv("mbird_data.csv")

In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
Sort Order,1307,23,41,48,53
Species ID,LEFL,CANG,GADW,MALL,GWTE
Common Name,Least Flycatcher,Canada Goose,Gadwall,Mallard,Green-winged Teal
Scientific Name,Empidonax minimus,Branta canadensis,Mareca strepera,Anas platyrhynchos,Anas crecca
Date,05-Jun-83,18-Apr-90,18-Apr-90,18-Apr-90,18-Apr-90
Number,1,1,0,4,10
Male,True,False,False,False,True
Female,False,False,False,False,True
Pair,False,False,False,True,False
Adult,False,True,True,True,True


In [4]:
df.shape

(80085, 17)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80085 entries, 0 to 80084
Data columns (total 17 columns):
Sort Order         80085 non-null float64
Species ID         80085 non-null object
Common Name        80085 non-null object
Scientific Name    80085 non-null object
Date               80085 non-null object
Number             80041 non-null float64
Male               80085 non-null bool
Female             80085 non-null bool
Pair               80085 non-null bool
Adult              80085 non-null bool
Immature           80085 non-null bool
Nesting Codes      10308 non-null object
Heard Only         80085 non-null bool
Uncertain ID       80085 non-null bool
Uncountable        80085 non-null bool
Hybrid             80085 non-null bool
Notes              14255 non-null object
dtypes: bool(9), float64(2), object(6)
memory usage: 5.6+ MB


In [6]:
df['Uncertain ID'].value_counts()

False    78996
True      1089
Name: Uncertain ID, dtype: int64

Good to keep in mind, 1089 of the entries have uncertain IDs. Will keep in for now. Need to convert Date from object to timestamp.

In [6]:
df['Date'] = pd.to_datetime(df['Date'])

In [7]:
df['Year'] = df['Date'].dt.year

In [8]:
df['Week'] = df['Date'].dt.week

In [9]:
df['Week'][0], df['Date'][0]

(22, Timestamp('1983-06-05 00:00:00'))

Next I want to check all species and limit to just birds. Can't think of a better way than to go through list of unique common names visually.

In [10]:
df['Common Name'].unique()

array(['Least Flycatcher', 'Canada Goose', 'Gadwall', 'Mallard',
       'Green-winged Teal', 'Common Merganser', 'Rock Pigeon',
       'American Coot', 'Killdeer', 'Western Sandpiper',
       'Great Blue Heron', 'Red-tailed Hawk', 'American Crow',
       'Violet-green Swallow', 'Northern Rough-winged Swallow',
       'Barn Swallow', 'Chestnut-backed Chickadee', 'Bushtit',
       'Marsh Wren', 'American Robin', 'European Starling',
       'American Goldfinch', 'Savannah Sparrow', 'Song Sparrow',
       'Red-winged Blackbird', 'Common Yellowthroat', 'Cinnamon Teal',
       'Mourning Dove', 'Glaucous-winged Gull', 'Northern Harrier',
       'Belted Kingfisher', 'House Wren', 'Black-capped Chickadee',
       'Western Meadowlark', 'Pied-billed Grebe', 'American Kestrel',
       'Spotted Towhee', 'White-crowned Sparrow',
       'Golden-crowned Sparrow', 'Yellow-rumped Warbler',
       "Wilson's Warbler", 'Black-headed Grosbeak', 'California Quail',
       'Bald Eagle', "Swainson's Thrush", '

In [11]:
non_bird = ["Eastern Gray Squirrel", "Roof Rat", "Raccoon", "Muskrat", "Long-tailed Weasel", "Northwestern Garter Snake", "Bullfrog", "Red-eared Slider",
           "Coyote", "American Beaver", "Western Toad", "River Otter", "Painted Turtle", "Virginia Opossum", "Pacific Tree Frog", "Mule Deer", 
           "Big Brown Bat", "Douglas Squirrel", "North American Deer Mouse", "Bobcat", "Eastern Cottontail", "White-tailed Deer", "Townsend's Mole",
           "Northern Flying Squirrel", "Townsend's Chipmonk", "Black-tailed Jack Rabbit", "Mink", "Long-tailed Vole", "Mountain Beaver",
           "Long-toed Salamander", "Coypu", "Black Bear", "Northwestern Salamander", "Little Brown Myotis", "Water Vole", "Northern Leopard Frog"]

Fix typo in Townsend's Chipmunk.

In [12]:
df['Common Name'].replace( "Townsend's Chipmonk", "Townsend's Chipmunk", inplace=True)
df.loc[df['Common Name'] == "Townsend's Chipmunk"]

Unnamed: 0,Sort Order,Species ID,Common Name,Scientific Name,Date,Number,Male,Female,Pair,Adult,Immature,Nesting Codes,Heard Only,Uncertain ID,Uncountable,Hybrid,Notes,Year,Week
21658,10022.0,TOCH,Townsend's Chipmunk,Tamias townsendii,2003-06-02,1.0,False,False,False,False,False,,False,False,False,False,In the ivy area just south of the first footbr...,2003,23


In [13]:
bird = []
for val in df["Common Name"].unique():
    if val not in non_bird:
        bird.append(val)

In [14]:
bird[:5]

['Least Flycatcher', 'Canada Goose', 'Gadwall', 'Mallard', 'Green-winged Teal']

It works! Now I make a dataframe of just birds, and since I'm only interested in time and name identifiers (and might want to drop uncertain IDs), limit it to the following columns: species ID, common name, scientific name, date, number, year, week in year, uncountable, and uncertain IDs.

In [15]:
birds_df = df[df['Common Name'].isin(bird)]
columns_keep = ['Common Name', 'Scientific Name', 'Date', 'Number', 'Uncertain ID', 'Uncountable', 'Year', 'Week']
bt_df = birds_df[columns_keep].copy()
bt_df.head()

Unnamed: 0,Common Name,Scientific Name,Date,Number,Uncertain ID,Uncountable,Year,Week
0,Least Flycatcher,Empidonax minimus,1983-06-05,1.0,False,False,1983,22
1,Canada Goose,Branta canadensis,1990-04-18,1.0,False,False,1990,16
2,Gadwall,Mareca strepera,1990-04-18,0.0,False,False,1990,16
3,Mallard,Anas platyrhynchos,1990-04-18,4.0,False,False,1990,16
4,Green-winged Teal,Anas crecca,1990-04-18,10.0,False,False,1990,16


In [17]:
bt_df.set_index('Date', inplace=True)
bt_df.head()

Unnamed: 0_level_0,Common Name,Scientific Name,Number,Uncertain ID,Uncountable,Year,Week
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1983-06-05,Least Flycatcher,Empidonax minimus,1.0,False,False,1983,22
1990-04-18,Canada Goose,Branta canadensis,1.0,False,False,1990,16
1990-04-18,Gadwall,Mareca strepera,0.0,False,False,1990,16
1990-04-18,Mallard,Anas platyrhynchos,4.0,False,False,1990,16
1990-04-18,Green-winged Teal,Anas crecca,10.0,False,False,1990,16


In [18]:
dft = bt_df[['Common Name']].resample('W', base=1).nunique()

In [19]:
dft['spec_count'] = dft.index.week

In [20]:
dft.head()

Unnamed: 0_level_0,Common Name,spec_count
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1983-06-05,1,22
1983-06-12,0,23
1983-06-19,0,24
1983-06-26,0,25
1983-07-03,0,26


In [21]:
df_spec_count = bt_df[['Common Name']].resample('W', base=1)

In [22]:
df_spec_count

DatetimeIndexResampler [freq=<Week: weekday=6>, axis=0, closed=right, label=right, convention=start, base=1]

In [23]:
print(bt_df.shape)
print(bt_df.info())
bt_df["Uncertain ID"].value_counts()

(77002, 7)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 77002 entries, 1983-06-05 to 2019-03-09
Data columns (total 7 columns):
Common Name        77002 non-null object
Scientific Name    77002 non-null object
Number             76964 non-null float64
Uncertain ID       77002 non-null bool
Uncountable        77002 non-null bool
Year               77002 non-null int64
Week               77002 non-null int64
dtypes: bool(2), float64(1), int64(2), object(2)
memory usage: 6.2+ MB
None


False    75977
True      1025
Name: Uncertain ID, dtype: int64

Investigate null numbers.

In [24]:
bt_df_nulls = bt_df[bt_df.isna().any(axis=1)]
bt_df_nulls

Unnamed: 0_level_0,Common Name,Scientific Name,Number,Uncertain ID,Uncountable,Year,Week
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1996-06-27,Pied-billed Grebe,Podilymbus podiceps,,False,False,1996,26
1999-12-16,House Finch,Haemorhous mexicanus,,False,False,1999,50
2000-05-24,Double-crested Cormorant,Phalacrocorax auritus,,False,False,2000,21
2000-11-15,Hooded Merganser,Lophodytes cucullatus,,False,False,2000,46
2001-10-31,Yellow-rumped Warbler,Setophaga coronata,,True,False,2001,44
2001-11-28,Spotted Towhee,Pipilo maculatus,,False,False,2001,48
2001-12-15,Pied-billed Grebe,Podilymbus podiceps,,False,False,2001,50
2001-12-15,Western Grebe,Aechmophorus occidentalis,,False,False,2001,50
2001-12-15,Double-crested Cormorant,Phalacrocorax auritus,,False,False,2001,50
2003-10-17,Hooded Merganser,Lophodytes cucullatus,,False,False,2003,42


Decide to export and ask data source about nulls. Will drop from dataset for now.

In [None]:
bt_df_nulls.to_excel('null_counts.xlsx')

In [25]:
bt_df.fillna(1, inplace=True)

In [26]:
bt_df['Number'].isna().any()

False

Next, add in columns for total species count per week (tspec_count_wk), species count per week (spec_count_wk), and presence/absence of species per week (pres_spec_wk). 

In [None]:
cg_df = bt_df[bt_df['Common Name'] == 'Canada Goose']
cg_df.head()

In [None]:
cg_df = cg_df.resample('W', ).sum()
cg_df.head()

In [None]:
all_birds = pd.DataFrame()
start = pd.DataFrame(data=[0], index=[pd.Timestamp('1983-06-04')])
for bird in bt_df['Common Name'].unique()[:2]:
    df = bt_df[bt_df['Common Name'] == bird].copy()
    df = df.append(start)
    df.resample('W').sum()
    all_birds = all_birds.append(df)
    
all_birds.head()

In [None]:
wk_yr_spec_count = bt_df.groupby(['Year', 'Week'])['Common Name'].nunique().reset_index()

In [None]:
wk_yr_spec_count.head(10)

In [None]:
wk_yr_spec_count['Week'][0].astype(str) + wk_yr_spec_count['Year'][0].astype(str)

In [None]:
spec_count = bt_df.groupby(['Date', 'Year', 'Week', 'Common Name'])['Number'].sum().reset_index()
spec_count.head()

In [None]:
bt_df.head()

In [None]:
bt_df.groupby(['Year', 'Week', 'Common Name'])['Number'].head()

frequency = number of obs (species) per unit time, week

Line Graphs
* seasonal plot, month, season, circular, 4 plots (subseries)
* autocorrelation, lag plot
* ACF plot peaks with seasonal or cyclic, no auto correlation are white noise, just random, autocorrelation equal to zero, 95% of acf spikes to be +- 2/sqrt(T) where T is length of time series, common to plot bounds on acf, 1 or more large spikes outside bout, or if more than 5% of spikes outside bounds, probably not white noise, tests if an individual lag autocorrelation is different than zero
** trends make positive correlations in early lags, strong trends make recent obs closer together
** seasonality make peak at seasonal lags, strongest correlation at values at same time of year
** cyclicity makes peaks at average cycle length
* Ljung-Box test, tests whether any of a group of autocorrelations of a time series are different from zero, overall randomness from a number of lags, if small p value, probably not white noise, 
* line graphs, time imposes additional structure on data, inherent order, in scatterplots, dots are placed evenly along axis with order, can connect with a line, and often omit dots, the denser a time series, less important are dots, can fill in area under line with solid color to emphasize trends, only valid if y-axis starts at zero, 
** can do multiple time series, label lines directly to limit cognitive load
** time series of 2 or more response variables, separate line graphs or plot together in connected scatterplot (easier to confuse order and direction, less likely to report correlation, higher engagement)
* smoothing for trends, LOESS curve or spline, careful, may have lots of different interpretations!
* detrending to find deviations
* beside simple detrending can...