# `datetime`, and `matplotlib` intro

This lesson rounds out the introductory pandas work and introduces our basic plotting library `matplotlib`.  

**OBJECTIVES**

- Understand and use `datetime` objects in pandas DataFrames
- Use `matplotlib` to produce basic plots from data
- Understand when to use histograms, boxplots, line plots, and scatterplots with data


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## `datetime`

A special type of data for pandas are entities that can be considered as dates.  We can create a special datatype for these using `pd.to_datetime`, and access the functions of the `datetime` module as a result.

In [19]:
# read in the sales data
url = 'data/product_sales.csv'

#read_csv
aapl = pd.read_csv(url)
aapl.head(3)

Unnamed: 0,product_title,product_vendor,product_type,product_price,day,week,month,quarter,year,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,total_cost
0,Yerba,,,3.0,2023-03-21,2023-W12,2023-03,2023-01,2023,10,30.0,-0.6,0.0,29.4,2.36,31.76,0.0
1,Yerba,,,3.0,2023-05-24,2023-W21,2023-05,2023-04,2023,8,24.0,0.0,0.0,24.0,1.92,25.92,0.0
2,Yerba,,,3.0,2023-02-26,2023-W08,2023-02,2023-01,2023,8,24.0,0.0,0.0,24.0,1.92,25.92,0.0


In [20]:
aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4108 entries, 0 to 4107
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   product_title   4108 non-null   object 
 1   product_vendor  3305 non-null   object 
 2   product_type    249 non-null    object 
 3   product_price   4108 non-null   float64
 4   day             4108 non-null   object 
 5   week            4108 non-null   object 
 6   month           4108 non-null   object 
 7   quarter         4108 non-null   object 
 8   year            4108 non-null   int64  
 9   net_quantity    4108 non-null   int64  
 10  gross_sales     4108 non-null   float64
 11  discounts       4108 non-null   float64
 12  returns         4108 non-null   float64
 13  net_sales       4108 non-null   float64
 14  taxes           4108 non-null   float64
 15  total_sales     4108 non-null   float64
 16  total_cost      4108 non-null   float64
dtypes: float64(8), int64(2), object(7

In [21]:
# convert to datetime
aapl['Date'] = pd.to_datetime(aapl['day'])
aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4108 entries, 0 to 4107
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   product_title   4108 non-null   object        
 1   product_vendor  3305 non-null   object        
 2   product_type    249 non-null    object        
 3   product_price   4108 non-null   float64       
 4   day             4108 non-null   object        
 5   week            4108 non-null   object        
 6   month           4108 non-null   object        
 7   quarter         4108 non-null   object        
 8   year            4108 non-null   int64         
 9   net_quantity    4108 non-null   int64         
 10  gross_sales     4108 non-null   float64       
 11  discounts       4108 non-null   float64       
 12  returns         4108 non-null   float64       
 13  net_sales       4108 non-null   float64       
 14  taxes           4108 non-null   float64       
 15  tota

In [None]:
# extract the month


In [None]:
# extract the day


In [None]:
# set date to be index of data


In [None]:
# sort the index


In [None]:
# select 2023


In [24]:
from datetime import datetime

In [25]:
# what time is it?
then = datetime.now()
then

datetime.datetime(2025, 2, 12, 15, 39, 19, 815216)

In [26]:
# how much time has passed?
datetime.now() - then

datetime.timedelta(seconds=1, microseconds=364005)

### More with timestamps

- Date times: A specific date and time with timezone support. Similar to datetime.datetime from the standard library.

- Time deltas: An absolute time duration. Similar to datetime.timedelta from the standard library.


In [None]:
# create a pd.Timedelta
delta = pd.Timedelta('1W')

In [None]:
# shift a date by 3 months
datetime.now() + delta

#### Problems

In [None]:
ufo_url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa24/refs/heads/main/data/ufo.csv'

1. Return to the ufo data and convert the Time column to a datetime object.

2. Set the Time column as the index column of the data.

3. Sort it

4. Create a new dataframe with ufo sightings since January 1, 1999

### Grouping with Dates

An operation similar to that of the `groupby` function can be used with dataframes whose index is a datetime object.  This is the `resample` function, and the groups are essentially a time period like week, month, year, etc. 

In [None]:
dow = sns.load_dataset('dowjones')

In [None]:
#check the info
dow.info()

In [None]:
#handle the index
dow.set_index('Date', inplace = True)

In [None]:
#check that things changed
dow.info()

In [None]:
dow.head()

In [None]:
#average yearly price
dow.resample('M').mean()

In [None]:
#quarterly maximum price
dow.resample('Q').max()

### Exploratory Data Analysis

> In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell beyond the formal modeling and thereby contrasts with traditional hypothesis testing, in which a model is supposed to be selected before the data is seen. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. --[Wikipedia](https://en.wikipedia.org/wiki/Exploratory_data_analysis)

#### Example: Tips Dataset

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component of pay. 

In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part of a national chain and served a varied menu. In observance of local law the restaurant offered seating in a non-smoking section to patrons who requested it. Each record includes a day and time, and taken together, they show the server’s work schedule.

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head()

In [None]:
tips.info()

## Introduction to `matplotlib`

Now, let us turn our attention to plotting data.  We begin with basic plots, and later explore some customization and additional plots.  For these exercises, we will use the stock price data and a dataset about antarctic penguins from the `seaborn` library.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

### Line Plots with Matplotlib

To begin, select the `total_bill` column of the data.  

In [None]:
#create a new column based on percent tipped
tips['tip_pct'] = ''

In [None]:
### plt.plot
plt.plot(tips['total_bill'])

In [None]:
### use the series
tips['total_bill'].plot()

In [None]:
#plot dow jones Price with matplotlib
plt.plot(dow)

In [None]:
#plot dow jones data from series
dow.plot()

#### Choosing A Plot

Below, plots are shown first for single quantiative variables, then single categorical variables.  Next, two continuous variables, one continuous vs. one categorical, and any mix of continuous and categorical.

#### Histogram

A histogram *is an approximate representation of the distribution of numerical data*.  This is a plot we use for any single continuous feature to better understand the shape of the data.  

In [None]:
### tip percentage histogram
plt.hist(tips['tip_pct'])

In [None]:
### as a method with the series
tips['tip_pct'].hist()

In [None]:
### adjusting the bin number
plt.hist(tips['tip_pct'], bins = 100);

In [None]:
### adding a title, labels, edgecolor, and alpha
plt.hist(tips['tip_pct'], 
         edgecolor = 'black', 
         color = 'red', 
         alpha = 0.3)

In [None]:
tips.hist();

#### Boxplot

Similar to a histogram, a boxplot can be used on a single quantitative feature.

In [None]:
### boxplot of tip percentage
plt.boxplot(tips['tip_pct']);

In [None]:
### Make a horizontal version of the plot
plt.boxplot(tips['tip_pct'], vert = False);

#### Bar Plot

A bar plot can be used to summarize a single categorical variable.  For example, if you want the counts of each unique category in a categorical feature. 

In [None]:
### counts of species
tips['size'].value_counts()

In [None]:
### barplot of counts
tips['size'].value_counts().plot(kind = 'bar')

#### Two Variable Plots

In [None]:
tips.head()

#### Scatterplot

Two continuous features can be compared using scatterplots.  Typically, one is interested in if a relationship between the features exists and the strength and direction of many datasets.

In [None]:
### tip % vs tip pct

In [None]:
### scatterplot of x vs. y


#### `pandas.plotting`

There is not a quick easy plot in `matplotlib` to compare all numeric features in a dataset.  Instead, `pandas.plotting` has a `scatter_matrix` function that serves a similar purpose.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
### scatter matrix of penguin data
scatter_matrix(tips);

In [None]:
### adding arguments and changing size
scatter_matrix(tips, diagonal = 'kde', figsize = (10, 10));

#### Subplots and Axes

![](https://matplotlib.org/stable/_images/users-explain-axes-index-1.2x.png)

In [None]:
### create a 1 row 2 column plot
fig, ax = plt.subplots(1, 2)

In [None]:
### add a plot to each axis
fig, ax = plt.subplots(1, 2)


In [None]:
### create a 2 x 2 grid of plots
### add histogram to bottom right plot
fig, ax = plt.subplots(2, 2, figsize = (10, 8))


#### Summary

Great job!  We will get practice plotting in this weeks homework and examine some other libraries and approaches during class next week.  For now, make sure you are familiar with the basic plots above -- histogram, boxplot, bar plot, scatterplot -- and when to use each.  