# 03 - Summarizing Voter List Data: Dealing with DateTime data types
<p class="lead">
Michelle Brown Notes v 1.7

In this notebook we are going to show you how to take a variable like the birthdate which may be stored as an (incorrect) data type and convert it to the proper date format for analyzing. In notebook 02, the voter list dataset had the (easy) birthyear stored as an integer and calculating age was simple subtraction. However, in this notebook we have a mock voterlist that has the more common date format of day month year. We review how to convert it and then calculate age based on a specified date (e.g., election day). 

# Outline

<!-- MarkdownTOC autolink=true autoanchor=true bracket=round -->

- [Import Libraries](#imp)
- [Read in the file](#read1)
- [Handling dates and time](#date)
- [Calculating age](#age)
- [Another trick when reading in the data](#magic)

<!-- /MarkdownTOC -->

<a name="imp"></a>
# Importing libraries

Again we import the analysis module called pandas as a variable called 'pd' so we can use it's associated methods. But we are also going to import some other libraries that we'll use later to make inline plots. 

In [None]:
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
import numpy

<a name="read1"></a>
# Read in the file

We read in the csv file and store it as dataframe called df1 and see it's shape (rows and columns)

In [None]:
df = pd.read_csv('data/mock_vldata_ddmmyyy_v1.csv', delimiter="\t")
df.shape

Look at the column names

In [None]:
df.columns

In [None]:
df.head()

<a name="date"></a>
# Handling dates and times</h1>

Let's take a look at how each of the variables are stored. This will give us their data types: 

In [None]:
df.dtypes

The data type for the 'birthdate' variable is object BUT the data is stored as day month year.

If we try to create a new variable called age by subtracting Year_of_Birth from 2017, we'll get an error: 

In [None]:
df['Age'] = 2017 - df['birthdate']

<b>Converting to Date (time formats) </b>
Before we can subtract a date to get age we have to convert our birthdate variable. We need to convert the variable from an 'object' data type to a date format. Let's do that and store it in a new variable called 'dob_formatted.'  Note that by adding the "errors=coerce" it will set invalid values to NaT.
<br>If the variable was stored as a string we could also use this and some extra settings (e.g., yearfirst) to convert it to the correct date format. Read more about the "to_datetime" method here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html

In [None]:
df['dob_formatted'] = pd.to_datetime(df['birthdate'],errors='coerce')

In [None]:
df['dob_formatted']

Now if we look again at the data types, we'll see our new variable 'dob_formatted' has the proper data type of 'datetime'

In [None]:
df.dtypes

<a name="age"></a>
# Calculating Age
<b>Create the (election) date for calculating age</b> 
<br>Now we are ready to make an age variable. Below we'll use January 20th 2017 as the pretend date. But first we have to import the date function to help. 

In [None]:
from datetime import date

In [None]:
d = date(2017, 1, 20)
print d

Create a new variable called Age that is from the designated date minus the correctly formatted date of birth minu

In [None]:
df['Age'] = (d - df['dob_formatted']).astype('<m8[Y]')

In [None]:
df['Age']

We are now in a position to perform some summary statistics and/or make some histograms of the the Age variable as we did in Notebook #02. 

<a name="magic"></a>
# Another trick: Python to automatically convert when reading 
<br>There is a setting you can use when you read in the csv file to have python try to properly convert the data in the column that you specify. Note that the "5" below refers to the column the date is in.

In [None]:
dfcsv = pd.read_csv('data/mock_vldata_ddmmyyy_v1.csv', delimiter="\t", parse_dates=[5]) 

Look at the data types and see if it properly categoried the 6th column (which is index number 5 because remember indexes start with 0) as datetime:

In [None]:
dfcsv.dtypes

Excellent, Python appears to have guessed correctly but let's take a quick look at the data to be sure:

In [None]:
dfcsv["birthdate"]

The variable is properly categorized. Now you could run the age calculations from above. <br>That's the end of this notebook. 