# Pandas II - Datetime Series

Pandas has proven very successful as a tool for working with time series data, especially in the financial data analysis space.

In working with time series data, we will frequently seek to:
- generate sequences of fixed-frequency dates and time spans
- conform or convert time series to a particular frequency
- compute “relative” dates based on various non-standard time increments (e.g. 5 business days before the last business day of the year), or “roll” dates forward or backward

http://pandas.pydata.org/pandas-docs/stable/timeseries.html

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('max_columns', 50)

We'll be using the [MovieLens](http://www.grouplens.org/node/73) dataset in many examples going forward. The dataset contains 100,000 ratings made by 943 users on 1,682 movies.

In [2]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
df_users = pd.read_csv('data/MovieLens-100k/u.user', sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
df_ratings = pd.read_csv('data/MovieLens-100k/u.data', sep='\t', names=r_cols)

m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
df_movies = pd.read_csv('data/MovieLens-100k/u.item', sep='|', names=m_cols, usecols=range(5))# only load the first five columns

## Summary

1. [Inspect](#1.-Inspect)<br>
    a) .dtype<br>
    b) .describe()<br>
    c) .head(), .tail(), [i:j]
2. [Select](#2.-Select)<br>
    a) Column Selection<br>
    b) Row Selection<br>
3. [Sort](#3.-Sort)<br>
    a) .sort() for DataFrames<br>
    b) .order() for Series<br>
4. [Bins](#4.-Bins)
5. [Merge](#5.-Merge)<br>
    a) Inner Join (default)<br>
    b) Left Outer Join<br>
    c) Right Outer Join<br>
    d) Full Outer Join<br>
6. [Concatenate](#6.-Concatenate)
7. [Split-Apply-Combine](#7.-Split-Apply-Combine)

## 1. Inspect

Pandas has a variety of functions for getting basic information about your DataFrame.<br>
The most basic of which is **calling your DataFrame by name**. The output tells a few things about our DataFrame.

1. It's an instance of a DataFrame.
2. Each row is assigned an index of 0 to N-1, where N is the number of rows in the DataFrame. (index can be set arbitrary)
3. There are 1,682 rows (every row must have an index).
4. Our dataset has five total columns, one of which isn't populated at all (video_release_date) and two that are missing some values (release_date and imdb_url).

### a)  `.dtypes`
Use the `.dtypes` attribute to get the datatype for each column.

In [4]:
print df_movies.dtypes,'\n'

print df_users.dtypes,'\n'

print df_ratings.dtypes,'\n'

movie_id                int64
title                  object
release_date           object
video_release_date    float64
imdb_url               object
dtype: object 

user_id        int64
age            int64
sex           object
occupation    object
zip_code      object
dtype: object 

user_id           int64
movie_id          int64
rating            int64
unix_timestamp    int64
dtype: object 



### b) `.describe()`
Use the `.describe()` method to see the basic statistics about the DataFrame's **numeric columns**. Be careful though, since this will return information on **all** columns of a numeric datatype.

In [5]:
df_users.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


Notice `user_id` was included since it's numeric. Since this is an ID value, the stats for it don't really matter.

We can quickly see the average age of our users is just above 34 years old, with the youngest being 7 and the oldest being 73. The median age is 31, with the youngest quartile of users being 25 or younger, and the oldest quartile being at least 43.

### c) `.head(), tail(), [i:j]`
By default, **`.head()`** displays the first five records of the DataFrame, while **`.tail()`** displays the last five.<br>
Alternatively, Python's regular slicing **`[i:j]`** syntax works as well.

In [8]:
print df_users.head()

   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067
3        4   24   M  technician    43537
4        5   33   F       other    15213
