# Introduction to Pandas

- It is a python [3rd party library](https://pandas.pydata.org)
- Used for data analysis and visualization
- Part of Anaconda python distribution
- Best used with Jupyter notebook, can be used with regular python programs
- Main feature is the Data Frame

In [1]:
# Load the pandas library to let python know you will use it
import pandas as pd

# What is a Data Frame?
- Its a data structure, like lists and dictionaries
- Consists of rows and columns, similar to SQL tables and excel spreadsheets
  - Columns are attributes or variables
  - Rows are records or single observations
- Operations are typically performed on columns

# Loading data into a data frame

- Data is usually loaded from an external source, like a csv or excel file.
- Download the weather data set from [vega-dataset](https://raw.githubusercontent.com/vega/vega-datasets/gh-pages/data/weather.csv) (**right click and save as**)
- Place it in the same directory as the jupyter notbook you are working on

In [5]:
# load the data using pandas library
# do you remember what was pd?
pd.read_csv("weather.csv")

# Jupter notebook tip:
# type: pd.
# then hit tab, see what happens
# try also: pd.read_ (then hit tab)

Unnamed: 0,location,date,precipitation,temp_max,temp_min,wind,weather
0,Seattle,2012-01-01 00:00,0.0,12.8,5.0,4.7,drizzle
1,Seattle,2012-01-02 00:00,10.9,10.6,2.8,4.5,rain
2,Seattle,2012-01-03 00:00,0.8,11.7,7.2,2.3,rain
3,Seattle,2012-01-04 00:00,20.3,12.2,5.6,4.7,rain
4,Seattle,2012-01-05 00:00,1.3,8.9,2.8,6.1,rain
5,Seattle,2012-01-06 00:00,2.5,4.4,2.2,2.2,rain
6,Seattle,2012-01-07 00:00,0.0,7.2,2.8,2.3,rain
7,Seattle,2012-01-08 00:00,0.0,10.0,2.8,2.0,sun
8,Seattle,2012-01-09 00:00,4.3,9.4,5.0,3.4,rain
9,Seattle,2012-01-10 00:00,1.0,6.1,0.6,3.4,rain


# How to work with the data?

- You must place it in a variable so you can refer to it
- The current data was displayed and not assigned to a variable, so you cannot use it
- Assign it to a variable named **my_df**

In [6]:
my_df = pd.read_csv("weather.csv")

# Let us discover how the data looks like

In [9]:
my_df.head()

Unnamed: 0,location,date,precipitation,temp_max,temp_min,wind,weather
0,Seattle,2012-01-01 00:00,0.0,12.8,5.0,4.7,drizzle
1,Seattle,2012-01-02 00:00,10.9,10.6,2.8,4.5,rain
2,Seattle,2012-01-03 00:00,0.8,11.7,7.2,2.3,rain
3,Seattle,2012-01-04 00:00,20.3,12.2,5.6,4.7,rain
4,Seattle,2012-01-05 00:00,1.3,8.9,2.8,6.1,rain


In [None]:
# You can pass a number in the head() method to show more data
# show 10 items


In [13]:
# To know which columns are available use the columns attribute
my_df.columns

Index(['location', 'date', 'precipitation', 'temp_max', 'temp_min', 'wind',
       'weather'],
      dtype='object')

# Data types

- Each **column** will have its own data type
- Remember, variables will be in columns
- Observations in rows
- Use dtypes attribute of to discover columns and datatypes
  - **OOP**: What is the difference between a *function*, a *method*, an *attribute*, and a *variable*?

In [18]:
my_df.dtypes

location          object
date              object
precipitation    float64
temp_max         float64
temp_min         float64
wind             float64
weather           object
dtype: object

In [17]:
# Pandas uses data types provided by numpy
# load numpy
import numpy as np

# convert the column to datetime
my_df.date.astype(np.datetime64)

0      2011-12-31 21:00:00
1      2012-01-01 21:00:00
2      2012-01-02 21:00:00
3      2012-01-03 21:00:00
4      2012-01-04 21:00:00
5      2012-01-05 21:00:00
6      2012-01-06 21:00:00
7      2012-01-07 21:00:00
8      2012-01-08 21:00:00
9      2012-01-09 21:00:00
10     2012-01-10 21:00:00
11     2012-01-11 21:00:00
12     2012-01-12 21:00:00
13     2012-01-13 21:00:00
14     2012-01-14 21:00:00
15     2012-01-15 21:00:00
16     2012-01-16 21:00:00
17     2012-01-17 21:00:00
18     2012-01-18 21:00:00
19     2012-01-19 21:00:00
20     2012-01-20 21:00:00
21     2012-01-21 21:00:00
22     2012-01-22 21:00:00
23     2012-01-23 21:00:00
24     2012-01-24 21:00:00
25     2012-01-25 21:00:00
26     2012-01-26 21:00:00
27     2012-01-27 21:00:00
28     2012-01-28 21:00:00
29     2012-01-29 21:00:00
               ...        
2892   2015-12-01 21:00:00
2893   2015-12-02 21:00:00
2894   2015-12-03 21:00:00
2895   2015-12-04 21:00:00
2896   2015-12-05 21:00:00
2897   2015-12-06 21:00:00
2

In [19]:
# just like the data frame, the command creates a copy
# but does not store it
# We need to replace the old date column with the new one
my_df.date = my_df.date.astype(np.datetime64)

In [21]:
# check the types
my_df.dtypes

location                 object
date             datetime64[ns]
precipitation           float64
temp_max                float64
temp_min                float64
wind                    float64
weather                  object
dtype: object

# Why convert an object column into a date column?
- As you will find out later, pandas can do more fancy things if it knows the column is a date
- For example:
 - Sort
 - Filter based on date range
 - Date arethmatic
- Always make sure date/time columns have the correct data type

In [None]:
# quick discovery methods here like describe and plot