<p style="font-size:3.75em; font-style:bold; text-align:center"><br>Pandas!</p></br>

**Pandas** is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

**Pandas** builds upon **numpy** and **scipy** providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures **pandas** provides are **Series** and **Data Frames**. After a brief introduction to these two data structures and data ingestion, the key features of **pandas** this notebook covers are:

* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframs
* Working with timestamps and series data

**Additional Recommended Resources:**

* **Pandas** documentation: http://pandas.pydata.org/pandas-docs/stable/
* Python for Data Analysis by Wes McKinney
* Python Data Science Handbook by Jake VanderPias




# Import Libraries

In [1]:
import pandas as pd

# Introduction to pandas Data Structuress

**Pandas** has two main data structures it uses, namely, **Series** and **DataFrames**

# Pandas Series

**Pandas** Series one-dimensional labeled array

In [42]:
ser = pd.Series(data = [100, 200, 300, 400, 500], index =['tom', 'bob', 'nancy', 'dan', 'eric'])
# the data and index variables aren't needed, pandas will know what to 
# do with what is put into it

In [6]:
ser
# Data is the column on the right, the numbers we put in
# We defined an index, which gives those names to each row of
# the data

tom      100
bob      200
nancy    300
dan      400
eric     500
dtype: int64

In [7]:
ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [40]:
# if no index is explicitely given, numbers will be used,
# starting at 0
no_index = pd.Series([100, 200, 300, 400, 500])
no_index

0    100
1    200
2    300
3    400
4    500
dtype: int64

In [13]:
no_index.index
# RangeIndex is used instead

RangeIndex(start=0, stop=5, step=1)

In [38]:
ser = pd.Series(data = [100, 'foo', 300, 'foo', 500], index =['tom', 'bob', 'nancy', 'dan', 'eric'])
# Series do not have to be all the same data type
ser
# note that the dtype changes from int64 from the previous example
# to object

tom      100
bob      foo
nancy    300
dan      foo
eric     500
dtype: object

In [19]:
# Data can be accessed like a dictionary
print(ser['nancy'])
print(ser['tom'])

300
100


In [22]:
# pdSeries.loc[index] will return the same data
print(ser.loc['nancy'])

300


In [30]:
# multiple locations can be accessed by passing a list/array
# into a pd.Series() object with index numbers
print(ser[[4, 3, 1]])
# Series indicies also start at 0

eric    500
dan     foo
bob     foo
dtype: object


In [31]:
# index names also work
print(ser[['nancy','eric','tom']])

nancy    300
eric     500
tom      100
dtype: object


In [36]:
# Python operations work on series!
'bob' in ser

True

In [37]:
ser * 2
# example with only numbers

tom       200
bob       400
nancy     600
dan       800
eric     1000
dtype: int64

In [39]:
ser * 2
# example with strings and numbers

tom         200
bob      foofoo
nancy       600
dan      foofoo
eric       1000
dtype: object

In [45]:
# can change series with their index like a dict
ser['bob'] = 200
ser['dan'] = 400
ser ** 2

tom       10000
bob       40000
nancy     90000
dan      160000
eric     250000
dtype: int64

In [46]:
# can run operations onf specific indexes as well
ser[['nancy', 'eric']] ** 2

nancy     90000
eric     250000
dtype: int64

# pandas DataFrame

**pandas DataFrame** is a 2-dimensional labeled data stucture.

## Create DataFrame from dictionary of Python Series

In [65]:
# Pandas DataFrames can be created by using a dictionary of Series
d = {'one': pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two': pd.Series([111., 222., 333., 444.], index=['apple', 'ball', 'cerill', 'dancy'])}
# If indexes are given, they must be equal to the amount of data
# being entered in the series

# Two columns will be generated, because of two dictionary entries

In [64]:
df = pd.DataFrame(d)
df
# Instead of using print for DataFrames in Jupyter Notebooks,
# just putting the variable name will print a nicely formatted
# table

# Note below that when data is missing for a row-column pair,
# it is NaN
# NaN is not 0 or None, it is UNDEFINED

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,444.0


In [67]:
df.index
# When using pd.DateFrame().index, all indexes of the
# dataframe will be shown on 1 line

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [70]:
df.columns
# Same for column headers

Index(['one', 'two'], dtype='object')

In [73]:
# Specific subsets of indexes can be viewed
# The order doesn't need to be in the originally created order
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

Unnamed: 0,one,two
dancy,,444.0
ball,200.0,222.0
apple,100.0,111.0


In [76]:
# Can also view columns the same way.
# Can also use columns that don't exist
pd.DataFrame(d, index=['dancy', 'ball', 'apple'],
            columns=['two', 'five'])

# If the data doesn't exist, it will be NaN (unidentified)

Unnamed: 0,two,five
dancy,444.0,
ball,222.0,
apple,111.0,


### Create DataFrame from list of Python Dictionaries

In [77]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [79]:
from_dict = pd.DataFrame(data)
from_dict

Unnamed: 0,alex,alice,dora,ema,joe
0,1.0,,,,2.0
1,,20.0,10.0,5.0,


In [83]:
# Index labels can be added or be renamed
from_dict = pd.DataFrame(data, index=['orange', 'red'])
from_dict

Unnamed: 0,alex,alice,dora,ema,joe
orange,1.0,,,,2.0
red,,20.0,10.0,5.0,


# Basic DataFrame Operations

In [85]:
# Using the DataFrame created earlier: df
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,444.0


In [86]:
# View individual columns
df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [97]:
# It is possibel to create and add new columns
# Works much like adding new info to a dictionary
df['three'] = df['one'] * df['two']
df
# Note for this operation, if NaN was one of the values,
# then column 'three' will also be NaN

Unnamed: 0,one,two,three
apple,100.0,111.0,11100.0
ball,200.0,222.0,44400.0
cerill,,333.0,
clock,300.0,,
dancy,,444.0,


In [98]:
# Boolean values can also be created for a column
df['flag'] = df['one'] > 250
df

Unnamed: 0,one,two,three,flag
apple,100.0,111.0,11100.0,False
ball,200.0,222.0,44400.0,False
cerill,,333.0,,False
clock,300.0,,,True
dancy,,444.0,,False


In [99]:
# You can remove columns using .pop(column)
three = df.pop('three')

In [101]:
print(three)
print()
df
# Note that df no longer has column 'three'

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64



Unnamed: 0,one,two,flag
apple,100.0,111.0,False
ball,200.0,222.0,False
cerill,,333.0,False
clock,300.0,,True
dancy,,444.0,False


In [102]:
# You can also use del to delete a column
del df['two']

In [103]:
df

Unnamed: 0,one,flag
apple,100.0,False
ball,200.0,False
cerill,,False
clock,300.0,True
dancy,,False


In [107]:
# You can insert new columns using .insert(position, col_name, data)
df.insert(2, 'copy_of_one', df['one'])

In [108]:
df

Unnamed: 0,one,flag,copy_of_one
apple,100.0,False,100.0
ball,200.0,False,200.0
cerill,,False,
clock,300.0,True,300.0
dancy,,False,


In [109]:
# Create a new column named 'one_upper_half'
# its data will be rows 0, and 1 from column 'one'
df['one_upper_half'] = df['one'][:2]

In [110]:
df

Unnamed: 0,one,flag,copy_of_one,one_upper_half
apple,100.0,False,100.0,100.0
ball,200.0,False,200.0,200.0
cerill,,False,,
clock,300.0,True,300.0,
dancy,,False,,
