# Pandas Basics

A high-level overview of the [Pandas](https://pandas.pydata.org) library.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

## Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today. 

In [None]:
elections = pd.read_csv("https://busan302.mycourses.work/data/elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
173,2016,Donald Trump,Republican,62984828,win,46.407862
174,2016,Evan McMullin,Independent,732273,loss,0.539546
175,2016,Gary Johnson,Libertarian,4489235,loss,3.307714
176,2016,Hillary Clinton,Democratic,65853514,loss,48.521539


We can use the head command to return only a few rows of a dataframe.

In [None]:
elections.head(10)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
5,1832,Henry Clay,National Republican,484205,loss,37.603628
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
7,1836,Hugh Lawson White,Whig,146109,loss,10.005985
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
9,1836,William Henry Harrison,Whig,550816,loss,37.721543


There is also a tail command.

In [None]:
elections.tail(7)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
171,2012,Mitt Romney,Republican,60933504,loss,47.384076
172,2016,Darrell Castle,Constitution,203091,loss,0.14964
173,2016,Donald Trump,Republican,62984828,win,46.407862
174,2016,Evan McMullin,Independent,732273,loss,0.539546
175,2016,Gary Johnson,Libertarian,4489235,loss,3.307714
176,2016,Hillary Clinton,Democratic,65853514,loss,48.521539
177,2016,Jill Stein,Green,1457226,loss,1.073699


## The [] Operator

In [None]:
elections["Candidate"].head(6)

0       Andrew Jackson
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
4       Andrew Jackson
5           Henry Clay
Name: Candidate, dtype: object

The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [None]:
elections[["Candidate", "Party"]].head()

Unnamed: 0,Candidate,Party
0,Andrew Jackson,Democratic-Republican
1,John Quincy Adams,Democratic-Republican
2,Andrew Jackson,Democratic
3,John Quincy Adams,National Republican
4,Andrew Jackson,Democratic


The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

In [None]:
elections[0:3]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927


## Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [None]:
iswin = elections['Result'] == 'win'
iswin#.head(5)

0      False
1       True
2       True
3      False
4       True
       ...  
173     True
174    False
175    False
176    False
177    False
Name: Result, Length: 178, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [None]:
elections[iswin]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
11,1840,William Henry Harrison,Whig,1275583,win,53.051213
13,1844,James Polk,Democratic,1339570,win,50.749477
16,1848,Zachary Taylor,Whig,1360235,win,47.309296
17,1852,Franklin Pierce,Democratic,1605943,win,51.013168
20,1856,James Buchanan,Democratic,1835140,win,45.30608
23,1860,Abraham Lincoln,Republican,1855993,win,39.699408


## Label-based access with `loc`

In [None]:
elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]

Unnamed: 0,Candidate,Party,Year
0,Andrew Jackson,Democratic-Republican,1824
1,John Quincy Adams,Democratic-Republican,1824
2,Andrew Jackson,Democratic,1828
3,John Quincy Adams,National Republican,1828
4,Andrew Jackson,Democratic,1832


Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

In [None]:
elections.loc[0:4, 'Candidate':'Year']

0
1
2
3
4


If we provide only a single label for the column argument, we get back a Series.

In [None]:
elections.loc[0:4, 'Candidate']

0       Andrew Jackson
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
4       Andrew Jackson
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

In [None]:
elections.loc[0:4, ['Candidate']]

Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson


## Positional access with `iloc`

In [None]:
elections.iloc[:3, 2:]

Unnamed: 0,Party,Popular vote,Result,%
0,Democratic-Republican,151271,loss,57.210122
1,Democratic-Republican,113142,win,42.789878
2,Democratic,642806,win,56.203927


We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Sampling

Pandas dataframes also make it easy to get a sample. We simply use the `sample` method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set `replace=True` if you want replacement.

In [None]:
elections.sample(10)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
175,2016,Gary Johnson,Libertarian,4489235,loss,3.307714
69,1912,William Taft,Republican,3486242,loss,23.218466
142,1992,George H. W. Bush,Republican,39104550,loss,37.544784
160,2004,Michael Peroutka,Constitution,143630,loss,0.117542
149,1996,Ralph Nader,Green,685297,loss,0.712721
31,1872,Horace Greeley,Liberal Republican,2834761,loss,44.071406
135,1988,George H. W. Bush,Republican,48886597,win,53.518845
29,1868,Horatio Seymour,Democratic,2708744,loss,47.334695
110,1956,T. Coleman Andrews,States' Rights,107929,loss,0.174883
168,2012,Barack Obama,Democratic,65915795,win,51.258484


In [None]:
elections.query("Year < 1992").sample(50, replace=True)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
132,1984,David Bergland,Libertarian,228111,loss,0.247245
102,1948,Norman Thomas,Socialist,139569,loss,0.286312
11,1840,William Henry Harrison,Whig,1275583,win,53.051213
105,1952,Adlai Stevenson,Democratic,27375090,loss,44.446312
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
27,1864,Abraham Lincoln,National Union,2211317,win,54.951512
55,1900,William Jennings Bryan,Democratic,6370932,loss,46.13054
88,1932,Norman Thomas,Socialist,884885,loss,2.236211
9,1836,William Henry Harrison,Whig,550816,loss,37.721543
116,1968,Hubert Humphrey,Democratic,31271839,loss,42.863537


## Handy Properties and Utility Functions for Series and DataFrames

#### Python Operations on Numerical DataFrames and Series

We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [None]:
np.mean(elections['%'])

27.52808988765169

We can also do more complicated operations like computing the mean squared error, i.e. the average L2 loss. (This will mean more in the next few weeks.)

In [None]:
c = 50.38
mse = np.mean((c - elections['%'])**2)
mse

1045.2784200615483

In [None]:
c2 = 50.35
mse2 = np.mean((c2 - elections['%'])**2)
mse2

1043.9082054548073

Also commonly used is the `unique` method, which returns all unique values as a numpy array.