# Introduction to Pandas 1

Click [here](https://pandas.pydata.org/) to access a high-level overview of the Pandas library.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

## 1. Reading in DataFrames from Files
Pandas has a number of very useful file reading tools. You can see them enumerated by typing `pd.re` and pressing tab. We'll be using `read_csv` in this notebook.

In [None]:
elections = pd.read_csv("elections.csv")
elections

**Example 1.1.** We can use the `.head` function to return only a few rows of a dataframe.

In [None]:
elections.head(10)

**Example 1.2.** There is also a `.tail` function.

In [None]:
elections.tail(7)

**Example 1.3.** The `.read_csv` command lets us specify a column to use an index. For example, we could have used `Year` as the index.

In [None]:
elections_year_index = pd.read_csv("elections.csv", index_col = "Year")
elections_year_index.head(5)

**Example 1.4.** Alternately, we could have used the `.set_index` function.

In [None]:
elections_party_index = elections.set_index("Party")
elections_party_index.head(5)

**Example 1.5.** The `.set_index` command (along with all other data frame methods) does not modify the dataframe. That is, the original `"elections"` is untouched. 

**Note:** There is a flag called `inplace` which does modify the calling dataframe.

In [None]:
elections.head()

**Example 1.6.** By contrast, column names are ideally unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.

In [None]:
dups = pd.read_csv("duplicate_columns.csv")
dups

## 2. The [] Operator

**Example 2.1.** The DataFrame class has an indexing operator **[]** that lets you do a variety of different things. If your provide a string to the **[]** operator, you get back a Series corresponding to the requested label.

In [None]:
elections_year_index.head(6)

In [None]:
elections_year_index["Candidate"].head(6)

**Example 2.2.** The **[]** operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [None]:
elections_year_index[["Candidate", "Party"]].head()

**Example 2.3.** A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

In [None]:
elections_year_index[["Candidate"]].head()

**Example 2.4.** Note that we can also use the `.to_frame` method to turn a Series into a DataFrame.

In [None]:
elections_year_index["Candidate"].to_frame().head()

**Example 2.5.** The **[]** operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column.

In [None]:
elections_year_index[0:3]

**Example 2.6.** If you provide a single argument to the **[]** operator, it tries to use it as a name. This is true even if the argument passed to **[]** is an integer.

In [None]:
# This does not work, try running this to see it fail in action
elections_year_index[0]

**Example 2.7.** The following cells allow you to test your understanding.

In [None]:
weird = pd.DataFrame({
      1:["topdog", "botdog"],
    "1":["topcat", "botcat"]
})
weird

**Example 2.7.1.** Try to predict the output.

In [None]:
# Try to predict the output
weird[1] 

**Example 2.7.2.** Try to predict the output.

In [None]:
# Try to predict the output
weird["1"]

**Example 2.7.3.** Try to predict the output.

In [None]:
# Try to predict the output
weird[1:]

## 3. Boolean Array Selection

**Example 3.1.** The **[]** operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [None]:
elections_year_index.head()

In [None]:
elections_year_index[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]

**Example 3.2.** One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [None]:
elections_year_index.head(5)

In [None]:
iswin = elections_year_index['Result'] == 'win'
iswin

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

**Example 3.3.** Such a boolean Series can be used as an argument to the **[]** operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [None]:
elections_year_index[iswin]

**Example 3.4.** Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [None]:
elections_year_index[elections_year_index['Result'] == 'win']

**Example 3.5.** We can select multiple criteria by creating multiple boolean Series and combining them using the **&** operator.

In [None]:
win50plus = (elections_year_index['Result'] == 'win') & (elections_year_index['%'] < 50)

In [None]:
win50plus.head()

In [None]:
elections_year_index[(elections_year_index['Result'] == 'win')
          & (elections_year_index['%'] < 50)]

# Note for Python experts: The reason we use the & symbol and not the word "and" is because the Python __and__ 
# method overrides the "&" operator, not the "and" operator.

**Example 3.6.** The **|** operator is the symbol for or.

In [None]:
ections_year_index[(elections_year_index['Party'] == 'Republican')
          | (elections_year_index['Party'] == "Democratic")]

**Example 3.7.** If we have multiple conditions (say Republican or Democratic), we can use the `.isin` operator to simplify our code.

In [None]:
elections_year_index['Party'].isin(["Republican", "Democratic"])

In [None]:
elections_year_index[elections_year_index['Party'].isin(["Republican", "Democratic"])]

**Example 3.8.** An alternate simpler way to get back a specific set of rows is to use the `.query` command.

In [None]:
elections_year_index.query?

In [None]:
elections_year_index.query("Result == 'win' and Year < 2000")

## 4. Label-based Access with `.loc` and `.iloc`

In [None]:
elections.head()

**Example 4.1.** Using `.loc`.

In [None]:
elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]

**Note:** The `.loc` command won't work with numeric arguments if we're using the `elections` DataFrame that was indexed by year.

In [None]:
# Causes error
elections_year_index.loc[[0, 1, 2, 3, 4], ['Candidate','Party']]

In [None]:
elections_year_index.loc[[1980, 1984], ['Candidate','Party']]

**Example 4.2.** `.loc` also supports slicing (for all types, including numeric and string labels). Note that the slicing for `.loc` is inclusive, even for numeric slices.

In [None]:
elections.loc[0:4, 'Candidate':'Year']

In [None]:
elections_year_index.loc[1980:1984, 'Candidate':'Party']

**Example 4.3.** If we provide only a single label for the column argument, we get back a Series.

In [None]:
elections.loc[0:4, 'Candidate']

**Example 4.4.** If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

In [None]:
elections.loc[0:4, ['Candidate']]

**Example 4.5.** If we give only one row but many column labels, we'll get back a Series corresponding to a row of the table. This new Series has a neat index, where each entry is the name of the column that the data came from.

In [None]:
elections.head(1)

In [None]:
elections.loc[0, 'Candidate':'Year']

In [None]:
elections.loc[[0], 'Candidate':'Year']

**Example 4.6.** If we omit the column argument altogether, the default behavior is to retrieve all columns.

In [None]:
elections.loc[[2, 4, 5]]

**Example 4.7.** `.loc` also supports boolean array inputs instead of labels. The Boolean arrays must be of the same length as the row/column shape of the dataframe, respectively (in versions prior to 0.25, Pandas used to allow size mismatches and would assume the missing values were all False, [this was changed in 2019](https://github.com/pandas-dev/pandas/pull/26911)).

In [None]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # Row mask
              [True, False, False, True, True]                                          # Column mask
             ]

In [None]:
elections.loc[[0, 3], ['Candidate', 'Year']]

**Example 4.8.** We can use boolean array arguments for one axis of the data, and labels for the other.

In [None]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # Row mask
              
              'Candidate':'%'                                                           # Column label slice
             ]

**Example 4.9.** A student asks what happens if you give scalar arguments for the requested rows **and** columns. The answer is that you get back just a single value.

In [None]:
elections.loc[0, 'Candidate']

## 5. Positional Access with .`iloc`

**Example 5.1.** `.loc`'s cousin `.iloc` is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. `.iloc` slicing is exclusive, just like standard Python slicing of numerical values.

In [None]:
elections.head()

In [None]:
elections.iloc[0:3, 0:4]

In [None]:
elections.iloc[:3, 2:]

**Note:** We will use both `.loc` and `.iloc` in the course. `.loc` is generally preferred for a number of reasons, for example:

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know (e.g. what column #31 represents).
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, `.iloc` is sometimes more convenient. We'll provide examples of when `.iloc` is the superior choice.

## 6. Quick Challenge

**Example 6.1** Which of the following expressions return DataFrame of the first 3 Candidate and Year for candidates that won with more than 50% of the vote.

In [None]:
elections.head(10)

In [None]:
elections.iloc[[0, 3, 5], [0, 3]]

In [None]:
elections.loc[[0, 3, 5], "Candidate":"Year"]

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

## 7. Sampling

**Example 7.1.** Pandas dataframes also make it easy to get a sample. We simply use the sample method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set `replace = True` if you want replacement.

In [None]:
elections.sample(10)

In [None]:
elections.query("Year < 1992").sample(50, replace = True)

## 8. Handy Properties and Utility Functions for Series and DataFrames

### 8.1. Python Operations on Numerical DataFrames and Series

**Example 8.1.1** Consider a series of only the vote percentages of election winners.

In [None]:
winners = elections.query("Result == 'win'")["%"]
winners

**Example 8.1.2.** We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [None]:
max(winners)

In [None]:
np.mean(winners)

**Example 8.1.3.** We can also do more complicated operations like computing the mean squared error, (i.e. the average L2 loss). 

**Note:** This will mean more in the next few weeks.

In [None]:
c = 50.38
mse = np.mean((c - winners)**2)
mse

In [None]:
c2 = 50.35
mse2 = np.mean((c2 - winners)**2)
mse2

**Example 8.1.4.** We can also apply mathematical operations to a DataFrame so long as it has only numerical data.

In [None]:
(elections[["%", "Year"]] + 3).head(5)

### 8.2. Handy Utility Methods

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [None]:
mottos = pd.read_csv("mottos.csv", index_col = "State")

In [None]:
mottos.head()

In [None]:
mottos.size

**Example 8.2.1.** The fact that the size is 200 means our data file is relatively small, with only 200 total entries.

In [None]:
mottos.shape

**Example 8.2.2.** Since we're looking at data for states, and we see the number 50, it looks like we've mostly likely got a complete dataset that omits Washington D.C. and U.S. territories like Guam and Puerto Rico.

In [None]:
mottos.describe()

**Note:** Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

**Example 8.2.3.** We can get a direct reference to the index using `.index`.

In [None]:
mottos.index

**Example 8.2.4.** We can also access individual properties of the index, for example, `mottos.index.name`.

In [None]:
mottos.index.name

**Note:** This reflects the fact that in our data frame, the index **is** the state name.

In [None]:
mottos.head(2)

**Example 8.2.5.** It turns out the columns also have an Index. We can access this index by using `.columns`.

In [None]:
mottos.head(2)

In [None]:
mottos.columns

**Example 8.2.6.** There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `.sort_values`.

In [None]:
elections.sort_values('%', ascending = False)

**Note:** As mentioned before, all Data Frame methods return a copy and do not modify the original data structure, unless you set inplace to True.

In [None]:
elections.head()

**Example 8.2.7.** If we want to sort in reverse order, we can set `ascending = False`.

In [None]:
elections.sort_values('%', ascending = False)

**Example 8.2.8.** We can also use `.sort_values` on Series objects.

In [None]:
mottos['Language'].sort_values(ascending=False).head(10)

**Example 8.2.9.** For Series, the `.value_counts` method is often quite handy.

In [None]:
elections['Party'].value_counts()

In [None]:
mottos['Language'].value_counts()

**Example 8.2.10.** Also commonly used is the `.unique` method, which returns all unique values as a numpy array.

In [None]:
mottos['Language'].unique()

## 9. Baby Names Data

**Example 9.1.** Now let's play around a bit with a large baby names dataset that is publicly available. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only North Carolina rather than looking at the national dataset.

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'NC.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header = None, names = field_names)

babynames.head()

**Question 9.2.** Find the most popular baby name in North Carolina in 2018.

In [None]:
# Question 9.2.

**Question 9.3.** Find baby names that start with "J". Hard to do with today's tools.

In [None]:
# Question 9.3.

**Question 9.4.** Name whose popularity has changed the most. Also tough.

In [None]:
# Question 9.4.