## Introduction to pandas

[This is what it's all about](https://www.youtube.com/watch?v=6UvbhER4RmM)

One of the early criticisms of many in the data science arena of the Python language was the lack of useful data structures for performing data analysis tasks. This criticism stemmed in part from comparisons between the R language and Python since R has a built-in *DataFrame* object that greatly simplified many data analysis tasks. This deficiency was addressed in 2008 by Wes McKinney with the creation of [pandas][1] (the name was originally an abbreviation of panel data), and this module continues to be improved. To quote the pandas documentation:

>Python has long been great for data munging and preparation, but less
>so for data analysis and modeling. pandas helps fill this gap, enabling
>you to carry out your entire data analysis workflow in Python without
>having to switch to a more domain specific language like R.

The pandas module introduces several new data structures like the `Series`, `DataFrame`, and `Panel` that build on top of existing tools like `NumPy` to speed up data analysis tasks. Note that the `NumPy` module provides support for numerical operations, including the generation of random data, which we will use in this notebook. The Pandas module also provides efficient mechanisms for moving data between in-memory representations and different data formats, including comma separated values (.csv) and text files, JSON files, SQL databases, HDF5 format files, and even Excel spreadsheets. Finally, the pandas module also provides support for dealing with missing or incomplete data and aggregating or grouping data.

-----
[1]: http://pandas.pydata.org

### Importing pandas

We import pandas using the `pd` alias.

In [None]:
import pandas as pd

### Data Types

There are two main data types in pandas: `Series` and `DataFrame`. They have similar functionality. A `DataFrame` contains one or more `Series` objects. Each data column (which is a `Series`) in a `DataFrame` must contain only a single data type (e.g., int32, float64, object) Note that there is also the types `TimeSeries` and `Panel` in pandas, but the types we will most commonly use are `DataFrame` and `Series`.

### `Series`

A `Series` is useful to hold data that can be accessed by using a specific label. Let's create a few different `Series` objects.

In [None]:
# Create a Series by passing it a list
s1 = pd.Series(["a", "b", "c", "d"])
print(s1)

Notice that 2 columns were printed out. The first column is the *index* and the second column is the data from the `list` that we passed in upon creation. The index was generated automatically for us when we created the `s1`. You can specify the index either at creation time or after `Series` has been created. We can get both the index and the values from a series with appropriate methods.

In [None]:
# What type is s1?
print(type(s1))

In [None]:
# Get the values from s1
s1.values

In [None]:
# What type is returned when calling values?
print(type(s1.values))

In [None]:
# Get the index of s1
s1.index

Because we did not give it an index at creation, it created and used a `RangeIndex`.

In [None]:
# How big is the index?
print(f"s1.size = {s1.size}")

# what is the type of the index
print(type(s1.index))

In [None]:
# Change the index
s1.index = ["one", "two", "three", "four"]
s1

You can name the series if you want.

In [None]:
# Name s1 series
s1.name = "Silly Series"

In [None]:
print(s1.name)

In [None]:
# Now when we print out the series, we should see its name also
s1

-----

### Operations with `Series`

Let's create a larger `Series` try a few different basic operations

In [None]:
# Create ice cream flavors list
iceCream = ["chocolate", "strawberry", "vanilla", "rum raisin", "chocolate", "vanilla", 
           "vanilla", "strawberry", "rum raisin", "chocolate", "strawberry", "cotton candy", 
           "chocolate", "vanilla", "rum raisin", "vanilla", "vanilla", "strawberry", 
           "chocolate", "vanilla", "chocolate", "vanilla", "strawberry", "vanilla", 
           "chocolate", "chocolate", "purple cow", "chocolate", "rum raisin", "vanilla", 
           "chocolate", "bubble gum", "vanilla"]

In [None]:
# Create a new series
flavors = pd.Series(iceCream)
flavors

In [None]:
# Look at just the top using .head()
flavors.head()

In [None]:
# By default head gives top 5
# You can specify n - the number to show
flavors.head(3)

In [None]:
# You can also see the bottom using .tail()
flavors.tail(3)

In [None]:
# Sometimes we want to "sample" from the series 
flavors.sample(5)

#### Counting Categorical Series

Suppose our ice cream flavors were the results of a survey we gave to your fellow students. It would be nice to know which flavors were the most popular, least popular, etc. You could sum up the responses for each flavor using a `for` loop. Fortunately, pandas provides a much easier way to arrive at our summary by using the `value_counts()` method. Let's try it.

In [None]:
# Find out the popularity of each flavor
flavors.value_counts()

In [None]:
# What type does value_counts() return?
print(type(flavors.value_counts()))

Because the returned object is a `Series`, you can use `.index` and `.values` like on any other `Series` object. 

There is also a way to easily find only the unique values for a categorical `Series` like we have: use the `unique()` method.

In [None]:
# Find only the unique flavors
flavors.unique()

-----

### Creating Random Data

I want to create some random numerical data to see how a `Series` is different from one with categorical data as we had above. We'll use the `numpy` package to generate the random numbers. 

In [None]:
import numpy as np

In [None]:
# Set the seed so that I can replicate the random numbers
np.random.seed(42)

# List comprehension to generate random floating point numbers with one digit
temps = [float(f"{np.random.randint(45, 67) + np.random.random():.1f}") for i in range(50)]
temps

In [None]:
# Create a series from float list temps
tempF = pd.Series(temps)
tempF

-----

### Accessing `Series` Data

Data rows can be accessed either by the "label" in the index column or by their position in the data column. The `.loc` command finds data rows based on their label. The `.iloc` command finds data rows based on its position; that is, the sequence in which the rows are found in the `Series`. When we created our `tempF` `Series`, we did not specify an index. Therefore, it will have a `RangeIndex` starting at 0 and increments by 1 for each subsequent row. When we create a `Series` this way, the commands `.loc` and `.iloc` will work identically for our `tempF` object.

In [None]:
# Get the element using the label (index) of 0
tempF.loc[0]

In [None]:
# Get the element using the position of 0
tempF.iloc[0]

But now let's change the indices to something else.

In [None]:
# Change index to start at 50
tempF.index = range(50, 100, 1)
tempF.index

The statement below will result in an error since our index labels have changed.

In [None]:
tempF.loc[0]

To get the first element of the `Series`, we need to use the newly labeled index of 50.

Using `.iloc[0]`, on the other hand, will still give us the first data row in the `Series`.

In [None]:
tempF.loc[50]

In [None]:
tempF.iloc[0]

-----

### Series Methods for Numerical Data

When we have a numerical data type in a `Series`, we will often want some summary statistics to get an idea of the data we are dealing with. Let's try a few of the methods we have available to us.

In [None]:
# We add up the values with sum()
print(f"tempF.sum():     {tempF.sum()}")

# We can find the average with mean()
print(f"tempF.mean():    {tempF.mean()}")

# We can find the median with median()
print(f"tempF.median():  {tempF.median()}")

# We can find the product with product()
print(f"tempF.product(): {tempF.product()}")

Those options are all nice, but there is a function that we can use on a numerical `Series` that provides some of the most common summary statistics: `describe()`.

In [None]:
# Get the summary statistics
tempF.describe()

#### Sorting

We will also want to sort a `Series` based on the numerical values. As expected, you can sort either ascending or descending.

In [None]:
# First look at the head()
tempF.head()

In [None]:
# Try sorting to see what happens
tempF.sort_values()
tempF.head()

Well that did not work the way we had hoped. What happened? The `sort_values()` returns a new `Series` object. We can store the result in a new variable and see if that reacts the way we had hoped.

In [None]:
sortedTemps = tempF.sort_values()
sortedTemps.head()

What if we want to keep it in the same variable? We can sort "inplace".

In [None]:
# See original
tempF.head()

In [None]:
# Sort in place
tempF.sort_values(inplace=True)
tempF.head()

Now, if we want to sort in descending, we have to add the argument `ascending=False`.

In [None]:
tempF.sort_values(inplace=True, ascending=False)
tempF.head()

What if you want to get back to the original sort order? Well, if you created the `Series` object with a `RangeIndex`, you can use that to get back to the original order by calling `sort_index()`.

In [None]:
# Re-sort using the index
tempF.sort_index(inplace=True)
tempF.head()

----

### Appending One Series to Another

We can use the `.append()` method to add a new `Series` to an existing one. As we saw with sorting, the combined `Series` object is not permananent unless you resave it or save it in a new variable.

In [None]:
# Look at original size and index
print(f"tempF.size:  {tempF.size}")
print(f"tempF.index: {tempF.index}")

# Create a new single element Series
tempSeries = pd.Series([0.0])

# append it to tempF and see if "stuck"
tempF.append(tempSeries)

print("AFTER APPENDING:")
print(f"tempF.size:  {tempF.size}")
print(f"tempF.index: {tempF.index}")

This confirms that it did not save it back to the original `Series` object. Let's try again.

In [None]:
# append it to tempF and save it back to tempF
tempF = tempF.append(tempSeries)

print("AFTER APPENDING:")
print(f"tempF.size:  {tempF.size}")
print(f"tempF.index: {tempF.index}")

Now, we have our combined `Series`. Notice that the index it gave the new element was `0`. If we want to get rid of that new value, then we can use the `.drop()` method. The `.drop()` method deletes a row in a `Series` based on the row label (i.e., index). Let's try it.

In [None]:
tempF.drop(labels=0, inplace=True)
print("AFTER DROPPING:")
print(f"tempF.size:  {tempF.size}")
print(f"tempF.index: {tempF.index}")

----

### Plotting a Series

Pictures (visualizations) are valuable to help understand your data. The `pandas` package provides very helpful methods to easily plot our data. Let's try it.

In [None]:
# Plot the flavors of ice cream and how many votes each received
# We want to use a BAR chart. Each flavor will be its own bar.
flavors.value_counts().plot(kind="bar")

That was super easy. What if want to plot our `tempF` `Series`? We have numerical instead of categorical data. Let's try just calling `.plot()` and see what happens. What is your guess of the type of chart it will create?

In [None]:
# Plot using default kind for numerical data.
tempF.plot()

So, we got a line chart. This would be appropriate **ONLY** if the $x$-axis was *time*. If we were interested in looking at the *distribution* or *shape* of the values, we would use a histogram or boxplot.

In [None]:
# Try creating a histogram
tempF.plot(kind="hist")

By default it made 10 "bins" or "buckets". You can change the number of bins.

In [None]:
# Use 5 bins
tempF.plot(kind="hist", bins=5)

In [None]:
# Let's try a boxplot
tempF.plot(kind="box")

----

<font color='red' size = '5'> Student Exercise </font>

In the **Code** cell below, you have been given a `list` of the US presidents heights in centimeters. Create a `Series` object named `prezSeries` from the given list.

1. Print out the type of `prezSeries` to verify that you created the `Series` object correctly.
2. Print out the summary statistics for the `Series`.
3. Create a histogram of the heights.
4. Create a boxplot of the heights.


-----

In [None]:
# list of data to convert to a Series
prezHeights = [193, 192, 191, 189, 188, 188, 188, 188, 188, 187, 
               185, 185, 185, 183, 183, 183, 183, 183, 183, 182, 
               182, 182, 182, 182, 180, 180, 179, 178, 178, 178, 
               178, 177, 175, 175, 174, 173, 173, 173, 173, 171, 
               170, 170, 168, 168, 163]

### YOUR CODE HERE
# 1. Create Series object and print out its type


# 2. Print out summary statistics


# 3. and 4. Histogram and Box Plot
# Leave these lines of code
import matplotlib.pyplot as plt
# set up multiple subplots in same figure
fig, axes = plt.subplots(nrows=2)

# In your plot command you need to tell it where to place each chart
# Do this with ax=axes[0] will put the chart in the first row
# and ax=axes[1] will put the chart in the second row


-----

## Using `DataFrame`s

A `DataFrame` is simply a two-dimensional object show columns have the type `Series`. Thus, all of the `Series` properties and methods we did earlier can be applied to `DataFrame` columns.

### Reading Files with `pandas`

Commonly the data we are interested in resides in external files. While there are "base" ways to read data from files with "built-in" Python functions, `pandas` provides an efficient and easy alternative to read the data from files into a `DataFrame`. If your file is in a text file, such as a .csv file, you can use the `.read_csv()` method. The nice thing about .csv files is that they are text and can easily be transferred and read on any operating system. If instead, your data is stored in Microsoft Excel files, there are `pandas` methods to read data from that format too. Let's start with .csv files.

In [None]:
# Read the data from the "iris.csv" file into a variable called df
df = pd.read_csv("iris.csv")
print(f"Type of df variable: {type(df)}")
df.head()

In [None]:
# We can look at the data types for the columns
df.dtypes

### Accessing Elements of a `DataFrame`

To get the entire column, we use `df["ColumnName"]`. 

In [None]:
# Pull out the "Species" column
df["Species"]

In [None]:
# What is the type of the column? You should already know this
type(df["Species"])

In [None]:
# If you want a specific row of a DataFrame, you can use .loc or .iloc
# Get first row
df.loc[0]

In [None]:
# Now with iloc
df.iloc[0]

### Methods and Properties of a `DataFrame`

Many of the ones we've seen with `Series` objects apply to a `DataFrame`. Let's look at a few of them.

In [None]:
# New one - shape: returns a tuple (rows, columns)
df.shape

In [None]:
# Already saw data types
df.dtypes

In [None]:
# Can get the index
df.index

In [None]:
# Can get the values
df.values

In [None]:
# The values are of type numpy.ndarray
type(df.values)

In [None]:
# If you want to convert it to a list
print(f"Type of df.values.tolist(): {type(df.values.tolist())}")
df.values.tolist()

In [None]:
# Can get the columns
df.columns

In [None]:
# if you only want to make the column names a list
df.columns.values.tolist()

In [None]:
# Our very powerful friend describe() is REALLY useful for DataFrames
df.describe()

### Sorting a `DataFrame`

In [None]:
# Let's sort by SepalWidthCm
df.sort_values(by="SepalWidthCm", inplace=True)
df

In [None]:
# Remember we can get back to the original order using the index (assuming we didn't change it)
df.sort_index(inplace=True)
df

In [None]:
# What if you want to sort by multiple columns
# That's easy
df.sort_values(by=["SepalLengthCm", "SepalWidthCm"], inplace=True)
df

### Creating New Columns in a `DataFrame`

There are many times that you will want to create a new column within an existing `DataFrame`, often based on the current columns. In machine learning parlance, this is often referred to as "feature engineering". To keep things very, very simple here we will simply convert one of the columns from centimeters to inches. Hopefully it is obvious that this would not really help us if we were trying to improve some of our machine learning models since it is a simple conversion rather than exploring a relationship among different variables. 

Let's convert the `PetalLengthCm` column to inches and name it `PetalLengthIn`. 

The statement below is, at first glance, perhaps confusing. It seems we are dividing a `DataFrame` column, which is a `Series` by a constant. Will this work because it appears we have mixed types of a `Series` and a `float`? Indeed it will work. The division is interpreted element-wise, so each element of the `Series` is divided by our conversion constant of 2.54.

In [None]:
# conversion constant
cmPerInch = 2.54
df["PetalLengthIn"] = df["PetalLengthCm"] / cmPerInch
df

### Grouping

As expert users of Excel, you understand PivotTables and really want to use them in Python. We'll look at this in more detail later, but here is a small taste of creating one. We are interested in looking at the average values for each numerical column for each species. That is, we want to "group by" species and see the average for all the columns.

In [None]:
# Create a pivot table
df.groupby("Species").mean()

Wow! Super easy! Let's try using this information to help us "predict" the species of any observations we might encounter. We'll define a function that looks at our newly created column `PetalLengthIn` and return one of the three species. We will then want to use that user-defined function to create a new column in our `DataFrame`.

In [None]:
# Define a predict function
def predict(row):
    if row["PetalLengthIn"] <= 1.0:
        return "Iris-setosa"
    elif row["PetalLengthIn"] <= 1.9:
        return "Iris-versicolor"
    else:
        return "Iris-virginica"

Now, we will use the `.apply` method of the `DataFrame`. This function causes these actions:

- Each row of the `DataFrame` is passed, one-by-one, to the function whose name is given as the first argument. In this case the function name is our user-defined `predict`.
- The return value from the `predict` function for each row is appended to the new column `Predict` in the same row that gave rise to the return value.
- The `axis` argument determines whether the `DataFrame` data is sent to `predict` by rows or by columns.

This last point is very confusing for beginners. Specifying `axis = "columns"` means that one "set" of column values are sent to the function in each pass. That means, by the way I think, that the data are sent row by row. Specifying `axis = "rows"` implies the converse: the total contents of each column, including the index column, are sent one-by-one to the function.

In [None]:
df["Predict"] = df.apply(predict, axis="columns")
df

How good were our predictions? We can compare the `Species` column with the `Predict` column and add up the ones that were the same. 

In [None]:
# How many correct did we have
(df["Species"] == df["Predict"]).sum()

In [None]:
# What percentage is that?
(df["Species"] == df["Predict"]).sum() / len(df)

----

<font color='red' size = '5'> Student Exercise </font>

You have been given an Excel file with the list of US presidents and their heights. In the **Code** cell below, read the data from Excel file into a `DataFrame` called `presidents`.

1. Print out the type of `presidents` to verify that you created the `DataFrame` object correctly.
2. Sample `presidents` to see what the data looks like.
3. Print out the summary statistics for `presidents`.
4. Create a new column that converts `HeightCm` into inches.
5. Create a histogram of your new column. How does it compare to the one you created in the previous exercise?


-----

In [None]:
### YOUR CODE HERE


**&copy; 2021 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**