# STA 141B Data & Web Technologies for Data Analysis

### Lecture 5, 10/17/24, Pandas


### Announcements

- Homework 2 is due this Sunday, 11:59 PM. 
- Midterm next week on Thursday, Oct 17. Sample exam is online. 

### Today's topics

 - Pandas
 
### Data Sets

 - `dogs_full.csv`
 - `fluidmilk.xlsx`

### References

 - Python for Data Analysis, Ch. 5, 10
 - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), Ch. 3

## Pandas

Pandas is a Python package that provides tools for manipulating tabular data. The name "pandas" is short for "**PAN**el **DA**ta", an econometrics term. Since we're using Anaconda, Pandas is already installed.

Pandas is documented [here](http://pandas.pydata.org/pandas-docs/stable/).

In [1]:
import pandas as pd

### Series

A Pandas Series a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). It is a generalization of a NumPy array.

In addition to elements, every series includes an <b>index</b>.

In [2]:
x = pd.Series([1, 2, 3])
type(x)

pandas.core.series.Series

In [3]:
x

0    1
1    2
2    3
dtype: int64

In [4]:
x = pd.S eries([1,2,3], index = ["a", "b", "c"])
x

a    1
b    2
c    3
dtype: int64

A series can be indexed in all of the same ways as a NumPy array, but also by index values. This means a series can also be used like an ordered dictionary (although its keys are not unique). 

In [6]:
x[0]

1

In [9]:
x[:1]

a    1
dtype: int64

In [5]:
x["a"]

1

In [12]:
x[0:2]

a    1
b    2
dtype: int64

In [18]:
x["b":"a"]

Series([], dtype: int64)

In [19]:
x = pd.Series([1,2,3], index = ["c", "b", "a"])
x

c    1
b    2
a    3
dtype: int64

In [20]:
x["a":"b"]

Series([], dtype: int64)

In [21]:
x["c":"b"]

c    1
b    2
dtype: int64

In [27]:
x = pd.Series([1,2,3], index = ["c", "a", "b"])
x is x["c":"b"]

False

In [37]:
x = pd.Series([1,2,3], index = ["c", "a", "b"])
x["a"]

2

A series may have integer indices as well. 

In [40]:
x = pd.Series([1, 2, 3], index = [2, 0, 1])
x 

2    1
0    2
1    3
dtype: int64

In [43]:
x[x > 2]

1    3
dtype: int64

For a indexing series (and as we'll see later, also data frames):

* `[ ]` is by position, name, or condition. **Exception:** for an integer index it's by name or condition only.
* `.iloc[ ]` is by position
* `.loc[ ]` is by name or condition

In [44]:
y = pd.Series([1,2,3])
z = y[1:3]
z

1    2
2    3
dtype: int64

In [49]:
y[1]

2

In [50]:
y.iloc[0]

1

In [51]:
y.loc[1]

2

In [52]:
z = pd.Series([1, 2, 3, 4], index = [3j, 2j, 1j, -1j])
z

 0.0+3.0j    1
 0.0+2.0j    2
 0.0+1.0j    3
-0.0-1.0j    4
dtype: int64

In [53]:
z[3j]

1

In [56]:
#z.iloc[-1j]

In [61]:
z.index

Index([3j, 2j, 1j, (-0-1j)], dtype='complex128')

### Data Frames

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

It represents tabular data as a collection of Series.

In [62]:
df = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df

Unnamed: 0,x,y
0,1,a
1,2,b
2,3,c


Data frames support the similar indexing methods as series. However, for indexing with `[ ]`,

* Scalar values get columns by name
* Conditions or slices get rows

In [63]:
df

Unnamed: 0,x,y
0,1,a
1,2,b
2,3,c


In [64]:
df.iloc[0,0]

1

In [81]:
df[df['x']>2]

Unnamed: 0,x,y
2,3,c


In [83]:
df["y"]

0    a
1    b
2    c
Name: y, dtype: object

In [67]:
df['x']

0    1
1    2
2    3
Name: x, dtype: int64

In [69]:
df.iloc[0]

x    1
y    a
Name: 0, dtype: object

In [72]:
df[0:1]

Unnamed: 0,x,y
0,1,a


In [None]:
df 

In [73]:
df.loc[:,"x"] 

0    1
1    2
2    3
Name: x, dtype: int64

In [74]:
df[df["x"] > 2] # more conventient 

Unnamed: 0,x,y
2,3,c


In [77]:
df.loc[df['x'] > 2,:]  # more principled (?)

Unnamed: 0,x,y
2,3,c


Since we are subsetting a DataFrame, a DataFrame is returned. 

In [84]:
df.shape

(3, 2)

In [85]:
df.size

6

In [86]:
type(df)

pandas.core.frame.DataFrame

In [101]:
df.dtypes

x     int64
y    object
dtype: object

In [None]:
x = pd.Series(['1', '2'])

In [99]:
x.dtype

dtype('O')

More indexing and selection data using pandas can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-slicing-with-labels)

### Missing Data

Pandas represents missing data with `NaN` and `None`, but these values do not exclusively mean missing data. For instance, `NaN` stands for "Not a Number" and is also the result of undefined computations. Pay attention to your data and code to determine whether values are missing or have some other meaning.

You can create `NaN` values with NumPy.

In [102]:
import numpy as np 
np.nan

nan

In [103]:
df = pd.DataFrame({"x": [1, np.nan, 2], "y": np.arange(3), "z": ["x", "y", None]})
df

Unnamed: 0,x,y,z
0,1.0,0,x
1,,1,y
2,2.0,2,


Use the `.isna()` or `.isnull()` methods to detect missing values.

In [104]:
df.isna()

Unnamed: 0,x,y,z
0,False,False,False
1,True,False,False
2,False,False,True


In [109]:
df.isnull()

Unnamed: 0,x,y,z
0,False,False,False
1,True,False,False
2,False,False,True


In [110]:
df

Unnamed: 0,x,y,z
0,1.0,0,x
1,,1,y
2,2.0,2,


In [111]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       2 non-null      float64
 1   y       3 non-null      int64  
 2   z       2 non-null      object 
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes


In [None]:
df 

In [113]:
(~df.isna()).sum().sum()

7

In [115]:
df.isna().sum().sum() # how many are missing

2

In [118]:
df

Unnamed: 0,x,y,z
0,1.0,0,x
1,,1,y
2,2.0,2,


In [120]:
df.sum(numeric_only=True)

x    3.0
y    3.0
dtype: float64

Lets deal with this warning. 

In [None]:
df.sum(numeric_only=True).sum()

### Data Alignment

Pandas supports vectorized operations, but elements are <b>automatically</b> aligned by index. **Beware!!** This is a major difference compared to R.

In [125]:
x = pd.Series([1, 2, 3], index = ["a", "b", "c"])
y = pd.Series([1, 2, 3])
x

a    1
b    2
c    3
dtype: int64

In [126]:
y

0    1
1    2
2    3
dtype: int64

In [127]:
x + y

a   NaN
b   NaN
c   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

You can use the `.reset_index()` method to reset the indexes on a series or data frame. Watch out: The method returns a new DataFrame, but does not overwrite the old object. 

In [161]:
x.reset_index().set_index('index')[0]

index
a    1
b    2
c    3
Name: 0, dtype: int64

In [142]:
y

0    1
1    2
2    3
dtype: int64

In [148]:
x.reset_index(drop = True) # throw away the old index

0    1
1    2
2    3
dtype: int64

In [145]:
x

a    1
b    2
c    3
dtype: int64

### Reading Data

Pandas provides functions for reading (and writing) a variety of common formats. Most of their names begin with `read_`. For instance, we can read the dogs data from a CSV file:

In [162]:
dogs = pd.read_csv("../data/dogs_full.csv")

In [163]:
dogs.head()

Unnamed: 0,breed,group,datadog,popularity_all,popularity,lifetime_cost,intelligence_rank,longevity,ailments,price,food_cost,grooming,kids,megarank_kids,megarank,size,weight,height
0,Border Collie,herding,3.64,45,39.0,20143.0,1.0,12.52,2.0,623.0,324.0,weekly,low,1.0,29.0,medium,,20.0
1,Border Terrier,terrier,3.61,80,61.0,22638.0,30.0,14.0,0.0,833.0,324.0,weekly,high,2.0,1.0,small,13.5,
2,Brittany,sporting,3.54,30,30.0,22589.0,19.0,12.92,0.0,618.0,466.0,weekly,medium,3.0,11.0,medium,35.0,19.0
3,Cairn Terrier,terrier,3.53,59,48.0,21992.0,35.0,13.84,2.0,435.0,324.0,weekly,high,4.0,2.0,small,14.0,10.0
4,Welsh Springer Spaniel,sporting,3.34,130,81.0,20224.0,31.0,12.49,1.0,750.0,324.0,weekly,high,5.0,4.0,medium,,18.0


### Inspecting Data

Series and data frames provide many of the same methods and attributes as NumPy arrays.

For a data frame, the `.dtypes` attribute gives the column types.

The type "object" means some non-numeric Python object, often a string.

In [164]:
dogs.dtypes

breed                 object
group                 object
datadog              float64
popularity_all         int64
popularity           float64
lifetime_cost        float64
intelligence_rank    float64
longevity            float64
ailments             float64
price                float64
food_cost            float64
grooming              object
kids                  object
megarank_kids        float64
megarank             float64
size                  object
weight               float64
height               float64
dtype: object

There are also several methods for quickly summarizing data.

In [165]:
dogs.describe()

Unnamed: 0,datadog,popularity_all,popularity,lifetime_cost,intelligence_rank,longevity,ailments,price,food_cost,megarank_kids,megarank,weight,height
count,87.0,172.0,87.0,91.0,132.0,135.0,148.0,146.0,87.0,87.0,87.0,86.0,159.0
mean,2.603678,87.122093,44.0,19819.538462,40.924242,10.956741,1.216216,876.815068,489.597701,43.954023,43.942529,44.97093,19.089623
std,0.570288,50.205335,25.258662,3102.475382,19.60356,1.995742,1.54981,461.172524,204.266894,25.288065,25.278153,35.52707,6.0124
min,0.99,1.0,1.0,12653.0,1.0,6.29,0.0,283.0,270.0,1.0,1.0,5.0,5.0
25%,2.185,43.75,22.5,17816.5,27.0,9.7,0.0,587.25,324.0,22.5,22.5,17.5,14.0
50%,2.71,87.5,44.0,20087.0,42.0,11.29,1.0,795.0,466.0,44.0,44.0,35.0,19.0
75%,3.035,130.25,65.5,21798.0,54.25,12.365,2.0,1042.25,466.0,65.5,65.5,62.5,24.125
max,3.64,173.0,87.0,26686.0,80.0,16.5,9.0,3460.0,1349.0,87.0,87.0,175.0,32.0


First, get the string columns (`object`), then describe

In [168]:
dogs.select_dtypes(include = ["object"]).describe()

Unnamed: 0,breed,group,grooming,kids,size
count,172,172,112,112,172
unique,172,7,3,3,3
top,Border Collie,terrier,weekly,high,medium
freq,1,28,88,67,60


In [169]:
dogs.select_dtypes(include = ["int64"]).describe()

Unnamed: 0,popularity_all
count,172.0
mean,87.122093
std,50.205335
min,1.0
25%,43.75
50%,87.5
75%,130.25
max,173.0


### Aggregation

Pandas also provides several methods for aggregating data, such as `.mean()`, `.median()`, `.std()`, and `.value_counts()`. They ignore missing values by default.

In [None]:
dogs.median(numeric_only=True)

In [None]:
dogs["price"].median()

In [None]:
dogs["group"].value_counts() # like R's table() with 1 arg

For counting one group against another (crosstabulating), use `pd.crosstab()`.

In [None]:
pd.crosstab(dogs["group"], dogs["kids"]) # like R's table() with 2+ arg

### Applying Functions

You can also use Pandas to apply your own aggregation functions to columns or rows.

* `.apply()` applies a function column-by-column or row-by-row.
* `.applymap()` applies a function element-by-element.

This is another way to vectorizing code, but only works for DataFrame. 


In [None]:
def spread(x):
    '''Returns spread. Input is a single column (or row)'''
    return x.max() - x.min()
    
dogs.select_dtypes(include = ["float64", "int64"]).apply(spread)

### Grouping

Use the `.groupby()` method to group data before computing aggregate statistics.

In [None]:
dogs.head()

In [None]:
dogs.groupby("group").mean(numeric_only=True).reset_index()

By default, the groups become the index. You can keep them as regular columns by setting `as_index = False` when grouping.

In [None]:
dogs.groupby("group", as_index = False).mean(numeric_only=True)

You can group by multiple columns.

In [None]:
dogs.groupby(["group", "kids"]).mean(numeric_only=True).head()

On groups, the `.apply()` method computes group-by-group. It is the most general form of two other methods:

* `.agg()`, which applies a function to each group to compute summary statistics
* `.transform()`, which applies a function to each group to compute transformations (such as standardization)

## Tidying a Dataset

Do Americans prefer low fat milk over whole milk?

The USDA publishes data about dairy production. We can answer the question with the [Milk Sales Dataset](https://www.ers.usda.gov/webdocs/DataFiles/48685/fluidmilk.xlsx?v=5010.6).

Many of Python's visualization packages expect [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf), which means:

1. Each feature must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Let's tidy up the Milk Sales Dataset so we can make a line plot that shows how milk sales have changed over time.

In [None]:
import numpy as np
import pandas as pd

In [None]:
milk = pd.read_excel("../data/fluidmilk.xlsx")
milk.head()

In [None]:
milk = pd.read_excel("../data/fluidmilk.xlsx", skiprows = 1)
milk.head()

In [None]:
milk.columns

In [None]:
milk.columns = milk.columns.str.replace('\n', '')
milk.head()

In [None]:
milk = milk.rename(columns=lambda x: x.strip(' 012'))
milk.head()

In [None]:
milk = milk.rename(columns = {'Unnamed:': 'Year'})
milk.head()

In [None]:
milk.columns.values[[2,3,5,6]] = np.array(['Reduced', 'Low', 
                                            'Flavored Whole', 'Flavored Other'])

In [None]:
milk.head()

In [None]:
milk.dtypes

In [None]:
milk = milk.set_index('Year')  

In [None]:
milk.head()

In [None]:
milk = pd.read_excel("../data/fluidmilk.xlsx", skiprows = 1)
milk.columns = milk.columns.str.replace('\n', '')
milk = milk.rename(columns=lambda df: df.strip(' 12'))
milk.columns.values[[0,2,3,5,6]] = np.array(['Year', 'Reduced', 'Low', 
                                             'Flavored Whole', 'Flavored Other'])
milk = milk[:-4] # get rid of the last four rows
milk = milk.set_index("Year") 
milk.head()

In [None]:
milk = milk.stack() 
milk

In [None]:
milk.index

In [None]:
milk = milk.reset_index()
milk

In [None]:
milk.columns.values[[False, True, True]] = np.array(["Kind", "Sales"])

In [None]:
milk.head()

In [None]:
milk.tail()