# Tabular data analysis

A lot of data in social sciences comes in tabular form, so analyzing tabular data is a key component of any social science oriented language. In this exercise, we will learn to load tabular data into Python and perform basic analyses. We will use data from San Francisco parking garages.

This file is a Jupyter "notebook", which can contain code, text, images, and output. You can run each code cell separately. One pitfall is that code cells can be (and, during development, often are) run out of order. This can create errors  if code cells depend on operations done later on. It can also create non-reproducible results - for instance, if one cell multiplies all values in a column by two, if that cell is run twice the values will be multiplied by four. To help avoid these issues, I always select Kernel -> Restart Kernel and Run All Cells before finalizing results. This will restart Python, remove any leftover variables, and run the entire notebook top to bottom. If this produces errors or results change, your code was dependent on the order you ran the cells in originally, and should be investigated.

## Importing libraries

Since Python is a general-purpose language, most data analysis tasks will require a separate library. In this exercise, we'll be using the `pandas` library for tabular data access. To use a library, we must "import" it. In this case, we have added `as pd` on the end of our import statement so that the library will be imported with the name `pd` rather than `pandas`. This will save typing when we have to refer to the library later in our code, and is a near-universal convention in scientific Python programming.

Run the cell below by clicking in it and pressing the "play" button at the top of the screen, or by pressing Shift-Enter. A `[*]` will appear next to the cell to indicate that it is executing, and then this will be replaced by `[1]` when it is done to indicate the order the cells were run in.

In [None]:
import pandas as pd

## Reading data

Next, we need to read our data. Like many datasets, the San Francisco parking garage data is distributed as a CSV file. We use the `read_csv` function from Pandas to read it. Unlike some other languages, Python requires you to spell out the library name to indicate you want to use a function from it—this is why imported pandas as `pd` so we don't have to keep typing out `pandas`.

We assign the result of the operation to a variable `data`. We can refer to the dataset by this name later. There's nothing special about the name `data`, we could have assigned any name consisting of alphnumeric characters, underscores, and numbers.

Variables can hold any type of data - numbers, strings, or complex data structures such as tables. Unlike in some languages (e.g. Stata), variables don't refer to columns within a loaded dataset; the `data` variable contains the entire dataset.

We want to read the file "sfpark.csv" in the "data" directory. Since this notebook is in the "notebooks" directory, we refer to the file as "../data/sfpark.csv" - ".." means the directory above this one, "/data" is the data directory within that, and "/sfpark.csv" is the CSV file within that directory.

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/mattwigway/odum-intro-python/main/data/sfpark.csv")

## Including outputs in the notebook

If a cell produces output, JupyterLab will display it inline. In the next cell, we just output the _data frame_ `data` that we just loaded, to see a preview of it.

In [None]:
data

## Basic data exploration

Pandas provides many basic analytical functions for columns, which can be accessed using `data.column.function()`. `data` refers to the data frame, `column` is the column name, and `function` is the function we want to execute (e.g. mean). The parentheses indicate to Python that `function` is executable code, and you want to run it.

In [None]:
data.facility.unique()

In [None]:
data.entries.mean()

### Exercise

- Compute the mean of the `exits` column
- Since the mean may be skewed by outliers, compute the median of the entries and exits columns.

## Accessing columns with special characters

Columns can only be accessed using the `data.column` syntax if the column name doesn't contain any special characters, and don't start with a number. We can use an alternate syntax to access colums with special characters. Here we also see the use of the `value_counts` function, which provides a count of the number of times each unique value occurs.

In [None]:
data["Usage Type"].value_counts()

## Renaming columns

I prefer to use the `data.column` syntax for accessing columns. We can rename the columns using the `data.rename()` function.

This code introduces a few new concepts. `pandas` provides not only functions based on columns, but also functions based on the entire dataset. Functions can take _arguments_, which are values provided when we call the function. In this case, we specify the column names, and that we want the operation to occur "in-place" - i.e. by modify `data`, rather than creating a new data frame with the column name changes.

Lastly, the argument to `columns=` is a _dict_, short for dictionary. A dict contains keys (on the left side of the `:`) and values (on the right side). Values can be looked up based on the corresponding key.

In [None]:
data.rename(columns={"Usage Type": "usage_type"}, inplace=True)

## Grouped data analysis, also known as split-apply-combine

Grouped data analysis is a very common pattern - rather than a mean over the entire dataset, we may want a mean by groups. For instance, the median being so different from the mean suggests outliers - perhaps one very large garage. Let's look at the mean entries by garage.

We do this by grouping the dataset by the unique values of the `facility` column, and then taking the mean of the entries column. We will get a single mean for each group. We can see that there are significant outliers, with several averages above 1000.

In [None]:
data.groupby("facility").entries.mean()

## Understanding the dataset

We have taken the mean number of entries across all the records for each garage. Is this the mean daily entries in each garage? Let's look at the dataset again. Does each row represent a full day of usage for a garage?

In [None]:
data

## Creating a daily dataset

We can group by garage and day, and sum entries. To group by multiple columns, we enclose the column names in `[]`, which creates a _list_ or array.

You can have multiple lines of code in a single code cell. In this case, only the output of the last line will be displayed. Here, we perform grouping and then display the result.

In [None]:
by_date = data.groupby(["facility", "date"]).entries.sum()
by_date

## Pandas data types

We have already worked with data frames, which contain tabular data. You may notice that printout above looks different from the data frame printouts we've seen earlier. That is because the group by sum has returned a _series_. A series is just a vector of values of the same type (integers in this case), with an _index_. An index is a set of labels for the items. Tables also have an index. Our data table just has row numbers as its index. However, any result from groupby is indexed by the column(s) used to group. If there are multiple columns, the data is indexed first by the one specified first in `groupby`, and so on.

We can easily access rows by index value using the `.loc` attribute of a series or data frame. We enclose the two values we want to look up in `()`, which creates a _tuple_. This is basically the same as a list, except for that it cannot be modified once created.

In [None]:
by_date.loc[("16th and Hoff Garage", "1/1/2012")]

## Converting indices back to regular columns

We can use the `reset_index` function to convert the series back to a regular data frame, with columns and an index based on row numbers, and assign the result back to the variable `by_date`, overwriting the original series.

In [None]:
by_date = by_date.reset_index()
by_date

## Exercise: per garage averages

Now that we have a single record for each garage and each day, compute the average daily entries by garage.

In [None]:
by_date.groupby("facility").entries.mean()

## Filtering data

Sometimes we don't want to work with the entire dataset. Let's extract just the data from garages in the Mission district. We can do this using _boolean indexing_, where we use a create an array of boolean (true/false) values used to select data, and use the `.loc` accessor to extract the values.

In [None]:
mission_data = data.loc[data.district == "Mission"]
mission_data

## Digging into filtering a bit more

What the above code does is create an array of booleans, one for each row, with a true value if that row is from a garage in the Mission. We can look at this array directly.

In [None]:
mission_bool = data.district == "Mission"
mission_bool

When that array is used in `.loc`, it selects all rows where the corresponding value is `True`

In [None]:
data.loc[mission_bool]

## Exercise

Compute daily per-garage averages using just the data from the Mission District.

In [None]:
mission_by_date = mission_data.groupby(["facility", "date"]).entries.sum().reset_index()
mission_by_date.groupby("facility").entries.mean()

## Data types

Every column in an Pandas data frame has a data type. These could be integers (usually represented as `int64` for a 64-bit integer), floating-point numbers (generally `float64`), objects (any Python object, but usually used for strings), dates, etc. We can see the data types in the `.dtype` field of a data frame.

In [None]:
data.dtypes

## Categorical data

Oftentimes, string/object columns will have just a few unique values. In these cases, it is often valuable to convert these columns to a categorical data type. From a user perspective these behave almost the same as string columns, but can be stored and manipulated more efficiently. With large data sets, this translates to more efficient data analysis and lower memory usage. In this dataset, usage_type, facility, and district can all be considered categorical.

We can convert the relevant columns with `.astype("category")`

In [None]:
data["usage_type"] = data.usage_type.astype("category")
data.dtypes

### Exercise: convert the facility and district columns as well

In [None]:
data["facility"] = data.facility.astype("category")
data["district"] = data.district.astype("category")

### Categorical data accessors

There are a number of functions for working with categorical data available within the "category accessor", `data.column.cat`. For example, we can display the categories in a categorical column:

In [None]:
data.usage_type.cat.categories

## Working with dates

We see that the date is represented as an `object`, not a date. The date has been read as a string rather than parsed as a date. We can use the `pd.to_datetime` function to parse dates. Specifying a date format is optional, but I always like to do it to ensure that the dates are parsed correctly. Date formats are specified using [the codes found here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). I like to run `pd.to_datetime` once to make sure it parsed correctly before assigning it to a column in the data frame.

In [None]:
pd.to_datetime(data.date, format="%m/%d/%Y")

Now, we can assign it to a column in the data frame. Creating new columns or overwriting old columns requires using the `dataframe["column name"]` syntax.

In [None]:
data["date"] = pd.to_datetime(data.date, format="%m/%d/%Y")

Now, we can ensure that data types are correct. Notice that the data type for the `date` column is no longer `object`, but `datetime64[ns]`.

In [None]:
data.dtypes

### Filtering by date

Standard mathematical operations work with date types; for instance, we can extract all records between January 15 and February 15, 2012.

Note that we are filtering by two conditions here, and putting them together with an `&`, meaning and. In Python, unlike many other languages, you need to put parentheses around each condition due to order of operations. Otherwise, the "and" operation will take place before the comparison operations, leading to errors or incorrect results.

In [None]:
data_jf_2012 = data.loc[
    (data.date <= pd.to_datetime("2/15/2012", format="%m/%d/%Y")) &
    (data.date >= pd.to_datetime("1/15/2012", format="%m/%d/%Y"))
]

#### Exercise: compute mean entries per day by garage in January/February 2012

In [None]:
by_date_jf_2012 = data_jf_2012.groupby(["facility", "date"]).entries.sum().reset_index()
by_date_jf_2012.groupby("facility").entries.mean()

### Operations with dates

There are a number of date functions available with the `.dt` DateTime accessor. For instance, we can filter to only weekend days using the `day_name` function.

In [None]:
weekends = data.loc[data.date.dt.day_name().isin(["Saturday", "Sunday"])]

#### Exercise: compute mean daily _weekday_ entries, by garage

In [None]:
weekdays = data.loc[data.date.dt.day_name().isin(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"])]
weekdays_by_date = weekdays.groupby(["facility", "date"]).entries.sum().reset_index()
weekdays_by_date.groupby("facility").entries.mean()

## Checking our work

Python has an `assert` statement, which will produce an error if whatever condition comes after it is not true. For instance, we can confirm that our weekend dataset contains Saturday and Sunday records. This would error if we had, for example, misspelled Saturday above and only included Sunday records. I like to use assert statements liberally throughout my code to check that operations completed successfully.

In [None]:
assert (weekends.date.dt.day_name().unique() == ["Saturday", "Sunday"]).all()

### Exercise

Create two assert statements to make sure that our data_jf_2012 dataset contains only dates from January 15, 2012 to February 15, 2012.

In [None]:
assert (data_jf_2012.date >= pd.to_datetime("1/15/2012", format="%m/%d/%Y")).all()
assert (data_jf_2012.date <= pd.to_datetime("2/15/2012", format="%m/%d/%Y")).all()