# Columns and Variables

Recall that the columns of a tabular data set represent variables (or fields). They are the measurements that we make on each observation.

As an example, let's consider the variables in the OKCupid data set.

In [None]:
import pandas as pd

In [None]:
df_okcupid = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/okcupid.csv")
df_okcupid

## Types of Variables

There is a fundamental difference between variables like `age` and `height` which are measured on a numeric scale, and variables like `religion` and `pets` which are not.

Variables that can be measured on a numeric scale are called **quantitative variables** (or numerical variables).

Variables like `religion` or `pets` that place the observational units into a relatively limited set of categories are called **categorical variables**. We call each possible value of a categorical variable a "level". Levels are usually non-numeric.

Just because a variable happens to contain numbers does not necessarily make it "quantitative". For example, we could code the variable *Section* for the students in DATA 301 this quarter as 5 or 7, but Section is still a categorical variable; we could easily replace 5 with "first" or "early" and 7 with "second" or "late".  To see if a variable is quantitative, ask "would it make sense to take an average"? It doesn't really make sense to take an average of Section. On the other hand, if the variable is *number of classes enrolled in*, it does make sense to take an average---for example, "CP students take on average 4.2 classes per quarter"---even though this variable only takes discrete values like 0, 1, 2, 3, ...

Some variables do not fit neatly into the categorical/quantitative classification. For example, the variable `essay1` contains users' answers to the prompt "What I’m doing with my life". This variable is obviously not quantitative, but it is not categorical either because every user has a unique answer. In other words, this variable does not place each observation into a reasonable set of categories. We will group such variables into an "other" category.

Every variable can be classified into one of these three main **types**:
- quantitative,
- categorical,
- other.

The type of the variable dictates how we analyze that variable.

## Selecting Variables

Suppose we want to select the `age` column from the `DataFrame` above. There are three ways to do this.

1\. Access the column as you would a key in a `dict`.

In [None]:
df_okcupid["age"]

2\.  Use `.loc`, specifying both the rows and columns. (The colon `:` is Python shorthand for "all".)

In [None]:
df_okcupid.loc[:, "age"]

3\. Access the column as an attribute of the `DataFrame`.

In [None]:
df_okcupid.age

Method 3 (attribute access) is the most concise. However, it does not work if the variable name contains spaces or special characters, begins with a number, or matches an existing attribute of `DataFrame`. For example, if `df_okcupid` had a column called `info`, `df_okcupid.info` would not return the column because `df_okcupid.info` is already reserved for something else (`info` returns some basic information about the data frame).

Notice that when you select a single column, the result is no longer a `DataFrame`. It is instead a `Series`.

In [None]:
type(df_okcupid["age"])

Notice that a `Series` is used here to store a single variable/column (across multiple observations/rows). In the previous section, we saw that a `Series` can also be used to store a single observation/row (across multiple variables/columns). To summarize,

- A `Series` stores one-dimensional data (i.e., a single observation/row or variable/column).
- A `DataFrame` stores two-dimensional data (i.e., both observations/rows and variables/columns).

To select multiple columns, you would pass in a _list_ of variable names, instead of a single variable name. For example, to select both `age` and `religion`, either of the two methods below would work (and produce the same result):

In [None]:
df_okcupid[["age", "religion"]]

In [None]:
df_okcupid.loc[:, ["age", "religion"]]

## Type Inference and Casting


Pandas tries to infer the type of each variable automatically. If every value in a column (except for missing values) is a number, then Pandas will treat that variable as quantitative. Otherwise, the variable is treated as categorical.

To determine the type that Pandas inferred, simply select that variable using the methods above and look for its `dtype`. A `dtype` of `float64` or `int64` indicates that the variable is quantitative.  For example, the `age` variable has a `dtype` of `int64`, so it is quantitative.

In [None]:
df_okcupid["age"]

On the other hand, the `religion` variable has a `dtype` of `object`, so `pandas` will treat it as categorical.

In [None]:
df_okcupid["religion"]

Sometimes it is necessary to convert quantitative variables to categorical variables and vice versa. This can be achieved using the `.astype()` method of a `Series`. For example, to convert `age` to a categorical variable, we simply "cast" its values to strings. (More on casting later.)

In [None]:
df_okcupid["age"].astype(str)

To save this as a column in the `DataFrame`, we assign it to a column called `age_cat`. Note that this column does not exist yet! It will be created at the time of assignment.

In [None]:
df_okcupid["age_cat"] = df_okcupid["age"].astype(str)

# Check that age_cat is a column in this DataFrame; see the last column
df_okcupid

In [None]:
# Check the type of age_cat is object
df_okcupid["age_cat"]