# Columns and Variables

Recall that the columns of a tabular data set represent variables (or fields). They are the measurements that we make on each observation.

As an example, let's consider the variables in the OKCupid data set.

In [1]:
import pandas as pd

In [2]:
df_okcupid = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/okcupid.csv")
df_okcupid

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,height,status
0,31,,mostly vegetarian,socially,sometimes,graduated from college/university,"75% nice, 45% shy, 80% stubborn, 100% charming...",i'm a new nurse. it rules.,"multiple-choice questions, dancing.",it depends on the people.,...,"san francisco, california",might want kids,gay,likes cats,buddhism,f,taurus and it&rsquo;s fun to think about,no,67.0,single
1,25,average,,socially,,working on college/university,"i like trees, spending long periods of time co...","studying landscape horticulture, beekeeping, g...","wasting time, making breakfast, nesting",i have a lot of freckles,...,"oakland, california",,gay,,,m,sagittarius and it&rsquo;s fun to think about,no,66.0,single
2,43,curvy,,rarely,never,graduated from masters program,,,,,...,"san francisco, california",has a kid,straight,likes dogs and has cats,other and laughing about it,f,leo and it&rsquo;s fun to think about,trying to quit,65.0,single
3,31,average,,socially,never,,"i am a seeker of laughs ,music ,magick good pe...",i strive to live life to the fullest and to tr...,i am good at my magic and weaving a world of i...,i am guessing y'all would notice my jewelry an...,...,"san francisco, california",doesn&rsquo;t want kids,gay,,other and very serious about it,m,capricorn and it&rsquo;s fun to think about,trying to quit,70.0,single
4,34,,,socially,,graduated from ph.d program,i've just moved here from london after finishi...,i'm doing a postdoc in psychology at stanford,,,...,"san francisco, california",,gay,,,m,cancer but it doesn&rsquo;t matter,,71.0,single
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,24,athletic,mostly anything,socially,sometimes,graduated from college/university,recent relocatee to san francisco. i'm writing...,cranking out two years of private equity so i ...,writing. criticizing. partying partying yeah.....,i've been described as 'all-american.' not sur...,...,"san francisco, california",,straight,,catholicism,m,,no,70.0,single
2996,50,fit,,rarely,never,graduated from college/university,i'm generally happy and typically spend my tim...,"i was raised with left-wing politics, pbs, bal...","i'm great with kids, dogs, cats and recycling....","gee, i don't know. . . that i'm smiling, that ...",...,"oakland, california",,straight,has dogs,agnosticism,f,scorpio but it doesn&rsquo;t matter,no,63.0,single
2997,31,thin,vegetarian,socially,sometimes,,,"i like to move around, therefore my life entai...","having fun and not taking life too seriously, ...",my smile and hair.,...,"san francisco, california",,straight,,,f,,no,64.0,single
2998,31,athletic,mostly vegetarian,socially,sometimes,graduated from college/university,"i work with seniors and i love it, so believe ...",going down in flames.,being a charming first date.,i dress like an adorable idiot.,...,"walnut creek, california",,straight,likes dogs and has cats,catholicism and laughing about it,f,aries and it&rsquo;s fun to think about,when drinking,62.0,single


## Types of Variables

There is a fundamental difference between variables like `age` and `height` which are measured on a numeric scale, and variables like `religion` and `pets` which are not.

Variables that can be measured on a numeric scale are called **quantitative variables** (or numerical variables).

Variables like `religion` or `pets` that place the observational units into a relatively limited set of categories are called **categorical variables**. We call each possible value of a categorical variable a "level". Levels are usually non-numeric.

Just because a variable happens to contain numbers does not necessarily make it "quantitative". For example, we could code the variable *Section* for the students in DATA 301 this quarter as 5 or 7, but Section is still a categorical variable; we could easily replace 5 with "first" or "early" and 7 with "second" or "late".  To see if a variable is quantitative, ask "would it make sense to take an average"? It doesn't really make sense to take an average of Section. On the other hand, if the variable is *number of classes enrolled in*, it does make sense to take an average---for example, "CP students take on average 4.2 classes per quarter"---even though this variable only takes discrete values like 0, 1, 2, 3, ...

Some variables do not fit neatly into the categorical/quantitative classification. For example, the variable `essay1` contains users' answers to the prompt "What I’m doing with my life". This variable is obviously not quantitative, but it is not categorical either because every user has a unique answer. In other words, this variable does not place each observation into a reasonable set of categories. We will group such variables into an "other" category.

Every variable can be classified into one of these three main **types**:
- quantitative,
- categorical,
- other.

The type of the variable dictates how we analyze that variable.

## Selecting Variables

Suppose we want to select the `age` column from the `DataFrame` above. There are three ways to do this.

1\. Access the column as you would a key in a `dict`.

In [3]:
df_okcupid["age"]

0       31
1       25
2       43
3       31
4       34
        ..
2995    24
2996    50
2997    31
2998    31
2999    60
Name: age, Length: 3000, dtype: int64

2\.  Use `.loc`, specifying both the rows and columns. (The colon `:` is Python shorthand for "all".)

In [4]:
df_okcupid.loc[:, "age"]

0       31
1       25
2       43
3       31
4       34
        ..
2995    24
2996    50
2997    31
2998    31
2999    60
Name: age, Length: 3000, dtype: int64

3\. Access the column as an attribute of the `DataFrame`.

In [5]:
df_okcupid.age

0       31
1       25
2       43
3       31
4       34
        ..
2995    24
2996    50
2997    31
2998    31
2999    60
Name: age, Length: 3000, dtype: int64

Method 3 (attribute access) is the most concise. However, it does not work if the variable name contains spaces or special characters, begins with a number, or matches an existing attribute of `DataFrame`. For example, if `df_okcupid` had a column called `info`, `df_okcupid.info` would not return the column because `df_okcupid.info` is already reserved for something else (`info` returns some basic information about the data frame).

Notice that when you select a single column, the result is no longer a `DataFrame`. It is instead a `Series`.

In [6]:
type(df_okcupid["age"])

pandas.core.series.Series

Notice that a `Series` is used here to store a single variable/column (across multiple observations/rows). In the previous section, we saw that a `Series` can also be used to store a single observation/row (across multiple variables/columns). To summarize,

- A `Series` stores one-dimensional data (i.e., a single observation/row or variable/column).
- A `DataFrame` stores two-dimensional data (i.e., both observations/rows and variables/columns).

To select multiple columns, you would pass in a _list_ of variable names, instead of a single variable name. For example, to select both `age` and `religion`, either of the two methods below would work (and produce the same result):

In [7]:
df_okcupid[["age", "religion"]]

Unnamed: 0,age,religion
0,31,buddhism
1,25,
2,43,other and laughing about it
3,31,other and very serious about it
4,34,
...,...,...
2995,24,catholicism
2996,50,agnosticism
2997,31,
2998,31,catholicism and laughing about it


In [8]:
df_okcupid.loc[:, ["age", "religion"]]

Unnamed: 0,age,religion
0,31,buddhism
1,25,
2,43,other and laughing about it
3,31,other and very serious about it
4,34,
...,...,...
2995,24,catholicism
2996,50,agnosticism
2997,31,
2998,31,catholicism and laughing about it


## Type Inference and Casting


Pandas tries to infer the type of each variable automatically. If every value in a column (except for missing values) is a number, then Pandas will treat that variable as quantitative. Otherwise, the variable is treated as categorical.

To determine the type that Pandas inferred, simply select that variable using the methods above and look for its `dtype`. A `dtype` of `float64` or `int64` indicates that the variable is quantitative.  For example, the `age` variable has a `dtype` of `int64`, so it is quantitative.

In [9]:
df_okcupid["age"]

0       31
1       25
2       43
3       31
4       34
        ..
2995    24
2996    50
2997    31
2998    31
2999    60
Name: age, Length: 3000, dtype: int64

On the other hand, the `religion` variable has a `dtype` of `object`, so `pandas` will treat it as categorical.

In [10]:
df_okcupid["religion"]

0                                buddhism
1                                     NaN
2             other and laughing about it
3         other and very serious about it
4                                     NaN
                      ...                
2995                          catholicism
2996                          agnosticism
2997                                  NaN
2998    catholicism and laughing about it
2999                                other
Name: religion, Length: 3000, dtype: object

Sometimes it is necessary to convert quantitative variables to categorical variables and vice versa. This can be achieved using the `.astype()` method of a `Series`. For example, to convert `age` to a categorical variable, we simply "cast" its values to strings. (More on casting later.)

In [11]:
df_okcupid["age"].astype(str)

0       31
1       25
2       43
3       31
4       34
        ..
2995    24
2996    50
2997    31
2998    31
2999    60
Name: age, Length: 3000, dtype: object

To save this as a column in the `DataFrame`, we assign it to a column called `age_cat`. Note that this column does not exist yet! It will be created at the time of assignment.

In [12]:
df_okcupid["age_cat"] = df_okcupid["age"].astype(str)

# Check that age_cat is a column in this DataFrame; see the last column
df_okcupid

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,offspring,orientation,pets,religion,sex,sign,smokes,height,status,age_cat
0,31,,mostly vegetarian,socially,sometimes,graduated from college/university,"75% nice, 45% shy, 80% stubborn, 100% charming...",i'm a new nurse. it rules.,"multiple-choice questions, dancing.",it depends on the people.,...,might want kids,gay,likes cats,buddhism,f,taurus and it&rsquo;s fun to think about,no,67.0,single,31
1,25,average,,socially,,working on college/university,"i like trees, spending long periods of time co...","studying landscape horticulture, beekeeping, g...","wasting time, making breakfast, nesting",i have a lot of freckles,...,,gay,,,m,sagittarius and it&rsquo;s fun to think about,no,66.0,single,25
2,43,curvy,,rarely,never,graduated from masters program,,,,,...,has a kid,straight,likes dogs and has cats,other and laughing about it,f,leo and it&rsquo;s fun to think about,trying to quit,65.0,single,43
3,31,average,,socially,never,,"i am a seeker of laughs ,music ,magick good pe...",i strive to live life to the fullest and to tr...,i am good at my magic and weaving a world of i...,i am guessing y'all would notice my jewelry an...,...,doesn&rsquo;t want kids,gay,,other and very serious about it,m,capricorn and it&rsquo;s fun to think about,trying to quit,70.0,single,31
4,34,,,socially,,graduated from ph.d program,i've just moved here from london after finishi...,i'm doing a postdoc in psychology at stanford,,,...,,gay,,,m,cancer but it doesn&rsquo;t matter,,71.0,single,34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,24,athletic,mostly anything,socially,sometimes,graduated from college/university,recent relocatee to san francisco. i'm writing...,cranking out two years of private equity so i ...,writing. criticizing. partying partying yeah.....,i've been described as 'all-american.' not sur...,...,,straight,,catholicism,m,,no,70.0,single,24
2996,50,fit,,rarely,never,graduated from college/university,i'm generally happy and typically spend my tim...,"i was raised with left-wing politics, pbs, bal...","i'm great with kids, dogs, cats and recycling....","gee, i don't know. . . that i'm smiling, that ...",...,,straight,has dogs,agnosticism,f,scorpio but it doesn&rsquo;t matter,no,63.0,single,50
2997,31,thin,vegetarian,socially,sometimes,,,"i like to move around, therefore my life entai...","having fun and not taking life too seriously, ...",my smile and hair.,...,,straight,,,f,,no,64.0,single,31
2998,31,athletic,mostly vegetarian,socially,sometimes,graduated from college/university,"i work with seniors and i love it, so believe ...",going down in flames.,being a charming first date.,i dress like an adorable idiot.,...,,straight,likes dogs and has cats,catholicism and laughing about it,f,aries and it&rsquo;s fun to think about,when drinking,62.0,single,31


In [13]:
# Check the type of age_cat is object
df_okcupid["age_cat"]

0       31
1       25
2       43
3       31
4       34
        ..
2995    24
2996    50
2997    31
2998    31
2999    60
Name: age_cat, Length: 3000, dtype: object