# Exploring Relationships
**Learning Objective:** 
- Learn to subset observations
- Learn to compare variables
- Learn to group and summarise information


## Levels of Measurement

We have seen that there are two main types of data: Discrete and Continuous.

- **Discrete** data can only take a finite number of values.
    - eg. The number of students in a class.

- **Continuous** data can take an infinite number of values.
    - eg. The height of a student.

### We can further divide each of these data types into four families:

- **Nominal:** Differences of kind. There is no mathematical relationship between the values.
    - eg. Political parties.

- **Ordinal:** Differences of degree. There is a mathematical relationships among the values. Symbols like <, ≤, =, ≥, and > have meaning but the distance between two elements is not constant.
    - eg. Levels of education.

- **Interval:** There is a mathematical relationship among the elements and the distance between them is constant but they do not have a meaningful zero value.
    - eg. Feelings thermometer (0 to 100).

- **Ratio:** Similar to the interval variables but they have a meaningful zero value.
    - eg. Feelings thermometer (-50 to 50)

|          | Continuous | Discrete |
| -:       | :-:        | :-:      |
| Nominal  |            | x        |
| Ordinal  |            | x        |
| Interval | x          | x        |
| Ratio    | x          | x        |

## Data Acquisition



In [None]:
# Load Pandas
import pandas as pd

# Import Data
data_url = "https://raw.githubusercontent.com/datamisc/ts-2020/main/data.csv"
anes_data  = pd.read_csv(data_url, compression='gzip')


In [None]:
# Subsetting & Renaming Variables
my_vars = [
    "V201032",  # intend to vote
    "V201033",  # intend to vote for
    "V201507x",  # age
    "V201200",  # liberal-conservative self-placement
]

df = anes_data[my_vars]
df.columns = ["vote", "vote_int", "age", "ideology"]


## Continuous & Continuous

## Discrete & Continuous

## Discrete & Discrete

In [None]:
# The Filter Verb ##############################################################
# Keep only respondents who casted a vote in 2012
dataset %>%
  filter(V161005 == 1)

# Keep respondents below 25 years old
...

# Keep respondents below 25 years who are not missing age and who casted a vote
...

# Select our previous 5 variables, remove respondents that didn't provide their
# age and store the new dataset into a new object `tmp_data`
...


# Select 5 variables, filter non-missing young voters and arrange obs. by age
...

# Store the output in a new object called young_voters
...


## Data Exploration - Observations

Now that you have your data, the next step is to get familiar with it. 

Most of the time you are interested in some specific concepts. 
- You need a way to only select the variables related to your concepts.


### Filtering Observation (Rows)

Let's say you want to explore the vote of us citizens (V201033).

- We can use square brackets on a DataFrame object to select a single column!
- We can also use a list of strings containing column namesto select multiple columns!

![](https://pandas.pydata.org/docs/_images/03_subset_rows.svg)


In [None]:
# The `columns` attribute allows you to get the column names of a dataframe
anes_data.columns

In [None]:
# Selecting the voting intent variable
anes_data["V201033"]


In [None]:
# We can also save it in a new object and check it's type
vote_int = anes_data["V201033"]
type(vote_int)


Let's say know you want to also learn who people intend to vote for as a function of their age and ideology? In this case you might need to select multiple variables.

In [None]:
# Selecting multiple columns
my_vars = [
    "V201032",  # intend to vote
    "V201033",  # intend to vote for
    "V201507x",  # age
    "V201200",  # liberal-conservative self-placement
]

anes_data[my_vars]

In [None]:
# Save this smaller subset of variables into my_df
my_df = anes_data[my_vars]
print(type(my_df))
print(my_df.columns)
my_df.head()

To avoid always having to check the codebook let's clean our data a bit by making the column names more explicit.

### Hack-Time

In [None]:
# What is the average age of the respodents in the ANES dataset?


In [None]:
# What is the average ideology of the respondents in the ANES dataset?


In [None]:
# What is the proportion of people who intend to vote for each D. Trump?


## Data Visualisation

Once you have found some information that you need. It is usually a good idea to plot your results. As sometimes, a visualisation will help you better understand the problems in your data!

Most of the times you will use bar plots and histograms to visualise a single variable, depending on its type.

### Types of data

We have seen that there are different types of data in python (strings, integers, floats, booleans, ...). When doing research we can group data in two broad families:

**Continuous** data can take an infinite number of values.
- The height of a student (eg. 182.5 cm)
- For these variables you will usually use **histograms**.

**Discrete** data can only take a finite number of values.
- The number of students in a class (eg. 22)
- For these variables you will usually use **bar plots**.
- **Don't forget to summarise your data before plotting!**
    - Otherwise your computer won't be happy...

### Pandas Plotting API

You can use pandas to plot your results using the `.plot()` method on a DataFrame or Series object.

For more information click [**here**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html).


In [None]:
# Let's plot the distribution of the vote variable.
my_df["age"].plot(kind="hist")



In [None]:
# Digging Deeper
my_df["age"].plot(kind="hist", bins=40)



In [None]:
# Let's look at participation
my_results.plot(kind="bar")


### Hack-Time

In [None]:
# Are US citizens polarized? 


In [None]:
# Who would win the popular vote according to the ANES 2020?


### Going further

There are many options to play with and improve a figure. 
When you are looking for help to change something on a figure, if you have the right terminology it is quite easy to find help!

#### Anatomy of a Figure
![Anatomy of a Figure](https://matplotlib.org/3.1.1/_images/anatomy.png)

Let's try to improve our voting intentions plot a tiny bit.

In [None]:
# Filtering Observations (for next time)
mask = my_df['vote_int'].between(1,4)

# Summarizing the data
tmp_data = my_df.loc[mask,"vote_int"].replace(
    {1:"Biden", 2:"Trump", 3:"Jorgensen", 4:"Hawkins"}
).value_counts(
    normalize=True
)

# Making a plot/graphic/figure
tmp_data.plot(
    kind="bar",
    title="Voting Intentions", 
    ylabel="Percentage",
    rot=0,
);


In [None]:
tmp_data

In [None]:
# Is there a relationship between Age and Ideology? 
my_vars = ['age', 'ideology']
mask = my_df['ideology'].between(1,7)
my_df.loc[mask, my_vars].boxplot(by='ideology')
