# Exploring Relationships
**Learning Objective:** 
- Learn to subset observations
- Learn to compare variables
- Learn to group and summarise information


## Data Exploration - Observations

Now that you have your data, the next step is to get familiar with it. 

Most of the time you are interested in some specific concepts. 
- You need a way to only select the variables related to your concepts.


### Filtering Observation (Rows)

Let's say you want to explore the vote of us citizens (V201033).

- We can use square brackets on a DataFrame object to select a single column!
- We can also use a list of strings containing column namesto select multiple columns!

![](https://pandas.pydata.org/docs/_images/03_subset_rows.svg)


In [1]:
# Load Pandas
import pandas as pd

# Import Data
data_url = "https://raw.githubusercontent.com/datamisc/ts-2020/main/data.csv"
anes_data  = pd.read_csv(data_url, compression='gzip')


  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
# Subsetting & Renaming Variables
my_vars = [
    "V201032",  # intend to vote
    "V201033",  # intend to vote for
    "V201507x",  # age
    "V201200",  # liberal-conservative self-placement
]

df = anes_data[my_vars]
df.columns = ["vote", "vote_int", "age", "ideology"]

df.head()

Unnamed: 0,vote,vote_int,age,ideology
0,1,2,46,6
1,1,3,37,4
2,1,1,40,2
3,1,1,41,3
4,1,2,72,5


In [None]:
# Are US citizens polarized? 


In [None]:
# Who would win the popular vote according to the ANES 2020?


In [11]:
df.describe()

Unnamed: 0,vote,vote_int,age,ideology
count,8280.0,8280.0,8280.0,8280.0
mean,0.942029,1.165097,49.038889,17.821498
std,0.721704,1.937772,20.771267,33.481452
min,-9.0,-9.0,-9.0,-9.0
25%,1.0,1.0,35.0,3.0
50%,1.0,1.0,51.0,4.0
75%,1.0,2.0,65.0,6.0
max,2.0,12.0,80.0,99.0


## Continuous & Continuous



In [None]:
# Is there a relationship between Age and Ideology? 
my_vars = ['age', 'ideology']
mask = my_df['ideology'].between(1,7)
my_df.loc[mask, my_vars].boxplot(by='ideology')


## Discrete & Continuous



In [None]:
# Is there a relationship between Age and Ideology? 
my_vars = ['age', 'ideology']
mask = my_df['ideology'].between(1,7)
my_df.loc[mask, my_vars].boxplot(by='ideology')


## Discrete & Discrete

In [None]:
# Is there a relationship between Age and Ideology? 
my_vars = ['age', 'ideology']
mask = my_df['ideology'].between(1,7)
my_df.loc[mask, my_vars].boxplot(by='ideology')


## Levels of Measurement

We have seen that there are two main types of data: Discrete and Continuous.

- **Discrete** data can only take a finite number of values.
    - eg. The number of students in a class.

- **Continuous** data can take an infinite number of values.
    - eg. The height of a student.



### We can further divide each of these data types into four families:

- **Nominal:** Differences of kind. There is no mathematical relationship between the values.
    - eg. Political parties.

- **Ordinal:** Differences of degree. There is a mathematical relationships among the values. Symbols like <, ≤, =, ≥, and > have meaning but the distance between two elements is not constant.
    - eg. Levels of education.

- **Interval:** There is a mathematical relationship among the elements and the distance between them is constant but they do not have a meaningful zero value.
    - eg. Feelings thermometer (0 to 100).

- **Ratio:** Similar to the interval variables but they have a meaningful zero value.
    - eg. Feelings thermometer (-50 to 50)

|          | Continuous | Discrete |
| -:       | :-:        | :-:      |
| Nominal  |            | x        |
| Ordinal  |            | x        |
| Interval | x          | x        |
| Ratio    | x          | x        |

In [None]:
# The Filter Verb ##############################################################
# Keep only respondents who casted a vote in 2012
dataset %>%
  filter(V161005 == 1)

# Keep respondents below 25 years old
...

# Keep respondents below 25 years who are not missing age and who casted a vote
...

# Select our previous 5 variables, remove respondents that didn't provide their
# age and store the new dataset into a new object `tmp_data`
...


# Select 5 variables, filter non-missing young voters and arrange obs. by age
...

# Store the output in a new object called young_voters
...


### Hack-Time

In [None]:
# What is the average age of the respodents in the ANES dataset?


In [None]:
# What is the average ideology of the respondents in the ANES dataset?


In [None]:
# What is the proportion of people who intend to vote for each D. Trump?


In [None]:
# Digging Deeper
my_df["age"].plot(kind="hist", bins=40)



In [None]:
# Let's look at participation
my_results.plot(kind="bar")


### Hack-Time

Let's try to improve our voting intentions plot a tiny bit.

In [None]:
# Filtering Observations (for next time)
mask = my_df['vote_int'].between(1,4)

# Summarizing the data
tmp_data = my_df.loc[mask,"vote_int"].replace(
    {1:"Biden", 2:"Trump", 3:"Jorgensen", 4:"Hawkins"}
).value_counts(
    normalize=True
)

# Making a plot/graphic/figure
tmp_data.plot(
    kind="bar",
    title="Voting Intentions", 
    ylabel="Percentage",
    rot=0,
);
