# Exploring Relationships
**Learning Objective:** 
- Learn to subset observations
- Learn to compare relationships 
- Learn to summarise relationships


### Filtering Observation (Rows)

Data is messy. Most of the time you need to filter some observations (rows) from your dataset. 
- You are interested in some particular aspects of your dataset (eg. young voters). 
- The information is irrelevant and you need to remove data to avoid drawing wrong conclusions (eg. people who refuse to answer).

Therefore, you need a way to filter the observations in your dataset. 

- Relational operators provide a way to subset observations
- There are also useful methods that can help you with this.

![](https://pandas.pydata.org/docs/_images/03_subset_rows.svg)

We have seen that the data in the ANES contains values such as -8, -9, 99, that might or might not be useful depending on the probelm we want to tackle.

How can we deal with them?

Let's start with this question to illustrate:

- Are Young Voters more Liberal or Conservative?


In [None]:
# Load Pandas
import pandas as pd

# Import Data
data_url = "https://raw.githubusercontent.com/datamisc/ts-2020/main/data.csv"
anes_data  = pd.read_csv(data_url, compression='gzip')


In [None]:
# Subsetting & Renaming Variables
my_vars = [
    "V201032",  # intend to vote
    "V201033",  # intend to vote for
    "V201507x", # age
    "V201200",  # liberal-conservative self-placement
]

df = anes_data[my_vars]
df.columns = ["vote", "vote_int", "age", "ideology"]

df.head()

In [None]:
# How is ideology distributed?
df.value_counts('ideology').sort_index().plot(kind='bar')


In [None]:
# How is age distributed?
df['age'].plot(kind='hist', bins=30)


In [None]:
df.describe()

We need to clean these variables a bit!

In [None]:
# Cleaning the age variable
mask = df['age'] >= 18
mask


In [None]:
# Age seems about right now! But we lost some observations!
df[mask].describe()


In [None]:
# Saving the cleaned data frame with the subsetted data
df = df[mask]

In [None]:
# Cleaning Ideology
mask = (df['ideology'] >= 1) & (df['ideology'] <= 7)
mask

In [None]:
df = df[mask]
df.describe()

In [None]:
# How is age distributed?
df['age'].plot(kind='hist', bins=30)


In [None]:
# How is ideology distributed now?
df.value_counts('ideology').sort_index().plot(kind='bar')


### Hack-Time



In [None]:
# Clean the `vote` variable


In [None]:
# Clean the `vote_int` variable


## Who are young voters?

In [None]:
# Defining Young Voters
mask = df['age'] <= 80


In [None]:
# Which group is more liberal? Which group is more conservative?
print(df[mask]['ideology'].mean())
print(df[~mask]['ideology'].mean())


In [None]:
# Did we forget something?


Could we visualise this relationship in some other way?


## Types of Data & Levels of Measurement

We have seen that there are two main types of data: Discrete and Continuous.

- **Discrete** data can only take a finite number of values.
    - eg. The number of political parties in a country.

- **Continuous** data can take an infinite number of values.
    - eg. The age of someone.



### We can further divide each of these data types into four families:

- **Nominal:** Differences of kind. There is no mathematical relationship between the values.
    - eg. Number of political parties.

- **Ordinal:** Differences of degree. There is a mathematical relationships among the values. Symbols like <, ≤, =, ≥, and > have meaning but the distance between two elements is not constant.
    - eg. Levels of education.

- **Interval:** There is a mathematical relationship among the elements and the distance between them is constant but they do not have a meaningful zero value.
    - eg. Liberal-conservative 7 point scale.

- **Ratio:** Similar to the interval variables but they have a meaningful zero value.
    - eg. Age

|          | Continuous | Discrete |
| -:       | :-:        | :-:      |
| Nominal  |            | x        |
| Ordinal  |            | x        |
| Interval | x          | x        |
| Ratio    | x          | x        |


## Continuous & Continuous



In [None]:
# What is the relationship between Age and Ideology? 
df.plot(kind='scatter', x='ideology', y='age', figsize=(10,10))


In [None]:
# We can sample some data
df.sample(100).plot(kind='scatter', x='ideology', y='age', figsize=(10,10))


In [None]:
# We can try to add transparency
df.plot(kind='scatter', x='ideology', y='age', alpha=0.02, figsize=(10,10))


In [None]:
# We can try to add transparency
df.sample(1000).plot(kind='scatter', x='ideology', y='age', alpha=0.2, figsize=(10,10))


In [None]:
# Using hexbin plots
df.plot(kind='hexbin', x='ideology', y='age')

In [None]:
# Increasing the grid size
df.plot(kind='hexbin', x='ideology', y='age', gridsize=7)


## Discrete & Continuous

![](https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png)



In [None]:
# Is there a relationship between Age and Ideology? 
my_vars = ['age', 'ideology']
df[my_vars].boxplot(by='ideology')


## Discrete & Discrete

### Cross tabulations

You can use the `pd.crosstab()` function to compute simple cross tabulation of two (or more) variables. 

![](https://www.dataindependent.com/wp-content/uploads/2020/08/Screen-Shot-2020-08-17-at-7.43.21-AM-1024x466.png)


In [None]:
# Cross tabulations
pd.crosstab(df['ideology'], df['vote_int'])


In [None]:
# Similar to ...
df.groupby('ideology')['vote_int'].value_counts().unstack().fillna(0).astype(int)


In [None]:
# Absolute value don't mean much
pd.crosstab(df['ideology'], df['vote_int'], normalize=True)


In [None]:
# Barplots to the rescue!
pd.crosstab(df['ideology'], df['vote_int'], normalize=True).plot(kind='bar', stacked=True)
