# Data Management - Exisiting Values

**Learning Objective:** 
- Learn to recode and create variables
- Learn about the `.replace()` method and the `pd.cut()` function
- Learn to compare variables


In [None]:
# Load Pandas
import pandas as pd

# Import Data
data_url = "https://raw.githubusercontent.com/datamisc/ts-2020/main/data.csv"
anes_data  = pd.read_csv(data_url, compression='gzip')


In [None]:
# Subsetting & Renaming Variables
my_vars = [
    "V201033",  # vote_int
    "V201507x",  # age
    "V201200",  # liberal-conservative self-placement
    "V201151",	 # biden thrm
    "V201152",  # trump thrm
    "V201144x",  # covid 
]

df = anes_data[my_vars]

df.columns = ["vote_int", "age", "ideology", "biden_thrm", "trump_thrm", "covid"]
df.head()


## Recoding Variables Using a Mask/Filter

We can use relational operators (<>=) to identify the observations that meet certain criteria that we want to change.

For instance we could recode the voting intention variable:

- [V201033](https://sda.berkeley.edu/sdaweb/docs/nes2020/DOC/hcbk0003.htm#V201033)



In [None]:
# Recoding values with a mask
new_df = df[(df['vote_int']>0) & (df['vote_int']<=5)].copy()  # Creating a new dataframe

# The recoding starts here
mask = new_df['vote_int'] == 1
new_df.loc[mask, "vote_int"] = "J.Biden"
mask = new_df['vote_int'] == 2
new_df.loc[mask, "vote_int"] = "D.Trump"


In [None]:
new_df['vote_int'].value_counts()


### Hack-Time



In [None]:
# Finish recoding the `vote_int` variable


## Recoding Variables Using `replace()`

One can also recode variables using the `replace()` method!

Let's recode the `covid` variable this time using the `.replace()` method and dictionary!

- [V201144x](https://sda.berkeley.edu/sdaweb/docs/nes2020/DOC/hcbk0005.htm#V201144x)

In [None]:
# What does the covid variable look like?
new_df['covid'].value_counts()

In [None]:
# Creating two lists with the old and the new labels
old_labels = [-2, 1, 2, 3, 4]
new_labels = ["Don't Know", "1.Approve strongly", "2.Approve not strongly", "3.Disapprove not strongly", "4.Disapprove strongly"]


In [None]:
new_df['covid'].replace(old_labels, new_labels)


In [None]:
new_df['covid'].replace(old_labels, new_labels).value_counts()

In [None]:
new_df['covid'] = new_df['covid'].replace(old_labels, new_labels)
new_df['covid'].value_counts(normalize=True)

In [None]:
new_df['covid'].value_counts(normalize=True).sort_index().plot(kind='bar')


### Hack-Time

In [None]:
# Are citizens who approve the actions taken by the government to 
# handle covid more likely to liberal or conservative?
# TIP: use pd.crosstabs!


## Filtering or Recoding?

Until now, we have been filtering out observations that we don't need without thinking about the consequences they can have on our results. 

If you apply multiple filters, you add bias to your dataset but you might also end up loosing a big part of your data! 
- **With less data we have less evidence to draw conclusions!**

Let's try to predict the election outcome using pre-election data!

In [None]:
# Filtering Out ALL Observations that are note Biden or Trump
mask = df['vote_int'].between(1,2)
filter_df = df[mask].copy()

# Change Labels
vote_int_labels = {
    1: "J.Biden",
    2: "D.Trump",
}

filtered_output = filter_df['vote_int'].replace(vote_int_labels).value_counts(normalize=True)
filtered_output

In [None]:
# Recoding Observations
mask = df["vote_int"] > 0
recoded_df = df[mask].copy()

# Keeping all other vote choice intentions coded as "Other"
mask = recoded_df["vote_int"] > 2
recoded_df.loc[mask, "vote_int"] = "Other"

# Recoding the remaining labels using the previously created dictionary
recoded_df["vote_int"] = recoded_df["vote_int"].replace(vote_int_labels)

recoded_output = recoded_df["vote_int"].value_counts(normalize=True)
recoded_output


In [None]:
print("The Filtered Output")
print(filtered_output)
print("==============================")
print("The Recoded Output")
print(recoded_output)

# Creating New Variables (~ Adding New Columns)

When you recode variables you might want to add a new variable to the original dataset to keep the orignial version of your variable.

![](https://pandas.pydata.org/docs/_images/05_newcolumn_1.svg)




In [None]:
recoded_df

In [None]:
recoded_df['my_new_var'] = 0
recoded_df

In [None]:
# let's remove this variable
recoded_df.drop('my_new_var', axis=1)
recoded_df

In [None]:
# we save!
recoded_df = recoded_df.drop('my_new_var', axis=1)

# Hack-Time

In [None]:
# Add a new binary variable that takes the value 1 when
# the respondent intends to vote for Trump. 
# Name this variable `vote_trump`


Let's now add a categorical age variable to our dataset!


In [None]:
# But how?
mask = recoded_df['age'] >= 18
recoded_df['age_cat'] = "18-35"
mask = recoded_df['age'] >= 36 
recoded_df['age_cat'] = "36-50"
mask = recoded_df['age'] >= 51
recoded_df['age_cat'] = "51-65"
...
...
...
# This is very tedious...

## The cut function
The `pd.cut()` function allows us to convert a continuous variable into a discrete variable !

In [None]:
recoded_df["age_cat"] = pd.cut(df["age"], bins=[17,35,50,65,80], labels=["18-35", "36-50", "51-65", "66+"])
recoded_df


In [None]:
recoded_df['age_cat'].value_counts().sort_index().plot(kind='bar')


In [None]:
pd.crosstab(recoded_df['vote_int'], recoded_df['age_cat'], normalize=True).plot(kind='bar', subplots=True, figsize=(10,10), layout=(2,2));

### Hack-Time

In [None]:
# Add a cleaned version of the covid variable named `clean_covid` 


In [None]:
# Which age group agrees more with how the governement is handling covid?
