In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline

# PS 88 Week 14: Practicalities

This week we will develop some practical skills for the third part of the project, and any potential data work you do in the future. We will cover a lot of ground, and provide some links for futher reading along the way.

As a running example, we will seek to replicate the first graphs I showed you in class, which plotted gun ownership and gun deaths across US states and different countries.

## Part 1. Finding data

There is no magical formula for finding data, though something close may be "use google."

A good formula if you are looking for one variable is to search for "[description of variable] by [level of observation] data". You may can also trying using "table" or "csv" instead of or in addition to "data". 

If you are interested in a particular time period/country/etc. you can add that to the search as well.

For example, when I search for "gun deaths by us state" <a href="https://www.cdc.gov/nchs/pressroom/sosmap/firearm_mortality/firearm.htm">this</a> is the first source that comes up.

Follow the link and poke around a bit. Note it produces some maps, which is nice, but we want the raw data. Fortunately there is a "Download Data (csv)" button. Click that, which will download the file for you.

If you go to the folder where downloads go, and double click on the file, it will probably try to open it in Excel or a text editor. 

Next we will put this on datahub so we can import it in a notebook. Before that, some more general tips:
- We want to be sure that our data come from reliable sources. There are no absolute rules here, but a government site like the CDC is probably pretty trustworthy. If this was a serious research project we would want to consult multiple sources and check they are consistent, if possible. 
- A good source for political science data is <a href="https://www.icpsr.umich.edu/web/pages/">ICPSR</a>.
- If you are struggling to find data you are looking for, reach out to your professors or GSIs: we will often have a good sense of where to search or who to go ask.

## Part 2. Getting data on the datahub.

You may have noticed that whenever we are working on notebooks there is a "jupyterhub" icon in the top left corner. This is a super valuable resources provided for members of the Berkeley community where we can use Python and other programming langauges through our browsers. It also makes it easy for me to share notebooks with you!

As you may know, you have a "personal" datahub site that includes all of the files we have worked on in this class, and potentially some others from other classes you have taken. 

Open a new browser tab/winder and go to datahub.berkeley.edu. You should see a folder called "PS-88-FA23". If you click this, it will take you to a directory with all of the work we have done so far in class. Pause for a moment to be proud of all the work you have done!

Now click on the "week14" folder. That folder should contain a file called Class14.ipynb: that is this file right here. There is also a folder called "data". If there isn't you can create one by clicking the "New" button in the top right and then going to "folder". Click the box next to "Untitled Folder" and then the "Rename" button.

Click on the data folder, which has a few files we will use later. 

Now we want to upload the data file. There is a button near the top right called Upload. Click that, and then find the data file you just downloaded. This creates a little dialog box with the current name of the file "data-table.csv" (maybe with a number in parenthesis after table). This is pretty uninformative, so go into the box and rename it "gundeaths.csv". Then click the blue upload button. Now it is up!

Any file you upload to datahub can be accessed by a notebook you are running on datahub. Though it is good practice to keep the data and the notebook that uses it in the same directory, or put the data files in a subdirectory.



## Part 3. Loading the data

Now we are ready to run some code that will look familiar. We will use the `pd.read_csv` function. The only difference is that this time you were the one to put the data file where it needs to be! 

If you put the .csv file in the same directory as your notebook, you could load it up with `pd.read_csv("gundeaths.csv")`. That is, by default this function will "look" for the data file in the same directory as the notebook. Since we put it in a folder called "data", we instead load it up with:

In [None]:
gd = pd.read_csv("data/gundeaths.csv")
gd

Notice there are 500 rows here, corresponding to 10 years for of data for 50 states (no DC here, alas).

The `DEATHS` column is a raw count of gun deaths, while `RATE` is the gun deaths per 100,000 residents (making some adjustments for differences in age, which we don't need to worry about). It will generally be better to work with these rates: if we don't adjust for population we may just find that there are more guns deaths in states with more guns just because both are correlated with population.

We can do some general exploratory analysis on the trends of gun deaths by year:

In [None]:
sns.lineplot(x='YEAR', y='RATE', data=gd, ci=None)

The reason the line is straight for a while is that there is a "gap" in the data from 2005 to 2014:

In [None]:
gd['YEAR'].value_counts()

**Plot the trends for some individual states by subsetting the data by state, i.e., change the `data=` argument to `data=gd[gd.STATE==X]` where X is a two letter postal code. If you make multiple plots, you can make a legend by adding a `label=X` argument to all lineplot calls, and then add `plt.legend()` at the end.**

## Part 4: Merging on one variable

Now we have one of our variables, and for multiple years. Let's see what we can find on gun ownership. Some quick googling turns up <a href="https://worldpopulationreview.com/state-rankings/gun-ownership-by-state">this</a> site. Since I first found the data they added an annoying hurdle where you need to sign up for their newsletter to get the data. To spare you that, I put the data up on datahub for you, named "gunowner2023.csv". 

**Load up this data file using `pd.read_csv`, and name the dataframe `go`**

**Make a histogram of gun ownership rates**

Now we want to combine the gun ownership data with the gun homicide data. There are a few challenges here:
- The gun ownership data is for 2023 and the gun homicide data has multiple years, ending in 2022. We will address this more later, but to keep things simple let's first try to get one dataframe that has the gun ownership data from 2023 and the gun homicide data from 2022. As we saw above, these tend to not dramatically change over time, so this will probably be reasonable to just get a sense of what the bivariate relationship looks like today.
- A more practical problem is that these two data sources name the states in different ways: one uses two digit postal codes ("CA"), while the other has the full state name ("California"). Humans who know the US postal codes can figure out which corresponds to which, but we would rather have a computer do this for us, but because it is faster and less error prone.

To solve the second problem, we are going to look for a third data set which provides a "translation" or "crosswalk" between the full state name and the two digit code. Googling "state name abbreviations" turns up <a href="https://worldpopulationreview.com/states/state-abbreviations">this</a> site, which has a csv file that contains both the full name and the abbreviation, as we want. Again, so you don't need to sign up for an annoying newsletter, I've put this up in the data folder for you, with the name "crosswalk.csv".

Now we can load it up:

In [None]:
crosswalk = pd.read_csv("data/crosswalk.csv")
crosswalk

To keep things tidy, let's drop the "abbrev" column which has an abbreviation which doesn't show up in either dataframe we are using. 

In [None]:
crosswalk = crosswalk.drop("abbrev", axis=1)
crosswalk.head()

Now we are ready for our first merge, which will bring the two letter postal code into the gun ownership data (we could also start by merging the full name into the gun deaths data). 

Let's remind ourselves what the `go` dataframe looks like:

In [None]:
go.head()

The simplest kinds of merges, like this one, correspond to cases where we have two dataframes where the rows correspond to the "same thing", here a state. Merging requires one or more variables to serve as a "key" which will tell us which rows correspond to the same case. To see what our key is, let's remind ourselves what the data frames look like:

In [None]:
crosswalk.head()

In merging our `go` dataframe with the `crosswalk` dataframe, our key will be the `State` variable, which corresponds to the full state name in both dataframes. Conveniently, it has the same column name in both dataframes; we will learn how to deal with the (very common) issue of different names later on. We will use the `pd.merge` function to combine the data from both columns, which in this case we can think of as adding the postal code to the gun ownership data. In general we will use four arguments when calling this function.
- The first two arguments are the names of the two dataframes to merge. We will refer to the first one as the "left" dataframe and the second as the "right" dataframe.
- Next we use an `on=` argument to say what column(s) we can use to indicate which rows belong together (the "key(s)").
- Finally, we use a `how=` argument which tells us what "kind" of merge to do. We will see more examples later, but first we will do a "left" merge, which will typically treat the first dataframe we entered as the "base" data frame, and then add the data from the right dataframe when available. Overall, here is the syntax for our first merge:

In [None]:
mergel = pd.merge(go, crosswalk, on="state", how="left")
mergel.head()

Comparing across the "State" and "Code" variables, this seems to have worked! 

A good thing to check whenever doing a merge is the total number of rows/columns in the original and merged data files. We can do this with the `.shape` function.

In [None]:
go.shape

**Find the shape of the `crosswalk` data frame**

This tells us that the gun ownership data file had 50 observations, while the crosswalk had 51. If you go and look at the crosswalk we can see why (if you can't already guess): the crosswalk also includes DC, while the gun ownership data does not. How did our "left" merge deal with this? 

**Check the shape of the merged dataframe**

Like the gun onwership ("left") dataframe, we only have 50 observations. This is because a "left" merge effectively says "you can drop cases where we don't have data in the right dataframe", which in this case means we don't create a row with DC because we have no gun ownership data. 

**What happens if we do a "right" merge by changing our "how" argument to `how="right"`. Save this as `merger` and then check the shape.**

Now we have 51 rows! Let's see what this looks like:

In [None]:
merger

Note this includes a row for DC, which includes the name and postal code -- the "data" from the crosswalk, or the right dataframe -- but no gun ownership data. Instead, we get a NaN, meaning there is no data. Sometimes it will be useful to have the "larger" dataframe with no data for some rows, but for our purposes we would just end up ignoring DC anyways since we don't have data on one of our key variables.

There are other types of merges that are useful for other purposes ("inner" and "outer"), but left and right will be enough for us. It won't matter for what we do next, but let's stick with `mergel`, which doesn't add the DC row.

Next, we want to bring in the gun death data. Let's remind ourselves what this looks like:

In [None]:
gd.head()

A few things to note:
- This dataframe also contains multiple years for each state. In the next line we will subset to the "closest" year to our gun ownership data, which is 2022. 
- The column names here are in all caps
- There are a couple columns we don't need: `URL` which contains the source, and `DEATHS`, which is the raw count of gun deaths, while we want to look at the rate.

We are going to take a few steps to clean this up before merging. First, let's create a new dataframe with just the 2021 data.

In [None]:
gd2022 = gd[gd.YEAR == 2022]
gd2022.shape

Let's also drop the URL and also the year (the dataframe name already reminds us this is just for 2021). We can do this with the `.drop` function. Put the names of the variables to drop as an array (`["Var1", "Var2"]`), and then also add an `axis=1` argument to indicate we are dropping columns.

In [None]:
gd2022 = gd2022.drop(["URL", "YEAR"], axis=1)

In [None]:
gd2022.head()

Now we want to merge this with our `mergel` dataframe. Let's remind ourselves what this looks like:

In [None]:
mergel.head()

The "key" we can use to identify each row within the two datafiles is the two letter postal code. But note this time it is named "STATE" in the gun death data and "Code" in the gun ownership data. We can do merges with key different column names, but I think it is tidier to first rename one of the columns to match the other. In this case, let's rename "STATE" to "Code" in the gun death data (since there is already a "State" variable in the gun ownership data).

We can do this with the `rename` function applied to the dataframe. To just rename one variable, we use `df.rename({'OLDNAME':'NEWNAME'}, axis=1)`. Think of the argument in curly brackets as the "dictionary" telling us the mapping from old to new names. The `axis=1` argument clarifies that we are renaming columns and no rows. Let's first check that this works:

In [None]:
gd2022.rename({'STATE':"code"}, axis=1)

In [None]:
gd2022=gd2022.rename({'STATE':"code"}, axis=1)

I find it useful when doing things like renaming variables to fist just run a line of code like df.rename or df.drop, and see what the dataframe it produces looks like. When it returns what I want, then I edit the line of code to either create a new dataframe (df2 = df.rename...) or overwrite the first one (df=df.rename...)

Now we are ready to merge the gun deaths and gun onwership data on the key "code". 

**Merge the `mergel` and `gd2022` dataframes on the key "code". After checking the result appears to work,  save it as `own_and_death`**

We still have 50 observations, which is good. And we now have both gun ownership and gun deaths (`RATE`)

In [None]:
own_and_death.head()

Finally, let's reward our hard work by making a scatterplot with gun ownership on the x axis and gun deaths on the y axis:

In [None]:
sns.scatterplot(x='gunOwnership',y="RATE", data=own_and_death)

Looks a lot like the graph I showed in the first lecture!

## Part 5. Merging on two variables

There are at least two unsatisfying things about the analysis above. First, we only have one year of data for each state. Second, they aren't even from the same year.

To rectify this, and learn some more about merging along the way, we will import some data on gun ownership for multiple years created by the Rand Corporation. (To tie together another theme of class, Rand was also an early hotbed of research on Game Theory in the mid 20th century.)

The data come from <a href="https://www.rand.org/research/gun-policy/gun-ownership.html">here</a> but are in an annoying format, so I cleaned it up and uploaded it to my Box drive and then made a publicly shareable link, which is a useful method for sharing.

In [None]:
rand_gunown = pd.read_csv("https://berkeley.box.com/shared/static/ptgqciox16mibfkcbs2af8ny85okmyyo.csv")
rand_gunown.shape

There are lots of rows here and just three columns. Let's take a closer look:

In [None]:
rand_gunown

The "HFR" is an estimate of gun ownership, using some convoluted methods which we won't delve into. Note we have data for multiple years. Let's see which:

In [None]:
rand_gunown['Year'].value_counts()

For some reason these are not sorted, but we have 1980-2016. 

**Let's remind ourselves about the gun death range by applying the `value_counts()` function to the `YEAR` column of `gd`.**

The overlap here is far from perfect, but it does give us four years with both gun onwership and gun death data. Let's combine these.

Like before, our gun ownership and gun deaths data have different ways of identifying states, even though the column names match!

In [None]:
rand_gunown.head()

In [None]:
gd.head()

Let's start with the gun deaths and rename the `STATE` column to `code` to match our crosswalk. We also will get rid of the all caps for others.

In [None]:
gd_clean = gd.rename({"YEAR":"Year", "STATE":"code", "RATE":"Deathrate"}, axis=1)
gd_clean

**To tidy things up a bit, drop the "DEATHS" and "URL columns from `gd_clean`**

Let's remember what the crosswalk looks like:

In [None]:
crosswalk.head()

Now we can do a merge between our cleaned gun death dataframe and the crosswalk, using the "Code" as a key. 

Note an interesting difference between these two merges. The `gd_clean` has 350 entries because there are multiple years, while `crosswalk` just has 50 (well, 51, but DC is going to get dropped) entries. But if we input `gd_clean` as our first dataframe and use `how=left`, and `on=Code`, it will think of the `gd_clean` as a dataframe where each row corresponds to a state, where we happen to have multiple years of data. So for each row, we just pull the "Code" from the crosswalk.

In [None]:
gd_withcode = pd.merge(gd_clean, crosswalk, on="code", how="left")
gd_withcode.shape

In [None]:
gd_withcode

Now we just need to clean up the `rand_gunown` a bit. In particular, let's rename `STATE` to `State`.

In [None]:
rand_gunown = rand_gunown.rename({"STATE":"state"}, axis=1)

To combine these two dataframes, we need to use *two* keys, since we want to make sure we match the gun ownership/deaths by year and state. So our two keys will be the state name (`State`) and the year (`Year`). We can do this by inputting an array with both column names in the `on` argument. Let's do this is a left merge with `gd_withcode` as our left dataframe, which we can think of as "adding" the gun ownership to this data.

In [None]:
rand_gunown

In [None]:
sy_left = pd.merge(gd_withcode, rand_gunown, on=["state", "Year"], how="left")

Let's compare the shape of our inputs and outputs:

In [None]:
gd_withcode.shape, rand_gunown.shape, sy_left.shape

Note the shape matches our left dataframe. In this case, we can think of this as only including the years where we have gun deaths data. 

**Do a right merge on the same data frames, which will keep all cases where we have gun ownership but not gun deaths. Check the shape.**

Just to see one more example, if we use `how="inner"` the resulting dataframe will only include state-years where we have data on both gun ownership and gun deaths:

In [None]:
sy_inner = pd.merge(gd_withcode, rand_gunown, on=["state", "Year"], how="inner")
sy_inner.shape

Which makes sense because there are four years of overlap, and 50 states per year.

Another useful diagnostic is to check how many NaN entries there are in our "new" data. For the left merge, this is the HFR variable.

In [None]:
sy_left.head()

The `isna()` function returns True for NaN and False otherwise. So we can sum up `sy_left.HFR.isna()` to  count how many NaNs are in the column.

In [None]:
np.sum(sy_left.HFR.isna())

So 6 years don't have gun ownership data.

We can also count NaN's by year with the `pd.crosstab` function:

In [None]:
pd.crosstab(sy_left.Year, sy_left.HFR.isna())

We can now do a scatterplot with multiple years of data, with actual overlap in the year:

In [None]:
sns.scatterplot(x='HFR', y='Deathrate', data=sy_left)

And run a regression

In [None]:
smf.ols("Deathrate ~ HFR", data=sy_left).fit().summary()

**Show that we get the same regression results using the `sy_inner` dataframe since `smf.ols` will drop cases with missing data (NaN) on any of the variables.**

**Run an analogous regression but add state fixed effects**

This means that, even keeping fixed the fact that some states are generally more violent than others, higher gun ownership is still associated with more gun deaths. This could still be partly driven by reverse causation, but is stronger evidence than just looking one year at a time that there is something causal going on.

## Part 6. Your turn

Now let's do the same exercise but by country (and just for one year).

You can get data on gun ownership by country <a href="https://worldpopulationreview.com/country-rankings/gun-ownership-by-country">here</a> and gun homicides <a href="https://worldpopulationreview.com/country-rankings/gun-deaths-by-country">here</a>.

So you don't have to sign up for their emails, here is a link to the <a href="https://berkeley.box.com/s/8fdflcgvlkrskj0fgi9fn3z51j1wl18d">ownership</a> csv, and the <a href="https://berkeley.box.com/s/bh0qg7sgtg9oy0ixsw0098kjsoyegcxg">homicides</a> csv. 

See if you can:
- Download both in .csv format
- Upload the .csv files to your datahub folder
- Merge the two data files, using `.shape` to checking that the merge worked and how many countries have data from both files
- Create a scatterplot with gun ownership on the x axis and gun deaths on the y axis.

If you have time, create a new variable that indicates whether a country is in South America (either by merging in a new dataset of using the `isin()` function, and then make a graph that plots these in a different color. You can also see how excluding this region changes the results of a linear regression predicting gun deaths from gun ownership.

## Part 7. Reshaping [OPTIONAL]

A lot of time the data we find online does not come in the right "shape". For example, we want our data here to be in a format where each row corresponds to a country/year (or state/year), with columns telling us which country and year the row corresponds to, and then the relevant variables for our analysis. We often call this "long" format, because with lots of combinations of countries/years the dataframes tend to get long. 

However, it is common for data to be in "wide" format, where each row corresponds to a state, and then there are different columns telling us the data values for each year. While the data we found so far was not in this format, we can simulate what this looks like by first making a pivot table (using the state level gun death data)

In [None]:
gd_clean.head()

In [None]:
gd_wide = gd_clean.pivot(index='code', columns='Year', values='Deathrate')
gd_wide

Another potential wide format would have years as rows and different states as columns.

**Make a wide version where rows are years and columns are states**

This is kind of useful because we can make plots by state easily (not that it's super hard in long format, where we just subset to the state we want).

In [None]:
sns.lineplot(x="Year", y="CO", data=gd_wide2)

Making the data as a pivot table creates a dataframe with some weird properties. If we had downloaded data as a .csv in wide format and then imported it we would get something more like this (don't worry about the details here).

In [None]:
gd_widedf = pd.DataFrame(gd_wide.to_records())
gd_widedf

To translate something like this back into "long" format, we can use the `pd.melt` function. This takes at least three arguments:
- the wide dataframe
- the variabe in the wide dataframe column name that identifies cases, as `idvars=`. Here this will be "code".
- An array with the wide dataframe column names that have the data, which here are the years.

In [None]:
gd_longagain = pd.melt(gd_widedf, id_vars="code", value_vars=['2005', '2014', '2015', '2016', '2017', '2018', '2019'])
gd_longagain

A trick to make this a bit more consice is to create an array with the relevant columns:

In [None]:
gd_widedf.columns

In [None]:
datacols = gd_widedf.columns[1:10]
datacols

In [None]:
gd_longagain = pd.melt(gd_widedf, id_vars="code", value_vars=datacols)
gd_longagain

Note to really get back to our original data frame we would need to rename the `variable` and `value` columns. We could do this with a `.rename`, or by redoing our `.melt` with `var_name` and `value_name` arguments added:

In [None]:
gd_longagain = pd.melt(gd_widedf, id_vars="code", value_vars=datacols, 
                       var_name = "Year", value_name = "Deathrate")
gd_longagain

**Recreate the long dataframe by melting the `gd_widedf2` dataframe.**

## Part 8. Lags [OPTIONAL]

Another common thing we want to do when we have multiple years of data is create "lagged" versions of variables. For example, recall that one theory about gun ownership is that people tend to want to buy guns for safety. A good way to measure whether people feel the need to buy a gun is how dangerous one's country/state was in the recent past. A way we can measure this is with the gun death rate in the previous year. 

Let's first do this on our `sy_left` data.

In [None]:
sy_left.head()

As a first step, let's create a new variable to indicate the previous year:

In [None]:
sy_left["lYear"] = sy_left["Year"] - 1
sy_left.head()

Now we are going to create a copy of the data frame to match with this. We are only going to want the Year/Code to match and the Deathrate/HFR to create the lagged data.

In [None]:
sy_lag = sy_left.drop(['lYear', "state"], axis=1)
sy_lag

Now we rename the year/Deathrate/HFR to indicate these are going to be the lagged data.

In [None]:
sy_lag= sy_lag.rename({"Year":"lYear", "Deathrate":"lDeathrate","HFR":"lHFR"}, axis=1)
sy_lag.head()

Here is a way to think about the current dataframe. If we are in Alabama, and "last year" is 2019, then the "last year" gun death rate was 22.2 out of 100,000. So if we do a merge between this and the original dataframe on "Code" and "lyear", we will add the last year death rate (and gun ownership) into the original.

Practically, by merging this into the original data file on the keys `lYear` and `code`, we will have the previous year data. 

**Do a left merge with `sy_left`, `sy_lag`, on `lYear` and `code`, and save it as `sy_withlag`**

Let's see what years we get our lagged death rate data

In [None]:
pd.crosstab(sy_withlag.Year, sy_withlag.lDeathrate.isna())

This makes sense: in 2014 the previous year is 2013, and we don't have data for that. In 2005, the previous year is 2004, and again we don't have data for that year. In general we always lose our "first" year when getting lags, and sometimes more.

Now do the same for the lagged gun ownership (HFR)

In [None]:
pd.crosstab(sy_withlag.Year, sy_withlag.lHFR.isna())

One test of the theory that gun ownership will be higher in dangerous places is to use gun ownership as the DV and lagged gun death rates as the DV:

In [None]:
smf.ols("HFR ~ lDeathrate", data=sy_withlag).fit().summary()

However, this isn't a super informative test: the current year death rate and previous year are pretty highly correlated

In [None]:
sns.scatterplot(x=sy_withlag['lDeathrate'], y=sy_withlag['Deathrate'])

**Show that the current year ownership and previous year ownership are also highly correlated.**

As a result it's a bit tricky to know what is causing what here. It could be the case that current year ownership causes current year gun deaths (not vice versa), and we'd still see a positive correlation between previous year gun deaths and current year gun ownership.

A more precise way to test this is by looking at the change in gun ownership. We can compute this for years where we have both the current and lagged data.

In [None]:
sy_withlag['dHFR'] = sy_withlag['HFR'] - sy_withlag['lHFR']
sy_withlag['dDeathrate'] = sy_withlag['Deathrate'] - sy_withlag['lDeathrate']

In [None]:
smf.ols("dHFR ~ lDeathrate", data=sy_withlag).fit().summary()

There is no magic bullet, but it doesn't seem that gun ownership tends to go up in places that had more deaths in the previous year. 