In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Working with Data: Pandas

![](https://images.csmonitor.com/csm/2015/10/944693_1_1029%20panda%20diplomacy_standard.jpg?alias=standard_900x600nc)

**NOT THAT TYPE OF PANDAS**

![Pandas logo](https://pandas.pydata.org/_static/pandas_logo.png)

*"pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language"*

https://pandas.pydata.org


How to get started?

In [2]:
import pandas as pd

Make sure you have the file "firearm-combined.csv" in the current directory. If not copy it here or go work there!

In [3]:
ls

Lecture 9.ipynb        firearms-combined.csv  lecture9.pdf


Now let's read the data from the CSV file into a dataframe:

In [5]:
# fixed field width is another common type of data, where each field is a "fixed" width. CSV is easier to read.
df = pd.read_csv("firearms-combined.csv")

### What is a dataframe?

A dataframe is like an Excel spreadsheet within Python:

In [6]:
df

Unnamed: 0,STATE,RATE-2005,RATE-2014,Total Laws 2014
0,AL,16.0,16.9,10
1,AK,17.5,19.2,3
2,AZ,16.1,13.5,8
3,AR,15.7,16.6,11
4,CA,9.5,7.4,100
5,CO,11.6,12.2,30
6,CT,5.3,5.0,85
7,DE,8.8,11.1,38
8,FL,10.0,11.5,21
9,GA,12.1,13.7,6


It is a two-dimennsional set of data, where the rows and columns can have labels. We can retrieve the data using these labels:

In [8]:
df['STATE']

0     AL
1     AK
2     AZ
3     AR
4     CA
5     CO
6     CT
7     DE
8     FL
9     GA
10    HI
11    ID
12    IL
13    IN
14    IA
15    KS
16    KY
17    LA
18    ME
19    MD
20    MA
21    MI
22    MN
23    MS
24    MO
25    MT
26    NE
27    NV
28    NH
29    NJ
30    NM
31    NY
32    NC
33    ND
34    OH
35    OK
36    OR
37    PA
38    RI
39    SC
40    SD
41    TN
42    TX
43    UT
44    VT
45    VA
46    WA
47    WV
48    WI
49    WY
Name: STATE, dtype: object

Note that a colum  of a dataframe is returned as a pandas series:

In [9]:
type(df['STATE'])

pandas.core.series.Series

A Pandas series is a one-dimensional data object with row labels

When you import from a CSV file, the column labels are imported, but the row labels are just the numbers of the data rows:

In [10]:
df.loc[8]

STATE                FL
RATE-2005            10
RATE-2014          11.5
Total Laws 2014      21
Name: 8, dtype: object

It is often convenient to use the values in one of the columns as the labels of the rows. We call these the *index* for the rows:

In [11]:
df.set_index('STATE')

Unnamed: 0_level_0,RATE-2005,RATE-2014,Total Laws 2014
STATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL,16.0,16.9,10
AK,17.5,19.2,3
AZ,16.1,13.5,8
AR,15.7,16.6,11
CA,9.5,7.4,100
CO,11.6,12.2,30
CT,5.3,5.0,85
DE,8.8,11.1,38
FL,10.0,11.5,21
GA,12.1,13.7,6


Note that that is actually returning a new dataframe and the original dataframe is unchanged:

In [12]:
df

Unnamed: 0,STATE,RATE-2005,RATE-2014,Total Laws 2014
0,AL,16.0,16.9,10
1,AK,17.5,19.2,3
2,AZ,16.1,13.5,8
3,AR,15.7,16.6,11
4,CA,9.5,7.4,100
5,CO,11.6,12.2,30
6,CT,5.3,5.0,85
7,DE,8.8,11.1,38
8,FL,10.0,11.5,21
9,GA,12.1,13.7,6


If we wish to work with the original one, we have to replace it

In [15]:
# df = df.set_index("STATE")
df

KeyError: "['STATE']"

This makes indexing much easier

Note that the row labels carry over to the Pandas series that is returned by indexing a particular column of the dataframe:

If all we want is the numerical values in the data series, we can convert it to a list:

Now that we know some basics of working with Pandas dataframes and series, we can begin to work with real data:

## Effect of 2004 Assault Weapons Ban

In 2004, the United State Congress passed a ban on a variety of semiautomatic rifles that are sometimes referred to as "assault weapons". The ban was in effect for 10 years, from 2004–2014

It might be guessed that the goal of any gun ban is to reduced gun violence. Thus it is natural to assess whether the "assault weapon" ban had any effect on gun violence

Fortunately, the Center for Disease Control's National Center for Health Statistics tracks firearm mortality at the state level. Visualizations of firearm mortality by state, along with links to download the data are available here:

https://www.cdc.gov/nchs/pressroom/sosmap/firearm_mortality/firearm.htm

Although this page does not have data prior to 2005, the data for 2005 should be similar to that before the ban because the ban was only on the **sale** of certain firearms. It would take many years for this ban to actually affect the availability of firearms

Thus, we can use two sets of data on that page to measure the effect of the "assault weapons" ban:

* The 2005 data set represents firearm mortaility before the ban
* The 2014 data set represents firearm mortality after the ban has been in effect for a decade. Let's begin by plotting this data:

In [4]:
# This step is not necessary but will simplify our work later:
rate2005=list(df["RATE-2005"])
rate2014=list(df["RATE-2014"])



NameError: name 'df' is not defined

Generally, one of the first visualizations to undertake with any data set is a scatter plot. Since we have two one-dimensional data sets, we can scatter plot each of them in the y-axis, where the x-axis is used to indicate which data set they come from:

What inferences might you make from this plot?

We could also scatter plot them versus the index of the state they came from:

This is  a jumbled mess. We could try to provide some order by sorting the lists. However, with two lists, it might not make sense to sort them separately. We can get the list of indices that would sort one list and then use that to order both lists.

To do that, we need to first understand a numpy capability called *fancy indexing*

Fancy indexing allows you to retrieve members of an array in any order by just indexing with a list of the indices:

In [None]:
squares=np.array([0,1,4,9,16,25,36,49])

To retrieve the squares of the prime numbers, I can make a list of primes and then use that as an index:

In [None]:
primes=[2,3,5,7]
#


The index list does not have to be in increasing order:

In [None]:
rprimes=primes[::-1]
rprimes

```numpy.argsort``` returns a list of the indices that will sort an array:

In [None]:
rate2005=np.array(rate2005)
rate2014=np.array(rate2014)

What inferences might you make from this plot?

Another common visualization is to look at a histogram of the data. Unlike the histograms we previously generated, this data takes on real values, not just integers. Fortunately, Matplotlib has functions to do the hard work of making histograms for us:

Some styling will help make this more legible:

Each bar of the histgram represents a "bin" of data values. In fact, the counts and bin edges are returned by the hist function

We can easily change the number of bins to provide more resolution:

However, it does not make sense to make the number of bins very large compared to the data size

What inferences might you make from this plot?

# Summary Statistics

Summary statistics are values calculated from sample data that measure some characteristic about the data

The most commonly summary statistic is ?

Both Pandas and Numpy provide methods to calculate the sample mean: