```{index} Pandas
```

# Intro to Pandas

In [1]:
## Add image like ![](https://images.csmonitor.com/csm/2015/10/944693_1_1029%20panda%20diplomacy_standard.jpg?alias=standard_900x600nc)

**NOT THAT TYPE OF PANDAS**

**Relink to current Pandas logo?**
![Pandas logo](https://pandas.pydata.org/_static/pandas_logo.png)

*"pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language"*

[https://pandas.pydata.org](https://pandas.pydata.org)


Rather than spending time explaining what Pandas does, let's just import it and start working with it. We will import it into the `pd` namespace:

In [2]:
import pandas as pd


```{index} see: dataframe; Pandas->dataframe
```

Pandas works with data in *dataframes*. A dataframe is similar to a spreadsheet or a database table. It is a two-dimensional structure in which each row corresponds to a single data entry: a set of information about a particular entity or observation. Each column stores one particular piece of information about these entries. Like tables in spreadsheets or databases, the columns can be labeled, and the rows can be indexed by consecutive integers or by the data in one of the columns. 

```{index} Covid; data
```

We will work with the following data:

* **Cases** and **deaths** from COVID-19 were retrieved from The New York Times "Coronavirus (Covid-19) Data in the United States github repository at [https://github.com/nytimes/covid-19-data/blob/master/us-states.csv](https://github.com/nytimes/covid-19-data/blob/master/us-states.csv). Retrieved April 26, 2021.

* **Population** data is from "Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019" 2010-2019 Population Estimates. United States Census Bureau, Population Division. December 30, 2019. Link: [https://www2.census.gov/programs-surveys/popest/tables/2010-2019/state/totals/nst-est2019-01.xlsx](https://www2.census.gov/programs-surveys/popest/tables/2010-2019/state/totals/nst-est2019-01.xlsx). Retrieved April 26, 2021.

* **Gross Domestic Product (GDP)** data is from the US Department of Commerce Bureau of Economic Analysis, Regional Data: GDP and Personal Income, Current-dollar Gross Domestic Product (GDP) (Millions of current dollars) 2019:Q4. GDP is a measure of total economic output of a state. Link: [https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1&acrdn=1](https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1&acrdn=1). Retrieved April 27, 2021.

* Data on **percentage of population living in urban areas** is from the United States Census Bureau, List of Population, Land Area, and Percent Urban and Rural in 2010 and Changes from 2000 to 2010, "Percent Urban and Rural in 2010 by State". We will refer to the percentage of a state's population living in urban areas as the *urban index*. Link: [https://www2.census.gov/geo/docs/reference/ua/PctUrbanRural_State.xls](https://www2.census.gov/geo/docs/reference/ua/PctUrbanRural_State.xls). Retrieved April 27, 2021.

To make it easier to start working with these data sets, I have merged the data into a single CSV file that is available at 

[https://raw.githubusercontent.com/jmshea/Foundations-of-Data-Science-with-Python/main/03-first-data/covid-merged.csv](https://raw.githubusercontent.com/jmshea/Foundations-of-Data-Science-with-Python/main/03-first-data/covid-merged.csv). 

Pandas provides a method to read CSV files, and it can read them directly from a URL:

In [3]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/jmshea/"
    + "Foundations-of-Data-Science-with-Python/main/"
    + "03-first-data/covid-merged.csv"
)

Note that this CSV file can be reconstructed from the raw data by running the commands in the notebook "Download Covid Data.ipynb", which is in the Chapter 3 folder on the book's github site at

[https://github.com/jmshea/Foundations-of-Data-Science-with-Python/blob/main/03-first-data/Download%20Covid%20Data.ipynb](https://github.com/jmshea/Foundations-of-Data-Science-with-Python/blob/main/03-first-data/Download%20Covid%20Data.ipynb)



```{index} Pandas; dataframe
```

## Working with Dataframes

Let's start by looking at the dataframe. When run inside of Jupyter, Pandas will pretty-print a dataframe if the dataframe variable is evaluated as the last line of a code cell. Note that the column labels are imported from the first line of the CSV, and the unlabeled column on the left-hand side is the index, which acts like labels for the rows and which defaults to consecutive integers, starting at row 0.

In [4]:
df

Unnamed: 0,state,cases,population,gdp,urban
0,Alabama,7068,4903185,230750.1,59.04
1,Alaska,353,731545,54674.7,66.02
2,Arizona,7648,7278717,379018.8,89.81
3,Arkansas,3281,3017804,132596.4,56.16
4,California,50470,39512223,3205000.1,94.95
5,Colorado,15207,5758736,400863.4,86.15
6,Connecticut,27700,3565287,290703.0,87.99
7,Delaware,4734,973764,77879.4,83.3
8,Florida,33683,21477737,1126510.3,91.16
9,Georgia,25431,10617423,634137.5,75.07


In Pandas, particular columns can be retrieved by putting the name of the column in square brackets after the name of the dataframe. For instance, here is how to retrieve the content of the 'state' column:

In [5]:
df["state"]

0            Alabama
1             Alaska
2            Arizona
3           Arkansas
4         California
5           Colorado
6        Connecticut
7           Delaware
8            Florida
9            Georgia
10            Hawaii
11             Idaho
12          Illinois
13           Indiana
14              Iowa
15            Kansas
16          Kentucky
17         Louisiana
18             Maine
19          Maryland
20     Massachusetts
21          Michigan
22         Minnesota
23       Mississippi
24          Missouri
25           Montana
26          Nebraska
27            Nevada
28     New Hampshire
29        New Jersey
30        New Mexico
31          New York
32    North Carolina
33      North Dakota
34              Ohio
35          Oklahoma
36            Oregon
37      Pennsylvania
38      Rhode Island
39    South Carolina
40      South Dakota
41         Tennessee
42             Texas
43              Utah
44           Vermont
45          Virginia
46        Washington
47     West V

A columns of a dataframe is returned as a Pandas series object, which is a one-dimensional object (like a list) that has a row index.

In [6]:
type(df["state"])

pandas.core.series.Series

We can retrieve multiple columns simultaneously by including them in a list within the square bracks (meaning that there will be double square brackes -- one set to tell Pandas we are selecting columns and one to form the list of columns to be selected:

In [7]:
df[["state", "cases"]]

Unnamed: 0,state,cases
0,Alabama,7068
1,Alaska,353
2,Arizona,7648
3,Arkansas,3281
4,California,50470
5,Colorado,15207
6,Connecticut,27700
7,Delaware,4734
8,Florida,33683
9,Georgia,25431


When we select multiple columns, the data is no longer one dimensional, and thus the returned data is a new dataframe, not a series:

In [8]:
type(df[["state", "cases"]])

pandas.core.frame.DataFrame

We can also retrieve the contents of a pandas row. The general approach is the same as retrieving a column, but we must index into the `.loc` member of the dataframe to retrieve a row:

In [9]:
df.loc[1]

state          Alaska
cases             353
population     731545
gdp           54674.7
urban           66.02
Name: 1, dtype: object

We can also retrieve multiple rows by providing a list of indices:

In [10]:
df.loc[1:5]

Unnamed: 0,state,cases,population,gdp,urban
1,Alaska,353,731545,54674.7,66.02
2,Arizona,7648,7278717,379018.8,89.81
3,Arkansas,3281,3017804,132596.4,56.16
4,California,50470,39512223,3205000.1,94.95
5,Colorado,15207,5758736,400863.4,86.15


We can combine both row and column selection by using `.loc` and specifying the desired columns after a comma inside the square brackets:

In [11]:
df.loc[1:5, ["state", "cases"]]

Unnamed: 0,state,cases
1,Alaska,353
2,Arizona,7648
3,Arkansas,3281
4,California,50470
5,Colorado,15207


It is often convenient to use the values in one of the columns to provide a more meaningful index for the the rows. Pandas does not require that the entries in the index be unique, but using non-unique entries can significantly impact performance and limit the ability to retrieve data by the index values.  To set the index, use Panda's `set_index` method:

In [12]:
df.set_index("state")

Unnamed: 0_level_0,cases,population,gdp,urban
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,7068,4903185,230750.1,59.04
Alaska,353,731545,54674.7,66.02
Arizona,7648,7278717,379018.8,89.81
Arkansas,3281,3017804,132596.4,56.16
California,50470,39512223,3205000.1,94.95
Colorado,15207,5758736,400863.4,86.15
Connecticut,27700,3565287,290703.0,87.99
Delaware,4734,973764,77879.4,83.3
Florida,33683,21477737,1126510.3,91.16
Georgia,25431,10617423,634137.5,75.07


Note that by default the `set_index` methods returns a new dataframe and the original dataframe is unchanged:

In [13]:
df

Unnamed: 0,state,cases,population,gdp,urban
0,Alabama,7068,4903185,230750.1,59.04
1,Alaska,353,731545,54674.7,66.02
2,Arizona,7648,7278717,379018.8,89.81
3,Arkansas,3281,3017804,132596.4,56.16
4,California,50470,39512223,3205000.1,94.95
5,Colorado,15207,5758736,400863.4,86.15
6,Connecticut,27700,3565287,290703.0,87.99
7,Delaware,4734,973764,77879.4,83.3
8,Florida,33683,21477737,1126510.3,91.16
9,Georgia,25431,10617423,634137.5,75.07


If we wish to work with the original one, we can tell `pandas` to change the index *in place*:

In [14]:
df.set_index("state", inplace=True)

This makes finding the data by state much easier:

In [15]:
df.loc["Florida"]

cases            33683.00
population    21477737.00
gdp            1126510.30
urban               91.16
Name: Florida, dtype: float64

Note that the index (i.e., these row labels) carry over to the Pandas series that is returned by indexing a particular column of the dataframe:

In [16]:
df["cases"]

state
Alabama             7068
Alaska               353
Arizona             7648
Arkansas            3281
California         50470
Colorado           15207
Connecticut        27700
Delaware            4734
Florida            33683
Georgia            25431
Hawaii               609
Idaho               2016
Illinois           52918
Indiana            18099
Iowa                7145
Kansas              4305
Kentucky            4708
Louisiana          28044
Maine               1095
Maryland           21825
Massachusetts      62205
Michigan           41348
Minnesota           5136
Mississippi         6815
Missouri            7563
Montana              452
Nebraska            4332
Nevada              5053
New Hampshire       2146
New Jersey        118652
New Mexico          3411
New York          309696
North Carolina     10507
North Dakota        1067
Ohio               18027
Oklahoma            3618
Oregon              2510
Pennsylvania       48224
Rhode Island        8621
South Carolina     

If all we want is the numerical values in the data series, we can convert it to a list or a NumPy array:

In [19]:
print(list(df["cases"]))

[7068, 353, 7648, 3281, 50470, 15207, 27700, 4734, 33683, 25431, 609, 2016, 52918, 18099, 7145, 4305, 4708, 28044, 1095, 21825, 62205, 41348, 5136, 6815, 7563, 452, 4332, 5053, 2146, 118652, 3411, 309696, 10507, 1067, 18027, 3618, 2510, 48224, 8621, 6095, 2450, 10506, 29072, 4672, 866, 15846, 14814, 1126, 6973, 559]


In [18]:
import numpy as np

np.array(df["cases"])

array([  7068,    353,   7648,   3281,  50470,  15207,  27700,   4734,
        33683,  25431,    609,   2016,  52918,  18099,   7145,   4305,
         4708,  28044,   1095,  21825,  62205,  41348,   5136,   6815,
         7563,    452,   4332,   5053,   2146, 118652,   3411, 309696,
        10507,   1067,  18027,   3618,   2510,  48224,   8621,   6095,
         2450,  10506,  29072,   4672,    866,  15846,  14814,   1126,
         6973,    559])