# A note about accessing the course files through Github

All of the lectures notes will be posted on the [class Github repo](https://github.com/iamwfx/4680_5680_intro_uds). 

We are (probably) not going over git and Github during class. If you're familiar with git/Github, feel free to clone the repo to get the new lecture materials for each class. Otherwise, I recommend you do the following: 
- Create class folder and name it `Intro_UDS`
- For each week, create a folder called `Week1`, `Week2`, etc. 
- To download the materials (Juputer notebook, data files, etc.), navigate to the file you want to download and select `Raw` above the view of the file. 
- Save this "raw" file in your class folder and **make sure it is in the proper file format**. For instance, if the file is `.ipynb` make sure you save the downloaded file as `.ipynb` (your computer might try to default to `.txt`.)

# Learning goals
After this week's lesson you should be able to:
- Explain what a Pandas Series is and how to select, filter, and replace valuess in the series 
- Read and explore tabular data in Python using a Pandas DataFrame
- Read and write data to a .csv text file

This week's lessons are adapted from:
- [Geo-Python Lesson 5](https://geo-python-site.readthedocs.io/en/latest/lessons/L5/overview.html)
- [Practical Data Science on Pandas Series](https://www.practicaldatascience.org/html/pandas_series.html)
- [Practical Data Sciece on Pandas DataFrames](https://www.practicaldatascience.org/html/30_pandas_dataframes.html)

# Grading
Each exercise will be graded based on the following rubrics:
- 5 points. Completed all the tasks and codes were well documented and explained.
- 4 points. Completed all the tasks with minor mismatch with the expected results (less than 10%).
- 3 points. Completed all the tasks with some mismatch with the expected results (more than 10% but less than 50%).
- 2 points. Completed all the tasks with major mismatch with the expected results (over 50%).
- 1 point. Made an attempt but didn’t finish any of the exercises.
- 0 point. Did not complete the excercise.

# 0. What is Pandas? 

[Pandas](http://pandas.pydata.org/) is a widely used Python library for data analysis. 

### Easy-to-use data structures
In pandas, the data is typically stored in a data structure called a
DataFrame that looks like a typical table with rows and columns (+
indices and column names), where columns can contain data of different
data types. Thus, it is similar in some sense to how data is stored in
Excel or in R, which also uses a concept of a dataframe. In fact, Wes
McKinney first [developed pandas as an alternative for
R](https://blog.quantopian.com/meet-quantopians-newest-advisor-wes-mckinney/)
to deal with different complex data structures.

### Summary of the pandas data structure
The below is a picture of how data is structured in a dataframe. You can see it is tabular structure with rows, columns, and a (row) index, which is in many ways similar to what you might be familiar with from a Excel or Google Spreadsheets. Using this structure, we can apply operations like arithmetic, columns and rows selection, columns and rows addition etc.

![Alt text](img/creating_dataframe1.png)

### Combines functionalities from many Python modules

pandas takes advantage of the [NumPy](http://www.numpy.org/) module
under the hood, which is mostly written in C. This makes it a fast and
powerful library that can efficiently handle even very large datasets.
pandas offers an easier and more intuitive syntax to do data analysis
and manipulation using either Numpy functionalities in the background or dedicated functionalities written explicitly for pandas. However, pandas is much more than an easier-to-use Numpy as it also combines many functionalities from other Python libraries such as [matplotlib (plotting)](https://matplotlib.org/) and [scipy(mathematics, science, engineering)](https://www.scipy.org/). Thus, you can use many of the features included in those packages without importing them at all.

### Supports data read/write from multiple formats
One of the most useful features of pandas is its ability to read data
from numerous different data formats directly. For example, pandas
supports reading and writing data from/to:

-   CSV
-   JSON
-   HTML
-   MS Excel
-   HDF5
-   Stata
-   SAS
-   Python Pickle format
-   SQL (Postgresql, MySQL, Oracle, MariaDB, etc.)

You can view the full list of supported data formats from the [pandas
docs](https://pandas.pydata.org/docs/user_guide/io.html).


# 1. Pandas Series
The building block of Pandas dataframes are Pandas **Series**. A Series is an ordered list of values, generally all of the same type. If you're familiar with Numpy arrays, a Series is a just a one-dimensional Numpy array with some added features. 

There are lots of ways to create Series, but the easiest is to just pass a list or an array to the pd.Series constructor.

Let's create a series for the [top 5 metropolitan statistical areas in the U.S. by population according to 2021 estimates](https://en.wikipedia.org/wiki/Metropolitan_statistical_area). 

1. New York-Newark-Jersey City, NY-NJ-CT-PA MSA (19,768,458)
2. Los Angeles-Long Beach-Anaheim, CA MSA (12,997,353)
3. Chicago-Naperville-Elgin, IL-IN-WI MSA (9,509,934)
4. Dallas-Fort Worth-Arlington, TX MSA (7,759,615)
5. Houston-The Woodlands-Sugar Land, TX MSA (7,206,841)

In [1]:
### Let's first import the pandas library
### We will use the alias pd for pandas to make it easier to type. 
import pandas as pd 

In [None]:
### Let's create a pandas series from a list
population = pd.Series([19768458, 12997353, 9509934, 7759615, 7206841])
population


## 1.1 Indices
Somewhat different from a standard list or array, we can include an index value associated with each row from this series. 

In [None]:
### Let's create a pandas series from a list that has an index
population = pd.Series([19768458, 12997353, 9509934, 7759615, 7206841],    
            index=['NYC', 'LA', 'Chicago', 'Dallas', 'Houston'])
population


Now, you can see that there are index values associated with each population count. You can access a Series’ index with the `.index` property: 

In [None]:
population.index

Different than working with tabular data in a spreadsheet software, the indicies stay with each row. So, say you wanted to sort your values from smallest to largest, like so, the row indices will be sorted as well to stay with the row values. 

In [None]:
population.sort_values()

## 1.2 Subsetting a Series
Very often, we will want to subset or filter our series based on certain criteria like the position or range of values, or [logicals](https://www.geeksforgeeks.org/python-logical-operators-with-examples-improvement-needed/) such as whether values are bigger/smaller than a certain number. 

### 1.2.1 `.iloc`
In order to select the row(s) based on their position in the table, we use the `.iloc` function. For instance, if I want to select the first row value in my `population` Series, I can type: 

Note: In Python, we start our count at 0. So 0 = first, 1 = second, etc.

In [None]:
population.iloc[0]

If I wanted to select the first three row value I can use the `:` to select a range of values `0:3`. 

Be careful, using **integer ranges excludes the last value in the range**. So, `0:3` gives us the first **three** values even though you might think you're selecting up to the fourth row value. 

In [None]:
population.iloc[0:3]

### 1.2.2 `.loc`
You can also select a range of rows by the index values using the function `.loc`. 

In [None]:
population.loc['NYC']

And in the same way, we can also select ranges. Using index label ranges, this **includes** the last value of the range: 


In [None]:
population.loc['NYC':'Chicago']

### 1.2.3 Subsetting based on logicals
Another criteria we can use is a logical condition for which each row value will return a Boolean, either `True` or `False`. 

Say I wanted to find all the cities in my population series that have poplations larger than 8 million

In [None]:
larger_than_8m = population > 8000000
larger_than_8m


The top three MSAs have all returned `True` while Dallas and Houston MSAs, which have populations under 8 million return `False`.

We can apply this condition to our series to get the subset of the series that a `True` condition for the logical we've described: 

In [None]:
population.loc[larger_than_8m]

### 1.2.4 The Single Square Brackets `([])` 
There is yet another way to subset and filter data by simply adding brackets after your dataframe. It is pandas' way of simplifying subsetting and filtering.

If you want to select based on the index value, the following will work: 

In [None]:
population['Chicago']

You can also pass a logical: 

In [None]:
population[population>8000000]    

However, if your index is not integer-based, the square brackets will work like `.iloc`: 

In [None]:
population[2]

### 1.2.5 Types of Series
The each series can be composed of different series types:
- Numbers, either integers (`int64` or `int32`) or floats `float64` or `float32`)
- Strings (`str`), i.e. text
- Objects (`O`), which is a flexible category that can hold either numbers, strings, or a mix. 

Note: the 32 or 64 after `int` and `float` refer to how many bits are allocated for each type. If we have a series that is `int64`, then each value in the series can go up to 64 digits. 

So if we type the following, we can see that our Series is an `int64`:

In [None]:
population.dtype

Let's check the following: 

In [None]:
s = pd.Series([1, 2, 3.14])
s.dtype

In [None]:
s = pd.Series([1, 2, "a string"])
s.dtype

#### 1.2.4.1 Converting data types
Every once in a while, we'll have to change datatypes. You can do this using the `.astype()` function: 

In [None]:
s = pd.Series([1, 2, 3])
s = s.astype('float64')
s

In [None]:
s.dtype

But if you try to convert an “object” Series to a “numeric” Series and there are numbers that can’t be converted, pandas will throw an error: 

In [None]:
s = pd.Series([1, 2, "a string"])
s.astype('float64')

## 1.3 Series Arithmetics
There are three forms of Series arithmetic:

- A Series with more than one element and a Series with only one element.
- A Series modified by a function.
- Two Series with the same number of elements. When working with two Series, elements are matched based on index values, not row numbers.

In [None]:
population*2

In [None]:
population.sum()

In [None]:
s = pd.Series([10000,2000020,30003000,40040000,50000005],
            index=['NYC', 'LA', 'Chicago', 'Dallas', 'Houston'])

population+s

## 1.4 Modifying series elements
Essentially, in the same way that we can select row elements we can also update them using the same logic. 

Say we needed to update the LA metro area in our `population` series (I'll just make up a fake population update for now) from `12997353` to `15000000`. 

There are a few ways to do this: 

In [None]:
### Filter for population value
population[population==12997353] = 15000000
population

In [None]:
### Using the .loc method
population.loc['LA'] = 15000000
population

In [None]:
### Using the .iloc method
population.iloc[1] = 15000000

In [None]:
### Using the square bracket method
### (Let's reset the population series to the original values for demo purposes)
population = pd.Series([19768458, 12997353, 9509934, 7759615, 7206841],    
            index=['NYC', 'LA', 'Chicago', 'Dallas', 'Houston'])

population['LA'] = 15000000
## or 
population[1] = 15000000
population

# 2. DataFrames
Pandas DataFrames are tabular data consisting of a collection of Series in which each column is a series. It is the central data structure used in most analysis using the Pandas library. 

In [None]:
# msa_by_pop = pd.read_html("https://en.wikipedia.org/wiki/Metropolitan_statistical_area",encoding="latin-1",)[1]
# msa_by_pop.columns = ['Rank', 'MSA', 'population_2021_est', 'population_2020',
#        'perc_change', 'Encompassing combined statistical area']
# msa_by_pop['perc_change'].replace('â','',regex=True,inplace=True)
# msa_by_pop[['Rank','MSA','population_2021_est','population_2020','perc_change']].to_csv('msa_by_pop.csv',index=False)


First, let's read in our data using the pandas function `.read_csv()`. 

One of the nifty things about reading data in pandas is that it's designed to read many different types data sources, from files, to QL databases, and URLs. Here are a few to know about: 
- `pd.read_csv`: Read in a comma-separated-value file
- `pd.read_excel`: Read in an Excel (`.xls` and `.xlsx`) spreadsheet
- `pd.read_stata`: Read Stata (`.dta`) datasets
- `pd.read_hdf`: Read HDF (`.hdf`) datasets
- `pd.read_sql`: Read from a SQL database
- `pd.read_html`: Read from the `html` tags of an HTML file

Similarly, you can write a dataframe to many formats: (`df` here is the name of a dataframe)
- `df.to_csv`: Write to a comma-separated-value file
- `df.to_excel`: Write to an Excel (.xls and .xlsx) spreadsheet
- `df.to_stata`: Write to a stata (.dta) dataset
- `df.to_hdf`: Write to an HDF (.hdf) dataset
- `df.to_sql`: Write to a SQL database
- `df.to_html`: Write to an HTML table. 


Check out the [Pandas documentation on input/output](https://pandas.pydata.org/docs/reference/io.html) to see all the possible functions for reading data. 

## 2.1 Reading a file
Download the `msa_by_pop.csv` in this week's folder. We are first going to read this CSV as DataFrame into Pandas.

The function `.read_csv()` takes a file path as a string (read: text). If you have saved `msa_by_pop.csv` in the same folder as this notebook, then all you will need to input as your path is `msa_by_pop.csv`. 

(If you had saved your CSV within a sub-directory called `Data`, then to access this data file you'd need to input `Data/msa_by_pop.csv`)

In [None]:
msa_by_pop = pd.read_csv('msa_by_pop.csv')
msa_by_pop

## 2.2 Exploring our data
You can see here that (in addition to the formatting of tabular data in Jupyter) the main difference between this and a series is that we have multiple columns with column labels. So the dataframe structure consists of: 
- An index, with index labels (here the labels are just `0`,`1`,...,`383`)
- Columns, with column labels (here `MSA`, `population_2021_est`, `population_2020`,`perc_change`)
- And the data, which are the values in each row. 

Now, in addition to `.index` we can also see all the columns in our DataFrame: 

In [None]:
msa_by_pop.index

In [None]:
msa_by_pop.columns

Now, to see the datatypes for each column, we use `.dtypes`

In [None]:
msa_by_pop.dtypes

One common function used to explore the data is called `.head()` that reveals the first 5 rows of the Dataframe. 

In [None]:
msa_by_pop.head()

If you start to type `msa_by_pop.head(` without finishing the parens you can see the inputs required of the function: 


<img src="img/func_arg.png" alt="drawing" width="400" style="display: block; margin: 0 auto"/>


This shows us that that `.head()` by default show 5 rows, but you can also optionally adjust this by specifying another integer. For instance: 

In [None]:
## This gives us the first 10 rows
msa_by_pop.head(10)

`.sample()` gives us a random selection of rows. 

In [None]:
msa_by_pop.sample(10)

The function `.len()` measures the length of a selection. Applying this to the DataFrame gives you the number of rows: 

In [None]:
len(msa_by_pop)

Applying it to the columns gives us the number of columns we have

In [None]:
len(msa_by_pop.columns)

The function `.shape` gives us the number of `(rows,columns)` in our dataframe: 

In [None]:
msa_by_pop.shape

`.describe()` provides some basic descriptive statistics for our dataframe: 

In [None]:
msa_by_pop.describe()

`.sort-values()` sorts your DataFrame by a certain column. If you column is numeric, it will sort the values from smallest to largest. If your column is a string, it will sort alphabetically.

The index values will be sorted along with the rows. 

In [None]:
# This sorts the dataframe by the population_2021_est column, from smallest to largest
# Note that the original dataframe is not changed 
# but that the index values now reflect the sorted order. 
msa_by_pop.sort_values('population_2021_est')

Adding the `ascending=False` input in your function will sort from largest to smallest or end of alphabet to beginning, depending on your column data type. 

In [None]:
msa_by_pop.sort_values('population_2021_est',ascending=False)

## 2.3 Subsetting and filtering Dataframes
Selecting data from our Dataframes is very similar to a Pandas Series, except now we have two dimensions (rows and columns) for which we have to specify conditions. 

For `.iloc` we can select rows and columns now by their position.

In [None]:
# Remember integer ranges are exclusive of the last value!
msa_by_pop.iloc[0:5,0:3]

Similarly, using `.loc` we can select rows and columns by their names. 

In [None]:
# Since our rows are index by integers, when we can use the .loc method to select rows, 
# it's the same as .iloc
msa_by_pop.loc[0:5,['Rank','MSA','population_2021_est']]

You can always leave out the column argument in either `.loc` or `.iloc` in order to select all columns. 

In [None]:
msa_by_pop.iloc[0:5]

In [None]:
msa_by_pop.loc[0:5]

But to do the same with columns you'll first have to specify that you want all the rows with `:` . 

In [None]:
msa_by_pop.loc[:,['Rank']]

In [None]:
msa_by_pop.iloc[:,[0]]

With the square brackets `[]` you can provide a list of columns. 

In [None]:
msa_by_pop[['Rank']]

In [None]:
msa_by_pop[['Rank','population_2020']]

We can again filter by logicals, but you'll need to specify the column names now. 

In [None]:
msa_by_pop.loc[msa_by_pop['Rank']>300]
# Yes, msa_by_pop['Rank'] does also select the column, 
# but notice that it gives a Series instead of a dataframe one column.

In order to filter by more than one condition, you must: 
1. Put all conditions in `()`
2. Separate the condtions by: 

    a. `|` if an `OR` condition     
    b. `&` if an `AND` condition

In [None]:
msa_by_pop[(msa_by_pop['Rank']>300) & (msa_by_pop['MSA'].str.contains('NY'))]

In [None]:
msa_by_pop[(msa_by_pop['Rank']>300) | (msa_by_pop['MSA'].str.contains('NY'))]

# 3. Good Coding Practices

The following are some good practices for writing more legible Jupyter notebooks. Often, we don't necessarily realize that code we write isn't immediately interpretable to readers. To make code more easily interpreted, we will often explain through markdown text or comments what we are doing. 

## 3.1 Markdown and explanatory cells
As you can see in this notebook, there are many "Markdown" cells surrounding our actual code. The markdown I have here describes the purpose of each code cell and what I wanted to do with it. 

I often organize my notebooks by header size `#`, `##` etc, and by numbering the different sections. This can be helpful if you're writing an especially long notebook. 

Here's a [guide on how to write in Markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax).

## 3.2 Formatting code
Sometimes, code can be very long. For instance, your function might have a bunch of inputs or you have many conditionals.

Depending on how wide your browser window is, this code may not fit in the window: 

In [None]:
msa_by_pop[(msa_by_pop['Rank']>300) & (msa_by_pop['MSA'].str.contains('NY')) & (msa_by_pop['population_2020']>100000)]

Instead, we can use `\` to start a new line. The below is also clearer because each condition is its own line. 

In [None]:
msa_by_pop[(msa_by_pop['Rank']>300) &\
        (msa_by_pop['MSA'].str.contains('NY')) &\
        (msa_by_pop['population_2020']>100000)]

You can also use the `#` in a code cell to comment. You can see that sometimes I have small notes in there that maybe didn't need to go into the markdown cell. 

In [None]:
# This is a code comment
msa_by_pop[(msa_by_pop['Rank']>300) &\
        (msa_by_pop['MSA'].str.contains('NY')) &\
        (msa_by_pop['population_2020']>100000)]

## 3.3 Selecting variable names
There are two aspects of select variable names we're going to go over today: 
1. What to name your variables or function
2. How to name a multi-word variable or function

### 3.3.1
A good variable name should: 
- Be clear and concise.
- Be written in English. A general coding practice is to write code with variable names in English, as that is the most likely common language between programmers. Thus, variable names such as muuttuja (which is also not a good name on other levels) should be avoided.
- Not contain special characters. Python supports use of special characters by way of various encoding options that can be given in a program. That said, it is better to avoid variables such as lämpötila because encoding issues can arise in some cases. Better to stick to the standard printable ASCII character set to be safe.
- Not conflict with any Python keywords, such as for, True, False, and, if, or else. These are reserved for special operations in Python and cannot be used as variable names.

In [None]:
# Do not do this: 
finnishmeteorlogicalinstituteobservationstationidentificationnumber = "101533"


In [None]:
# Or this: 
f = "101533"

In [None]:
# Something that is as short as possible while still being descriptive is best
sid = "101533"

### 3.3.2 Snake case and camel case
There are two general ways of connecting variable and function names that contain more than one word. 

You can see from the above exercise that I've named my DataFrame `msa_by_pop`. Connecting words by `_`  is called "Snake Case". 

Another convention is to use capital letters to start new words. `msaByPop` could be another way to name our variable. 

# 4. In-Class Exercises
Each week, we will have in-class exercises that give you some pratice on the concepts learned in class.


## 4.1. Exercise 1 (5 points)
For this exercise, we are going to use the table from the earlier examples. Instead of reading the table from a CSV, let's read the table directly from the [Wikipedia page on Metropolitan Statistical Areas](https://en.wikipedia.org/wiki/Metropolitan_statistical_area).

In [2]:
import pandas as pd
msa_by_pop = pd.read_html("https://en.wikipedia.org/wiki/Metropolitan_statistical_area")[1]
msa_by_pop.columns =['Rank', 'MSA', 'population_2021_est', 'population_2020',
       'perc_change', 'Encompassing_combined_statistical_area']

Which is the MSA with the 10th largest population as estimated for 2021? 

In [8]:
## Insert your code here

##Sort 2021 pop column from largest to smallest. Select the 10th row 
msa_by_pop.sort_values('population_2021_est',ascending=False).loc[9,['Rank', 'MSA', 'population_2021_est']]

Rank                                              10
MSA                    Phoenix-Mesa-Chandler, AZ MSA
population_2021_est                          4946145
Name: 9, dtype: object

Select the smallest 10 MSAs by 2020 Census population

In [9]:
## Insert your code here

##Sort again but this time from smallest to largest. Grab the first 10 rows
msa_by_pop.sort_values('population_2020').head(10)

Unnamed: 0,Rank,MSA,population_2021_est,population_2020,perc_change,Encompassing_combined_statistical_area
383,384,"Carson City, NV MSA",58993,58639,+0.60%,"Reno-Carson City-Fernley, NV CSA"
381,382,"Walla Walla, WA MSA",62682,62584,+0.16%,"Kennewick-Richland-Walla Walla, WA CSA"
382,383,"Enid, OK MSA",61926,62846,−1.46%,
380,381,"Lewiston, ID-WA MSA",64851,64375,+0.74%,
379,380,"Danville, IL MSA",73095,74188,−1.47%,
378,379,"Grand Island, NE MSA",76175,77038,−1.12%,
377,378,"Casper, WY MSA",79555,79955,−0.50%,
375,376,"Hinesville, GA MSA",82863,81424,+1.77%,"Savannah-Hinesville-Statesboro, GA CSA"
376,377,"Columbus, IN MSA",82475,82208,+0.32%,"Indianapolis-Carmel-Muncie, IN CSA"
374,375,"Bloomsburg-Berwick, PA MSA",82959,82863,+0.12%,"Bloomsburg-Berwick-Sunbury, PA CSA"


Find the total 2020 Census population of these 384 MSAs. 

In [23]:
## Insert your code here

##Sum up the values in the 2020 population column
sum(msa_by_pop['population_2020'])

286104556

Based on the population growth between 2020 and 2021, let's estimate the population in 2021 using the following function: 

$$
pop_{year2} =pop_{year1}*(1+ {\% change})^t
$$

where $t=1$ here and $\% change$ is the percentage estimated change in population from 2020.

First, calculate the percentage change between 2020 and 2021 and create a new column with these values called `year_change`. Do not multiply by 100, just keep the values in their original decimal form.

In [11]:
## Insert your code here

##Subtract pop 2020 column from pop 2021 column and divide by pop 2020 column. 
msa_by_pop['year_change'] = (msa_by_pop['population_2021_est'] - msa_by_pop['population_2020'])/msa_by_pop['population_2020']


Does `year_change` line up with the `% change` column that we already have? Select the two columns together to do an eyeball estimate. 

In [12]:
## Insert your code here

#Index the two relevant columns to compare
msa_by_pop[['year_change','perc_change']]

Unnamed: 0,year_change,perc_change
0,-0.018471,−1.85%
1,-0.015426,−1.54%
2,-0.011287,−1.13%
3,0.016004,+1.60%
4,0.011878,+1.19%
...,...,...
379,-0.014733,−1.47%
380,0.007394,+0.74%
381,0.001566,+0.16%
382,-0.014639,−1.46%


Now using your new `year_change` column, created another column called `population_2021_estimate` and estimate the 2021 population counts using the formula above. 

In [27]:
## Insert your code here

##Using the formula given above:
msa_by_pop['population_2021_estimate']= msa_by_pop['population_2020'] + (1 + msa_by_pop['year_change'])**1

##Print the two estimated pop columns for 2021 to compare:
print(msa_by_pop[['population_2021_est', 'population_2021_estimate']])


     population_2021_est  population_2021_estimate
0               19768458              2.014047e+07
1               12997353              1.320100e+07
2                9509934              9.618503e+06
3                7759615              7.637388e+06
4                7206841              7.122241e+06
..                   ...                       ...
379                73095              7.418899e+04
380                64851              6.437601e+04
381                62682              6.258500e+04
382                61926              6.284699e+04
383                58993              5.864001e+04

[384 rows x 2 columns]


## 4.2 Exercise 2 (5 points)
In this exercise, we are going to explore urban population change over the previous decade. 

Here, we are going to read the Excel file on "Annual Estimates of the Resident Population: April 1, 2010 to July 1, 2019" directly from the [U.S. Census Bureau's website](https://www.census.gov/data/tables/time-series/demo/popest/2010s-total-metro-and-micro-statistical-areas.html).

Below, I'm going to do some data cleaning and re-formatting so the table is easier to work with.

Each column with the year is the Census estimated MSA population for that year.

In [14]:
# .read_excel() reads an Excel spreadsheet from a local file or from a URL. 
# Most .read_*() methods can read from both local files and URLs.
# The default sheet is the first one, but you can specify a different sheet by name or index.

# skiprows=3 skips the first 3 rows, header=0 uses the first row as column names
msa_pop_2010_2019 = pd.read_excel('https://www2.census.gov/programs-surveys/popest/tables/2010-2019/metro/totals/cbsa-met-est2019-annres.xlsx',
                                skiprows=3,header=0) 

# Because this is an Excel file with multiple header-type rows, 
# we need to skip the first 3 rows
msa_pop_2010_2019 = msa_pop_2010_2019.iloc[2:]

# I'm going to rename the columns to make them easier to work with
msa_pop_2010_2019.columns = ['MSA','apr_1_2010','Estimates Base','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019']

# And I'm selecting only the columns we need for this exercise
msa_pop_2010_2019 = msa_pop_2010_2019[['MSA','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019']]

# And I'm going to remove the rows that are not Metropolitan Statistical Areas 
# by filtering on the string 'Metro Area'
msa_pop_2010_2019 = msa_pop_2010_2019[msa_pop_2010_2019['MSA'].str.contains('Metro Area')]

# There's strangely a period at the beginning of each MSA name, so I'm going to remove it
msa_pop_2010_2019['MSA'] = msa_pop_2010_2019['MSA'].str.replace('.','')

# And I'm going to remove the word 'Metro Area' from the end of each MSA name
msa_pop_2010_2019['MSA'] = msa_pop_2010_2019['MSA'].str.replace(' Metro Area','')

  msa_pop_2010_2019['MSA'] = msa_pop_2010_2019['MSA'].str.replace('.','')


Let's take a look at the first 5 rows of the data.

In [15]:
## Insert your code here

##Use head command to grab the first 5 rows
msa_pop_2010_2019.head()

Unnamed: 0,MSA,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
2,"Abilene, TX",165585.0,166634.0,167442.0,167473.0,168342.0,169688.0,170017.0,170429.0,171150.0,172060.0
3,"Akron, OH",703031.0,703200.0,702109.0,703621.0,704908.0,704382.0,703524.0,703987.0,703855.0,703479.0
4,"Albany, GA",154145.0,154545.0,153976.0,152667.0,151949.0,150387.0,149137.0,148090.0,147840.0,146726.0
5,"Albany-Lebanon, OR",116891.0,118164.0,118273.0,118405.0,119042.0,120236.0,122769.0,125035.0,127451.0,129749.0
6,"Albany-Schenectady-Troy, NY",871082.0,872778.0,874698.0,877065.0,878113.0,879085.0,879792.0,882158.0,882263.0,880381.0


Now, create a new column that represents the percentage change in population between 2010 and 2019 and call this column `perc_pop_change`. 

In [16]:
## Insert your code here

##Take the difference of the 2010 and 2019 pop columns and divide the difference by the 2010 pop column
msa_pop_2010_2019['perc_pop_change'] = (msa_pop_2010_2019['2019'] - msa_pop_2010_2019['2010'])/msa_pop_2010_2019['2010']

Print the average population change for the 10 **largest** MSAs by population in 2010. 

In [20]:
# Insert your code here

##First index the 10 largest MSAs by sorting the 2010 pop column and selecting the % change values
MSA_top_10 = msa_pop_2010_2019.sort_values('2010', ascending = False).head(10)['perc_pop_change']

## find the average by adding up the values found above and dividing by the length of the series
change_top_10 = sum(MSA_top_10)/len(MSA_top_10)


# The print function is used to print a string to the screen
print('The average population change for the 10 largest MSAs between 2010 and 2019 is', 
        round(change_top_10*100,4),'%')

The average population change for the 10 largest MSAs between 2010 and 2019 is 8.5142 %


Now print the average population change for the 10 **smallest** MSAs by population in 2010. 

In [22]:
# Insert your code here

## Similar process to code cell above but sort by smallest MSAs first. 
MSA_bottom_10 = msa_pop_2010_2019.sort_values('2010').head(10)['perc_pop_change']
change_bottom_10 = sum(MSA_bottom_10)/len(MSA_bottom_10)

print('The average population change for the 10 smallest MSAs between 2010 and 2019 is', 
    round(change_bottom_10*100,4),'%')

The average population change for the 10 smallest MSAs between 2010 and 2019 is 2.4318 %
