## US States Data 美國州數據

> Merge and join operations come up most often when combining data from different sources.
Here we will consider an example of some data about US states and their populations.
The data files can be found at http://github.com/jakevdp/data-USstates/:

合併及聯表操作在你處理多個不同數據來源時會經常出現。下面我們使用美國州及其人口數據作為例子來進行更加直觀的說明。這些數據文件可以在http://github.com/jakevdp/data-USstates/ 中找到：

In [1]:
import numpy as np
import pandas as pd

In [2]:
pop = pd.read_csv('data/pd-state-population.csv')
areas = pd.read_csv('data/pd-state-areas.csv')
abbrevs = pd.read_csv('data/pd-state-abbrevs.csv')

> We'll start with a many-to-one merge that will give us the full state name within the population ``DataFrame``.
We want to merge based on the ``state/region``  column of ``pop``, and the ``abbreviation`` column of ``abbrevs``.
We'll use ``how='outer'`` to make sure no data is thrown away due to mismatched labels.

我們先進行一個多對一的合併，將州全名和人口數據合併在一個`DataFrame`中。我們希望合併基於`pop`數據集的`state/region`列以及`abbreviation`數據集的`abbrevs`列。使用`how='outer'`來保證合併過程中不會因為不匹配的標籤而丟失任何數據。

In [3]:
merged = pd.merge(pop, abbrevs, how='outer',
                  left_on='state/region', right_on='abbreviation')
merged = merged.drop('abbreviation', 1) # 移除冗余的列
merged.head()

  merged = merged.drop('abbreviation', 1) # 移除冗余的列


Unnamed: 0,state/region,ages,year,population,state
0,AL,under18,2012,1117489.0,Alabama
1,AL,total,2012,4817528.0,Alabama
2,AL,under18,2010,1130966.0,Alabama
3,AL,total,2010,4785570.0,Alabama
4,AL,under18,2011,1125763.0,Alabama


In [4]:
merged.isnull().any()  #讓我們檢查結果中是否有不匹配的情況，通過在數據集中尋找空值來查看
merged[merged['population'].isnull()].head()  #一些人口population數據是空的；再來看看是哪些。

Unnamed: 0,state/region,ages,year,population,state
2448,PR,under18,1990,,
2449,PR,total,1990,,
2450,PR,total,1991,,
2451,PR,under18,1991,,
2452,PR,total,1993,,


> It appears that all the null population values are from Puerto Rico prior to the year 2000; this is likely due to this data not being available from the original source.More importantly, we see also that some of the new ``state`` entries are also null, which means that there was no corresponding entry in the ``abbrevs`` key!
Let's figure out which regions lack this match:

發現所有空的人口數據都是2000年前波多黎各的；這可能因為數據來源本來就沒有這些數據造成的。更重要的是，我們發現一些新的州`state`的數據也是空的，這意味著`abbrevs`列中不存在這些州的簡稱。再看看是哪些州有這種情況：

In [5]:
merged.loc[merged['state'].isnull(), 'state/region'].unique()

array(['PR', 'USA'], dtype=object)

> We can quickly infer the issue: our population data includes entries for Puerto Rico (PR) and the United States as a whole (USA), while these entries do not appear in the state abbreviation key.
We can fix these quickly by filling in appropriate entries:

從上面的結果很容易發現：人口數據集中包括波多黎各（PR）和全美國（USA）的數據，而州簡稱數據集中卻沒有這兩者數據。通過填充相應的數據可以很快解決這個問題：

In [6]:
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()
final = pd.merge(merged, areas, on='state', how='left')
final.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012,1117489.0,Alabama,52423.0
1,AL,total,2012,4817528.0,Alabama,52423.0
2,AL,under18,2010,1130966.0,Alabama,52423.0
3,AL,total,2010,4785570.0,Alabama,52423.0
4,AL,under18,2011,1125763.0,Alabama,52423.0


> There are nulls in the ``area`` column; we can take a look to see which regions were ignored here:> We see that our ``areas`` ``DataFrame`` does not contain the area of the United States as a whole.
We could insert the appropriate value (using the sum of all state areas, for instance), but in this case we'll just drop the null values because the population density of the entire United States is not relevant to our current discussion:

面積`area`列有空值；我們看看是哪裡出現的：結果顯示面積數據集不包括整個美國的面積。我們可以為這個空值插入正確的值（使用所有州的面積數據之和），但是這個例子中我們只需要簡單地移除空值數據即可，因為全美國的人口密度數據與我們前面的問題無關：

In [7]:
final['state'][final['area (sq. mi)'].isnull()].unique()

array(['United States'], dtype=object)

In [8]:
final.dropna(inplace=True)
final.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012,1117489.0,Alabama,52423.0
1,AL,total,2012,4817528.0,Alabama,52423.0
2,AL,under18,2010,1130966.0,Alabama,52423.0
3,AL,total,2010,4785570.0,Alabama,52423.0
4,AL,under18,2011,1125763.0,Alabama,52423.0


> Now we have all the data we need. To answer the question of interest, let's first select the portion of the data corresponding with the year 2000, and the total population.
We'll use the ``query()`` function to do this quickly (this requires the ``numexpr`` package to be installed; see [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb)):

現在我們需要數據都已經準備好了。要回答前面那個問題，首先要選擇出2010年相應的部分數據集以及不分年齡的全體人口數。我們使用`query()`函數來快速完成這項任務（這需要安裝`numexpr`包，參見[高性能Pandas: ``eval()`` 和 ``query()``](03.12-Performance-Eval-and-Query.ipynb)）：

In [9]:
data2010 = final.query("year == 2010 & ages == 'total'")
data2010.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
3,AL,total,2010,4785570.0,Alabama,52423.0
91,AK,total,2010,713868.0,Alaska,656425.0
101,AZ,total,2010,6408790.0,Arizona,114006.0
189,AR,total,2010,2922280.0,Arkansas,53182.0
197,CA,total,2010,37333601.0,California,163707.0


> Now let's compute the population density and display it in order.
We'll start by re-indexing our data on the state, and then compute the result:

下面我們可以計算人口密度並排序輸出了。我們現將數據集按照`state`進行重新索引，然後計算結果：

In [10]:
data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']
density.sort_values(ascending=False, inplace=True)
density.head()

state
District of Columbia    8898.897059
Puerto Rico             1058.665149
New Jersey              1009.253268
Rhode Island             681.339159
Connecticut              645.600649
dtype: float64

> The result is a ranking of US states plus Washington, DC, and Puerto Rico in order of their 2010 population density, in residents per square mile.
We can see that by far the densest region in this dataset is Washington, DC (i.e., the District of Columbia); among states, the densest is New Jersey.

結果是美國州根據2010年人口密度的排名，包括華盛頓特區和波多黎各，數據是每平方英里的居住人數。結果顯示人口密度最稠密的地區是華盛頓特區（表中的the District of Columbia）；在其他的州中，人口密度最大的是新澤西。

In [11]:
density.tail()  #結果顯示密度最小的州 Alaska，平均每平方英里略大於1個居民。

state
South Dakota    10.583512
North Dakota     9.537565
Montana          6.736171
Wyoming          5.768079
Alaska           1.087509
dtype: float64