## Pandas

The Python Data Analysis Library (Pandas) is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of _tabular_ data.

### Import Pandas

Import pandas using the common shortcut `pd`:

`import pandas as pd`

In [60]:
import pandas as pd

### Pandas Data Structures

The Data Structures provided by Pandas are of two distinct types: Series and DataFrame

### Series

A Series represents a one-dimensional array of data.

`prices = [1.19, 3.25, 2.99]`

`s = pd.Series(prices)`

`s`

In [61]:
prices = [1.19, 3.25, 2.99]
s = pd.Series(prices)
s

0    1.19
1    3.25
2    2.99
dtype: float64

Use argument `index` to set the Series index.


`items = ["Apples", "Bread", "Butter"]`

`s = pd.Series(prices, index=items)`

In [62]:
items = ["Apples", "Bread", "Butter"]
s = pd.Series(prices, index=items)
s

Apples    1.19
Bread     3.25
Butter    2.99
dtype: float64

### DataFrames

DataFrames are two-dimensional data structures with labeled axes (rows and columns). Data are aligned in a tabular fashion with three principal components: data, rows, and columns.

### Creating a DataFrame

Read a csv file into a DataFrame using the `read_csv()` function:

`pd.read_csv("./data/counties2010.csv")`

In [63]:
pd.read_csv("../data/counties2010.csv")

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,GCT_STUB.target-geo-id,GCT_STUB.target-geo-id2,GCT_STUB.display-label,GCT_STUB.display-label.1,HD01,HD02,SUBHD0301,SUBHD0302,SUBHD0303,SUBHD0401,SUBHD0402
0,Id,Id2,Geography,Target Geo Id,Target Geo Id2,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing...
1,0400000US06,6,California,0400000US06,6,California,California,37253956,13680081,163694.74,7915.52,155779.22,239.1,87.8
2,0400000US06,6,California,0500000US06001,6001,California - Alameda County,Alameda County,1510271,582549,821.33,82.31,739.02,2043.6,788.3
3,0400000US06,6,California,0500000US06003,6003,California - Alpine County,Alpine County,1175,1760,743.18,4.85,738.33,1.6,2.4
4,0400000US06,6,California,0500000US06005,6005,California - Amador County,Amador County,38091,18032,605.96,11.37,594.58,64.1,30.3
5,0400000US06,6,California,0500000US06007,6007,California - Butte County,Butte County,220000,95835,1677.13,40.67,1636.46,134.4,58.6
6,0400000US06,6,California,0500000US06009,6009,California - Calaveras County,Calaveras County,45578,27925,1036.93,16.92,1020.01,44.7,27.4
7,0400000US06,6,California,0500000US06011,6011,California - Colusa County,Colusa County,21419,7883,1156.36,5.63,1150.73,18.6,6.9
8,0400000US06,6,California,0500000US06013,6013,California - Contra Costa County,Contra Costa County,1049025,400263,803.77,87.83,715.94,1465.2,559.1
9,0400000US06,6,California,0500000US06015,6015,California - Del Norte County,Del Norte County,28610,11186,1229.74,223.37,1006.37,28.4,11.1


### Arguments

Notice in this output that the original file has 2 header rows and the first entry is for the state of California. A numeric index (0-59) was also created because one was not specified.

Specify the following arguments in `read_csv`:

`header=1`

`skiprows=[2]` 

`index_col="Target Geo Id"`

Create a variable for the dataframe and view the first 5 rows using `head()`

`counties2010 = pd.read_csv("./data/counties2010.csv", header=1, skiprows=[2], index_col="Target Geo Id")`

`counties2010.head()`

In [64]:
counties2010 = pd.read_csv("../data/counties2010.csv", header=1, skiprows=[2], index_col="Target Geo Id")
counties2010.tail()

Unnamed: 0_level_0,Id,Id2,Geography,Target Geo Id2,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0500000US06107,0400000US06,6,California,6107,California - Tulare County,Tulare County,442179,141696,4838.65,14.44,4824.21,91.7,29.4
0500000US06109,0400000US06,6,California,6109,California - Tuolumne County,Tuolumne County,55365,31244,2274.43,53.55,2220.88,24.9,14.1
0500000US06111,0400000US06,6,California,6111,California - Ventura County,Ventura County,823318,281695,2208.38,365.25,1843.13,446.7,152.8
0500000US06113,0400000US06,6,California,6113,California - Yolo County,Yolo County,200849,75054,1023.56,8.87,1014.69,197.9,74.0
0500000US06115,0400000US06,6,California,6115,California - Yuba County,Yuba County,72155,27635,643.81,11.97,631.84,114.2,43.7


### Inspecting the DataFrame

`counties2010.shape`

`counties2010.dtypes`

`counties2010.describe()`

`counties2010.info()`

In [65]:
counties2010.describe()

Unnamed: 0,Id2,Target Geo Id2,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
count,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0,58.0
mean,6.0,6058.0,642309.6,235863.5,2822.323793,136.475172,2685.848103,663.253448,275.484483
std,0.0,33.773757,1416933.0,500003.6,3116.781522,192.544225,3102.319488,2314.767149,1067.751558
min,6.0,6001.0,1175.0,1760.0,231.89,1.76,46.87,1.6,0.9
25%,6.0,6029.5,48000.75,24679.25,978.855,16.0725,959.4875,25.925,11.85
50%,6.0,6058.0,179140.5,76183.5,1595.91,56.965,1535.34,104.65,39.0
75%,6.0,6086.5,642592.8,226459.2,3832.835,183.5,3454.3975,334.925,127.375
max,6.0,6115.0,9818605.0,3445076.0,20104.83,1053.99,20056.94,17179.2,8041.8


### Removing Columns

Remove columns from the dataframe using the `drop` function:


`counties2010 = counties2010.drop(["Id2", "Id", "Geographic area"], axis=1)`

`counties2010.head()`

In [66]:
counties2010 = pd.read_csv("../data/counties2010.csv", header=1, index_col="Target Geo Id", skiprows=[2])
counties2010 = counties2010.drop(["Id2", "Id", "Geographic area"], axis=1)
counties2010.head()

Unnamed: 0_level_0,Geography,Target Geo Id2,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0500000US06001,California,6001,Alameda County,1510271,582549,821.33,82.31,739.02,2043.6,788.3
0500000US06003,California,6003,Alpine County,1175,1760,743.18,4.85,738.33,1.6,2.4
0500000US06005,California,6005,Amador County,38091,18032,605.96,11.37,594.58,64.1,30.3
0500000US06007,California,6007,Butte County,220000,95835,1677.13,40.67,1636.46,134.4,58.6
0500000US06009,California,6009,Calaveras County,45578,27925,1036.93,16.92,1020.01,44.7,27.4


### Renaming Columns

`counties2010 = counties2010.rename(columns = {"Target Geo Id2":"FIPS"})`

In [67]:
counties2010 = pd.read_csv("../data/counties2010.csv", header=1, index_col="Target Geo Id", skiprows=[2])
counties2010 = counties2010.drop(["Id2", "Id"], axis=1)

#Rename "Target Geo Id2" to "FIPS"
counties2010 = counties2010.rename(columns = {"Target Geo Id2":"FIPS"})
counties2010.head()

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0500000US06001,California,6001,California - Alameda County,Alameda County,1510271,582549,821.33,82.31,739.02,2043.6,788.3
0500000US06003,California,6003,California - Alpine County,Alpine County,1175,1760,743.18,4.85,738.33,1.6,2.4
0500000US06005,California,6005,California - Amador County,Amador County,38091,18032,605.96,11.37,594.58,64.1,30.3
0500000US06007,California,6007,California - Butte County,Butte County,220000,95835,1677.13,40.67,1636.46,134.4,58.6
0500000US06009,California,6009,California - Calaveras County,Calaveras County,45578,27925,1036.93,16.92,1020.01,44.7,27.4


### Selecting Data

Select one or more columns by name:

`counties2010["County"]`

`counties2010[["County","Population"]]`

In [68]:
counties2010[["County","Population"]].tail()

Unnamed: 0_level_0,County,Population
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1
0500000US06107,Tulare County,442179
0500000US06109,Tuolumne County,55365
0500000US06111,Ventura County,823318
0500000US06113,Yolo County,200849
0500000US06115,Yuba County,72155


### Slicing

Use slicing syntax to extract rows:

`counties2010[0:12]`

`counties2010[4:-4]`

`counties2010["Population"][4:40]`

### loc and iloc

Select one or more rows by dataframe index using `loc`:

`counties2010.loc["0500000US06001":]`

`counties2010.loc["0500000US06007": "0500000US06025"]`


Select rows or columns at particular positions in the dataframe index using `iloc`

`counties2010.iloc[0:3]` *`#first 3 rows`*

`counties2010.iloc[:,2]` *`#all rows from the third column`*

`counties2010.iloc[[5, -1]]`  *`#only the 6th row and the last row`*

`counties2010.iloc[10:19, 4:5]` *`#rows 11-20, columns 5-7`*

In [69]:
counties2010.iloc[0:4,]

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0500000US06001,California,6001,California - Alameda County,Alameda County,1510271,582549,821.33,82.31,739.02,2043.6,788.3
0500000US06003,California,6003,California - Alpine County,Alpine County,1175,1760,743.18,4.85,738.33,1.6,2.4
0500000US06005,California,6005,California - Amador County,Amador County,38091,18032,605.96,11.37,594.58,64.1,30.3
0500000US06007,California,6007,California - Butte County,Butte County,220000,95835,1677.13,40.67,1636.46,134.4,58.6


Find all rows with a population higher than 1,000,000

In [70]:
counties2010[counties2010["Population"] > 1000000]

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0500000US06001,California,6001,California - Alameda County,Alameda County,1510271,582549,821.33,82.31,739.02,2043.6,788.3
0500000US06013,California,6013,California - Contra Costa County,Contra Costa County,1049025,400263,803.77,87.83,715.94,1465.2,559.1
0500000US06037,California,6037,California - Los Angeles County,Los Angeles County,9818605,3445076,4750.94,693.06,4057.88,2419.6,849.0
0500000US06059,California,6059,California - Orange County,Orange County,3010232,1048907,948.07,157.5,790.57,3807.7,1326.8
0500000US06065,California,6065,California - Riverside County,Riverside County,2189641,800707,7303.42,96.94,7206.48,303.8,111.1
0500000US06067,California,6067,California - Sacramento County,Sacramento County,1418788,555932,994.02,29.37,964.64,1470.8,576.3
0500000US06071,California,6071,California - San Bernardino County,San Bernardino County,2035210,699637,20104.83,47.89,20056.94,101.5,34.9
0500000US06073,California,6073,California - San Diego County,San Diego County,3095313,1164786,4525.68,319.05,4206.63,735.8,276.9
0500000US06085,California,6085,California - Santa Clara County,Santa Clara County,1781642,631920,1304.07,13.97,1290.1,1381.0,489.8


Find all rows containing a string

In [71]:
my_string = "San"
counties2010[counties2010["County"].str.contains(my_string)]

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0500000US06069,California,6069,California - San Benito County,San Benito County,55269,17870,1390.47,1.76,1388.71,39.8,12.9
0500000US06071,California,6071,California - San Bernardino County,San Bernardino County,2035210,699637,20104.83,47.89,20056.94,101.5,34.9
0500000US06073,California,6073,California - San Diego County,San Diego County,3095313,1164786,4525.68,319.05,4206.63,735.8,276.9
0500000US06075,California,6075,California - San Francisco County,San Francisco County,805235,376942,231.89,185.02,46.87,17179.2,8041.8
0500000US06077,California,6077,California - San Joaquin County,San Joaquin County,685306,233755,1426.5,35.18,1391.32,492.6,168.0
0500000US06079,California,6079,California - San Luis Obispo County,San Luis Obispo County,269637,117315,3615.55,316.98,3298.57,81.7,35.6
0500000US06081,California,6081,California - San Mateo County,San Mateo County,718451,271031,740.96,292.55,448.41,1602.2,604.4
0500000US06083,California,6083,California - Santa Barbara County,Santa Barbara County,423895,152834,3789.08,1053.99,2735.09,155.0,55.9
0500000US06085,California,6085,California - Santa Clara County,Santa Clara County,1781642,631920,1304.07,13.97,1290.1,1381.0,489.8
0500000US06087,California,6087,California - Santa Cruz County,Santa Cruz County,262382,104476,607.17,162.0,445.17,589.4,234.7


### Splitting Strings

`CountyName = counties2010["Geographic area"].str.split('- ', n=1, expand=True)`

`counties2010["Geographic area"] = CountyName[1]`

`counties2010["Geographic area"]`

In [72]:
counties2010 = pd.read_csv("../data/counties2010.csv", header=1, index_col="Target Geo Id", skiprows=[2])
counties2010 = counties2010.drop(["Id2", "Id"], axis=1)
counties2010 = counties2010.rename(columns = {"Target Geo Id2":"FIPS"})

CountyName = counties2010["Geographic area"].str.split('- ', n=1, expand=True)
counties2010["Geographic area"] = CountyName[1]
counties2010["Geographic area"]                                                      

Target Geo Id
0500000US06001            Alameda County
0500000US06003             Alpine County
0500000US06005             Amador County
0500000US06007              Butte County
0500000US06009          Calaveras County
0500000US06011             Colusa County
0500000US06013       Contra Costa County
0500000US06015          Del Norte County
0500000US06017          El Dorado County
0500000US06019             Fresno County
0500000US06021              Glenn County
0500000US06023           Humboldt County
0500000US06025           Imperial County
0500000US06027               Inyo County
0500000US06029               Kern County
0500000US06031              Kings County
0500000US06033               Lake County
0500000US06035             Lassen County
0500000US06037        Los Angeles County
0500000US06039             Madera County
0500000US06041              Marin County
0500000US06043           Mariposa County
0500000US06045          Mendocino County
0500000US06047             Merced County
05

### Getting Statistics

Using `max(), min(), and mean()`:

`counties2010.min()`

`counties2010["Population"].max()`

In [73]:
counties2010.min()

Geography                                                   California
FIPS                                                              6001
Geographic area                                         Alameda County
County                                                  Alameda County
Population                                                        1175
Housing units                                                     1760
Area in square miles - Total area                               231.89
Area in square miles - Water area                                 1.76
Area in square miles - Land area                                 46.87
Density per square mile of land area - Population                  1.6
Density per square mile of land area - Housing units               0.9
dtype: object

In [74]:
counties2010.max()

Geography                                                California
FIPS                                                           6115
Geographic area                                         Yuba County
County                                                  Yuba County
Population                                                  9818605
Housing units                                               3445076
Area in square miles - Total area                           20104.8
Area in square miles - Water area                           1053.99
Area in square miles - Land area                            20056.9
Density per square mile of land area - Population           17179.2
Density per square mile of land area - Housing units         8041.8
dtype: object

Find the row with the highest population

In [75]:
max_pop = counties2010["Population"].max()
counties2010[counties2010["Population"] == max_pop]

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0500000US06037,California,6037,Los Angeles County,Los Angeles County,9818605,3445076,4750.94,693.06,4057.88,2419.6,849.0


Perform the Same Operations on Another File

In [76]:
counties2000 = pd.read_csv("../data/counties2000.csv", header=1, index_col="Target Geo Id", skiprows=[2])
counties2000 = counties2000.drop(["Id2", "Id"], axis=1)
counties2000 = counties2000.rename(columns = {"Target Geo Id2":"FIPS"})
counties2000

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0500000US06001,California,6001,California - Alameda County,Alameda County,1443741,540183,821.15,83.57,737.57,1957.4,732.4
0500000US06003,California,6003,California - Alpine County,Alpine County,1208,1514,743.19,4.57,738.62,1.6,2.0
0500000US06005,California,6005,California - Amador County,Amador County,35100,15035,604.69,11.73,592.97,59.2,25.4
0500000US06007,California,6007,California - Butte County,Butte County,203171,85523,1677.11,37.62,1639.49,123.9,52.2
0500000US06009,California,6009,California - Calaveras County,Calaveras County,40554,22946,1036.84,16.81,1020.04,39.8,22.5
0500000US06011,California,6011,California - Colusa County,Colusa County,18804,6774,1156.22,5.54,1150.68,16.3,5.9
0500000US06013,California,6013,California - Contra Costa County,Contra Costa County,948816,354577,802.15,82.2,719.95,1317.9,492.5
0500000US06015,California,6015,California - Del Norte County,Del Norte County,27507,10434,1229.75,221.94,1007.81,27.3,10.4
0500000US06017,California,6017,California - El Dorado County,El Dorado County,156299,71278,1788.1,77.25,1710.85,91.4,41.7
0500000US06019,California,6019,California - Fresno County,Fresno County,799407,270767,6017.42,54.7,5962.73,134.1,45.4


Add a new column in `counties2010` containing the change in population between 2000 and 2010

`counties2010["Population Change 2000-2010"] = counties2010["Population"] - counties2000["Population"]`

In [77]:
counties2010["Population Change 2000-2010"] = counties2010["Population"] - counties2000["Population"]
counties2010

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units,Population Change 2000-2010
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0500000US06001,California,6001,Alameda County,Alameda County,1510271,582549,821.33,82.31,739.02,2043.6,788.3,66530
0500000US06003,California,6003,Alpine County,Alpine County,1175,1760,743.18,4.85,738.33,1.6,2.4,-33
0500000US06005,California,6005,Amador County,Amador County,38091,18032,605.96,11.37,594.58,64.1,30.3,2991
0500000US06007,California,6007,Butte County,Butte County,220000,95835,1677.13,40.67,1636.46,134.4,58.6,16829
0500000US06009,California,6009,Calaveras County,Calaveras County,45578,27925,1036.93,16.92,1020.01,44.7,27.4,5024
0500000US06011,California,6011,Colusa County,Colusa County,21419,7883,1156.36,5.63,1150.73,18.6,6.9,2615
0500000US06013,California,6013,Contra Costa County,Contra Costa County,1049025,400263,803.77,87.83,715.94,1465.2,559.1,100209
0500000US06015,California,6015,Del Norte County,Del Norte County,28610,11186,1229.74,223.37,1006.37,28.4,11.1,1103
0500000US06017,California,6017,El Dorado County,El Dorado County,181058,88159,1786.36,78.47,1707.88,106.0,51.6,24759
0500000US06019,California,6019,Fresno County,Fresno County,930450,315531,6011.2,53.21,5957.99,156.2,53.0,131043


Find all counties with a population decrease

`counties2010[counties2010["Population Change 2000-2010"] < 0]`

In [78]:
counties2010[counties2010["Population Change 2000-2010"] < 0]

Unnamed: 0_level_0,Geography,FIPS,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units,Population Change 2000-2010
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0500000US06003,California,6003,Alpine County,Alpine County,1175,1760,743.18,4.85,738.33,1.6,2.4,-33
0500000US06063,California,6063,Plumas County,Plumas County,20007,15566,2613.43,60.38,2553.04,7.8,6.1,-817
0500000US06091,California,6091,Sierra County,Sierra County,3240,2328,962.21,9.0,953.21,3.4,2.4,-315


### Merging DataFrames

Using the `merge` function to merge two datasets into one and align the rows using on a common index

`result = counties2010.merge(counties2000, on="Target Geo Id")`

`result.head()`

In [79]:
result = counties2010.merge(counties2000, on="Target Geo Id")
result.head()

Unnamed: 0_level_0,Geography_x,FIPS_x,Geographic area_x,County_x,Population_x,Housing units,Area in square miles - Total area_x,Area in square miles - Water area_x,Area in square miles - Land area_x,Density per square mile of land area - Population_x,...,FIPS_y,Geographic area_y,County_y,Population_y,Housing units,Area in square miles - Total area_y,Area in square miles - Water area_y,Area in square miles - Land area_y,Density per square mile of land area - Population_y,Density per square mile of land area - Housing units
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0500000US06001,California,6001,Alameda County,Alameda County,1510271,582549,821.33,82.31,739.02,2043.6,...,6001,California - Alameda County,Alameda County,1443741,540183,821.15,83.57,737.57,1957.4,732.4
0500000US06003,California,6003,Alpine County,Alpine County,1175,1760,743.18,4.85,738.33,1.6,...,6003,California - Alpine County,Alpine County,1208,1514,743.19,4.57,738.62,1.6,2.0
0500000US06005,California,6005,Amador County,Amador County,38091,18032,605.96,11.37,594.58,64.1,...,6005,California - Amador County,Amador County,35100,15035,604.69,11.73,592.97,59.2,25.4
0500000US06007,California,6007,Butte County,Butte County,220000,95835,1677.13,40.67,1636.46,134.4,...,6007,California - Butte County,Butte County,203171,85523,1677.11,37.62,1639.49,123.9,52.2
0500000US06009,California,6009,Calaveras County,Calaveras County,45578,27925,1036.93,16.92,1020.01,44.7,...,6009,California - Calaveras County,Calaveras County,40554,22946,1036.84,16.81,1020.04,39.8,22.5


Specify columns to merge and supply suffix arguments to use in place of `_x` and `_y`:


`result = counties2010[["County","Population","Population Change 2000-2010"]].merge(counties2000[["Population"]], on="Target Geo Id", suffixes=("2010", "2000"))`

In [80]:
result = counties2010[["County","Population","Population Change 2000-2010"]].merge(counties2000[["Population"]], on="Target Geo Id", suffixes=(" 2010", " 2000"))
result.head()

Unnamed: 0_level_0,County,Population 2010,Population Change 2000-2010,Population 2000
Target Geo Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0500000US06001,Alameda County,1510271,66530,1443741
0500000US06003,Alpine County,1175,-33,1208
0500000US06005,Amador County,38091,2991,35100
0500000US06007,Butte County,220000,16829,203171
0500000US06009,Calaveras County,45578,5024,40554


### Viewing Variables

In [81]:
%whos

Variable       Type         Data/Info
-------------------------------------
CountyName     DataFrame                             <...>              Yuba County
all_counties   DataFrame                  Id  Id2   G<...>\n[232 rows x 16 columns]
all_files      list         n=4
build_df       function     <function build_df at 0x11cf88ae8>
c_pop          DataFrame                            P<...>                  72155  
combineFIles   function     <function combineFIles at 0x11ce2e8c8>
combineFiles   function     <function combineFiles at 0x11ce2ea60>
counties2000   DataFrame                     Geograph<...>               35.9      
counties2010   DataFrame                     Geograph<...>                  11936  
df             DataFrame                             <...>                43.7     
f              str          ../data/counties2010.csv
file_out       str          ../data/all_counties.csv
glob           module       <module 'glob' from '/ana<...>3/lib/python3.6/glob.py'>
i

### Performing the Same Actions on Multiple Files

Using a for loop, read a file into a dataframe and extract column data into a new dataframe.

Create a list of files using `glob`

In [6]:
import glob
raw_files = glob.glob('../*data/counties*.csv')
print (raw_files)

['../data/counties1980.csv', '../data/counties1990.csv', '../data/counties2000.csv', '../data/counties2010.csv']


Loop over the list of files and read each csv into a dataframe `df`. Then, extract the year from the filename:

`for f in raw_files:`

&nbsp;&nbsp;&nbsp;&nbsp;`df = pd.read_csv(f)`

&nbsp;&nbsp;&nbsp;&nbsp;`year = f[-8:-4]`

In [5]:
for f in raw_files:
    df = pd.read_csv(f)
    year = f[-8:-4]
    print (f, year)

../data/counties1980.csv 1980
../data/counties1990.csv 1990
../data/counties2000.csv 2000
../data/counties2010.csv 2010


Create an empty dataframe

`all_counties = pd.DataFrame()`

Define a function to extract the column data into the new dataframe

`def build_df(df):`

&nbsp;&nbsp;`all_counties["Population " + year] = df["Population"]`

In [84]:

raw_files = glob.glob("../*data/counties*.csv")


for f in raw_files:
    df = pd.read_csv(f, header=1, skiprows=[2], index_col="County")
    year = f[-8:-4]

c_pop.head()

Unnamed: 0_level_0,Population 1980,Population 1990,Population 2000,Population 2010
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alameda County,1073184,1105379,1443741,1510271
Alpine County,484,1097,1208,1175
Amador County,11821,19314,35100,38091
Butte County,101969,143851,203171,220000
Calaveras County,13585,20710,40554,45578


### Writing a DataFrame to a File

Use `to_csv(filename)`

In [85]:
c_pop.to_csv("../data/ca_population.csv")

### Concatenating Files

`def produceOneCSV(all_files):`

&nbsp;&nbsp;&nbsp;&nbsp;`result_obj = pd.concat([pd.read_csv(file, header=1, skiprows=[2]) for file in list_of_files], sort=False)`

&nbsp;&nbsp;&nbsp;&nbsp;`result_obj.to_csv(file_out, index=False, encoding="utf-8")`


`all_files = glob.glob("*data/counties*.csv")`

`print(all_files)`

`file_out = "all_counties.csv"`

`produceOneCSV(list_of_files)`

In [87]:
all_counties = pd.read_csv("../data/all_counties.csv")
all_counties

Unnamed: 0,Id,Id2,Geography,Target Geo Id,Target Geo Id2,Geographic area,County,Population,Housing units,Area in square miles - Total area,Area in square miles - Water area,Area in square miles - Land area,Density per square mile of land area - Population,Density per square mile of land area - Housing units,Housing units.1,Density per square mile of land area - Housing units.1
0,0400000US06,6,California,0500000US06001,6001,California - Alameda County,Alameda County,1073184,540183.0,821.15,83.57,737.57,1957.4,732.4,,
1,0400000US06,6,California,0500000US06003,6003,California - Alpine County,Alpine County,484,1514.0,743.19,4.57,738.62,1.6,2.0,,
2,0400000US06,6,California,0500000US06005,6005,California - Amador County,Amador County,11821,15035.0,604.69,11.73,592.97,59.2,25.4,,
3,0400000US06,6,California,0500000US06007,6007,California - Butte County,Butte County,101969,85523.0,1677.11,37.62,1639.49,123.9,52.2,,
4,0400000US06,6,California,0500000US06009,6009,California - Calaveras County,Calaveras County,13585,22946.0,1036.84,16.81,1020.04,39.8,22.5,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227,0400000US06,6,California,0500000US06107,6107,California - Tulare County,Tulare County,442179,,4838.65,14.44,4824.21,91.7,,141696.0,29.4
228,0400000US06,6,California,0500000US06109,6109,California - Tuolumne County,Tuolumne County,55365,,2274.43,53.55,2220.88,24.9,,31244.0,14.1
229,0400000US06,6,California,0500000US06111,6111,California - Ventura County,Ventura County,823318,,2208.38,365.25,1843.13,446.7,,281695.0,152.8
230,0400000US06,6,California,0500000US06113,6113,California - Yolo County,Yolo County,200849,,1023.56,8.87,1014.69,197.9,,75054.0,74.0
