# Concatenating data frames

Concatenating data is a common GIS operations.  It just refers to the act of combining multiple data sources.  Usually these data sources have a common structure (i.e. the same attributes and geometry type).  An example might be if you are collecting data in the field using a mobile GPS device.  Every day the data that you collect is downloaded as a shapefile.  After a week you have 5 shapefiles with similar data and each of those must be combined into a masterfile that contains all of the data.

Lets load the BUOWL data first.

In [1]:
%matplotlib inline
import geopandas as gpd

buowl = gpd.read_file("data/BUOWL_Habitat.shp")
buowl.head()

Unnamed: 0,postgis_fi,habitat,hist_occup,recentstat,habitat_id,active2017,geometry
0,15.0,Ground squirrel-mixed Vegetation,Undetermined,NO NESTING ACTIVITY OBSERVED,15,False,"POLYGON ((-104.61687 40.16775, -104.61676 40.1..."
1,41.0,Ground squirrel-mixed Vegetation; removed 3/26/14,Undetermined,REMOVED,41,False,"POLYGON ((-104.65030 40.14220, -104.65014 40.1..."
2,42.0,Ground squirrel-mixed Vegetation; removed 3/26/14,Undetermined,REMOVED,42,False,"POLYGON ((-104.59917 40.11202, -104.59902 40.1..."
3,43.0,Ground squirrel-mixed Vegetation; removed 3/26/14,Undetermined,REMOVED,43,False,"POLYGON ((-104.69383 40.17870, -104.69360 40.1..."
4,54.0,Active Prarie Dog Colony,Undetermined,NO NESTING ACTIVITY OBSERVED,54,False,"POLYGON ((-104.68393 40.19921, -104.68402 40.1..."


Now lets seperate this dataframe into two dataframes. One will contain the historically occupied buowl habitat and the other the undetermined buowl habitat.

In [2]:
buowl_ho = buowl[buowl['hist_occup'] == 'Yes']
buowl_ho.head()

Unnamed: 0,postgis_fi,habitat,hist_occup,recentstat,habitat_id,active2017,geometry
11,128.0,Active Prarie Dog Colony,Yes,NO NESTING ACTIVITY OBSERVED,128,False,"POLYGON ((-104.89222 40.11220, -104.89223 40.1..."
12,129.0,Active Prarie Dog Colony,Yes,NO NESTING ACTIVITY OBSERVED,129,False,"POLYGON ((-104.85957 40.16792, -104.85992 40.1..."
13,130.0,Active Prarie Dog Colony,Yes,NO NESTING ACTIVITY OBSERVED,130,False,"POLYGON ((-104.64250 40.21463, -104.64321 40.2..."
15,131.0,Active Prarie Dog Colony,Yes,NO NESTING ACTIVITY OBSERVED,131,False,"POLYGON ((-104.87795 40.00594, -104.87787 40.0..."
16,132.0,Active Prarie Dog Colony,Yes,UNDETERMINED,132,False,"POLYGON ((-104.88152 40.07898, -104.88082 40.0..."


In [3]:
buowl_und = buowl[buowl['hist_occup'] == 'Undetermined']
buowl_und.head()

Unnamed: 0,postgis_fi,habitat,hist_occup,recentstat,habitat_id,active2017,geometry
0,15.0,Ground squirrel-mixed Vegetation,Undetermined,NO NESTING ACTIVITY OBSERVED,15,False,"POLYGON ((-104.61687 40.16775, -104.61676 40.1..."
1,41.0,Ground squirrel-mixed Vegetation; removed 3/26/14,Undetermined,REMOVED,41,False,"POLYGON ((-104.65030 40.14220, -104.65014 40.1..."
2,42.0,Ground squirrel-mixed Vegetation; removed 3/26/14,Undetermined,REMOVED,42,False,"POLYGON ((-104.59917 40.11202, -104.59902 40.1..."
3,43.0,Ground squirrel-mixed Vegetation; removed 3/26/14,Undetermined,REMOVED,43,False,"POLYGON ((-104.69383 40.17870, -104.69360 40.1..."
4,54.0,Active Prarie Dog Colony,Undetermined,NO NESTING ACTIVITY OBSERVED,54,False,"POLYGON ((-104.68393 40.19921, -104.68402 40.1..."


In [4]:
buowl_und.count()

postgis_fi    359
habitat       348
hist_occup    359
recentstat    359
habitat_id    359
active2017    359
geometry      359
dtype: int64

We know that these two dataframes have identical structures because we created them from the same dataframe.  In this case it is very easy to concatenate them back into a single dataframe using the pandas concat method.  The concat method can be fairly complex but in its easiest form we can just pass it a list of dataframes to concatenate

In [5]:
import pandas as pd

buowl_all = pd.concat([buowl_ho, buowl_und])
buowl_all.count()

postgis_fi    469
habitat       455
hist_occup    469
recentstat    469
habitat_id    469
active2017    469
geometry      469
dtype: int64

But what if the data structures are not identical?

Lets read in a few data frames that were created as the intersections of environmental constraints and project buffers

In [6]:
raptor_buffer = gpd.read_file("data/intersections.gpkg", layer = 'raptor_buffer')
buowl_buffer = gpd.read_file("data/intersections.gpkg", layer = 'buowl_buffer')

In [7]:
raptor_buffer.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Nest_ID     847 non-null    int64   
 1   recentstat  847 non-null    object  
 2   recentspec  847 non-null    object  
 3   Project     847 non-null    int64   
 4   type        847 non-null    object  
 5   length_m    847 non-null    float64 
 6   area_ha     847 non-null    float64 
 7   geometry    847 non-null    geometry
dtypes: float64(2), geometry(1), int64(2), object(3)
memory usage: 53.1+ KB


In [8]:
buowl_buffer.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   habitat_id  413 non-null    int64   
 1   hist_occup  411 non-null    object  
 2   Project     413 non-null    int64   
 3   type        413 non-null    object  
 4   length_m    413 non-null    float64 
 5   area_ha     413 non-null    float64 
 6   geometry    413 non-null    geometry
dtypes: float64(2), geometry(1), int64(2), object(2)
memory usage: 22.7+ KB


Notice that in this case that some of the column names differ (although they hold similar type of information).  Lets do a quick simple merge and see what happens.

In [9]:
ec = pd.concat([raptor_buffer, buowl_buffer])
ec

Unnamed: 0,Nest_ID,recentstat,recentspec,Project,type,length_m,area_ha,geometry,habitat_id,hist_occup
0,361.0,INACTIVE NEST,Swainsons Hawk,1003,Pipeline,1359.173136,5.265382,"POLYGON ((517254.228 4460633.114, 517244.804 4...",,
1,219.0,INACTIVE NEST,Red-tail Hawk,1003,Pipeline,1359.173136,8.067820,"POLYGON ((517140.043 4460868.609, 517128.114 4...",,
2,362.0,INACTIVE NEST,Swainsons Hawk,977,Flowline,272.268013,0.555339,"POLYGON ((518257.982 4452433.583, 518262.939 4...",,
3,511.0,ACTIVE NEST,Swainsons Hawk,977,Flowline,272.268013,0.143751,"POLYGON ((518407.545 4452393.335, 518404.320 4...",,
4,3.0,ACTIVE NEST,Swainsons Hawk,87,Pipeline,14108.354396,3.205486,"POLYGON ((522030.885 4448245.106, 522030.278 4...",,
...,...,...,...,...,...,...,...,...,...,...
408,,,,599,Access Road - Confirmed,391.400697,1.690594,"POLYGON ((493333.526 4434586.059, 493331.487 4...",393.0,Undetermined
409,,,,277,Flowline,1126.003756,3.934133,"POLYGON ((494034.646 4442424.241, 494053.959 4...",396.0,Undetermined
410,,,,981,Access Road - Confirmed,98.753176,0.520475,"POLYGON ((501663.476 4448031.217, 501665.436 4...",400.0,Undetermined
411,,,,464,Access Road - Confirmed,1398.298546,1.958287,"POLYGON ((521956.288 4434698.030, 521962.939 4...",404.0,Undetermined


Notice that identical column names are combined automatically in the same column but there are now 2 new columns *habitat_id* and *hist_occup* that reflect the columns in buowl_buffer that are not found in raptor_buffer

Now, we have a few issues to resolve.  One is that the buowl_buffer file has no species column because all of the buowl_habitat reflects a single species.  But if we are going to combine these results with the raptor_intersections then we should have a species column containing the text BUOWL so that we can differentiate them from the raptor nests.  This is easy enough to do.

In [10]:
buowl_buffer['recentspec'] = 'BUOWL'
buowl_buffer.head()

Unnamed: 0,habitat_id,hist_occup,Project,type,length_m,area_ha,geometry,recentspec
0,15,Undetermined,797,Pipeline,572.739025,6.511527,"POLYGON ((532634.811 4445796.290, 532634.613 4...",BUOWL
1,153,Undetermined,797,Pipeline,572.739025,0.375981,"POLYGON ((532605.864 4445805.297, 532606.674 4...",BUOWL
2,43,Undetermined,87,Pipeline,14108.354396,4.788766,"POLYGON ((525798.474 4447770.931, 525801.285 4...",BUOWL
3,54,Undetermined,87,Pipeline,14108.354396,11.667813,"POLYGON ((526607.041 4450520.402, 526624.190 4...",BUOWL
4,429,Yes,87,Pipeline,14108.354396,0.039634,"POLYGON ((526568.823 4448606.149, 526572.230 4...",BUOWL


We see that we now have a recentspec column containing the text BUOWL for all the buowl_buffer records.

Another issue is that the buowl_buffer dataframe has column names *habitat_id* and *hist_occup* that contain similar types of information to the *Nest_ID* and *recentstat* columns in the raptor_buffer dataframe.  We want those columns to be combined when we concatenate so they have to have the same field names.

I will have an entire lecture on various ways to rename columns but for now just know that we can simply assign a new list of column names to the columns property of the data frame.  This is probably the easiest although if you have a lot of column names, there are more efficient ways.

In [11]:
buowl_buffer.columns = ['Nest_ID', 'recentstat', 'Project', 'type', 'length_m', 'area_ha', 'geometry', 'recentspec']
buowl_buffer.head()

Unnamed: 0,Nest_ID,recentstat,Project,type,length_m,area_ha,geometry,recentspec
0,15,Undetermined,797,Pipeline,572.739025,6.511527,"POLYGON ((532634.811 4445796.290, 532634.613 4...",BUOWL
1,153,Undetermined,797,Pipeline,572.739025,0.375981,"POLYGON ((532605.864 4445805.297, 532606.674 4...",BUOWL
2,43,Undetermined,87,Pipeline,14108.354396,4.788766,"POLYGON ((525798.474 4447770.931, 525801.285 4...",BUOWL
3,54,Undetermined,87,Pipeline,14108.354396,11.667813,"POLYGON ((526607.041 4450520.402, 526624.190 4...",BUOWL
4,429,Yes,87,Pipeline,14108.354396,0.039634,"POLYGON ((526568.823 4448606.149, 526572.230 4...",BUOWL


Now that we have the same column names for both dataframes lets combine them again

In [12]:
ec = pd.concat([raptor_buffer, buowl_buffer])
ec

Unnamed: 0,Nest_ID,recentstat,recentspec,Project,type,length_m,area_ha,geometry
0,361,INACTIVE NEST,Swainsons Hawk,1003,Pipeline,1359.173136,5.265382,"POLYGON ((517254.228 4460633.114, 517244.804 4..."
1,219,INACTIVE NEST,Red-tail Hawk,1003,Pipeline,1359.173136,8.067820,"POLYGON ((517140.043 4460868.609, 517128.114 4..."
2,362,INACTIVE NEST,Swainsons Hawk,977,Flowline,272.268013,0.555339,"POLYGON ((518257.982 4452433.583, 518262.939 4..."
3,511,ACTIVE NEST,Swainsons Hawk,977,Flowline,272.268013,0.143751,"POLYGON ((518407.545 4452393.335, 518404.320 4..."
4,3,ACTIVE NEST,Swainsons Hawk,87,Pipeline,14108.354396,3.205486,"POLYGON ((522030.885 4448245.106, 522030.278 4..."
...,...,...,...,...,...,...,...,...
408,393,Undetermined,BUOWL,599,Access Road - Confirmed,391.400697,1.690594,"POLYGON ((493333.526 4434586.059, 493331.487 4..."
409,396,Undetermined,BUOWL,277,Flowline,1126.003756,3.934133,"POLYGON ((494034.646 4442424.241, 494053.959 4..."
410,400,Undetermined,BUOWL,981,Access Road - Confirmed,98.753176,0.520475,"POLYGON ((501663.476 4448031.217, 501665.436 4..."
411,404,Undetermined,BUOWL,464,Access Road - Confirmed,1398.298546,1.958287,"POLYGON ((521956.288 4434698.030, 521962.939 4..."


Notice that there are now no new columns and that the data line up properly, even though the columns are not in the same order.  The important thing is the column name.

And now we can create data summaries that integrate both raptor nests and buowl habitat

In [13]:
pd.pivot_table(ec, index=['Project', 'recentspec', 'recentstat'], values='area_ha', aggfunc=['sum', 'count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,count
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,area_ha,area_ha
Project,recentspec,recentstat,Unnamed: 3_level_2,Unnamed: 4_level_2
2,Swainsons Hawk,ACTIVE NEST,0.004884,1
3,BUOWL,Undetermined,1.103370,1
3,Red-tail Hawk,ACTIVE NEST,2.028860,1
3,Red-tail Hawk,INACTIVE NEST,1.661129,1
3,Swainsons Hawk,FLEDGED NEST,1.671810,1
...,...,...,...,...
1106,Swainsons Hawk,FLEDGED NEST,4.361714,2
1107,Red-tail Hawk,ACTIVE NEST,27.121996,2
1107,Red-tail Hawk,INACTIVE NEST,30.790307,3
1107,Swainsons Hawk,ACTIVE NEST,2.540957,1


Pandas has a rich set of functionality for doing these kinds of data manipulations. As is sometimes the case with open source projects there may be several different ways to achieve the same result.  I chose the simplest method for concatenating data that I could and the one that is most similar to methods available in desktop GIS but the same result could be achieved in other ways.

Take a look at the documentation for the concat method for more information on other ways it can be used.

In [14]:
help(pd.concat)

Help on function concat in module pandas.core.reshape.concat:

concat(objs: 'Iterable[NDFrame] | Mapping[HashableT, NDFrame]', *, axis: 'Axis' = 0, join: 'str' = 'outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool | None' = None) -> 'DataFrame | Series'
    Concatenate pandas objects along a particular axis.
    
    Allows optional set logic along the other axes.
    
    Can also add a layer of hierarchical indexing on the concatenation axis,
    which may be useful if the labels are the same (or overlapping) on
    the passed axis number.
    
    Parameters
    ----------
    objs : a sequence or mapping of Series or DataFrame objects
        If a mapping is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be
        selected (see below). Any None objects will be dropped silently unless
        they are all None in which c

More information on this topic can be found in the [Pandas documentaion](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html).  This page provides a good overview of the many methods that are available.