# Pair Distances

It is a common practice that we try to make use of the available data and combine data sets from different sources for further analysis. For example, the following `stations1.csv` and `stations2.csv` files have different sets of stations with latitude, longitude information. 
Assuming that `stations1.csv` has air quality information while `stations2.csv` has stations has weather information measured by those stations respectively. 

One might assume that the air quality measured by a station in the first data set has a strong correlation with the weather condition registered by its closest station. To combine the two data sets, we need to determine the stations of `stations2.csv` that are closest to those stations of `stations1.csv`.

In [1]:
# Import functions
import sys
sys.path.append("../")
import pandas as pd

In [2]:
df1 = pd.read_csv("stations1.csv")
df2 = pd.read_csv("stations2.csv")

In [3]:
df1.head(10)

Unnamed: 0,station_id,latitude,longitude
0,dongsi,39.929,116.417
1,tiantan,39.886,116.407
2,guanyuan,39.929,116.339
3,wanshouxigong,39.878,116.352
4,aotizhongxin,39.982,116.397
5,nongzhanguan,39.937,116.461
6,wanliu,39.987,116.287
7,beibuxinqu,40.09,116.174
8,zhiwuyuan,40.002,116.207
9,fengtaihuayuan,39.863,116.279


In [4]:
df2.head(10)

Unnamed: 0,station_id,latitude,longitude
0,shunyi,40.126667,116.615278
1,hadian,39.986944,116.290556
2,yanqing,40.449444,115.968889
3,miyun,40.3775,116.864167
4,huairou,40.357778,116.626944
5,shangdianzi,40.658889,117.111667
6,pinggu,40.169444,117.117778
7,tongzhou,39.8475,116.756667
8,chaoyang,39.9525,116.500833
9,pingchang,40.223333,116.211667


Before checking for the closest stations, we can understand the composition of stations found in the different data sets. The designed **sets_grps** function from [preprocess.py](https://github.com/JQGoh/jqlearning/blob/master/script/preprocess.py) file is useful for this purpose.

In [5]:
from script.preprocess import sets_grps

In [6]:
sets_grps(df1.station_id, df2.station_id)

Common elements in both sets:
{'miyun', 'pingchang', 'huairou', 'tongzhou', 'shunyi', 'mentougou', 'daxing', 'pinggu', 'fangshan'} 9

Elements of set1 not in set2:
{'dongsihuan', 'yufa', 'guanyuan', 'miyunshuiku', 'yungang', 'beibuxinqu', 'donggaocun', 'qianmen', 'zhiwuyuan', 'nansanhuan', 'badaling', 'gucheng', 'wanshouxigong', 'yongledian', 'yongdingmennei', 'liulihe', 'dongsi', 'yanqin', 'aotizhongxin', 'tiantan', 'nongzhanguan', 'xizhimenbei', 'fengtaihuayuan', 'wanliu', 'yizhuang', 'dingling'} 26

Elements of set2 not in set1:
{'zhaitang', 'hadian', 'fengtai', 'beijing', 'shijingshan', 'chaoyang', 'shangdianzi', 'xiayunling', 'yanqing'} 9


The summary, as shown above, suggests that both data sets have 9 common stations. 

To evaluate the distances between the selected features of two data sets, the designed **pair_dist** function from [preprocess.py](https://github.com/JQGoh/jqlearning/blob/master/script/preprocess.py) file can be handy.

One can provide the dataframes he/she is interested to work with, and the selected features in key-value pair (Python dict). The key-value pair specifies the group label (station_id in this example), and the features (latitude and longitude) we will use for distance evaluation. The distance calculation is based on the [cdist](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) function from scipy package.

The **pair_dist** function expects the provided key-value pairs are of the same size as the distance calculation will refer to a consistent set of features. The returned dataframe will have the group labels of first data set as its index, while the group labels of the second data set as its columns. 

In [7]:
from script.preprocess import pair_dist

In [8]:
pair_dist(df1, 
          df2, 
          {"station_id": ["latitude", "longitude"]}, 
          {"station_id": ["latitude", "longitude"]})

Unnamed: 0,shunyi,hadian,yanqing,miyun,huairou,shangdianzi,pinggu,tongzhou,chaoyang,pingchang,zhaitang,mentougou,beijing,shijingshan,fengtai,daxing,fangshan,xiayunling
dongsi,0.279975,0.139089,0.686779,0.633333,0.477417,1.007621,0.74088,0.349307,0.087065,0.358879,0.726167,0.263851,0.133612,0.212152,0.181485,0.219492,0.27198,0.705502
tiantan,0.318277,0.154107,0.71373,0.671248,0.520528,1.045903,0.76521,0.35178,0.115008,0.389806,0.720161,0.250617,0.101398,0.209485,0.162485,0.175446,0.240945,0.684777
guanyuan,0.339708,0.075528,0.638627,0.690617,0.51649,1.062898,0.815051,0.425544,0.163531,0.320696,0.648334,0.187206,0.179213,0.134402,0.110599,0.210955,0.212827,0.631103
wanshouxigong,0.362147,0.125077,0.687985,0.715412,0.552975,1.089441,0.819363,0.405814,0.166438,0.372758,0.666709,0.195855,0.1377,0.160274,0.107001,0.159408,0.189538,0.629429
aotizhongxin,0.261866,0.106559,0.633864,0.612099,0.440549,0.984341,0.744752,0.383993,0.107943,0.304286,0.704824,0.258402,0.190224,0.195749,0.188418,0.266805,0.291203,0.703651
nongzhanguan,0.244489,0.177611,0.710474,0.597146,0.452318,0.971849,0.696698,0.308916,0.042743,0.379676,0.769662,0.308562,0.131161,0.255781,0.225805,0.242998,0.313174,0.749977
wanliu,0.356754,0.003556,0.561293,0.696858,0.503029,1.063724,0.850575,0.489946,0.216599,0.24805,0.594922,0.164025,0.256918,0.093053,0.123955,0.276733,0.233217,0.604455
beibuxinqu,0.442799,0.155582,0.413849,0.747654,0.526178,1.096747,0.947116,0.631115,0.354579,0.138552,0.495572,0.202988,0.409732,0.15078,0.230994,0.412904,0.317585,0.564337
zhiwuyuan,0.426887,0.084901,0.506856,0.756881,0.550392,1.118,0.926042,0.570967,0.297974,0.221383,0.515545,0.124933,0.32749,0.059525,0.137171,0.319451,0.229304,0.540659
fengtaihuayuan,0.427321,0.124482,0.66339,0.779186,0.604872,1.151856,0.893004,0.477918,0.239208,0.36657,0.597164,0.12509,0.19876,0.108421,0.034499,0.162911,0.123639,0.554962


It is straightforward to find the closest stations by using the [idxmin](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmin.html) function.

This jupyter notebook is available at my Github page: [Pair-Distances.ipynb](https://github.com/JQGoh/jqlearning/blob/master/posts/Pair-Distances.ipynb), and it is part of the repository [jqlearning](https://github.com/JQGoh/jqlearning).

In [9]:
station_pairs_df = pair_dist(df1, 
                             df2, 
                             {"station_id": ["latitude", "longitude"]}, 
                             {"station_id": ["latitude", "longitude"]})
station_pairs_df.idxmin(axis=1)

dongsi               chaoyang
tiantan               beijing
guanyuan               hadian
wanshouxigong         fengtai
aotizhongxin           hadian
nongzhanguan         chaoyang
wanliu                 hadian
beibuxinqu          pingchang
zhiwuyuan         shijingshan
fengtaihuayuan        fengtai
yungang             mentougou
gucheng           shijingshan
fangshan             fangshan
daxing                 daxing
yizhuang              beijing
tongzhou             tongzhou
shunyi                 shunyi
pingchang           pingchang
mentougou           mentougou
pinggu                 pinggu
huairou               huairou
miyun                   miyun
yanqin                yanqing
dingling            pingchang
badaling              yanqing
miyunshuiku             miyun
donggaocun             pinggu
yongledian           tongzhou
yufa                   daxing
liulihe              fangshan
qianmen              chaoyang
yongdingmennei        beijing
xizhimenbei            hadian
nansanhuan