# Ranking schools by similarity:

This notebook takes the similarity matrix, and for each school, finds the 15 most similar schools, saving the results to disk.

In [1]:
import pandas as pd

 Load the similarity matrix:

In [2]:
cosim = pd.read_csv('data/similarity_index.csv', index_col='UNITID')
cosim.head()

Unnamed: 0_level_0,INSTNM,ZIP,100654,100663,100690,100706,100724,100751,100812,100830,...,45891904,45891905,45891906,45891907,45896401,45896402,45897301,45897302,45897303,45897304
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100654,Alabama A & M University,35762,1.0,0.751817,0.649907,0.721509,0.928033,0.622077,0.61894,0.849199,...,0.481003,0.481003,0.481003,0.481003,0.481003,0.481003,0.481003,0.481003,0.481003,0.481003
100663,University of Alabama at Birmingham,35294-0110,0.751817,1.0,0.790216,0.991159,0.663715,0.84265,0.811736,0.878556,...,0.586366,0.586366,0.586366,0.586366,0.586366,0.586366,0.586366,0.586366,0.586366,0.586366
100690,Amridge University,36117-3553,0.649907,0.790216,1.0,0.764079,0.589172,0.693059,0.706366,0.76968,...,0.605232,0.605232,0.605232,0.605232,0.605232,0.605232,0.605232,0.605232,0.605232,0.605232
100706,University of Alabama in Huntsville,35899,0.721509,0.991159,0.764079,1.0,0.629289,0.837913,0.815738,0.866514,...,0.588891,0.588891,0.588891,0.588891,0.588891,0.588891,0.588891,0.588891,0.588891,0.588891
100724,Alabama State University,36104-0271,0.928033,0.663715,0.589172,0.629289,1.0,0.522988,0.539998,0.785523,...,0.449587,0.449587,0.449587,0.449587,0.449587,0.449587,0.449587,0.449587,0.449587,0.449587


Set up a dataframe for the rankings:

In [3]:
rankings = pd.DataFrame(index=cosim.index, columns=['Similar School {}'.format(i) for i in range(1,16)])
rankings.head()

Unnamed: 0_level_0,Similar School 1,Similar School 2,Similar School 3,Similar School 4,Similar School 5,Similar School 6,Similar School 7,Similar School 8,Similar School 9,Similar School 10,Similar School 11,Similar School 12,Similar School 13,Similar School 14,Similar School 15
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100654,,,,,,,,,,,,,,,
100663,,,,,,,,,,,,,,,
100690,,,,,,,,,,,,,,,
100706,,,,,,,,,,,,,,,
100724,,,,,,,,,,,,,,,


For each school (row) in the ranking index, select that column from the similarity index, sort by highest value (most similar), and enter the corresponding school id's in that row:

In [4]:
for id in rankings.index:
    rankings.loc[id] = cosim[str(id)].sort_values(ascending=False)[1:16].index

In [5]:
rankings.head()

Unnamed: 0_level_0,Similar School 1,Similar School 2,Similar School 3,Similar School 4,Similar School 5,Similar School 6,Similar School 7,Similar School 8,Similar School 9,Similar School 10,Similar School 11,Similar School 12,Similar School 13,Similar School 14,Similar School 15
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100654,199157,232937,100724,160621,227526,199999,10236803,10236801,213598,133650,100830,175856,139366,219602,140960
100663,100706,221759,234030,104151,178396,126818,230764,209551,218663,134097,139959,240444,134130,200332,170976
100690,158477,133492,175980,455770,153144,236328,169080,210401,233374,166124,100937,130590,199607,214175,114840
100706,100663,221759,234030,126818,104151,178396,230764,218663,209551,134097,139959,240444,200332,134130,170976
100724,160621,199157,232937,100654,140960,213598,10236803,10236801,139366,133650,227526,198543,219602,175856,138789


Sanity check (Harvard):

In [6]:
cosim[['INSTNM', '166027']].sort_values('166027', ascending=False).head(16)

Unnamed: 0_level_0,INSTNM,166027
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1
166027,Harvard University,1.0
130794,Yale University,0.991084
198419,Duke University,0.983294
217156,Brown University,0.980329
166683,Massachusetts Institute of Technology,0.979921
195030,University of Rochester,0.955522
110404,California Institute of Technology,0.949982
243744,Stanford University,0.947793
182670,Dartmouth College,0.944073
190150,Columbia University in the City of New York,0.942211


In [7]:
rankings.loc[166027]

Similar School 1     130794
Similar School 2     198419
Similar School 3     217156
Similar School 4     166683
Similar School 5     195030
Similar School 6     110404
Similar School 7     243744
Similar School 8     182670
Similar School 9     190150
Similar School 10    215062
Similar School 11    131496
Similar School 12    144050
Similar School 13    186131
Similar School 14    147767
Similar School 15    190415
Name: 166027, dtype: int64

Everything looks good!  Save to disk:

In [8]:
rankings.to_csv('data/similarity_rankings.csv')