# Ranking schools by similarity:

This notebook takes the similarity matrix, and for each school, finds the 15 most similar schools, saving the results to disk.

In [1]:
import pandas as pd

 Load the similarity matrix:

In [2]:
cosim = pd.read_csv('data/similarity_index.csv', index_col='UNITID')
cosim.head()

Unnamed: 0_level_0,INSTNM,ZIP,100654,100663,100690,100706,100724,100751,100812,100830,...,45891904,45891905,45891906,45891907,45896401,45896402,45897301,45897302,45897303,45897304
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100654,Alabama A & M University,35762,1.0,0.645446,0.559181,0.617786,0.852245,0.535816,0.52081,0.718865,...,0.397937,0.482052,0.482052,0.482052,0.482052,0.482052,0.480283,0.480283,0.482052,0.482052
100663,University of Alabama at Birmingham,35294-0110,0.645446,1.0,0.683236,0.92712,0.555639,0.736936,0.700375,0.82531,...,0.57222,0.568836,0.568836,0.568836,0.568836,0.568836,0.568492,0.568492,0.568836,0.568836
100690,Amridge University,36117-3553,0.559181,0.683236,1.0,0.659358,0.489259,0.600211,0.603639,0.663921,...,0.501795,0.4982,0.4982,0.4982,0.4982,0.4982,0.497286,0.497286,0.4982,0.4982
100706,University of Alabama in Huntsville,35899,0.617786,0.92712,0.659358,1.0,0.525419,0.730869,0.701689,0.885037,...,0.49675,0.493836,0.493836,0.493836,0.493836,0.493836,0.49343,0.49343,0.493836,0.493836
100724,Alabama State University,36104-0271,0.852245,0.555639,0.489259,0.525419,1.0,0.438288,0.450393,0.648715,...,0.353488,0.452585,0.452585,0.452585,0.452585,0.452585,0.450919,0.450919,0.452585,0.452585


Set up a dataframe for the rankings:

In [3]:
rankings = pd.DataFrame(index=cosim.index, columns=['Similar School {}'.format(i) for i in range(1,16)])
rankings.head()

Unnamed: 0_level_0,Similar School 1,Similar School 2,Similar School 3,Similar School 4,Similar School 5,Similar School 6,Similar School 7,Similar School 8,Similar School 9,Similar School 10,Similar School 11,Similar School 12,Similar School 13,Similar School 14,Similar School 15
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100654,,,,,,,,,,,,,,,
100663,,,,,,,,,,,,,,,
100690,,,,,,,,,,,,,,,
100706,,,,,,,,,,,,,,,
100724,,,,,,,,,,,,,,,


For each school (row) in the ranking index, select that column from the similarity index, sort by highest value (most similar), and enter the corresponding school id's in that row:

In [4]:
for id in rankings.index:
    rankings.loc[id] = cosim[str(id)].sort_values(ascending=False)[1:16].index

In [5]:
rankings.head()

Unnamed: 0_level_0,Similar School 1,Similar School 2,Similar School 3,Similar School 4,Similar School 5,Similar School 6,Similar School 7,Similar School 8,Similar School 9,Similar School 10,Similar School 11,Similar School 12,Similar School 13,Similar School 14,Similar School 15
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100654,199157,232937,160621,100724,227526,199999,213598,133650,175856,130934,234155,140960,218733,163338,139719
100663,230764,100706,234030,104151,134097,139940,209551,132903,228787,137351,227216,225511,102094,142115,166708
100690,113704,230825,388520,439446,475273,198747,155636,420246,443049,444990,206154,133492,175980,236328,199272
100706,100663,230764,226833,229179,138789,100830,219602,221759,178396,240444,126818,134097,228787,218663,234030
100724,160621,199157,232937,140960,227526,100654,175856,199999,138716,159009,199102,163453,213598,133650,106412


Sanity check (Harvard):

In [6]:
cosim[['INSTNM', '166027']].sort_values('166027', ascending=False).head(16)

Unnamed: 0_level_0,INSTNM,166027
UNITID,Unnamed: 1_level_1,Unnamed: 2_level_1
166027,Harvard University,1.0
130794,Yale University,0.991754
198419,Duke University,0.984593
166683,Massachusetts Institute of Technology,0.981545
243744,Stanford University,0.95191
190150,Columbia University in the City of New York,0.946907
215062,University of Pennsylvania,0.946676
131496,Georgetown University,0.943269
144050,University of Chicago,0.943169
147767,Northwestern University,0.942208


In [7]:
rankings.loc[166027]

Similar School 1     130794
Similar School 2     198419
Similar School 3     166683
Similar School 4     243744
Similar School 5     190150
Similar School 6     215062
Similar School 7     131496
Similar School 8     144050
Similar School 9     147767
Similar School 10    190415
Similar School 11    217156
Similar School 12    221999
Similar School 13    179867
Similar School 14    139658
Similar School 15    162928
Name: 166027, dtype: int64

Everything looks good!  Save to disk:

In [8]:
rankings.to_csv('data/similarity_rankings.csv')