# Analyze Tournament Site Location Data

## Import data and packages

In [2]:
# import python packages
import pandas as pd

# import school distances dataset
data = pd.read_csv('../data/cleaned/distances-schools.csv')
data.head()

Unnamed: 0,seed,school_common_name,site,year,id,school_full_name,team,city,state,type,conference,address,lng,lat,geometry,distance
0,1.0,Duke,"Columbia, SC",2019.0,20190,Duke University,Blue Devils,Durham,North Carolina,Private/Non-sectarian,Atlantic Coast Conference,Duke University Durham North Carolina,-78.94423,36.000156,POINT (-78.94422972195878 36.00015569999999),179.765685
1,1.0,Gonzaga,"Salt Lake City, UT",2019.0,20191,Gonzaga University,Bulldogs,Spokane,Washington,Private/Catholic,West Coast Conference,Gonzaga University Spokane Washington,-117.403044,47.666739,POINT (-117.4030438539681 47.66673855000001),549.380264
2,1.0,North Carolina,"Columbus, OH",2019.0,20192,University of North Carolina at Chapel Hill,Tar Heels,Chapel Hill,North Carolina,State,Atlantic Coast Conference,University of North Carolina at Chapel Hill Ch...,-79.047753,35.905035,POINT (-79.04775326525106 35.90503535),352.052893
3,1.0,Virginia,"Columbia, SC",2019.0,20193,University of Virginia,Cavaliers,Charlottesville,Virginia,State,Atlantic Coast Conference,University of Virginia Charlottesville Virginia,-78.5055,38.041058,POINT (-78.50549960183569 38.0410576),296.237023
4,2.0,Michigan State,"Des Moines, IA",2019.0,20194,Michigan State University,Spartans,East Lansing,Michigan,State,Big Ten Conference,Michigan State University East Lansing Michigan,-84.477916,42.718568,POINT (-84.47791570930522 42.71856800000001),571.484627


## Find weighted distance

In theory, higher seeded teams should play at closer sites, and the further down the list of the top 16 teams, the lesser geographic preference. To compare higher and lower seeds on the same level, a simple weighted average is calculated where 1 seeds are weighted 1, 2 seeds are weighted 0.75, 3 seeds are weighted 0.5, and 4 seeds are weighted 0.25.

In [4]:
# dictionary of weights - seeds are keys, weights are values
weights = {1: 1, 2: 0.75, 3: 0.5, 4: 0.25}

# loop through distances and apply weights based on the associated seed value 
weightedDistance = [dist * weights[data.seed[i]] for i, dist in enumerate(data.distance)]

# add weighted distance column to dataframe
data['weightedDist'] = weightedDistance
data.tail()

Unnamed: 0,seed,school_common_name,site,year,id,school_full_name,team,city,state,type,conference,address,lng,lat,geometry,distance,weightedDist
555,3.0,NC State,"Albuquerque, NM",1985.0,1985555,North Carolina State University,Wolfpack,Raleigh,North Carolina,State,Atlantic Coast Conference,North Carolina State University Raleigh North ...,-78.674087,35.77185,POINT (-78.67408695452633 35.77184965),1739.300819,869.65041
556,4.0,Loyola–Chicago,"Hartford, CT",1985.0,1985556,Loyola University Chicago,Ramblers,Chicago,Illinois,Private/Catholic,Missouri Valley Conference,Loyola University Chicago Chicago Illinois,-87.668422,41.944842,POINT (-87.66842176669064 41.94484179999999),930.929259,232.732315
557,4.0,Ohio State,"Tulsa, OK",1985.0,1985557,The Ohio State University,Buckeyes,Columbus,Ohio,State,Big Ten Conference,The Ohio State University Columbus Ohio,-83.028663,40.005709,POINT (-83.02866259769122 40.00570905),840.512615,210.128154
558,4.0,LSU,"Dayton, OH",1985.0,1985558,Louisiana State University,Tigers,Baton Rouge,Louisiana,State,Southeastern Conference,Louisiana State University Baton Rouge Louisiana,-91.185968,30.405709,POINT (-91.18596767189877 30.40570885),725.868192,181.467048
559,4.0,UNLV,"Salt Lake City, UT",1985.0,1985559,"University of Nevada, Las Vegas",Rebels,Paradise,Nevada,State,Mountain West Conference,UNLV Paradise Nevada,-115.141832,36.107155,POINT (-115.1418318610852 36.1071554),352.89541,88.223853


In [3]:
# data.loc[data.school_common_name == 'Kentucky']

## Aggregate distances by school

While the distance dataframe is useful as is, it is most interesting to aggregate the data at different levels. Of most interest is aggregating at the school level. By using the Pandas `describe` method, the mean, standard deviation, minimum, maximum, and quantiles are quickly calculated for each school.

In [4]:
# group dataframe by schools and apply describe method
schoolsWeighted = data.groupby('school_common_name').describe()

# preserve the weighted distance aggregation only
schoolsWtDistAgg = schoolsWeighted.weightedDist
schoolsWtDistAgg.head()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama,3.0,112.809325,120.297037,37.254071,43.44792,49.641768,150.586951,251.532134
Arizona,20.0,452.091798,252.612485,0.387824,339.977205,399.902145,620.471038,828.782889
Arkansas,7.0,277.658557,168.427111,105.722508,194.128868,216.471203,304.079252,624.999944
Auburn,2.0,469.969893,31.19733,447.910049,458.939971,469.969893,480.999815,492.029737
BYU,2.0,321.585878,161.017824,207.729083,264.65748,321.585878,378.514275,435.442673


In [5]:
# preserve the distance aggregation only
schoolsDistAgg = schoolsWeighted.distance
schoolsDistAgg.head()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama,3.0,194.538449,142.89464,49.672095,124.119584,198.567073,266.971626,335.376179
Arizona,20.0,723.99961,442.918415,0.387824,463.82781,648.429761,859.110202,1655.073496
Arkansas,7.0,548.434711,215.707104,210.878249,422.890031,581.981962,667.258947,865.884814
Auburn,2.0,1208.014498,1074.95002,447.910049,827.962274,1208.014498,1588.066722,1968.118947
BYU,2.0,1078.614428,937.844581,415.458165,747.036297,1078.614428,1410.19256,1741.770691


## Add tags to column names to differentiate between weighted and unweighted statistics

Since both the weighted and unweighted distances were aggregated and have the same `description()` column names, tags can be concatenated to the columns to differentiate the two.

In [6]:
# concatenate `_wtDist` to each column and apply to weighted distance dataframe
schoolsWtDistAgg.columns = [col + '_wtDist' for col in schoolsWtDistAgg.columns]
display(schoolsWtDistAgg.head())

# concatenate `_dist` to each column and apply to UNweighted distance dataframe
schoolsDistAgg.columns = [col + '_dist' for col in schoolsDistAgg.columns]
schoolsDistAgg.head()

Unnamed: 0_level_0,count_wtDist,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist
school_common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama,3.0,112.809325,120.297037,37.254071,43.44792,49.641768,150.586951,251.532134
Arizona,20.0,452.091798,252.612485,0.387824,339.977205,399.902145,620.471038,828.782889
Arkansas,7.0,277.658557,168.427111,105.722508,194.128868,216.471203,304.079252,624.999944
Auburn,2.0,469.969893,31.19733,447.910049,458.939971,469.969893,480.999815,492.029737
BYU,2.0,321.585878,161.017824,207.729083,264.65748,321.585878,378.514275,435.442673


Unnamed: 0_level_0,count_dist,mean_dist,std_dist,min_dist,25%_dist,50%_dist,75%_dist,max_dist
school_common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama,3.0,194.538449,142.89464,49.672095,124.119584,198.567073,266.971626,335.376179
Arizona,20.0,723.99961,442.918415,0.387824,463.82781,648.429761,859.110202,1655.073496
Arkansas,7.0,548.434711,215.707104,210.878249,422.890031,581.981962,667.258947,865.884814
Auburn,2.0,1208.014498,1074.95002,447.910049,827.962274,1208.014498,1588.066722,1968.118947
BYU,2.0,1078.614428,937.844581,415.458165,747.036297,1078.614428,1410.19256,1741.770691


## Append distance and weighted distance dataframes

Because both of the aggregations will be merged into the `data` dataframe, they should be appended together in one dataframe. They can be merged by their matching index keys (school names).

In [7]:
# append two dataframes into one
aggs = pd.merge(schoolsDistAgg, schoolsWtDistAgg, left_on=schoolsDistAgg.index.get_level_values('school_common_name'), right_on=schoolsWtDistAgg.index.get_level_values('school_common_name'))
aggs.head()

Unnamed: 0,key_0,count_dist,mean_dist,std_dist,min_dist,25%_dist,50%_dist,75%_dist,max_dist,count_wtDist,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist
0,Alabama,3.0,194.538449,142.89464,49.672095,124.119584,198.567073,266.971626,335.376179,3.0,112.809325,120.297037,37.254071,43.44792,49.641768,150.586951,251.532134
1,Arizona,20.0,723.99961,442.918415,0.387824,463.82781,648.429761,859.110202,1655.073496,20.0,452.091798,252.612485,0.387824,339.977205,399.902145,620.471038,828.782889
2,Arkansas,7.0,548.434711,215.707104,210.878249,422.890031,581.981962,667.258947,865.884814,7.0,277.658557,168.427111,105.722508,194.128868,216.471203,304.079252,624.999944
3,Auburn,2.0,1208.014498,1074.95002,447.910049,827.962274,1208.014498,1588.066722,1968.118947,2.0,469.969893,31.19733,447.910049,458.939971,469.969893,480.999815,492.029737
4,BYU,2.0,1078.614428,937.844581,415.458165,747.036297,1078.614428,1410.19256,1741.770691,2.0,321.585878,161.017824,207.729083,264.65748,321.585878,378.514275,435.442673


## Prep original `data` dataframe for merging

The weighted and unweighted distance aggregations need to be merged with the main school dataframe. Because the aggregations are at the school level, we only need the generic school information columns - the `year`, `seed`, `site`, `id`, and `geometry` columns can be dropped from `data`. Furthermore, all duplicate entries can be dropped.



In [8]:
# drop all duplicate school names, drop columns that change depending on the year
schoolsTrimmed = data.drop_duplicates(['school_common_name']).drop(['year', 'seed', 'site', 'id', 'geometry', 'distance', 'weightedDist'], axis=1)
schoolsTrimmed.head()

Unnamed: 0,school_common_name,school_full_name,team,city,state,type,conference,address,lng,lat
0,Duke,Duke University,Blue Devils,Durham,North Carolina,Private/Non-sectarian,Atlantic Coast Conference,Duke University Durham North Carolina,-78.94423,36.000156
1,Gonzaga,Gonzaga University,Bulldogs,Spokane,Washington,Private/Catholic,West Coast Conference,Gonzaga University Spokane Washington,-117.403044,47.666739
2,North Carolina,University of North Carolina at Chapel Hill,Tar Heels,Chapel Hill,North Carolina,State,Atlantic Coast Conference,University of North Carolina at Chapel Hill Ch...,-79.047753,35.905035
3,Virginia,University of Virginia,Cavaliers,Charlottesville,Virginia,State,Atlantic Coast Conference,University of Virginia Charlottesville Virginia,-78.5055,38.041058
4,Michigan State,Michigan State University,Spartans,East Lansing,Michigan,State,Big Ten Conference,Michigan State University East Lansing Michigan,-84.477916,42.718568


In [9]:
# merge schools with weights distance aggregation on
schools = pd.merge(schoolsTrimmed, aggs, how='left', left_on='school_common_name', right_on='key_0')
display(schools.head())

# sort by mean weighted distance so that higher means (and thus bigger proportional circles) are plotted first
schoolsSorted = schools.sort_values(by=['mean_wtDist'], ascending=False)
schoolsSorted.head()

Unnamed: 0,school_common_name,school_full_name,team,city,state,type,conference,address,lng,lat,...,75%_dist,max_dist,count_wtDist,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist
0,Duke,Duke University,Blue Devils,Durham,North Carolina,Private/Non-sectarian,Atlantic Coast Conference,Duke University Durham North Carolina,-78.94423,36.000156,...,367.019606,2068.651965,31.0,208.909988,231.93662,6.594143,52.879953,127.42266,240.599142,1034.325982
1,Gonzaga,Gonzaga University,Bulldogs,Spokane,Washington,Private/Catholic,West Coast Conference,Gonzaga University Spokane Washington,-117.403044,47.666739,...,549.380264,1039.667928,9.0,339.566741,203.415695,65.642776,229.680863,274.690132,549.380264,549.380264
2,North Carolina,University of North Carolina at Chapel Hill,Tar Heels,Chapel Hill,North Carolina,State,Atlantic Coast Conference,University of North Carolina at Chapel Hill Ch...,-79.047753,35.905035,...,476.946658,2786.543099,26.0,331.367343,407.281304,26.55306,78.860877,159.503568,410.998203,1547.355379
3,Virginia,University of Virginia,Cavaliers,Charlottesville,Virginia,State,Atlantic Coast Conference,University of Virginia Charlottesville Virginia,-78.5055,38.041058,...,279.018491,369.170067,6.0,177.984456,72.930227,92.292517,140.746063,155.634117,213.152715,296.237023
4,Michigan State,Michigan State University,Spartans,East Lansing,Michigan,State,Big Ten Conference,Michigan State University East Lansing Michigan,-84.477916,42.718568,...,621.912538,2070.326087,12.0,294.073988,176.723598,38.663768,187.515849,271.86996,423.238462,584.228041


Unnamed: 0,school_common_name,school_full_name,team,city,state,type,conference,address,lng,lat,...,75%_dist,max_dist,count_wtDist,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist
73,St. John's,St. John's University,Red Storm,Jamaica,New York,Private/Catholic,Big East Conference,St. John's University Jamaica New York,-73.990073,40.729944,...,2358.641641,2780.99965,5.0,1501.481643,1157.579618,159.223749,442.906099,1768.981231,2355.297484,2780.99965
85,VCU,Virginia Commonwealth University,Rams,Richmond,Virginia,State,Atlantic 10 Conference,Virginia Commonwealth University Richmond Virg...,-77.453064,37.548215,...,1821.113346,1821.113346,1.0,1365.83501,,1365.83501,1365.83501,1365.83501,1365.83501,1365.83501
55,Stanford,Stanford University,Cardinal,Palo Alto,California,Private/Non-Sectarian,Pac-12 Conference,Stanford University Palo Alto California,-122.169365,37.431314,...,2176.374,2635.125228,8.0,811.707855,696.429024,173.194292,364.331846,553.266875,1141.144455,2211.465583
70,USC,University of Southern California,Trojans,Los Angeles,California,Private/Non-Sectarian,Pac-12 Conference,University of Southern California Los Angeles ...,-118.285867,34.021883,...,1562.989869,1968.481379,2.0,781.494935,982.689063,86.628835,434.061885,781.494935,1128.927985,1476.361035
79,Seton Hall,Seton Hall University,Pirates,South Orange,New Jersey,Private/Catholic,Big East Conference,Seton Hall University South Orange New Jersey,-74.246858,40.743372,...,2340.323155,2343.279641,4.0,778.182929,505.062064,112.649327,522.242636,914.221284,1170.161577,1171.63982


In [11]:
# Find overall mean for mapping purposes
print(schoolsSorted['mean_wtDist'].mean())
schoolsSorted['mean_dist'].mean()

345.88513858848495


724.6294584016067

## Write to CSV

In [62]:
schoolsSorted.to_csv('../data/cleaned/schools-wtAvg.csv', index=False)

## Group data by school and seed

To provide a view for each school's seeding data, the data should be grouped by both the school name and seed. The `describe` method calculates various statistics for the datasets. The weighted and unweighted distances should then be joined together in one dataset.

In [12]:
# group by school and seed, calculate averages with describe method
seeds = data.groupby(['school_common_name', 'seed']).describe()
seedsDist = seeds.distance
seedsWtDist = seeds.weightedDist

display(seedsDist.head())
display(seedsWtDist.head())

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama,2.0,2.0,192.524137,202.023295,49.672095,121.098116,192.524137,263.950158,335.376179
Alabama,4.0,1.0,198.567073,,198.567073,198.567073,198.567073,198.567073,198.567073
Arizona,1.0,6.0,487.863183,286.797591,0.387824,405.640533,499.402953,706.043212,778.675678
Arizona,2.0,7.0,718.174996,279.259933,458.502104,533.20286,533.20286,932.035218,1105.043853
Arizona,3.0,4.0,712.841103,689.426122,95.472823,265.475016,550.409046,997.775133,1655.073496


Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama,2.0,2.0,144.393103,151.517471,37.254071,90.823587,144.393103,197.962618,251.532134
Alabama,4.0,1.0,49.641768,,49.641768,49.641768,49.641768,49.641768,49.641768
Arizona,1.0,6.0,487.863183,286.797591,0.387824,405.640533,499.402953,706.043212,778.675678
Arizona,2.0,7.0,538.631247,209.44495,343.876578,399.902145,399.902145,699.026413,828.782889
Arizona,3.0,4.0,356.420551,344.713061,47.736411,132.737508,275.204523,498.887566,827.536748


### Join weighted and unweighted dataframes together

The weighted and unweighted distance dataframes should be joined together. Because the dataframes need to be joined by two matching columns (`school_common_name` and `seed`), it's easier to simply create new columns in the `seedsDist` dataframe with the columns from the `seedsWtDist` dataframe.

In [13]:
# turn off copy warning
pd.options.mode.chained_assignment = None  # default='warn'

seedsDist['mean_wtDist'] = seedsWtDist['mean']
seedsDist['std_wtDist'] = seedsWtDist['std']
seedsDist['min_wtDist'] = seedsWtDist['min']
seedsDist['25%_wtDist'] = seedsWtDist['25%']
seedsDist['50%_wtDist'] = seedsWtDist['50%']
seedsDist['75%_wtDist'] = seedsWtDist['75%']
seedsDist['max_wtDist'] = seedsWtDist['max']
seedsDist

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Alabama,2.0,2.0,192.524137,202.023295,49.672095,121.098116,192.524137,263.950158,335.376179,144.393103,151.517471,37.254071,90.823587,144.393103,197.962618,251.532134
Alabama,4.0,1.0,198.567073,,198.567073,198.567073,198.567073,198.567073,198.567073,49.641768,,49.641768,49.641768,49.641768,49.641768,49.641768
Arizona,1.0,6.0,487.863183,286.797591,0.387824,405.640533,499.402953,706.043212,778.675678,487.863183,286.797591,0.387824,405.640533,499.402953,706.043212,778.675678
Arizona,2.0,7.0,718.174996,279.259933,458.502104,533.202860,533.202860,932.035218,1105.043853,538.631247,209.444950,343.876578,399.902145,399.902145,699.026413,828.782889
Arizona,3.0,4.0,712.841103,689.426122,95.472823,265.475016,550.409046,997.775133,1655.073496,356.420551,344.713061,47.736411,132.737508,275.204523,498.887566,827.536748
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wisconsin,4.0,3.0,1202.786541,287.506888,926.712988,1053.928069,1181.143151,1340.823317,1500.503484,300.696635,71.876722,231.678247,263.482017,295.285788,335.205829,375.125871
Xavier,1.0,1.0,234.536763,,234.536763,234.536763,234.536763,234.536763,234.536763,234.536763,,234.536763,234.536763,234.536763,234.536763,234.536763
Xavier,2.0,1.0,359.960356,,359.960356,359.960356,359.960356,359.960356,359.960356,269.970267,,269.970267,269.970267,269.970267,269.970267,269.970267
Xavier,3.0,2.0,348.427564,161.065915,234.536763,291.482163,348.427564,405.372964,462.318364,174.213782,80.532957,117.268382,145.741082,174.213782,202.686482,231.159182


## Calculate overall seed averages

As a point of reference, the weighted and unweighted statistics for the overall seeds (independent of schools) should be calculated.

In [14]:
# group data by seed
seedsAll = data.groupby('seed').describe()

# merge weighted and unweighted averages on 'seed', add custom suffixes to differentiate between the two
seedsMerged = pd.merge(seedsAll.distance, seedsAll.weightedDist, 
               left_on=seedsAll.distance.index.get_level_values('seed'), 
               right_on=seedsAll.weightedDist.index.get_level_values('seed'),
               suffixes=('_dist', '_wtDist'))

# rename `key_0` column as `seed` for clarity
seedsMerged['seed'] = seedsMerged['key_0']
seedsMerged

Unnamed: 0,key_0,count_dist,mean_dist,std_dist,min_dist,25%_dist,50%_dist,75%_dist,max_dist,count_wtDist,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist,seed
0,1.0,140.0,376.468191,450.363472,0.387824,127.42266,243.041596,451.63311,2780.99965,140.0,376.468191,450.363472,0.387824,127.42266,243.041596,451.63311,2780.99965,1.0
1,2.0,140.0,513.247102,536.864699,1.068547,149.279603,336.85344,616.545553,2464.06052,140.0,384.935326,402.648525,0.80141,111.959702,252.64008,462.409165,1848.04539,2.0
2,3.0,140.0,742.758208,628.182745,1.725003,344.374113,520.265933,901.080894,2786.543099,140.0,371.379104,314.091373,0.862502,172.187056,260.132966,450.540447,1393.27155,3.0
3,4.0,140.0,922.672701,769.356046,38.373249,351.300392,660.338677,1453.80607,2865.82939,140.0,230.668175,192.339012,9.593312,87.825098,165.084669,363.451518,716.457347,4.0


## Write to CSV

In [107]:
# leave in indexes to preserve school name and seed
# this is pulled into QGIS and saved as geojson and converted to .json for proper formatting
seedsDist.to_csv('../data/cleaned/seeds-by-school.csv')
seedsMerged.to_csv('../data/cleaned/seeds-overall.csv', index=False)

# Aggregate by Conference

In [11]:
# calculate statistics at conference level
conf = data.groupby('conference').describe()

# merge weighted distance and unweighted distance in single dataframe
confAll = pd.merge(conf.distance, conf.weightedDist, 
             left_on=conf.distance.index.get_level_values('conference'), 
             right_on=conf.weightedDist.index.get_level_values('conference'),
             suffixes=('_dist', '_wtDist'))

confAll.head()

Unnamed: 0,key_0,count_dist,mean_dist,std_dist,min_dist,25%_dist,50%_dist,75%_dist,max_dist,count_wtDist,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist
0,American Athletic Conference,35.0,544.058284,584.52552,27.304459,214.57164,348.141829,630.326868,2464.06052,35.0,390.199946,475.859494,10.968326,162.115023,261.106372,430.366276,2038.307527
1,Atlantic 10 Conference,11.0,969.521764,958.868496,45.877546,132.428889,288.715757,1895.191486,2278.013061,11.0,431.138401,464.632374,22.938773,68.0805,288.715757,550.229958,1365.83501
2,Atlantic Coast Conference,139.0,600.842046,703.710799,1.068547,127.42266,345.194626,658.086142,2865.82939,139.0,302.426192,313.246211,0.80141,87.444525,193.69554,401.404772,1547.355379
3,Big 12 Conference,78.0,544.761848,402.366177,17.777579,219.233497,454.134239,773.462956,1631.984883,78.0,299.547673,225.203496,10.635634,149.931525,226.33828,406.101337,1054.014359
4,Big East Conference,44.0,700.308612,803.60833,12.131191,234.536763,375.350575,765.596346,2863.66665,44.0,418.092775,584.856203,6.065595,114.869779,221.48139,397.323866,2780.99965


## Write to CSV

In [14]:
confAll.to_csv('../data/cleaned/conference-agg.csv')

# Playground - Work In Progress Below

In [7]:
tmp = data.groupby('conference').describe()
dist = tmp.distance
wtDist = tmp.weightedDist
wtDist

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
conference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
American Athletic Conference,35.0,390.199946,475.859494,10.968326,162.115023,261.106372,430.366276,2038.307527
Atlantic 10 Conference,11.0,431.138401,464.632374,22.938773,68.0805,288.715757,550.229958,1365.83501
Atlantic Coast Conference,139.0,302.426192,313.246211,0.80141,87.444525,193.69554,401.404772,1547.355379
Big 12 Conference,78.0,299.547673,225.203496,10.635634,149.931525,226.33828,406.101337,1054.014359
Big East Conference,44.0,418.092775,584.856203,6.065595,114.869779,221.48139,397.323866,2780.99965
Big Ten Conference,94.0,332.949736,342.701517,8.24911,101.621833,213.241358,445.067223,1806.430803
Missouri Valley Conference,2.0,167.684644,91.991297,102.636974,135.160809,167.684644,200.208479,232.732315
Mountain West Conference,13.0,293.173881,212.392758,57.787579,117.833624,285.453367,352.89541,844.102336
Pac-12 Conference,59.0,459.938865,427.517174,0.387824,174.19145,389.337839,587.074272,2211.465583
Southeastern Conference,74.0,302.783505,277.098408,22.679792,123.202285,212.875135,413.5866,1710.25413


In [8]:
m = pd.merge(tmp.distance, tmp.weightedDist, 
               left_on=tmp.distance.index.get_level_values('conference'), 
               right_on=tmp.weightedDist.index.get_level_values('conference'),
               suffixes=('_dist', '_wtDist'))
m

Unnamed: 0,key_0,count_dist,mean_dist,std_dist,min_dist,25%_dist,50%_dist,75%_dist,max_dist,count_wtDist,mean_wtDist,std_wtDist,min_wtDist,25%_wtDist,50%_wtDist,75%_wtDist,max_wtDist
0,American Athletic Conference,35.0,544.058284,584.52552,27.304459,214.57164,348.141829,630.326868,2464.06052,35.0,390.199946,475.859494,10.968326,162.115023,261.106372,430.366276,2038.307527
1,Atlantic 10 Conference,11.0,969.521764,958.868496,45.877546,132.428889,288.715757,1895.191486,2278.013061,11.0,431.138401,464.632374,22.938773,68.0805,288.715757,550.229958,1365.83501
2,Atlantic Coast Conference,139.0,600.842046,703.710799,1.068547,127.42266,345.194626,658.086142,2865.82939,139.0,302.426192,313.246211,0.80141,87.444525,193.69554,401.404772,1547.355379
3,Big 12 Conference,78.0,544.761848,402.366177,17.777579,219.233497,454.134239,773.462956,1631.984883,78.0,299.547673,225.203496,10.635634,149.931525,226.33828,406.101337,1054.014359
4,Big East Conference,44.0,700.308612,803.60833,12.131191,234.536763,375.350575,765.596346,2863.66665,44.0,418.092775,584.856203,6.065595,114.869779,221.48139,397.323866,2780.99965
5,Big Ten Conference,94.0,697.795391,704.472325,8.24911,151.786305,423.744431,915.353168,2768.812156,94.0,332.949736,342.701517,8.24911,101.621833,213.241358,445.067223,1806.430803
6,Missouri Valley Conference,2.0,670.738577,367.96519,410.547896,540.643237,670.738577,800.833918,930.929259,2.0,167.684644,91.991297,102.636974,135.160809,167.684644,200.208479,232.732315
7,Mountain West Conference,13.0,628.90154,466.985908,231.150318,352.89541,380.60449,925.244249,1688.204671,13.0,293.173881,212.392758,57.787579,117.833624,285.453367,352.89541,844.102336
8,Pac-12 Conference,59.0,811.178417,692.094786,0.387824,369.559704,533.457641,965.115478,2687.468752,59.0,459.938865,427.517174,0.387824,174.19145,389.337839,587.074272,2211.465583
9,Southeastern Conference,74.0,562.148034,481.868779,49.672095,233.860625,476.600342,645.48809,2181.977569,74.0,302.783505,277.098408,22.679792,123.202285,212.875135,413.5866,1710.25413


In [18]:
display(data.weightedDist.describe())
display(data.loc[data.weightedDist > 2700])

count     560.000000
mean      340.862699
std       358.548307
min         0.387824
25%       114.888579
50%       228.521879
75%       435.534435
max      2780.999650
Name: weightedDist, dtype: float64

Unnamed: 0,seed,school_common_name,site,year,id,school_full_name,team,city,state,type,conference,address,lng,lat,geometry,distance,weightedDist
531,1.0,St. John's,"Long Beach, California",1986.0,1986531,St. John's University,Red Storm,Jamaica,New York,Private/Catholic,Big East Conference,St. John's University Jamaica New York,-73.990073,40.729944,POINT (-73.99007259999999 40.72994420000001),2780.99965,2780.99965


In [19]:
display(data.distance.describe())
# display(data.loc[data.distance < 1])
display(data.loc[data.distance > 2865])

count     560.000000
mean      638.786551
std       641.469182
min         0.387824
25%       211.343978
50%       413.003031
75%       801.004004
max      2865.829390
Name: distance, dtype: float64

Unnamed: 0,seed,school_common_name,site,year,id,school_full_name,team,city,state,type,conference,address,lng,lat,geometry,distance,weightedDist
108,4.0,Syracuse,"San Jose, CA",2013.0,2013108,Syracuse University,Orange,Syracuse,New York,Private/Methodist,Atlantic Coast Conference,Syracuse University Syracuse New York,-76.133309,43.038306,POINT (-76.13330882751831 43.03830645),2865.82939,716.457347


In [19]:
# data.distance.plot()
data.weightedDist.describe()

count     560.000000
mean      340.862699
std       358.548307
min         0.387824
25%       114.888579
50%       228.521879
75%       435.534435
max      2780.999650
Name: weightedDist, dtype: float64

In [50]:
import matplotlib
%matplotlib inline

seed = data.groupby('seed').describe()
seed.distance
data.groupby('seed').distance.median()

seed
1.0    243.041596
2.0    336.853440
3.0    520.265933
4.0    665.624166
Name: distance, dtype: float64

In [160]:
school = data.groupby('school_common_name').describe()
school.distance.head()

# print (df.drop_duplicates(['Cat']))
d = data.drop_duplicates(['school_common_name']).drop(['distance', 'seed', 'site', 'id', 'geometry'], axis=1)
d

m = pd.merge(d, school.distance, how='left', left_on='school_common_name', right_on=school.distance.index.get_level_values('school_common_name'))
m

# df.sort_values(by=['col1'])
m.sort_values(by=['mean'])

# x = school.distance
# x['name'] = school.distance.index.get_level_values('school_common_name')
# # school.loc[school.distance, 'name'] = school.distance.index.get_level_values('school_common_name')
# x

# s = pd.merge(x, data, how='right', left_on='name', right_on='school_common_name')
# s
# # m = pd.merge(x, data, how='right', left_on='name', right_on='school_common_name')
# # # m = pd.merge(data, school.distance, how='left', left_on='school_common_name', right_on=school.distance.index.get_level_values('school_common_name'))
# # # m = m.drop(['distance'], axis=1)
# # m
# # # print(m)

# # m.to_json('../data/cleaned/school-mean.json', orient='records')
# # school.distance[school.distance.index.get_level_values('school_common_name') == 'Kentucky']

Unnamed: 0,school_common_name,year,school_full_name,team,city,state,type,conference,address,lng,lat,count,mean,std,min,25%,50%,75%,max
83,DePaul,1987.0,DePaul University,Blue Demons,Chicago,Illinois,Private/Catholic,Big East Conference,DePaul University Chicago Illinois,-87.654726,41.924020,1.0,14.406784,,14.406784,14.406784,14.406784,14.406784,14.406784
82,La Salle,1990.0,La Salle University,Explorers,Philadelphia,Pennsylvania,Private/Catholic,Atlantic 10 Conference,La Salle University Philadelphia Pennsylvania,-75.154018,40.037470,1.0,186.969348,,186.969348,186.969348,186.969348,186.969348,186.969348
68,Alabama,2002.0,University of Alabama,Crimson Tide,Tuscaloosa,Alabama,State,Southeastern Conference,University of Alabama Tuscaloosa Alabama,-87.539674,33.212082,3.0,194.538449,142.894640,49.672095,124.119584,198.567073,266.971626,335.376179
27,Butler,2017.0,Butler University,Bulldogs,Indianapolis,Indiana,Private/Non-Sectarian,Big East Conference,Butler University Indianapolis Indiana,-86.173749,39.840719,1.0,226.329078,,226.329078,226.329078,226.329078,226.329078,226.329078
3,Virginia,2019.0,University of Virginia,Cavaliers,Charlottesville,Virginia,State,Atlantic Coast Conference,University of Virginia Charlottesville Virginia,-78.505500,38.041058,6.0,233.604168,89.021514,140.746063,162.400271,227.362896,279.018491,369.170067
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52,Vanderbilt,2010.0,Vanderbilt University,Commodores,Nashville,Tennessee,Private/Non-Sectarian,Southeastern Conference,Vanderbilt University Nashville Tennessee,-86.802819,36.143801,4.0,1631.407918,757.446810,576.522270,1332.996267,1883.565917,2181.977569,2181.977569
85,VCU,1985.0,Virginia Commonwealth University,Rams,Richmond,Virginia,State,Atlantic 10 Conference,Virginia Commonwealth University Richmond Virg...,-77.453064,37.548215,1.0,1821.113346,,1821.113346,1821.113346,1821.113346,1821.113346,1821.113346
46,Saint Louis,2013.0,Saint Louis University,Billikens,St. Louis,Missouri,Private/Catholic,Atlantic 10 Conference,Saint Louis University St. Louis Missouri,-90.231677,38.635284,1.0,1969.269626,,1969.269626,1969.269626,1969.269626,1969.269626,1969.269626
67,Dayton,2003.0,University of Dayton,Flyers,Dayton,Ohio,Private/Catholic,Atlantic 10 Conference,University of Dayton Dayton Ohio,-84.179195,39.738460,1.0,2123.826607,,2123.826607,2123.826607,2123.826607,2123.826607,2123.826607


In [159]:
m.to_json('../data/cleaned/schools-default.json')
m.to_csv('../data/cleaned/schools-default.csv', index=False)


# with open('../data/cleaned/schools-default.json', 'w') as f:
#     f.write(m.to_json())

In [21]:
conf = data.groupby('conference').describe()
conf.distance

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
conference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
American Athletic Conference,35.0,544.024159,584.435745,27.304459,214.57164,348.141829,630.326868,2464.06052
Atlantic 10 Conference,11.0,969.521764,958.868496,45.877546,132.428889,288.715757,1895.191486,2278.013061
Atlantic Coast Conference,139.0,600.824508,703.691371,1.068547,127.42266,345.194626,658.086142,2865.82939
Big 12 Conference,78.0,544.735037,402.353783,17.777579,219.233497,454.134239,773.462956,1631.984883
Big East Conference,44.0,667.392697,743.839563,12.131191,234.536763,375.350575,765.596346,2863.66665
Big Ten Conference,94.0,694.419396,697.89175,8.24911,151.786305,423.744431,915.353168,2768.812156
Missouri Valley Conference,2.0,670.738577,367.96519,410.547896,540.643237,670.738577,800.833918,930.929259
Mountain West Conference,13.0,909.1767,1001.203972,231.150318,352.89541,471.334497,959.302768,3882.415905
Pac-12 Conference,59.0,865.894042,784.995125,0.387824,369.559704,533.457641,1079.386584,3685.275858
Southeastern Conference,74.0,562.131312,481.863616,49.672095,233.860625,476.600342,645.48809,2181.977569


In [287]:
def wavg(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean. Customize this if your business case
    should return otherwise.
    """
    d = group[avg_name]
    w = group[weight_name]
    
    weights = {1: 1, 2: 0.75, 3: 0.5, 4: 0.25}

#     print(weights)
#     print(str(d))
    try:
        return (weights[d] * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

data.groupby(['school_common_name']).apply(wavg, 'seed', 'distance')
# tmp = data.groupby(['school_common_name', 'seed']).describe()

#In [12]: df.iloc[df.index.get_level_values('A') == 1]
# display(tmp.distance[tmp.distance.index.get_level_values('school_common_name') == 'Kentucky'])
# school.distance[school.distance.index.get_level_values('school_common_name') == 'Kentucky']
# tmp.distance


TypeError: 'Series' objects are mutable, thus they cannot be hashed

In [268]:
# ((450*10*1) + (381 * 7 * 0.75) + (628*2*0.5) + (535*3*0.25)) / 22

# [tmp.distance[tmp.distance.index.get_level_values('school_common_name') == i] for i in tmp.distance.index.get_level_values('school_common_name')]
seed = tmp.distance.index.get_level_values('seed')
school = tmp.distance.index.get_level_values('school_common_name')
# for i in school:
#      if 1 not in seed[school == i]:
#         s3 = pd.Series([4, 5, 6], index=[3, 4, 5])
#         df = pd.Series([0,0,0,0,0,0,0,0], index=[i, 1])
#         print(df)

#             dfObj.loc['k'] = ['Smriti', 26, 'Bangalore', 'India']
#         tmp.distance.loc['1'] = [0,0,0,0,0,0,0,0]
#         df = pd.DataFrame([0], columns=list('count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max')) 
#         df = pd.DataFrame({'count': 0, 'mean': 0, 'std': 0, 'min': 0, '25%': 0, '50%': 0, '75%': 0, 'max': 0})
# modDfObj = dfObj.append({'Name' : 'Sahil' , 'Age' : 22} , ignore_index=True)
#         tmp.distance.append({'count': 0, 'mean': 0, 'std': 0, 'min': 0, '25%': 0, '50%': 0, '75%': 0, 'max': 0}, ignore_index=False)
#         print(seed[school == i])
# tmp.distance

x = tmp.distance
x['school'] = x.index.get_level_values('school_common_name')
x['seed'] = x.index.get_level_values('seed')
x.loc[x.school == 'Alabama']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['school'] = x.index.get_level_values('school_common_name')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['seed'] = x.index.get_level_values('seed')


Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max,school,seed
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Alabama,2.0,2.0,192.524137,202.023295,49.672095,121.098116,192.524137,263.950158,335.376179,Alabama,2.0
Alabama,4.0,1.0,198.567073,,198.567073,198.567073,198.567073,198.567073,198.567073,Alabama,4.0


In [63]:
display(tmp.distance[tmp.distance.index.get_level_values('school_common_name') == 'Duke'])
school.distance[school.distance.index.get_level_values('school_common_name') == 'Duke']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Duke,1.0,14.0,142.136302,114.9146,23.370403,52.879953,127.42266,166.679929,390.499009
Duke,2.0,11.0,270.183865,305.620735,8.79219,52.879953,215.463324,311.765414,1050.483423
Duke,3.0,5.0,843.743833,789.577391,23.370403,366.243031,668.50975,1091.944017,2068.651965
Duke,4.0,1.0,591.699762,,591.699762,591.699762,591.699762,591.699762,591.699762


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Duke,31.0,315.237086,428.730315,8.79219,52.879953,127.42266,367.019606,2068.651965


In [64]:
((14*142*1)+(11*270*0.75)+(5*843*.5)+(1*591*0.25)) / 31

208.73387096774192

In [66]:
tmp.distance[tmp.distance.index.get_level_values('school_common_name') == 'Kansas']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Kansas,1.0,14.0,335.734603,222.52172,42.542538,180.221004,258.46982,480.582796,688.944254
Kansas,2.0,7.0,457.954089,351.355571,149.292054,203.880937,311.741554,621.976607,1092.929925
Kansas,3.0,3.0,478.032214,190.179024,258.46982,421.386069,584.302318,587.813411,591.324504
Kansas,4.0,5.0,644.803337,367.563468,42.542538,670.559068,688.944254,781.803429,1040.167396


In [36]:
tmp.distance[tmp.distance.index.get_level_values('school_common_name') == 'North Carolina']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
North Carolina,1.0,14.0,173.584449,169.329781,26.55306,54.391445,97.301448,227.74676,537.320513
North Carolina,2.0,7.0,640.371667,670.759162,119.273414,238.662342,470.350619,676.256223,2063.140505
North Carolina,3.0,3.0,1752.966507,1225.273577,399.467443,1236.178211,2072.888978,2429.716039,2786.543099
North Carolina,4.0,2.0,387.935215,7.823261,382.403335,385.169275,387.935215,390.701156,393.467096


In [37]:
tmp.distance[tmp.distance.index.get_level_values('school_common_name') == 'Michigan State']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
school_common_name,seed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Michigan State,1.0,5.0,320.757808,176.167537,188.773963,194.356748,214.983495,421.446793,584.228041
Michigan State,2.0,2.0,504.913264,94.146125,438.341901,471.627582,504.913264,538.198945,571.484627
Michigan State,3.0,3.0,310.937264,391.858249,77.327536,84.73826,92.148984,427.742128,763.335272
Michigan State,4.0,2.0,1402.646059,944.242151,734.966031,1068.806045,1402.646059,1736.486073,2070.326087
