# Clustering and segmenting Colleges in US

This is the project created for the final coursera capstone. It uses the location data from the foursquare API as well as college dataset as obtained from an online source to segment and cluster the US colleges according to the kind of neighborhoods they are present in so that the aplicant can get an idea of the kind of environment they will be stepping into.

## 1. Creating the datset and sourcing the data

The datset containing information such as the name of the college and their adress etc had been sourced from a website called opendatasoft. What is more is that this is a publically available and higly reliable dataset which already contains a column with the latitude and longitude of each of the colleges available and thus the requirement of location data libraries like geopy is eradicated.

In [1]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_b12c776265884bde9497c3315299a37c = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='Qqbp_2nistk49XwRj9-rFzTZ9HU63tz39QStUN_13hbN',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_b12c776265884bde9497c3315299a37c.get_object(Bucket='machinelearningfinalproject-donotdelete-pr-dpqc6p5dfdrflv',Key='hc.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body,sep=';') #the file is a semicolon seprated file and thus the sep parameter is added
df_data_1.head()
# the dataset df_data_1 now contains the raw dataset as imported from the aforementioned website

Unnamed: 0,Geo Point,Geo Shape,FID,OBJECTID,IPEDSID,NAME,ADDRESS,ADDRESS2,CITY,STATE,...,ALIAS,SIZE_SET,INST_SIZE,PT_ENROLL,FT_ENROLL,TOT_ENROLL,HOUSING,DORM_CAP,TOT_EMP,SHELTER_ID
0,"27.191599818,-80.249507429","{""type"": ""Point"", ""coordinates"": [-80.24950742...",5544,5943,445744,FORTIS INSTITUTE-PORT SAINT LUCIE,9022 SOUTH US HIGHWAY 1,,PORT SAINT LUCIE,FL,...,,-3,1,-999,489,489,2,-999,63,
1,"46.251436412,-119.118516839","{""type"": ""Point"", ""coordinates"": [-119.1185168...",5746,4146,234979,COLUMBIA BASIN COLLEGE,2600 N 20TH AVE,,PASCO,WA,...,CBC,3,3,2820,3565,6385,2,-999,500,
2,"38.34799259,-81.634181682","{""type"": ""Point"", ""coordinates"": [-81.63418168...",5848,4248,237987,WEST VIRGINIA JUNIOR COLLEGE-CHARLESTON,1000 VIRGINIA ST E,,CHARLESTON,WV,...,,1,1,21,141,162,2,-999,38,
3,"42.079353804,-104.190823112","{""type"": ""Point"", ""coordinates"": [-104.1908231...",5926,4326,240596,EASTERN WYOMING COLLEGE,3200 WEST C ST,,TORRINGTON,WY,...,,2,2,1044,660,1704,1,165,186,
4,"18.395409773,-66.159471941","{""type"": ""Point"", ""coordinates"": [-66.15947194...",5940,4340,240985,EDUCATIONAL TECHNICAL COLLEGE-RECINTO DE BAYAMON,1685 CARR #2 KL 11.2,,BAYAMON,PR,...,,-3,1,-999,497,497,2,-999,71,


The above dataset has a lot of information that is not relevent to the present context. Thus now we shall create a new data frame called df which has the relevant values only.

In [2]:
df=pd.DataFrame()
df['zip-code']=df_data_1['ZIP']
df['Name']=df_data_1['NAME']
df['City']=df_data_1['CITY']
df['State']=df_data_1['STATE']
df['coordinates']=df_data_1['Geo Point']
df.head()

Unnamed: 0,zip-code,Name,City,State,coordinates
0,34952.0,FORTIS INSTITUTE-PORT SAINT LUCIE,PORT SAINT LUCIE,FL,"27.191599818,-80.249507429"
1,99301.0,COLUMBIA BASIN COLLEGE,PASCO,WA,"46.251436412,-119.118516839"
2,25301.0,WEST VIRGINIA JUNIOR COLLEGE-CHARLESTON,CHARLESTON,WV,"38.34799259,-81.634181682"
3,82240.0,EASTERN WYOMING COLLEGE,TORRINGTON,WY,"42.079353804,-104.190823112"
4,,EDUCATIONAL TECHNICAL COLLEGE-RECINTO DE BAYAMON,BAYAMON,PR,"18.395409773,-66.159471941"


As can be seen from the above output, the zip code values of some of the colleges are null and hence to clean the data we shall remove all those rows that have missing or null values.

In [3]:
df=df.dropna()
df=df.reset_index(drop=True)
print(df.shape)
df.head()

(7168, 5)


Unnamed: 0,zip-code,Name,City,State,coordinates
0,34952.0,FORTIS INSTITUTE-PORT SAINT LUCIE,PORT SAINT LUCIE,FL,"27.191599818,-80.249507429"
1,99301.0,COLUMBIA BASIN COLLEGE,PASCO,WA,"46.251436412,-119.118516839"
2,25301.0,WEST VIRGINIA JUNIOR COLLEGE-CHARLESTON,CHARLESTON,WV,"38.34799259,-81.634181682"
3,82240.0,EASTERN WYOMING COLLEGE,TORRINGTON,WY,"42.079353804,-104.190823112"
4,614.0,UNIVERSITY OF PUERTO RICO-ARECIBO,ARECIBO,PR,"18.469199,-66.74114"


From the above output we can see that the dataset is now cleaned and free from null values and has about 7168 data entries. The two problems that have to be tackled now are:
1. To convert the zip code to integer values from float
2. To convert the coordinate column to seperate latitude and longitude columns

In [4]:
# 1. Converting the floating point zip code values to integer values
df['zip-code']=df['zip-code'].astype(int)
df.head()

Unnamed: 0,zip-code,Name,City,State,coordinates
0,34952,FORTIS INSTITUTE-PORT SAINT LUCIE,PORT SAINT LUCIE,FL,"27.191599818,-80.249507429"
1,99301,COLUMBIA BASIN COLLEGE,PASCO,WA,"46.251436412,-119.118516839"
2,25301,WEST VIRGINIA JUNIOR COLLEGE-CHARLESTON,CHARLESTON,WV,"38.34799259,-81.634181682"
3,82240,EASTERN WYOMING COLLEGE,TORRINGTON,WY,"42.079353804,-104.190823112"
4,614,UNIVERSITY OF PUERTO RICO-ARECIBO,ARECIBO,PR,"18.469199,-66.74114"


In [5]:
# 2. Coverting the coordinates column to separate latitude and longitude columns
lat=[]
long=[]
for i in range(0,7168):
    temp=[]
    temp.append(df.iloc[i,4])
    var=temp[0]
    temp=list(var.split(","))
    lat.append(temp[0])
    long.append(temp[1])
df["Latitude"]=lat
df["Longitude"]=long
del df["coordinates"]

The above dataset that has been obtained is optimum. Now we shall reduce the size of the dataset for ease of computation and shall take only the first 100 college entries for clustering purposes.

In [6]:
df_final=pd.DataFrame()
df_final=df[0:100]
print(df_final.shape)
df_final.head()

(100, 6)


Unnamed: 0,zip-code,Name,City,State,Latitude,Longitude
0,34952,FORTIS INSTITUTE-PORT SAINT LUCIE,PORT SAINT LUCIE,FL,27.191599818,-80.249507429
1,99301,COLUMBIA BASIN COLLEGE,PASCO,WA,46.251436412,-119.118516839
2,25301,WEST VIRGINIA JUNIOR COLLEGE-CHARLESTON,CHARLESTON,WV,38.34799259,-81.634181682
3,82240,EASTERN WYOMING COLLEGE,TORRINGTON,WY,42.079353804,-104.190823112
4,614,UNIVERSITY OF PUERTO RICO-ARECIBO,ARECIBO,PR,18.469199,-66.74114


## 2. Importing the required modules and libraries

Under this section we shall write the relevant python code to import the libraries that will be required to handle further program execution. We shall also define and generate the foursquare access credentials.

In [7]:
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes 
#import folium # map rendering library
print('Libraries imported.')

Libraries imported.


Now we shall define and print the foursquare API's access credentials.

In [8]:
# Defining the client credentials:
import requests
CLIENT_ID = 'GWNSHV2HCXRVX0UW3Y4F5V4WXOE3EFZP2ZUPOBQEUOE3JZC5' # your Foursquare ID
CLIENT_SECRET = 'RD13AZQCILYQI4DG2EA4ODFTQ51YPZXE0OUIN2UDRGPXPK4N' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GWNSHV2HCXRVX0UW3Y4F5V4WXOE3EFZP2ZUPOBQEUOE3JZC5
CLIENT_SECRET:RD13AZQCILYQI4DG2EA4ODFTQ51YPZXE0OUIN2UDRGPXPK4N


## 3. Accessing and contacting the forsquare API to obtain venue information

Under this section, we first create a function to access the foursquare API and create a dataframe to store the relevant information in it. We then call this function and print out the dataframe with the information.

In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    limit=100
    radius=500
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)    
            # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
            # return only relevant information for each nearby venue
        venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
     
                
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
    nearby_venues.columns = ['College Name', 
                  'College Latitude', 
                  'College Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

We shall now create a function call to obtain the information into a datset

In [10]:
venue_df=getNearbyVenues(df_final['Name'],df_final['Latitude'],df_final['Longitude'])

In [11]:
venue_df.head()

Unnamed: 0,College Name,College Latitude,College Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,FORTIS INSTITUTE-PORT SAINT LUCIE,27.191599818,-80.249507429,Flanigan's,27.189679,-80.251054,American Restaurant
1,FORTIS INSTITUTE-PORT SAINT LUCIE,27.191599818,-80.249507429,Wawa,27.190136,-80.249503,Breakfast Spot
2,FORTIS INSTITUTE-PORT SAINT LUCIE,27.191599818,-80.249507429,Terra Fermata,27.194554,-80.252194,Beer Garden
3,FORTIS INSTITUTE-PORT SAINT LUCIE,27.191599818,-80.249507429,Fruits & Roots,27.193018,-80.253279,Vegetarian / Vegan Restaurant
4,FORTIS INSTITUTE-PORT SAINT LUCIE,27.191599818,-80.249507429,Lola's Seafood Eatery,27.19141,-80.254541,Seafood Restaurant


Thus we now have a dataframe with a list of venues in and around each of the college locations

## 4. Rearranging the obtained information and clustering the data

Here we create a new dataframe called onehot with the data in one hot encoded format from the above table venues_df. The get_dummies() method of the pandas library is used to achieve the same.

In [12]:
# one hot encoding
onehot = pd.get_dummies(venue_df[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
onehot['College Name'] = venue_df['College Name'] 
# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]
onehot.head()

Unnamed: 0,College Name,ATM,Accessories Store,African Restaurant,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Store,Vietnamese Restaurant,Wagashi Place,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,FORTIS INSTITUTE-PORT SAINT LUCIE,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,FORTIS INSTITUTE-PORT SAINT LUCIE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,FORTIS INSTITUTE-PORT SAINT LUCIE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,FORTIS INSTITUTE-PORT SAINT LUCIE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,FORTIS INSTITUTE-PORT SAINT LUCIE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Grouping the data by college name we get:

In [13]:
grouped = onehot.groupby('College Name').mean().reset_index()
grouped.head()

Unnamed: 0,College Name,ATM,Accessories Store,African Restaurant,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Store,Vietnamese Restaurant,Wagashi Place,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,ACADEMY OF ART UNIVERSITY,0.0,0.0,0.0,0.0,0.0,0.03,0.04,0.0,0.0,...,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01
1,ACAYDIA SCHOOL OF AESTHETICS,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.045455,...,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ALASKA BIBLE COLLEGE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ALLEN SCHOOL-BROOKLYN,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,0.021053,...,0.0,0.010526,0.0,0.0,0.0,0.0,0.031579,0.0,0.0,0.031579
4,AMERICAN BUSINESS AND TECHNOLOGY UNIVERSITY,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we insert below a function that parses the above table grouped and finds the top 20 most popular venues for each location and creates a new data frame called venues_sorted that holds this data along with the college name.

In [14]:
import numpy as np
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['College Name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['College Name'] =grouped['College Name']

for ind in np.arange(grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,College Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,ACADEMY OF ART UNIVERSITY,Coffee Shop,Gym / Fitness Center,Café,Art Museum,Hotel,Sandwich Place,Salad Place,Art Gallery,Boutique,...,Tea Room,Dim Sum Restaurant,New American Restaurant,Theater,Bar,Spa,Museum,Food Truck,Burger Joint,Garden
1,ACAYDIA SCHOOL OF AESTHETICS,Mexican Restaurant,Asian Restaurant,Chinese Restaurant,Sandwich Place,Snack Place,Bank,Bakery,Rock Club,Indian Restaurant,...,Breakfast Spot,Thai Restaurant,Gas Station,Music Venue,Ice Cream Shop,Gift Shop,Burger Joint,Salad Place,Café,Church
2,ALASKA BIBLE COLLEGE,Pizza Place,Café,Shipping Store,Ice Cream Shop,Coffee Shop,Mediterranean Restaurant,Bookstore,Museum,Sandwich Place,...,Clothing Store,Tourist Information Center,Bar,Mexican Restaurant,Hotel,Diner,Disc Golf,Cycle Studio,Cuban Restaurant,Falafel Restaurant
3,ALLEN SCHOOL-BROOKLYN,Pizza Place,Deli / Bodega,Grocery Store,Gym,Bagel Shop,Yoga Studio,Wine Shop,Bakery,Pet Store,...,Middle Eastern Restaurant,Gym / Fitness Center,Diner,Mediterranean Restaurant,Coffee Shop,History Museum,Park,Bubble Tea Shop,Plaza,Bar
4,AMERICAN BUSINESS AND TECHNOLOGY UNIVERSITY,Art Gallery,History Museum,Lawyer,Health & Beauty Service,Yoga Studio,Event Space,Dry Cleaner,Dumpling Restaurant,Electronics Store,...,Ethiopian Restaurant,Event Service,Exhibit,Fabric Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Drugstore


We now take the above table and apply the K means clustering algorithm on it to get the segmented list of colleges

In [15]:
# set number of clusters
kclusters = 4

grouped_clustering = grouped.drop('College Name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 2, 1,
       1, 1, 1], dtype=int32)

We now create a table called merged with the informtion from the original data frame with the 100 colleges' data and the top 20 venues around these colleges. In this table, the cluster labels assosciated with each of these colleges after segmentation is also attached.

In [16]:
# add clustering labels
venues_sorted['Cluster Labels']=kmeans.labels_

merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
merged =merged.join(venues_sorted.set_index('College Name'), on='Name')

merged.dropna(axis=0,how='any',inplace=True)
merged.reset_index(inplace=True,drop=True)
merged.head()

Unnamed: 0,zip-code,Name,City,State,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,...,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue,Cluster Labels
0,34952,FORTIS INSTITUTE-PORT SAINT LUCIE,PORT SAINT LUCIE,FL,27.191599818,-80.249507429,Hotel,Sandwich Place,Convenience Store,American Restaurant,...,Breakfast Spot,Yoga Studio,Ethiopian Restaurant,Electronics Store,English Restaurant,Fabric Shop,Event Service,Event Space,Exhibit,1.0
1,99301,COLUMBIA BASIN COLLEGE,PASCO,WA,46.251436412,-119.118516839,Hotel,Planetarium,American Restaurant,Golf Course,...,English Restaurant,Event Space,Event Service,Drugstore,Exhibit,Fabric Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,1.0
2,25301,WEST VIRGINIA JUNIOR COLLEGE-CHARLESTON,CHARLESTON,WV,38.34799259,-81.634181682,Bar,Sports Bar,Bank,Pizza Place,...,Breakfast Spot,Ice Cream Shop,Sandwich Place,Middle Eastern Restaurant,Miscellaneous Shop,Rock Club,Dive Bar,Other Great Outdoors,Shipping Store,1.0
3,82240,EASTERN WYOMING COLLEGE,TORRINGTON,WY,42.079353804,-104.190823112,Theater,Yoga Studio,Event Space,Dumpling Restaurant,...,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Dry Cleaner,Donut Shop,Flea Market,Deli / Bodega,3.0
4,614,UNIVERSITY OF PUERTO RICO-ARECIBO,ARECIBO,PR,18.469199,-66.74114,Pizza Place,Yoga Studio,Event Space,Dry Cleaner,...,Fabric Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Financial or Legal Service,Drugstore,Dive Bar,Hotpot Restaurant,0.0


We now use the folium library to visualize the clustered colleges on a map of the US.

In [17]:
#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library
# create map
latitude=39.381266
longitude=-97.922211
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(7)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'].astype(float),merged['Longitude'].astype(float), merged['Name'], merged['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[int(cluster-1)],
            fill=True,
            fill_color=rainbow[int(cluster-1)],
            fill_opacity=0.7).add_to(map_clusters)
    
map_clusters


We shall now create a rudimentary user interface that allows people to enter the name of a college of their choice and it prints out the kind of venues that exist around that area and top 5 colleges similar to this college.

In [36]:
neigh= input("Enter the name of the college you are currently interested in...")
ans=neigh
similar=[]
loca=list(merged['Name'])
label=list(merged['Cluster Labels'])
venues=merged.iloc[:,6:12]
re=[]
for i in range (0,93):
    re.append(list(venues.iloc[i,:]))
for name,labe,re in zip(loca,label,re):
    if name==ans:
        top5=re
        clustno=labe
for name,labe in zip(loca,label):
    if name==ans:
        continue
    if labe==clustno:
        similar.append(name.strip())
print("The top colleges similar to yours are...")
similar=list(dict.fromkeys(similar))
length=len(similar)
for num in range(0,5):
    print(num+1,". ",similar[num])
    print("\n")
print("The kind of venues you will find around this institution most popularly are:\n")
for i in range(6):
    print(i+1,". ",top5[i])
    print("\n")

Enter the name of the college you are currently interested in...WEST VIRGINIA JUNIOR COLLEGE-CHARLESTON
The top colleges similar to yours are...
1 .  FORTIS INSTITUTE-PORT SAINT LUCIE


2 .  COLUMBIA BASIN COLLEGE


3 .  YORK TECHNICAL COLLEGE


4 .  DYERSBURG STATE COMMUNITY COLLEGE


5 .  SEWANEE-THE UNIVERSITY OF THE SOUTH


The kind of venues you will find around this institution most popularly are:

1 .  Bar


2 .  Sports Bar


3 .  Bank


4 .  Pizza Place


5 .  American Restaurant


6 .  Café


