# Determining internet user clusters using KNN 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib as plt
import sklearn

### Motivations: 
For this project we are going to build a KNN with Internet data to create internet user clusters. There are over 200 countries represented in the data, rather than assuming that the clusters exist based on region, we will use KNN to determined the trends of internet access, usability and cellular device usage.

### Usability:
As companies grow, and focus on their expanding efforts, it is important to determine where to allocate resources and investments that gurantee success. For IoT products, having insight into accessibility is one of the determinants of marketing and investment efforts. 

### Data:
The columns in the data 




In [13]:
df = pd.read_csv('Final.csv', index_col=None)
df

Unnamed: 0.1,Unnamed: 0,Entity,Code,Year,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
0,0,Afghanistan,AFG,1980,0.000000,0.000000,0,0.000000
1,1,Afghanistan,AFG,1981,0.000000,0.000000,0,0.000000
2,2,Afghanistan,AFG,1982,0.000000,0.000000,0,0.000000
3,3,Afghanistan,AFG,1983,0.000000,0.000000,0,0.000000
4,4,Afghanistan,AFG,1984,0.000000,0.000000,0,0.000000
...,...,...,...,...,...,...,...,...
8862,8862,Zimbabwe,ZWE,2016,91.793457,23.119989,3341464,1.217633
8863,8863,Zimbabwe,ZWE,2017,98.985077,24.400000,3599269,1.315694
8864,8864,Zimbabwe,ZWE,2018,89.404869,25.000000,3763048,1.406322
8865,8865,Zimbabwe,ZWE,2019,90.102287,25.100000,3854006,1.395818


In [14]:
df.drop(columns=["Unnamed: 0"],inplace=True)
df

Unnamed: 0,Entity,Code,Year,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
0,Afghanistan,AFG,1980,0.000000,0.000000,0,0.000000
1,Afghanistan,AFG,1981,0.000000,0.000000,0,0.000000
2,Afghanistan,AFG,1982,0.000000,0.000000,0,0.000000
3,Afghanistan,AFG,1983,0.000000,0.000000,0,0.000000
4,Afghanistan,AFG,1984,0.000000,0.000000,0,0.000000
...,...,...,...,...,...,...,...
8862,Zimbabwe,ZWE,2016,91.793457,23.119989,3341464,1.217633
8863,Zimbabwe,ZWE,2017,98.985077,24.400000,3599269,1.315694
8864,Zimbabwe,ZWE,2018,89.404869,25.000000,3763048,1.406322
8865,Zimbabwe,ZWE,2019,90.102287,25.100000,3854006,1.395818


In [91]:
print('Total number of unique countries represented in the dataset: ', df['Entity'].nunique())

Total number of unique countries represented in the dataset:  229


In [26]:
df.columns

Index(['Entity', 'Code', 'Year', 'Cellular Subscription', 'Internet Users(%)',
       'No. of Internet Users', 'Broadband Subscription'],
      dtype='object')

In [89]:
af_country = df[df['Entity'] == 'Zimbabwe']
af_country

Unnamed: 0,Entity,Code,Year,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
8826,Zimbabwe,ZWE,1980,0.0,0.0,0,0.0
8827,Zimbabwe,ZWE,1981,0.0,0.0,0,0.0
8828,Zimbabwe,ZWE,1982,0.0,0.0,0,0.0
8829,Zimbabwe,ZWE,1983,0.0,0.0,0,0.0
8830,Zimbabwe,ZWE,1984,0.0,0.0,0,0.0
8831,Zimbabwe,ZWE,1985,0.0,0.0,0,0.0
8832,Zimbabwe,ZWE,1986,0.0,0.0,0,0.0
8833,Zimbabwe,ZWE,1987,0.0,0.0,0,0.0
8834,Zimbabwe,ZWE,1988,0.0,0.0,0,0.0
8835,Zimbabwe,ZWE,1989,0.0,0.0,0,0.0


In [69]:
df.describe()

Unnamed: 0,Year,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
count,8867.0,8867.0,8867.0,8867.0,8867.0
mean,2000.151799,39.989614,17.043606,10891380.0,4.440695
std,11.812151,51.98141,26.883498,124884100.0,9.755705
min,1980.0,0.0,0.0,0.0,0.0
25%,1990.0,0.0,0.0,0.0,0.0
50%,2000.0,5.501357,0.855662,10047.0,0.0
75%,2010.0,82.231594,25.449939,866419.5,2.007603
max,2020.0,436.103027,100.0,4699886000.0,78.524361


We are posed with the challenge of what to do with our 'Year' column. Does it matter to look at historical data? Should we perform a time series analysis? What would we use this time series data for? 

What I am interested in creating is a series of clusters that determine high-internet users, intermediate, and low-internet usage countries/regions/clusters. If analyzing the dataset from the yearly perspective does not serve this purpose, should we just focus on the last year available of the data? Let's find out if we even have all of the entries for that year first.

In [98]:
df_2020 = df[df['Year']==2020]

In [101]:
df_2020.isnull()

Unnamed: 0,Entity,Code,Year,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
40,False,False,False,False,False,False,False
81,False,False,False,False,False,False,False
122,False,False,False,False,False,False,False
185,False,False,False,False,False,False,False
226,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
8702,False,False,False,False,False,False,False
8743,False,False,False,False,False,False,False
8784,False,False,False,False,False,False,False
8825,False,False,False,False,False,False,False
