# World Internet Status

### This is an exploratory project that compares the internet speeds of various world economies. 

Economic data is in the form of per capita income from the [World Bank](http://databank.worldbank.org/data/reports.aspx?Code=NY.GDP.PCAP.CD&id=af3ce82b&report_name=Popular_indicators&populartype=series&ispopular=y#). Note that the World Bank data had gone through quite a lot of cleaning, including removing rows with non-NaN, non-numeric values, or any other values which don't play nice with Python.

The internet speeds and users datasets are downloaded from [Akamai](https://www.akamai.com/us/en/our-thinking/state-of-the-internet-report/).

### <center> I. Import Python Libraries </center></font>

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as grd
import matplotlib.ticker as tkr
import matplotlib.font_manager as font_manager

from matplotlib.ticker import AutoMinorLocator
from matplotlib.ticker import FuncFormatter
from matplotlib import rcParams

#define plotter
minorLocatorx   = AutoMinorLocator(10)
minorLocatory   = AutoMinorLocator(4)
matplotlib.rc('xtick', labelsize=16) 
matplotlib.rc('ytick', labelsize=16) 
matplotlib.rcParams['axes.linewidth'] = 2.
plt.rcParams['axes.linewidth'] = 4
plt.rc('font', family='serif')
plt.rc('font', serif='Times New Roman') 
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 15
fig_size[1] = 9
plt.rcParams["figure.figsize"] = fig_size

In [4]:
# Try to make nice displays of Pandas tables.
#Font style
from IPython.core.display import HTML
css = open('/Users/gmsardane/nikola-blog/stories/style-table.css').read() \
+ open('/Users/gmsardane/nikola-blog/stories/style-notebook.css').read()
HTML('<style>{}</style>'.format(css));

### <center> 2. Load in GDP per capita data. </center> </font>

In [5]:
## World GDP
GDP = pd.read_csv('/Users/gmsardane/datascience_project/PhilippineInternetUsers/GDP_percapita.csv')
#GDP = GDP.drop(GDP.columns[[0,1,3, -1]], axis=1)
GDP = GDP.set_index(['Country Name','Country Code']).dropna()
GDP = GDP.apply(pd.to_numeric, args=('coerce',))
GDP = GDP.reset_index()
year=[]
t=list(GDP.keys())
for word in t:
    year.append(word.split()[0])
GDP.head()

Unnamed: 0,Country Name,Country Code,2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014]
0,Afghanistan,AFG,,119.899037,192.153528,203.651041,224.914712,257.175795,280.245644,380.400955,384.131681,458.955782,569.940729,622.379654,690.842629,666.795051,633.569247
1,Albania,ALB,1175.788981,1326.970339,1453.642777,1890.681557,2416.588235,2709.142931,3005.012903,3603.013685,4370.539647,4114.136545,4094.358832,4437.811999,4247.485437,4411.258241,4564.390339
2,Algeria,DZA,1757.011974,1732.958517,1774.292021,2094.893302,2600.00652,3102.037384,3467.54474,3939.559939,4912.251941,3875.822095,4473.486446,5447.403976,5583.61616,5491.614414,5484.066806
3,American Samoa,ASM,,,,,,,,,,,,,,,
4,Andorra,ADO,21432.96007,21897.66294,24175.37275,31742.99258,37235.45003,39990.33041,42417.22915,47253.5298,46735.99957,42701.44714,39639.38602,41630.05258,39666.36921,42806.52255,


### <center> Also load in the GDP ranking data. </center> </font>

In [8]:
## GDP Rank
GDP_rank = pd.read_csv('GDP.csv')
#GDP_rank = GDP_rank.set_index(['Country'])
GDP_rank = GDP_rank.sort_values(['Country'])
GDP_rank.head(5)

Unnamed: 0,Country Code,Rank,Country,USD_GDP
107,AFG,108,Afghanistan,20038
126,ALB,127,Albania,13212
48,DZA,49,Algeria,213518
161,ADO,162,Andorra,3249
57,AGO,58,Angola,138357


### <center> 3. Loading the Internet user and speed data from varios nations. </center> </font>

In [9]:
## Internet speed
speed = pd.read_csv('WorldInternetSpeedQ42014.txt')
#speed = speed.set_index('Region')
#speed=speed.sort_index()
speed.head()

Unnamed: 0,Region,Unique IPv4 Addresses,Average Connection Speed(Mbps),Average Peak Connection Speed (Mbps),Pct Above 4Mbps,Pct Above 10 Mbps,Pct Above 15 Mbps
0,Argentina,8199701,4.7,28.5,46.0,4.4,0.7
1,Bolivia,279921,2.0,13.2,3.3,0.3,0.1
2,Brazil,47913625,4.1,30.3,39.0,2.9,0.8
3,Canada,14924241,13.1,54.9,88.0,49.0,27.0
4,Chile,4750333,6.1,44.7,67.0,10.0,2.7


In [10]:
## Internet users
users = pd.read_csv('WorldInternetUsers2014.txt')
users = users.drop(users.columns[[0]], axis=1)
#users = users.set_index('Country')
#users = users.sort_index()
users.head(5)

Unnamed: 0,Country,Users,Penetration_percent,Population,Non-Users,Change_from_prev_yr_percent,Change_from_prev_yr_num,Population_Change_percent
0,Afghanistan,2020998,6.4,31627506,29606508,11.6,210730,3.08
1,Albania,1736695,60.1,2889676,1152981,5.3,87459,0.22
2,Algeria,7043221,18.1,38934334,31891113,11.8,742509,1.96
3,Andorra,69802,95.9,72786,2984,-2.2,-1546,-4.11
4,Angola,5150772,21.3,24227524,19076752,15.0,672165,3.32


### <center> 3. Merging the economic and tech data. </center> </font>

In [12]:
## Merge 
dfMerged = pd.merge(users, speed, right_on=['Region'], left_on=['Country'], how='inner')
dfMerged.drop('Region', axis=1, inplace=True)
print dfMerged.keys()
print "There are {} countries having both economic and tech data available.".format(len(dfMerged))

Index([u'Country', u'Users', u'Penetration_percent', u'Population',
       u'Non-Users', u'Change_from_prev_yr_percent',
       u'Change_from_prev_yr_num', u'Population_Change_percent',
       u'Unique IPv4 Addresses', u'Average Connection Speed(Mbps)',
       u'Average Peak Connection Speed (Mbps)', u'Pct Above 4Mbps',
       u' Pct Above 10 Mbps', u' Pct Above 15 Mbps'],
      dtype='object')
There are 54 countries having both economic and tech data available.
