# Internet Usage Analysis

## by Justin Sierchio

This Jupyter Notebook will be examining internet usage worldwide and its association with income and urban rate. 

This data is in .csv file format and is from Kaggle under a Public Domain License. It can be found at: https://www.kaggle.com/sansuthi/gapminder-internet/download. Additional related information can be found at: https://www.kaggle.com/sansuthi/gapminder-internet.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df_INTERNET = pd.read_csv("internet_gapminder.csv");

print('Datasets uploaded!');

Datasets uploaded!


In [3]:
# Open Internet Data Usage dataset and display 1st 5 rows
df_INTERNET.head(5)

Unnamed: 0,country,incomeperperson,internetuserate,urbanrate
0,Afghanistan,,3.654121623,24.04
1,Albania,1914.996551,44.98994696,46.72
2,Algeria,2231.993335,12.50007331,65.22
3,Andorra,21943.3399,81.0,88.92
4,Angola,1381.004268,9.999953883,56.7


As one can see, we have columns for (1) country (2) income per person (3) internet use rate and (4) rate of urbanization.

# Data Cleaning

Let's begin looking at this dataset by making sure it is sufficiently cleaned.

In [4]:
df_INTERNET.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          213 non-null    object 
 1   incomeperperson  190 non-null    float64
 2   internetuserate  193 non-null    object 
 3   urbanrate        203 non-null    float64
dtypes: float64(2), object(2)
memory usage: 6.8+ KB


We see that 'internetuserate' is defined as an object. We will need to convert this to a float64 to do analysis.

In [5]:
# Convert 'internetuserate' to a float64.
df_INTERNET['internetuserate'] = pd.to_numeric(df_INTERNET['internetuserate'],errors='coerce')

Now let's check for 'Nan' or 'null' values.

In [6]:
# Check dataset for 'NaN' or 'null' values
df_INTERNET.isnull().sum()

country             0
incomeperperson    23
internetuserate    21
urbanrate          10
dtype: int64

We'll drop the countries with incomplete user data.

In [7]:
# Remove 'NULL' rows from SBA Loans Dataset
df_INTERNET2 = df_INTERNET.dropna()

# Confirm all 'NULL' rows and columns removed
df_INTERNET2.isnull().sum()

country            0
incomeperperson    0
internetuserate    0
urbanrate          0
dtype: int64

In [8]:
df_INTERNET2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 1 to 212
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          182 non-null    object 
 1   incomeperperson  182 non-null    float64
 2   internetuserate  182 non-null    float64
 3   urbanrate        182 non-null    float64
dtypes: float64(3), object(1)
memory usage: 7.1+ KB


Now it appears that our data is sufficiently cleaned to conduct a further analysis.

# Exploratory Data Analysis

Let's begin by seeing (among the countries in our modified dataset) some of the basic characteristics.

In [9]:
# Sort Countries by Internet Usage Rate
df_INTERNET_explore1 = df_INTERNET2[['country','internetuserate']]
df_INTERNET_explore1_sortUse = df_INTERNET_explore1.sort_values('internetuserate', ascending=False);
print('Top 20 Countries (w/ complete data) for Internet Usage Rates:\n')
print(df_INTERNET_explore1_sortUse.head(20))

Top 20 Countries (w/ complete data) for Internet Usage Rates:

                  country  internetuserate
85                Iceland        95.638113
144                Norway        93.277508
136           Netherlands        90.703555
111            Luxembourg        90.079527
184                Sweden        90.016190
50                Denmark        88.770254
63                Finland        86.898845
202        United Kingdom        84.731705
20                Bermuda        84.654514
139           New Zealand        83.002584
69                Germany        82.526898
100           Korea, Rep.        82.515928
185           Switzerland        82.166660
156                 Qatar        81.590397
32                 Canada        81.338393
3                 Andorra        81.000000
5     Antigua and Barbuda        80.645455
109         Liechtenstein        80.000000
174       Slovak Republic        79.889777
201  United Arab Emirates        77.996781


As we can see, the vast majority of the countries in our complete dataset are in Europe, with a few others elsewhere.

Let's look at income per person and urbanization rates.

In [10]:
# Sort Countries by Income per Person
df_INTERNET_explore2 = df_INTERNET2[['country','incomeperperson']]
df_INTERNET_explore2_sortUse = df_INTERNET_explore2.sort_values('incomeperperson', ascending=False);
print('Top 20 Countries (w/ complete data) for Income per Person:\n')
print(df_INTERNET_explore2_sortUse.head(20))

Top 20 Countries (w/ complete data) for Income per Person:

              country  incomeperperson
109     Liechtenstein      81647.10003
20            Bermuda      62682.14701
111        Luxembourg      52301.58718
144            Norway      39972.35277
94              Japan      39309.47886
185       Switzerland      37662.75125
203     United States      37491.17952
83   Hong Kong, China      35536.07247
85            Iceland      33945.31442
156             Qatar      33931.83208
112      Macao, China      33923.31387
173         Singapore      32535.83251
184            Sweden      32292.48298
50            Denmark      30532.27704
202    United Kingdom      28033.48928
90            Ireland      27595.09135
63            Finland      27110.73159
10            Austria      26692.98411
136       Netherlands      26551.84424
32             Canada      25575.35262


In [11]:
# Sort Countries by Urbanization Rate
df_INTERNET_explore3 = df_INTERNET2[['country','urbanrate']]
df_INTERNET_explore3_sortUse = df_INTERNET_explore3.sort_values('urbanrate', ascending=False);
print('Top 20 Countries (w/ complete data) for Urbanization Rate:\n')
print(df_INTERNET_explore3_sortUse.head(20))

Top 20 Countries (w/ complete data) for Urbanization Rate:

              country  urbanrate
173         Singapore     100.00
83   Hong Kong, China     100.00
112      Macao, China     100.00
20            Bermuda     100.00
155       Puerto Rico      98.32
17            Belgium      97.36
156             Qatar      95.64
119             Malta      94.26
207         Venezuela      93.32
204           Uruguay      92.30
85            Iceland      92.26
6           Argentina      92.00
91             Israel      91.66
202    United Kingdom      89.94
3             Andorra      88.92
9           Australia      88.74
13            Bahrain      88.52
37              Chile      88.44
51           Djibouti      87.30
105           Lebanon      86.96


We see a more geographically diverse distribution when it comes to incomes per person and urbanization rates worldwide.