# Assignment 01 | Question 05

**Question**
For the following question you need to submit a link to a recorded video, YouTube link is preferable (can be unlisted). We intend to link to these videos from our public course webpage.

Go through the video at https://www.youtube.com/watch?v=hVimVzgtD6w. There are number of libraries to create such visualization: one example is GapMinder animation, another is Plotly. Choose any dataset from any of the following websites:
* https://www.gapminder.org/data/
* http://www.healthdata.org/data-visualization/gbd-compare or http://ghdx.healthdata.org/gbd-2017 (in Select Articles there are folder with data).
* https://niti.gov.in/state-statistics.

Take any two parameters, and either a number of Indian states, or a number of countries including India.
Then create such a visualization. We rely on you to choose two parameters that make a somewhat
interesting story as Hans Rosling does. If you want to use datasets about pandemic that is also fine —
either come up with suggestions, or reach out to us.
Note that you have to be sometimes careful about missing data, data formatting etc these are all part of
the problem. Document what problems you faced and what you did to handle these.

---
# Dataset Used

The datasets used are, 
1. The number of internet users between the years **1990-2019**
2. The population dataset between the years **1850-2100** (estimated values for the coming years)

> Datasets taken from [GapMinder](https://www.gapminder.org/data/)

---

## Import Libraries

In [147]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
my_dpi = 96

In [48]:
# DO NOT RUN THIS CELL IF PLOTLY ALREADY INSTALLED
!pip install plotly_express

Collecting plotly_express
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Collecting plotly>=4.1.0
  Downloading plotly-4.14.3-py2.py3-none-any.whl (13.2 MB)
[K     |████████████████████████████████| 13.2 MB 268 kB/s eta 0:00:01
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11430 sha256=7c69ce9159db5ed9b037417f7a19ed727a65f2b73deb0e9289d7d2f1f4b6fdf8
  Stored in directory: /home/pranshu/.cache/pip/wheels/f9/8d/8d/f6af3f7f9eea3553bc2fe6d53e4b287dad18b06a861ac56ddf
Successfully built retrying
Installing collected packages: retrying, plotly, plotly-express
Successfully installed plotly-4.14.3 plotly-express-0.4.1 retrying-1.3.3


In [49]:
import plotly_express as px

## Import Dataset

In [131]:
pop_data = pd.read_csv("population_total.csv")
num_inter_data = pd.read_csv("net_users_num.csv")
print(pop_data.shape, num_inter_data.shape)

(195, 302) (194, 31)


## Data Formatting

### Removing the *NaN* values 

From the *number of internet users dataset*, I removed the NaN entries with 0. Any entry with a *NaN* value will be regarded as if there were no internet user that year. 
There were no NaN values in the *Total Population Dataset*

In [None]:
# Replace NaN values in both the datasets with 0
# pop_data = pop_data.fillna(0)
num_inter_data = num_inter_data.fillna(0)

### Removing years
Since, the *Total Population Dataset* has data from years ranging between *1850-2100* and *Number of internet users* has data ranging from *1990-2019*, I kept only the columns of years ranging from *1990-2019* to have a uniformity in the dataset

In [133]:
years = []
not_years = []
for year in pop_data.iloc[:,1:]:
    if year in num_inter_data.iloc[:,1:].columns:
        years.append(year)
    else:
        not_years.append(year)
        
years = pd.Series(years)

# Drop years from pop_data
pop_data = pop_data.drop(not_years,axis=1)
print(pop_data.shape, num_inter_data.shape)

(195, 31) (194, 31)


### Dropping Countries
In the following lines of code, I check what all countries are present in the *Number of internet users dataset* but not present in *Total Population Dataset*. 

I remove those countries from the *Total Population Dataset*

In [134]:
# # Drop countries that are not in the mental health dataset
drop_countries = []
for country in pop_data['country']:
    if country not in list(num_inter_data['country']):
        drop_countries.append(country)
for c in drop_countries:
    pop_data = pop_data.drop(pop_data[pop_data['country'] == c].index).reset_index(drop=True)

### Narrowing Down the number of countries

> **This step can be discarded or ignored as per user's choice**

In the following lines of code, I remove all the countries that have **less than 25M internet users in the year 2019**

I did this to have a bit more clarity in the final plot. Since there were 194 countries in consideration, the plot was crowded with scatter points that were firstly, unnecessary and secondly, quite messy!

In [146]:
# Drop countries that have less than 25M users in the year 2019
drop_countries = []
for idx, row in num_inter_data.iterrows():
    if row['2019'] < 25000000:
        drop_countries.append(row['country'])

In [136]:
for c in drop_countries:
#     print(num_inter_data.head())
    pop_data = pop_data.drop(pop_data[pop_data['country'] == c].index).reset_index(drop=True)
    num_inter_data = num_inter_data.drop(num_inter_data[num_inter_data['country'] == c].index).reset_index(drop=True)
print(pop_data.shape, num_inter_data.shape)

(32, 31) (32, 31)


## Create New Dataset

In the following lines of code, I create the final dataset which will be used later for plotting. 
The dataset will have the following columns, 
1. **country** - The column containing all the countries
2. **year** - The years from 1990-2019 (For each country)
3. **pop** - The population of a country in a particular year
4. **num_users** - The number of internet users of a country in a particular year

In [121]:
pop_data.iloc[0:1,:].T.iloc[1:,:].reset_index(drop=True)
pop_data.T
print(pop_data.shape, num_inter_data.shape)

(32, 31) (32, 31)


In [137]:
final_df = pd.DataFrame({
    'country':[],
    'year':[],
    'pop':[],
    'num_users':[]
})
# print(final_df)
for i in range(num_inter_data.shape[0]):
    pop_df = pop_data.iloc[i:i+1,:].T.iloc[1:,:].reset_index(drop=True)   
    num_int_users = num_inter_data.iloc[i:i+1,:].T.iloc[1:,:].reset_index(drop=True)
    countries = pd.Series(np.full((pop_data.shape[1]-1,), pop_data['country'][i]))
    temp_df = pd.concat([countries, years, pop_df, num_int_users], axis=1)
    temp_df.columns = ['country','year','pop', 'num_users']
    final_df = final_df.append(temp_df)
final_df = final_df.reset_index(drop=True)
final_df

Unnamed: 0,country,year,pop,num_users
0,Algeria,1990,25800000,0
1,Algeria,1991,26400000,0
2,Algeria,1992,27000000,0
3,Algeria,1993,27600000,0
4,Algeria,1994,28200000,102
...,...,...,...,...
955,Vietnam,2015,92700000,4.17e+07
956,Vietnam,2016,93600000,4.96e+07
957,Vietnam,2017,94600000,5.5e+07
958,Vietnam,2018,95500000,6.72e+07


In [138]:
# Save to a csv file
# NOT NECESSARY
final_df.to_csv("final.csv")

## Plotting
For plotting, I use the **plotly express** library. 

In [139]:
final_df['country'] = pd.Categorical(final_df['country'])

In [140]:
print(final_df['pop'].min(),final_df['pop'].max())

16200000 1430000000


In [142]:
print(final_df['num_users'].min(),final_df['num_users'].max())

0.0 779000000.0


In [143]:
size = np.array(final_df['pop']/20000, dtype="float64")

In [144]:
fig=px.scatter(final_df, x="pop", y="num_users", animation_frame="year", animation_group="country",height=600,width=1000,
           size=size, color="country", hover_name="country",size_max=50,log_x=True,log_y=False,text="country",
               range_x=[16000000,1440000000], range_y=[0,779000000], labels=dict(pop="Total Population",
                          num_users="Total number of internet users"))

In [145]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

plot(fig)

'temp-plot.html'