## Que5 - Report

For this problem, I have created an animation using Plotly Express animation library about the percentange of internet users in a conutry (India included) v/s it's income per person (GDP/capita) yearwise from $1988$ to $2017$. To see the animation generated by this code open the file *temp-plot.html*, the video has been uploaded here - *https://www.youtube.com/watch?v=WbqUjpbGWXw*

### Problems faced and how did I handle them - 

#### *1. Data set collection*

For making animations I required these attributes for each country for each year - continent, population in that year, income per person (GDP/capita) in that year, percentage of internet users in that year with respect to the population in that year. All these data was not available in a plain csv file. To handle this I downloaded various files from https://www.gapminder.org/data/ that could make up the complete data - 

- *que5_country_name_with_continent.csv* which contains for each country it's continent
- *que5_population.csv* which contains for each country it's population (in that year) yearwise.
- *que5_income_per_person.csv* which contains for each country it's income per person (in that year) yearwise.
- *que5_internet_users.csv* which contains for each country it's percentage of internet users (out of the total population of that country in that year) yearwise.

#### *2. Making data consistent*

It can be noticed that the number of *unique* countries in *que5_income_per_person.csv* is 193, in *que5_internet_users.csv* is 194 and in *que5_population.csv* is 195 whereas in *que5_country_name_with_continent.csv* is 142, i.e. some countries are missing in continent data, some countries are missing income and internet users data for all the years. And not all the files contain data about all the countries. 

To handle this first I chose an year range from $1988$ to $2017$ (because for these years most of the data is available). Now, for each of the files I extracted data value for these years for each country and saved in nested dictionaries namely *population_data_country_and_year_wise*, *internet_users_data_country_and_year_wise*, and *income_per_person_data_country_and_year_wise*. 

Now, to handle the problem of missing countries I chose those countries whose data was available in all the files, i.e. I took intersection of all the countries, and found out that for only $133$ countries data was available in all the files.


Note - This missing data is not like for a country for some of the years, the data is missing (which is actually handled in the 3rd part) but for a country for **all the years** there is no data i.e. the entire country is missing from the file.

#### *3. Missing/NULL values*

Now, even after choosing those countries which had data present in all the files, the values for some of the years was NULL / missing and we cannot replace value 0 at those places since it would mean as if the value is actually 0.


To handle this, for each year I calculated mean value by taking into consideration all the countries where value is not NULL, and assigned this value to the countries that had NULL value i.e. a missing data for a country is assigned equal to the average data about the world in that year. 

## Code - 

After preprocessing the data, it is saved in file *que5_cleaned_dataset.csv*. The code for generating this file and animation is below - 

In [99]:
# Import libraries
import pandas as pd
import math
import plotly.graph_objs as go
import csv
from plotly.offline import init_notebook_mode, plot, iplot, download_plotlyjs
import matplotlib.pyplot as plt
import plotly_express as px

# Importing data
data_population = pd.read_csv('que5_population.csv')
data_internet_users = pd.read_csv('que5_internet_users.csv')
data_income_per_person = pd.read_csv('que5_income_per_person.csv')
data_country_name_with_continent = pd.read_csv('que5_country_name_with_continent.csv')

# Selecting an year range
years = [str(year) for year in range(1988, 2018)]

# Dictionary to store continent name against a country
country_with_continent_name = {}
index = 0
for country in data_country_name_with_continent['country']:    
    country_with_continent_name[country] = data_country_name_with_continent['continent'][index]  
    index+=1 

print("No. of unique countries in que5_income_per_person.csv -  "+str(len(data_income_per_person)))
print("No. of unique countries in que5_internet_users.csv -  "+str(len(data_internet_users)))
print("No. of unique countries in que5_population.csv -  "+str(len(data_population)))
print("No. of unique countries in que5_country_name_with_continent.csv -  "+str(len(country_with_continent_name)))

No. of unique countries in que5_income_per_person.csv -  193
No. of unique countries in que5_internet_users.csv -  194
No. of unique countries in que5_population.csv -  195
No. of unique countries in que5_country_name_with_continent.csv -  142


In [102]:
# Dictionary to store population data country and year wise
population_data_country_and_year_wise = {}
index = 0
for country in data_population['country']:    
    # Making nested dictionary to store data yearwise for this country
    population_data_country_and_year_wise[country] = {}    
    for year in years:        
        population_data_country_and_year_wise[country][year] = data_population[year][index]        
    index+=1  
    
# Dictionary to store internet users data country and year wise
internet_users_data_country_and_year_wise = {}
index = 0
for country in data_internet_users['country']:    
    # Making nested dictionary to store data yearwise for this country
    internet_users_data_country_and_year_wise[country] = {}    
    for year in years:        
        internet_users_data_country_and_year_wise[country][year] = data_internet_users[year][index]        
    index+=1  
    
# Dictionary to store income per person data country and year wise
income_per_person_data_country_and_year_wise = {}
index = 0
for country in data_income_per_person['country']:   
    # Making nested dictionary to store data yearwise for this country
    income_per_person_data_country_and_year_wise[country] = {}    
    for year in years:        
        income_per_person_data_country_and_year_wise[country][year] = data_income_per_person[year][index]        
    index+=1  
    
# Choosing only those countries which are intersection of all the data available
countries = []
for country in  population_data_country_and_year_wise:    
    if (country in internet_users_data_country_and_year_wise) and (country in income_per_person_data_country_and_year_wise) and (country in country_with_continent_name):
        countries.append(country)    
        
print("Only "+str(len(countries))+" countries are present in all the files.")    

# Handling NULL values - for each year--> calculate mean value by taking into consideration all the countries where value is not NULL, now assign this value to the countries that have NULL value
for year in years:
    
    # first index denotes sum (numerator), second index denotes count (denominator)
    mean_data = {'population':[0, 0], 'internet_users':[0, 0],'income_per_person':[0, 0]}
    
    # find mean_data
    for country in countries:
        if not math.isnan(population_data_country_and_year_wise[country][year]):                        
            mean_data['population'][0]+=population_data_country_and_year_wise[country][year]
            mean_data['population'][1]+=1
        if not math.isnan(internet_users_data_country_and_year_wise[country][year]):
            mean_data['internet_users'][0]+=internet_users_data_country_and_year_wise[country][year]
            mean_data['internet_users'][1]+=1
        if not math.isnan(income_per_person_data_country_and_year_wise[country][year]):
            mean_data['income_per_person'][0]+=income_per_person_data_country_and_year_wise[country][year]
            mean_data['income_per_person'][1]+=1       
    
    # update values which was NULL
    for country in countries:
        if math.isnan(population_data_country_and_year_wise[country][year]):                        
            population_data_country_and_year_wise[country][year] = mean_data['population'][0]/mean_data['population'][1]            
        if math.isnan(internet_users_data_country_and_year_wise[country][year]):
            internet_users_data_country_and_year_wise[country][year] = mean_data['internet_users'][0]/mean_data['internet_users'][1]
        if math.isnan(income_per_person_data_country_and_year_wise[country][year]):
            income_per_person_data_country_and_year_wise[country][year] = mean_data['income_per_person'][0]/mean_data['income_per_person'][1]
        
        
# Save this clean data set as a csv file
with open('que5_cleaned_dataset.csv','w', newline='') as cleaned_dataset:
    writer = csv.writer(cleaned_dataset)
    writer.writerow(['country','year','population','continent','income_per_person','internet_users'])
    for year in years:
        for country in countries:            
            writer.writerow([country,year,population_data_country_and_year_wise[country][year],country_with_continent_name[country],income_per_person_data_country_and_year_wise[country][year],internet_users_data_country_and_year_wise[country][year]])    
    
    
# open this file
data = pd.read_csv("que5_cleaned_dataset.csv")

# plot animations - animations will be formed in an html file named - temp-plot.html
fig=px.scatter(data, x="internet_users", y="income_per_person",size="population",color="continent",animation_frame="year",hover_name="country",animation_group="country",size_max=70)
plot(fig)

Only 133 countries are present in all the files.


'temp-plot.html'