## Data from World Happiness Report

The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. It contains articles, and rankings of national happiness based on respondent ratings of their own lives, which the report also correlates with various life factors.

In this notebook we will explore the happiness of different countries and the features associated.
The datasets that we will use are available in *Data*: **happiness2020.pkl** and **countries_info.csv**.

Although the features are self-explanatory, here a summary: 

**happiness2020.pkl**
* country: *Name of the country*
* happiness_score: *Happiness score*
* social_support: *Social support (mitigation the effects of inequality)*
* healthy_life_expectancy: *Healthy Life Expectancy*
* freedom_of_choices: *Freedom to make life choices*
* generosity: *Generosity (charity, volunteers)*
* perception_of_corruption: *Corruption Perception*
* world_region: *Area of the world of the country*

**countries_info.csv**
* country_name: *Name of the country*
* area: *Area in sq mi*
* population: *Number of people*
* literacy: *Literacy percentage*

In [2]:
DATA_FOLDER = 'Data/'

HAPPINESS_DATASET = DATA_FOLDER+"happiness2020.csv"
COUNTRIES_DATASET = DATA_FOLDER+"countries_info.csv"

## Task 1: Load the data

Load the 2 datasets in Pandas dataframes (called *happiness* and *countries*), and show the first rows.

In [3]:
# Write your code here
import numpy as np
import pandas as pd
happiness=pd.read_csv("C:/Users/kushal/Desktop/interview_quiz/Data/happiness2020.csv")
countries=pd.read_csv("C:/Users/kushal/Desktop/interview_quiz/Data/countries_info.csv")
happiness.head()
#print("----------------") to distinguish the two dataframes
#countries.head()

## Task 2: Let's merge the data

Create a dataframe called *country_features* by merging *happiness* and *countries*. A row of this dataframe must describe all the features that we have about a country.



In [4]:
# Write your code here
countries=pd.read_csv("C:/Users/kushal/Desktop/interview_quiz/Data/countries_info.csv")
happiness=pd.read_csv("C:/Users/kushal/Desktop/interview_quiz/Data/happiness2020.csv")
happiness['country']=happiness['country'].str.lower()
country_features=countries.merge(happiness,left_on="country_name",right_on="country")
country_features.head()

## Task 3: Where do people are happier?

Print the top 10 countries based on their happiness score (high is better).

In [5]:
# Write your code here
ref_df=happiness.sort_values(by=['happiness_score'],ascending=False)
ref_df['country'].head(10)
#printing top 10 countries based on happiness index

We are interested to know in what world region the people are happier. 

Create and print a dataframe with the (1) average happiness score and (2) the number of contries for each world region.
Sort the result to show the happiness ranking.

In [6]:
# Write your code here
ref3_frame=happiness.groupby(['world_region'])['happiness_score'].agg('mean')
ref31=ref3_frame.to_frame()
ref31

The first region has only a few countries! What are them and what is their score?

In [7]:
# Write your code here
ref4_frame=happiness.groupby('world_region').count()
#distinct commenting important
ref4_frame


## Task 4: How literate is the world?

Print the name of countries with a level of literacy of 100%. 

For each country, print the name and the world region with the format: *{region name} - {country name} ({happiness score})*

In [8]:
# Write your code here
countries.head()
countries1=countries[countries.literacy=='100,0']
countries1
for i in countries1['country_name']:
    print(i)
   

What is the global average?

In [9]:
# Write your code here
countries=pd.read_csv("C:/Users/kushal/Desktop/interview_quiz/Data/countries_info.csv")
countries_sampling=countries['literacy_sampling']
countries_sampling=countries_sampling.to_frame()
countries_sampling['literacy_sampling'].agg('mean')

Calculate the proportion of countries with a literacy level below 50%. Print the value in percentage, formatted with 2 decimals.

In [10]:
# Write your code here
countries=pd.read_csv("C:/Users/kushal/Desktop/interview_quiz/Data/countries_info.csv")
countries_sampling=countries['literacy_sampling']
countries_sampling=countries_sampling.to_frame()

countries_sampling=countries_sampling[countries_sampling.literacy_sampling<50.0]
percentage=(len(countries_sampling)/len(countries))*100
print(str(round(percentage, 2)))

Print the raw number and the percentage of world population that is illiterate.

In [11]:
# Write your code here
countries_sampling=countries_sampling[countries_sampling.literacy_sampling<50.0]
percentage=(len(countries_sampling)/len(countries))*100
print(str(round(percentage, 2)))


## Task 5: Population density

Add to the dataframe a new field called *population_density*.

In [12]:
# Write your code here
import numpy as np
import matplotlib.pyplot as plt
plt.scatter(country_features['happiness_score'], country_features['healthy_life_expectancy'], alpha=0.5)
plt.show()

What is the happiness score of the 3 countries with lowest population density?

In [13]:
# Write your code here
country_features.sort_values(by=['population_density'])['country_name'].head(3)

## Task 6: Health and happiness?

show happiness score (x) vs. healty like expectancy (y) in a proper plot

In [14]:
# Write your code here
import numpy as np
import matplotlib.pyplot as plt
plt.scatter(country_features['happiness_score'], country_features['healthy_life_expectancy'], alpha=0.5)
plt.show()

## Task 7: Healty-Happy Hypothesis?

What Hypothesis you can come up with looking at the plot?

What kind of analysis we can utilize to understand how credible is our hypothesis?

Use scikit-learn to do the analysis and describe shortly how well one can argue that the hypothesis hold?

In [15]:
# Write your code here 
import numpy as np
import matplotlib.pyplot as plt
plt.scatter(country_features['happiness_score'], country_features['healthy_life_expectancy'], alpha=0.5)
plt.show()
#There is a strong positive corealtion among the two columns, they are positively co-related.Null hypothesis is that they are strongly co-related.Alternative(H-alpha) is that they are not stringly co-related.

## Task 8: Region and happiness?

Plot in a proper plot happiness vs the region of the countries?

What plot you would use and why?

You might need to use seaborn package!

In [16]:
# Write your code here
merged_frame=ref31.merge(ref4_frame,left_on="world_region",right_on="world_region")

ax = merged_frame.plot.bar()
ax
#We would need to use the barplot using matplotlib or seaborn package,since the x-axis we are dealing with are distinct category of world region  wheras the y-axis is happiness score index which is continous data.Hence, we would use the barplot for visualization here

## Task 9: Region-Happiness category Hypothesis?

Now let's categorize countries based on whether they are happy countries or not?

Use the mean value of the happiness to categorzie countries as happy/unhappy countries and a column to the dataset expressing whether the coubntry is happy or not.

Now we want to test the hypothesis that whether the region of a country correlates with the fact that the people are happy-unhappy, design a proper test to examine this hypothesis and analyze your result .

In [17]:
# Write your code here
The final hypothesis shows the country correlation that examining

## Task 10: Happiness vs. region and Health?

We now want to see the influence of the health on happiness in different regions of the World.

Use you visualization skills to show the relation of happiness and health in different parts of the world.

In [18]:
# Write your code here
merged_frame=ref31.merge(ref4_frame,left_on="world_region",right_on="world_region")

ax = merged_frame.plot.bar(x=merged_frame['happiness_score_x'],y=merged_frame['world_region'])
ax

merged_frame=ref31.merge(ref4_frame,left_on="world_region",right_on="world_region")

ax = merged_frame.plot.bar(x=merged_frame['health_life_expectancy'],y=merged_frame['world_region'])
ax


## Task 11: population category?

Now let's make countries groups with different population. Categorize countries based on the following criteria and make a column to the dataset containing their groups

Group 0 -> countries with population less than 1 million 

Group 1 -> countries with population between 1 million and 10 million 

Group 2 -> countries with population between 10 million and 100 million 

Group 3 -> countries with population between 100 million and 1000 million

In [19]:
# Write your code here
import pandas as pd
countries.head()

print(type(countries))
countries['population']=countries['population']
countries.loc[countries['population'] <=1000000, 'group'] = 'Group 0' 
countries.loc[((countries['population'] > 1000000) & (countries['population'] <= 10000000)) , 'group'] = 'Group 1' 
countries.loc[((countries['population'] > 10000000) & (countries['population'] <= 100000000)) , 'group'] = 'Group 2' 
countries.loc[((countries['population'] > 100000000) & (countries['population'] <= 1000000000)) , 'group'] = 'Group 3' 
print (countries)

## Task 12: Happiness vs. population category and world region?

Visulaize the Happiness score of countries based on their population category and world region

What kind of plot you would use ? why?

In [20]:
# Write your code here
import numpy as np
import matplotlib.pyplot as plt
plt.scatter(country_features['happiness_score'], country_features['healthy_life_expectancy'], alpha=0.5)
plt.show()



# Quiz
### Question 1. What visualizations are suitable for getting insights about the distribution of a single continuous variable?

- a) Barplot and histogram(Correct)
- b) Boxplot and histogram
- c) Scatterplot and boxplot
- d) Barplot, boxplot, and histogram


### Question 2: Complex ML models...
 - a) tend to have high bias and low variance
 - b) are always interpretable
 - c) are prone to overfitting(Correct)
 - d) are prone to underfitting


### Question 3: Which of the following classification models fulfills all three characteristics: i) it is the quickest to train, ii) it is able to handle complex decision boundaries, and iii) it doesnâ€™t require additional retraining to make predictions that take into account freshly obtained data points?

 - a) logistic regression
 - b) k nearest neighbors
 - c) random forest 
 - d) deep neural network(Correct)
 
 

### Question 4: Which of the following real-world ML applications is not unsupervised learning?
1. Netflix matrix factorization pipeline to discover users with similar interests
2. Speaker recognition (recognition of the identity of who is talking) in phones and smart assistant devices(Correct)
3. LDA topic modeling on Twitter content to discover customers' opinions about a product
4. K-means clustering of Web domains 


# Please Note down here the time you took to solve the quiz:

In [None]:
(2 hours-10 minutes)