# <center>Exploratory Data Analysis for beginners

<center>
This notebook is a basic introduction to <strong>Exploratory Data Analysis (EDA)</strong>, the foundation of any Data Science project. Because it's meant of beginners, as I myself was when writing it, I will ask some very basic questions to understand why they do things the way they do in this industry.<br/><br/>   
Although modelling is the most highlighted part of the job, experienced Data Scientists say that preparing the data before they can start training models for it takes most of their time. And when they speak about it, many times, you hear them saying "this is not pretty" or "not glamarous". I actually find this detective work pretty beautiful. I hope you will like it too.<br/><br/>  
Data preparation contains <strong>multiple steps</strong>. When I first started training myself in this field, I began reading about EDA in <a href="https://www.kdnuggets.com/2019/06/7-steps-mastering-data-preparation-python.html">7 Steps to Mastering Data Preparation for Machine Learning with Python</a> on KDnuggets.<br/>
In this tutorial I will only discuss <strong>Exploratory Data Analysis.</strong><br/><br/>
I will work on the data from the World Happiness Report from 2020.<br/><br/>
<i>I like to sprinkle my writings with fun facts from different domains</i>

## Table of contents

1. [Why EDA ?](#1.-Why-EDA-?)
2. [Pandas, Numpy, Matplotlib, Seaborn](#2.-Pandas,-Numpy,-Matplotlib,-Seaborn)
3. [Data types](#3.-Data-types)
4. [Exploring categorical features](#4.-Exploring-categorical-features)
5. [Exploring numerical features](#5.-Exploring-numerical-features)
6. [Bivariate analysis](#6.-Bivariate-analysis)
7. [Outliers](#7.-Outliers)

## 1. Why EDA ?

Because in order to start working with our data, we need to know what kind of data we are dealing with. And this detective work got itself the dry name of exploratory data analysis (which I don't think does justice to it). 

These are only some of the questions that we ask ourselves. Depending on the answer, we have to proceed with different processing steps before we can use any algorithms on our data:
- Do we have 1000 or 1 million entries in our data ?
- Are we dealing with text or numbers ?
- Do we have dates ? What format to these dates have ?
- Do we have outliers ? (Data points that are extremely different than all the other ones)
- Do we have missing data ? That is, is any of the cells in our dataset empty ?

If I just open my data, the csv file, in a spreadsheet application and look at it with the naked eye, I won't be able to tell much.<br/>
<img src="https://mihaelagrigore.info/wp-content/uploads/2020/10/Happiness-CSV.png"></img>

I will open the csv file and read all my data.

In [None]:
import pandas as pd
df = pd.read_csv('../input/world-happiness-report/2020.csv')

## 2. Pandas, Numpy, Matplotlib, Seaborn

![Red pandas](https://mihaelagrigore.info/wp-content/uploads/2020/10/red-panda-970798_640.jpg)
These are red pandas. We are mostly used to the black & white ones. This image by <a href="https://pixabay.com/users/1443435-1443435/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=970798">1443435</a> from <a href="https://pixabay.com/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=970798">Pixabay</a> is a tribute to diversity.

### 2.1 Pandas
In the code right above (before the cute furry animals), I just imported pandas library and used **read_csv** to read my csv data in a **Pandas DataFrame.**  


Pandas is a software library created for data manipulation and analysis. Using pandas we can read various file formats easily into data structures specifically created for data manipulation procedures.  

The most commonly used data structures in pandas are [Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) and [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html). **Series** stores one-dimensional data (like a table with only one column) and **DataFrame** stores 2-dimensional data (tables with multiple columns).

The best place to learn pandas is the official documentation. If during or after reading this you feel like you need a more thorough  work session with pandas, have a look at this [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) tutorial.  # Mind that it only takes 10 min if you're some species from another planet. For humans, most likely, it takes way more than that.

### 2.2 Numpy
Numpy is a library mainly used for the Mathematical functions it implements. This way we don't have to write the functions ourselves all the time.

### 2.3 Matplotlib
Matplotlib brings us data visualizations.

### 2.4 Seaborn
Seaborn takes visualisations to the next level: more powerful and more beautiful. You'll see..

In [None]:
#let's set the precision to 2 decimal places
pd.set_option("display.precision", 2)

#the first 3 rows of our pandas DataFrame object
#if we run df.head(), it will display the first 5 rows by default
df.head(3)

Pandas makes it very easy to handle tabular data.   

Tabular data means that our data fits or belongs in a table. Other types of data can be visual (that is, images, for which it doesn't really make sense to be stored as csv files).

The standard way to store tabular data is that:  
- **each row** represents a different **observation**. Observation is a fancy Statistics term, but it just means a new data point, a new measurement we did. 
If our data is about happiness in various countries, each row contains data for a new country.  
- **each column** is a different **feature** (or attribute) of our observations. For the World Happiness Report dataset, examples of features can be the Country name, the Regional indicator or the Social Support score.  

Let's use the numpy library to see the maximum value of the **feature** *Ladder score* across **all observations** in our dataset (all countries). 

In [None]:
#Let's import the numpy library
import numpy as np

#and use a numpy function to see what's the maximum value for our Ladder score feature
np.max(df["Ladder score"])

And since we're here, I'll do a quick demo of how convenient it is to use pandas DataFrame structure.  
We found the maximum values for "Ladder score" feature. What is the row number of the entry with the max Ladder score ?

In [None]:
df['Ladder score'].argmax()

It only took one line of code to find the row number. Let's see this observation's features, to convince ourselves we got the right entry. # Mind that when displaying one single entry from the DataFrame, the feature values won't appear o a row anymore, but will be displayed as a column (I find this switch a bit confusing).

In [None]:
df.iloc[df['Ladder score'].argmax()]

## 3. Data types

We have some idea about or features types just by looking a the CSV file. But a better method is the one below.

In [None]:
#DataFrame has this very handy method.
df.info()

What I see in the output above:
- my data is a DataFrame, with 153 entries (from 0 to 152)
- I have 20 columns (from 0 to 19)
- all my columns have 153 non-null values (I don't have "missing" data in any of these columns)
- my column types are: object (2 of them) and float64* (18 of them)

*float64 means they can store fractional numbers and each number takes 64 bits

The 'object' type I see above most likely refers to a string. I'll use [DataFrame indexing / selection](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) to look at one particular value to verify my assumption.

In [None]:
print(df['Country name'][0])
print(df['Regional indicator'][0])

Ok, so, in this case, 'object' means String. 

## 4. Exploring categorical features

We have 2 features which contain text:
- Contry
- Region

### Country

Our intuition is that each country is unique in our dataset (one country per row). This is what we would expect from a study of happiness levels in different countries across the worls. We can verify this assumption, to make sure we don't have errors in our data. For example, the social scientist running this study could have accidentally entered the same observation twice because she was working late to finish her data analysis.

In [None]:
#how many entries we have for each country
#shown in descending order (highest value first)
df["Country name"].value_counts().sort_values(ascending = False)

In [None]:
#Uncomment the line below to see what data type we used. This is a nice way to explore the functioning of pandas.
#print("\nThe code above returns a date of type: ", type(df['Country name'].value_counts()))

### Region

Let's have a look at the **regions** now. It would be interesting to see what different regions we have. This would open the door for questions like: 'Are people happier in Western Europen than in Eastern Europe ?'. We don't know yet what question we can ask and exploring our data informs our next steps.

By the way, since we are dealing with long column names, it's worth mentioning that I don't have to type the whole column name. I just input the first 3 letters and press Tab for autocomplete.

We see in the output below that:
- Europe is split into 2: 'Western Europe' and 'Central and Eastern Europe'
- The Americas are divided into 2: 'Latin America and Caribbean' and 'North America and ANZ' (which is North America, Australia and New Zealand)
- Africa is split into 2: 'Sub-Saharan Africa' and 'Middle East and North Africa'
- Asia is divided into 3: 'Southeast Asia', 'South Asia' and 'East Asia'
- There is a group of post-Soviet republics in Eurasia making up the 'Commonwealth of Independent States'

In [None]:
#here's each individual region and its corresponding frequency (the statistical term 
#for the number of times this region appears in our dataset)
df['Regional indicator'].value_counts()

In [None]:
#we have 10 regions and pandas DataFrame has a method to find this out
print(f"The number of regions in our dataset is: {df['Regional indicator'].nunique()}")

I just used Python's fancy formatting in the line of code above. If you like it and want to read more, know that it's called Literal String Interpolation (but the popular name is f-string). You can read more [here](https://www.programiz.com/python-programming/string-interpolation).

### Visualisation for categorical features

Since the frequencies (the number of times they appear in our dataset) of our regions is greater than one, it invites us to look at them in a more intuitive way rather than the text displayed above.  

It is generally much better for the audience to present any data in visual form, whenever possible. For countries, nothing else made sense since each country appeared once in our data. But for regions, we can use a **bar chart**.

The bar chart below shows the same information as the table we've seen earlier.   
But in visual form it's so much easier to gain insights like "Sub-Saharan Africa is present in our dataset approximately twice as much as the next region in line, Western Europe". 

In [None]:
df['Regional indicator'].value_counts().plot(kind='bar', title='Absolute frequency distribution of Regional indicator')

In the code above I've used **Pandas built-in capabilities for data visualization**. I didn't feel a need to turn to matplotlit or seaborn for basic visualisation that can be provided by pandas.  
If you feel like you want to read more abour Pandas visualisation, see the [official documentation.](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)

Another obsvervation for the plot above is that those numbers are absolute frequencies. That is, the bar chart shows the number of times each region is present in our dataset. Sometimes it's enough to know that we have 39 countries from Sub-Saharan Africa. But there are times when we're wondering how much this represents in terms of percentage.  

In [None]:
(df['Regional indicator'].value_counts()/df.shape[0]).plot(kind='bar', title='Relative frequency of Regional indicators')

Now we know that Sub-Saharan Africa represents 25% of our data. For this dataset this is not unusual. But imagine you're trying to see how happy people are in a single country, you broadcast a digital survey that people can take and during data analysis you realize that 25% of the people who filled in the survey are from the same city in this country.  

## 5. Exploring numerical features

Pandas has a nice built-in method that performs descriptive statistics on a DataFrame.  
It shows us: 
- the number of values for each feature (again, an opportunity to see if we have missing values for any feature)
- the mean value
- the standard error
- the min and max value
- the median of our data (50%)
- the lower and upper quartile (25% and 75%)

In [None]:
df.describe()

Insights from the descriptive statistics above:
- Ladder score actually goes from 2.5 to 7.8. There's no 0 or 10. 
- Healthy life expectancy has a minimum of 45 and a maximum of 76. This is a large range. There are countries in our dataset where life expenctancy is 45 years !
- Generosity can be negative. It's the only feature that has negative values.  
- Other features are more difficult to interpret from the descriptive stats above.

Numerical data is best viewed as histograms. We will use both matplotlit and seaborn for this.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
columns = ['Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',\
           'Perceptions of corruption']


scols = int(len(columns)/2)
srows = 2
fig, axes = plt.subplots(scols, srows, figsize=(10,6))

for i, col in enumerate(columns):
    ax_col = int(i%scols)
    ax_row = int(i/scols)
    
    sns.distplot(df[col], hist=True, ax=axes[ax_col, ax_row])
    axes[ax_col, ax_row].set_title('Frequency distribution '+ col, fontsize=12)
    axes[ax_col, ax_row].set_xlabel(col, fontsize=8)
    axes[ax_col, ax_row].set_ylabel('Count', fontsize=8)
fig.tight_layout()
plt.show()

Insights from the visual exploration of our numerical data:
- the distributions of GDP, social support, healthy life expectancy, freedom and corruption are all [left skewed](http://www.cvgs.k12.va.us/DIGSTATS/main/descriptv/d_skewd.html) (or negative skew). That is to say, most of our values do not happen to be in the middle of the min-max range, but are pushed towards the upper end of our range. For all but Perception of corruption this is good news. 
- generosity, though, is right skewed. The majority of the countries are in the bottom half of the generosity scale (unfortunately)

If you feel the need to read more about why we might want to look at the distribution of our data, [here is a very quick overview](http://www.cvgs.k12.va.us/DIGSTATS/main/descriptv/).

## 6. Bivariate analysis

All the explorations above belong to univariate analysis (that is, we looked at each variable individually).
We can also perform bivariate analysis - we can look at pairs of two variables to explore a possible relation between them.

When Data Scientists perform a bivariate analysis, they look at scatterplots like the ones below and they search for clouds of dots that arrange themselves into straight diagonal lines. This is a visual representation of two variables that correlate.  

Here's how to read the plots below:  
Let's look at the **second plot on the first row**. On the **far left** of the image we see "Logged GDP per capita". All plots on the first row have on the y axis (the vertical axis) the Logged GDP per capita as the label of the Y axis. Now look at the bottom of the plots, all the way down, under the second column we have "Social support" as the name of the X axis. All plots on the second columns have the Social support on the x axis (the horizontal axis).  

Armed with this information, let's look at the contents of the second plot, first row. As the 'Social support' increases, so does 'Logged GDP per capita'. What does this mean ? Nothing more than the fact that the two feature seems to be correlated (correlation, not causation). Most likely (intuition dictates) as the country gets riches it can afford to offer more social support to its inhabitants.

Now look at the fourth subplot on the same row. The datapoints are all over place and there seems to be no correlation between 'GDP per capita' and 'Freedom to make life choices'.   

Correlation is not assessed only by looking at a scatterplot, but this is a good start.

Take a few moments to explore the plots below. Look on the diagonal, from upper left to lower right. Do you recognize them from the univariate analysis section ? These are the histograms we've seen earlier.

In [None]:
#This will take slightly longer than other plots, don't worry if the plots don't show up immediately.
columns = ['Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',\
           'Perceptions of corruption']
sns.pairplot(df[columns])

Seaborn allows us to add a 'hue' to our plots. 
We will set our scatterplots to assign **different colors to datapoints that belong to different global regions.**

You can read about [Seaborn pairplot here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)

This helps us gain insight like: Sub-Saharan African countries (the purple dots, according to the legend on the right) have the lowest GDP and the lowest Healthy life expectancy, but they are not less generous than more fortunate countries.

In [None]:
#This will take slightly longer than other plots, don't worry if the plots don't show up immediately.
columns = ['Regional indicator','Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',\
           'Perceptions of corruption']
sns.pairplot(df[columns], hue="Regional indicator", palette="Paired")

Correlation is not assessed only by looking at a scatterplot, but the mono-coloured pairplot above was a good start.  
Another useful tool in the EDA toolset is the **correlation matrix.**

In [None]:
meaningful_columns = ['Ladder score','Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia']

plt.figure(figsize=(8,6))
#sns.heatmap(df.corr(), annot = True, fmt='.1g', cmap= 'coolwarm')
sns.heatmap(df[columns].corr(), annot = True, fmt='.1g', cmap= 'coolwarm')

## 7. Outliers

A nice way to spot outliers is a Box and Whiskers plot.

In [None]:
small = ['Social support', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
medium = ['Ladder score', 'Logged GDP per capita']
large = ['Healthy life expectancy']

f, axs = plt.subplots(1,3,figsize=(15,5))

# equivalent but more general
ax1=plt.subplot(1, 3, 1)
df.boxplot(column=small, ax = ax1)
plt.xticks(rotation=90)

ax2=plt.subplot(1, 3, 2)
df.boxplot(column=medium, ax = ax2)

ax3=plt.subplot(1, 3, 3)
df.boxplot(column=large, ax = ax3)

The classical interpretation in Statistics is that whatever falls outside the 'whiskers' represents an outlier.  

If you'd like to read more about box plots and what the box, the line that splits the box and the whiskers represent, [this resource](https://publiclab.org/notes/mimiss/06-18-2019/creating-a-boxplot-to-identify-outliers-using-codap) seemed to have nice visuals. 

In practice, deciding what to do with outliers depends on many factory (whether you think they can be a mistake in data collection, for example). 

Let's examine the case of Perceptions of corruption.

In [None]:
f, axs = plt.subplots(1,2,figsize=(12,4))

# equivalent but more general
ax1=plt.subplot(1, 2, 1)
sns.distplot(df['Perceptions of corruption'], hist=True, ax=ax1)

ax2=plt.subplot(1, 2, 2)
df.boxplot(column=['Perceptions of corruption'], ax = ax2)

Because 'Perceptions of corruption' feature is left skewed, countries with lowest perception of corruption are automatically categorized as outliers in the boxplot. 

But just because they are technically outliers does not necessarily mean we should do something about them. The next question is: is the data correct ? Let's see who these outliers are.

In [None]:
(df[df['Perceptions of corruption'] < 0.4])[['Country name', 'Perceptions of corruption']].sort_values(by = 'Perceptions of corruption', axis=0, ascending=True)

It's no surprise to find almost all these countries in the bottom of the Perceptions of Corruption top. I admit I did not know about the low corruption in Rwanda !  

*If you find the line of code above confusing, I did too, in the begining. When I found lines like this in someone else's code, I used to dissect them to examine the output and the data types. Maybe this tip helps.*

In [None]:
#Uncomment the code below, line by line, if you want to dissect the previous line of code.
#I find it useful to first make a hypothesis about what I expect the line of code does before running it. 

#print(f'df has {len(df)} entries')

#df['Perceptions of corruption'] < 0.4

#df[df['Perceptions of corruption'] < 0.4]

#print(f"our selection has {len(df[df['Perceptions of corruption'] < 0.4])} entries")

#(df[df['Perceptions of corruption'] < 0.4])[['Country name', 'Perceptions of corruption']]

That's it for EDA for this rather simple tabular dataset.