# **Exploratory Data Analysis**

**Content**:
1. What is Exploratory Data Analysis?
2. Why is exploratory data analysis important in data science?
3. A Statistical Approach
4. Hands on!


### **What is Exploratory Data Analysis?**

Exploratory Data Analysis refers to the critical process of performing **initial investigations on data** so as to discover **patterns**, to **spot anomalies**, to test hypothesis and to **check assumptions** with the help of summary statistics and graphical representations.

Exploratory Data Analysis (**EDA**) is used by data scientists to **analyze** and **investigate** data sets and **summarize their main characteristics**.

> ##### **EDA** is all about making sense of data in hand, before getting them dirty with it. <br><br>
> 

### **Why is exploratory data analysis important in data science?**

The main purpose of EDA is to help look at data before making any assumptions. 
It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, 
find interesting relations among the variables.

- Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. 
- EDA also helps stakeholders by confirming they are asking the right questions. 

EDA can help answer questions about standard deviations, categorical variables, and confidence intervals.

 Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

> **"Data combined with practical methods can answer questions and guide decisions under uncertainty"** 

### **A Statistical Approach**


When doing EDA, we might want **evidence** that is more persuasive and an answer that is more reliable.

To face limitations of **anecdotal evidence** and other non-reliable sources of data, we will use some helpful tool of statistics:
1. Data Collection
2. Descriptive statistics
3. Exploratory Data Analysis
4. Estimation
5. Hyphotesis testing

> **Anecdotal evidence**: Evidence, often personal, that is collected casually rather than by a well-designed study. <br><br>
> Learn more about how Anecdotal Evidence can beat you: https://statisticsbyjim.com/basics/anecdotal-evidence/

Anecdotal evidence usually fails, because: 

- **Small number of observations**  
- **Selection bias:** Individuals or groups in a study differ systematically from the population of interest leading to a systematic error in an association or outcome.
- **Confirmation bias:** People who believe the claim might be more likely to contribute examples that confrm it. People who doubt the claim are more likely to cite counterexamples.
- **Inaccuracy:** Anecdotes are often personal stories, and often misremem- bered, misrepresented, repeated inaccurately, etc.

More types of bias: https://catalogofbias.org/biases/

### **Hands On**

We are going to work with Data on COVID-19 (coronavirus) by **Our World in Data**.

Download the complete [COVID-19 dataset](https://covid.ourworldindata.org/data/owid-covid-data.csv).

Take a look into this site for [more details](https://github.com/owid/covid-19-data/tree/master/public/data).

Importing the neccesary libraries to perform our analysis:

In [3]:
using Pkg

Pkg.add("CSV")
Pkg.add("DataFrames")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`


In [4]:
using CSV, DataFrames

In [5]:
df = DataFrame(CSV.File("../data/owid-covid-data.csv"))
df

Unnamed: 0_level_0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed
Unnamed: 0_level_1,String15,String15?,String,Date,Float64?,Float64?,Float64?
1,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,missing
2,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,missing
3,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,missing
4,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,missing
5,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,missing
6,AFG,Asia,Afghanistan,2020-02-29,5.0,0.0,0.714
7,AFG,Asia,Afghanistan,2020-03-01,5.0,0.0,0.714
8,AFG,Asia,Afghanistan,2020-03-02,5.0,0.0,0.0
9,AFG,Asia,Afghanistan,2020-03-03,5.0,0.0,0.0
10,AFG,Asia,Afghanistan,2020-03-04,5.0,0.0,0.0


In [6]:
size(df)

(217278, 67)

This dataset contains  217,278 observations and 67 columns.

In [7]:
names(df)

67-element Vector{String}:
 "iso_code"
 "continent"
 "location"
 "date"
 "total_cases"
 "new_cases"
 "new_cases_smoothed"
 "total_deaths"
 "new_deaths"
 "new_deaths_smoothed"
 "total_cases_per_million"
 "new_cases_per_million"
 "new_cases_smoothed_per_million"
 ⋮
 "cardiovasc_death_rate"
 "diabetes_prevalence"
 "female_smokers"
 "male_smokers"
 "handwashing_facilities"
 "hospital_beds_per_thousand"
 "life_expectancy"
 "human_development_index"
 "excess_mortality_cumulative_absolute"
 "excess_mortality_cumulative"
 "excess_mortality"
 "excess_mortality_cumulative_per_million"

Missing values are represented in Julia using `missing` that has type `Missing`.

Julia provides support for representing missing values in the statistical sense, that is for situations where no value is available for a variable in an observation, but a valid value theoretically exists.

The `describe` method is also available to get summary statistics:

In [8]:
missing_values = describe(df, :nmissing)
missing_values

Unnamed: 0_level_0,variable,nmissing
Unnamed: 0_level_1,Symbol,Int64
1,iso_code,0
2,continent,12504
3,location,0
4,date,0
5,total_cases,8904
6,new_cases,9194
7,new_cases_smoothed,10386
8,total_deaths,27956
9,new_deaths,28030
10,new_deaths_smoothed,29210


In [9]:
total_missing = sum(missing_values[:, :nmissing])
total_missing

6545978

The total number of missing values in the df is: 6,545,978



In [10]:
missing_values[:, [:variable, :nmissing]]

Unnamed: 0_level_0,variable,nmissing
Unnamed: 0_level_1,Symbol,Int64
1,iso_code,0
2,continent,12504
3,location,0
4,date,0
5,total_cases,8904
6,new_cases,9194
7,new_cases_smoothed,10386
8,total_deaths,27956
9,new_deaths,28030
10,new_deaths_smoothed,29210


**Vectorized Operators**
Vectorized Operators operators are the vectorized versions of the any binary arithmetic operation. This is represented by a dot in front of the operator. These are of course meant to be used on an iterable vector or array. The vectorized versions of these operators are as follows: `.+ .- .* ./ .\ .% .^`

`transform()` function create a new data frame that contains columns from `df` plus columns specified by args and return it. The result is guaranteed to have the same number of rows as `df`.
- `transform!` is an in-place version of transform.


In [30]:
transform(missing_values, :nmissing => (v -> v / total_missing) => :percentaje)

Unnamed: 0_level_0,variable,nmissing,percentaje
Unnamed: 0_level_1,Symbol,Int64,Float64
1,iso_code,0,0.0
2,continent,12504,0.00191018
3,location,0,0.0
4,date,0,0.0
5,total_cases,8904,0.00136022
6,new_cases,9194,0.00140453
7,new_cases_smoothed,10386,0.00158662
8,total_deaths,27956,0.00427071
9,new_deaths,28030,0.00428202
10,new_deaths_smoothed,29210,0.00446228


- `Dict` function creates a dictionary, a collection of key-value pairs, where each value in the dictionary can be accessed with its key. 
    - These key-value pairs need not be of the same data type, which means a `String` typed key can hold a value of any type like `Integer`, `String`, `float`, etc. 
    - Keys of a dictionary can never be same, each key must be unique. This doesn’t apply to the values, values can be same, as per need. 
    - Dictionaries by default are an unordered collection of data.

- `eltype`: Determine the type of the elements generated by iterating a collection of the given type. 

- `eachcol`: Return a DataFrameColumns object that is a vector-like that allows iterating an AbstractDataFrame column by column.



In [12]:
Dict(names(df) .=> eltype.(eachcol(df)))

Dict{String, Type} with 67 entries:
  "aged_65_older"                           => Union{Missing, Float64}
  "people_vaccinated_per_hundred"           => Union{Missing, Float64}
  "new_deaths_smoothed_per_million"         => Union{Missing, Float64}
  "hosp_patients_per_million"               => Union{Missing, Float64}
  "life_expectancy"                         => Union{Missing, Float64}
  "hospital_beds_per_thousand"              => Union{Missing, Float64}
  "tests_per_case"                          => Union{Missing, Float64}
  "new_cases_per_million"                   => Union{Missing, Float64}
  "iso_code"                                => String15
  "new_deaths"                              => Union{Missing, Float64}
  "total_tests_per_thousand"                => Union{Missing, Float64}
  "weekly_icu_admissions_per_million"       => Union{Missing, Float64}
  "female_smokers"                          => Union{Missing, Float64}
  "new_tests_smoothed"                      => Union{Mis

For practicity, we are going to work with a subset of the entire dataset, this subset contains all variables that are related to **Confirmed cases** and **Confirmed deaths** in America.
- Recall the `!` notation: This is similar to the `in_place` argument in some of the methods in Python Pandas library, modifying the dataframe inplace.

In [13]:
columns = [:iso_code, :continent, :location, :date, :total_cases, :new_cases, :total_cases_per_million, :gdp_per_capita, :new_cases_per_million, :total_deaths, :new_deaths, 
 :total_deaths_per_million, :new_deaths_per_million, :people_vaccinated, :people_fully_vaccinated]
cases_df = df[:, columns]

dropmissing!(cases_df)
cases_df

Unnamed: 0_level_0,iso_code,continent,location,date,total_cases,new_cases,total_cases_per_million
Unnamed: 0_level_1,String15,String15,String,Date,Float64,Float64,Float64
1,AFG,Asia,Afghanistan,2021-05-11,62403.0,340.0,1556.2
2,AFG,Asia,Afghanistan,2021-05-20,64575.0,453.0,1610.37
3,AFG,Asia,Afghanistan,2021-05-24,66275.0,547.0,1652.77
4,AFG,Asia,Afghanistan,2021-05-26,67743.0,840.0,1689.37
5,AFG,Asia,Afghanistan,2021-05-27,68366.0,623.0,1704.91
6,AFG,Asia,Afghanistan,2021-05-30,70761.0,650.0,1764.64
7,AFG,Asia,Afghanistan,2021-06-02,74026.0,1049.0,1846.06
8,AFG,Asia,Afghanistan,2021-06-03,75119.0,1093.0,1873.32
9,AFG,Asia,Afghanistan,2021-06-08,82326.0,1485.0,2053.05
10,AFG,Asia,Afghanistan,2021-06-14,91458.0,1597.0,2280.78


`describe()` function generates descriptive statistics.

In [14]:
describe(cases_df)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing
Unnamed: 0_level_1,Symbol,Union…,Any,Any,Any,Int64
1,iso_code,,ABW,,ZWE,0
2,continent,,Africa,,South America,0
3,location,,Afghanistan,,Zimbabwe,0
4,date,,2020-12-13,2021-10-29,2022-09-17,0
5,total_cases,3575760.0,4.0,663813.0,9.53532e7,0
6,new_cases,10199.5,0.0,891.0,1.35536e6,0
7,total_cases_per_million,104667.0,12.534,67482.4,6.40758e5,0
8,gdp_per_capita,26573.5,661.24,24055.6,1.16936e5,0
9,new_cases_per_million,351.145,0.0,89.812,35195.4,0
10,total_deaths,53682.6,1.0,8580.0,1.05138e6,0


In [15]:
using Statistics

`combine` function let us apply transformations to a grouped dataframe.
- The pair operator `=>`: Construct a Pair object with type `Pair{typeof(x), typeof(y)}`. The elements are stored in the fields first and second. They can also be accessed via iteration. 
    - In this case  `=>` operator maps the column and the function we want to apply
    -   `.=>` is it's vectorized form

In [35]:
df_gp = groupby(cases_df, "location")
numeric_columns = [:total_cases, :new_cases, :total_cases_per_million, :gdp_per_capita, :new_cases_per_million, :total_deaths, :new_deaths, 
    :total_deaths_per_million, :new_deaths_per_million, :people_vaccinated, :people_fully_vaccinated]

combine(df_gp, numeric_columns .=> mean)

Unnamed: 0_level_0,location,total_cases_mean,new_cases_mean,total_cases_per_million_mean
Unnamed: 0_level_1,String,Float64,Float64,Float64
1,Afghanistan,1.52939e5,415.541,3814.0
2,Albania,1.76725e5,359.276,61906.4
3,Algeria,2.22632e5,179.391,5039.44
4,Angola,69851.6,180.846,2024.46
5,Antigua and Barbuda,2889.68,24.475,30998.4
6,Argentina,5.97984e6,13034.2,1.32073e5
7,Armenia,3.14121e5,562.282,1.12549e5
8,Aruba,24079.2,65.1867,2.26019e5
9,Australia,2.56336e6,18107.7,98890.8
10,Austria,7.47331e5,3754.08,83761.9


In [17]:
combine(df_gp, numeric_columns .=> std)

Unnamed: 0_level_0,location,total_cases_std,new_cases_std,total_cases_per_million_std
Unnamed: 0_level_1,String,Float64,Float64,Float64
1,Afghanistan,40967.5,473.064,1021.65
2,Albania,53761.7,390.59,18832.6
3,Algeria,26490.0,168.441,599.621
4,Angola,25910.6,283.207,750.949
5,Antigua and Barbuda,2049.94,68.8101,21990.3
6,Argentina,2.74436e6,21581.6,60613.0
7,Armenia,69704.4,688.071,24974.9
8,Aruba,12230.5,150.505,1.14802e5
9,Australia,3.38666e6,24098.4,1.30653e5
10,Austria,3.06559e5,5927.84,34359.6


In [18]:
combine(df_gp, numeric_columns .=> maximum)

Unnamed: 0_level_0,location,total_cases_maximum,new_cases_maximum,total_cases_per_million_maximum
Unnamed: 0_level_1,String,Float64,Float64,Float64
1,Afghanistan,196182.0,1847.0,4892.39
2,Albania,330193.0,2177.0,115666.0
3,Algeria,270443.0,610.0,6121.67
4,Angola,102636.0,1396.0,2974.63
5,Antigua and Barbuda,8555.0,419.0,91772.2
6,Argentina,9.69776e6,139853.0,2.14188e5
7,Armenia,422917.0,2467.0,1.5153e5
8,Aruba,42970.0,1162.0,4.03338e5
9,Australia,1.01506e7,175271.0,3.91598e5
10,Austria,1.8356e6,38141.0,205737.0


In [37]:
combine(df_gp, numeric_columns .=> minimum)

Unnamed: 0_level_0,location,total_cases_minimum,new_cases_minimum,total_cases_per_million_minimum
Unnamed: 0_level_1,String,Float64,Float64,Float64
1,Afghanistan,62403.0,0.0,1556.2
2,Albania,79934.0,0.0,28000.7
3,Algeria,190656.0,2.0,4315.64
4,Angola,29405.0,0.0,852.226
5,Antigua and Barbuda,1251.0,0.0,13419.9
6,Argentina,1.60216e6,0.0,35386.0
7,Armenia,224227.0,0.0,80340.1
8,Aruba,9294.0,0.0,87238.1
9,Australia,28939.0,2.0,1116.43
10,Austria,365898.0,66.0,41010.4


### **Dive into the American continent Data**


Lets go deeper, and analyze data only from America

`filter`: Return a data frame containing only rows from `df` for which fun returns true.

In [20]:
america_df = filter(row -> (row[:continent] == "North America") | (row[:continent] == "South America"), cases_df)

Unnamed: 0_level_0,iso_code,continent,location,date,total_cases,new_cases
Unnamed: 0_level_1,String15,String15,String,Date,Float64,Float64
1,ATG,North America,Antigua and Barbuda,2021-05-17,1251.0,10.0
2,ATG,North America,Antigua and Barbuda,2021-05-21,1255.0,0.0
3,ATG,North America,Antigua and Barbuda,2021-05-25,1258.0,0.0
4,ATG,North America,Antigua and Barbuda,2021-05-26,1258.0,0.0
5,ATG,North America,Antigua and Barbuda,2021-05-27,1258.0,0.0
6,ATG,North America,Antigua and Barbuda,2021-05-29,1259.0,0.0
7,ATG,North America,Antigua and Barbuda,2021-05-31,1260.0,1.0
8,ATG,North America,Antigua and Barbuda,2021-06-03,1262.0,0.0
9,ATG,North America,Antigua and Barbuda,2021-06-06,1263.0,0.0
10,ATG,North America,Antigua and Barbuda,2021-06-07,1263.0,0.0


In [21]:
max_cases = groupby(america_df, :location)
combine(max_cases, "new_cases" => maximum)

Unnamed: 0_level_0,location,new_cases_maximum
Unnamed: 0_level_1,String,Float64
1,Antigua and Barbuda,419.0
2,Argentina,139853.0
3,Aruba,1162.0
4,Bahamas,315.0
5,Barbados,1329.0
6,Belize,794.0
7,Bermuda,661.0
8,Bolivia,23611.0
9,Brazil,287149.0
10,Canada,63678.0


In [39]:
min_cases = groupby(america_df, :location)
combine(min_cases, :new_cases => minimum)

Unnamed: 0_level_0,location,new_cases_minimum
Unnamed: 0_level_1,String,Float64
1,Antigua and Barbuda,0.0
2,Argentina,0.0
3,Aruba,0.0
4,Bahamas,0.0
5,Barbados,0.0
6,Belize,0.0
7,Bermuda,0.0
8,Bolivia,0.0
9,Brazil,0.0
10,Canada,0.0


In [23]:
max_deaths = groupby(america_df, :location)
combine(max_deaths, "total_deaths" => maximum => "total_deaths_per_country")

Unnamed: 0_level_0,location,total_deaths_per_country
Unnamed: 0_level_1,String,Float64
1,Antigua and Barbuda,140.0
2,Argentina,129830.0
3,Aruba,271.0
4,Bahamas,823.0
5,Barbados,559.0
6,Belize,680.0
7,Bermuda,140.0
8,Bolivia,22177.0
9,Brazil,685203.0
10,Canada,44662.0


In [24]:
new_cases_per_million = groupby(america_df, :location)
combine(max_deaths, "new_cases_per_million" => median)

Unnamed: 0_level_0,location,new_cases_per_million_median
Unnamed: 0_level_1,String,Float64
1,Antigua and Barbuda,0.0
2,Argentina,132.529
3,Aruba,150.184
4,Bahamas,34.322
5,Barbados,522.76
6,Belize,153.738
7,Bermuda,0.0
8,Bolivia,55.714
9,Brazil,156.43
10,Canada,95.636


In [40]:
gdp_analysis =  combine(groupby(america_df, :location), "total_deaths_per_million" => mean, "gdp_per_capita" => mean)

Unnamed: 0_level_0,location,total_deaths_per_million_mean,gdp_per_capita_mean
Unnamed: 0_level_1,String,Float64,Float64
1,Antigua and Barbuda,769.282,21490.9
2,Argentina,2275.03,18933.9
3,Aruba,1596.83,35973.8
4,Bahamas,1353.69,27717.8
5,Barbados,876.544,16978.1
6,Belize,1165.61,7824.36
7,Bermuda,1004.07,50669.3
8,Bolivia,1549.42,6885.83
9,Brazil,2592.49,14103.5
10,Canada,775.443,44017.6


Is there is a linear relationship between the **GDP per capita** and the **total deaths per million**?

Let's make some statistics and take a look into the correlation matrix

> Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate).

- `Matrix`: Converts a dataframe into a matrix object.
- `cor`: Returns the correlation matrix of the given data matrix.

In [26]:
gdp_analysis_matrix = gdp_analysis[:, [:total_deaths_per_million_mean, :gdp_per_capita_mean]]
cor(Matrix(gdp_analysis_matrix))

2×2 Matrix{Float64}:
 1.0        0.0312768
 0.0312768  1.0

Note there is not linear relationship between `gdp_per_capita` and `total_deaths_per_million` as the correlation coefficient is close to zero.

Is there a linear relation between all variables of our dataframe?

Compute the correlation matrix of our dataframe and see:

In [27]:
america_df_matrix = america_df[:, numeric_columns]
cor(Matrix(america_df_matrix))

11×11 Matrix{Float64}:
  1.0         0.547456    0.369614   …  -0.00802082   0.930433    0.922082
  0.547456    1.0         0.164437       0.0732822    0.512536    0.504659
  0.369614    0.164437    1.0           -0.0195805    0.320807    0.333145
  0.529093    0.360233    0.339168       0.00751575   0.47609     0.475993
  0.0401668   0.265761    0.269526       0.184045     0.0134125   0.0181569
  0.930491    0.528857    0.289463   …   0.00462507   0.957009    0.932798
  0.47292     0.622901    0.0531288      0.203894     0.40829     0.367467
  0.254573    0.108984    0.361777       0.0222548    0.307137    0.307339
 -0.00802082  0.0732822  -0.0195805      1.0         -0.0420967  -0.0502304
  0.930433    0.512536    0.320807      -0.0420967    1.0         0.991881
  0.922082    0.504659    0.333145   …  -0.0502304    0.991881    1.0