### Init

Import data


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from src.linear_regression import read_data_set

# import data as data frame
df = read_data_set()

### Dataset characteristics:

| Feature Name                        | Definition                                                          | Unit                         |
| ----------------------------------- | ------------------------------------------------------------------- | ---------------------------- |
| **Country**                         | Name of the country ( `193` unique countries)                       | N/A                          |
| **Year**                            | Year of the observation `[2000:2015]`                               | N/A                          |
| **Status**                          | Development status of the country (Developed/Developing)            | N/A                          |
| **Life expectancy**                 | Average number of years a newborn is expected to live               | N/A                          |
| **Adult Mortality**                 | Number of adult deaths per 1,000 population aged 15-60              | Deaths per 1,000 adults      |
| **Infant deaths**                   | Number of infant deaths per 1,000 live births                       | Deaths per 1,000 live births |
| **Alcohol**                         | Per capita alcohol consumption (aged 15+) in liters per year        | Liters per capita            |
| **Percentage expenditure**          | Government health expenditure as a percentage of total expenditure  | Percentage                   |
| **Hepatitis B**                     | Percentage of 1-year-old children immunized against Hepatitis B     | Percentage                   |
| **Measles**                         | Number of reported measles cases per 1,000 population               | Cases per 1,000 population   |
| **BMI**                             | Average Body Mass Index of the population                           | kg/m²                        |
| **Under-five deaths**               | Number of deaths of children under 5 years per 1,000 live births    | Deaths per 1,000 live births |
| **Polio**                           | Percentage of 1-year-old children immunized against Polio           | Percentage                   |
| **Total expenditure**               | Total health expenditure as a percentage of GDP                     | Percentage of GDP            |
| **Diphtheria**                      | Percentage of 1-year-old children immunized against Diphtheria      | Percentage                   |
| **HIV/AIDS**                        | Deaths per 1,000 live births due to HIV/AIDS among children under 5 | Deaths per 1,000 live births |
| **GDP**                             | Gross Domestic Product per capita                                   | USD                          |
| **Population**                      | Total population                                                    | Number of people             |
| **Thinness 1-19 years**             | Percentage of the population aged 1-19 with BMI < 18.5              | Percentage                   |
| **Thinness 5-9 years**              | Percentage of the population aged 5-9 with BMI < 18.5               | Percentage                   |
| **Income composition of resources** | Human Development Index in terms of income composition              | Index (0 to 1)               |
| **Schooling**                       | Average number of years of schooling                                | N/A                          |


### Unreal data

Assign `NaN` to data which are unreal or exceed valid boundaries


In [49]:
df.describe(exclude=["object"]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,2938.0,2007.519,4.613841,2000.0,2004.0,2008.0,2012.0,2015.0
Life expectancy,2928.0,69.22493,9.523867,36.3,63.1,72.1,75.7,89.0
Adult Mortality,2928.0,164.7964,124.2921,1.0,74.0,144.0,228.0,723.0
infant deaths,2938.0,30.30395,117.9265,0.0,0.0,3.0,22.0,1800.0
Alcohol,2744.0,4.602861,4.052413,0.01,0.8775,3.755,7.7025,17.87
percentage expenditure,2938.0,738.2513,1987.915,0.0,4.685343,64.91291,441.5341,19479.91
Hepatitis B,2385.0,80.94046,25.07002,1.0,77.0,92.0,97.0,99.0
Measles,2938.0,2419.592,11467.27,0.0,0.0,17.0,360.25,212183.0
BMI,2904.0,38.32125,20.04403,1.0,19.3,43.5,56.2,87.3
under-five deaths,2938.0,42.03574,160.4455,0.0,0.0,4.0,28.0,2500.0


The numbers in the dataset seem reasonable at a glance, but there are a few points that may require deeper scrutiny or cleaning based on domain knowledge. Let's evaluate:

- Percentage Expenditure (0.0 - 19479.9): Values over 100% are not logically valid
- Measles Cases (0 - 212,183): This range includes very high numbers, which might be accurate for countries with large outbreaks
- BMI (1.0 - 87.3): A BMI of 1.0 is not biologically plausible for any human being. Similarly, 87.3 is highly unlikely
- GDP (298.0 - 119,172.0): (119,172 USD per capita) might represent anomalies or specific countries with unique economic characteristics.
- Thinness (0.1 - 63.5): Thinness percentages as high as 63.5% in age groups may reflect areas with extreme malnutrition.
- Population (34.0 - 1.3e9): A population of 34 seems too small to represent a country unless the dataset also includes very small island nations or territories.
