- In this study, we are going to make Exploratory Data Analysis (EDA) with the Life Expectancy dataset.
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has 3253 instances along with country name, life expectancy, and year.
- Dataset includes 1800-2016 life expectancy info.

- First thing first, lets' import the related libraries for further analysis

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib as mpl

import plotly.io as pio
pio.renderers.default = 'iframe'


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

In [2]:
df = pd.read_csv('Life expectancy.csv')
df.head()

Unnamed: 0,Entity,Year,Life expectancy
0,Australia,1802,34.049999
1,Australia,1803,34.049999
2,Australia,1804,34.049999
3,Australia,1805,34.049999
4,Australia,1806,34.049999


In [3]:
df.shape

(3253, 3)

In [4]:
df.isnull().sum()

Entity             0
Year               0
Life expectancy    0
dtype: int64

In [5]:
df['Year'].unique()

array([1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812,
       1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823,
       1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834,
       1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845,
       1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856,
       1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867,
       1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878,
       1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889,
       1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900,
       1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911,
       1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922,
       1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933,
       1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944,
       1945, 1946, 1947, 1948, 1949, 1950, 1951, 19

- We have a data between 1800-2016.

In [8]:
df['Entity'].unique()

array(['Australia', 'Brazil', 'Canada', 'China', 'France', 'Germany',
       'India', 'Italy', 'Japan', 'Mexico', 'Russia', 'Spain',
       'Switzerland', 'United Kingdom', 'United States'], dtype=object)

- Also we have 15 countries in the dataset.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3253 entries, 0 to 3252
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Entity           3253 non-null   object 
 1   Year             3253 non-null   int64  
 2   Life expectancy  3253 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 76.4+ KB


In [None]:
- Everything seems quite OK.

- Let's move on to the analysis part.

### Analysis Part

#### **Life Expectancy**

In [11]:
df['Life expectancy'].describe()

count    3253.000000
mean       48.680380
std        17.965669
min         8.108836
25%        32.000000
50%        41.880001
75%        66.820000
max        83.940002
Name: Life expectancy, dtype: float64

- Quite expected.
- We have 216 years of data on life expectancy from different continents and it is very normal to see skewed distribution.
- Based on the descriptive info, we can expect highly right skewed data.
- Let's see it

In [12]:
fig = px.histogram(df, x= 'Life expectancy', title='Life Expectancy', marginal="box", hover_data = df[['Entity','Year']])
fig.show()

- As we expected, even though median score is almost 42 years, distribution density increases at around 50 years.
- And we have a quite a lot of records bigger than median score of life expectancy.

### Life Expectancy By Year

In [16]:
px.scatter(df, x="Year", y="Life expectancy", color='Entity')

- Quite a lot ofinformation in this scatter plot.

- After World War 2, we can see significant increase all over the counties in the dataset.

- Let's get closer look into countries by continents.


### **Europe**

In [46]:
df_Europe = df[df['Entity'].isin(['France', 'Germany','Italy','Spain','Switzerland', 'United Kingdom'])].copy()
df_Europe["Life expectancy"].describe()

count    1302.000000
mean       52.370963
std        17.403136
min        25.620001
25%        38.369999
50%        46.310000
75%        70.629736
max        83.180000
Name: Life expectancy, dtype: float64

#### Life Expectancy By Year in Europe

In [21]:
px.scatter(df_Europe, x="Year", y="Life expectancy", color='Entity')

- Ok it is much more clear now that, in Europe, during the World War 1 and World War 2, 'life expectancy' was at the lowest points.
- After World War 2, life expectancy increases constantly.
- Let's get closer look at the three countries' info from the above plot.

#### **Germany**

In [25]:
germany = df_Europe[df_Europe['Entity']=='Germany'].copy()

px.scatter(germany, x="Year", y="Life expectancy", color='Entity')

- During the World War 1:
   - **Germany**:
     - Life expectancy at 
          - 1913 was 50.13, 
          - 1914 was 46.17, 
          - 1915 was 40.15, 
          - 1916 was 39.98, 
          - 1917 was 40.08 
          - and at the end of the world war 1 at 1918 was **32.93**
     
- During the World War 2:
    - **Germany**:
        - Life expectancy at
           - 1939 was 61, 
           - 1940  was 60.65, 
           - 1941 was 59, 
           - 1942 was 55, 
           - 1943 was 49.7, 
           - 1944 was **37**, 
           - at the end of the World War 2 at 1945 was **29**

#### **France**

In [26]:
france = df_Europe[df_Europe['Entity']=='France'].copy()

px.scatter(france, x="Year", y="Life expectancy", color='Entity')

- During the World War 1:
   - **France**:
     - Life expectancy qt 
          - 1913 was 51.35, 
          - 1914 was 37.85, 
          - 1915 was **35.63**, 
          - 1916 was 39.81, 
          - 1917 was 42.06 
          - and at the end of the world war 1 at 1918 was **34.34**
     
- During the World War 2:
    - **France**:
        - Life expectancy at
           - 1939 was 59.62, 
           - 1940  was **49.45**, 
           - 1941 was 57.67, 
           - 1942 was 57.44, 
           - 1943 was 53.33, 
           - 1944 was **47.19**, 
           - 1945 was 54.96
     

#### **Italy**

In [27]:
italy = df_Europe[df_Europe['Entity']=='Italy'].copy()

px.scatter(italy, x="Year", y="Life expectancy", color='Entity')

- During the World War 1:
   - **Italy**:
     - Life expectancy at 
          - 1913 was 48.46, 
          - 1914 was 48.89, 
          - 1915 was 42.34, 
          - 1916 was 39.27, 
          - 1917 was 37.69 
          - and at the end of the world war 1 at 1918 was **25.62**
     
- During the World War 2:
    - **Italy**:
        - Life expectancy at 
           - 1939 was 57.64, 
           - 1940  was 56.95, 
           - 1941 was 54.62, 
           - 1942 was 52.45, 
           - 1943 was **49.19**, 
           - 1944 was 52.34, 
           - at the end of the World War 2 at 1945 was 54.78
      

### **Asia**

In [47]:
df_Asia = df[df['Entity'].isin(['China', 'India', 'Japan','Russia'])].copy()
df_Asia["Life expectancy"].describe()

count    868.000000
mean      42.021026
std       17.362311
min        8.108836
25%       29.573458
50%       35.971498
75%       56.942999
max       83.940002
Name: Life expectancy, dtype: float64

#### Life Expectancy By Year in Asia

In [35]:
px.scatter(df_Asia, x="Year", y="Life expectancy", color='Entity')

- Life expectancy till World War 2, time to time significantly differs among the countries in Asia sample.
- For example World War 1, except 1918, make no changes in the life expectancy in **Japan** and life expectancy continued to increase till World War 2.
- For **Russia**, it is quite complicated. Especially between 1867 and 1946, life expectancy is quite inconsistent. 
- On the other hand, for **India**, World War 2 makes a very slight change compare to other countries.
- **China** has its own story. Between 1850-1866, between 1875-1882, betwwen 1926-1930, during the World War 2 and between 1958-1963 has quite significant cahnges can be observed in the life expectancy.

- Let's see all of them in the plots.

#### **China**

In [36]:
china = df_Asia[df_Asia['Entity']=='China'].copy()

px.scatter(china, x="Year", y="Life expectancy", color='Entity')

- Lowest life expectancy at 1918 was **22**

#### **India**

In [37]:
india = df_Asia[df_Asia['Entity']=='India'].copy()

px.scatter(india, x="Year", y="Life expectancy", color='Entity')

- Lowest life expectancy at 1918 was **8**.
- That's quite interesting number to check with other resources.

#### **Japan**

In [38]:
japan = df_Asia[df_Asia['Entity']=='Japan'].copy()

px.scatter(japan, x="Year", y="Life expectancy", color='Entity')

- Lowest life expectancy at 1945 was **30.5**

#### **Russia**

In [39]:
russia = df_Asia[df_Asia['Entity']=='Russia'].copy()

px.scatter(russia, x="Year", y="Life expectancy", color='Entity')

- Lowest life expectancy at 1943 was **16**

### **America & Australia**

In [50]:
df_America_Australia = df[df['Entity'].isin(['Australia', 'Brazil', 'Canada', 'United States'])].copy()
df_America_Australia["Life expectancy"].describe()

count    866.000000
mean      51.772170
std       16.803706
min       26.978191
25%       36.654560
50%       46.924160
75%       69.095001
max       82.580002
Name: Life expectancy, dtype: float64

In [41]:
px.scatter(df_America_Australia, x="Year", y="Life expectancy", color='Entity')

- Compare to Europe and Asia, we have a quite different life expectancy in here. 
- During the civil war (1861-1865) we can observe significant changes in the life expectancy in the **United States**. Also only in 1918, life expectancy goes down to 47.2
- **Canada**, quite a beautiful country in every sense. Cocnsistent increase in the life expectancy. Only in 1918, life expectancy goes down to 47.16
- **Brazil**, consistent increase in the life expectancy can be seen after 1928. Only in 1918, life expectancy goes down to 26.97
- **Australia**, quite similar to Canada, everything goes smoothly. Only in 1918, life expectancy goes down to 54.8
- Let's see all of them in the plots.

#### **United States**

In [42]:
US = df_America_Australia[df_America_Australia['Entity']=='United States'].copy()

px.scatter(US, x="Year", y="Life expectancy", color='Entity')

- Lowest life expectancy at 1864 was **31**

#### **Canada**

In [43]:
canada = df_America_Australia[df_America_Australia['Entity']=='Canada'].copy()

px.scatter(canada, x="Year", y="Life expectancy", color='Entity')

- Lowest life expectancy at 1800 was **39**

#### **Brazil**

In [45]:
brazil = df_America_Australia[df_America_Australia['Entity']=='Brazil'].copy()

px.scatter(brazil, x="Year", y="Life expectancy", color='Entity')

- Lowest life expectancy at 1918 was **26.97**



- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 