# <b> Deliverable #3: Data Storage and Analysis </b>
#### Minh Le
#### CS 181/DA 210
#### Dr. Amert
#### April 21th, 2023
---

---

#### Import Packages

In [1]:
import pandas as pd
import os
import os.path
import json
import sqlalchemy as sa
import matplotlib.pyplot as pyplot
import plotly.express as px


#### Set Credentials

In [2]:
def getsqlite_creds(dirname=".",filename="creds.json",source="sqlite"):
    """ Using directory and filename parameters, open a credentials file
        and obtain the two parts needed for a connection string to
        a local provider using the "sqlite" dictionary within
        an outer dictionary.  
        
        Return a scheme and a dbfile
    """
    assert os.path.isfile(os.path.join(dirname, filename))
    with open(os.path.join(dirname, filename)) as f:
        D = json.load(f)
    sqlite = D[source]
    return sqlite["scheme"], sqlite["dbdir"], sqlite["database"]

In [3]:
def buildConnectionString(source="sqlite_country"):
    scheme, dbdir, database = getsqlite_creds(source=source)
    template = '{}:///{}/{}.db'
    return template.format(scheme, dbdir, database)

---
## <b> Part 1: Database Summary </b>
In our database, we have one table named `indicators`. The primary key of this table is `country_and_area`, which is a singleton key. The functional dependecy is `country_and_area -> region, gdp_per_capita, life`.

---
## <b> Part 2: Data Analysis </b>

After we have stored our tables in the database, we will now analyze the data to answer our central question: _Does GDP per capita have a relationship with the life expectancy of countries in the world?_.

In [4]:
cstring = buildConnectionString("sqlite_country")
print("Connection string:", cstring)
engine = sa.create_engine(cstring)
connection = engine.connect()

Connection string: sqlite:///./country.db


We will now perform query to see the relationship between GDP per capita and life expectancy. First of all, we will create a scatter plot with a regression line to see the connection.

In [5]:
#select all the table
query1 = """
SELECT *
FROM indicators
"""

df1 = pd.read_sql_query(query1, con=connection)
df1.head()

Unnamed: 0,country_and_area,region,gdp_per_capita,life
0,Aruba,Latin America & Caribbean,29342.100858,74.6
1,Afghanistan,South Asia,368.754614,62.0
2,Angola,Sub-Saharan Africa,1953.533757,61.6
3,Albania,Europe & Central Asia,6492.872012,76.5
4,Andorra,Europe & Central Asia,42137.327271,80.4


In [6]:
def create_graph(data):
    """
    Creates a plot showing the relationship between
    GDP Per Capita and Life Expectancy
    """
    fig = px.scatter(data, x="gdp_per_capita", y = "life", 
                    title="Relationship between GDP Per Capita and Life Expectancy", 
                    hover_name='country_and_area', size_max=20, trendline='ols', trendline_color_override="red")
    fig.show()

In [7]:
create_graph(df1)

**Analysis:** Based on the graph and the regression line, it appears that there is a positive relationship between GDP Per Capita and life expectancy across all countries in the world. All the points follow along the regression line, which implies that as the GDP Per Capita of a country increase, its citizens tend to enjoy longer lifespans on average.

We will discover this trend further by discovering the life expectancy in each region. We all know that Europe, North America, and Asia-Pacific are regions with the highest level of economic growth, while Africa, some of the parts in Asia and South America are regions with lower levels. Therefore, we will look into the life expectancy of each region for further analysis.

In [8]:
query2 = """
SELECT region, AVG(life) AS avg_life
FROM indicators
WHERE life IS NOT NULL
GROUP BY region
ORDER BY avg_life ASC
"""

df2 = pd.read_sql_query(query2, con=connection)
df2

Unnamed: 0,region,avg_life
0,Sub-Saharan Africa,61.506522
1,South Asia,70.525
2,Latin America & Caribbean,72.907317
3,East Asia & Pacific,73.537838
4,Middle East & North Africa,74.455
5,Europe & Central Asia,77.392857
6,North America,79.733333


In [9]:
fig = px.bar(df2, x="region", y="avg_life", 
             hover_data=['region', 'avg_life'], color='avg_life',
             labels={'region': 'Regions', 'avg_life':'Life Expectancy'},
             title="Average Life Expectancy by Region",height=600)
fig.show()

**Analysis:** From the bar chart above, we can clearly see that regions with higher economic growth will have a longer average life expectancy than regions with low economic growth. In deed, North America, Europe & Central Asia are two regions with highest average lifespan, while the lowest figure belongs to Sub-Saharan Africa.

We will try to group each country by its income level to see the difference if there is any difference from grouping by region. 

In [10]:
query3 = """
SELECT low_high_group, AVG(life) AS avg_life
FROM (SELECT life, country_and_area, gdp_per_capita > (SELECT AVG(gdp_per_capita) AS avg
                        FROM indicators WHERE gdp_per_capita IS NOT NULL) AS low_high_group
FROM indicators)
WHERE low_high_group IS NOT NULL AND life IS NOT NULL
GROUP BY low_high_group
"""

df3 = pd.read_sql_query(query3, con=connection)
df3.head()

Unnamed: 0,low_high_group,avg_life
0,0,68.562791
1,1,80.441667


**Analysis:** The table above shows the average life expectancy of two groups: countries below and above average GDP per capita. The average lifespan for low group is 68.5 years, while this figure for the high group is 80.4 years. Hence, citizens of countries in the above GDP per capita group lives longer 

In [11]:
# Close the connection!
try:
    connection.close()
except:
    pass
del engine

## <b> Part 2: Conclusion </b>
In conclusion, from the scatterplot, the bar graph, and the table above, there is a relationship between GDP per capita and life expectancy. Economic growth can directly affects human well-being in a positive way. Indeed, in a country that has a high GDP per capita, its citizens will have a longer lifespan than those people who are in a country that has a lower GDP per capita.

---
---

## <b> Reference: </b>

- GDP Per Capita dataset: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?end=2019&start=1960
- Life Expectancy dataset: https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy#World_Health_Organization_(2019)
- Countries dataset: http://datasystems.denison.edu/data.html
