## This code returns a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The generated database can be applied for the field of World Development Indication as measurement of poverty.
### Data exploration:
##### Retrieved the raw data (even though they are processed and ready to use for some applications) from the World Bank website and a Kaggle project. These include a .sqlite database (including 6 CSV files some including above 5.5M rows of raw data), scraped CSV files and tables from webpages.
### Data munging: required to manipulate the imported data both by cleaning the raw data and also transforming the data to the desired format in a Mongodb database.
#### Data cleaning: 
##### dropped some unnecessary columns and excluded some of the imported CSV files from the sqlite database followed by renaming some of the column headrs.
#### Data transformation: 
##### exported sqlite database into three Pandas dataframe where we could easily merge the tables. It was followed by exporting the dataframe into Mongodb.

# Importing required libraries

In [1]:
from flask import Flask, render_template, jsonify, redirect
from flask_pymongo import PyMongo
from pymongo import MongoClient
import numpy as np
import pandas as pd
import datetime as dt
import pandas as pd

# Reflect Tables into SQLAlchemy ORM

In [2]:
# Python SQL toolkit and Object Relational Mapper
import sqlalchemy
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine, func, inspect
import sqlite3

## Connecting to the relational database 
### Source: sqlite database from Kaggle Website

In [3]:
# Path to sqlite
dbp = "../Data/WDI_Kaggle.sqlite"
engine = create_engine(f"sqlite:///{dbp}")
conn=engine.connect()
for table_name in inspect(engine).get_table_names():
   print(table_name)


Country
CountryNotes
Footnotes
Indicators
Series
SeriesNotes


##  Tables and exporting them to a Pandas DataFrame

In [4]:
Country_df=pd.read_sql('SELECT CountryCode, Region, IncomeGroup FROM Country',conn)
Indicators_df=pd.read_sql('SELECT * FROM Indicators',conn)
Series_df=pd.read_sql('SELECT SeriesCode, Topic, LongDefinition, AggregationMethod, LimitationsAndExceptions, Source, StatisticalConceptAndMethodology FROM Series',conn)

#### We realized that there are two codes (IndicatorCode in Indicator table and SeriesCode in Series table). First, we confirm that these two codes are exactly the same since there is no difference between them (i.e., diff_Ind_Series is Null), then we merge Series and Indicator tables based on this common column.

In [5]:
series = set(Series_df.SeriesCode)
diff_Ind_Series = [x for x in Indicators_df.IndicatorCode if x not in series]
diff_Ind_Series

[]

### Now, we merge three DataFrames

In [6]:
IndCou=Indicators_df.merge(Country_df, left_on='CountryCode', right_on='CountryCode')

In [7]:
IndCouSer=IndCou.merge(Series_df, left_on='IndicatorCode', right_on='SeriesCode')

Other option: Indictors = engine.execute('SELECT * FROM Indicators join Country on Indicators.CountryCode=Country.CountryCode').fetchall()

In [8]:
IndCouSer.drop(['SeriesCode'],axis=1)

Unnamed: 0,CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value,Region,IncomeGroup,Topic,LongDefinition,AggregationMethod,LimitationsAndExceptions,Source,StatisticalConceptAndMethodology
0,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1960,1.335609e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
1,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1961,1.341644e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
2,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1962,1.348610e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
3,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1963,1.345048e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
4,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1964,1.341035e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
5,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1965,1.335682e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
6,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1966,1.326774e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
7,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1967,1.316725e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
8,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1968,1.292034e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...
9,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1969,1.267538e+02,,,Health: Reproductive health,Adolescent fertility rate is the number of bir...,Weighted average,,"United Nations Population Division, World Popu...",Reproductive health is a state of physical and...


#### First, we tried to directly send the dataframe as a dictionary to the Mongodb. However, we faced the memory issure (MemoryError below). So, we decided to turn the original Pandas Dataframe into 'ns' chuncks and feed them to the Mongodb

In [41]:
IndCouSer.to_dict()

MemoryError: 

# Accessing the non-relaitonal database

In [28]:
#app = Flask(__name__)
#mongo = PyMongo(app, uri="mongodb://localhost:27017/WDI")
client = MongoClient('mongodb://localhost:27017/')
dbmongo = client.World_Development_Indicator

## Since exporting with the full size dataframe is not working (MemoryError issue), we just chunked the main dataframe into 100 dataframes.

In [74]:
fn=0
ln=10000

In [75]:
IndCouSer_section=IndCouSer[fn:ln]
nc=5

def chunk(df,x):
    return [ df[i::x] for i in range(x) ]
 
chunks = chunk(IndCouSer_section, nc)

In [76]:
col=dbmongo['WDI_general']

#b=col.insert_many(chunks[x].to_dict(orient='records') for x in range(nc))
for x in range(nc):
    a=chunks[x].to_dict(orient='records') 
    col.insert_many(a)
    print(a)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



# At this point, we concluded that thte Jupyter noteboo cannot export the very large size dataframe into the Mongodb. Rather, we started transferring the whole code into a explicit Python file (Scrape_WDI.py).