# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import matplotlib
import requests
import dask.dataframe as dd
from datetime import datetime, tzinfo
import json
from pathlib import Path
import glob
import os
import boto3
from zipfile import ZipFile

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

The project is aimed to.... TBA!!!
The data from this project will be used to create a PowerBI dashboard. 
One data source is a collection of 278 products that were cultivated across the globe for the last 60 years. 
Another data source is an enrichment for the countries (total area, population, etc).

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

Data comes from Food and Agriculture Organization of the United Nations (https://www.fao.org/home/en, CSV format) and Rest Countries project (https://restcountries.com/, API endpoint, JSON format).

TBA... describe each data source, frequency of the updates

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

### unzip data folder to have access to most data sources

In [138]:
from zipfile import ZipFile
  
# loading the data.zip and creating a zip object
with ZipFile("C:\\Users\\lopin\\Desktop\\current\\Udacity\\!DE-project06\\etl_aws_s3\\data\\data.zip", 'r') as zObject:
  
    # Extracting all the members of the zip into a specific location.
    zObject.extractall(
        path="C:\\Users\\lopin\\Desktop\\current\\Udacity\\!DE-project06\\etl_aws_s3\\data")

In [139]:
current_directory = os.getcwd()
print(current_directory)

C:\Users\lopin\Desktop\current\Udacity\!DE-project06\capstone-project-template.ipynb


> #### DATA SOURCE 1: Units.csv

- ##### read file and show first few rows

In [85]:
data_units = pd.read_csv('data/Units.csv')
data_units.head(15)

Unnamed: 0,Unit Name,Description
0,An,Animals
1,ha,Hectares
2,100 g,hundred Grams
3,100 g/An,hundred Grams per animal
4,100 g/ha,hundred Grams per hectare
5,100 g/t,hundred Grams per tonne
6,100 mg/An,hundred Milligrams per animal
7,No,Number
8,No/An,Number per animal
9,0.1 g/An,tenth Grams per animal


- ##### show some info about the dataset

In [86]:
data_units.describe()

Unnamed: 0,Unit Name,Description
count,13,13
unique,13,13
top,An,Animals
freq,1,1


- ##### check for duplicates

In [87]:
tr_units = data_units.count()
ur_units = data_units.drop_duplicates().count()

In [88]:
if tr_units['Unit Name'] == ur_units['Unit Name']:
    print(f"Data is good, there are {tr_units['Unit Name']-ur_units['Unit Name']} duplicates. In total there are {tr_units['Unit Name']} rows, \
and {ur_units['Unit Name']} unique values.")
    
elif tr_units['Unit Name'] > ur_units['Unit Name']:
    print(f"There are {tr_units['Unit Name']-ur_units['Unit Name']} duplicates in the data. \
There are {tr_units['Unit Name']} rows in total, and {ur_units['Unit Name']} unique values")
    
else:
    print(f"Something is wrong with the data. There are more unique values than the rows: \
total rows {tr_units['Unit Name']} and {ur_units['Unit Name']} unique values.")

Data is good, there are 0 duplicates. In total there are 13 rows, and 13 unique values.


> #### DATA SOURCE 2: ItemGroup.csv

- ##### read the file and show first few rows

In [92]:
data_item_group = pd.read_csv('data/ItemGroup.csv')
data_item_group.head()

Unnamed: 0,Item Group Code,Item Group,Item Code,Item,Factor,CPC Code,HS Code,HS07 Code,HS12 Code
0,QC,"Crops, primary",1714,"Crops, primary",1.0,F1714,,,
1,QC,"Crops, primary",1753,Fibre Crops Primary,1.0,F1753,,,
2,QC,"Crops, primary",1730,Oilcrops Primary,1.0,F1730,,,
3,QA,Live Animals,1756,Live Animals,1.0,F1756,,,
4,QL,Livestock primary,1777,"Hides and skins, primary",1.0,F1777,,,


- ##### explore the data

In [93]:
data_item_group.describe()

Unnamed: 0,Item Code,Factor,HS Code
count,748.0,748.0,0.0
mean,618.610963,0.966167,
std,737.034012,0.148149,
min,15.0,0.0312,
25%,270.0,1.0,
50%,522.0,1.0,
75%,891.75,1.0,
max,17530.0,1.0,


> #### DATA SOURCE 3: Flags.csv

- ##### read the file and show first few rows

In [94]:
data_flags = pd.read_csv('data/Flags.csv')
data_flags.head()

Unnamed: 0,Flag,Description
0,A,Official figure
1,E,Estimated value
2,I,Imputed value
3,M,Missing value (data cannot exist; not applicable)
4,T,Unofficial figure


- ##### explore the data

In [95]:
data_flags.describe()

Unnamed: 0,Flag,Description
count,5,5
unique,5,5
top,A,Official figure
freq,1,1


> #### DATA SOURCE 4: Elements.csv

- ##### read the file and show first few rows

In [96]:
data_elements = pd.read_csv('data/Elements.csv')
data_elements.head()

Unnamed: 0,Element Code,Element,Unit,Description
0,5312,Area harvested,ha,Data refer to the area from which a crop is ga...
1,5423,Extraction Rate,hg/mt,
2,5410,Yield,100mg/An,
3,5413,Yield,No/An,
4,5420,Yield,hg/An,


- ##### explore the data

In [97]:
data_elements.describe()

Unnamed: 0,Element Code
count,22.0
mean,5344.090909
std,112.951279
min,5111.0
25%,5315.0
50%,5322.5
75%,5419.75
max,5513.0


In [98]:
data_elements.count()

Element Code    22
Element         22
Unit            22
Description      9
dtype: int64

> #### DATA SOURCE 5: CountryGroup.csv

- ##### read the file and show first few rows

In [99]:
data_country_group = pd.read_csv('data/CountryGroup.csv')
data_country_group.head()

Unnamed: 0,Country Group Code,Country Group,Country Code,Country,M49 Code,ISO2 Code,ISO3 Code
0,5100,Africa,4,Algeria,12,DZ,DZA
1,5100,Africa,7,Angola,24,AO,AGO
2,5100,Africa,53,Benin,204,BJ,BEN
3,5100,Africa,20,Botswana,72,BW,BWA
4,5100,Africa,233,Burkina Faso,854,BF,BFA


- ##### explore the data

In [100]:
data_country_group.describe()

Unnamed: 0,Country Group Code,Country Code,M49 Code
count,918.0,918.0,918.0
mean,5371.140523,131.269063,435.545752
std,318.9191,75.804004,254.890011
min,5000.0,1.0,4.0
25%,5100.0,64.5,212.5
50%,5301.0,130.5,430.0
75%,5801.0,195.0,662.0
max,5817.0,351.0,894.0


In [102]:
data_country_group['Country'].nunique()

211

> #### DATA SOURCE 6: WorldData.csv (future fact table)

- ##### read the file, check first few rows

In [103]:
file_world_data = "data/WorldData.csv"
world_data = dd.read_csv(file_world_data, encoding="cp1252")
world_data.head(6)

Unnamed: 0,Area Code,Area Code (M49),Area,Item Code,Item Code (CPC),Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
0,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1975,1975,ha,0.0,E
1,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1976,1976,ha,5900.0,E
2,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1977,1977,ha,6000.0,E
3,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1978,1978,ha,6000.0,E
4,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1979,1979,ha,6000.0,E
5,2,'004,Afghanistan,221,'01371,"Almonds, in shell",5312,Area harvested,1980,1980,ha,5800.0,E


- ##### check for duplicates

In [104]:
tr_wd = world_data.count().compute()
ur_wd = world_data.drop_duplicates().count().compute()

In [105]:
if tr_wd['Value'] == ur_wd['Value']:
    print(f"Data is good, there are {tr_wd['Value']-ur_wd['Value']} duplicates. \
In total there are {tr_wd['Value']} rows, and {ur_wd['Value']} unique values.")
    
elif tr_wd['Value'] > unique_rows_wd['Value']:
    print(f"There are {tr_wd['Value']-ur_wd['Value']} duplicates in the data. \
There are {tr_wd['Value']} rows in total, and {ur_wd['Value']} unique values")
    
else:
    print(f"Something is wrong with the data. There are more unique values than the rows: \
total rows {tr_wd['Value']} and {ur_wd['Value']} unique values.")

Data is good, there are 0 duplicates. In total there are 3761168 rows, and 3761168 unique values.


- ##### create a file with unique values for Countries (and their codes)

In [22]:
countries = world_data[['Area', 'Area Code (M49)']].drop_duplicates().compute()
countries['Area Code (M49)'] = countries['Area Code (M49)'].str.lstrip("'")
current_datetime = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
str_current_datetime = str(current_datetime)
countries.to_csv(f'data/countries_list_{str_current_datetime}.csv')

> #### DATA SOURCE API: restcountries

- ##### get the list of countries codes that will be used as a parameter for API calls

In [106]:
list_of_files = glob.glob('data/countries_list_*') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)

data\countries_list_2023-08-18T09-21-01.csv


In [59]:
df_countries = pd.read_csv(latest_file)
list_of_countries=[]
for x in df_countries.values:
    list_of_countries.append(f'{x[2]:03d}')

In [60]:
print(len(list_of_countries))

245


- ##### make API calls to get information about the countries, save results in json files

In [61]:
result_file_countries = []
for i in list_of_countries:
    url = "https://restcountries.com/v3.1/alpha/"+i+"?fields=ccn3,flags,name,capital,languages,area,population" 
    r = requests.get(url)
    if r.status_code >= 201:
        continue
    data = r.json()
    result_file_countries.append(data)
    
#result_file_countries
json_object = json.dumps(result_file_countries, indent=4)
 
# Writing to sample.json
with open("data/countries_info.json", "w") as outfile:
    outfile.write(json_object)

### Step 3: Define the Data Model

#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

- ##### conceptual data model

![conceptual data model](img/wd_conceptual_datamodel.jpg)

- ##### high level architecture

![High level architecture](img/wd_architecture.jpg)

- ##### physical data model

![Physical data model](img/wd_physical_datamodel.jpg)

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

Pre-requisites:
* Prepare S3 bucket and IAM user to upload result files 
* <p style="color:red"> tba </p>

<p style="color:red"> add here!!!! </p>

- Units.csv
    - rename field names (formatting) 
    - save and load the file to prepared S3 bucket 
- Items.csv
    - rename field names (formatting)
    - drop columns 'CPC Code', 'HS Code', 'HS07 Code', 'HS12 Code'
    - save and load the file to prepared S3 bucket
- ItemGroup.csv
    - rename field names (formatting)
    - drop columns 'Factor', CPC Code', 'HS Code', 'HS07 Code', 'HS12 Code'
    - insert a column with id of the row
    - save and load the file to prepared S3 bucket
- Flags.csv
    - rename field names (formatting)
    - save and load the file to prepared S3 bucket
- Elements.csv
    - rename field names (formatting)
    - save and load the file to prepared S3 bucket
- CountryGroup.csv
    - rename field names (formatting)
    - drop columns 'Country Group Code', 'Country Code', 'ISO2 Code', 'ISO3 Code'
    - insert a column with id of the row
    - save and load the file to prepared S3 bucket
- WorldData.csv
    - rename field names (formatting)
    - drop columns 'Country Group Code', 'Country Code', 'ISO2 Code', 'ISO3 Code'
    - save and load the file to prepared S3 bucket
- API RestCountries
    - name: remove 'nativeName' section
    - capital: make a string with a list of values
    - languages: make a string with a list of values
    - save and load the file to prepared S3 bucket
    

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

- ##### settings for all data sources

In [107]:
# define helpers to upload files to S3
s3 = boto3.client('s3')

In [109]:
# creating output folder where the result files will be uploaded
current_directory = os.getcwd()

final_directory = os.path.join(current_directory, r'output')
if not os.path.exists(final_directory):
   os.makedirs(final_directory)

- ##### Units.csv -> dim_unit.csv

In [110]:
# data transformations
data_units_final = data_units.rename(columns={'Unit Name': 'unit_name', 'Description': 'description'})

In [111]:
# saving result file
data_units_file = data_units_final.to_csv(f'{final_directory}/dim_unit.csv', index=False)

In [140]:
# upload file to S3
s3.upload_file(f'{final_directory}/dim_unit.csv', 'world-data-project', 'dim_unit.csv')

- ##### ItemsGroup.csv -> dim_item_group.csv

In [112]:
# data transformations
data_item_group_final = data_item_group\
.rename(columns={'Item Group Code': 'item_group_code', 'Item Group': 'item_group', \
                 'Item Code': 'item_code', 'Item': 'item'})\
.drop(columns=['Factor', 'CPC Code', 'HS Code', 'HS07 Code', 'HS12 Code'])

In [114]:
# saving result file
data_item_group_file = data_item_group_final.to_csv(f'{final_directory}/dim_item_group.csv', index=False)

In [141]:
# upload file to S3
s3.upload_file(f'{final_directory}/dim_item_group.csv', 'world-data-project', 'dim_item_group.csv')

- ##### Flags.csv -> dim_flag.csv

In [115]:
# data transformations
data_flag_final = data_flags\
.rename(columns={'Flag': 'flag', 'Description': 'description'})

In [116]:
# saving result file
data_flag_file = data_flag_final.to_csv(f'{final_directory}/dim_flag.csv', index=False)

In [142]:
# upload file to S3
s3.upload_file(f'{final_directory}/dim_flag.csv', 'world-data-project', 'dim_flag.csv')

- ##### Elements.csv -> dim_element.csv

In [117]:
# data transformations
data_element_final = data_elements\
.rename(columns={'Element Code': 'element_code', 'Element': 'element', 'Unit': 'unit', 'Description': 'description'})

In [118]:
# saving result file
data_element_file = data_element_final.to_csv(f'{final_directory}/dim_element.csv', index=False)

In [143]:
# upload file to S3
s3.upload_file(f'{final_directory}/dim_element.csv', 'world-data-project', 'dim_element.csv')

- ##### CountryGroup.csv -> dim_country_group.csv

In [119]:
# data transformations
data_country_group_final = data_country_group\
.rename(columns={'Country Group': 'country_group', \
                 'Country': 'country', 'M49 Code': 'm49_code'})\
.drop(columns=['Country Group Code', 'Country Code', 'ISO2 Code', 'ISO3 Code'])

In [120]:
data_country_group_final.head()

Unnamed: 0,country_group,country,m49_code
0,Africa,Algeria,12
1,Africa,Angola,24
2,Africa,Benin,204
3,Africa,Botswana,72
4,Africa,Burkina Faso,854


In [121]:
# saving result file
data_country_group_file = data_country_group_final.to_csv(f'{final_directory}/dim_country_group.csv', index=False)

In [144]:
# upload file to S3
s3.upload_file(f'{final_directory}/dim_country_group.csv', 'world-data-project', 'dim_country_group.csv')

- ##### WorldData.csv -> fact_world_data.csv

In [122]:
# data transformations
world_data_final = world_data\
.rename(columns={'Area Code (M49)': 'area_code_m49', 'Item Code': 'item_code', 'Element Code': 'element_code', \
                 'Year': 'year', 'Unit': 'unit_name', 'Value': 'value', 'Flag': 'flag_code'})\
.drop(columns=['Area Code', 'Area', 'Item Code (CPC)', 'Item', 'Element', 'Year Code'])

In [123]:
# remove star from area code
world_data_final['area_code_m49'] = world_data_final['area_code_m49'].str.lstrip("'")

In [146]:
world_data_final.head()

Unnamed: 0,area_code_m49,item_code,element_code,year,unit_name,value,flag_code
0,4,221,5312,1975,ha,0.0,E
1,4,221,5312,1976,ha,5900.0,E
2,4,221,5312,1977,ha,6000.0,E
3,4,221,5312,1978,ha,6000.0,E
4,4,221,5312,1979,ha,6000.0,E


In [125]:
# saving result file
world_data_file = world_data_final.to_csv(f'{final_directory}/fact_world_data.csv', single_file=True)

In [145]:
# upload file to S3
s3.upload_file(f'{final_directory}/fact_world_data.csv', 'world-data-project', 'fact_world_data.csv')

- ##### API RestCountries -> dim_country_info.csv

In [134]:
# data transformations: languages and capita(s) become strings, native names are removed
countries_info_data = []
for item in data_ctb:
    if item['languages']:
        item['languages'] = ', '.join(str(value) for value in item['languages'].values())
        
    if item['capital']:
        item['capital'] = ', '.join(value for value in item['capital'])
        
    if item['name']['nativeName']:
        del item['name']['nativeName']
        
    countries_info_data.append(item)

AttributeError: 'str' object has no attribute 'values'

In [131]:
countries_info_data_final = pd.json_normalize(countries_info_data)

In [132]:
countries_info_data_final.head()

Unnamed: 0,ccn3,capital,languages,area,population,flags.png,flags.svg,flags.alt,name.common,name.official
0,4,Kabul,"Dari, Pashto, Turkmen",652230.0,40218234,https://upload.wikimedia.org/wikipedia/commons...,https://upload.wikimedia.org/wikipedia/commons...,The flag of the Islamic Emirate of Afghanistan...,Afghanistan,Islamic Republic of Afghanistan
1,8,Tirana,Albanian,28748.0,2837743,https://flagcdn.com/w320/al.png,https://flagcdn.com/al.svg,The flag of Albania features a silhouetted dou...,Albania,Republic of Albania
2,12,Algiers,Arabic,2381741.0,44700000,https://flagcdn.com/w320/dz.png,https://flagcdn.com/dz.svg,The flag of Algeria features two equal vertica...,Algeria,People's Democratic Republic of Algeria
3,24,Luanda,Portuguese,1246700.0,32866268,https://flagcdn.com/w320/ao.png,https://flagcdn.com/ao.svg,The flag of Angola features two equal horizont...,Angola,Republic of Angola
4,28,Saint John's,English,442.0,97928,https://flagcdn.com/w320/ag.png,https://flagcdn.com/ag.svg,The flag of Antigua and Barbuda has a red fiel...,Antigua and Barbuda,Antigua and Barbuda


In [133]:
# saving result file
countries_info_data_file = countries_info_data_final.to_csv(f'{final_directory}/dim_country_info.csv')

In [147]:
# saving result file in S3 bucket
s3.upload_file(f'{final_directory}/dim_country_info.csv', 'world-data-project', 'dim_country_info.csv')

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

- ##### data quality checks for dim_unit.csv

In [102]:
#check that the file is saved in the right directory
path = 'output/dim_unit.csv'

if os.path.isfile(path) is True:
    print('The file is saved in the correct local folder')
else:
    print('The file not in saved, or in the wrong folder!')

The file is saved in the correct local folder


In [100]:
# check that the file is not empty
if os.stat('output/dim_unit.csv').st_size != 0:
    print('The file is not empty')
else:
    print('The file is EMPTY!')

The file is not empty


In [148]:
# check that the file is uploaded to S3 bucket
response = s3.list_objects_v2(
    Bucket='world-data-project',
    Prefix='dim_unit.csv',
)

if 'Contents' in response:
    print('Object has fully uploaded to S3')
else:
    print('Object has not fully uploaded to S3')

Object has fully uploaded to S3


- ##### data quality checks for dim_item_group.csv

In [173]:
#check that the file is saved in the right directory
path = 'output/dim_item_group.csv'

if os.path.isfile(path) is True:
    print('The file is saved in the correct local folder')
else:
    print('The file not in saved, or in the wrong folder!')

The file is saved in the correct local folder


In [174]:
# check that the file is not empty
if os.stat('output/dim_item_group.csv').st_size != 0:
    print('The file is not empty')
else:
    print('The file is EMPTY!')

The file is not empty


In [149]:
# check that the file is uploaded to S3 bucket
response = s3.list_objects_v2(
    Bucket='world-data-project',
    Prefix='dim_item_group.csv',
)

if 'Contents' in response:
    print('Object has fully uploaded to S3')
else:
    print('Object has not fully uploaded to S3')

Object has fully uploaded to S3


- ##### quality checks for dim_flag.csv

In [184]:
#check that the file is saved in the right directory
path = 'output/dim_flag.csv'

if os.path.isfile(path) is True:
    print('The file is saved in the correct local folder')
else:
    print('The file not in saved, or in the wrong folder!')

The file is saved in the correct local folder


In [185]:
# check that the file is not empty
if os.stat('output/dim_flag.csv').st_size != 0:
    print('The file is not empty')
else:
    print('The file is EMPTY!')

The file is not empty


In [150]:
# check that the file is uploaded to S3 bucket
response = s3.list_objects_v2(
    Bucket='world-data-project',
    Prefix='dim_flag.csv',
)

if 'Contents' in response:
    print('Object has fully uploaded to S3')
else:
    print('Object has not fully uploaded to S3')

Object has fully uploaded to S3


- ##### quality checks for dim_element.csv

In [192]:
#check that the file is saved in the right directory
path = 'output/dim_element.csv'

if os.path.isfile(path) is True:
    print('The file is saved in the correct local folder')
else:
    print('The file not in saved, or in the wrong folder!')

The file is saved in the correct local folder


In [193]:
# check that the file is not empty
if os.stat('output/dim_element.csv').st_size != 0:
    print('The file is not empty')
else:
    print('The file is EMPTY!')

The file is not empty


In [151]:
# check that the file is uploaded to S3 bucket
response = s3.list_objects_v2(
    Bucket='world-data-project',
    Prefix='dim_element.csv',
)

if 'Contents' in response:
    print('Object has fully uploaded to S3')
else:
    print('Object has not fully uploaded to S3')

Object has fully uploaded to S3


- ##### quality checks for dim_country_group.csv

In [209]:
#check that the file is saved in the right directory
path = 'output/dim_country_group.csv'

if os.path.isfile(path) is True:
    print('The file is saved in the correct local folder')
else:
    print('The file not in saved, or in the wrong folder!')

The file is saved in the correct local folder


In [210]:
# check that the file is not empty
if os.stat('output/dim_country_group.csv').st_size != 0:
    print('The file is not empty')
else:
    print('The file is EMPTY!')

The file is not empty


In [152]:
# check that the file is uploaded to S3 bucket
response = s3.list_objects_v2(
    Bucket='world-data-project',
    Prefix='dim_country_group.csv',
)

if 'Contents' in response:
    print('Object has fully uploaded to S3')
else:
    print('Object has not fully uploaded to S3')

Object has fully uploaded to S3


- ##### quality checks for fact_world_data.csv

In [392]:
#check that the file is saved in the right directory
path = 'output/fact_world_data.csv'

if os.path.isfile(path) is True:
    print('The file is saved in the correct local folder')
else:
    print('The file not in saved, or in the wrong folder!')

The file is saved in the correct local folder


In [393]:
# check that the file is not empty
if os.stat('output/fact_world_data.csv').st_size != 0:
    print('The file is not empty')
else:
    print('The file is EMPTY!')

The file is not empty


In [153]:
# check that the file is uploaded to S3 bucket
response = s3.list_objects_v2(
    Bucket='world-data-project',
    Prefix='fact_world_data.csv',
)

if 'Contents' in response:
    print('Object has fully uploaded to S3')
else:
    print('Object has not fully uploaded to S3')

Object has fully uploaded to S3


- ##### quality checks for dim_country_info.csv

In [136]:
#check that the file is saved in the right directory
path = 'output/dim_country_info.csv'

if os.path.isfile(path) is True:
    print('The file is saved in the correct local folder')
else:
    print('The file not in saved, or in the wrong folder!')

The file is saved in the correct local folder


In [155]:
# check that the file is not empty
if os.stat('output/dim_country_info.csv').st_size != 0:
    print('The file is not empty')
else:
    print('The file is EMPTY!')

The file is not empty


In [157]:
# check that the file is uploaded to S3 bucket
response = s3.list_objects_v2(
    Bucket='world-data-project',
    Prefix='dim_country_info.csv',
)

if 'Contents' in response:
    print('Object has fully uploaded to S3')
else:
    print('Object has not fully uploaded to S3')

Object has fully uploaded to S3


#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.