# Basics

Import libraries and functions.

</p>Search for archives, and create the 1st data frame, as well as an empty list for the headers (h) and the list for comun columns (CC). The data frame will be formed by the files which are normalized.

<p>Using the <code>glob</code> library allows us to create a list with only the csv that are ended in <em>(Normalized).csv</em> , which will be the most useful for a statistical analysis. <br>
Moreover, for a future simplification, we also create a list with the relevant information of each of the csv <em>('Area','Year','Element','Unit','Value')</em> </p>

In [1]:
import pandas as pd
import numpy as np
import glob
from functools import reduce
folder_path = 'C:/Users/amarchve/Desktop/Data'
file_list = glob.glob(folder_path + "/**(Normalized).csv")
CC=['Area','Year','Element','Unit','Value']

# Data integration

As at the beginning we do not have enough computer power to process the whole database, we are going to develop a test run to be sure that the ideas are escalable.


In this first test run for the loop, we are going to load and process the data into concated data frames, where later on, there is an application of the pivot_table function which adjusts all the variables, previouly called 'Elements' & 'Units' into the headers of the columns, and the values the values of the table.

In [9]:
main_dataframe = pd.DataFrame(pd.read_csv(file_list[0], sep=',', encoding='latin-1'),columns=CC)
for i in range(1,5):
    df = pd.DataFrame(pd.read_csv(file_list[i],sep=',' , encoding='latin-1',low_memory=False), columns=CC)
    main_dataframe = pd.concat([main_dataframe, df])

main_dataframeC=main_dataframe.pivot_table(index=['Area','Year'], columns= ['Element','Unit'], values='Value')
main_dataframeC

Unnamed: 0_level_0,Element,Domestic supply quantity,Export Quantity,Feed,Food supply quantity (tonnes),Import Quantity,Losses,Other uses (non-food),"Per 100,000 farmers",Processing,Production,"Researchers, total",Seed,"Share of Value Added (Agriculture, Forestry and Fishing)","Spending, total",Stock Variation,"Value Local Currency, 2015 prices","Value US$, 2015 prices"
Unnamed: 0_level_1,Unit,tonnes,tonnes,tonnes,tonnes,tonnes,tonnes,tonnes,FTE,tonnes,tonnes,FTE,tonnes,%,million PPP (constant 2011 prices),tonnes,LCU,US$
Area,Year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
Afghanistan,1961,49168.461538,3819.250,99751.6,4900.0,22.666667,1100.0,7406.4,,41987.5,66960.8,,4262.5,,,0.0,,
Afghanistan,1962,52625.153846,4445.875,100276.0,4965.0,10.666667,1138.0,9284.8,,57303.0,71963.0,,5324.5,,,0.0,,
Afghanistan,1963,53554.846154,4881.125,92627.8,4950.0,12.166667,1136.0,10943.0,,78337.0,73518.9,,5324.5,,,0.0,,
Afghanistan,1964,53641.615385,5224.000,103823.8,4965.0,10.666667,1138.0,8170.2,,59243.0,73906.9,,3908.5,,,0.0,,
Afghanistan,1965,55104.153846,4015.875,106227.2,5100.0,7.500000,1184.0,9368.4,,60740.5,74843.6,,2755.5,,,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zimbabwe,2016,,,,,,,,6.5,,,242.0,,1.39,41.6,,101.074375,101.074375
Zimbabwe,2017,,,,,,,,,,,,,,,,103.039125,103.039125
Zimbabwe,2018,,,,,,,,,,,,,,,,104.098850,104.098850
Zimbabwe,2019,,,,,,,,,,,,,,,,105.339300,105.339300


<p>In the following cell, we are concatanating all the files from the <code>file_list</code>, which will have the same shape thanks to creation of the dataframes with the restriction of the columns. <br>
Moreover, this concat function will allow for a single data frame which has all the files one on top of another. Therefore the final result form this loop will be <code>main_dataframe</code> which will be our Normalized Source Data Model.</p>

In [None]:
main_dataframe = pd.DataFrame(pd.read_csv(file_list[0], sep=',', encoding='latin-1'),columns=CC)
for i in range(1,len(file_list)):
    df = pd.DataFrame(pd.read_csv(file_list[i],sep=',' , encoding='latin-1',low_memory=False), columns=CC)
    main_dataframe = pd.concat([main_dataframe, df])

<p>Lastly, to convert the Normalized Source Data Model into the Normalized Integrated Data Model, we are going to use the <code>pivot_table</code> function which allows to <br>
adjusts all the variables, previouly called <em>'Elements' & 'Units'</em> into the headers of the columns, and the <em>'Value'</em> column will be the values of the table.

In [None]:
main_dataframeC=main_dataframe.pivot_table(index=['Area','Year'], columns= ['Element','Unit'], values='Value')
main_dataframeC

# Quality assurance

In the following cell, we are going to make sure that none of our interesting variables from <em>'Elements'</em> have been left out, thus checking if the extraction & integtration has been completed.

In [11]:
if len(main_dataframe["Element"].value_counts())==main_dataframeC.shape[1]:
    print('Data extraction & integration is COMPLETED and CORRECT')
else:
    print('Data extraction & integration is UNCOMPLETED')

Data extraction & integration is COMPLETED and CORRECT


# Merge union

Test loop of merge instead of append.

In [3]:
main_dataframe = pd.DataFrame(pd.read_csv(file_list[0], sep=',', encoding='latin-1'),columns=CC)
for i in range(1,5):
    df = pd.DataFrame(pd.read_csv(file_list[i],sep=',' , encoding='latin-1',low_memory=False), columns=CC)
    main_dataframe = pd.merge(left=main_dataframe, right=df , how='left', on= ['Area','Year'])
main_dataframe

  main_dataframe = pd.merge(left=main_dataframe, right=df , how='left', on= ['Area','Year'])


Unnamed: 0,Area,Year,Element_x,Unit_x,Value_x,Element_y,Unit_y,Value_y,Element_x.1,Unit_x.1,Value_x.1,Element_y.1,Unit_y.1,Value_y.1,Element,Unit,Value
0,Algeria,2009,"Share of Value Added (Agriculture, Forestry an...",%,0.18,"Researchers, total",FTE,510.3,Import Quantity,tonnes,2771.0,,,69.503120,"Value Local Currency, 2015 prices",LCU,72.7492
1,Algeria,2009,"Share of Value Added (Agriculture, Forestry an...",%,0.18,"Researchers, total",FTE,510.3,Import Quantity,tonnes,2771.0,,,69.503120,"Value US$, 2015 prices",US$,100.8325
2,Algeria,2009,"Share of Value Added (Agriculture, Forestry an...",%,0.18,"Researchers, total",FTE,510.3,Import Quantity,tonnes,2771.0,,,69.503120,"Value Local Currency, 2015 prices",LCU,77.8007
3,Algeria,2009,"Share of Value Added (Agriculture, Forestry an...",%,0.18,"Researchers, total",FTE,510.3,Import Quantity,tonnes,2771.0,,,69.503120,"Value US$, 2015 prices",US$,107.8340
4,Algeria,2009,"Share of Value Added (Agriculture, Forestry an...",%,0.18,"Researchers, total",FTE,510.3,Import Quantity,tonnes,2771.0,,,69.503120,"Value Local Currency, 2015 prices",LCU,71.0064
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99495815,Zimbabwe,2016,"Spending, total",million PPP (constant 2011 prices),41.60,"Per 100,000 farmers",FTE,6.5,,,,,%,-0.899281,"Value US$, 2015 prices",US$,100.5032
99495816,Zimbabwe,2016,"Spending, total",million PPP (constant 2011 prices),41.60,"Per 100,000 farmers",FTE,6.5,,,,,%,-0.899281,"Value Local Currency, 2015 prices",LCU,101.7875
99495817,Zimbabwe,2016,"Spending, total",million PPP (constant 2011 prices),41.60,"Per 100,000 farmers",FTE,6.5,,,,,%,-0.899281,"Value US$, 2015 prices",US$,101.7875
99495818,Zimbabwe,2016,"Spending, total",million PPP (constant 2011 prices),41.60,"Per 100,000 farmers",FTE,6.5,,,,,%,-0.899281,"Value Local Currency, 2015 prices",LCU,99.8458


In [8]:
for i in range(1,len(file_list)):
    data = pd.read_csv(file_list[i],sep=',' , encoding='latin-1',low_memory=False)
    df = pd.DataFrame(data)
    main_dataframe = pd.merge(left=main_dataframe, right=df , how='left')
   
main_dataframe

ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

# Crazy ideas

Test run for loop of merge & lambda instead of append.
With a creation of a list of all the df, and merging them all together


In [None]:
df_list=[main_dataframe]  
for i in range(1,10):
     df = pd.DataFrame( pd.read_csv(file_list[i],sep=',' , encoding='latin-1',low_memory=False))
     df_list.append(df)
merged_df = reduce(lambda l, r: pd.merge(l, r, on='Area' & 'Year', how='inner'), df_list)
merged_df