## DATASET CREATION

Unfortunately, the dataset I will work with does not come as one already done dataset - as it is mostly usual in Data Science projects. <br /> 
To get the sheer experience of how a normal Data Science job looks like I also wanted to dive into this issue. By experience I can tell that this takes a shit ton of time - like almost 80% - of preparing the data. <br /> 
Since the data usually shares the same countries and lists mostly the same years of the recording I used this as a primary key where I connect the datapoints with each other. <br /> 
However, the  countries are listed in rows along with the year of the recording - I want to have a final dataset that looks like follows: <br /> 

|Country | Afghanistan | Albania | ... | Zimbabwe | 
| ----- | ----------  | ------ | ----- | ------- | 
|Alcohol consumption [l] | 0.2 | 2.4 | ... | 0.01 |
|Human Develpment Index (HDI) | 0.1 | 0.15 | ... | 0.1 | 
|... | ... | ... | ... | ... |
|Healthcare Expenditure [$] | 13.322 | 15.211 | ... | 1.039 |

Thus, I have to transpose each of the countries and record each of the years as seperate entry in the dataset. <br /> 

All the data is [publicly available](https://ourworldindata.org), and this source is trusted by many notorious companies such as Vox, The Ney York times and even the top universities of this world like MIT, Oxford, Stanford. <br /> 
Hence, I assume that this data is rather based on actual recording from the respective country. <br /> 
Even the United Nation published their records in this page and I bet that these folks do some amazing work, which we can trust. <br />

I downloaded 76lists with different indicator variables ranging from the Human Developemnt Index (HDI) over the life expectancy until poultry consumption per capita for several years. Hence there is a lot of data in it and I am just sratching the surface of these datasets of the UN, WHO and FAO. <br /> 

One more thing to mention with respect to the countries is that some of them were only listed together such as *Serbia and Montenegro*, *Belgium Luxembourg*, ... and thus sometimes there is no single value for these countries available due to their combined listing. I decided not to give the credits only one specific country due to the unproportionality in their population and possibly cultural changes. <br /> 

But now let's not waste too much with the explaination part and go straight into how I merged the datafiles to one huge on. <br /> 

In [1]:
import os 
import sys
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

Do the preprocessing necessities with renaming the columns and dropping the ones we are not interested in, i.e. continents aggregated stuff, some islands and countries I have never heard of, etc. <br /> 
After doing that, we save the new csv file again.  <br /> 
Moreover, I also delete all the entries of 1979 and before to not have too much history data, which speeds up the computation by a looooot at least 1 day of computation saved due to that! <br /> 

In [2]:
def rename_Countries_drop_Unnecessary(df, name): 
    print("Before {} shape: {}".format(name, df.shape))
    # Input is only the dataframe with the country names in the column 'Entity'
    right_names = []
    for data in df['Entity']:
        ## RENAMING COUNTRIES FOR DATA CONSISTENCY
        if 'Hong Kong' in data:
            data = 'Hong Kong'
        if 'Taiwan' in data: 
            data = 'Taiwan'
        if 'Macao' in data: 
            data = 'Macao'
        if 'Ethiopia' in data: 
            data = 'Ethiopia'
        if 'Sudan' in data: 
            data = 'Sudan'        
        if 'Czechia' in data: 
            data = 'Czech Republic'
        if 'Syria' in data: 
            data = 'Syria'
        if 'Russia' in data: 
            data = 'Russia'
        if "Ivoire" in data:
            data = "Cote d'Ivoire"

        # America 
        if 'US' in data: 
            data = 'United States'
        if 'USA' in data: 
            data = 'United States'
        if 'U.S.A.' in data: 
            data = 'United States'
        if 'U.S.A' in data: 
            data = 'United States'
        if 'United States of' in data: # gets United States of America
            data = 'United States'
        if data == 'America': 
            data = 'United States'
        right_names.append(data) 
        
    ## Replace the names with the consistent names of them
    right_names = pd.Series(right_names)
    df['Entity'] = right_names

    ## DELETE THE ENTRIES WHICH ARE NOT IN OUR MASTER COUNTRY LIST
    countries = list(set(df["Entity"]))
    for country in countries: 
        if country not in interested_countries:
            idx = list(df['Entity']).index(country)
            endCountry = idx + list(df['Entity']).count(country)
            ranges = np.arange( idx , endCountry )  
            df.drop(df.index[ranges], inplace = True)
            
    ## DELETE ENTRIES BEFORE 1980
    df = df[df["Year"] > 1979]
    
    ## SAVE THE FILE REDUCED AND CHANGED NAME AGAIN
    print("After: {}".format(df.shape))
    df.to_csv(os.path.join(datapath, name), index=False)

Define the countries we want to have in our Masterlist and thus in the dataset.

In [3]:
interested_countries = ['Sweden', 'Norway', 'Finland', 'Iceland', 'Germany', 'Netherlands', 'Belgium', 'Luxembourg',
                        'England', 'Scotland', 'Wales', 'Ireland', 'United Kingdom', 'Switzerland', 'Austria', 'France',
                        'Italy', 'Spain', 'Portugal', 'Morocco', 'Tunisia', 'Egypt', 'Liechtenstein', 'Cyprus', 'Vatican',
                        'Kosovo', 'Serbia', 'Georgia', 'Greenland', 'Antigua and Barbuda', 'Hungary', 'Monaco', 'Israel',
                        'Albania', 'Iraq', 'Iran', 'Syria', 'Turkey', 'Palestine', 'Montenegro', 'Latvia', 'Jordan',
                        'Croatia', 'New Zealand', 'Eritrea', 'Libya', 'Belarus', 'Slovenia', 'Greece', 'Lithuania',
                        'Liberia', 'Slovakia', 'Estonia', 'Poland', 'Czech Republic', 'Armenia', 'Denmark', 'Bulgaria',
                        
                        'Russia', 'United States', 'Canada', 'Qatar', 'Kuwait', 'Mexico', 'South Africa', 'Fiji', 'Oman',
                        'Japan', 'United Arab Emirates', 'South Korea', 'Macao', 'Hong Kong', 'China', 'Thailand', 'Belize',
                        'Taiwan', 'Vietnam', 'Malaysia', 'Indonesia', 'India', 'Philippines', 'Australia', 'Laos', 'Bhutan',
                        
                        'Kyrgyzstan', 'Kazakhstan', 'Uzbekistan', 'Turkmenistan', 'Tajikistan', 'Pakistan', 'Afghanistan',
                        'Argentina', 'Brazil', 'Chile', 'Venezuela', 'Peru', 'Colombia', 'Guyana', 'Mauritius', 'Barbados', 
                        'Cuba', 'Panama', 'Bahamas', 'Puerto Rico', 'Costa Rica', 'Solomon Islands',  'Marshall Islands',
                        'Ecuador', 'Benin', 'Seychelles', 'Bolivia', 'Madagascar',  'Mauritania', 'Bosnia and Herzegovina', 
                        'Jamaica', 'Lebanon', 'Senegal', 'Malta', 'French Polynesia', 'Bahrain', 'Burundi', 'Swaziland',
                        'Tanzania', 'Central African Republic', 'Malawi', 'Djibouti', 'Mozambique', 'Macedonia', 'Sierra Leone',
                        'Democratic Republic of Congo', 'Namibia', 'Algeria', 'Trinidad and Tobago', "Cote d'Ivoire",
                         
                        'Samoa', 'Bermuda', 'Aruba', 'Myanmar', 'Cape Verde', 'Uganda', 'Togo', 'Guinea', 
                        'San Marino', 'Ukraine', 'North Korea', 'Papua New Guinea', 'Haiti', 'Ghana', 'Sudan',
                        'Faeroe Islands', 'Cambodia', 'Somalia',  'Kiribati', 'Tonga', 'Mongolia', 'Rwanda', 'Bangladesh',
                        'Suriname', 'Nauru', 'Zambia', 'Azerbaijan',  'Sri Lanka', 'Nigeria', 'Kenya', 'Comoros', 'Andorra', 
                        'Tuvalu', 'Zimbabwe', 'Yemen', 'Cameroon', 'El Salvador', 'Angola', 'Curacao', 'Nicaragua',
                        'Saudi Arabia', 'Lesotho', 'Moldova', 'Gabon', 'Grenada', 'Mali', 'Romania', 'Guatemala', 'Dominican Republic', 
                        'Honduras', 'Congo',  'Burkina Faso',  'Saint Lucia', 'Cayman Islands', 'Botswana', 'Ethiopia', 
                        'Chad', 'Uruguay', 'Maldives', 'Gibraltar', 'Paraguay', 'Niger', 'Nepal']

In [4]:
# Define the path where we have our data stored and want to have it stored as well.
datapath = os.path.join(os.path.join(os.getcwd(), 'data'), 'Health')
datapath

'C:\\Users\\Lenny\\Documents\\Studium_Robotics (M.Sc.)\\03_Semester 3 - Oslo ERASMUS\\01_Applied Data Analysis and Machine Learning\\Project 3\\data\\Health'

Call each single file in our data directory and process each one according to the rules we set previously (Renaming and Deleting entries). 

In [21]:
dataFileNames = [f for f in os.listdir(datapath) if os.path.isfile(os.path.join(datapath, f))]
type4Cols = []
type7Cols = []
manualLists = []

# the datasets mostly have the same size of 4 columns and same setup so let's get those first

for file in dataFileNames: 
    try:
        df = pd.read_csv(os.path.join(datapath, str(file) ) , encoding='latin-1')
    except: 
        print("problems with this guy: {}".format(file))
        manualLists.append(file)
    if df.shape[1] == 5 or df.shape[1] == 4: # one type of files (4 columns)
        type4Cols.append(file) 
        rename_Countries_drop_Unnecessary(df, str(file))
    elif df.shape[1] == 7: 
        type7Cols.append(file) 
    else: # manual shit  to do then 
        manualLists.append(file)
manualLists

Before agricultural-area-per-capita.csv shape: (8993, 4)
After: (5218, 4)
Before alcohol-attributable-fraction-of-mortality.csv shape: (173, 4)
After: (173, 4)
Before annual-healthcare-expenditure-per-capita.csv shape: (3460, 4)
After: (3460, 4)
Before average-height-of-men-for-selected-countries.csv shape: (91, 4)
After: (91, 4)
Before beef-consumption-per-dude.csv shape: (5198, 4)
After: (5198, 4)
Before beer-consumption-per-person.csv shape: (5626, 4)
After: (5626, 4)
Before cancer-death-rates.csv shape: (5068, 4)
After: (5068, 4)
Before cardiovascular-disease-death-rates.csv shape: (5068, 4)
After: (5068, 4)
Before child-mortality.csv shape: (6434, 4)
After: (6434, 4)
Before co-emissions-per-capita.csv shape: (6878, 4)
After: (6878, 4)
Before consumption-per-smoker-per-day.csv shape: (5742, 4)
After: (5742, 4)
Before daily-per-capita-fat-supply.csv shape: (5398, 4)
After: (5398, 4)
Before daily-per-capita-protein-supply.csv shape: (5398, 4)
After: (5398, 4)
Before dementia-death-ra

['Merged UN Data 1980+.csv', 'Merged UN Data.csv']

Set up the final dataframe which we are going to use in the Analysis part. <br /> 
Notice that I set it up with 5mio rows, however this is just to ensure that all the data will be safely stored in it. I will delete the empty rows after the dataset is created. <br /> 
So, it's just a placeholder until now and serves the purpose of not running into index errors/ too small row size.

In [22]:
# create dataframe where we want to paste everything inside
# 5.000.000 rows to not run into some problems while adding rows - delete later the other ones
final_df = pd.DataFrame(data = np.zeros( (3000, len(interested_countries)) ), 
                        index = np.arange(3000),
                        columns = [ country for country in interested_countries])
# save indices as strings to get meaningful names
final_df.index = final_df.index.map(str)
final_df.shape

(3000, 197)

The Magic happens down here. <br /> 
We loop through every preprocessed list, <br /> 
In each list we loop through every country and further <br /> 
we also iterate over each year in that country. <br /> 
There I use the Year and the name of the file/list to create an index name. In this index name we paste the respective country and its value in it. <br /> 
We do this for all the preprocessed lists, which takes a shit ton of time. <br /> 


In [23]:
%%time
# paste the values into the final_df from each single list
nextListIndex = 0

try: 
    for lists in type4Cols:
        
        ## Read the file and get the all countries along with their reported years
        print("{} list out of {}, Index: {}, Name: {}".format(type4Cols.index(lists), len(type4Cols), nextListIndex, lists))
        df = pd.read_csv(os.path.join(datapath, str(lists) ) , encoding='latin-1')
        
        # get the col names, unique countries and unique years
        columns = list(df.columns)
        countries = list(set(df["Entity"]))
        years = list(set(df['Year']))
        
        # get a list of all the index/row names
        indexNamesArr = final_df.index.values

        ## groupby countries and then years accordingly
        #df.groupby(["Entity", 'Year'])

        firstListRun = False # flag for renaming the indices

        # loop thru every country in the list
        for country in countries:
            
            ## check if country is in our masterlist
            if country not in interested_countries: 
                # skip this item
                print("\tCountry: {} not in list - but we skip it.".format(country))
                continue
            
            # take a dataframe for one country at a time
            country_df = df[df['Entity'] == country]

            # loop thru every year within that country - assuming the years are in the same order for every country
            for year in years:
                
                # rename the indices only if it is the very first run for the country
                if not firstListRun:
                    indexName = str(columns[-1]) + ' in ' + str(year)
                    indexNamesArr[nextListIndex] = indexName
                    nextListIndex += 1
                    if nextListIndex % 20 == 0: 
                        print("\t\t" + str(indexName))
                        
                ## IMPROVEMENT
                # instead of taking 0 when value is not available, take a window of +- 1 entry and take the average of it

                ## get the proper value and fill empty ones, if not available, fill it with 0.000
                # note: .sum() is only having one element anyway, just done to get the value as a float not an array
                value = country_df[country_df["Year"] == year][columns[-1]].sum() if not country_df[country_df["Year"] == year][columns[-1]].empty else 0.000
                # get the name of the row
                idxName = str(columns[-1]) + ' in ' + str(year)
                
                #print("Country: {} found in the dataset at spot: {}".format(country, interested_countries.index(country)))
                
                # assign the value in the merged df with the value 
                final_df.iat[list(final_df.index.values).index(idxName), interested_countries.index(country)] = value # .iat[row, col]

            ## get the proper index for the next list to begin with   
            firstListRun = True
            # print progress
            if countries.index(country) % 60 == 0:
                print("\tWorking on country: {} out of {}".format(countries.index(country), len(countries)))
            
except Exception as e:
    print("Next List Index in line: {} of {}, Matrix Size: {}, list: {}, country: {}".format(i, nextListIndex, final_df.shape[0], lists, country))
    print(e)
    sys.exit()

0 list out of 76, Index: 0, Name: agricultural-area-per-capita.csv
		Agricultural Area [h/person] in 1999
	Working on country: 0 out of 163
	Working on country: 60 out of 163
	Working on country: 120 out of 163
1 list out of 76, Index: 34, Name: alcohol-attributable-fraction-of-mortality.csv
	Working on country: 0 out of 172
	Working on country: 60 out of 172
	Working on country: 120 out of 172
2 list out of 76, Index: 35, Name: annual-healthcare-expenditure-per-capita.csv
		annual healthcare spending [$] in 1999
	Working on country: 0 out of 174
	Working on country: 60 out of 174
	Working on country: 120 out of 174
3 list out of 76, Index: 55, Name: average-height-of-men-for-selected-countries.csv
	Working on country: 0 out of 91
	Working on country: 60 out of 91
4 list out of 76, Index: 56, Name: beef-consumption-per-dude.csv
		Beef and buffalo (kg) in 1983
		Beef and buffalo (kg) in 2003
	Working on country: 0 out of 163
	Working on country: 60 out of 163
	Working on country: 120 ou

	Working on country: 60 out of 163
	Working on country: 120 out of 163
35 list out of 76, Index: 978, Name: median-age.csv
		Median Age [years] in 2055
		Median Age [years] in 2030
	Working on country: 0 out of 178
	Working on country: 60 out of 178
	Working on country: 120 out of 178
36 list out of 76, Index: 1003, Name: merchandise-exports-gdp-cepii.csv
		Value of global merchandise exports [% of GDP] in 1996
	Working on country: 0 out of 178
	Working on country: 60 out of 178
	Working on country: 120 out of 178
37 list out of 76, Index: 1038, Name: military-expenditure-as-share-of-gdp.csv
		Military expenditure [% of GDP] in 1981
		Military expenditure [% of GDP] in 2001
	Working on country: 0 out of 157
	Working on country: 60 out of 157
	Working on country: 120 out of 157
38 list out of 76, Index: 1076, Name: milk-production-tonnes.csv
		Milk production [t] in 1983
		Milk production [t] in 2003
	Working on country: 0 out of 174
	Working on country: 60 out of 174
	Working on countr

	Working on country: 60 out of 174
	Working on country: 120 out of 174
71 list out of 76, Index: 1824, Name: total-healthcare-expenditure-as-share-of-national-gdp-by-country.csv
		Healthcare of GDP [%] in 2010
	Working on country: 0 out of 174
	Working on country: 60 out of 174
	Working on country: 120 out of 174
72 list out of 76, Index: 1844, Name: total-meat-consumption-per-capita.csv
		Total Meat Consumption per capita [kg] in 1995
	Working on country: 0 out of 163
	Working on country: 60 out of 163
	Working on country: 120 out of 163
73 list out of 76, Index: 1878, Name: total-tax-revenues-gdp.csv
		Total Taxes Revenue [% GDP] in 1981
		Total Taxes Revenue [% GDP] in 2001
	Working on country: 0 out of 174
	Working on country: 60 out of 174
	Working on country: 120 out of 174
74 list out of 76, Index: 1916, Name: vegetable-consumption-per-capita.csv
		Vegetable Consumption [kg/capita/year] in 1983
		Vegetable Consumption [kg/capita/year] in 2003
	Working on country: 0 out of 163
	W

In [25]:
lastFilledItem = list(final_df.index).index(indexName) + 1
final_df.drop(final_df.index[ np.arange(lastFilledItem, final_df.shape[0]) ], inplace=True)
final_df.shape

(1986, 197)

Finally save our sweet dataframe! **HURRRRRAAAAAAAAYY**

In [78]:
final_df.to_csv(os.path.join(datapath, "Merged UN Data 1980+.csv"))

## TESTING, VALIDATION and some more PREPROCESSING

In [72]:
# open my baby again 
df = pd.read_csv(os.path.join(datapath, "Merged UN Data 1980+.csv"))

In [73]:
final_df.describe()

Unnamed: 0,Sweden,Norway,Finland,Iceland,Germany,Netherlands,Belgium,Luxembourg,Ireland,United Kingdom,...,Burkina Faso,Saint Lucia,Botswana,Ethiopia,Chad,Uruguay,Maldives,Paraguay,Niger,Nepal
count,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,...,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0,1986.0
mean,117302.6,81601.76,78088.2,8985.268,914695.6,366682.2,114787.8,13037.92,190689.6,626939.9,...,8481.39431,3425.498426,16709.32,45674.87,5981.087153,57611.08,7619.734,18216.66,15155.84,30029.63
std,646312.8,486251.1,419769.0,69857.06,4757285.0,1907861.0,787353.5,96093.39,1040101.0,3490236.0,...,41281.470835,30355.671222,142673.8,295603.4,31572.953632,303119.5,78131.36,90501.85,91337.55,167806.2
min,0.0,0.0,0.0,0.0,-0.191,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.405,5.7,5.4,5.8,5.8,5.59391,0.9103638,0.2584646,6.835773,4.126333,...,1.634105,3.795,3.208215,0.39,0.576348,4.844213,0.4707185,4.142803,1.0725,1.0625
50%,21.39935,20.47,20.255,21.21431,20.38601,22.465,14.69712,12.0,26.13,18.23534,...,9.730886,21.056036,16.43085,6.858542,9.07015,19.48604,9.190904,16.97377,9.815,8.72154
75%,85.0175,81.85125,82.1075,91.77775,94.4475,100.0,80.75325,76.07125,90.66,81.177,...,49.535,85.499941,59.62434,43.13613,47.2075,86.03,65.77975,71.4255,51.95113,63.67295
max,6782000.0,5960000.0,3275200.0,1792000.0,35555000.0,15828000.0,8355000.0,1090000.0,10100000.0,35814000.0,...,334213.0,348000.0,2101000.0,4430840.0,291849.0,3037000.0,1286000.0,1308000.0,1097833.0,1792204.0


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1986 entries, 0 to 1985
Columns: 197 entries, Sweden to Nepal
dtypes: float64(197)
memory usage: 3.0 MB


## VALIDATION OF THE DATA

Since it is too complicated to check every single entry I decided to go with 4 randomly chosen lists, that appear in different locations and check the values manually of specific countries and its indicator variable. 

In [None]:
gdp = 'maddison-data-gdp-per-capita-in-2011us.csv'
df = pd.read_csv(os.path.join(datapath, gdp ) , encoding='latin-1')
print(df.shape)
result_df = df[df["Year"] > 1979]
result_df

In [8]:
## VALIDATE BEER AND WINE CONSUMPTION
vege = 'vegetable-consumption-per-capita.csv'
wine = 'wine-consumption-per-person.csv'
beer = 'beer-consumption-per-person.csv'
meat = 'total-meat-consumption-per-capita.csv'


test_df = pd.DataFrame(data = np.zeros( (500, len(interested_countries)) ), 
                        index = np.arange(500),
                        columns = [ country for country in interested_countries])
# save indices as strings to get meaningful names
test_df.index = test_df.index.map(str)


# paste the values into the final_df from each single list
listies = [vege, wine, beer, meat]
nextListIndex = 0

for lists in listies:

    ## Read the file and get the all countries along with their reported years
    df = pd.read_csv(os.path.join(datapath, str(lists) ) , encoding='latin-1')

    # get the col names, unique countries and unique years
    columns = list(df.columns)
    countries = list(set(df["Entity"]))
    years = list(set(df['Year']))

    # get a list of all the index/row names
    indexNamesArr = test_df.index.values

    ## groupby countries and then years accordingly
    #df.groupby(["Entity", 'Year'])

    firstListRun = False # flag for renaming the indices

    # loop thru every country in the list
    for country in countries:

        ## check if country is in our masterlist
        if country not in interested_countries: 
            # skip this item
            print("\tCountry: {} not in list - but we skip it.".format(country))
            continue

        # take a dataframe for one country at a time
        country_df = df[df['Entity'] == country]

        # loop thru every year within that country - assuming the years are in the same order for every country
        for year in years:

            # rename the indices only if it is the very first run for the country
            if not firstListRun:
                indexName = str(columns[-1]) + ' in ' + str(year)
                indexNamesArr[nextListIndex] = indexName
                nextListIndex += 1
                if nextListIndex % 20 == 0: 
                    print("\t\t" + str(indexName))

            ## get the proper value and fill empty ones, if not available, fill it with 0.000
            # note: .sum() is only having one element anyway, just done to get the value as a float not an array
            value = country_df[country_df["Year"] == year][columns[-1]].sum() if not country_df[country_df["Year"] == year][columns[-1]].empty else 0.000
            # get the name of the row
            idxName = str(columns[-1]) + ' in ' + str(year)

            #print("Country: {} found in the dataset at spot: {}".format(country, interested_countries.index(country)))

            # assign the value in the merged df with the value 
            test_df.iat[list(test_df.index.values).index(idxName), interested_countries.index(country)] = value # .iat[row, col]

        ## get the proper index for the next list to begin with   
        firstListRun = True
        # print progress
        if countries.index(country) % 60 == 0:
            print("\tWorking on country: {} out of {}".format(countries.index(country), len(countries)))


		Vegetable consumption per capita [kg] in 1999
	Working on country: 0 out of 161
	Working on country: 60 out of 161
	Working on country: 120 out of 161
		Wine Consumption [l] in 1985
		Wine Consumption [l] in 2005
	Working on country: 0 out of 173
	Working on country: 60 out of 173
	Working on country: 120 out of 173
		Beer Consumption per capita [l] in 1989
		Beer Consumption per capita [l] in 2009
	Working on country: 0 out of 173
	Working on country: 60 out of 173
	Working on country: 120 out of 173
		Total Meat Consumption per capita [kg] in 1993
		Total Meat Consumption per capita [kg] in 2013
	Working on country: 0 out of 163
	Working on country: 60 out of 163
	Working on country: 120 out of 163


In [18]:
test_df["United States"]["Total Meat Consumption per capita [kg] in 2010"]

118.82

Comparing the above dataframe with the real lists from the UN, we see that they do match, although the wine consumption seems a little suspisous to me to be honest. <br /> 
By [checking this guy here](https://ourworldindata.org/grapher/wine-consumption-per-person) again, they mention that they only record the total alcohol amount of wine. Wine usually has 12% of alcohol in a bottle, thus 1l of wine contains 0.12l of pure alcohol in it. Thus 3l of pure alcohol from wine is the equvalent of approximately 25 bottles of wine. <br /> 

Hence, the data is accordingly sorted in the right columns and rows and we can go ahead and further analyse this fresh gut here then! 

### Check for Non Zero Values in Each Country

In [76]:
minimum_Entries = 600
dropCountries = df.astype(bool).sum(axis=0).values > minimum_Entries
#final_df.columns.values
idx_to_drop = np.argwhere(dropCountries == False).flatten()
countries_to_drop = [df.columns.values[cntry] for cntry in idx_to_drop]
countries_to_drop, len(countries_to_drop)

(['England',
  'Scotland',
  'Wales',
  'Liechtenstein',
  'Vatican',
  'Kosovo',
  'Greenland',
  'Monaco',
  'Aruba',
  'San Marino',
  'Faeroe Islands',
  'Nauru',
  'Tuvalu',
  'Curacao',
  'Cayman Islands',
  'Gibraltar'],
 16)

In [77]:
# Here you will drop those countries
df.drop(countries_to_drop, axis=1, inplace=True)
df.shape

(1986, 182)

In [None]:
df.groupby('Entity')['Wine Consumption'].sum().sort_values().tail(5)

In [1]:
df[df['Entity'] == 'Germany']['Wine Consumption'].plot.hist(bins=20)

NameError: name 'df' is not defined