## Analysis of the GDP of Various Countries ##

## Abstract ##

*** Introduction ***
> In this lab, I will be using WorldBank's 2017 GDP data to determine a few things. I want to find the average GDP of every country, as well as the minimum and maximum GDP's. I'm specifically curious to see whether the U.S. has the highest GDP of any country. I also want to see whether or not the lowest GDP is shockingly low or not. Finally, I'm curious whether or not the average GDP is comparable to other country's GDP's (such as the mininum and maximum GDP).

## Dataset Preparation ##

***Data Location***

> This data was retrieved from Worldbank's public database: [Data Link](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD)

> The data was not modified from the form stored on the server before importing to Jupyter Notebook


***Accessing the Data in Python***

> To access the data, I had to first to open the file in python. To do that, I opened the file with read permissions, and stored the data in a buffer variable called 'content.' I used the keyword "with" to specify to Python that I wanted the file closed after the code block had been executed.

***Storing the Data for Analysis***
> Since the data has over 200 rows and over 30 columns, it did not make sense to store the rows or colums as hand-crafted lists. Instead, I created a matrix of lists (essentially a 2D array), by iterating through each row, then splitting the row by colums and storing the individual elements. I neglected the final element in each row (as indicated by the "[:-1]"), because it was an empty string.

In [89]:
with open('./API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10080925.csv', 'r') as data:
    content = data.read().splitlines()[4:]
    # Remove title rows
    
matrix = [[x.split(',')[i] for i in range(len(x.split(',')))][:-1] for x in content]

***Cleaning the Data***

> Cleaning the data was fairly straightforward. The only aspect that would cause problems during analyses was the fact that some of the countries were not actually that; they were groups of countries. To deal with that, I simply deleted their row from the matrix, so that it was exclusively countries. Otherwise, the maximum GDP would have been astronomically high. The individual groups of countries are commented in the code.

> Afterwards, I wrote a method called "comma," which simply would change a number like "123456789" to "123,456,789", so that the large GDP numbers are easier to read later on.

In [90]:
# Excluding groups of countries
del matrix[258] # World
del matrix[94]  # High income
del matrix[179] # OECD members
del matrix[195] # Post-demographic dividend
del matrix[101] # IDA & IBRD total
del matrix[137] # Low & middle income
del matrix[152] # Middle income
del matrix[100]  # IBRD only
del matrix[62]  # East Asia & Pacific
del matrix[63]  # Europe & Central Asia
del matrix[239] # Upper middle income
del matrix[162] # North America
del matrix[135] # Late-demographic dividend
del matrix[60]  # East Asia & Pacific (excluding high income)
del matrix[64]  # Euro area
del matrix[68]  # European Union
del matrix[215] # East Asia & Pacific (IDA & IBRD countries)

def comma(x):
    ret=[]
    for i, n in enumerate(list(str(int(x)))[::-1]): # Go through each element, and store index and item in i, n
        if i and not (i % 3):                       # After every third element going through backwards, add comma
            ret.insert(0, ',')
        ret.insert(0, n)
    return ''.join(ret)

## Data Modelling ##

***Finding the Highest GDP***
> To find the maximum GDP, I used two nested for loops to iterate through each column (skipping the first column and the first 5 rows as they were just titles or names of countries), and found the maximum GDP. This was done by storing the value in a "max" variable, and storing its x and y index in "tIndex". At the end, I printed out the maximum number, along with the row header (country name), and the column header (year). I also stripped all "\"'s from the string, because this dataset uses them to denote spaces.

In [91]:
top, tIndex = 0.0, [0, 0]
for i in range(1, len(matrix)):
    for j in range(5, len(matrix[0])):
        try:
            value = float(matrix[i][j].strip("\"")) # Try doing, and if it fails it is because it's an empty string
            if value > top:
                top = value
                tIndex = (i, j)
        except:                                     # If it is an empty string, break
            break

print("Highest GDP: $%s -- %s %s" %(comma(top), matrix[tIndex[0]][0].strip("\""), matrix[0][tIndex[1]].strip("\"")))

Highest GDP: $19,390,604,000,000 -- United States 2017


***Finding the Lowest GDP***
> To find the minimum GDP, I used two nested for loops to iterate through each column (skipping the first column and the first 5 rows as they were just titles or names of countries), and found the minimum GDP. This was done by storing the value in a "min" variable, and storing its x and y index in "bIndex". At the end, I printed out the minimum number, along with the row header (country name), and the column header (year). I also stripped all "\"'s from the string, because this dataset uses them to denote spaces.

In [92]:
bottom, bIndex = 1000000000000.0, [0, 0]
for i in range(1, len(matrix)):
    for j in range(5, len(matrix[0])):
        try:
            value = float(matrix[i][j].strip("\"")) # Try doing, and if it fails it is because it's an empty string
            if value < bottom:
                bottom = value
                bIndex = (i, j)
        except:                                     # If it is an empty string, break
            break
            
print("Lowest GDP: $%s -- %s %s" %(comma(bottom), matrix[bIndex[0]][0].strip("\""), matrix[0][bIndex[1]].strip("\"")))

Lowest GDP: $11,592,024 -- Seychelles 1961


***Finding the Average GDP***
> To find the average GDP, I used two nested for loops to iterate through each column (skipping the first column and the first 5 rows as they were just titles or names of countries), and added each GDP to the "total" variable. At the end, I divided the total variable by the "length" variable, which stores each recorded GDP. This yields the average GDP of all recorded countries. I also stripped all "\"'s from the string, because this dataset uses them to denote spaces.

In [93]:
total, length = 0.0, 0
for i in range(1, len(matrix)):
    for j in range(5, len(matrix[0])):
        try:
            total += float(matrix[i][j].strip("\"")) # Try doing, and if it fails it is because it's an empty string
            length += 1                              # Doesn't record if it doesn't exist -- fails before adding
        except:                                      # If it is an empty string, break
            break

print("Average GDP: $%s" %comma(str(int(total/length))))

Average GDP: $317,176,540,628


## Data Analysis & Conclusion ##

*** Conclusion ***
> After analyzing the data, I found that the lowest GDP was of Seychelles, which I was certainly surprised by. I would have thought that it would be a country thought of as more "third-world," but Seychelles makes sense due solely to its size.

> I was unsurprised to find that the U.S. has the highest GDP. Frankly I didn't know what to make of the average GDP, as I have no way of judging it. It is significantly higher than Seychelles GDP, while under one-thousandth of the U.S.'s. It just gives me a sense for the shocking amount of money the U.S. generates.

## Acknowledgements ##

***No acknowledgements applicable for this lab.***