# Introduction to Big Data Project - Main Python File

In [1]:
# Check that jupyter lab points to your project's environment directory
import sys
sys.executable

'C:\\Users\\Adespotos\\anaconda3\\envs\\bigdata_env\\python.exe'

# Import Libraries

In [2]:
import numpy as np  # Numerical computing library
import pandas as pd  # Data handling
import matplotlib.pyplot as plt  # Plotting library
import seaborn as sns  # An updated matplotlib (for better visualizations)
import pymysql  # MySQL database connector for Python
import chardet  # Library for automatic character encoding detection
import re  # Regular expressions for string manipulation and pattern matching

import custom_functions  # Custom functions for this project (e.g., CSV cleaning, encoding detection)

pd.options.display.max_columns = 100  # Display all columns

# DATA HARVESTING

# Read Datasets to DataFrames

A detailed exploration of the CSV files is necessary to ensure they can be read correctly into DataFrames.

We selected five health indicators and three environmental indicators. Some of these files are too complex to be read directly into DataFrames. The complexity became apparent when we realized that pandas.read_csv() could not handle the files properly, producing unreliable results.

Rather than choosing easier files, we took this as an opportunity to develop a step-by-step solution and learn as much as possible from the process. The Pythonic approach we followed includes the steps below:

1. Ensure the correct file encoding.
2. Fetch the CSV files into Python’s runtime (without using pandas, since it could not read the files correctly).
3. Explore the fetched CSV content to identify the issues.
4. Write scripts to address one problem at a time, progressing toward the final solution.

The indicators that required this special treatment were **"Population, total"** and **"Renewable energy consumption (% of total final energy consumption)"**. For this reason, we separated these two files from the rest, assigning them the notation 2, while the remaining files use the notation 1.

To handle the process described above, we also created three custom functions for reading and cleaning the CSV files. We also included docstrings in our functions to improve their readability and understanding.

Therefore, as the first preprocessing step, we create the lists with file paths and data abbreviations. The mechanics are simple: the first element of filepaths_1 corresponds to the first element of abbreviations_1. The same applies to filepaths_2 and abbreviations_2. So, the custom creation order of the list elements matters. These lists make automation easier and help map each file to its abbreviation.

In [3]:
# File paths for notation 1 ('che', 'wr', 'wu', 'sr', 'su', 'gem')
filepaths_1 = ['UPDATED CSV DATA - Intro to Big Data/Current health expenditure (% of GDP).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using at least basic drinking water services, rural (% of rural population).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using at least basic drinking water services, urban (% of urban population).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using safely managed sanitation services, rural (% of rural population).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using safely managed sanitation services, urban (% of urban population).csv',
               'UPDATED CSV DATA - Intro to Big Data/Total greenhouse gas emissions including LULUCF (Mt CO2e).csv']

# File paths for notation 2 ('pop', 'ren')
filepaths_2 = ['UPDATED CSV DATA - Intro to Big Data/Population, total.csv',
               'UPDATED CSV DATA - Intro to Big Data/Renewable energy consumption (% of total final energy consumption).csv']

# Abbreviations for easy mapping: notation 1
abbreviations_1 = ['che', 'wr', 'wu', 'sr', 'su', 'gem']

# Abbreviations for easy mapping: notation 2
abbreviations_2 = ['pop', 'ren']

### Step 1:

We use a function, leveraging chardet library's capabilities, to detect the file encoding, ensuring that no encoding issues occur when opening the CSV files in Python, which could otherwise disrupt our approach. We automated the process by iterating simultaneously over the abbreviations and filepaths lists. This type of iteration ensures that when Python processes element 0 of abbreviations, it also processes element 0 of filepaths, so each abbreviation corresponds to the correct file path.

**Notation 1 Files:**

In [4]:
# Dictionary to store detected encodings for each CSV file
encodings_1 = {}

# Iterate through abbreviations and filepaths together, detect encoding for each file
for abbreviation, filepath in zip(abbreviations_1, filepaths_1):
    encodings_1[abbreviation] = custom_functions.detect_encoding(filepath)

# Print the results to verify encodings
print("Here are the files' encodings: \n", encodings_1)

Here are the files' encodings: 
 {'che': 'UTF-8-SIG', 'wr': 'UTF-8-SIG', 'wu': 'UTF-8-SIG', 'sr': 'UTF-8-SIG', 'su': 'UTF-8-SIG', 'gem': 'UTF-8-SIG'}


**Notation 2 Files:**

In [5]:
# Dictionary to store detected encodings for each CSV file
encodings_2 = {}

# Iterate through abbreviations and filepaths together, detect encoding for each file
for abbreviation, filepath in zip(abbreviations_2, filepaths_2):
    encodings_2[abbreviation] = custom_functions.detect_encoding(filepath)

# Print the results to verify encodings
print("Here are the files' encodings: \n", encodings_2)

Here are the files' encodings: 
 {'pop': 'UTF-8-SIG', 'ren': 'UTF-8-SIG'}


At this point, all encodings are safely stored as values in a dictionary. Things seem simple because the encodings are currently the same. However, if they ever differ, each encoding can be accessed using the corresponding dictionary key. Even though the situation looks straightforward now, we follow a dynamic approach and retrieve the encodings through the dictionary keys instead of writing the encoding manually for each file.

### Step 2 & 3:

Since Pandas cannot handle reading these CSV files directly, we can use standard Python to load portions of a CSV file into memory. We can automate this process as before by iterating through both the filepaths and abbreviations lists and using a custom function that returns a selected line from each CSV file. We also use the repr() function, which is very helpful in situations like this, since the print() function often alters the displayed content. It is important to load the original CSV lines into memory without any changes to correctly identify the necessary actions to clean the files so that Pandas can read them successfully.

**Notation 1 Files:**

In [6]:
# Iterate through abbreviations and filepaths together, print a CSV line for each file
for abbreviation, filepath in zip(abbreviations_1, filepaths_1):
    csv_part = custom_functions.explore_csv(filepath=filepath, encoding=encodings_1[abbreviation], line_number=3)
    print(abbreviation + ":")
    print(repr(csv_part))  # repr() displays the string exactly as it is, unlike print() which might ruin special characters

che:
'Country Name;Country Code;Indicator Name;Indicator Code;1960;1961;1962;1963;1964;1965;1966;1967;1968;1969;1970;1971;1972;1973;1974;1975;1976;1977;1978;1979;1980;1981;1982;1983;1984;1985;1986;1987;1988;1989;1990;1991;1992;1993;1994;1995;1996;1997;1998;1999;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017;2018;2019;2020;2021;2022;2023;2024\n'
wr:
'Country Name;Country Code;Indicator Name;Indicator Code;1960;1961;1962;1963;1964;1965;1966;1967;1968;1969;1970;1971;1972;1973;1974;1975;1976;1977;1978;1979;1980;1981;1982;1983;1984;1985;1986;1987;1988;1989;1990;1991;1992;1993;1994;1995;1996;1997;1998;1999;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017;2018;2019;2020;2021;2022;2023;2024\n'
wu:
'Country Name;Country Code;Indicator Name;Indicator Code;1960;1961;1962;1963;1964;1965;1966;1967;1968;1969;1970;1971;1972;1973;1974;1975;1976;1977;1978;1979;1980;1981;1982;1983;1984;1985;1986;1987;1988;1989;1990;1991;

**Notation 2 Files:**

In [7]:
# Iterate through abbreviations and filepaths together, print a CSV line for each file
for abbreviation, filepath in zip(abbreviations_2, filepaths_2):
    csv_part = custom_functions.explore_csv(filepath=filepath, encoding=encodings_2[abbreviation], line_number=4)
    print(abbreviation + ":")
    print(repr(csv_part))  # repr() displays the string exactly as it is, unlike print() which ruins special characters

pop:
'"Country Name,""Country Code"",""Indicator Name"",""Indicator Code"",""1960"",""1961"",""1962"",""1963"",""1964"",""1965"",""1966"",""1967"",""1968"",""1969"",""1970"",""1971"",""1972"",""1973"",""1974"",""1975"",""1976"",""1977"",""1978"",""1979"",""1980"",""1981"",""1982"",""1983"",""1984"",""1985"",""1986"",""1987"",""1988"",""1989"",""1990"",""1991"",""1992"",""1993"",""1994"",""1995"",""1996"",""1997"",""1998"",""1999"",""2000"",""2001"",""2002"",""2003"",""2004"",""2005"",""2006"",""2007"",""2008"",""2009"",""2010"",""2011"",""2012"",""2013"",""2014"",""2015"",""2016"",""2017"",""2018"",""2019"",""2020"",""2021"",""2022"",""2023"",""2024"","\n'
ren:
'"Country Name,""Country Code"",""Indicator Name"",""Indicator Code"",""1960"",""1961"",""1962"",""1963"",""1964"",""1965"",""1966"",""1967"",""1968"",""1969"",""1970"",""1971"",""1972"",""1973"",""1974"",""1975"",""1976"",""1977"",""1978"",""1979"",""1980"",""1981"",""1982"",""1983"",""1984"",""1985"",""1986"",""1987"",""1988""

The above scripts, along with the custom function explore_csv, helped us identify the following:

1. All notation 1 files start at line number 3, whereas notation 2 files start at line number 4. The preceding lines contain metadata at both cases.
2. All notation 1 files use ';' as a separator, whereas notation 2 files use ','.
3. Rows in every file end with a newline character '\n'.
4. There are empty strings that should be manually converted to NaN values. Otherwise, pandas cannot recognize them as missing data.
5. Apart from the delimiter and the newline character at the end, the notation 1 files are clean, while notation 2 files include the following extra issues:  
    a. Each row ends with a comma followed by a quote, before the newline character: ',"\n'.  
    b. Quotes included in the data.  
    c. In 'pop', there is an extra comma in the column name 'Population, total', which causes pandas to treat 'Population' and 'total' as separate columns, even though they belong to the same column.  
    d. Each list element represents a row of the dataset.

Although the identification appears to be correct and effective, the order of execution of the above instructions is important. For instance, removing all quotes before applying rstrip() can lead to unreliable results.

### Step 4

We created a custom function to clean all files. The function is dynamic, meaning it can handle both notation 1 and notation 2 datasets. The only point that requires attention is that the function must accept different arguments depending on the file notation. As mentioned before, notation 2 files require special treatment.

We adopted a new approach by creating a third list that maps to the other two lists introduced earlier. This third list contains the names of the DataFrames.

**Reading Notation 1 Files to DataFrames:**

In [8]:
# List of the DataFrames for notation 1
dfnames_1 = ['df_che', 'df_wr', 'df_wu', 'df_sr', 'df_su', 'df_gem']

In [9]:
df_dict_1 = {}  # Initialize an empty dictionary to store the DataFrames

# Iterate through the three lists simultaneously.
# This works because the lists were created with the correct mapping order.
for df_name, abbreviation, filepath in zip(dfnames_1, abbreviations_1, filepaths_1):
    # Use the DataFrame name string as the key in the dictionary
    df_dict_1[df_name] = custom_functions.clean_csv(
        filepath=filepath,  # Path to the CSV file
        encoding=encodings_1[abbreviation],  # Retrieve encoding based on abbreviation
        separator=';',   # Set the column delimiter
        trail1='\n',  # First trailing character to remove
        trail2=None,  # Second trailing character to remove (optional)
        trail3=None,  # Third trailing character to remove (optional)
        to_be_replaced='"',  # Characters to replace (quotes in this case)
        start_row=3  # Row index corresponding to column headers
    )

# Extract the DataFrames from the dictionary and assign to variables
df_che = df_dict_1['df_che']
df_wr  = df_dict_1['df_wr']
df_wu  = df_dict_1['df_wu']
df_sr  = df_dict_1['df_sr']
df_su  = df_dict_1['df_su']
df_gem = df_dict_1['df_gem']

**Reading Notation 2 Files to DataFrames:**

In [10]:
# List of the DataFrames for notation 2
dfnames_2 = ['df_pop', 'df_ren']

In [11]:
df_dict_2 = {}

for df_name, abbreviation, filepath in zip(dfnames_2, abbreviations_2, filepaths_2):
    df_dict_2[df_name] = custom_functions.clean_csv(
        filepath=filepath,
        encoding=encodings_2[abbreviation],
        separator=',',
        trail1='\n',
        trail2='"',
        trail3=',',
        to_be_replaced='"',
        start_row=4
    )
    
df_pop = df_dict_2['df_pop']
df_ren = df_dict_2['df_ren']

Below is a list that includes all DataFrames. This is a convenient way to iterate through them later, automate some processes, and keep the code clean.

In [12]:
# List of DataFrames for automations
all_dfs = [df_che, df_wr, df_wu, df_sr, df_su, df_gem, df_pop, df_ren]

# Quick Exploration on Data Integrity & Observations 

The results appear reliable, as all DataFrames have the same shape and identical column names. A quick inspection suggests that our cleaning process has been effective. All indicators are related to countries, so the 266 rows represent countries and territories worldwide. The 193 sovereign countries are included, but the additional rows correspond to non-independent territories, overseas dependencies and entire regions, such as Puerto Rico, Hong Kong, Bermuda, Africa Eastern and Southern. Arab World etc.

The indicators’ data have been collected from 1960 to the present (2024). There are several reasons why 1960 is used as the starting year. By this time, most countries had rebuilt or established functioning statistical offices after World War II, enabling systematic and comparable data collection. Additionally, decolonization and the formation of new countries occurred primarily in the 1950s and 1960s. Many nations became independent around this period, so data prior to 1960 would often be incomplete, inconsistent, or recorded under colonial administrations.

In [13]:
print('che:', df_che.shape)
print('wr:', df_wr.shape)
print('wu:', df_wu.shape)
print('sr:', df_sr.shape)
print('su:', df_su.shape)
print('pop:', df_pop.shape)
print('ren:', df_ren.shape)
print('gem:', df_gem.shape)

che: (266, 69)
wr: (266, 69)
wu: (266, 69)
sr: (266, 69)
su: (266, 69)
pop: (266, 69)
ren: (266, 69)
gem: (266, 69)


In [14]:
print("che Columns:\n", df_che.columns)
print("wr Columns:\n", df_wr.columns)
print("wu Columns:\n", df_wu.columns)
print("sr Columns:\n", df_sr.columns)
print("su Columns:\n", df_su.columns)
print("pop Columns:\n", df_pop.columns)
print("ren Columns:\n", df_ren.columns)
print("gem Columns:\n", df_gem.columns)

che Columns:
 Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       '2023', '2024'],
      dtype='object')
wr Columns:
 Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '19

In [15]:
# Display all unique countries of the datasets
df_che['Country Name'].unique();

# Handling Missing Values

In [16]:
df_nans = []  # Empty list to host the DataFrames with NaNs
for df in all_dfs:  # Iterate through DataFrames list
    columns = df.columns  # Assign columns
    data = df.isna().sum().values.reshape(1, 69)  # Calculate NaNs reshaping them
    df = pd.DataFrame(data=data, columns=columns)  # Create the DataFrame with NaNs
    df_nans.append(df)  # Add the DataFrame to the list

In [17]:
# Manually correct the names
df_che_nans = df_nans[0]
df_wr_nans = df_nans[1]
df_wu_nans = df_nans[2]
df_sr_nans = df_nans[3]
df_su_nans = df_nans[4]
df_gem_nans = df_nans[5]
df_pop_nans = df_nans[6]
df_ren_nans = df_nans[7]

Current health expenditure (% of GDP)
For Current Health Expenditure (% of GDP), seeing all missing values until 2000, and then 232 out of 266 countries starting to report in 2000, is consistent with the history of international health data collection. Health expenditure tracking requires national health accounts (NHAs). Many countries did not have a formal NHA system until late 1990s–early 2000s. Developing countries, especially in Africa, Asia, and Latin America, started implementing NHAs only around 2000. Additionally, World Bank metadata confirms that systematic collection of health expenditure as % of GDP began in 2000 for most countries.
People using at least basic drinking water services, rural (% of rural population)
People using at least basic drinking water services, urban (% of urban population)
People using safely managed sanitation services, rural (% of rural population)
People using safely managed sanitation services, urban (% of urban population)
Total greenhouse gas emissions including LULUCF (Mt CO2e)
Population, total
Renewable energy consumption (% of total final energy consumption)

In [32]:
df_che_nans.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,0,0,0,0,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,266,34,34,33,31,31,31,31,31,31,31,30,29,29,28,28,28,28,27,26,26,26,26,27,245,266


In [31]:
df_gem[df_gem['1990'].notna()]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
259,World,WLD,Total greenhouse gas emissions including LULUC...,EN.GHG.ALL.LU.MT.CE.AR5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3134282431,314360547,3149945781,3189984709,32100025,3407593276,3361033536,3298455005,3433538494,3467915975,367044151,3541374566,3677107088,3859053981,4112584819,410772551,4313924053,4333166492,428314438,4183671677,4392412891,4753557026,4843230915,4820466336,489991287,4921882751,4787550134,4911904844,50984951,512783474,4887142231,5173772293,5183806473,5450070669,
