# Introduction to Big Data Project - Main Python File

In [1]:
# Check that jupyter lab points to your project's environment directory
import sys
sys.executable

'C:\\Users\\Adespotos\\anaconda3\\envs\\bigdata_env\\python.exe'

# Import Libraries

In [2]:
import numpy as np  # Numerical computing library
import pandas as pd  # Data handling
import matplotlib.pyplot as plt  # Plotting library
import seaborn as sns  # An updated matplotlib (for better visualizations)
import pymysql  # MySQL database connector for Python
import chardet  # Library for automatic character encoding detection
import re  # Regular expressions for string manipulation and pattern matching

import custom_functions  # Custom functions for this project (e.g., CSV cleaning, encoding detection)

pd.options.display.max_columns = 100

# Read Dataset to a DataFrame

The first element of list_of_filepaths corresponds to the first element of the abbreviations list below. These lists make automation easier and help map files to their abbreviations.

In [3]:
# List of filepaths to our indicators data
filepaths_1 = ['UPDATED CSV DATA - Intro to Big Data/Current health expenditure (% of GDP).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using at least basic drinking water services, rural (% of rural population).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using at least basic drinking water services, urban (% of urban population).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using safely managed sanitation services, rural (% of rural population).csv',
               'UPDATED CSV DATA - Intro to Big Data/People using safely managed sanitation services, urban (% of urban population).csv',
               'UPDATED CSV DATA - Intro to Big Data/Total greenhouse gas emissions including LULUCF (Mt CO2e).csv']

filepaths_2 = ['UPDATED CSV DATA - Intro to Big Data/Population, total.csv',
               'UPDATED CSV DATA - Intro to Big Data/Renewable energy consumption (% of total final energy consumption).csv']

# Custom abbreviation lists for mapping with each dataset
abbreviations_1 = ['che', 'wr', 'wu', 'sr', 'su', 'gem']

abbreviations_2 = ['pop', 'ren']

A detailed exploration of the CSV files is necessary to ensure they can be read correctly into DataFrames.

We selected five health indicators and three environmental indicators. Some of these files are too complex to be read directly into DataFrames. The complexity became apparent when we realized that pandas.read_csv() could not handle the files properly, producing unreliable results.

Rather than choosing easier files, we took this as an opportunity to develop a step-by-step solution and learn as much as possible from the process. The Pythonic approach we followed was:

1. Ensure the correct file encoding.
2. Fetch the CSV file into Pythonâ€™s runtime (without using pandas, since it could not read the file correctly).
3. Explore the fetched CSV content to identify the issues.
4. Write scripts to address one problem at a time, progressing toward the final solution.

The indicators that required this special treatment were **"Population, total"** and **"Renewable energy consumption (% of total final energy consumption)"**.

To handle this process, we created three custom functions for reading and cleaning the CSV files. We also included docstrings in our functions to improve their readability and understanding.

### Step 1:

We use a function to detect the file encoding, ensuring that no encoding issues occur when opening the CSV files in Python, which could otherwise disrupt our approach. We automated the process by iterating simultaneously over the abbreviations and list_of_filepaths lists. This type of iteration ensures that when Python processes element 0 of abbreviations, it also processes element 0 of list_of_filepaths, so each abbreviation corresponds to the correct file path.

In [4]:
# Dictionary to store detected encodings for each CSV file
encodings_1 = {}

# Iterate through abbreviations and filepaths together, detect encoding for each file
for abbreviation, filepath in zip(abbreviations_1, filepaths_1):
    encodings_1[abbreviation] = custom_functions.detect_encoding(filepath)

# Print the results to verify encodings
print("Here are the files' encodings: \n", encodings_1)

Here are the files' encodings: 
 {'che': 'UTF-8-SIG', 'wr': 'UTF-8-SIG', 'wu': 'UTF-8-SIG', 'sr': 'UTF-8-SIG', 'su': 'UTF-8-SIG', 'gem': 'UTF-8-SIG'}


In [5]:
# Dictionary to store detected encodings for each CSV file
encodings_2 = {}

# Iterate through abbreviations and filepaths together, detect encoding for each file
for abbreviation, filepath in zip(abbreviations_2, filepaths_2):
    encodings_2[abbreviation] = custom_functions.detect_encoding(filepath)

# Print the results to verify encodings
print("Here are the files' encodings: \n", encodings_2)

Here are the files' encodings: 
 {'pop': 'UTF-8-SIG', 'ren': 'UTF-8-SIG'}


At this point, all encodings are safely stored as values in a dictionary.

### Step 2 & 3:

Standard Python can be used to load parts of a CSV file into memory. We can automate this process in the same way as before and display a line from each CSV file directly within a Jupyter notebook.

In [6]:
# Iterate through abbreviations and filepaths together, print a CSV line for each file
for abbreviation, filepath in zip(abbreviations_1, filepaths_1):
    csv_part = custom_functions.explore_csv(filepath=filepath, encoding=encodings_1[abbreviation], line_number=3)
    print(abbreviation + ":")
    print(repr(csv_part))  # repr() displays the string exactly as it is, unlike print() which ruins special characters

che:
'Country Name;Country Code;Indicator Name;Indicator Code;1960;1961;1962;1963;1964;1965;1966;1967;1968;1969;1970;1971;1972;1973;1974;1975;1976;1977;1978;1979;1980;1981;1982;1983;1984;1985;1986;1987;1988;1989;1990;1991;1992;1993;1994;1995;1996;1997;1998;1999;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017;2018;2019;2020;2021;2022;2023;2024\n'
wr:
'Country Name;Country Code;Indicator Name;Indicator Code;1960;1961;1962;1963;1964;1965;1966;1967;1968;1969;1970;1971;1972;1973;1974;1975;1976;1977;1978;1979;1980;1981;1982;1983;1984;1985;1986;1987;1988;1989;1990;1991;1992;1993;1994;1995;1996;1997;1998;1999;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017;2018;2019;2020;2021;2022;2023;2024\n'
wu:
'Country Name;Country Code;Indicator Name;Indicator Code;1960;1961;1962;1963;1964;1965;1966;1967;1968;1969;1970;1971;1972;1973;1974;1975;1976;1977;1978;1979;1980;1981;1982;1983;1984;1985;1986;1987;1988;1989;1990;1991;

In [7]:
# Iterate through abbreviations and filepaths together, print a CSV line for each file
for abbreviation, filepath in zip(abbreviations_2, filepaths_2):
    csv_part = custom_functions.explore_csv(filepath=filepath, encoding=encodings_2[abbreviation], line_number=4)
    print(abbreviation + ":")
    print(repr(csv_part))  # repr() displays the string exactly as it is, unlike print() which ruins special characters

pop:
'"Country Name,""Country Code"",""Indicator Name"",""Indicator Code"",""1960"",""1961"",""1962"",""1963"",""1964"",""1965"",""1966"",""1967"",""1968"",""1969"",""1970"",""1971"",""1972"",""1973"",""1974"",""1975"",""1976"",""1977"",""1978"",""1979"",""1980"",""1981"",""1982"",""1983"",""1984"",""1985"",""1986"",""1987"",""1988"",""1989"",""1990"",""1991"",""1992"",""1993"",""1994"",""1995"",""1996"",""1997"",""1998"",""1999"",""2000"",""2001"",""2002"",""2003"",""2004"",""2005"",""2006"",""2007"",""2008"",""2009"",""2010"",""2011"",""2012"",""2013"",""2014"",""2015"",""2016"",""2017"",""2018"",""2019"",""2020"",""2021"",""2022"",""2023"",""2024"","\n'
ren:
'"Country Name,""Country Code"",""Indicator Name"",""Indicator Code"",""1960"",""1961"",""1962"",""1963"",""1964"",""1965"",""1966"",""1967"",""1968"",""1969"",""1970"",""1971"",""1972"",""1973"",""1974"",""1975"",""1976"",""1977"",""1978"",""1979"",""1980"",""1981"",""1982"",""1983"",""1984"",""1985"",""1986"",""1987"",""1988""

The above scripts, along with the custom function explore_csv, helped us identify the following:

1. Files 'che', 'wr', 'wu', 'sr', 'su', and 'gem' start at line number 3, whereas 'pop' and 'ren' start at line number 4. The preceding lines contain metadata.
2. Files 'che', 'wr', 'wu', 'sr', 'su', and 'gem' use ';' as a separator, whereas 'pop' and 'ren' use ','.
3. Rows in every file end with a newline character '\n'.
4. Apart from the delimiter and the newline character at the end, the files 'che', 'wr', 'wu', 'sr', 'su', and 'gem' are clean, while 'pop' and 'ren' include the following extra issues:  
    a. Each row ends with a comma followed by a quote, before the newline character: ',"\n'.  
    b. Quotes included in the data.  
    c. In 'pop', there is an extra comma in the column name 'Population, total', which causes pandas to treat 'Population' and 'total' as separate columns, even though they belong to the same column.  
    d. Each list element represents a row of the dataset.

**NOTE: If we remove quotes before applying rstrip to the newline character, the file 'ren' may not be read correctly.**

### Step 4

We created a custom function to clean all files. The function is dynamic, meaning it can clean the datasets 'che', 'wr', 'wu', 'sr', 'su', 'gem', as well as 'pop' and 'ren'.

### Reading 'che', 'wr', 'wu', 'sr', 'su', 'gem' to DataFrames

In [8]:
dfnames_1 = ['df_che', 'df_wr', 'df_wu', 'df_sr', 'df_su', 'df_gem']

In [9]:
df_dict_1 = {}

for df_name, abbreviation, filepath in zip(dfnames_1, abbreviations_1, filepaths_1):
    df_dict_1[df_name] = custom_functions.clean_csv(
        filepath=filepath,
        encoding=encodings_1[abbreviation],
        separator=';',
        trail1='\n',
        trail2=None,
        trail3=None,
        to_be_replaced='"',
        start_row=3
    )
    
df_che = df_dict_1['df_che']
df_wr = df_dict_1['df_wr']
df_wu = df_dict_1['df_wu']
df_sr = df_dict_1['df_sr']
df_su = df_dict_1['df_su']
df_gem = df_dict_1['df_gem']

### Reading 'pop' and 'ren' to DataFrames

In [10]:
dfnames_2 = ['df_pop', 'df_ren']

In [11]:
df_dict_2 = {}

for df_name, abbreviation, filepath in zip(dfnames_2, abbreviations_2, filepaths_2):
    df_dict_2[df_name] = custom_functions.clean_csv(
        filepath=filepath,
        encoding=encodings_2[abbreviation],
        separator=',',
        trail1='\n',
        trail2='"',
        trail3=',',
        to_be_replaced='"',
        start_row=4
    )
    
df_pop = df_dict_2['df_pop']
df_ren = df_dict_2['df_ren']

# Quick Exploration 

In [14]:
print('che:', df_che.shape)
print('wr:', df_wr.shape)
print('wu:', df_wu.shape)
print('sr:', df_sr.shape)
print('su:', df_su.shape)
print('pop:', df_pop.shape)
print('ren:', df_ren.shape)
print('gem:', df_gem.shape)

che: (266, 69)
wr: (266, 69)
wu: (266, 69)
sr: (266, 69)
su: (266, 69)
pop: (266, 69)
ren: (266, 69)
gem: (266, 69)
