# Getting and Reading Data

James Connelly

[Return to Table of Contents](../html/TableOfContents.html)

Previous Section: [2-Python Basics](../html/2-PythonBasics.html)

## Step 1: Getting and Understanding a Data Source

A majority of the world runs on data. Data is everywhere, and is a very powerful
tool. 

Data powers powerful predictions, points out outliers, can give retrospective
for how to better oneself, identify trends, and overall drive decision making. 

Given a task, there will be data associated with it. The more data, the better
and more accurate a prediction can be, given that the correct variables are
being looked at. 

For this lesson, we will be looking at some data publically available to get the
base understanding of how to break down a given data source, read what is
available, and use it to interpret some outcome. 

For this data, I pulled some data from a popular open source dataset site called
Kaggle, the csv is included as a zip under
`courses/data/global-earthquake-tsunami-risk-assessment-dataset.zip`.

The link if you want to try it yourself is below, but extracting this zip will
give you a
csv file that we will be breaking down. 

```https://www.kaggle.com/datasets/ahmeduzaki/global-earthquake-tsunami-risk-assessment-dataset/data```

CSV stands for Comma-Separated Values, where in terms of a table, each comma
correlates to the next column over, and each line is a new row.

In many datasets where the values have a correlated meaning - unless it is a
homemade file for a very single-purpose design - there will be a line dedicated
to a header describing what each column is supposed to designate. 

If we open the given file, we see that we do have a header, and can use this
header to know what our data is and going, and can use this for breaking down
the data and visualize it. 

In [55]:
file = open("../data/earthquake_data_tsunami.csv")
line = file.readline().replace(",","\n") # Read the first line in that contains our headers, replace commas with new line characters for the printout
print(line)
file.close() # close the file so later on we can work on the whole file and not just the data sections

magnitude
cdi
mmi
sig
nst
dmin
gap
depth
latitude
longitude
Year
Month
tsunami



In our data source, we have 13 columns. Some of the data that we might find
important is magnitude, latitude, longitude, Year, Month, tsunami. 

With these headers in particular, we can potentially predict what areas are more
at risk, what times of the year are earthquakes more likely, if certain
magnitudes have a higher probability of causing a tsunami, and so on. 

### Reading in the Data for Easy Use

Now that we know what parts of the data we care about, we can actually parse the data out into individual pieces for quick access and be able to start drawing out information from this data. 

In [56]:
file = open("../data/earthquake_data_tsunami.csv") # Open our fresh file
lines = file.readlines() # Get the lines separated out to work on them as a list of strings
dataset = {} # Instantiate our dataset dictionary, where the key will be our column header and the value will be a list of the rest of the data in that column

separator = ',' # CSV using comma-separated values

for header in lines[0].split(separator):  #split on our separator for the very first line to get the header
    dataset[header.strip()] = [] # Instantiate that column in the dictionary as empty, we will append to this later

keys = list(dataset) # Get our keys as a list for faster lookups when we go through the data

for line in lines[1::]: # Going through line-by-line of the file starting with index 1 (second line) to end
    dataArr = line.split(separator) # split out each column of data to go through it
    for i in range(len(dataArr)):
        datum = dataArr[i].strip() # Get our specific data element we are moving in
        dataset[keys[i]].append(datum) # Append our datum to a list

def print_dataset(dset):
    for key, val in dset.items():
        print(key)
        print(val)
        print()

print_dataset(dataset)

magnitude
['7', '6.9', '7', '7.3', '6.6', '7', '6.8', '6.7', '6.8', '7.6', '6.9', '6.5', '7', '7.6', '6.6', '6.6', '7', '6.5', '7.2', '6.9', '6.8', '6.6', '7', '6.9', '6.7', '7.3', '6.7', '6.6', '6.8', '6.5', '6.5', '6.6', '6.5', '6.5', '6.6', '6.7', '6.7', '6.8', '6.6', '6.6', '7.3', '7.3', '7.5', '6.6', '6.9', '6.9', '7.3', '6.5', '7', '7.1', '6.6', '6.9', '6.9', '7.2', '6.9', '6.9', '8.1', '7.5', '7.1', '8.2', '8.2', '6.7', '6.7', '6.5', '6.5', '7.3', '6.7', '6.7', '6.9', '6.5', '6.5', '7', '6.6', '6.5', '8.1', '7.4', '7.3', '7.1', '7.7', '6.9', '7', '6.7', '6.7', '7', '7.6', '6.9', '6.5', '6.8', '6.5', '6.9', '6.9', '6.8', '6.6', '7.8', '7', '6.6', '7.4', '7.4', '6.6', '6.8', '6.5', '6.6', '6.8', '6.5', '6.5', '7.5', '7', '7.7', '6.7', '6.8', '7.1', '6.5', '6.6', '6.5', '6.6', '6.7', '6.5', '6.6', '6.6', '6.9', '6.8', '6.6', '7.2', '6.6', '6.9', '7.1', '7.3', '7.3', '6.6', '8', '7.6', '7.1', '6.8', '6.5', '7', '7.5', '6.7', '6.7', '6.7', '6.6', '6.6', '6.8', '7', '7.3', '7.1', '6.6

We now have all of the data ingested into Python from a file, and with this we
have a lot of options on how to tackle the data, from identifying correlations,
filtering, and so on. 

## Finding Correlations

In order to be able to start processing this data, we should start by shrinking our scope, we have all of the data fields that we don't care about, and also have some data that's not relevant, lets go ahead and filter this out

In [57]:
keys = ["tsunami", "magnitude", "latitude", "longitude", "Month", "Year"]

filtered = {k: dataset[k] for k in keys if k in dataset}

print_dataset(filtered)

tsunami
['1', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '0', '1', '0', '1', '1', '1', '1', '1', '1', '0', '0', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1', '0', '0', '0', '1', '0', '0', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '0', '0', '0', '0', '1', '1', '1', '0', '1', '1', '1', '1', '0', '1', '0', '0', '0', '1', '1', '1', '0', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1', '0', '0', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1', '0', '0', '1', '1', '1', '1', '1', '1', '1', '0', '0', '0', '1', '1', '1', '1', '0', '1', '1', '0', '1', '0', '1', '0', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1', '1', '0', '1', '1', '1', '1', '0', '1', '0', '

Now to format our data so we aren't working on strings

In [58]:
for key in list(filtered.keys()):
    if key == "tsunami":
        temp = [True if v == '1' else False for v in filtered[key]]
        filtered[key] = temp
    elif key[0].isupper():
        temp = [int(v) for v in filtered[key]]
        filtered[key] = temp
    else:
        temp = [float(v) for v in filtered[key]]
        filtered[key] = temp

print_dataset(filtered)

tsunami
[True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, True, True, True, True, True, True, False, True, True, False, True, False, True, True, True, True, True, True, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, False, False, False, True, False, False, True, True, False, True, True, True, True, True, True, True, False, True, False, True, True, False, True, False, False, False, False, True, True, True, False, True, True, True, True, False, True, False, False, False, True, True, True, False, True, True, False, True, True, True, True, True, True, True, False, True, True, True, False, False, False, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True,

## Filtering subsets

We now have our data as lists of their individual types that we care about and can understand, we can start building out a filter function where we can specify what we're filtering on and start working with shorter datasets to find correlations

In [59]:
from enum import Enum

class Compare(Enum):
    EQUAL = 0
    GREATER = 1
    GREATER_EQUAL = 2
    LESS = 3
    LESS_EQUAL = 4
    NOT_EQUAL = 5

def filter(dset: dict, key, value, comparison: Compare) -> dict:
    indices = []
    ans = {}
    col = dset[key]
    for i in range(len(col)):
        match comparison:
            case Compare.EQUAL:
                if col[i] == value:
                    indices.append(i)
            case Compare.NOT_EQUAL:
                if col[i] != value:
                    indices.append(i)
            case Compare.GREATER:
                if col[i] > value:
                    indices.append(i)
            case Compare.GREATER_EQUAL:
                if col[i] >= value:
                    indices.append(i)
            case Compare.LESS:
                if col[i] < value:
                    indices.append(i)
            case Compare.LESS_EQUAL:
                if col[i] <= value:
                    indices.append(i)
            case _:
                continue
    for k, v in dset.items():
        ans[k] = [v[i] for i in indices]
    return ans

high_mag = filter(filtered, "magnitude", 7.5, Compare.GREATER_EQUAL)

print_dataset(high_mag)

tsunami
[True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, False, True, True, True, False, True, True, True, True, True, True, True, False, True, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]

magnitude
[7.6, 7.6, 7.5, 8.1, 7.5, 8.2, 8.2, 8.1, 7.7, 7.6, 7.8, 7.5, 7.7, 8.0, 7.6, 7.5, 7.5, 7.5, 7.9, 8.2, 7.5, 7.9, 7.5, 8.2, 7.7, 7.9, 7.6, 7.9, 7.8, 7.8, 7.7, 7.8, 7.8, 7.6, 7.6, 7.5, 8.3, 7.8, 7.5, 7.8, 7.5, 7.9, 7.5, 7.6, 7.7, 8.2, 7.7, 7.7

Great! we can now filter down on a key for a certain value and comparison. 

Note: There are multiple ways to accomplish this, I will include a file that
would encapsulate all of the work from this example into a class, then you would
just have to import the file to your project and use the class to have all of
the features into a usable format. It is not a complete or the most intricate
use, but it is the first foundation for setting up this kind of object
structure.

Also in later lessons, we will look even closer to how to use widely popular
modules that can accomplish a lot of this. These are the preferred uses, but to
comprehend how they work and for designing in niche cases without boilerplate,
this can be a fast and modular tool to build from. 

## Sorting Subsets

All of our data is now filtered to a smaller subset in our variable `high_mag`, but it's not the easiest to work with, let's start sorting on this data. 

You can sort at any time, but sorting later will overall be faster if time or performance is a concern. This goes into the idea of Time Complexity of Algorithms,

A good resource is https://medium.com/@DevChy/introduction-to-big-o-notation-time-and-space-complexity-f747ea5bca58 to go over the basics, 

but TL;DR, less data means smaller n, smaller n means a faster sort

In [60]:
def sort_val(dset:dict, key, descending=False) -> dict:
    ans = {}
    indexed_list = list(enumerate(dset[key]))
    sorted_ilist = sorted(indexed_list, key=lambda item: item[1], reverse=descending)
    indices = [idx for idx, _ in sorted_ilist]
    for k, v in dset.items():
        ans[k] = [v[i] for i in indices]
    return ans

sorted_mag = sort_val(high_mag, "magnitude")

print_dataset(sorted_mag)

tsunami
[True, True, True, True, True, False, True, True, False, True, True, True, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, True, False, False, False, False, False, False, False, False, False, True, True, True, True, True, False, False, False, False, True, True, False, False, False, False, True, False, False, False, False, False, True, True, True, True, True, False, True, True, False, False, False, False, False, False, False, False]

magnitude
[7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.7, 7.7, 7.7, 7.7

Oddly enough, just buy sorting this, we can already see that there is no
correlation between the magnitude of an earthquake and if a tsunami occurred,
almost all of the largest recorded earthquakes did not lead to tsunamis!

Next: [4-Visualizing Data](../html/4-VisualizingData.html)