# Structuring Data

Before you can do anything with most data, you must structure it in some manner so that you can begin to see what the data contains(an, sometimes, what it doesn't). Python provides access to a number of organizational structures for data, including stacks, queues and dictionaries. 

You need to consider the structural requirements for the data you use with your algorithms; the better you can see and understand the content through structure formatting, the easier it becomes to perform algorithm-based tasks successfully. Trying to impose form on humans rarely works and generally results in frustration that makes using the algorithm even harder, so structure imposed through data manipulation becomes even more important.

#### When working with mulitple data sources, you must:
- Determine whether both datasets contain all the required data
- Check both datasets for data type issues
- Ensure that all datasets place the same meaning on data elements
- Verify the data attributes

The more time you spend verifying the compatibility of data from each of the sources you want to use for a dataset, the less likely you are to encounter problems when working with an algorithm.

## Package Imports

In [1]:
import numpy as np
import pandas as pd

## Considering the Need for Remediation

After you find problems with your dataset, you need to remediate it so that the dataaset works properly wih the algorithms you use. For example, when working with conflicting data type, you must change the data types of each data sources so that they match and then create the single data source used with the algorithm. Data duplication and missing values are two very common data problems, but remediation can be necessary due to a host of reasons (inconsistent data entry, misspellings, out-of-range values, etc.). 

Often, you become aware of a problem by running the algorithm and noting that the results are skewed in some way or that the algorithm doesn't work at all (even if it worked on a subset of the data). When in doubt, check your data for potential remediation needs.

### Dealing with Data Duplication

Duplicated data can happen for a variety of reasons during the remediation process, and can unfairly weight the output of any algorithm that you're using. The pandas package makes it easy to remove duplicated data. The `drop_duplicates` function removes the duplicate records found in rows 4 and 6 in the below example.

In [2]:
df = pd.DataFrame({'A': [0,0,0,0,0,1,0],
                   'B': [0,2,3,5,0,2,0],
                   'C': [0,3,4,1,0,2,0]})
print(df, "\n")

df = df.drop_duplicates()
print(df)

   A  B  C
0  0  0  0
1  0  2  3
2  0  3  4
3  0  5  1
4  0  0  0
5  1  2  2
6  0  0  0 

   A  B  C
0  0  0  0
1  0  2  3
2  0  3  4
3  0  5  1
5  1  2  2


### Dealing with Missing Values

Missing values can also skew the results of an algorithm't output. In fact, they can cause some algorithms to react oddly or even raise an error. The point is that missing values cause problems with your data, so you need to remove them. 

Options for addressing this issue include:
- Simply setting missing values to a standard value, such as 0 for ints
- Using the mean of all the values instead of some standard value

The `fillna` functions enables you to get rid of the missing values whether they're not a number (NAN) or simply missing (None). You can supply the missing data in a number of forms - this example relies on a series that contains the mean for each seperate column of data.

#### Mean Approach

In [10]:
df = pd.DataFrame({'A': [0,0,1,None],
                   'B': [1,2,3,4],
                   'C': [np.NAN,3,4,1]}, dtype=int)
print("Original DataFrame\n", df, "\n")

values = pd.Series(df.mean(), dtype=int) # ensures values is the same datatype as the original DataFrame
print("Mean of Each Column\n", values, "\n")

df = df.fillna(values)
print("Mean-Replacement\n", df)

Original DataFrame
       A  B    C
0     0  1  NaN
1     0  2    3
2     1  3    4
3  None  4    1 

Mean of Each Column
 A    0
B    2
C    2
dtype: int64 

Mean-Replacement
    A  B  C
0  0  1  2
1  0  2  3
2  1  3  4
3  0  4  1


## Stacking and Piling Data in Order

Python provides a number of storage methodologies, and both NumPy and Pandas provide storage alternatives that might consider when working through various data structures. A common problem of data storage isn't just the fact that you need to store the data, but that you must store it in a particular order so that you can access it when necessary. The following sections describe the standard Python methods for ensuring orderly data storage that let you have a specific processing arrangement.

### Ordering in Stacks

A *stack* provides last in/first out (LIFO) data storage. Recursive function calls, for example, are placed on a stack, such that each new call goes on the top of the stack until the base case is reached, then the function calls are *popped* off the stack as they resolve down to the original function call. The NumPy package provides an actual stack implementation, and Pandas associates stacks with objects such as DataFrames. 

To demonstrate the functionality of a stack, an example implementation is given below. The example ensures that the stack maintains the integrity of the data and works with it in the order you expect.

**Important Tip:** The below example is implemented using a Python list - from an algorithm perspective, lists often don't perform well because they store the list elements in computer memory and access them using an index and *memory pointer*, a number that provides the memory address of the data. When your application makes a data request, the list scans through all of its items, which is even slower. When data is scattered across your computer's memory, lists must gather the data from each location indiviudal, whcih slows access even more.

In [32]:
class Stack:
    stack = None
    stackSize = 10
    
    def __init__(self, size):
        self.stack = []
        self.stackSize = size
    
    def __str__(self):
        result = "Current Stack:\n"
        
        if len(self.stack) > 0:
            for i, item in enumerate(self.stack[::-1]):
                result += (str(i) + ": " + str(item) + "\n")
        else:
            return "Stack is empty!"
        
        return result        
        
    def push(self, value):
        if len(self.stack) < self.stackSize:
            self.stack.append(value)
        else:
            print("Stack is full!")
            
    def pop(self):
        if len(self.stack) > 0:
            print("Popping: ", self.stack.pop())
        else:
            print("Stack is empty!")
            
myStack = Stack(10)
print(myStack, "\n")

print("Push values onto the stack")
myStack.push(2)
myStack.push(6)
myStack.push(4)
myStack.push(10)
myStack.push(5)
print(myStack)

myStack.pop()
print(myStack)

myStack.pop()
myStack.pop()
print(myStack)

myStack.pop()
myStack.pop()
print(myStack)

Stack is empty! 

Push values onto the stack
Current Stack:
0: 5
1: 10
2: 4
3: 6
4: 2

Popping:  5
Current Stack:
0: 10
1: 4
2: 6
3: 2

Popping:  10
Popping:  4
Current Stack:
0: 6
1: 2

Popping:  6
Popping:  2
Stack is empty!


### Ordering in Queues
Unlike stacks, *queues* are first in/first out (FIFO) data structures. Both NumPy and Pandas offer implementations of the queue structure, but you can also leverage Python's built-in queue implementation using `import queue`. Rather than implementing our own queue, we'll use Python's queue to demonstrate how they work.

In [41]:
import queue

myQ = queue.Queue(3)

print("Queue Empty: ", myQ.empty())

print("\nPutting values in the queue: 2, 10, 5")
myQ.put(2)
myQ.put(10)
myQ.put(5)
print("Queue Full: ", myQ.full())

print("\nPopping: ", myQ.get())
print("Queue Full: ", myQ.full())

print("\nPopping: ", myQ.get())
print("Popping: ", myQ.get())
print("Queue Empty: ", myQ.empty())

Queue Empty:  True

Putting values in the queue: 2, 10, 5
Queue Full:  True

Popping:  2
Queue Full:  False

Popping:  10
Popping:  5
Queue Empty:  True


### Finding Data Using Dictionaries
Creating a `dictionary` is much like working with a `list`, except that you must now define a key and value pair. The great advantage of this data structure is that dictionaries can quickly provide access to specific data items using the key.

**Key Limitations:**
- The key must be unique - if a duplicate key is used, the value in the second entry overwrites the value in the original entry
- The key must be immutable

Python dictionaries are the software implementation of a data structure called a *hash table*, an array that maps keys to values. Dictionaries are a bit like individual tables within a database. You can update, add, and delete records to a dictionary. The `update` function can overwrite or add new entries to the dictionary. The following example helps demonstrate how dictionaries work.

In [57]:
colors = {'Sam':'Blue',
          'Amy':'Red',
          'Sarah':'Yellow'}

print('Sarah\'s favorite color is:', colors['Sarah'])
print('The keys are:', colors.keys())

print("\nDemonstrating key duplication - colors['Sarah'] = 'Purple':")
colors['Sarah'] = 'Purple'
print('Sarah\'s favorite color is:', colors['Sarah'])


print("\nDemonstrating value updates - colors.update({'Sarah':'Black'}):")
colors.update({'Sarah':'Black'})
print('Sarah\'s favorite color is:', colors['Sarah'])

print("\nDemonstrating adding new values - colors.update({'Mark':'Orange'}):")
colors.update({'Mark':'Orange'})
print('Mark\'s favorite color is:', colors['Mark'])

print('\nValues can be deleted as well - del colors[\'Sarah\']:')
del colors['Sarah']
print(colors)

Sarah's favorite color is: Yellow
The keys are: dict_keys(['Sam', 'Amy', 'Sarah'])

Demonstrating key duplication - colors['Sarah'] = 'Purple':
Sarah's favorite color is: Purple

Demonstrating value updates - colors.update({'Sarah':'Black'}):
Sarah's favorite color is: Black

Demonstrating adding new values - colors.update({'Mark':'Orange'}):
Mark's favorite color is: Orange

Values can be deleted as well - del colors['Sarah']:
{'Sam': 'Blue', 'Amy': 'Red', 'Mark': 'Orange'}


## Working with Trees
Using trees helps you organize data quickly and find it in a shorter time than using other data storage techniques. You commonly find trees used for search and sort routines, but they have many other purposes as well.

### Tree Definitions
- *node* - each item (or data value) which makes up the tree
- *root node* - provides the starting point for the various kinds of processing you perform
- *leaf node* - a node with no children; an end point for the tree
- *links* - how nodes are connected to one another
- *trees* - the combination of nodes and links which forms the data structure

### Kinds of Trees
**Balanced Trees:** A kind of tree that maintains a balanced structure through reorganization so that it can provide reduced access times. The number of elements on the left side differs from the number of elements on the right side by at most one. One example of a balanced tree is the AVL Tree, a balanced binary search tree.

**Unbalanced Trees:** A tree that places new data items wherever necessary in the tree without regard to balance. This method of adding items makes building the tree faster but reduces access speed when searching or sorting.

**Heaps:** A sophisticated tree that allows data insertions into the tree structure. The use of data insertion makes sorting faster. You can further classify these trees as max heaps and min heaps, depending on the tree's capability to immediately provide the maximum or minimum value present in the tree.

## Representing Relations in Graphs
*Graphs* are another form of common data structure - they are a sort of tree extension; as with trees, you have nodes that connect to each other to create relationships, but a graph node can have more than one or two connections. 

Most developers use dictionaries to build graphs - using a dictionary makes building the graph easy because the key is the node name and the values are the connections for that node. Below you can see an example graph implemented.

In [55]:
graph = {'A': ['B', 'F'],
         'B': ['A', 'C'],
         'C': ['B', 'D'],
         'D': ['C', 'E'],
         'E': ['D', 'F'],
         'F': ['E', 'A']}

def find_path(graph, start, end, path=[]):
    path = path + [start]
    
    if start == end:
        print('Ending')
        return path
    
    for node in graph[start]:
        print("Checking Node ", node)
        
        if node not in path:
            print("Path so far ", path)
            newPath = find_path(graph, node, end, path)
            if newPath:
                return newPath

find_path(graph, 'B', 'E')

Checking Node  A
Path so far  ['B']
Checking Node  B
Checking Node  F
Path so far  ['B', 'A']
Checking Node  E
Path so far  ['B', 'A', 'F']
Ending


['B', 'A', 'F', 'E']