<a href="https://colab.research.google.com/github/despiegj/goz39a/blob/main/DataStructures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DataStructures in Python
The data structure is the basic structure on which the program is built. Each data structure provides a special way of organizing data, so it can be accessed efficiently according to your use case. Python comes with an extensive set of data structures in its standard library.

This notebook gives a brief introduction of the available datastructures. An extensive 

Once has to make a distinctive between native data structures and data structures that are defined within packages ( `numpy`, `pandas`)



*   List
*   Array
*   Tuple
*   Sets
*   Dictionary
*   Objects
*   DataFrame

Reference : https://realpython.com/



## Lists
The list is part of the core Python language. Despite the name, Python lists are implemented as **dynamic arrays** in the background. This means that the list allows adding or deleting elements, and the list will automatically adjust the required storage for these elements by allocating or releasing memory. 

Python lists can contain any element. Remember: everything is an object in Python, including functions. Therefore, you can mix and match different kinds of data types and store them all in one list. 

This may be a powerful feature, but the disadvantage is that supporting multiple data types at the same time means that data packaging is usually not tight. As a result, the entire structure takes up more space

In [3]:
# setting up a list (combing different datatypes)
arr = ["one", "two", True , 123.9]
print(arr[0])

# Lists are mutable:
arr[1] = "hello"
print(arr)

# Elements can be removed
del arr[1]
print(arr)

# Lists can hold arbitrary data types such as function
def myfunc(x):
  return x*2

arr.append(myfunc)
print(arr[-1](4))

one
['one', 'hello', True, 123.9]
['one', True, 123.9]
8


#### Slicing Lists
Once you’ve defined a list, you can get any part of it as a new list. This is called slicing the list.

In [14]:
a_list= ['a', 'b', 'KUL', 'z', 'leuven']
print(a_list[1:3])           
print(a_list[1:-1])
print(a_list[0:3])
print(a_list[:3])
print(a_list[3:])
print(a_list[:])

['b', 'KUL']
['b', 'KUL', 'z']
['a', 'b', 'KUL']
['a', 'b', 'KUL']
['z', 'leuven']
['a', 'b', 'KUL', 'z', 'leuven']


## Tuples
Just like lists, tuples are part of the core Python language. However, unlike lists, Python's tuple objects are **immutable**. This means that elements cannot be added or removed dynamically-all elements in a tuple must be defined at creation time.

Tuples are another data structure that can hold elements of any data type. Having this flexibility is powerful, but again, it also means that the data is not as compressed as in typed arrays

In [None]:
arr = ("one", "two", "three")
print(arr[0])

try:
  arr[1] = "hello"
except Exception as e:
  print(e.__str__())



## Arrays
Python's array module can save space to store basic C-style data types, such as bytes, 32-bit integers, floating point numbers, etc.

Arrays created using the **array.array** class are mutable and, except for one important difference, behave like lists: they are typed arrays restricted to a **single data type**.

Due to this constraint, array.array objects with many elements are more space efficient than lists and tuples. The elements stored in it are tightly packed, which is useful if you need to store many elements of the same type.

In addition, arrays support many of the same methods as regular lists, and you can use them as direct replacements without other changes to the application code.

In [None]:
#defining an array of type float "f"
import array
arr = array.array("f", (1.0, 1.5, 2.0, 2.5))
arr.append(139.0)
print(arr)

## Sets
A set is an unordered “bag” of unique values. A single set can contain values of any immutable datatype. Once you have two sets, you can do standard set operations like union, intersection, and set difference.

In [8]:
a_set = {1, 2}
print(a_set)

# Creating a set from a list
a_list = ['a', 'b', 'Leuven', True, False, 55]
a_set = set(a_list)
print(a_set)

{1, 2}
{False, 'Leuven', True, 'a', 'b', 55}


There are two different ways to add values to an existing set: the `add()` method, and the `update()` method.

In [9]:
a_set = {1, 2}
a_set.add(4)
a_set.update([10, 20, 30])   
print(a_set)

{1, 2, 4, 10, 20, 30}


There are three ways to remove individual values from a set. The first two, `discard()` and `remove()`, have one subtle difference.
-   The discard() method takes a single value as an argument and removes that value from the set.
-   If you call the discard() method with a value that doesn’t exist in the set, it does nothing. No error; it’s just a no-op.
-   The remove() method also takes a single value as an argument, and it also removes that value from the set. Here’s the difference: if the value doesn’t exist in the set, the remove() method raises a KeyError exception.

In [11]:
a_set = {1, 3, 6, 10, 15, 21, 28, 36, 45}
a_set.discard(10)
print(a_set)
a_set.discard(10)                  
print(a_set)
a_set.remove(21)
print(a_set)

{1, 3, 36, 6, 45, 15, 21, 28}
{1, 3, 36, 6, 45, 15, 21, 28}
{1, 3, 36, 6, 45, 15, 28}


Other operators to compare different sets:
- The `union()` method returns a new set containing all the elements that are in either set.
- The `intersection()` method returns a new set containing all the elements that are in both sets.
- The `difference()` method returns a new set containing all the elements that are in a_set but not b_set.


## Dictionary
In Python, dictionaries are a central data structure. Dicts store an arbitrary number of objects, each identified by a unique dictionary key.

Dictionaries are also often called maps, hashmaps, lookup tables, or associative arrays. They allow for the efficient lookup, insertion, and deletion of any object associated with a given key.

Phone books make a decent real-world analog for dictionary objects. They allow you to quickly retrieve the information (phone number) associated with a given key (a person’s name). Instead of having to read a phone book front to back to find someone’s number, you can jump more or less directly to a name and look up the associated information.

In [None]:
phonebook = {
    "bob": 7387,
    "alice": 3719,
    "jack": 7052,
}


print(phonebook["alice"])

## Objects (Classes)
Classes allow you to define reusable blueprints for data objects to ensure each object provides the same set of fields.

Using regular Python classes as record data types is feasible, but it also takes manual work to get the convenience features of other implementations. For example, adding new fields to the `__init__`  constructor is verbose and takes time.

Also, the default string representation for objects instantiated from custom classes isn’t very helpful. To fix that, you may have to add your own `__repr__` method, which again is usually quite verbose and must be updated each time you add a new field.

Fields stored on classes are mutable, and new fields can be added freely, which you may or may not like. It’s possible to provide more access control and to create read-only fields using the @property decorator, but once again, this requires writing more glue code.

Writing a custom class is a great option whenever you’d like to add business logic and behavior to your record objects using methods. However, this means that these objects are technically no longer plain data objects:

In [None]:
class Car:
    def __init__(self, color, mileage, automatic):
        self.color = color
        self.mileage = mileage
        self.automatic = automatic

car1 = Car("red", 3812.4, True)
car2 = Car("blue", 40231.0, False)

# Get the mileage:
print(car2.mileage)


# Classes are mutable:
car2.mileage = 12
car2.windshield = "broken"
print(car2.windshield)
# String representation is not very useful
# (must add a manually written __repr__ method):
car1


# Numpy Arrays
The package [Numpy](https://numpy.org/)is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. A numpy array is a grid of values, **all of the same type**. This is a difference with the list.

The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

Compared to lists Numpy data structures perform better in:

- Size - Numpy data structures take up less space
- Performance - they have a need for speed and are faster than lists
- Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built in.

In [25]:
import numpy as np
import time

size_of_vec = 1000

def pure_python_version():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X)) ]
    return time.time() - t1

def numpy_version():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1


t1 = pure_python_version()
t2 = numpy_version()
print('Core Python:',t1,' Numpy:',t2)
print("Numpy is in this example " + str(t1/t2) + " faster!")

Core Python: 0.000385284423828125  Numpy: 4.100799560546875e-05
Numpy is in this example 9.395348837209303 faster!


# Pandas 

[Pandas](https://pandas.pydata.org/) is a Python library well suited for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

The core datastructure in Pandas is the **DataFrame**. 
This DataFrame object is perfect for data manipulation. It has:
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, fancy indexing, and subsetting of large data sets.
- Data structure column insertion and deletion.
- Group by engine allowing split-apply-combine operations on data sets.
- Data set merging and joining.
- Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
- Time series-functionality: Date range generation[6] and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
