# ASSIGNMENT 1

### Task Breakdown:

1. **loadGraph(edgeFilename)**: Read the edge data from a file and construct an adjacency list representation of the graph.
   
2. **MyQueue class**: Implement a queue data structure with enqueue, dequeue, empty methods.

3. **BFS(G, s)**: Implement Breadth-First Search algorithm to compute distances from a source vertex s to all other vertices in the graph G.

4. **distanceDistribution(G)**: Compute the distribution of distances between nodes in the graph G.

5. **Testing code**: Write code at the end of `swp.py` to load data, run BFS, compute distance distribution, and print the results.

Let's start with the first step: implementing the `loadGraph` function to read the edge data from a file and construct an adjacency list representation. Could you provide a sample of the edge data format from your file `edges.tct`? This will help in constructing the `loadGraph` function correctly.

In [1]:
def loadGraph(filename: str):
    """
    Reads in edge data and returns an adjacency 
    list that corresponds to the undirected graph 
    of social connections.
    Args:
        filename (str) : name of Facebook edge file
    Returns:
        dict : adjacency list representation
    """
    adjacency_list = {}
    
    with open(filename, 'r') as file:
        for line in file:
            vertex1, vertex2 = map(int, line.split())
            
            # initialize empty list for vertices if not already in the graph
            if vertex1 not in adjacency_list:
                adjacency_list[vertex1] = []
            if vertex2 not in adjacency_list:
                adjacency_list[vertex2] = []
            
            # adds corresponding vertices to list if on the same line
            # do it for both directions so undirected
            adjacency_list[vertex1].append(vertex2)
            adjacency_list[vertex2].append(vertex1)
    
    return adjacency_list


### Explanation:
- **adjacency_list**: This dictionary will store the adjacency list representation of the graph. Keys are vertex IDs, and values are lists of adjacent vertices.
- **open(edgeFilename, 'r')**: Opens the file specified by `edgeFilename` in read mode.
- **for line in file**: Iterates over each line in the file.
- **map(int, line.split())**: Splits each line into two parts (vertex IDs) and converts them to integers.
- **if vertex1 not in adjacency_list**: Checks if the vertex is already in the adjacency list; if not, initializes an empty list.
- **adjacency_list[vertex1].append(vertex2)**: Adds `vertex2` to the adjacency list of `vertex1`, and vice versa to ensure the graph is undirected.

This function will create and return the adjacency list representation of the graph from the edge data in the file.

Next, we can proceed with implementing the `MyQueue` class for BFS and then the BFS algorithm itself.

In [2]:
class MyQueue:
    """
    Class for queue data structure.
    Attributes:
        queue : queue data structure
    Methods:
        enqueue: adding an object to the tail
        dequeue: removing an object from the head
        empty: true if queue is empty
    """
    def __init__(self):
        self.queue = []
    
    def enqueue(self, item):
        """Add to tail of queue"""
        self.queue.append(item)
    
    def dequeue(self):
        """Remove from head of queue"""
        if not self.empty():
            return self.queue.pop(0)
        else:
            raise IndexError("dequeue from empty queue")
    
    def empty(self):
        """Check empty state of queue"""
        return len(self.queue) == 0
    
    def __str__(self):
        """String representation of queue"""
        return str(self.queue)


### Explanation:
- **`__init__`**: Initializes an empty list `self.queue` to store queue elements.
- **`enqueue(item)`**: Adds `item` to the end of the queue using the list's `append` method.
- **`dequeue()`**: Removes and returns the first element from the queue using the list's `pop(0)` method. Raises an `IndexError` if the queue is empty.
- **`empty()`**: Returns `True` if the queue is empty (i.e., `self.queue` has length 0), otherwise `False`.
- **`__str__()`**: Returns a string representation of the queue, useful for debugging and printing the queue contents.

Now that we have our `MyQueue` class ready, we can proceed to implement the BFS algorithm using this queue. The BFS algorithm will compute distances from a given source vertex to all other vertices in the graph.

In [3]:
def BFS(G, s):
    """
    Runs breadth-first-search to return the distance
    between the source vertext, s, to all other vertices, V.
    Args:
        G (dict) : adjacency list (graph)
        s (int) : source vertex index
    Returns:
        dict : key-value pair where key is the vertex (v) and 
                value is the distance between s and itself
    """
    queue = MyQueue()
    queue.enqueue(s)
    
    distances = {}
    for vertex in G:
        distances[vertex] = float('inf')
    distances[s] = 0
    
    while not queue.empty():
        current = queue.dequeue()
        
        for neighbor in G[current]:
            if distances[neighbor] == float('inf'):
                distances[neighbor] = distances[current] + 1
                queue.enqueue(neighbor)
    
    return distances


### Explanation:
- **`BFS(G, s)`**: Takes a graph `G` represented as an adjacency list and a source vertex `s`.
- **`queue = MyQueue()`**: Initializes an empty queue using our `MyQueue` class.
- **`distances = {}`**: Initializes a dictionary `distances` where keys are vertices and values are distances from the source `s`. Initially, all distances are set to infinity (`float('inf')`), except `distances[s]` which is set to 0.
- **`while not queue.empty():`**: Executes the BFS loop until the queue is empty.
- **`current = queue.dequeue()`**: Dequeues the current vertex from the queue.
- **`for neighbor in G[current]:`**: Iterates over each neighbor of the current vertex.
- **`if distances[neighbor] == float('inf'):`**: Checks if the distance to `neighbor` is still infinity, indicating it hasn't been visited.
- **`distances[neighbor] = distances[current] + 1`**: Updates the distance to `neighbor` to be one more than the distance to `current`.
- **`queue.enqueue(neighbor)`**: Enqueues `neighbor` for further exploration.

This `BFS` function will return a dictionary `distances` where each key is a vertex in the graph and the corresponding value is the distance from the source vertex `s` to that vertex.

In [4]:
def distanceDistribution(G):
    """
    Calculates and returns the distribution of all 
    distance frequencies between the vertices.
    Args:
        G (dict) : adjacency list graph representation
    Returns:
        dict : distribution of distances (key = distance, value = frequency in %)
    """
    distribution = {}
    
    # for every vertex, compute the distance to every other vertices
    for vertex in G:
        distances = BFS(G, vertex)
        for dist in distances.values():
            if dist != 0:  # dont count 0 distance to self
                # increment the counter within distribution dict
                if dist in distribution:
                    distribution[dist] += 1
                else:
                    distribution[dist] = 1
    
    # convert frequencies to percentages
    total_pairs = len(G) * (len(G) - 1)  # # of possible pairs excluding self-loops
    
    for dist in distribution:
        distribution[dist] = (distribution[dist] / total_pairs) * 100
    
    return distribution


### Explanation:
- **`distanceDistribution(G)`**: Takes a graph `G` represented as an adjacency list.
- **`distribution = {}`**: Initializes an empty dictionary to store distance frequencies.
- **BFS for each vertex**: For each vertex in the graph, compute distances to all other vertices using the `BFS` function.
- **Update frequency dictionary**: For each distance found (excluding distance 0, which is the vertex itself), update the `distribution` dictionary to count occurrences.
- **Convert frequencies to percentages**: After counting distances, convert frequencies to percentages of total possible pairs (`total_pairs`).
- **Return `distribution`**: Returns a dictionary where keys are distances and values are percentages indicating the frequency of occurrence of those distances.

This function calculates the distribution of distances in the graph based on BFS results. It gives insight into how nodes are interconnected based on their distances apart.

Now that we have implemented the main functions (`loadGraph`, `BFS`, and `distanceDistribution`), we can proceed to write the testing code at the bottom of `swp.py` to load data, run BFS, compute the distance distribution, and print the results.

In [5]:
if __name__ == "__main__":
    # Step 1: Load graph from edge file
    filename = 'edges.txt'
    graph = loadGraph(filename)
    
    # Step 2: Compute distance distribution
    distribution = distanceDistribution(graph)
    
    # Step 3: Print the final distribution dictionary
    print("Distance Distribution:")
    for dist, percent in distribution.items():
        print(f"Distance {dist}: {percent:.2f}%")


Distance Distribution:
Distance 1: 1.08%
Distance 4: 35.94%
Distance 2: 16.65%
Distance 3: 24.41%
Distance 6: 4.15%
Distance 5: 15.73%
Distance 7: 1.93%
Distance 8: 0.10%


This code will load the graph data from a file, run BFS from each vertex to compute distances, compute the distance distribution, and finally print out the distribution dictionary.

### Explanation:
- **`if __name__ == "__main__":`**: Ensures that the code block is executed only when the script is run directly, not when it's imported as a module.
- **Load graph**: Calls `loadGraph` function to read the graph data from `edges.tct` file and store it in `graph`.
- **Compute distance distribution**: Calls `distanceDistribution` function to compute the distribution of distances in the graph `graph`.
- **Print distribution**: Prints each distance and its corresponding percentage from the `distribution` dictionary.

Make sure to place this code at the bottom of your `swp.py` file. When you run the script, it will load the graph data, compute the necessary metrics, and display the distance distribution as specified.

If everything is clear, you can proceed to run this code with your actual data file (`edges.txt`).

In [22]:
distribution

{1: 1.0819963503439287,
 4: 35.93958410205793,
 2: 16.653711013016846,
 3: 24.41433762273995,
 6: 4.152271666261381,
 5: 15.728089954052496,
 7: 1.9342367832405714,
 8: 0.09577250828689715}

In [20]:
thresh = 6
total = 0
for dist, freq in distribution.items():
    if dist <= thresh:
        total += freq
total

97.96999070847254

### Small World Phenomenon

The small world phenomenon refers to the observation that in social networks, individuals are typically connected by short paths of acquaintanceship or friendship. This means that even in large networks, most pairs of people can be connected through a relatively small number of intermediate connections. The concept gained prominence through Stanley Milgram's famous "six degrees of separation" experiment, which suggested that any two people in the world are connected by at most six acquaintances. This phenomenon has since been studied in various contexts, including social networks, the internet, and even biological networks.

To evaluate the small world phenomenon:
- Small average path length: Check if the average distance between nodes is relatively low,
    suggesting short paths exist between nodes.
- High clustering coefficient: Evaluate if nodes tend to cluster together, indicating 
    local connectivity.

Observations based on distance distribution:
- Distance 1: 1.08% - Very low percentage, suggesting very few direct connections (possibly hubs or central nodes).
- Distance 2: 16.65% - Moderate percentage, indicating a significant number of nodes are reachable within 2 steps.
- Distance 3: 24.41% - Another significant percentage, showing nodes are generally reachable within 3 steps.
- Higher distances (4 and beyond): Gradually decreasing percentages, indicating fewer long-distance connections.

Conclusion:
- The network exhibits characteristics of the small world phenomenon with a notable portion of nodes being reachable within 2 to 3 steps.
    - Nearly 98% of people are connected by at most 6 aquaintances
    - Nearly 42% of people are connected by at most 3 aquaitances
- However, the very low percentage of distance 1 connections suggests some nodes may act as hubs or central points of connectivity.
- Further analysis of average path length and clustering coefficient would provide a more comprehensive understanding of the network's structure and small world properties.



Since nearly 98% of social connections between people can be done through 6 or less aquaitances, the network given in 'edges.txt' strongly statisfied the small world phenomenon.

Yes, you have the right understanding of the process. Here is a more detailed breakdown of each step:

1. **Calculate the Existing Distribution (via BFS) of Distance Counts**:
   - Perform BFS for each vertex in the graph to calculate the distances between all pairs of vertices.
   - Store these distances in a data structure, such as a dictionary, where the keys are the distances and the values are the counts of how many vertex pairs have that distance.

2. **Store the Counts of the Distances**:
   - Maintain a distribution dictionary that keeps track of the counts of each distance. This will help you avoid recalculating distances for the entire graph when adding a new vertex.

3. **Add New Vertex with New Edge Data/Nodes**:
   - Introduce the new vertex to the graph and connect it with its neighbors by updating the adjacency list.
   - This step modifies the graph structure to include the new vertex and its edges.

4. **Calculate the Distance Distribution via BFS for That New Vertex Only**:
   - Run BFS starting from the new vertex to determine its distances to all other vertices in the graph.
   - This BFS will reveal how the new vertex changes the distance distribution.

5. **Append the Distance Counts to the Appropriate Distances of the Original, Stored Distribution**:
   - Update the original distribution with the distances computed from the new vertex.
   - For each distance found in the new BFS, increase the count in the original distribution dictionary accordingly.

Here's a revised version of the code that implements these steps:

```python
def loadGraph(filename: str):
    """
    Reads in edge data and returns an adjacency 
    list that corresponds to the undirected graph 
    of social connections.
    Args:
        filename (str) : name of Facebook edge file
    Returns:
        dict : adjacency list representation
    """
    adjacency_list = {}
    
    with open(filename, 'r') as file:
        for line in file:
            vertex1, vertex2 = map(int, line.split())
            
            if vertex1 not in adjacency_list:
                adjacency_list[vertex1] = []
            if vertex2 not in adjacency_list:
                adjacency_list[vertex2] = []
            
            adjacency_list[vertex1].append(vertex2)
            adjacency_list[vertex2].append(vertex1)
    
    return adjacency_list

class MyQueue:
    def __init__(self):
        self.queue = []
    
    def enqueue(self, item):
        self.queue.append(item)
    
    def dequeue(self):
        if self.empty():
            raise IndexError("dequeue from empty queue")
        return self.queue.pop(0)
    
    def empty(self):
        return len(self.queue) == 0
    
    def __str__(self):
        return str(self.queue)
    
def BFS(G, s):
    queue = MyQueue()
    queue.enqueue(s)
    
    distances = {}
    for vertex in G:
        distances[vertex] = float('inf')
    distances[s] = 0
    
    while not queue.empty():
        current = queue.dequeue()
        
        for neighbor in G[current]:
            if distances[neighbor] == float('inf'):
                distances[neighbor] = distances[current] + 1
                queue.enqueue(neighbor)
    
    return distances

def distanceDistribution(G):
    distribution = {}
    
    for vertex in G:
        distances = BFS(G, vertex)
        for dist in distances.values():
            if dist != 0:
                if dist in distribution:
                    distribution[dist] += 1
                else:
                    distribution[dist] = 1
    
    total_pairs = len(G) * (len(G) - 1)
    
    for dist in distribution:
        distribution[dist] = (distribution[dist] / total_pairs) * 100
    
    return distribution

def update_distance_distribution(G, new_vertex, new_neighbors, existing_distribution):
    G[new_vertex] = new_neighbors
    for neighbor in new_neighbors:
        G[neighbor].append(new_vertex)
    
    new_distances = BFS(G, new_vertex)
    updated_distribution = existing_distribution.copy()
    
    total_pairs = len(G) * (len(G) - 1)
    
    for dist in new_distances.values():
        if dist != 0:
            if dist in updated_distribution:
                updated_distribution[dist] += 1
            else:
                updated_distribution[dist] = 1
    
    for dist in updated_distribution:
        updated_distribution[dist] = (updated_distribution[dist] / total_pairs) * 100
    
    return updated_distribution

if __name__ == "__main__":
    filename = 'edges.txt'
    graph = loadGraph(filename)
    
    initial_distribution = distanceDistribution(graph)
    print("Initial Distance Distribution:")
    for dist, percent in initial_distribution.items():
        print(f"Distance {dist}: {percent:.2f}%")
    
    new_vertex = 1000
    new_neighbors = [1, 2, 3]  # example neighbors of the new vertex
    updated_distribution = update_distance_distribution(graph, new_vertex, new_neighbors, initial_distribution)
    print("Updated Distance Distribution:")
    for dist, percent in updated_distribution.items():
        print(f"Distance {dist}: {percent:.2f}%")
```

This code efficiently updates the distance distribution by focusing only on the new vertex and its impact on the graph, ensuring scalability even as the graph grows.

