# Assignment 3: Local Association Matrix 

**Student names**: _Your_names_here_ <br>
**Group number**: _Your_group_here_ <br>
**Date**: _Submission Date_

## Important notes
Please read and follow these rules. Submissions that do not fulfill them may be returned.
1. You may work in groups of maximum 2 students.
2. Submit in **.ipynb** format only.
3. The assignment must be typed. Handwritten answers are not accepted.

**Due date**: 12.10.2025 23:59

### What you will do 
- Build a **local association matrix** from Cranfield collection.
- Compute the **normalized association matrix**.
- Use the normalized matrix to **identify neighborhood terms** for expansion for given queries.


---
## Dataset

You will use the **Cranfield** dataset, provided in this file:

- `cran.all.1400`: The document collection (1400 documents)

**The code to parse the file is ready — just update the cran file path to match your own file location. Use the docs variable in your code for the parsed file**


### Load and parse documents (provided)

Run the cell to parse the Cranfield documents. Update the path so it points to your `cran.all.1400` file.

In [None]:
# Read 'cran.all.1400' and parse the documents into a suitable data structure

CRAN_PATH = r"your_path_to/cran.all.1400"  # <-- change this!

def parse_cranfield(path):
    docs = {}
    current_id = None
    current_field = None
    buffers = {"T": [], "A": [], "B": [], "W": []}
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(".I "):
                if current_id is not None:
                    docs[current_id] = {
                        "id": current_id,
                        "title": " ".join(buffers["T"]).strip(),
                        "abstract": " ".join(buffers["W"]).strip()
                    }
                current_id = int(line.split()[1])
                buffers = {k: [] for k in buffers}
                current_field = None
            elif line.startswith("."):
                tag = line[1:].strip()
                current_field = tag if tag in buffers else None
            else:
                if current_field is not None:
                    buffers[current_field].append(line)
    if current_id is not None:
        docs[current_id] = {
            "id": current_id,
            "title": " ".join(buffers["T"]).strip(),
            "abstract": " ".join(buffers["W"]).strip()
        }
    print(f"Parsed {len(docs)} documents.")
    return docs

docs = parse_cranfield(CRAN_PATH)

## 3.1  Local association matrix

For the given Cranfield document collection in cran.all.1400 construct a local association matrix to identify association clusters. Use the docs variable with the parsed file. Omit stopwords in the STOPWORDS list given below from the vocabulary. 


The correlation factors $c_{u,v}$ between any pair of terms $w_u$ and $w_v$ are defined as  
$c_{u,v} = \sum_{d_j \in D} f_{u,j} \cdot f_{v,j}$  

$f_{u,j}$ is the raw term frequency of $w_u$ in document $d_j$.

### Weighting variants: **scalar** and **metric**

Add two alternative weighting schemes for the matrix (only the formula for assigning the matrix cell value changes):

- **Metric weighting** :
Let $w_u(n,j)$ and $w_v(m,j)$ denote the $n$-th and $m$-th occurrences of terms $w_u$ and $w_v$ in document $d_j$.  
Define a distance function $r(w_u(n,j), w_v(m,j))$ (e.g., $r(i,k) = 1 + |i - k|$).  
Then:

$$
c_{u,v} = \sum_{d_j \in D} \sum_n \sum_m \frac{1}{r(w_u(n,j), w_v(m,j))}
$$


- **Scalar weighting** :
Let $\vec{s}_u = \langle c_{u,x_1}, c_{u,x_2}, \dots, c_{u,x_n} \rangle$ be the neighborhood vector of term $w_u$, and similarly for $w_v$.  
Then:

$$
c_{u,v} = \frac{\vec{s}_u \cdot \vec{s}_v}{|\vec{s}_u| \cdot |\vec{s}_v|}
$$

In [None]:
# TODO: Construct a local association matrix for the cranfield collection. Use both weighting variants.

STOPWORDS = set("""a about above after again against all am an and any are aren't as at be because been
before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down
during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers
herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most
mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she
she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's
these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're
we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't
you you'd you'll you're you've your yours yourself yourselves""".split())

# Your code here


## 3.2 Normalized association matrix

Compute the normalized association matrix from the unnormalized matrix computed above. 

To normalize the matrix use the following formula: <br>
$c'_{u,v} = \frac{c_{u,v}}{c_{u,u} + c_{v,v} - c_{u,v}}$  


In [None]:
#TODO: Compute the normalized association matrix 

# Your code here

## 3.3 Neighborhood terms

With the help of the normalized local association matrix, identify the neighborhood terms that should be used for expansion for the following queries (queries_assignment3):


In [None]:
# Do not change this code
queries_assignment3 = [
  "gas pressure",
  "structural aeroelastic flight high speed aircraft",
  "heat conduction composite slabs",
  "boundary layer control",
  "compressible flow nozzle",
  "combustion chamber injection",
  "laminar turbulent transition",
  "fatigue crack growth",
  "wing tip vortices",
  "propulsion efficiency"
]

In [None]:
#TODO: Identify neighborhood terms for queries_assignment3

# Your code here
