# **Extract from data raw**

This code extract info from pdf and csv (scrapping)

## **From Pdfs**

1. **Setup**  
   *Creates `../data/processed/` so the final CSV can be saved there.*

2. **Identify questions**  
   `is_question()` returns `True` for any line that  
   - ends with `?`, or  
   - starts with `Q:`, `Q-`, or `Q `.

3. **`extract_faqs(pdf_path)`**  
   *Reads one PDF and returns a list of Q-A pairs.*  
   - Loads every page with **pdfplumber** and concatenates the text.  
   - Scans line-by-line:  
     - If the line is a question, it stores the previous question + answer and starts a new pair.  
     - Otherwise, it appends the line to the current answer.  
   - Adds the last pair once the loop finishes.

4. **Batch processing**  
   Loops through every PDF in `../data/raw/`, calling `extract_faqs()` and aggregating all rows.

5. **Build & save CSV**  
   Converts the rows to a Pandas DataFrame, assigns an incremental `id`, and writes `faqs.csv` to `../data/processed/` while printing the total FAQ count.


### Libraries

In [None]:
import pdfplumber, re, pandas as pd, pathlib

In [None]:
#Setting outputs
OUTPUT_DIR = pathlib.Path("../data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Function to detect ? or Q:
def is_question(line: str) -> bool:
    return bool(re.match(r".*\?$", line) or re.match(r"^Q[:\- ]", line))

# This function
def extract_faqs(pdf_path: pathlib.Path):
    rows, q, a = [], None, [] # This save in a dict {question, answer, source_pdf}.
    with pdfplumber.open(str(pdf_path)) as pdf:
        # joining pages separated by linebreak
        text = "\n".join(
            page.extract_text(x_tolerance=1, y_tolerance=3) or "" # smooth issues with columns
            for page in pdf.pages
        )
        # Iterate line by line
    for ln in text.splitlines():
        line = ln.strip() #Remove spaces
        if not line: # Ignore empty lines
            continue
        if is_question(line):
            if q and a:      
                #save in the dict  
                rows.append(
                    {
                        "question": q,
                        "answer": " ".join(a).strip(),
                        "source_pdf": pdf_path.name,
                    }
                )
            q, a = line.rstrip("?").lstrip("Q:- ").strip(), []
        else:
            #if is not questions, save in answer (continue)
            a.append(line)
    if q and a:              
        rows.append(
            {
                "question": q,
                "answer": " ".join(a).strip(),
                "source_pdf": pdf_path.name,
            }
        )
    return rows

# Processing pdfs and save extracted rows
all_rows = []
for pdf in pathlib.Path("../data/raw").glob("*.pdf"):
    all_rows.extend(extract_faqs(pdf))

# Saving in a df assigning an id
df = pd.DataFrame(all_rows)
df = df.reset_index().rename(columns={"index": "id"})
df.to_csv(OUTPUT_DIR / "faqs.csv", index=False)
print(f"Extracted {len(df)} FAQs → {OUTPUT_DIR/'faqs.csv'}")


Extracted 28 FAQs → ..\data\processed\faqs.csv


## **From Scraping CSV**

1. **Drop the raw file**  
   Place the un-edited CSV generated by the scraper in  data/raw/student_resources_raw.csv


2. **Create a cleaning notebook**  
- reads the raw CSV,  
- trims whitespace, fixes column names (`title`, `link`, `description`),  
- drops duplicates, and  
- adds a numeric `id`.


In [19]:

RAW  = pathlib.Path("../data/raw/student_resources_raw.csv")
OUT  = pathlib.Path("../data/processed/student_resources_index.csv")

df = pd.read_csv(RAW)

# cleaning
df = (
    df.rename(columns=str.lower)               # title → title, Link → link…
      .drop_duplicates()
      .assign(title=lambda d: d['title'].str.strip(),
              description=lambda d: d['description'].fillna('').str.strip())
      .reset_index(drop=True)
      .assign(id=lambda d: d.index)            # id
)

OUT.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(OUT, index=False)
print(f"Saved {len(df)} resources → {OUT}")


Saved 428 resources → ..\data\processed\student_resources_index.csv
