# I. Dataset & Input Design

### Question 1: Why were only SPOC 2-minute cadence light curves used?

We used 2-minute cadence SPOC light curves because they provide a manageable dataset size (~15k–20k light curves per sector) with high-quality, preprocessed data and confirmed TOIs. This makes them ideal for experimentation on a modest system (32GB RAM, 4-core i3 CPU), and provides a fair testbed without overwhelming computational resources.



### Follow-up 1.1: Why not use more stars or the full million-star dataset?

More stars were not required to test our hypothesis. We selected a representative subset with a fair distribution of TOIs, sufficient to evaluate model behavior under resource constraints.



### Follow-up 1.2: Why not use FFI (30-minute cadence) light curves?

FFI light curves are extremely large in volume, making them impractical for this phase of the project on our available hardware. Our goal was not exhaustive coverage, but controlled hypothesis testing.



### Follow-up 1.3: Why did you move from Sector 1 to Sector 2?

The first four URF models were trained and tested on Sector 1. However, URF-3, trained using synthetic features generated from TOIs, overfitted to Sector 1 data. To confirm this, we tested it on Sector 2, which revealed generalization failure. Due to space limitations, only one sector could be held at a time. Sector 2 was retained for follow-up experiments and URF-4 development. We plan to revisit cross-sector validation in CLARA Part 3.



### Follow-up 1.4: Why were TOIs used at all in an unsupervised setting?

Since we work with unlabeled data, we needed a ground-truth proxy to benchmark performance. TOIs represent scientifically validated transit signals and serve as a useful reference to evaluate anomaly detection quality.



### Follow-up 1.5: Were TOIs used for training?

Yes, in URF-3. The synthetic set for URF-3 was constructed from features derived from known TOIs. However, the model overfit to Sector 1 and failed to generalize to Sector 2. This led to the development of URF-4, which relies on synthetic light curves with user-controlled features.



### Follow-up 1.6: Why use synthetic light curves (e.g., with `batman`) in URF-4?

URF-4 was built to test synthetic set design systematically. Using tools like `batman`, we can generate synthetic LCs with controlled input parameters — such as transit count, duration, cadence, and noise — making it possible to study how these factors influence URF model behavior.



### Follow-up 1.7: Why not use TOIs for synthetic set design?

While TOIs are useful for evaluation, they lack standardized, controllable input parameters. Synthetic curves, on the other hand, allow precise tuning of design variables like duration, cadence, and noise. This makes them ideal for constructing interpretable and tunable synthetic training sets for URF models.

---

### Question 2: Could the results change if we use FFI-based light curves?

Yes, the results would likely change if FFI-based light curves were used. FFI (Full Frame Image) light curves are not subject to the same target pre-selection as 2-minute cadence light curves, and they represent a broader, less filtered stellar population. As a result, the definition of "normal" behavior in light curves would shift, potentially altering the anomaly scoring and the mapping between synthetic set parameters and model performance. While our hypothesis — that synthetic set design can steer model behavior — is expected to hold, the specific mappings of input features (e.g., number of transits, duration) to performance metrics (e.g., TOI recall) would likely differ.


### Follow-Up 2.1: Why would FFI-based light curves change the anomaly landscape?

2-minute cadence light curves are drawn from pre-selected targets, typically nearby dwarf stars, and tend to exhibit more stable, well-characterized variability. In contrast, FFI data includes a much wider range of stellar types and behaviors, including less stable or more variable stars. This broader variability spectrum would redefine what is considered "normal," leading the URF to assign different anomaly scores. Consequently, the anomaly score thresholds — currently calibrated using TOIs from 2-min data — would not directly apply to FFI data.


### Follow-Up 2.2: Would anomaly score thresholds need to change for FFI-based models?

Yes. Since FFI data would establish a different baseline for what is "normal," the anomaly score distribution would shift. Thresholds based on 2-min cadence TOIs would no longer be valid for performance evaluation. A new evaluation strategy would be needed, possibly using synthetic injections or another benchmark tailored to the FFI population.


### Follow-Up 2.3: If TOIs can't serve as benchmarks for FFI-based URFs, how should model performance be evaluated?

When TOIs are insufficient or non-representative (as in FFI studies), we can use synthetic light curves with known injected features (e.g., transits, flares) as a benchmarking strategy. This allows for controlled testing of model sensitivity and precision by comparing detection outcomes against known injected events.


### Follow-Up 2.4: Can FFI-based models find anomalies that are missed in 2-min cadence data?

This has not yet been tested. However, if the FFI-based population defines a new normal, then some light curves that were considered normal in 2-min cadence analysis might appear anomalous when compared against FFI variability patterns. Once Part 2 clustering is complete, we can examine whether new anomaly types emerge uniquely in FFI-trained models.

---

### Question 3: How representative are the 16k light curves per sector?
  
The ~16,000 light curves per sector (from SPOC 2-minute cadence targets) may not be fully representative of the entire TESS dataset, particularly compared to FFI-based light curves. However, they are standardized, high-quality, and computationally manageable — making them suitable for hypothesis testing. Their use is also consistent with prior work, such as Crake & Martínez-Galarza (2023), which this study builds upon.



### Follow-Up 3.1: In what sense are these 16k curves possibly *not* representative of full TESS sectors?


They exclude the vast majority of stars observed in TESS sectors — especially those only available through FFIs. The 2-min targets are typically bright, nearby dwarfs selected for known scientific interest, leading to a biased sample with less astrophysical diversity.



### Follow-Up 3.2: How did you account for any imbalance or bias when selecting test subsets from the 16k curves?


We used stratified random sampling to match the TOI-to-total curve ratio of the full 16k set. This method was used both for the 4000-light curve test subset (used in URF-4 subvariant evaluation) and the 10 subsets (used in α-variant testing). It ensured uniformity and fair comparison across models.



### Follow-Up 3.3: Why is this sample still valid for hypothesis testing?


Because the primary goal was to test how synthetic set design affects URF performance in a controlled setting. The relative behavior of models — not the absolute astrophysical coverage — was the focus. This dataset offers clarity, consistency, and efficiency for such controlled evaluations.


### Follow-Up 3.4: What limitations does this dataset impose on the generalizability of your conclusions?


Being restricted to 2-min cadence targets means the results might not generalize to FFI data, which includes more diverse and noisier light curves. Definitions of normality, anomaly thresholds, and performance metrics like TOI recall or importance AUC may shift significantly with FFI-based studies.



### Follow-Up 3.5: How could future CLARA research address the representativeness limitations of the current 16k light curve subset?


By scaling testing to FFI-based datasets using high-performance computing. This would allow us to evaluate how the performance-to-input feature mapping changes across a broader population and define correction strategies or new models suited for FFI-normalized anomaly detection.

---


### Question 4: Are TOIs evenly distributed in the set? Are there biases?


No, TOIs are not evenly distributed. In Sector 1, there were 175 TOI-associated FITS files out of ~16,000 light curves; in Sector 2, the number was 195. The distribution appears sparse, and there is no evidence of uniform coverage across all TICs in a sector.



### Follow-Up 4.1 Have you tried to quantify TOI biases with respect to sky position or magnitude?  
No. We have not yet correlated astrometric or astrophysical properties to the TESS data. For URF feature extraction, we used only flux time series and Lomb-Scargle periodogram power values (vector sizes 3000 and 1000 respectively).



### Follow-Up 4.2 Would this bias in TOI distribution affect the model’s ability to generalize?  
This has not been definitively proven, but we believe not. The bias is consistent (~175–195 TOIs out of 16k light curves per sector). Since only 1500 curves were used as the real feature set during URF training and hyperparameter search, and the URF-4 α = 0.5 model passed the generalization test, the model appears robust to such distributional bias.



### Follow-Up 4.3 Do we know what selection effects determine which TICs are assigned TOIs in SPOC data?  
No. This hasn't been explored in our work. TOIs are used solely as a proxy benchmark for evaluating anomaly detection, not as an unbiased sample of transit candidates.



### Follow-Up 4.4 Do TOIs cluster in anomaly score space, or are they evenly spread?  
They tend to cluster near the top of the anomaly score distribution. In our analysis, high-importance TOIs were often found within the top 5–20% of the anomaly score percentiles, especially in balanced and low-α URF-4 variants.



### Follow-Up 4.5 Could uneven representation of astrophysical TOI types affect model performance across anomaly categories?  
Possibly. Our synthetic sets used fixed parameter configurations and transit models (box or Mandel-Agol). If TOIs skew heavily toward certain types (e.g., deep, short-period hot Jupiters), then less-represented types may be underperforming. However, generalization tests suggest the model handles diverse anomaly shapes reasonably well.

---

### Question 5: What preprocessing (detrending, smoothing, normalization) was used, and how does it affect anomalies?

For each light curve, we generated a 4000-length feature vector:  
- The first 3000 values were normalized PDCSAP flux points from the start of the light curve.  
- The remaining 1000 values came from the Lomb-Scargle periodogram’s power spectrum.  

This setup captures both the direct flux time series and its dominant periodic features — enabling the URF to learn from both temporal behavior and frequency domain patterns. Normalization was critical to ensure comparability between flux values across stars, and truncation was used for computational efficiency.


### Follow-Up 5.1: Why 3000 flux points and 1000 power values specifically?
  
The decision was inspired by Crake & Martínez-Galarza (2023), who also used flux + periodogram features for anomaly detection.  
Practically, using all ~19,000 flux points is computationally expensive. We assumed that if any periodic anomaly exists, its signature would appear in the first 3000 points.  
This trade-off balances detection capability and computational tractability.


### Follow-up 5.2: How was normalization applied, and why is it important?
 
Normalization was done using `.normalize()` from `astropy`, applied after creating the light curve from time and flux data.  
It standardizes the scale of flux values across all stars, allowing the URF to compare light curves meaningfully.  
Without normalization, variations due to stellar brightness differences could obscure true anomalies.


### Follow-up 5.3: Why was truncation to 3000 flux points considered safe?
  
While full light curves have ~19,000 points, analyzing all is inefficient and often redundant.  
Since many periodic anomalies (e.g. transits or flares) repeat or occur early, we assumed the first 3000 points would usually capture relevant behavior.  
This assumption held well during URF training and testing phases.


### Follow-up 5.4: Why was the Lomb-Scargle periodogram used, and how do its power values help in detecting anomalies?
  
Lomb-Scargle is ideal for unevenly sampled time series like TESS light curves.  
Its power spectrum identifies dominant periodicities, which helps the URF learn what constitutes a "periodic anomaly."  
The peak frequency from the power spectrum is later used to phase-fold light curves and study recurring patterns more clearly.


### Follow-up 5.5: How might preprocessing choices introduce biases or affect anomaly detection?
  
Normalization and truncation can inadvertently suppress low-amplitude or long-duration anomalies.  
For instance, an overly aggressive normalization may flatten subtle dips, and truncating at 3000 points may miss late-occurring periodic events.  
Thus, preprocessing decisions involve trade-offs between speed, sensitivity, and signal completeness.

---


# II. URF Architecture & Model Logic

### Question 1: Why URF instead of other anomaly detectors (e.g., autoencoders, isolation forests, VAEs)?

Crake and Martínez-Galarza (2023) used unsupervised random forests (URFs) for anomaly detection. Our project began as a replication of that approach, aiming to identify new anomalies using URFs. However, we later discovered that URFs could be tuned to exhibit specific behaviors based on how their synthetic contrast set was designed. This prompted us to explore the broader behavior space of URFs by varying synthetic set parameters and analyzing performance metrics such as TOI recall and feature importance.

### Follow-up 1.1: Were other models (autoencoders, VAEs, isolation forests) tested?
No. This study focused exclusively on URFs to extend the methodology of Crake & MG (2023). While other anomaly detection models could be valuable, our aim was to systematically understand the behavior of URFs under controlled synthetic set designs.

### Follow-up 1.2: Can URFs be controlled to favor different scientific goals?
Yes. From our experiments, we observed that different synthetic set configurations lead to URFs with distinct performance profiles. We defined two key metrics — TOI recall and TOI importance — and demonstrated that URFs could be tuned to favor either by adjusting the input features of the synthetic set. These preferences remained consistent across sectors, demonstrating generalizable model behavior.

### Follow-up 1.3: Are there limitations to what URFs can learn?
Yes. URFs exhibit predictable behavior only within specific ranges of synthetic set design parameters. For example, noise levels between 50 and 300 ppm consistently produced viable models, while values outside this range led to erratic or degenerate behavior. Additionally, conflicting goals like maximizing recall and importance while minimizing anomaly rate could not always be satisfied together.

### Follow-up 1.4: Why is URF considered scientifically interpretable?
URFs allow researchers to define and prioritize their goals through interpretable metric combinations. By using a small labelled set of 4000 light curves with known TOIs, we computed a correlation matrix between metrics and applied a combined scoring formula weighted by user-defined coefficients like α. This provides transparency in model selection and aligns model behavior with scientific intent.

### Follow-up 1.5: Does URF learn anything meaningful about real data, or just statistical artifacts?
Most likely both — but importantly, URFs respond to statistical patterns that correlate with real astrophysical behavior, such as dips and variability. This is supported by the high TOI recall and importance scores observed across sectors, particularly for alpha-tuned variants. CLARA Part 2 will further test this by correlating anomalies with physical properties.

---

### Question 2: What hyperparameters were fixed vs randomized across URF-4 subvariants?
URF-4 subvariants used a fixed hyperparameter search space inspired by Crake & MG (2023). For each synthetic set design, we applied a random search using the following distributions:

`n_estimators`: 10 evenly spaced values between 50 and 200

`max_features`: [sqrt, log2]

`max_depth`: [100, 300, 500, 700, 900, 1000, None]

`min_samples_split`: [2, 4, 7, 10]

`min_samples_leaf`: [1, 2]

`bootstrap`: [True, False]

`warm_start`: [True, False]

### Follow-up 2.1: What hyperparameters were fixed for all models?
Parameters not included in the search space were fixed for all subvariants:
criterion = 'gini', random_state = 42, oob_score = False, class_weight = None, ccp_alpha = 0.0, max_samples = None, monotonic_cst = None, min_impurity_decrease = 0.0.

### Follow-up 2.2: Why was this specific search space used?
The configuration was directly inspired by Crake & MG (2023), balancing model diversity with computational efficiency. It covers a reasonable range of decision tree depths, ensemble sizes, and split strategies.

### Follow-up 2.3: Was hyperparameter tuning linked to performance?
Yes. After random search, each URF model was evaluated on a labelled validation set of 4000 light curves using metrics like TOI recall and importance. A combined scoring formula with varying α was used to identify top-performing models under different metric priorities.

### Follow-up 2.4: Were any models found to have identical hyperparameters?
Yes. Out of 36 URF-4 models, 32 shared hyperparameter configurations in repeated clusters. This indicates that different synthetic feature sets can result in the same optimal hyperparameters.

### Follow-up 2.5: Was the hyperparameter search space fixed for every synthetic set?
Yes. The same hyperparameter distribution was used for all synthetic set configurations to ensure fair comparison of resulting models.

---

### Question 3: Is there any risk of overfitting to the synthetic contrast class?
Yes — as shown by URF-3, which used synthetic curves derived from real TOI features. This model heavily overfit and performed poorly outside its sector. We therefore returned to uniform synthetic distributions as used in Crake & MG (2023), but explored a range of values for features like noise and cadence.

### Follow-up 3.1: Did URF-4 overfit in any observable way?
No significant overfitting was observed for URF-4 variants. This is likely because the synthetic data remained uniformly distributed and did not mimic TOI structures.

### Follow-up 3.2: Can fine-grained tuning of synthetic features help generalization?
Yes — but only within viable ranges. For example, noise_ppm between 50 and 300 produced stable models. Outside these ranges, URF behavior became unstable or collapsed entirely.

### Follow-up 3.3: Do anomaly scores show selective learning of TOIs?
Yes. Despite a low proportion of TOIs per sector (~200 out of 16,000), URF-4 models achieved TOI recall rates comparable to or exceeding their anomaly rate, suggesting some degree of meaningful signal extraction rather than random flagging.

### Follow-up 3.4: Could real, labelled negative data improve training?
Unlikely. We tested this in URF-3 using confirmed TOIs, and the model overfit. Without a uniform and representative real-negative dataset, URFs generalize poorly when trained on real data distributions.

### Follow-up 3.5: Are anomaly scores tied to astrophysical structure?
That’s the working hypothesis. While URFs respond to statistical noise, the boundaries they learn appear aligned with real structure (e.g. transits, eclipses). CLARA Part 2 will evaluate this by testing anomaly score correlations with astrophysical properties.

---

### Question 4: Why use terminal node population as the scoring heuristic?
This follows the methodology of Baron & Poznanski (2017) and MG21. The URF assigns anomaly scores by:

Training a classifier to distinguish real from synthetic curves

For each tree, recording how many real curves land in each terminal node

Computing a similarity score (S) for each real curve as the average fraction of real data in the same leaf

Defining anomaly score = 1 − S

A score of 1 means the object is always alone in its terminal node (maximally anomalous), and 0 means it’s always grouped with all others.

### Follow-up 4.1: Why choose this scoring method?
It provides a principled and scalable way to quantify “anomalousness” based purely on how the forest partitions the feature space. It’s consistent across models and interpretable.

### Follow-up 4.2: Is performance stable across sectors?
Yes — as long as the synthetic set remains in the stable regime. Alpha-variant behavior for each metric is reproducible across sectors.

### Follow-up 4.3: Does the method have scientific value?
Yes — while the scoring method is statistical, it produces scientifically meaningful separation across metrics like TOI recall and importance, which hold consistently across sectors.

### Follow-up 4.4: Can scores be reproduced?
Yes — using the same seed (random_state = 42) and same hyperparameter configuration, anomaly scores are reproducible.

### Follow-up 4.5: Can anomaly scores be tied to anomaly types?
Not yet — but this is a core goal of CLARA Part 2, which aims to correlate score distributions with specific astrophysical or astrometric properties.

---

### Question 5: Does the URF learn meaningful separation from synthetic to real data, or just "weirdness"?
We argue it learns both — the “weirdness” URF responds to often aligns with real astrophysical structure (transits, dips). This is supported by consistent TOI recall and importance across alpha variants.

### Follow-up 5.1: Has this been validated?
Not fully — but it is planned in Part 2 of CLARA, which will map anomaly scores and clusters to real astrophysical object classes and properties.

### Follow-up 5.2: Could URFs be learning pure noise?
It’s unlikely. The ratio of TOIs flagged vs total anomalies across sectors suggests some structure is being detected beyond noise.

### Follow-up 5.3: Could stronger validation help?
Yes — especially if we map scores and anomalies back to physical object classes using Gaia, SIMBAD, or RV surveys. That would offer direct confirmation.

### Follow-up 5.4: Does URF respond to signal boundaries?
Yes — though it learns statistical thresholds, those appear to correspond to real features in the data, based on recall and importance behavior.

### Follow-up 5.5: Is it either pure noise detection or structure detection?
No — it’s likely a mix. The model reacts to statistical deviation, but that deviation often reflects meaningful, structured signals in the light curves.

---

# III. Synthetic Set Design
Why inject transits into the synthetic set rather than the real set (like traditional recovery testing)?

Why not use completely random synthetic flux curves?

How were parameters like duration, S/N, cadence chosen for each variant?

Are the synthetic light curves realistic enough?

Could injecting into real light curves lead to stronger contrast?

What if the synthetic set is too similar to real ones — would it make scoring noisier?

# IV. Evaluation Metrics
Why choose TOI Recall AUC and Importance AUC as the combined score axes?

Why not include binary recall, PR-AUC, or anomaly precision?

Why 20% for “top N% importance” — is that arbitrary?

Why percentile ranges for score concentration rather than mean ranks or cumulative coverage?

Are scores consistent across random test subsets?

# V. Pipeline Execution & Runtime
How were the light curves processed so quickly — what exactly did you optimize?

Why do your models run 15× faster than MG23 despite same feature length?

Is the runtime still scalable to larger sets (e.g. FFIs, LSST)?

How many cores and threads were used for parallelism?

Could results be replicated on other machines?

# VI. Alpha Variants & Steering Behavior
Why use a combined score to rank model variants — does it reflect real scientific tradeoffs?

Why were only α = 0.3, 0.5, and 0.9 tested?

How does behavior change across the entire α range?

Do lower α models always concentrate TOIs more tightly?

Is this behavioral tuning stable across sectors?

# VII. Limitations & Boundaries
What does this pipeline not do?

Can it detect non-transit anomalies like flares, binaries, blends, systematic artifacts?

Is interpretability generalizable beyond TOIs?

What if the synthetic set is poorly designed — does URF behavior become meaningless?

How do we distinguish a “high-scoring TOI” from a “high-scoring noise artifact”?

# VIII. Scientific Discovery Potential
How many new candidates were found?

Were any flagged curves previously known or listed elsewhere (e.g. rejected TOIs)?

Can this pipeline prioritize candidates for follow-up?

Can you target non-transit science with different synthetic sets?

How much does the anomaly score correlate with astrophysical properties?

# IX. Future Use Cases (for Part 2 Preview)
Could this pipeline be applied to Kepler, LSST, ZTF, or Gaia light curves?

Can clustering methods be layered on top of the anomaly output?

Could a public UI or API allow users to tune α and score curves interactively?

What improvements can be made to scoring resolution?

Can you auto-learn optimal synthetic parameters via meta-optimization?