## Introduction

This notebook is a **continuation** of the previous work titled [*Catalog-Based Positional Matching of Carbon Stars*](https://www.kaggle.com/code/nicolsottero/catalog-based-positional-matching-of-carbon-stars), where we identified the closest infrared sources to five known carbon stars across 48 deep-sky catalogs.

In this second stage, we demonstrate the full analysis pipeline **using carbon star `s1` as a representative example**. The same methodology can later be applied to the other stars (`s0`, `s2`, `s3`, `s4`).

The goal is to:

- Build an **infrared light curve** using `Ks`-band magnitudes
- Perform data cleaning and imputation
- Apply machine learning to predict future brightness values

The full pipeline integrates:

- Renaming and organizing raw magnitude files  
- Merging them with observation metadata from `lista.cat`  
- Filtering for `Ks`-band data only  
- Interpolating and smoothing the light curve  
- Filling gaps with a **Random Forest Regressor**  
- Forecasting with a **Long Short-Term Memory (LSTM)** neural network

This project bridges traditional catalog-based analysis with modern machine learning to study long-term variability in evolved stars like carbon stars.


### Data Recap: Source Files and Metadata

As a reminder, the dataset used in this project was kindly provided by **PhD. David Merlo** and includes the following components:

- **`Merlo2015.pdf`**  
  A scientific article that provides the theoretical and observational context for the study, including previous work on carbon stars and infrared variability.

- **Multiple `.asc` files**  
  Raw catalogs of infrared sources, each corresponding to a different photometric filter (Ks, H, J, Y, Z). These files are sorted by celestial coordinates and contain magnitudes, errors, and source classification labels.

- **`lista.cat`**  
  A metadata file summarizing the content of each `.asc` file, including the observation date, filter used, and file name reference.

- **`fuentes.txt`**  
  A list of **five carbon stars**, labeled `s0` to `s4`, with their precise **J2000 equatorial coordinates**. These stars serve as the reference targets for matching and time-series analysis. Star `s1` corresponds to the coordinates listed under label `s1` in `fuentes.txt`.

- **Additional test files** (e.g., `Ks20s0`, `Ks20s1`, etc.)  
  Used for preliminary inspection and early-stage validation of magnitudes across epochs.

This notebook builds upon these data components to generate and model infrared light curves for the selected stars.


### 🔁 Recap from the Previous Notebook: Generating `_results.csv` Files

Each raw catalog file was originally named using the following pattern:

**vYYYYMMDD_NNNNN_st_tl_cat.asc**



For example:  
`v20100313_00394_st_tl_cat.asc`

These `.asc` files contain catalogs of thousands of sources detected in infrared observations.  
The objective of the previous notebook ([*Catalog-Based Positional Matching of Carbon Stars*](https://www.kaggle.com/code/nicolsottero/catalog-based-positional-matching-of-carbon-stars)) was to extract only the closest stellar source (type `-1`) to each of the five reference carbon stars (`s0` to `s4`) from each `.asc` file.

**To facilitate future time series construction**, only the filtered brightness measurements of these 5 targets were retained.  
Each result was saved as a new CSV using the same base name plus `_results.csv`.  
For example:

- `v20100313_00394_st_tl_cat.asc` → `v20100313_00394_st_tl_cat_results.csv`

This ensured:
- Direct traceability between raw catalogs and result files.
- Simplified handling for downstream analysis.

---

#### 🧩 Code Excerpt

The following block, shown here a reference, was used to save the filtered results:

```python
                # Save results to CSV
                if closest_stars:
                    output_filename = f"{os.path.splitext(filename)[0]}_results.csv"
                    df_closest = pd.DataFrame(closest_stars)
                    df_closest.to_csv(output_filename, mode='w', header=True, index=False)
                    files.download(output_filename)
```

These _results.csv files now serve as the foundation for constructing light curves across time and filters in the current notebook.

## 1. Extracting Magnitudes for Light Curve Construction

We now begin the first proper stage of this notebook. The previous section served as a general introduction.

In this step, we obtain the file **`merged_result_s1.csv`**, which will be used for the upcoming light curve analysis.

This file is built by collecting the magnitude values of the **s1** carbon star across all catalogs. To do this, we rely on the individual **`_results.csv`** files generated in the [previous notebook](https://www.kaggle.com/code/nicolsottero/catalog-based-positional-matching-of-carbon-stars), where each file contains only the five closest stellar sources (type = -1) to the carbon stars s0 through s4.

These result files are renamed to match the identifiers listed in the **`lista.cat`** file, which serves as a catalog index.

### What is `lista.cat`?

This file associates each original `.asc` file with key metadata:
- The photometric filter used (Ks, H, J, Y, Z)
- The observation date
- The original filename

By renaming each `_results.csv` file accordingly, we ensure compatibility with `lista.cat`, allowing us to seamlessly merge brightness measurements with their corresponding filters and timestamps. This enables the construction of accurate and well-labeled time series for each carbon star.

### Extracting Magnitudes for s1 Across All Catalogs

The following code loads the `*_results.csv` files generated in the previous notebook.  
Each file contains photometric measurements for five carbon stars extracted from one catalog.

The process involves:
- Uploading these CSVs (renamed to match their original `.asc` catalog filenames).
- Parsing the SkyCoord string column to ensure valid coordinate format.
- Filtering each file to find the closest stellar source to star **s1** based on angular distance.
- Extracting its magnitude and error.

The output is a file named `final_result_s1.csv`, listing the brightness of s1 across all catalogs.  
This data will be used to build its light curve.


In [None]:
# Execution guard to prevent running this cell accidentally
run_block = False

if run_block:

    from google.colab import files
    import pandas as pd
    import io
    from astropy.coordinates import SkyCoord
    import astropy.units as u

    # Ensure coordinate strings are parsed correctly
    def parse_skycoord(coord_str):
        try:
            coord_str = coord_str.replace('<SkyCoord (ICRS): (ra, dec) in deg', '').strip()
            coord_str = coord_str.replace('>', '').replace('(', '').replace(')', '').strip()
            ra_dec_list = coord_str.split(',')
            if len(ra_dec_list) == 2:
                ra = float(ra_dec_list[0].strip())
                dec = float(ra_dec_list[1].strip())
                return SkyCoord(ra=ra*u.deg, dec=dec*u.deg, frame='icrs')
            else:
                raise ValueError("Invalid coordinate format")
        except Exception as e:
            print(f"Error parsing coord: {coord_str} -> {e}")
            return None

    def load_catalogs_and_find_star_s1(uploaded_files, star_coords):
        star_data = []
        for file_name, file_content in uploaded_files.items():
            df = pd.read_csv(io.BytesIO(file_content), header=None)
            df.columns = ['RA_h', 'RA_m', 'RA_s', 'Dec_g', 'Dec_m', 'Dec_s',
                          'Mag', 'ErrorMag', 'Fuente', 'coord', 'separation', 'carbon_star_key']
            df = df.drop(0)  # Remove extra header row
            df['coord'] = df['coord'].apply(parse_skycoord)
            df = df.dropna(subset=['coord'])
            df['separation'] = df['coord'].apply(lambda x: x.separation(star_coords).arcsecond)
            closest_star = df.loc[df['separation'].idxmin()]
            star_data.append({
                'file_name': file_name,
                'magnitude': closest_star['Mag'],
                'magnitude_error': closest_star['ErrorMag']
            })
        return star_data

    s1_coords = SkyCoord(ra=183.99373333*u.deg, dec=-64.34371111*u.deg, frame='icrs')

    uploaded_files = files.upload()

    s1_data = load_catalogs_and_find_star_s1(uploaded_files, s1_coords)

    output_df = pd.DataFrame(s1_data)
    print("List of s1 magnitudes across different catalogs:")
    print(output_df[['file_name', 'magnitude', 'magnitude_error']])

    output_df.to_csv('/content/final_result_s1.csv', index=False)
    files.download('/content/final_result_s1.csv')

else:
    print("Execution disabled")


## 2. Merging Metadata and Magnitudes for Ks Band

This section focuses on creating a clean, unified table (`merged_result_s1.csv`) that combines:

- The Ks-band metadata for each catalog (from `lista.cat`)
- The measured magnitudes of star **s1** from the `_results.csv` files

The process was carried out in three steps:

---

#### Filtering Ks Band Entries from `lista.cat`

The original `lista.cat` file contains metadata for all catalogs (filter band, observation date, etc.).  
A cleaned version named `CSV_lista.txt` was uploaded and processed to retain only the entries corresponding to the **Ks** filter.

- The file was loaded using pandas
- Rows with band `"Ks"` were selected
- The result was saved as `Ks_catalogs_s1.csv` for future use

---

#### Cleaning `file_name` Columns in Both Tables

To ensure a successful merge, the column `file_name` from both `Ks_catalogs_s1.csv` and `final_result_s1.csv` was sanitized.

- The file extensions `.csv` and `.asc` were stripped from all values
- This standardization allows the two datasets to match on catalog names
- The cleaned files were saved as `Ks_catalogs_s1_modified.csv` and `final_result_s1_modified.csv`

---

#### Merging Metadata and Photometry into a Single Table

The cleaned metadata and photometric measurements were merged based on the `file_name` column using an **outer join**.

- The outer join ensures no data is lost if a match is missing on either side
- The final result, `merged_result_s1.csv`, contains:
  - Observation metadata (e.g. filter, MJD, exposure time)
  - Photometric magnitude and error for star **s1** in each catalog

This consolidated table is now ready for time-series analysis and curve fitting.


### 2.1. Filtering Ks Band Entries from `lista.cat`

The file `CSV_lista.txt` is a cleaned version of the original `lista.cat`, which lists metadata for each catalog file, including its filter band, observation date (MJD), and exposure time.

In this step, we filter the entries to keep only those corresponding to the **Ks band**, since this wavelength is of particular interest for the photometric analysis of carbon star **s1**.

The result is saved as `Ks_catalogs_s1.csv` for later merging with photometric measurements.

> Below is the code used to upload and process the file. Execution has been disabled to preserve reproducibility:


In [None]:
# Code block to upload and filter CSV_lista.txt – execution disabled
run_filter = False

if run_filter:
    from google.colab import files
    import pandas as pd

    # Upload the CSV_lista.txt file
    uploaded = files.upload()

    # Read the CSV file
    csv_lista_path = 'CSV_lista.txt'
    lista_cat_df = pd.read_csv(csv_lista_path)

    # Filter for entries with the Ks band
    Ks_catalogs_s1_df = lista_cat_df[lista_cat_df['Banda'] == 'Ks']

    # Save the result for later use
    Ks_catalogs_s1_df.to_csv('Ks_catalogs_s1.csv', index=False)

    # Download the result
    files.download('Ks_catalogs_s1.csv')
else:
    print("Execution disabled")


### 2.2. Cleaning `file_name` Columns

To ensure a proper merge between photometric measurements and catalog metadata, we removed extensions such as `.csv` and `.asc` from the `file_name` columns in both datasets:

- `Ks_catalogs_s1.csv` (catalog metadata with band = Ks)
- `final_result_s1.csv` (magnitudes extracted for carbon star s1)

This normalization avoids mismatches caused by filename formatting and prepares the data for merging.

> Below is the code used to perform the cleaning. Execution is disabled for safety:


In [None]:
# Code block to clean file_name columns – execution disabled
run_cleaning = False

if run_cleaning:
    from google.colab import files
    import pandas as pd

    # Upload the modified Ks catalog and magnitude result files
    uploaded = files.upload()

    # Extract paths
    Ks_catalogs_s1_path = [path for path in uploaded.keys() if "Ks_catalogs_s1" in path][0]
    final_result_s1_path = [path for path in uploaded.keys() if "final_result_s1" in path][0]

    # Load the dataframes
    Ks_catalogs_s1_df = pd.read_csv(Ks_catalogs_s1_path)
    final_result_s1_df = pd.read_csv(final_result_s1_path)

    # Remove file extensions from 'file_name' columns
    Ks_catalogs_s1_df['file_name'] = Ks_catalogs_s1_df['file_name'].str.replace('.csv', '', regex=False)
    Ks_catalogs_s1_df['file_name'] = Ks_catalogs_s1_df['file_name'].str.replace('.asc', '', regex=False)
    final_result_s1_df['file_name'] = final_result_s1_df['file_name'].str.replace('.csv', '', regex=False)
    final_result_s1_df['file_name'] = final_result_s1_df['file_name'].str.replace('.asc', '', regex=False)

    # Save and download the cleaned files
    Ks_catalogs_s1_df.to_csv('Ks_catalogs_s1_modified.csv', index=False)
    final_result_s1_df.to_csv('final_result_s1_modified.csv', index=False)
    files.download('Ks_catalogs_s1_modified.csv')
    files.download('final_result_s1_modified.csv')
else:
    print("Execution disabled for reproducibility.")


### 2.3. Merging Catalog Metadata with Photometric Results

Finally, both cleaned datasets were merged to produce a unified table:

- `Ks_catalogs_s1_modified.csv`: metadata from `lista.cat` filtered for the Ks band.
- `final_result_s1_modified.csv`: magnitude and error for carbon star s1 across all catalogs.

The merge was performed on the `file_name` column after extension cleanup.

This operation yielded a complete dataset named `merged_result_s1.csv`, ready for time series analysis and visualization.

> Code block used for merging (execution disabled for reproducibility):


In [None]:
# Code to merge cleaned metadata and magnitude results – execution disabled
run_merge = False

if run_merge:
    from google.colab import files
    import pandas as pd

    # Upload cleaned metadata and results
    uploaded = files.upload()

    # Extract file paths
    ks_catalogs_s1_filename = [path for path in uploaded.keys() if "Ks_catalogs_s1_modified" in path][0]
    final_result_s1_filename = [path for path in uploaded.keys() if "final_result_s1_modified" in path][0]

    # Load data
    ks_catalogs_s1_df = pd.read_csv(ks_catalogs_s1_filename)
    final_result_s1_df = pd.read_csv(final_result_s1_filename)

    # Merge on 'file_name'
    merged_s1_df = pd.merge(ks_catalogs_s1_df, final_result_s1_df, on='file_name', how='outer')

    # Save merged results
    merged_s1_df.to_csv('merged_result_s1.csv', index=False)
    files.download('merged_result_s1.csv')

    # Display output
    print("Merged DataFrame:")
    print(merged_s1_df)
else:
    print("Execution disabled for reproducibility.")


## 3. Conclusion

This notebook successfully prepares the dataset required for photometric analysis and light curve modeling of the carbon star **s1** in the **Ks band**.

Starting from the original `.asc` catalogs and the metadata in `lista.cat`, we:

- Extracted the closest source to s1 in each catalog using positional matching.
- Filtered the catalogs by Ks filter using a cleaned version of `lista.cat`.
- Matched and merged the magnitude data with catalog metadata.

The final output, `merged_result_s1.csv`, contains for each Ks-band observation:

- The corresponding photometric magnitude and error for s1.
- The associated catalog metadata, including date and exposure information.

***This file is now ready for time-series modeling and predictive analysis in the next phase.***
