### Project Description
Background and Motivation

The alluring mysteries of space have drawn human curiosity since the dawn of mankind. Among these mysteries, the search for life that evolved off our planet has been a driving endeavor in the field of astronomy and astrophysics. The first step in the search for evidence of extraterrestrial life has been to identify exoplanets, or planets that exist beyond our solar system. In 2009, NASA launched the Kepler Space Telescope for this exact purpose. Its primary mission was to survey a section of the Milky Way galaxy to discover Earth-size planets orbiting within their stars' habitable zones. Since its launch, the Kepler mission has provided an unprecedented wealth of data on exoplanets.

Our motivation for choosing a project centered on the classification of exoplanets using data from the Kepler Space Telescope stems from a deep-rooted fascination with the universe and the fundamental questions of our existence. This topic lies outside the normal scope of our majors, providing us with a unique opportunity to study a new and interesting subject while also putting into practice the data science skills we have acquired during this course.

Project Objectives

The overall objective of the project is to apply machine learning and data analysis techniques to classify and analyze the vast amounts of data collected by the Kepler mission to contribute to our understanding of the universe.

The primary objective of this project is to determine classification accuracy; that is, how accurately machine learning algorithms can classify Earth-like exoplanet candidates based on their observed characteristics. This includes identifying which data features are most effective for exoplanet classification. A secondary objective is to utilize our data science skills to determine how we can interpret the results of our models in a meaningful way and visualize the relationships within the data. This objective focuses on developing intuitive visualizations and metrics that can convey complex astronomical concepts in an accessible manner.

The benefits of this project include contributing to exoplanet research by enhancing our ability to classify potentially habitable exoplanets, thereby enriching our understanding of the universe and the quest for extraterrestrial life. Additionally, the project offers a chance to improve our machine learning, data preprocessing, and visualization skills through the use of advanced algorithms, handling high-dimensional data, and effectively communicating complex results. This endeavor serves as an opportunity to deepen our knowledge at the confluence of astronomy, data science, and computational methods, paving new paths for research and innovation.

### Data Description

This dataset represents a cumulative record of all observed Kepler Space Telescope "objects of interest", it comprises a total of 9564 entries. The dataset encompasses a variety of attributes, including Identification Columns, Exoplanet Archive Information, Project Disposition Columns, Transit Properties, Threshold-Crossing Event (TCE) Information, Stellar Parameters, KIC Parameters, and Pixel-Based KOI Vetting Statistics. It contains extensive information on astronomical objects, such as their locations, luminosities, and physical characteristics. The data is stored in CSV format, which is used for tabulating data.

This dataset was published as-is by NASA and was directly downloaded from their website, the links for the dataset is https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=koi.

The data descriptions link is https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html

### Ethical Data Concerns

The data in the Exoplanet Archive is not published explicitly for public use. However the data can be used for additional research with the following acknowledgements.

"This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program."

The results of the project could potentially impact astronomers, researchers, academic institutions, and the general public interested in astronomical data. Incorrect analysis or interpretation of data could mislead the public's understanding of space research, potentially affecting research funding and public interest. If the dataset or algorithms used for analysis are biased, the project could reinforce or amplify existing prejudices. Additionally, data could be intentionally manipulated to impact public opinion or policy in adverse ways.

### Methods

This study employs a range of statistical and machine learning methods to analyze exoplanets in the Kepler Space Telescope dataset. Initially, we conduct a data description phase. For the continuous variables within the dataset, we will calculate basic statistical metrics such as mean, median, standard deviation, variance, and so forth. To explore the potential correlations between continuous variables in the dataset, we will use Pearson correlation coefficients to identify which variables are closely associated with the presence of exoplanets.

Given the data's high-dimensional nature, we will apply dimensionality reduction techniques, such as Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP), to simplify the data and reveal its structure. Through visualization, we aim to gain a more intuitive understanding of the patterns within the data.

For the task of classifying exoplanets, our analysis will utilize supervised learning models, specifically linear regression models and random forests. These models will be trained on existing data to predict potential exoplanets. Regression models are capable of handling the complex relationships between continuous variables, while random forests offer a powerful classification mechanism suited for data with a large number of features.

Furthermore, we will explore unsupervised learning techniques, such as cluster analysis, to uncover natural groupings within the data. This effort will help us reveal similarities and differences between various types of exoplanets, providing valuable insights for further research.

By integrating these methods, we aim to conduct a thorough analysis of the Kepler data, not only to accurately classify potential exoplanets but also to explore their intrinsic connections. This approach offers new perspectives on understanding planetary systems in the universe.

### Preliminary Results

Using Principal Component Analysis on our dataset, we found that with 6 PCA components we can explain 93.17% of variance in our data. 

The first two components (PC1 and PC2) differentiate between the thermal characteristics of exoplanets and their stellar hosts, with PC1 closely associated with the equilibrium temperature (koi_teq), indicating its strong influence by the thermal properties of exoplanets and their transit features. PC2, on the other hand, is significantly correlated with stellar effective temperature (koi_steff), pointing towards its role in capturing variations in the thermal properties of stars, especially highlighting the inverse relationship between stellar temperature and surface gravity.

Further analysis from PC3 to PC6 reveals intricate details of the dataset: PC3's strong positive correlation with koi_time0bk signifies its emphasis on transit timing, while PC4, with its positive correlation to koi_duration, suggests its significance in representing the duration of these transits, associating longer durations with specific stellar characteristics. PC5, through its negative relationship with koi_teq, underscores variations in cooler temperatures across the dataset. Notably, PC6 stands out for its strong positive correlation with koi_model_snr (Signal to Noise Ratio), underlining the component's critical role in distinguishing the quality of exoplanet detection signals.

After plotting our first 3 PCA components, the visualizations suggested that PCA is not as effective at capturing significant variance related to the KOI disposition, which was indicated by the minimal clustering patterns of the different statuses: False Positive, Candidate, and Confirmed. While there is some degree of separation between the categories, indicating the effectiveness of PCA in distinguishing between different dispositions, overlaps between Candidate and Confirmed categories suggest similarities in their characteristics. This overlap might reflect the continuum from Candidate to Confirmed status, where additional verification and observation are required.

### Peer Feedback

### Completed Milestones

### Methods Milestones

### Summary

In summary, our primary objective is to leverage machine learning and data analysis techniques to classify and analyze exoplanets using data collected by the Kepler Space Telescope. Through statistical analysis and dimensionality reduction methods like Principal Component Analysis (PCA), we've gained insights into the underlying structure of the dataset. Our findings reveal that PCA components capture essential features related to exoplanet thermal characteristics, stellar properties, transit timing, and signal quality. However, visualizations of the first three PCA components suggest limitations in effectively distinguishing between different disposition categories (False Positive, Candidate, and Confirmed), indicating potential challenges in classification accuracy related to these labels. This project represents a significant endeavor at the intersection of astronomy, data science, and computational methods, contributing to our understanding of the universe and the quest for extraterrestrial life.