## Determine class of stars
This notebook provides a script for searching through [arXiv](https://arxiv.org/) to determine the class of stars. The function *search_arxiv()* will make a search based on the input keyword (*search_term*) and find the most commonly mentioned class in all the resulting abstracts and titles. 

### Example: B335
*search_arxiv(search_term='b335')* will search trough [these results](https://arxiv.org/search/?query=b335&searchtype=all&source=header) for the words *Class* och *class* and return the word that most frequently follows these (in this case the result is 0).

**The only thing you need to change is the search_term!**

### Import necessary modules

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from collections import Counter

### Define functions

In [2]:
class_exceptions = ['None', 'of', 'trajectory'] # words that have been misidentified as classes

def search_arxiv(search_term, print_bool = False):
    
    scrapeLink = 'https://arxiv.org/search/?query=' + search_term + '&searchtype=all&source=header'
    page = requests.get(scrapeLink)
    soup = BeautifulSoup(page.content, 'html.parser')

    all_titles = soup.find_all('p', {'class': 'title'}) 
    all_abstracts = soup.find_all('span', {'class': 'abstract-full'}) 
    class_list = []


    for i in range(len(all_titles)):

        detected_class = 'None'
        
        title = str(all_titles[i]).split('\n')[2]
        title = title.replace('<span class="search-hit mathjax">', '')
        title = title.replace('</span>', '')

        class_pos = max(title.find('Class'), title.find('class'))
        if class_pos != -1:
            detected_class = title[class_pos:].split(' ')[1]
            s_print = fr'{i}. HERE IS A CLASS FROM TITLE: ' + detected_class

        else:
            abstract = str(all_abstracts[i]).split('\n')[1]
            abstract = abstract.replace('<span class="search-hit mathjax">', '')
            abstract = abstract.replace('</span>', '')
            
            class_pos = max(abstract.find('Class '), abstract.find('class '))
            
            if class_pos != -1:
                abstract = abstract[class_pos:]
                detected_class = abstract.split(' ')[1]
                s_print = fr'{i}. HERE IS A CLASS FROM ABSTRACT: ' + detected_class

            else:
                s_print = fr'{i}. No class found'

        if detected_class not in class_exceptions:
            class_list += [detected_class]
            
        if print_bool:
            print(s_print)

    final_class = Counter(class_list).most_common()
    
    if len(final_class) != 0:
        final_class = final_class[0][0]
        if print_bool:
            print(fr'Final class: {final_class}')
    else:
        if print_bool:
            print('No class found:(')
            
    return final_class

### Use function

In [3]:
source_list = ['B335', 
               'HL Tau', 
               'HH212', 
               'Dg Tau B',
               'HH111', 
               'IRAS 15398',
               'TMC1A',
               'HH 30',
               'mms 6', 
               'Orion Source 1',
               'HH 900']

df = pd.DataFrame()
df['Source name'] = source_list

cl_list = []
for it, s in enumerate(source_list):
    cl = search_arxiv(search_term=s, print_bool = False)
    cl_list += ([cl] if len(cl)!=0 else ['-'])
    
df['Class'] = cl_list
df

Unnamed: 0,Source name,Class
0,B335,0
1,HL Tau,I
2,HH212,0
3,Dg Tau B,I
4,HH111,I
5,IRAS 15398,0
6,TMC1A,II
7,HH 30,0
8,mms 6,II
9,Orion Source 1,II
