### ArXiv Metadata Analysis
#### Capstone Project, DSI-911 cohort, Lisa Paul

**Current Notebook:** 01-preprocess-EDA
>Run this Second to read-in testing & training CSV data, then encode pred-column, add features (columns)

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords

import nltk
nltk.download('punkt')  # For word tokenization
nltk.download('stopwords')  # For stopwords removal

[nltk_data] Downloading package punkt to /Users/lisapaul/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lisapaul/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
data_path = '../data/'

orig_df = pd.read_csv(data_path + "arxiv_meta_aa-single-cat.csv")

In [3]:
orig_df.head(1)


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[['Balázs', 'C.', ''], ['Berger', 'E. L.', '']..."


In [4]:
#training file has only 32031 records (because of previous cleaning)
orig_df.shape

(32031, 14)

In [5]:
#function to encode the categories (research fields)
def encode_cats(df):
    # Make a copy of the DataFrame to avoid modifying the original data
    encoded_df = df.copy()
    
    # which column to encode? confusingly: "categories"
    categorical_column = encoded_df['categories']

    #Instance the encoder
    le = LabelEncoder()

    # Encode 'categories' column
    encoded_df['numeric_categories'] = le.fit_transform(categorical_column)
    
    return encoded_df


In [6]:
#Call this function exactly once (per df) to create df w/ one extra column
#Cleaner code would be to call this //within// the feature-adding function
encoded_df = encode_cats(orig_df)

encoded_df.columns

Index(['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed', 'numeric_categories'],
      dtype='object')

In [7]:
encoded_df.shape
#correctly has 1 extra column now

(32031, 15)

In [8]:
print ("Unique CategoryNames:\t", 
        len(encoded_df['categories'].unique()),  
       "\nUnique CategoryNums:\t", 
       len(encoded_df['numeric_categories'].unique())
      )

Unique CategoryNames:	 120 
Unique CategoryNums:	 120


#### Function to drop the columns I don't want.

In [9]:
def drop_cols(df):
    
    cols_to_keep = ['id', 'title', 'abstract', 'categories', 'numeric_categories'
]
    smaller_df = df[cols_to_keep]
    
    return smaller_df

In [10]:
smaller_df = drop_cols(encoded_df)
smaller_df.shape

(32031, 5)

#### The **add_text_features** function will add some new features (based on NLP of the abstract).
> I used chatGPT to suggest solutions to type errors


In [11]:

def add_text_features(df):
    
    featured_df = df.copy()
    
    #Create new column containing list of the words in the abstract (for each row)
    abs_tokens = featured_df['abstract'].apply(word_tokenize)
    featured_df['abs_tokens'] = abs_tokens

    # Create new column containing abstract wordlist _without_ stopwords
    # apply stopword removal across all rows
    abs_no_sw = abs_tokens.apply(lambda tokens: [token for token in tokens if token.lower() not in stopwords.words('english')])
    featured_df['abs_tokens_no_sw'] = abs_no_sw
    
    
    # Most additional features are based on previous stopword removal: 
    
    # Create new column for the count of unique words
    unique_wc = abs_no_sw.apply(tuple).nunique()
    featured_df['unique_wc'] = unique_wc

    # Create new column for the total word count
    wc = abs_no_sw.apply(len).sum()
    featured_df['wc'] = wc

    # Create new column for the ratio of unique wc to total wc
    unique_words_ratio = unique_wc / wc
    featured_df['unique_words_ratio'] = unique_words_ratio


    return featured_df

#### More thoughts about add_text_features():

#### Another less-interpretable approach  would be to do Vectorizing, instead of creating all these data manually.
#####     TF-IDF would be better than CountVectorizer because:
> - considers the context of the entire dataframe, and penalizes common words across rows
> - thus "ignoring" the noisy ones that aren't useful for classifying
        
##### Assuming, however, that I manually created more features, here are a few that could be interesting or useful:
> Create new columns for 1 or 2 parts of speech (e.g., noun, adjective)
>   - nltk.pos_tag() 

> Or, features which require original abstract, not tokenized:
>   - Create a new column for counting special characters
>   - Add a column for general char_count
>   - Same for readability scores such as Flesch-Kincaid or SMOG



In [12]:
#convert these comments to markdown:
#original plan was to use 2 separate csv chunks of the original dataset
#But, after slow performance processing 2 dataframes, coupled with extremely large files that caused GitHub warnings
#I decided to go with Train Test Split method after all, which only needs 1 dataframe here.

In [None]:
%%time 
#call on only 1 dataframe, because more  is out of scope for "prelim study"
featured_df = add_text_features(smaller_df)


In [None]:
featured_df.shape

In [None]:
featured_df.columns

In [None]:
featured_df['abs_tokens']

In [None]:
#Save enhanced dataframes to a csv file:

featured_df.to_csv(data_path + "train_featured.csv", index=False)


Next, let's see if any features will be useful to use for prediction modelling:

In [None]:
#correlations heatmap

In [None]:
#hey, self, for the above ~~3 sections, this may go faster if I get into colab or quite some progs
