# Citation Prediction EDA & Model Selection

This notebook analyzes the relationship between paper metadata and citation counts to determine the best features for our prediction model.

## Goals
1. Analyze citation distribution.
2. Identify correlation between features (Author count, Abstract length, etc.) and Citations.
3. Train baseline XGBoost model to feature importance.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval

%matplotlib inline
sns.set_theme(style="whitegrid")

In [None]:
# Load Data
df = pd.read_csv("../data/eda_papers.csv")
print(f"Loaded {len(df)} papers.")
df.head()

## 1. Feature Engineering
We extract simple features from metadata:
- Title Length
- Abstract Length
- Number of Authors
- Primary Category
- Days since publication (Age)

In [None]:
df['published_at'] = pd.to_datetime(df['published_at'])
df['age_days'] = (pd.Timestamp.now(tz='UTC') - df['published_at']).dt.days

# Text features
df['title_len'] = df['title'].apply(len)
df['abstract_len'] = df['abstract'].apply(len)

# Author count (splitting by pipe)
df['num_authors'] = df['authors'].apply(lambda x: len(str(x).split('|')) if pd.notna(x) else 0)

# Filter out very new papers if needed (though API query handled this)
df = df[df['age_days'] > 30]

df[['title_len', 'abstract_len', 'num_authors', 'age_days', 'citation_count']].describe()

## 2. Correlation Analysis

In [None]:
corr_matrix = df[['citation_count', 'title_len', 'abstract_len', 'num_authors', 'age_days']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

## 3. XGBoost Feature Importance
Train a regressor to predict log(citations + 1).

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

features = ['title_len', 'abstract_len', 'num_authors', 'age_days']
X = df[features]
y = np.log1p(df['citation_count']) # Log transform target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

preds = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(f"RMSE (log-scale): {rmse:.4f}")

# Feature Importance
plt.figure(figsize=(10, 5))
plt.barh(features, model.feature_importances_)
plt.title("XGBoost Feature Importance")
plt.show()