# Exploratory Data Analysis (EDA) on arXiv Scientific Research Papers

This notebook performs exploratory data analysis on the arXiv dataset, which contains scientific research papers. The analysis includes data visualization and insights derived from the dataset.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [None]:
# Load the dataset
data_path = '../data/arxiv_papers.csv'
arxiv_data = pd.read_csv(data_path)

# Display the first few rows of the dataset
arxiv_data.head()

In [None]:
# Check for missing values
missing_values = arxiv_data.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Visualize the distribution of publication years
plt.figure(figsize=(12, 6))
sns.histplot(arxiv_data['year'], bins=30, kde=True)
plt.title('Distribution of Publication Years')
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Visualize the number of papers by category
plt.figure(figsize=(12, 6))
category_counts = arxiv_data['category'].value_counts()
sns.barplot(x=category_counts.index, y=category_counts.values)
plt.title('Number of Papers by Category')
plt.xlabel('Category')
plt.ylabel('Number of Papers')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Analyze the relationship between title length and number of citations
arxiv_data['title_length'] = arxiv_data['title'].apply(len)

plt.figure(figsize=(12, 6))
sns.scatterplot(x='title_length', y='num_citations', data=arxiv_data)
plt.title('Title Length vs Number of Citations')
plt.xlabel('Title Length')
plt.ylabel('Number of Citations')
plt.show()

## Conclusion

This exploratory data analysis provides insights into the arXiv dataset, including the distribution of publication years, the number of papers by category, and the relationship between title length and citations. Further analysis can be conducted to derive more insights and prepare the data for modeling.