# Exploratory Data Analysis

## Descriptive Statistics 
### Aim 
- Obtain basic statistics for textual lengths (like headline length).
- Count the number of articles per publisher to identify which publishers are most active.
- Analyze the publication dates to see trends over time, such as increased news frequency on particular days or during specific events.
# 
### Technical Appraoch 
- use modular implementation as much as possible 
- cleanup code 
- calculate basic Technical indicators 
- visualize Data



In [30]:
# Import Dataset and useful libraries 
import os 
import pandas as pd 
os.chdir('../scripts/')
import utils as util


data_path = "../../data/week1/raw_analyst_ratings.csv"
df = util.read_csv_file(data_path)

In [31]:
df = df.get("data")
df.head()

Unnamed: 0,headline,url,publisher,date,stock
0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A
1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A
2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A
3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A
4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A


### Obtain basic statistics for textual lengths (like headline length).

In [32]:
# Calculate the length of each headline
df['headline_length'] = df['headline'].apply(len)

# Obtain basic statistics for headline lengths and also change it to integer to make more sense 
headline_length_stats = df['headline_length'].describe().apply(lambda x: int(x) if not x.is_integer() else x)

# Display the statistics
headline_length_stats

count    1407328.0
mean          73.0
std           40.0
min            3.0
25%           47.0
50%           64.0
75%           87.0
max          512.0
Name: headline_length, dtype: float64

### Count the number of articles per publisher to identify which publishers are most active.

In [33]:
# Count the number of articles per publisher
articles_per_publisher = df['publisher'].value_counts()

# Display top 20 articles per publisher  
articles_per_publisher[:20] 


Paul Quintaro        228373
Lisa Levin           186979
Benzinga Newsdesk    150484
Charles Gross         96732
Monica Gerson         82380
Eddie Staley          57254
Hal Lindon            49047
ETF Professor         28489
Juan Lopez            28438
Benzinga Staff        28114
Vick Meyer            24826
webmaster             20313
Benzinga_Newsdesk     19410
Zacks                 19390
Jayson Derrick        19050
Allie Wickman         18317
Shanthi Rexaline      16640
Craig Jones           16221
Wayne Duggan          12897
Nelson Hem            12590
Name: publisher, dtype: int64

### Analyze the publication dates to see trends over time, such as increased news frequency on particular days or during specific events.

In [38]:
# Convert the 'date' column to datetime format with UTC conversion to handle tz-aware values
df['date'] = pd.to_datetime(df['date'], errors='coerce', utc=True)

# Drop rows with invalid date values
df = df.dropna(subset=['date'])

# Convert to local time zone if needed (e.g., 'America/New_York')
df['date'] = df['date'].dt.tz_convert('America/New_York')

# Extract the date part only (without time)
df['date_only'] = df['date'].dt.date

# Extract the day of the week
df['day_of_week'] = df['date'].dt.day_name()

# Count the number of articles per day
articles_per_day = df['date_only'].value_counts().sort_index()

# Count the number of articles per day of the week
articles_per_day_of_week = df['day_of_week'].value_counts().sort_index()






In [39]:
# Display the number of articles per day
articles_per_day

2009-02-13      1
2009-04-26      2
2009-04-28      1
2009-05-21      1
2009-05-26      6
             ... 
2020-06-07     25
2020-06-08    765
2020-06-09    804
2020-06-10    806
2020-06-11    544
Name: date_only, Length: 3976, dtype: int64

In [40]:
# Display the number of articles per day of the week
articles_per_day_of_week

Friday        16854
Monday       295793
Saturday      16344
Sunday       255278
Thursday     221207
Tuesday      300060
Wednesday    301792
Name: day_of_week, dtype: int64

### Text Analysis(Sentiment analysis & Topic Modeling)

##### Perform sentiment analysis on headlines to gauge the sentiment (positive, negative, neutral) associated with the news.

In [None]:
from textblob import TextBlob

# Function to get sentiment
def get_sentiment(headline):
    analysis = TextBlob(headline)
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

# Apply sentiment analysis on headlines
df['sentiment'] = df['headline'].apply(get_sentiment)

# Display the first few rows to check the sentiment column
df[['headline', 'sentiment']].head()


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Function to display topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

# Extract keywords/phrases using CountVectorizer
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(df['headline'])

# Apply Latent Dirichlet Allocation (LDA) to extract topics
lda = LatentDirichletAllocation(n_components=10, random_state=42)
lda.fit(X)

# Display the topics
no_top_words = 10
display_topics(lda, vectorizer.get_feature_names_out(), no_top_words)
