## Task 1 stage 2 
#### Text Analysis(Sentiment analysis & Topic Modeling):


# Step 1: Import Necessary Libraries

In this step, we import the libraries required for text analysis, including:
- **Data Manipulation and Visualization:** `pandas`, `numpy`, `matplotlib`, and `seaborn`.
- **Text Analysis and Sentiment Analysis:** `nltk` and `textblob`.
- **Text Vectorization:** `scikit-learn`.
- **Topic Modeling:** `spacy` and `gensim`.

Ensure all libraries are installed and necessary corpora are downloaded. Restart the Jupyter notebook kernel after installing new packages and models.


In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy
import gensim
from gensim import corpora

# Download NLTK data if not already present
nltk.download('vader_lexicon')
nltk.download('stopwords')

# Load the English language model for spacy
try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    print("SpaCy model 'en_core_web_sm' not found. Please install it using 'python -m spacy download en_core_web_sm'.")


# Step 2: Prepare the Data

In this step, we will prepare our dataset for sentiment analysis and topic modeling. The process includes:
1. **Loading the Data:** Read the financial news dataset into a pandas DataFrame.
2. **Data Cleaning:** Handle any missing or inconsistent data.
3. **Text Preprocessing:** Tokenize the text, remove stop words, and perform other preprocessing tasks.

Ensure you have followed Step 1 and imported the necessary libraries before proceeding with this step.

## 2.1 Load the Data

We will start by loading the dataset from the `data` folder into a pandas DataFrame.


In [None]:
# Load the dataset
data_path = '../data/raw_analyst_ratings.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()


# 2.2 Check for Missing Values

We need to check for any missing values in the dataset. Handling missing values is crucial for accurate analysis. We'll identify any missing values and decide how to handle them (e.g., removing rows, imputing values).


In [None]:
# Check for missing values
df.isnull().sum()


# spaCy

In [None]:
import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Define a function to clean text
def clean_text_spacy(text):
    doc = nlp(text.lower())  # Convert to lowercase and process with spaCy
    tokens = [token.text for token in doc if not token.is_punct and not token.is_stop]  # Remove punctuation and stop words
    return ' '.join(tokens)

# Apply the function to the headlines
df['cleaned_headline'] = df['headline'].apply(clean_text_spacy)

# Display the first few rows of the cleaned data
df[['headline', 'cleaned_headline']].head()
