# Sentiment Analysis of Financial News Using NLTK

We would predict the sentiment of Fiancial news using NLTK (Natural Language Toolkit).

# About Dataset
https://www.kaggle.com/datasets/notlucasp/financial-news-headlines/code 

This dataset contains 3 csv file

cnbc headline   (3080, 3)

gaurdian headline   (17800, 2)

reuters headline   (32770, 3)


# Columns Provided in the Dataset

cnbc headline df
1. time
2. headlines
3. Description

gaurdian headline df
1. time
2. headline

reuters headline df
1. time
2. headline
3. description


# What is NLTK ?

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP).

It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.



# What is sentiment analysis ?

Sentiment analysis is the process of detecting positive or negative sentiment in text. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.



In [None]:
pip install nltk

In [None]:
import warnings

# Ignore all warnings (not recommended in most cases)
warnings.filterwarnings("ignore")

In [None]:
# Import all the required libraries 
import nltk
#import stopwords and text processing libraries
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download NLTK resources (only required once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')


In [None]:
#import machine learning libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Basic EDA on cnbc_headlines dataset

In [None]:
# Read csv file of cnbc headlines using pandas
# Path=  /kaggle/input/financial-news-headlines/cnbc_headlines.csv

import pandas as pd

# Define the path to the CSV file
csv_path = "/kaggle/input/financial-news-headlines/cnbc_headlines.csv"

# Read the CSV file using Pandas
cnbc_df = pd.read_csv(csv_path)


In [None]:
# Display the first few rows of the DataFrame
cnbc_df

In [None]:
# check the shape of cnbc headline dataset
cnbc_df.shape

In [None]:
# Check all the columns in the cnbc headline dataset
cnbc_df.columns

In [None]:
# Check which columns are having categorical, numerical or boolean values
cnbc_df.info()

In [None]:
# Check for missing values in all the columnns of cnbc headline dataset
cnbc_df.isnull().sum()

There is 280 missing values in headlines, description and time

In [None]:
# Drop nan values in cnbc headline dataset
cnbc_df = cnbc_df.dropna()

In [None]:
cnbc_df.shape

In [None]:
# Count the duplicate rows
cnbc_df.duplicated().sum()

In [None]:
# Drop the duplicate rows in the dataset keep the first one
cnbc_df = cnbc_df.drop_duplicates(keep='first')

cnbc_df.head()

In [None]:
# Check the shape of cnbc headline dataset
cnbc_df.shape

# Basic EDA on Gaurdian headlines dataset

In [None]:
# Read csv file of gaurdian headlines using pandas
# Path = /kaggle/input/financial-news-headlines/guardian_headlines.csv
# Define the path to the CSV file
csv_path = "/kaggle/input/financial-news-headlines/guardian_headlines.csv"

# Read the CSV file using Pandas
guardian_df = pd.read_csv(csv_path)

In [None]:
# Display the first few rows of the DataFrame
guardian_df.head()

In [None]:
# Check the shape of gaurdian headline dataset
guardian_df.shape

In [None]:
# Check columns of gaurdian headline
guardian_df.columns

In [None]:
# Check which columns are having categorical, numerical or boolean values
guardian_df.info()

In [None]:
# Check null values in gaurdian headlines dataset
guardian_df.isnull().sum()

In [None]:
# Drop duplicate rows in headlines and keep the first one
# Drop duplicate rows and keep the first occurrence
guardian_df = guardian_df.drop_duplicates(keep='first')

# Display the first few rows of the DataFrame after dropping duplicates
guardian_df.head()

# Basic EDA on reuters headlines

In [None]:
# Read csv file of reuters headlines using using pandas
# Path= /kaggle/input/financial-news-headlines/reuters_headlines.csv

# Define the path to the CSV file
csv_path = "/kaggle/input/financial-news-headlines/reuters_headlines.csv"

# Read the CSV file using Pandas
reuters_df = pd.read_csv(csv_path)

In [None]:
# Display the first few rows of the DataFrame
reuters_df.head()

In [None]:
# Check the shape of reuters headlines dataset
reuters_df.shape

In [None]:
#check the columns of reuters headline dataset
reuters_df.columns

In [None]:
# Check which columns are having categorical, numerical or boolean values
reuters_df.info()

In [None]:
# Check for missing values in all the columnns of reuters headlines dataset
reuters_df.isnull().sum()

In [None]:
# Dropp the duplicate rows in reuters headlines dataset and keep the first one
reuters_df = reuters_df.drop_duplicates(keep='first')
reuters_df

# Making some functions that we will need  ahead

### Preprocessing 

1. **Lowercase** - It is necessary to convert the text to lower case as it is case sensitive.

2. **Remove punctuations** -  The punctuations present in the text do not add value to the data. The punctuation, when attached to any word, will create a problem in differentiating with other words. so we have to get rid of them.

3. **Remove stopwords** -  Stopwords include: I, he, she, and, but, was were, being, have, etc, which do not add meaning to the data. So these words must be removed which helps to reduce the features from our data. These are removed after tokenizing the text.

4. **Stemming** -  A technique that takes the word to its root form. It just removes suffixes from the words. The stemmed word might not be part of the dictionary, i.e it will not necessarily give meaning.

5. **lemmatizing** -  Takes the word to its root form called Lemma. It helps to bring words to their dictionary form. It is applied to nouns by default. It is more accurate as it uses more informed analysis to create groups of words with similar meanings based on the context, so it is complex and takes more time. This is used where we need to retain the contextual information.


In [None]:
# Create a function for preprocessing 

def preprocess_text(text, stemming=False, lemmatizing=False):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in text.split() if word not in stop_words]
    
    # Apply stemming if specified
    if stemming:
        stemmer = PorterStemmer()
        words = [stemmer.stem(word) for word in words]
    
    # Apply lemmatizing if specified
    if lemmatizing:
        lemmatizer = WordNetLemmatizer()
        words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the preprocessed words back into a sentence
    processed_text = ' '.join(words)
    
    return processed_text  

# Lets begin Sentiment Analysis

In [None]:
# Import sentiment intensity analyzer

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')


# Create a SentimentIntensityAnalyzer object
sia = SentimentIntensityAnalyzer()

In [None]:
# Fuction to  decide sentiment as positive, negative and neutral
def get_sentiment_label(text):
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = sia.polarity_scores(text)
    
    # Decide sentiment label based on compound score
    compound_score = sentiment_scores['compound']
    if compound_score >= 0.05:
        return 'Positive'
    elif compound_score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Example usage
headline = "Stock Market Soars Amid Positive Earnings Reports"
sentiment_label = get_sentiment_label(headline)
print("Sentiment Label:", sentiment_label)


# Now working with description on datasets

In [None]:
# Concatenate cnbc headlines dataset and reuters headline dataset
combined_df = pd.concat([cnbc_df, reuters_df], ignore_index=True)

In [None]:
# Check the shape of this new dataset
combined_df.shape

In [None]:
# Make a copy of new dataset 
combined_df_copy = combined_df.copy()

In [None]:
combined_df_copy

In [None]:
# Apply preprocessing function to the 'Description' of new dataset (combined_df_copy)
combined_df_copy['Description'] = combined_df_copy['Description'].apply(preprocess_text)

# Display the first few rows of the DataFrame after preprocessing
combined_df_copy.head()


### Calculate Polarity Score

Polarity score is a metric used in sentiment analysis to quantify the sentiment or emotion expressed in a piece of text. It indicates whether the text expresses a positive, negative, or neutral sentiment. Polarity scores are typically numerical values that range from -1 to 1:

    A polarity score of 1 indicates a highly positive sentiment.
    A polarity score of -1 indicates a highly negative sentiment.
    A polarity score close to 0 indicates a neutral sentiment.

Polarity scores are often calculated using various natural language processing techniques, including lexicon-based methods, machine learning models, and rule-based systems. In the context of sentiment analysis, polarity scores are used to determine the sentiment of a text and categorize it as positive, negative, or neutral based on the calculated score.

In NLTK's SentimentIntensityAnalyzer, the **polarity_scores()** function computes a polarity score for a given text, providing values for positive, negative, neutral, and compound sentiments. The compound sentiment score is often used to make overall sentiment predictions, as it combines all three sentiment components into a single value.

In [None]:
# Analyze polarity score of values in description and  add new column ''ds_score'' in dataset
def get_sentiment_score(text):
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = sia.polarity_scores(text)
    return sentiment_scores['compound']

combined_df_copy['ds_score'] = combined_df_copy['Description'].apply(get_sentiment_score)

combined_df_copy.head()

In [None]:
# Apply the function  which decides sentiment to  polarity score column

# Create a function to decide sentiment label based on polarity score
def decide_sentiment_label(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the decide_sentiment_label function to the 'ds_score' column
combined_df_copy['sentiment_label'] = combined_df_copy['ds_score'].apply(decide_sentiment_label)



In [None]:
# Display the first few rows of the DataFrame with the new sentiment label column
combined_df_copy.head()

In [None]:
# Calculating the sum of each unique sentiment label
sentiment_label_counts = combined_df_copy['sentiment_label'].value_counts()

print(sentiment_label_counts)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Make Count plot for Sentiment Label

plt.figure(figsize=(5, 3))
sns.countplot(data=combined_df_copy, x='sentiment_label')
plt.title('Sentiment Label Count')
plt.xlabel('Sentiment Label')
plt.ylabel('Count')
plt.show()

In the description 

there are approx

16000 positive statment

12000 negative statment

6000 neutral statment

In [None]:
# Pie chart on description score column

# Calculate the counts of each sentiment label
sentiment_counts = combined_df_copy['sentiment_label'].value_counts()

# Create a pie chart for the sentiment labels
plt.figure(figsize=(5, 3))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Sentiment Label Distribution')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

# Modelling on Description 

In [None]:
# Split the dataset  into test and train 
# 90% train , 10% test and random state 212

from sklearn.model_selection import train_test_split

# Define the features and target variable
X = combined_df_copy['Description']  # Features
y = combined_df_copy['sentiment_label']  # Target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=212)

# Print the shape of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


## (1) LINEAR SUPPORT VECTOR MACHINE


In [None]:
%%time
# pipeline creation
# 1. tfidVectorization
# 2. linearSVC model

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Create a pipeline with TfidfVectorizer and LinearSVC
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('model', LinearSVC())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))


## (2) LOGISTIC REGRESSION


In [None]:
%%time
# pipeline creation 
# 1. CountVectorization
# 2. TfidTransformer
# 3. Logistic Regression

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Create a pipeline with CountVectorizer, TfidfTransformer, and Logistic Regression
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', LogisticRegression())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))


## (3) MULTINOMIAL NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. MultinomialNB
 
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Create a pipeline with CountVectorizer, TfidfTransformer, and Multinomial Naive Bayes
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', MultinomialNB())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))


## (4) BERNOULLI NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. BernoulliNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Create a pipeline with CountVectorizer, TfidfTransformer, and Bernoulli Naive Bayes
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', BernoulliNB())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))


## (5) GRADIENT BOOSTING CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. GradientBoostingClassifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Create a pipeline with CountVectorizer, TfidfTransformer, and Gradient Boosting Classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', GradientBoostingClassifier())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

## (6) XGBOOST CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. XGBClassifier

from sklearn.preprocessing import LabelEncoder


# Create a label encoder
label_encoder = LabelEncoder()

# Fit the label encoder on the sentiment labels and transform them to numerical values
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Create a pipeline with CountVectorizer, TfidfTransformer, and XGBoost Classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', XGBClassifier())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train_encoded)

# Predict on the test dataset
y_pred_encoded = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test_encoded, y_pred_encoded)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test_encoded, y_pred_encoded))

# Print classification report
print("Classification Report:\n", classification_report(y_test_encoded, y_pred_encoded, target_names=label_encoder.classes_))


## (7) DECISION TREE CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. Decision tree classifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Assuming you have already split the data into X_train, X_test, y_train, y_test
# If not, please refer to the previous code snippets

# Create a pipeline with CountVectorizer, TfidfTransformer, and Decision Tree Classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', DecisionTreeClassifier())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))


## (8) K- NEAREST NEIGHBOUR CLASSIFIER MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. KNN classifier


from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Create a pipeline with CountVectorizer, TfidfTransformer, and K-Nearest Neighbors Classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', KNeighborsClassifier())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {:.2f}%".format(accuracy * 100))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Comparing all models metrics

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Create a helper function to compare metrics
def compare_models(models, X_train, X_test, y_train, y_test):
    # Initialize an empty DataFrame to store metrics
    metrics_df = pd.DataFrame(columns=['Model', 'Accuracy', 'Confusion Matrix', 'Classification Report'])
    
    # Create a label encoder to convert sentiment labels to numerical values
    label_encoder = LabelEncoder()
    y_train_encoded = label_encoder.fit_transform(y_train)
    y_test_encoded = label_encoder.transform(y_test)
    
    for model_name, model in models.items():
        # Create a pipeline with CountVectorizer, TfidfTransformer, and the current model
        pipeline = Pipeline([
            ('vect', CountVectorizer()),
            ('tfidf', TfidfTransformer()),
            ('model', model)
        ])
        
        # Fit the pipeline to the training data
        pipeline.fit(X_train, y_train_encoded)
        
        # Predict on the test dataset
        y_pred_encoded = pipeline.predict(X_test)
        
        # Calculate accuracy score
        accuracy = accuracy_score(y_test_encoded, y_pred_encoded)
        
        # Calculate confusion matrix
        conf_matrix = confusion_matrix(y_test_encoded, y_pred_encoded)
        
        # Calculate classification report
        class_report = classification_report(y_test_encoded, y_pred_encoded, output_dict=True, target_names=label_encoder.classes_)
        
        # Append metrics to the DataFrame
        metrics_df = metrics_df.append({
            'Model': model_name,
            'Accuracy': accuracy,
            'Confusion Matrix': conf_matrix,
            'Classification Report': class_report
        }, ignore_index=True)
    
    return metrics_df



In [None]:
# Define the models you want to compare
models = {
    'LinearSVC': LinearSVC(),
    'MultinomialNB': MultinomialNB(),
    'BernoulliNB': BernoulliNB(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'XGBClassifier': XGBClassifier(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'KNeighborsClassifier': KNeighborsClassifier()
}

# Call the helper function to compare metrics
metrics_comparison = compare_models(models, X_train, X_test, y_train, y_test)

# Display the metrics comparison DataFrame
print(metrics_comparison)


In [None]:
metrics_comparison

# Working with Test Dataset

In [None]:
# Perform the prediction on the test dataset
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Create a pipeline with CountVectorizer, TfidfTransformer, and LinearSVC
best_model_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', LinearSVC())
])

# Fit the pipeline to the training data
best_model_pipeline.fit(X_train, y_train_encoded)

# Predict on the test dataset
y_pred_encoded = best_model_pipeline.predict(X_test)

# Inverse transform numerical predictions back to original sentiment labels
y_pred = label_encoder.inverse_transform(y_pred_encoded)

# Print the predicted sentiment labels
print("Predicted Sentiment Labels:\n", y_pred)

In [None]:
### Creating a dataframe of predicted results 

# Create a dictionary to store the predicted results
predicted_results = {
    'Headline': X_test,  # Assuming X_test contains the original headlines
    'Predicted Sentiment': y_pred,
    'Actual Sentiment': y_test
}

# Convert the dictionary to a DataFrame
predicted_results_df = pd.DataFrame(predicted_results)

# Print the DataFrame containing predicted results
predicted_results_df

### Finding Test Accuracy!

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Calculate classification report
class_report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
print("Classification Report:\n", class_report)


#### **Awesome! We have 84% Test Accuracy!**

# Now working with headlines + description

Performing sentiment analysis on both headlines and descriptions can provide a more comprehensive understanding of the sentiment expressed in financial news articles.

Here are a few reasons why analyzing both headlines and descriptions could be valuable:

1.    Richer Context: Headlines provide a concise summary of the article's main theme, while descriptions offer more detailed information. By analyzing both, you can capture the sentiment of the main idea as well as the supporting context.

2.    Nuanced Sentiment: Headlines often focus on attracting attention, which can sometimes result in sensationalism. Descriptions, on the other hand, may contain more nuanced and balanced sentiment.

3.    Detection of Changes: Sentiment can change from the headline to the description, reflecting shifts in the article's tone or focus. Analyzing both can help detect these changes.

4.    Performance Improvement: Combining multiple sources of information (headlines and descriptions) can potentially lead to better sentiment analysis results, as one source might compensate for limitations in the other.

5.    Informed Decision-Making: In financial contexts, understanding sentiment is crucial for making informed decisions. By analyzing both headlines and descriptions, you can gain deeper insights into market perceptions and trends.

6.    Research and Strategy: Researchers and investors may benefit from a more thorough sentiment analysis that considers both headlines and descriptions to guide their research and investment strategies.

7.    Robustness: If sentiment analysis on one source (e.g., headlines) is less accurate due to inherent biases or limitations, using another source (e.g., descriptions) can enhance the robustness of the analysis.

In [None]:

# Merge the 'Headlines' and 'Description' columns and create a new column 'Info'
combined_df_copy['Info'] = combined_df_copy['Headlines'] + ' ' + combined_df_copy['Description']

# Print the updated DataFrame
combined_df_copy

In [None]:
combined_df_copy.columns

In [None]:
# Keep only the 'Info' and 'Time' columns and drop the remaining columns
combined_df_copy = combined_df_copy[['Info', 'Time']]

# Print the updated DataFrame
combined_df_copy.head()

In [None]:
 # Apply the preprocessing function to the 'Info' column
combined_df_copy['Info'] = combined_df_copy['Info'].apply(preprocess_text)

# Print the updated DataFrame
combined_df_copy

In [None]:
# Analyze polarity score of values in info and  add new column 'info_score' of it in dataset

# Create a SentimentIntensityAnalyzer object
sia = SentimentIntensityAnalyzer()

# Function to calculate polarity scores
def get_polarity_score(text):
    return sia.polarity_scores(text)['compound']


In [None]:
# Apply the polarity score function to the 'Info' column
combined_df_copy['info_score'] = combined_df_copy['Info'].apply(get_polarity_score)

# Print the updated DataFrame
combined_df_copy

In [None]:
# Function to map polarity scores to sentiment labels
def map_to_sentiment(score):
    if score > 0.05:
        return 'Positive'
    elif score < -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the mapping function to the 'info_score' column
combined_df_copy['info_score'] = combined_df_copy['info_score'].apply(map_to_sentiment)

# Print the updated DataFrame
combined_df_copy

In [None]:
# Perform count plot on info_score column
import seaborn as sns
import matplotlib.pyplot as plt

# Create a count plot for the 'info_score' column
plt.figure(figsize=(4, 3))
sns.countplot(data=combined_df_copy, x='info_score')
plt.title('Count Plot of Info Scores')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

In [None]:
# Calculate the sum of each type of 'info_score'
score_counts = combined_df_copy['info_score'].value_counts()

# Print the count of each type of 'info_score'
print(score_counts)


In [None]:
# Perform pie chart on info_score column
import matplotlib.pyplot as plt

# Calculate the count of each type of 'info_score'
score_counts = combined_df_copy['info_score'].value_counts()

# Create a pie chart
plt.figure(figsize=(4, 3))
plt.pie(score_counts, labels=score_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Pie Chart of Info Scores')
plt.show()

In the dataset

info contains

48.4 % positive statments

39.8% negtive statements

11.3% neutral statments

# Model Building on headlines + description

In [None]:
# Split the dataset  into test and train 
# 90% train , 10% test and random state 212

from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    combined_df_copy['Info'],  # Features (Info column)
    combined_df_copy['info_score'],  # Target (info_score column)
    test_size=0.1,  # 10% test size
    random_state=212
)

# Print the shapes of the train and test sets
print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

# (1) LINEAR SUPPORT VECTOR MACHINE


In [None]:

%%time
# pipeline creation
# 1. tfidVectorization
# 2. linearSVC model

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with TF-IDF vectorization and LinearSVC
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('model', LinearSVC())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test dataset
y_pred = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Print classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)


# (2) LOGISTIC REGRESSION


In [None]:
%%time
# pipeline creation 
# 1. CountVectorization
# 2. TfidTransformer
# 3. Logistic Regression

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with CountVectorizer, TfidfTransformer, and Logistic Regression
pipeline_lr = Pipeline([
    ('count_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', LogisticRegression())
])

# Fit the pipeline to the training data
pipeline_lr.fit(X_train, y_train)

# Predict on the test dataset
y_pred_lr = pipeline_lr.predict(X_test)

# Calculate accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy:", accuracy_lr)

# Calculate confusion matrix
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
print("Confusion Matrix:\n", conf_matrix_lr)

# Print classification report
class_report_lr = classification_report(y_test, y_pred_lr)
print("Classification Report:\n", class_report_lr)


# (3) MULTINOMIAL NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. MultinomialNB


from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with CountVectorizer, TfidfTransformer, and Multinomial Naive Bayes
pipeline_nb = Pipeline([
    ('count_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', MultinomialNB())
])

# Fit the pipeline to the training data
pipeline_nb.fit(X_train, y_train)

# Predict on the test dataset
y_pred_nb = pipeline_nb.predict(X_test)

# Calculate accuracy
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print("Accuracy:", accuracy_nb)

# Calculate confusion matrix
conf_matrix_nb = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matrix:\n", conf_matrix_nb)

# Print classification report
class_report_nb = classification_report(y_test, y_pred_nb)
print("Classification Report:\n", class_report_nb)


# (4) BERNOULLI NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. BernoulliNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with CountVectorizer, TfidfTransformer, and Bernoulli Naive Bayes
pipeline_bnb = Pipeline([
    ('count_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', BernoulliNB())
])

# Fit the pipeline to the training data
pipeline_bnb.fit(X_train, y_train)

# Predict on the test dataset
y_pred_bnb = pipeline_bnb.predict(X_test)

# Calculate accuracy
accuracy_bnb = accuracy_score(y_test, y_pred_bnb)
print("Accuracy:", accuracy_bnb)

# Calculate confusion matrix
conf_matrix_bnb = confusion_matrix(y_test, y_pred_bnb)
print("Confusion Matrix:\n", conf_matrix_bnb)

# Print classification report
class_report_bnb = classification_report(y_test, y_pred_bnb)
print("Classification Report:\n", class_report_bnb)


# (5) GRADIENT BOOSTING CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. GradientBoostingClassifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with CountVectorizer, TfidfTransformer, and Gradient Boosting Classifier
pipeline_gb = Pipeline([
    ('count_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', GradientBoostingClassifier())
])

# Fit the pipeline to the training data
pipeline_gb.fit(X_train, y_train)

# Predict on the test dataset
y_pred_gb = pipeline_gb.predict(X_test)

# Calculate accuracy
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Accuracy:", accuracy_gb)

# Calculate confusion matrix
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
print("Confusion Matrix:\n", conf_matrix_gb)

# Print classification report
class_report_gb = classification_report(y_test, y_pred_gb)
print("Classification Report:\n", class_report_gb)


# (6) XGBOOST CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. XGBClassifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with CountVectorizer, TfidfTransformer, and Gradient Boosting Classifier
pipeline_xgb = Pipeline([
    ('count_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', GradientBoostingClassifier())
])

# Fit the pipeline to the training data
pipeline_xgb.fit(X_train, y_train)

# Predict on the test dataset
y_pred_xgb = pipeline_xgb.predict(X_test)

# Calculate accuracy
accuracy_xgb = accuracy_score(y_test, y_pred_gb)
print("Accuracy:", accuracy_xgb)

# Calculate confusion matrix
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
print("Confusion Matrix:\n", conf_matrix_gb)

# Print classification report
class_report_xgb = classification_report(y_test, y_pred_gb)
print("Classification Report:\n", class_report_xgb)


# (7) DECISION TREE CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. Decision tree classifier


from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with CountVectorizer, TfidfTransformer, and Decision Tree Classifier
pipeline_dt = Pipeline([
    ('count_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', DecisionTreeClassifier())
])

# Fit the pipeline to the training data
pipeline_dt.fit(X_train, y_train)

# Predict on the test dataset
y_pred_dt = pipeline_dt.predict(X_test)

# Calculate accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print("Accuracy:", accuracy_dt)

# Calculate confusion matrix
conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)
print("Confusion Matrix:\n", conf_matrix_dt)

# Print classification report
class_report_dt = classification_report(y_test, y_pred_dt)
print("Classification Report:\n", class_report_dt)


# (8) K- NEAREST NEIGHBOUR CLASSIFIER MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. KNN classifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline with CountVectorizer, TfidfTransformer, and K-Nearest Neighbors Classifier
pipeline_knn = Pipeline([
    ('count_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', KNeighborsClassifier())
])

# Fit the pipeline to the training data
pipeline_knn.fit(X_train, y_train)

# Predict on the test dataset
y_pred_knn = pipeline_knn.predict(X_test)

# Calculate accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print("Accuracy:", accuracy_knn)

# Calculate confusion matrix
conf_matrix_knn = confusion_matrix(y_test, y_pred_knn)
print("Confusion Matrix:\n", conf_matrix_knn)

# Print classification report
class_report_knn = classification_report(y_test, y_pred_knn)
print("Classification Report:\n", class_report_knn)


In [None]:
# Helper function for comparing models matric

def compare_models(models, model_names, X_test, y_test):
    metrics = []

    for model, name in zip(models, model_names):
        y_pred = model.predict(X_test)
        
        accuracy = accuracy_score(y_test, y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        class_report = classification_report(y_test, y_pred, output_dict=True)
        
        metrics.append({
            'Model': name,
            'Accuracy': accuracy,
            'Confusion Matrix': conf_matrix,
            'Classification Report': class_report
        })

    metrics_df = pd.DataFrame(metrics)
    return metrics_df

In [None]:
# List of model objects
models = [pipeline_lr, pipeline_nb, pipeline_bnb, pipeline_gb, pipeline_xgb, pipeline_dt, pipeline_knn]

# List of model names
model_names = ['Logistic Regression', 'Multinomial Naive Bayes', 'Bernoulli Naive Bayes',
               'Gradient Boosting Classifier', 'XGBoost Classifier', 'Decision Tree Classifier',
               'K-Nearest Neighbors']

# Compare models and get metrics dataframe
metrics_dataframe = compare_models(models, model_names, X_test, y_test)

# Print the comparison of models
metrics_dataframe

### **Hence our best model is Logistic Regression with an Accuracy of 0.807714**

# **Now we will make predictions on our Test Data**

In [None]:
# Perforn the prediction on the test dataset
# Predict on the test dataset using the Logistic Regression model
y_pred_lr = pipeline_lr.predict(X_test)

# Print the predicted labels
y_pred_lr


In [None]:
# Creating a dataframe of predicted results 
import pandas as pd

# Create a DataFrame of predicted results
predicted_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_lr})

# Display the DataFrame
predicted_df

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy:", accuracy_lr)

# Calculate confusion matrix
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
print("Confusion Matrix:\n", conf_matrix_lr)

# Print classification report
class_report_lr = classification_report(y_test, y_pred_lr)
print("Classification Report:\n", class_report_lr)

# Now we will be working on headlines

In [None]:
combined_df.head()

In [None]:
# Create a new dataframe by dropping the 'Description' column
new_dataframe = combined_df.drop(columns=['Description'])

# Display the new dataframe
new_dataframe.head()

In [None]:
# Rename the "date" column to "time" in the Guardian headlines dataset
guardian_df.rename(columns={'date': 'time'}, inplace=True)

# Display the updated Guardian headlines dataset
guardian_df.head()

In [None]:
# Concatenate the guardian_df and  new_dataframe to get all headlines together

import pandas as pd

# Concatenate the guardian_df and new_dataframe
all_headlines = pd.concat([guardian_df, new_dataframe])

# Display the concatenated dataframe
all_headlines.head()

In [None]:
# Check the shape of all headlines dataset
all_headlines.shape

In [None]:
# Apply Preprocessing Function (previously made by us) to the headlines column in the new dataset
all_headlines['Headlines'] = all_headlines['Headlines'].apply(preprocess_text)

# Display the updated dataframe
all_headlines.head()

In [None]:
# Analyze polarity score of values in headlines and  add new column 'hl_score' of it in dataset

from nltk.sentiment import SentimentIntensityAnalyzer

# Create a SentimentIntensityAnalyzer object
sia = SentimentIntensityAnalyzer()

# Function to get the polarity score
def get_polarity_score(text):
    sentiment = sia.polarity_scores(text)
    return sentiment['compound']

# Apply the polarity score function to the "Headlines" column
all_headlines['hl_score'] = all_headlines['Headlines'].apply(get_polarity_score)

# Display the updated dataframe
all_headlines.head()


In [None]:
# Apply the function  which decides sentiment to  polarity score column

def get_sentiment_label(score):
    if score > 0.05:
        return 'Positive'
    elif score < -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the sentiment function to the "hl_score" column
all_headlines['hl_sentiment'] = all_headlines['hl_score'].apply(get_sentiment_label)

# Display the updated dataframe
all_headlines.head()

In [None]:
# Perform countplot on headline score column

import seaborn as sns
import matplotlib.pyplot as plt

# Create a countplot for the "hl_sentiment" column
sns.countplot(data=all_headlines, x='hl_sentiment')

# Set labels and title
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.title('Distribution of Headline Sentiments')

# Show the plot
plt.show()

In [None]:
# Make a pie chart on hl_sentiment

# Create a pie chart
plt.figure(figsize=(4, 4))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140, colors=['#66b3ff','#99ff99','#ff9999'])

# Set title
plt.title('Distribution of Headline Sentiments')

# Show the pie chart
plt.show()

# Modeling on Headlines dataframe all_headlines

In [None]:
all_headlines.head()

In [None]:
# Remove the "hl_score" column from the dataframe
all_headlines = all_headlines.drop(columns=['hl_score'])

In [None]:
# Split the dataset  into test and train 
# 90% train , 10% test and random state 212

from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    all_headlines['Headlines'],    # Features (Headlines column)
    all_headlines['hl_sentiment'],  # Target (hl_sentiment column)
    test_size=0.1,                  # 10% test size
    random_state=212
)

# Print the shapes of the train and test sets
print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

# (1) LINEAR SUPPORT VECTOR MACHINE

In [None]:
%%time
# pipeline creation
# 1. tfidVectorization
# 2. linearSVC model

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline
pipeline_svc = Pipeline([
    ('tfidf', TfidfVectorizer()),  # TF-IDF vectorization
    ('svc', LinearSVC())           # LinearSVC model
])

# Fit the pipeline to the training data
pipeline_svc.fit(X_train, y_train)

# Predict on the test dataset
y_pred_svc = pipeline_svc.predict(X_test)

# Calculate accuracy
accuracy_svc = accuracy_score(y_test, y_pred_svc)
print("Accuracy:", accuracy_svc)

# Calculate confusion matrix
conf_matrix_svc = confusion_matrix(y_test, y_pred_svc)
print("Confusion Matrix:\n", conf_matrix_svc)

# Print classification report
class_report_svc = classification_report(y_test, y_pred_svc)
print("Classification Report:\n", class_report_svc)


# (2) LOGISTIC REGRESSION

In [None]:
%%time
# pipeline creation 
# 1. CountVectorization
# 2. TfidTransformer
# 3. Logistic Regression

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline
pipeline_lr = Pipeline([
    ('count_vect', CountVectorizer()),   # CountVectorization
    ('tfidf', TfidfTransformer()),        # TF-IDF transformation
    ('lr', LogisticRegression())         # Logistic Regression model
])

# Fit the pipeline to the training data
pipeline_lr.fit(X_train, y_train)

# Predict on the test dataset
y_pred_lr = pipeline_lr.predict(X_test)

# Calculate accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy:", accuracy_lr)

# Calculate confusion matrix
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
print("Confusion Matrix:\n", conf_matrix_lr)

# Print classification report
class_report_lr = classification_report(y_test, y_pred_lr)
print("Classification Report:\n", class_report_lr)


# (3) MULTINOMIAL NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline
pipeline_mnb = Pipeline([
    ('count_vect', CountVectorizer()),   # CountVectorization
    ('tfidf', TfidfTransformer()),        # TF-IDF transformation
    ('nb', MultinomialNB())              # Multinomial Naive Bayes model
])

# Fit the pipeline to the training data
pipeline_mnb.fit(X_train, y_train)

# Predict on the test dataset
y_pred_mnb = pipeline_mnb.predict(X_test)

# Calculate accuracy
accuracy_mnb = accuracy_score(y_test, y_pred_mnb)
print("Accuracy:", accuracy_mnb)

# Calculate confusion matrix
conf_matrix_mnb = confusion_matrix(y_test, y_pred_mnb)
print("Confusion Matrix:\n", conf_matrix_mnb)

# Print classification report
class_report_mnb = classification_report(y_test, y_pred_mnb)
print("Classification Report:\n", class_report_mnb)


# (4) BERNOULLI NAIVE BAYES


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. BernoulliNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline
pipeline_bnb = Pipeline([
    ('count_vect', CountVectorizer()),   # CountVectorization
    ('tfidf', TfidfTransformer()),        # TF-IDF transformation
    ('bnb', BernoulliNB())               # Bernoulli Naive Bayes model
])

# Fit the pipeline to the training data
pipeline_bnb.fit(X_train, y_train)

# Predict on the test dataset
y_pred_bnb = pipeline_bnb.predict(X_test)

# Calculate accuracy
accuracy_bnb = accuracy_score(y_test, y_pred_bnb)
print("Accuracy:", accuracy_bnb)

# Calculate confusion matrix
conf_matrix_bnb = confusion_matrix(y_test, y_pred_bnb)
print("Confusion Matrix:\n", conf_matrix_bnb)

# Print classification report
class_report_bnb = classification_report(y_test, y_pred_bnb)
print("Classification Report:\n", class_report_bnb)


# (5) GRADIENT BOOSTING CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. GradientBoostingClassifier


from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline
pipeline_gb = Pipeline([
    ('count_vect', CountVectorizer()),        # CountVectorization
    ('tfidf', TfidfTransformer()),             # TF-IDF transformation
    ('gb', GradientBoostingClassifier())      # Gradient Boosting Classifier model
])

# Fit the pipeline to the training data
pipeline_gb.fit(X_train, y_train)

# Predict on the test dataset
y_pred_gb = pipeline_gb.predict(X_test)

# Calculate accuracy
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Accuracy:", accuracy_gb)

# Calculate confusion matrix
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
print("Confusion Matrix:\n", conf_matrix_gb)

# Print classification report
class_report_gb = classification_report(y_test, y_pred_gb)
print("Classification Report:\n", class_report_gb)


# (5) XGBOOST CLASSIFICATION MODEL
Needed label encoding

In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. XGBClassifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

# Create a label encoder
label_encoder = LabelEncoder()

# Fit the encoder on the target variable
label_encoder.fit(y_train)

# Encode the target variable
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Create a pipeline
pipeline_xgb = Pipeline([
    ('count_vect', CountVectorizer()),   # CountVectorization
    ('tfidf', TfidfTransformer()),        # TF-IDF transformation
    ('xgb', XGBClassifier())             # XGBoost Classifier model
])

# Fit the pipeline to the training data
pipeline_xgb.fit(X_train, y_train_encoded)

# Predict on the test dataset
y_pred_xgb_encoded = pipeline_xgb.predict(X_test)

# Inverse transform the predictions to get original labels
y_pred_xgb = label_encoder.inverse_transform(y_pred_xgb_encoded)

# Calculate accuracy
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print("Accuracy:", accuracy_xgb)

# Calculate confusion matrix
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
print("Confusion Matrix:\n", conf_matrix_xgb)

# Print classification report
class_report_xgb = classification_report(y_test, y_pred_xgb)
print("Classification Report:\n", class_report_xgb)


# (6) DECISION TREE CLASSIFICATION MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. Decision tree classifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline
pipeline_dt = Pipeline([
    ('count_vect', CountVectorizer()),       # CountVectorization
    ('tfidf', TfidfTransformer()),            # TF-IDF transformation
    ('dt', DecisionTreeClassifier())         # Decision Tree Classifier model
])

# Fit the pipeline to the training data
pipeline_dt.fit(X_train, y_train)

# Predict on the test dataset
y_pred_dt = pipeline_dt.predict(X_test)

# Calculate accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print("Accuracy:", accuracy_dt)

# Calculate confusion matrix
conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)
print("Confusion Matrix:\n", conf_matrix_dt)

# Print classification report
class_report_dt = classification_report(y_test, y_pred_dt)
print("Classification Report:\n", class_report_dt)

# (7) K- NEAREST NEIGHBOUR CLASSIFIER MODEL


In [None]:
%%time
# pipeline creation 
# 1. CountVectorizer
# 2. TfidTransformer
# 3. KNN classifier

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Create a pipeline
pipeline_knn = Pipeline([
    ('count_vect', CountVectorizer()),       # CountVectorization
    ('tfidf', TfidfTransformer()),            # TF-IDF transformation
    ('knn', KNeighborsClassifier())          # K-Nearest Neighbors Classifier model
])

# Fit the pipeline to the training data
pipeline_knn.fit(X_train, y_train)

# Predict on the test dataset
y_pred_knn = pipeline_knn.predict(X_test)

# Calculate accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print("Accuracy:", accuracy_knn)

# Calculate confusion matrix
conf_matrix_knn = confusion_matrix(y_test, y_pred_knn)
print("Confusion Matrix:\n", conf_matrix_knn)

# Print classification report
class_report_knn = classification_report(y_test, y_pred_knn)
print("Classification Report:\n", class_report_knn)


### Making a df of all models metrics.

beforehand, lets make label encoding so that no model will throw any error

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create a label encoder
label_encoder = LabelEncoder()

# Fit the encoder on the target variable in training data
y_train_encoded = label_encoder.fit_transform(y_train)

# Transform the target variable in test data using the same encoder
y_test_encoded = label_encoder.transform(y_test)

# Now you can use y_train_encoded and y_test_encoded in your pipelines and models

In [None]:
# Helper function for comparing models metrics

def compare_models(models, model_names, X_train, y_train, X_test, y_test):
    label_encoder = LabelEncoder()
    y_train_encoded = label_encoder.fit_transform(y_train)
    y_test_encoded = label_encoder.transform(y_test)
    
    metrics = []
    
    for model, name in zip(models, model_names):
        model.fit(X_train, y_train_encoded)
        y_pred_encoded = model.predict(X_test)
        y_pred = label_encoder.inverse_transform(y_pred_encoded)
        
        accuracy = accuracy_score(y_test, y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        class_report = classification_report(y_test, y_pred, output_dict=True)
        
        metrics.append({
            'Model': name,
            'Accuracy': accuracy,
            'Confusion Matrix': conf_matrix,
            'Classification Report': class_report
        })
    
    metrics_df = pd.DataFrame(metrics)
    return metrics_df

In [None]:
# List of model objects
models = [pipeline_svc, pipeline_lr, pipeline_mnb, pipeline_bnb, pipeline_gb, pipeline_xgb, pipeline_dt, pipeline_knn]

# List of model names
model_names = ['LinearSVC', 'Logistic Regression', 'Multinomial Naive Bayes', 
               'Bernoulli Naive Bayes', 'Gradient Boosting Classifier', 'XGBoost Classifier', 
               'Decision Tree Classifier', 'K-Nearest Neighbors']

# Create a dataframe with metrics
model_metrics_df = compare_models(models, model_names, X_train, y_train, X_test, y_test)


In [None]:
model_metrics_df

# Now working with test data
#### We found Linear SVC is the best model with an accuracy of **0.919917** .

In [None]:
# Perforn the prediction on the test dataset
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Fit the Linear SVC model on the full training data
pipeline_svc.fit(X_train, y_train)

# Predict on the test dataset
y_pred_svc = pipeline_svc.predict(X_test)
y_pred_svc


In [None]:
# Calculate accuracy
accuracy_svc = accuracy_score(y_test, y_pred_svc)
print("Accuracy:", accuracy_svc)

# Calculate confusion matrix
conf_matrix_svc = confusion_matrix(y_test, y_pred_svc)
print("Confusion Matrix:\n", conf_matrix_svc)

# Print classification report
class_report_svc = classification_report(y_test, y_pred_svc)
print("Classification Report:\n", class_report_svc)

In [None]:
# Creating a dataframe of predicted results 
# Create a dataframe of predicted results
predicted_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred_svc
})

# Display the predicted dataframe
predicted_df.head()

# **Making predictions on realtime news data .........**

We can check the result on real time news headlines!

In [None]:
sent1 = ['Chandrayaan 3 makes successful soft landing on Lunar South Pole']
y_predict = pipeline_svc.predict(sent1)
print(y_predict)

In [None]:
sent2 = ["Rape survivors sustains fatal injuries in Kolkata"]
y_predict = pipeline_svc.predict(sent2)
print(y_predict)

# Conclusion

We learn about NLTK, sentiment analysis in this Project.

We conclude that using nltk it is easy to classify financial news and more we improve the traning data more we can get accurate
