# Text analysis from (Stock Market News Data in Portuguese | Kaggle)

### By: Kevin De Mello Santamaria
#### Date: 12/12/2023

## LangChain Model Description

The LangChain model is a comprehensive language processing system tailored for financial phrases in Portuguese. Leveraging state-of-the-art models and libraries, it seamlessly integrates text summarization, keyword extraction, and sentiment analysis to provide a holistic understanding of financial text data.

#### 1. Text Summarization using Facebook BART Large CNN Model:

Model use Facebook BART Large CNN model for text summarization put as a tool in the model. This pre-trained model excels at condensing lengthy financial phrases into concise and informative summaries, capturing the essential meaning and context.

#### 2. Keyword Extraction with KeyBert Library:

To identify crucial terms and concepts within the financial phrases, I used modelthe KeyBert library to extract words. KeyBert utilizes BERT-based embeddings to extract keywords, offering a nuanced representation of the most significant information present in the text.

#### 3. Sentiment Analysis using TextBlob Library:

Understanding the sentiment behind financial statements is essential. I implemented TextBlob library for sentiment analysis, assigning sentiment labels such as positive, negative, or neutral to each financial phrase. This feature aids in gauging the overall sentiment expressed in the dataset.

#### Dataset Source:

The LangChain model is specifically designed for financial phrases sourced from the "Financial Phrase Bank - Portuguese Translation" dataset available on Kaggle. This dataset contains a diverse collection of financial statements and expressions in Portuguese, providing a robust foundation for the model's training and evaluation.

In [None]:
! python -m venv venv # Create a venv
! source ./venv/bin/activate
! pip install -r requirements.txt # Install all libraries from txt
! python -m pip install jupyter # Install Jupyter Notebook
! python -m ipykernel install --user --name=llm-virtual-venv # Declare a kernel for Jupyternotebook

##### In case you need a OpenAI Key, kaggle username and password Consult Kevin
Cell number & WhatsAPP: + 55 16 99740 8851

#### Insert here Open AI Key, kaggle username and key

In [None]:
OPENAI_API_KEY = 'INSERT_YOUR_OPEN_AI_KEY'
KAGGLE_USERNAME = "INSERT_YOUR_KAGGLE_USERNAME"
KAGGLE_KEY = "INSERT_YOUR_KAGGLE_KEY"
DOWNLOAD_DIR = 'data_financial/'
COMPETITION_NAME = 'mateuspicanco/financial-phrase-bank-portuguese-translation'

#### Libraries

In [None]:
import streamlit as st
import pandas as pd
import json
import openai
import os
import re
import matplotlib.pyplot as plt
import os
import glob
import shutil
from zipfile import ZipFile
from kaggle.api.kaggle_api_extended import KaggleApi

In [None]:
from typing import Type
from pydantic import BaseModel, Field
from textblob import TextBlob
from sklearn.metrics import classification_report
from keybert import KeyBERT
from operator import itemgetter

In [None]:
from langchain.tools import BaseTool
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.chat_models import ChatOpenAI
from langchain.tools import BaseTool
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain_core.messages import HumanMessage, AIMessage
from transformers import  BartTokenizer, BartForConditionalGeneration
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.agents import Tool
from langchain_experimental.agents import create_csv_agent
from langchain.chat_models import ChatOpenAI
from langchain.agents.agent_types import AgentType

In [None]:
def remove_zip_files(directory):
   for root, dirs, files in os.walk(directory):
       for file in files:
           if file.lower().endswith('.zip'):
               file_path = os.path.join(root, file)
               os.remove(file_path)


def get_filename():
    os.environ['KAGGLE_USERNAME'] = KAGGLE_USERNAME
    os.environ['KAGGLE_KEY'] = KAGGLE_KEY

    api = KaggleApi()
    api.authenticate()


    shutil.rmtree(DOWNLOAD_DIR, ignore_errors=True)
    os.makedirs(DOWNLOAD_DIR, exist_ok=True)

    api.dataset_download_files(COMPETITION_NAME, path = DOWNLOAD_DIR)


    with ZipFile(f'{DOWNLOAD_DIR}{os.listdir(DOWNLOAD_DIR)[0]}', 'r') as zObject:
        zObject.extractall(path=f'{DOWNLOAD_DIR}')
    remove_zip_files(DOWNLOAD_DIR)

    extension = 'csv'
    os.chdir(DOWNLOAD_DIR)
    result = glob.glob('*.{}'.format(extension))
    return f'{result[0]}'


In [None]:
DF_PATH = get_filename()

#### Functions to summarize the text, get key words, do a sentiment analysis and score the accuracy

In [None]:
#Get Text Summary
def summarize_text(text):
    model_name = 'facebook/bart-large-cnn'
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize and encode the text
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

In [None]:
# Get Keywordsd from text
def get_keyword(text):
    kw_model = KeyBERT()
    return list(dict(kw_model.extract_keywords(text)).keys())[:3]

# Get sentiment as Positive, Negative and Neutral
def get_sentiment(text):
    blob = TextBlob(text)

    return get_Analysis(blob.sentiment.polarity)

def get_Analysis(score):
  if score < 0:
    return 'negative'
  elif score == 0:
    return 'neutral'
  else:
    return 'positive'

def get_accuracy(prediction,labeled):
    return classification_report(prediction,labeled, target_names=['negative', 'neutral', 'positive'])

#### Tool declaration for LangChain

In [None]:
class SummarizeText(BaseTool):
    name = "Summarize Text"
    description = "use this tool when you need to make a summary of a text"

    def _run(self, text: str):
        return summarize_text(text)

    def _arun(self, text : str):
        raise NotImplementedError("This tool does not support async")

class KeyWordsText(BaseTool):
    name = "Key Words Text"
    description = "use this tool when you need search keywords from a text"

    def _run(self, text : str):
        return get_keyword(text)

    def _arun(self, text : str):
        raise NotImplementedError("This tool does not support async")


class SentimentAnalysis(BaseTool):
    name = "Sentiment Analysis"
    description = "use this tool when you need to do a sentiment analysis"

    def _run(self, text : str):
        return get_sentiment(text)

    def _arun(self, text : str):
        raise NotImplementedError("This tool does not support async")

In [None]:
csv_agent = create_csv_agent(
        ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613", openai_api_key=OPENAI_API_KEY),
        DF_PATH,
        verbose=True,
        agent_type=AgentType.OPENAI_FUNCTIONS,
    )

Pandas_Df_Financial = Tool(name='csv_agent',func=csv_agent.run,description='Useful to read csv file')

In [None]:
tools = [SummarizeText(),KeyWordsText(),SentimentAnalysis(),Pandas_Df_Financial]

#### LLM Lanchain Declaration
###### For this mode I used a ChatGPT model using a conversational memory reading the dataset from a tool

In [None]:
llm = ChatOpenAI(
        openai_api_key = OPENAI_API_KEY,
        temperature=0,
        model_name='gpt-3.5-turbo'
)

conversational_memory = ConversationBufferWindowMemory(
        memory_key='chat_history',
        k=5,
        return_messages=True
)

agent = initialize_agent(
    agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    verbose=False,
    max_iterations=3,
    early_stopping_method='generate',
    memory=conversational_memory
)

#### Prompts for test

In [None]:
agent("How many rows there are in the dataset")['output'] # Test dataframe lecture



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': 'df.shape[0]'}`


[0m[36;1m[1;3m4845[0m[32;1m[1;3mThe number of rows in the dataframe is 4845.[0m

[1m> Finished chain.[0m


'The number of rows in the dataset is 4845.'

In [None]:
agent("Show me the columns of the dataframe")['output'] # Test manipulation of dataframe



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThe columns in the dataframe are "y", "text", and "text_pt".[0m

[1m> Finished chain.[0m


"The columns in the dataframe are 'y', 'text', and 'text_pt'."

In [None]:
agent("Make me a summary of the text column")['output'] # Testing text summary tool

"The summary of the text column is: Use the weekly Newsquiz to test your knowledge of stories you saw on CNN.com and CNN iReport.com for a chance to win a $1,000 gift card to help support CNN's iReport project."

In [None]:
agent("Make me a summary of the text_pt column")['output'] # Testing text summary tool for portuguese

'The summary of the text_pt column is: text_PT is a text-based search engine. Text-based searches can be used to search for text. text-search.com is a search engine that searches for text in text format.'

In [None]:
agent("Make me a summary of the text column first cell")['output'] # Testing text summary tool

"The summary of the text column's first cell is: Use the weekly Newsquiz to test your knowledge of stories you saw on CNN.com and CNN iReport. You could win a $1,000 gift card to help support CNN's iReport project."

In [None]:
agent("Make me a sentiment analysis of the text column firt item")['output'] # Testing text sentiment analysis

"The sentiment analysis of the text column's first item is positive."

In [None]:
agent("Make me a keyword extraction of the text column firt item")['output'] # Testing Keywords extraction

"The keywords extracted from the text column's first item are 'cnn', 'newsquiz', and 'stories'."

In [None]:
agent("Make me a sentiment analysis from all dataset")['output'] # Testing sentiment analysis from all dataset

'The sentiment analysis of the entire dataset is neutral.'

In [None]:
agent("Make me a summary from all values of text column")['output'] # Testing all dataset summary

"The summary of all values in the text column is: Use the weekly Newsquiz to test your knowledge of stories you saw on CNN.com and CNN iReport. You could win a $1,000 gift card to help support CNN's iReport project."

### Testing sentiment analysis accuracy

In [None]:
df = pd.read_csv(DF_PATH)
df['sentiment'] = df['text'].apply(get_sentiment)

In [None]:
print(get_accuracy(df['sentiment'],df['y']))

              precision    recall  f1-score   support

    negative       0.38      0.30      0.34       761
     neutral       0.54      0.63      0.58      2468
    positive       0.43      0.36      0.39      1616

    accuracy                           0.49      4845
   macro avg       0.45      0.43      0.44      4845
weighted avg       0.48      0.49      0.48      4845

