# Reviews Sentiment

For: Pao Pao

The reviews are given as locations across all dates. Therefore, the two main complexities here are. 

1. We need to link the location to the company. This should be difficult.
2. We will need to accumulate this up to a quarter level. This should be relatively easy.

In the very basic form we basically want the output to be a csv file in the format below. (ideally order by quarter_year then by ticker but doesn't matter). `news_sentiment` should be values between 0 to 1 where the value vaguely represents the probability of a positive sentiment. Or -1 to 1 where -1 is neg and 1 is pos. This depends on you but *make it clear with a markdown at the end.*


| ticker | quarter_year  | reviews_sentiment |
|--------|---------------|-------------------|
| BAC    | Q1 2001       | 0.2               |
| JPM    | Q1 2001       | 0.67              |
| WFC    | Q1 2001       | 0.97              |


Of course some averaging will be needed so to prevent data loss, you could have multiple columns representing upper quartile sentiment, mean sentiment lower quartile sentiment for example. Ideally, you should have 2 output files; 1 for revenue and 1 for CAR.

The main difference between reviews and the other 2 text data is reviews are not finance based so a model like FinBERT is not suitable. Perhaps, GPT 0-shot classification might work better.

Be creative!

In [51]:
import pandas as pd
import re
import numpy as np
from tqdm import tqdm
import math

In [None]:
import torch
torch.backends.mps.is_available()

True

In [56]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

In [2]:
reviews_data = pd.read_csv("data/text/reviews/extracted_reviews_24032025.csv")
reviews_data_2 = pd.read_csv("data/text/reviews/detailed_reviews_fix.csv")
reviews_data = pd.concat([reviews_data, reviews_data_2])

In [3]:
def text_preprocessing_reviews(text):
    """Write the text preprocessing function here. This should work through the `df.apply()` function"""
    return text

In [4]:
def sentiment_analysis_reviews(reviews_data: pd.DataFrame):
    """This function should take in the news data and output the final csv file dataframe"""
    output_data = reviews_data.copy()
    return output_data

In [5]:
reviews_data.head()

Unnamed: 0,place_id,place_name,review_id,name,reviewer_profile,rating,review_text,published_at,published_at_date,response_from_owner_text,response_from_owner_ago,response_from_owner_date,review_likes_count,total_number_of_reviews_by_reviewer,total_number_of_photos_by_reviewer,is_local_guide,review_translated_text,response_from_owner_translated_text
0,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),ChZDSUhNMG9nS0VJQ0FnTURRNzRDUWVREAE,Berta Flores,https://www.google.com/maps/contrib/1171003832...,5,They treated me excellently,a week ago,2025-03-16T12:41:47,Thank you for the positive review about your r...,a week ago,2025-03-16T12:41:47,0,2.0,,,Me atendieron excelente,
1,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),ChdDSUhNMG9nS0VJQ0FnTUNBbzlqYnlnRRAB,Rony Tiulcaz,https://www.google.com/maps/contrib/1055339603...,5,,a month ago,2025-02-23T12:41:47,,,,0,1.0,,,,
2,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),ChZDSUhNMG9nS0VJQ0FnSUNfN05IM093EAE,alonzo remodeling,https://www.google.com/maps/contrib/1027286017...,5,Diego good service.,2 months ago,2025-01-23T12:41:47,Hello. Thanks for letting us know how helpful ...,2 months ago,2025-01-23T12:41:47,0,1.0,,,Diego buen cervicio.,
3,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),ChdDSUhNMG9nS0VJQ0FnSURmMWVMN3BnRRAB,Edgar Castro,https://www.google.com/maps/contrib/1089460770...,5,"They gave me a good service, specially Mr Dieg...",2 months ago,2025-01-23T12:41:47,Wonderful! Thanks for letting us know how help...,2 months ago,2025-01-23T12:41:47,0,1.0,,,,
4,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),ChZDSUhNMG9nS0VJQ0FnSURmNVo2dldREAE,Z Z,https://www.google.com/maps/contrib/1110376128...,5,"Mr. Diego Gomez! Great person, great customer ...",2 months ago,2025-01-23T12:41:47,Thank you for letting us know how helpful Dieg...,2 months ago,2025-01-23T12:41:47,0,2.0,,,,


## First Task: Map each row to a company ticker

In [6]:
# Data overview
df_overview = pd.read_csv("data/text/reviews/data_overview.csv")
df_overview_2 = pd.read_csv("data/text/reviews/data_overview.csv")
df_overview = pd.concat([df_overview, df_overview_2])

  df_overview = pd.read_csv("data/text/reviews/data_overview.csv")
  df_overview_2 = pd.read_csv("data/text/reviews/data_overview.csv")


In [7]:
df_overview.head()

Unnamed: 0,place_id,name,description,is_spending_on_ads,reviews,rating,competitors,website,phone,can_claim,...,featured_image,main_category,categories,workday_timing,is_temporarily_closed,closed_on,address,review_keywords,link,query
0,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),"Bank of America is proud to serve Hyattsville,...",,92,2.1,Name: Wells Fargo Bank\nLink: https://www.goog...,https://locators.bankofamerica.com/md/hyattsvi...,+1 301-408-4400,,...,https://lh3.ggpht.com/p/AF1QipOxwo4k--9lWgz3Gd...,thnaakhaar,"thnaakhaar, thiiprueksaadaankaarengin, nakwaan...",9:00-17:00,,wan`aathity,"7950 New Hampshire Ave STE A, Hyattsville, MD ...",,https://www.google.com/maps/place/Bank+of+Amer...,"bank in langley park, usa"
1,ChIJM89ARhPGt4kR4bwPgqhSNqU,Capital One Bank,Your place for all your banking needs. Get Amb...,,39,2.9,Name: Wells Fargo Bank\nLink: https://www.goog...,http://www.capitalonebank.com/?external_id=ENT...,+1 301-439-7900,,...,https://lh3.ggpht.com/p/AF1QipNZFpCzdKKjlj7GR-...,thnaakhaar,"thnaakhaar, tuue`thiie`m, tawaethnsinechuue`, ...",9:00-17:00,,wan`aathity,"1181 University Blvd E, Takoma Park, MD 20912 ...",,https://www.google.com/maps/place/Capital+One+...,"bank in langley park, usa"
2,ChIJO-BixRTGt4kR8P-ihLEKrs0,Citi,"Citi is a financial services company, which of...",,30,2.7,,https://www.citi.com/?utm_source=gmb&utm_mediu...,+1 240-398-3074,,...,https://lh3.ggpht.com/p/AF1QipN3X9JIoQ7-V3s-iT...,thnaakhaar,"thnaakhaar, tuue`thiie`m",10:00-17:00,,wan`aathity,"7633 New Hampshire Ave, Takoma Park, MD 20912 ...",,https://www.google.com/maps/place/Citi/data=!4...,"bank in langley park, usa"
3,ChIJC4llRxPGt4kRtWJze1u2lFk,Wells Fargo Bank,We're here for you in Takoma Park to help supp...,,51,3.1,Name: Bank of America (with Drive-thru ATM)\nL...,https://www.wellsfargo.com/locator/bank/1175__...,+1 301-650-1083,,...,https://lh3.ggpht.com/p/AF1QipPA3kXLZQw7m6kkQX...,thnaakhaar,"thnaakhaar, tuue`thiie`m",9:00-17:00,,wan`aathity,"1175 University Blvd E, Takoma Park, MD 20912 ...",,https://www.google.com/maps/place/Wells+Fargo+...,"bank in langley park, usa"
4,ChIJ-_uKky3Gt4kRcdMbJj1CwOk,Bank of America (with Drive-thru ATM),"Bank of America is proud to serve Hyattsville,...",,45,1.7,Name: Walmart Supercenter\nLink: https://www.g...,https://locators.bankofamerica.com/md/hyattsvi...,+1 301-270-7990,,...,https://lh3.ggpht.com/p/AF1QipNJvoiJeDs74FQEv6...,thnaakhaar,"thnaakhaar, thiiprueksaadaankaarengin, nakwaan...",10:00-16:00,,"wanesaar, wan`aathity","6495 New Hampshire Ave, Hyattsville, MD 20783 ...",,https://www.google.com/maps/place/Bank+of+Amer...,"bank in langley park, usa"


In [8]:
map_place_id_website = df_overview[["place_id", "name", "website"]].drop_duplicates()
map_place_id_website.dropna(subset=["place_id", "website"], inplace=True)
map_place_id_website.reset_index(drop=True, inplace=True)

In [9]:
map_place_id_website

Unnamed: 0,place_id,name,website
0,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),https://locators.bankofamerica.com/md/hyattsvi...
1,ChIJM89ARhPGt4kR4bwPgqhSNqU,Capital One Bank,http://www.capitalonebank.com/?external_id=ENT...
2,ChIJO-BixRTGt4kR8P-ihLEKrs0,Citi,https://www.citi.com/?utm_source=gmb&utm_mediu...
3,ChIJC4llRxPGt4kRtWJze1u2lFk,Wells Fargo Bank,https://www.wellsfargo.com/locator/bank/1175__...
4,ChIJ-_uKky3Gt4kRcdMbJj1CwOk,Bank of America (with Drive-thru ATM),https://locators.bankofamerica.com/md/hyattsvi...
...,...,...,...
72261,ChIJi6zQltqd7ocRSV3YU019enA,GreenState Credit Union,https://www.greenstate.org/
72262,ChIJAVBHeAee7ocRVm_xK6q_HAA,Regions Bank,https://www.regions.com/locator/ia/urbandale/f...
72263,ChIJUfuiU-Wd7ocRbR8Fh7Bym2c,Wells Fargo Bank,https://www.wellsfargo.com/locator/bank/8301__...
72264,ChIJE4xTp_Cd7ocRdDK936ganHE,Midwest Heritage,https://www.midwestheritage.com/


In [10]:
df_fundamentals = pd.read_csv("data/fundamentals/banking_fundamental_drive.csv")

  df_fundamentals = pd.read_csv("data/fundamentals/banking_fundamental_drive.csv")


In [11]:
from companies import small_banks, medium_banks, large_banks

In [12]:
all_banks = small_banks + medium_banks + large_banks

In [13]:
map_ticker_web_url = df_fundamentals[df_fundamentals["tic"].isin(all_banks)][["tic", "weburl"]].drop_duplicates()
map_ticker_web_url.dropna(inplace=True)
map_ticker_web_url.reset_index(drop=True, inplace=True)

In [14]:
def extract_domain(url):
    # Remove protocol and path if present
    cleaned_url = re.sub(r'^https?:\/\/', '', url)
    cleaned_url = cleaned_url.split('/', 1)[0]
    
    # Special case for q4ir.com domains - extract the company name
    q4ir_pattern = r'([a-zA-Z0-9-]+)\.q4ir\.com$'
    q4ir_match = re.search(q4ir_pattern, cleaned_url)
    if q4ir_match:
        return q4ir_match.group(1).split(".")[0]
    
    # General pattern for domain.tld or domain.co.uk style domains
    domain_pattern = r'(?:[\w-]+\.)?([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)$'
    match = re.search(domain_pattern, cleaned_url)
    
    if match:
        string_ = match.group(1)
        return string_.split(".")[0]
    return 0

In [15]:
map_ticker_web_url["domain"] = map_ticker_web_url["weburl"].apply(extract_domain)
map_ticker_web_url["domain"] = map_ticker_web_url["domain"].str.replace("www.", "")
map_ticker_web_url["domain"] = map_ticker_web_url["domain"].str.replace("https://", "")
map_ticker_web_url["domain"] = map_ticker_web_url["domain"].str.replace("http://", "")
map_ticker_web_url["domain"] = map_ticker_web_url["domain"].str.replace("/", "")
map_ticker_web_url["domain"] = map_ticker_web_url["domain"].str.replace(" ", "")
map_ticker_web_url["domain"] = map_ticker_web_url["domain"].str.lower()

In [16]:
map_ticker_web_url

Unnamed: 0,tic,weburl,domain
0,JPM,www.jpmorganchase.com,jpmorganchase
1,CMA,www.comerica.com,comerica
2,CFR,www.frostbank.com,frostbank
3,RF,www.regions.com,regions
4,TRMK,www.trustmark.com,trustmark
...,...,...,...
110,RIVE,www.riverviewbankpa.com,riverviewbankpa
111,FCB,floridacommunitybank.com/,floridacommunitybank
112,CBF,www.capitalbank-us.com/,capitalbank-us
113,NBHC,www.nationalbankholdings.com,nationalbankholdings


In [17]:
len(map_ticker_web_url["tic"].unique())  # unique tickers

115

In [18]:
len(map_ticker_web_url["domain"].unique())  # unique domains

115

In [19]:
map_place_id_website["domain"] = map_place_id_website["website"].apply(extract_domain)
map_place_id_website["domain"] = map_place_id_website["domain"].str.replace("www.", "")
map_place_id_website["domain"] = map_place_id_website["domain"].str.replace("https://", "")
map_place_id_website["domain"] = map_place_id_website["domain"].str.replace("http://", "")
map_place_id_website["domain"] = map_place_id_website["domain"].str.replace("/", "")
map_place_id_website["domain"] = map_place_id_website["domain"].str.replace(" ", "")
map_place_id_website["domain"] = map_place_id_website["domain"].str.lower()

In [20]:
map_place_id_website.head(20)

Unnamed: 0,place_id,name,website,domain
0,ChIJdwZ-KRPGt4kRV303Br-IDvM,Bank of America (with Drive-thru ATM),https://locators.bankofamerica.com/md/hyattsvi...,bankofamerica
1,ChIJM89ARhPGt4kR4bwPgqhSNqU,Capital One Bank,http://www.capitalonebank.com/?external_id=ENT...,capitalonebank
2,ChIJO-BixRTGt4kR8P-ihLEKrs0,Citi,https://www.citi.com/?utm_source=gmb&utm_mediu...,citi
3,ChIJC4llRxPGt4kRtWJze1u2lFk,Wells Fargo Bank,https://www.wellsfargo.com/locator/bank/1175__...,wellsfargo
4,ChIJ-_uKky3Gt4kRcdMbJj1CwOk,Bank of America (with Drive-thru ATM),https://locators.bankofamerica.com/md/hyattsvi...,bankofamerica
5,ChIJGWfLnNbFt4kRuMn3WNrQNMo,Bank of America (with Drive-thru ATM),https://locators.bankofamerica.com/md/adelphi/...,bankofamerica
6,ChIJC8CUxoLIt4kRSmWzkm6YrNQ,Truist,https://www.truist.com/branch/md/takoma-park/2...,truist
7,ChIJLxx2PrHHt4kRgx5kxlEBEeo,Bank of America ATM (Drive-thru),https://locators.bankofamerica.com/md/hyattsvi...,bankofamerica
8,ChIJS_PE_ADHt4kR4zbhM7KWPIg,Bank of America (with Drive-thru ATM),https://locators.bankofamerica.com/md/hyattsvi...,bankofamerica
9,ChIJ6UYQQRPGt4kRpQRqhIGcecE,Wells Fargo ATM,https://www.wellsfargo.com/locator/bank/1175__...,wellsfargo


In [21]:
domain_ggm = map_place_id_website["domain"].unique()
domain_funda = map_ticker_web_url["domain"].unique()
domain_ggm = set(domain_ggm)
domain_funda = set(domain_funda)

In [22]:
len(domain_ggm.intersection(domain_funda))  # common domains

73

In [23]:
len(domain_ggm.difference(domain_funda))  # common domains

7057

### *Special Cases*

- JPM: www.chase.com
- NEWT: www.newtekbusinessservices.com
- WTNY: www.hancockwhitney.com   # Already got aquired and changed ticker to HWC but HWC does not exist in our bank list
- ZION: www.zionsbank.com
- AROW: www.arrowbank.com
- OFG: www.orientalbank.com
- SUSQ: scb.bank
- MBTF: www.mtb.com
- NBTB: www.nbtbank.com
- PRK: parknationalbank.com
- NCOM: www.nbcbanking.com
- BOKF: www.bokfinancial.com
- NKSH: nbbank.com
- STEL: www.stellar.bank
- ATLO: boonebankiowa.com
- ATLO: rsbiowa.com
- ATLO: bankubt.com
- ATLO: banksbt.com
- ATLO: issbbank.com
- ATLO: fnb247.com.com
- BSF: www.arvest.com
- EMCF: ir.farmersbankgroup.com/
- EVBN: www.evansbank.com
- BRBS: www.mybrb.bank
- HWBK: www.hawthornbank.com
- FISI: www.five-starbank.com
- SHBI: shoreunitedbank.com
- FSGI: www.firstsg.com
- FSGI: www.fsbank.com
- RIVE: www.riverviewbank.com
- NBHC: www.cobnks.com
- NBHC: www.bankmw.com
- NBHC: www.hillcrestbank.com
- NBHC: bankofjacksonhole.com
- NBHC: bojh.com


More fix

- C: www.city.com
- COF: www.capitalonebank.com
- WU: www.westernunion.com

In [24]:
special_case_data = [
    ("JPM", "www.chase.com"),
    ("NEWT", "www.newtekbusinessservices.com"),
    ("WTNY", "www.hancockwhitney.com"),
    ("ZION", "www.zionsbank.com"),
    ("AROW", "www.arrowbank.com"),
    ("OFG", "www.orientalbank.com"),
    ("SUSQ", "scb.bank"),
    ("MBTF", "www.mtb.com"),
    ("NBTB", "www.nbtbank.com"),
    ("PRK", "parknationalbank.com"),
    ("NCOM", "www.nbcbanking.com"),
    ("BOKF", "www.bokfinancial.com"),
    ("NKSH", "nbbank.com"),
    ("STEL", "www.stellar.bank"),
    ("ATLO", "boonebankiowa.com"),
    ("ATLO", "rsbiowa.com"),
    ("ATLO", "bankubt.com"),
    ("ATLO", "banksbt.com"),
    ("ATLO", "issbbank.com"),
    ("ATLO", "fnb247.com.com"),
    ("BSF", "www.arvest.com"),
    ("EMCF", "ir.farmersbankgroup.com/"),
    ("EVBN", "www.evansbank.com"),
    ("BRBS", "www.mybrb.bank"),
    ("HWBK", "www.hawthornbank.com"),
    ("FISI", "www.five-starbank.com"),
    ("SHBI", "shoreunitedbank.com"),
    ("FSGI", "www.firstsg.com"),
    ("FSGI", "www.fsbank.com"),
    ("RIVE", "www.riverviewbank.com"),
    ("NBHC", "www.cobnks.com"),
    ("NBHC", "www.bankmw.com"),
    ("NBHC", "www.hillcrestbank.com"),
    ("NBHC", "bankofjacksonhole.com"),
    ("NBHC", "bojh.com"),
    ("C", "www.city.com"),
    ("COF", "www.capitalonebank.com"),
    ("WU", "www.westernunion.com")
]

df_fix = pd.DataFrame(special_case_data, columns=["tic", "weburl"])


In [25]:
df_fix["domain"] = df_fix["weburl"].apply(extract_domain)
df_fix["domain"] = df_fix["domain"].str.replace("www.", "")
df_fix["domain"] = df_fix["domain"].str.replace("https://", "")
df_fix["domain"] = df_fix["domain"].str.replace("http://", "")
df_fix["domain"] = df_fix["domain"].str.replace("/", "")
df_fix["domain"] = df_fix["domain"].str.replace(" ", "")
df_fix["domain"] = df_fix["domain"].str.lower()

In [26]:
map_ticker_web_url_fix = pd.concat([map_ticker_web_url, df_fix], ignore_index=True)

In [27]:
len(map_place_id_website)

72266

In [28]:
map_ticker_web_url_fix = map_ticker_web_url_fix[["tic", "domain"]]
map_place_id_website = map_place_id_website[["place_id", "domain"]]
map_place_id_website = pd.merge(map_place_id_website, map_ticker_web_url_fix, on="domain", how="left")

In [29]:
map_id_tic = map_place_id_website[~map_place_id_website["tic"].isna()][["place_id", "tic"]].copy()
map_id_tic = map_id_tic.drop_duplicates()
map_id_tic.dropna(subset=["place_id", "tic"], inplace=True)
map_id_tic = map_id_tic[map_id_tic["tic"].isin(all_banks)]
map_id_tic.reset_index(drop=True, inplace=True)

In [30]:
len(map_id_tic["tic"].unique())  # unique tickers

97

In [31]:
# map id to tic dictionary
map_id_tic_dict = map_id_tic.set_index("place_id")["tic"].to_dict()

In [32]:
len(map_id_tic_dict)

30259

### Create The Dataframe for sentiment analysis

In [33]:
df_reviews = reviews_data[["place_id", "review_text", "rating", "published_at_date"]].copy()

In [34]:
df_reviews["tic"] = df_reviews["place_id"].apply(lambda x: map_id_tic_dict.get(x, np.nan))

In [35]:
df_reviews = df_reviews.dropna(subset=["tic", "review_text"])

In [36]:
df_reviews

Unnamed: 0,place_id,review_text,rating,published_at_date,tic
0,ChIJdwZ-KRPGt4kRV303Br-IDvM,They treated me excellently,5,2025-03-16T12:41:47,BAC
2,ChIJdwZ-KRPGt4kRV303Br-IDvM,Diego good service.,5,2025-01-23T12:41:47,BAC
3,ChIJdwZ-KRPGt4kRV303Br-IDvM,"They gave me a good service, specially Mr Dieg...",5,2025-01-23T12:41:47,BAC
4,ChIJdwZ-KRPGt4kRV303Br-IDvM,"Mr. Diego Gomez! Great person, great customer ...",5,2025-01-23T12:41:47,BAC
5,ChIJdwZ-KRPGt4kRV303Br-IDvM,This is the best civilian customer service I'v...,5,2024-10-23T12:41:47,BAC
...,...,...,...,...,...
117130,ChIJ3QxDDOWewoAR8oxteOE1J5Y,Great service,5,2023-03-24T12:19:04,JPM
117133,ChIJ3QxDDOWewoAR8oxteOE1J5Y,Good customer service,5,2020-03-24T12:19:04,JPM
117135,ChIJ3QxDDOWewoAR8oxteOE1J5Y,First off this branch is NOT located inside th...,1,2019-03-24T12:19:04,JPM
117136,ChIJ3QxDDOWewoAR8oxteOE1J5Y,Great experience....exactly what you want a ba...,5,2018-03-24T12:19:04,JPM


## Sentiment using TextBlob

In [38]:
from textblob import TextBlob

def get_blob_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment  # returns a namedtuple (polarity, subjectivity)
    return sentiment.polarity  # Range: [-1.0, 1.0]

In [None]:
df_reviews["text_blob_reviews_sentiment"] = df_reviews["review_text"].apply(get_blob_sentiment)

## Sentiment using Vader

In [44]:
# import SentimentIntensityAnalyzer class from vaderSentiment.vaderSentiment module.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sid_obj = SentimentIntensityAnalyzer()

def get_vader_sentiment(sentence):
    sentiment_dict = sid_obj.polarity_scores(sentence)
    return sentiment_dict['neg'], sentiment_dict['pos']

In [45]:
# Apply the function and create new columns
df_reviews[['vader_reviews_sentiment_neg', 'vader_reviews_sentiment_pos']] = df_reviews['review_text'].apply(lambda x: pd.Series(get_vader_sentiment(x)))

## BERT Sentiment

In [57]:
from transformers import pipeline

# Load sentiment-analysis pipeline with a BERT-based model
sentiment_pipeline = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment", truncation=True, device=0 if device.type == "mps" else -1)

Device set to use mps:0


In [None]:
# THIS CODE TOOK 1.5 HOURS ON MY MAC, DONT RERUN, ASK ME FOR THE FILE IF YOU NEED IT.

# List of texts
texts = df_reviews['review_text'].tolist()

# Define batch size
batch_size = 64
n_batches = math.ceil(len(texts) / batch_size)

bert_results = []
for i in tqdm(range(n_batches), desc="Processing Sentiment"):
    batch = texts[i * batch_size : (i + 1) * batch_size]
    results = sentiment_pipeline(batch)
    bert_results.extend(results)

Processing Sentiment: 100%|██████████| 4447/4447 [1:30:10<00:00,  1.22s/it]  


In [None]:
# Add results back to the dataframe
df_reviews['bert_reviews_label'] = [res['label'] for res in bert_results]
df_reviews['bert_reviews_score'] = [res['score'] for res in bert_results]

In [64]:
df_reviews['bert_reviews_label'] = df_reviews['bert_reviews_label'].apply(lambda x: int(str(x)[0]))

In [66]:
df_reviews.to_csv("reviews_sentiment_ungrouped.csv", index=False)

## Group data by quarter

In [67]:
df_reviews = pd.read_csv("reviews_sentiment_ungrouped.csv")

In [72]:
df_reviews_grouped = df_reviews[["published_at_date", "tic", "rating", "text_blob_reviews_sentiment", "vader_reviews_sentiment_neg", "vader_reviews_sentiment_pos", "bert_reviews_label", "bert_reviews_score"]].copy()

In [73]:
df_reviews_grouped.rename(columns={"rating": "reviews_rating"}, inplace=True)

In [77]:
df_reviews_grouped["published_at_date"] = pd.to_datetime(df_reviews_grouped["published_at_date"])
df_reviews_grouped["datacqtr"] = df_reviews_grouped["published_at_date"].dt.to_period("Q")

In [None]:
df_reviews_grouped = df_reviews_grouped.drop(columns="published_at_date")
df_reviews_grouped = df_reviews_grouped.groupby(by=["tic", "datacqtr"]).mean().reset_index()

In [88]:
df_reviews_grouped["reviews_rating"] = ((df_reviews_grouped["reviews_rating"]) - 1) / 4
df_reviews_grouped["bert_reviews_label"] = ((df_reviews_grouped["bert_reviews_label"]) - 1) / 4

In [89]:
df_reviews_grouped

Unnamed: 0,tic,datacqtr,reviews_rating,text_blob_reviews_sentiment,vader_reviews_sentiment_neg,vader_reviews_sentiment_pos,bert_reviews_label,bert_reviews_score
0,ALRS,2015Q1,0.000000,0.145455,0.114000,0.076000,0.000000,0.511831
1,ALRS,2017Q1,1.000000,0.558333,0.049500,0.563500,1.000000,0.844069
2,ALRS,2018Q1,0.777778,0.454962,0.010111,0.318000,0.861111,0.666347
3,ALRS,2019Q1,0.666667,0.301620,0.030333,0.258333,0.666667,0.650901
4,ALRS,2020Q1,0.636364,0.181895,0.030364,0.189545,0.636364,0.700691
...,...,...,...,...,...,...,...,...
1245,ZION,2024Q1,0.461290,0.122622,0.092819,0.173568,0.456452,0.690260
1246,ZION,2024Q2,0.544872,0.139245,0.049538,0.182333,0.557692,0.733532
1247,ZION,2024Q3,0.213415,0.022032,0.109634,0.132317,0.262195,0.699212
1248,ZION,2024Q4,0.295455,0.085296,0.091394,0.146667,0.280303,0.702155


In [90]:
df_reviews_grouped.to_csv("data/text_results/reviews_sentiment.csv")

In [4]:
import pandas as pd

df_reviews_grouped = pd.read_csv("data/text_results/reviews_sentiment.csv", index_col=0)

In [6]:
df_reviews_grouped.describe()

Unnamed: 0,reviews_rating,text_blob_reviews_sentiment,vader_reviews_sentiment_neg,vader_reviews_sentiment_pos,bert_reviews_label,bert_reviews_score
count,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0
mean,0.606557,0.2197,0.065272,0.248015,0.590013,0.687629
std,0.249097,0.198478,0.057149,0.12875,0.251946,0.098038
min,0.0,-0.695,0.0,0.0,0.0,0.256037
25%,0.46132,0.106081,0.031357,0.17657,0.45,0.654208
50%,0.590971,0.208471,0.059839,0.232642,0.583333,0.690745
75%,0.777083,0.328077,0.087447,0.301083,0.75,0.733813
max,1.0,1.0,0.756,1.0,1.0,0.982982
