<a href="https://colab.research.google.com/github/nlnlvlc/financial-lstm-data/blob/main/financial_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following datasets contain financial news, from various sources, to be used in sentintment analysis models. The following code prepares the data to run through:

*   an **AT-LSTM** (Attention based Long-Short Term Memory model)
*   a **Bi-LSTM-AN** (BiDirectional Long-Short Term Memory & Adverserial Neural Network Hybrid model)

The full project can be found [here](https://github.com/Alex-Gideon/635Group3Project/tree/main).


In [None]:
import pandas as pd
import re

**Pretty Clean Dataset Cleaning**

In [None]:
#import dataset financial-pretty-clean
pretty_df = pd.read_csv("financial-news-pretty-clean.csv")

pretty_df.head(5)

Unnamed: 0,Date_published,Headline,Synopsis,Full_text,Final Status
0,2022-06-21,"Banks holding on to subsidy share, say payment...",The companies have written to the National Pay...,ReutersPayments companies and banks are at log...,Negative
1,2022-04-19,Digitally ready Bank of Baroda aims to click o...,"At present, 50% of the bank's retail loans are...",AgenciesThe bank presently has 20 million acti...,Positive
2,2022-05-27,Karnataka attracted investment commitment of R...,Karnataka is at the forefront in attracting in...,PTIKarnataka Chief Minister Basavaraj Bommai.K...,Positive
3,2022-04-06,Splitting of provident fund accounts may be de...,The EPFO is likely to split accounts only at t...,Getty ImagesThe budget for FY22 had imposed in...,Negative
4,2022-06-14,Irdai weighs proposal to privatise Insurance I...,"Set up in 2009 as an advisory body, IIB collec...",AgenciesThere is a view in the insurance indus...,Positive


In [None]:
pretty_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Date_published  400 non-null    object
 1   Headline        400 non-null    object
 2   Synopsis        399 non-null    object
 3   Full_text       400 non-null    object
 4   Final Status    400 non-null    object
dtypes: object(5)
memory usage: 15.8+ KB


In [None]:
#drop unnecessary columns
pretty_df = pretty_df.drop(["Date_published", "Headline", "Synopsis"], axis=1)

pretty_df.head(5)

Unnamed: 0,Full_text,Final Status
0,ReutersPayments companies and banks are at log...,Negative
1,AgenciesThe bank presently has 20 million acti...,Positive
2,PTIKarnataka Chief Minister Basavaraj Bommai.K...,Positive
3,Getty ImagesThe budget for FY22 had imposed in...,Negative
4,AgenciesThere is a view in the insurance indus...,Positive


In [None]:
#calculate Score and place into new Column "Score"
pretty_df["Score"] = pretty_df["Final Status"].apply(
    lambda x: "10" if x == "Positive" else "1" if x == "Negative" else "7"
    )

pretty_df.head(5)

Unnamed: 0,Full_text,Final Status,Score
0,ReutersPayments companies and banks are at log...,Negative,1
1,AgenciesThe bank presently has 20 million acti...,Positive,10
2,PTIKarnataka Chief Minister Basavaraj Bommai.K...,Positive,10
3,Getty ImagesThe budget for FY22 had imposed in...,Negative,1
4,AgenciesThere is a view in the insurance indus...,Positive,10


In [None]:
#check for any null cells
pretty_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Full_text     400 non-null    object
 1   Final Status  400 non-null    object
 2   Score         400 non-null    object
dtypes: object(3)
memory usage: 9.5+ KB


In [None]:
syn = set(pretty_df["Full_text"])

print(len(syn))

400


In [None]:
#create sample entry to evaluate
sample = pretty_df['Full_text'][0]

sample = ''.join(filter(lambda x: x.isalpha() or x.isspace(), sample))

#replace erraneous \n and accented letters
sample = sample.replace("\n", " ")
sample = sample.replace("â", "")
sample = sample.split(" ")
#print to identify any erraneous character skipped
print(sample)

['ReutersPayments', 'companies', 'and', 'banks', 'are', 'at', 'loggerheads', 'over', 'the', 'sharing', 'of', 'governmentgranted', 'subsidies', 'for', 'building', 'payment', 'infrastructure', 'said', 'three', 'people', 'with', 'knowledge', 'of', 'the', 'matter', '', '', 'The', 'companies', 'have', 'written', 'to', 'the', 'National', 'Payments', 'Corp', 'of', 'India', 'NPCI', 'complaining', 'that', '', 'crore', 'of', 'the', '', 'crore', 'granted', 'in', 'the', 'budget', 'is', 'being', 'retained', 'by', 'banks', 'they', 'said', 'This', 'has', 'deprived', 'companies', 'connecting', 'up', 'the', 'last', 'mile', 'of', 'statepromised', 'revenues', 'according', 'to', 'them', 'The', 'government', 'granted', 'the', 'subsidies', 'in', 'exchange', 'for', 'waiving', 'Merchant', 'Discount', 'Rate', 'MDR', 'charges', '', 'The', 'government', 'has', 'released', '', 'crore', 'worth', 'of', 'subsidies', 'to', 'banks', 'but', 'they', 'are', 'not', 'sharing', 'it', 'with', 'any', 'payment', 'aggregators',

In [None]:
#store scores as labels and each text entry into word_list
word_list = []
labels = []

#loop through each row and apply transformations used on samples to all text
#append clean text and label to respective lists
for index, row in pretty_df.iterrows():
  label = row['Score']
  line = ''.join(filter(lambda x: x.isalpha() or x.isspace(), row['Full_text']))
  line = line.replace("\n", " ")
  line = line.replace("â", "")
  word_list.append(line)
  labels.append(label)

#check that both lists are the same length
print(len(word_list))
print(len(labels))


400
400


In [None]:
#dict holding both lists
dict = {'label': labels, 'text': word_list}

#merge into a new dataframe
cleaned_df = pd.DataFrame(dict)

cleaned_df.head(5)

Unnamed: 0,label,text
0,1,ReutersPayments companies and banks are at log...
1,10,AgenciesThe bank presently has million active...
2,10,PTIKarnataka Chief Minister Basavaraj BommaiKa...
3,1,Getty ImagesThe budget for FY had imposed inco...
4,10,AgenciesThere is a view in the insurance indus...


In [None]:
#save cleaned dataset to new file
cleaned_df.to_csv('/experiment-1/financial/datasets/clean_financialpc.csv', index=False)

**PhraseBook Cleaning**

In [None]:
#import dataset for full phrasebook
df = pd.read_csv("financial-phrase-bank-all.csv")

df.head(5)

Unnamed: 0,Status,Text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Status  4846 non-null   object
 1   Text    4846 non-null   object
dtypes: object(2)
memory usage: 75.8+ KB


In [None]:
#calculate score and place into new column 'Score'
df["Score"] = df["Status"].apply(
    lambda x: "10" if x == "positive" else "1" if x == "negative" else "7"
    )

df.head(5)

Unnamed: 0,Status,Text,Score
0,neutral,"According to Gran , the company has no plans t...",7
1,neutral,Technopolis plans to develop in stages an area...,7
2,negative,The international electronic industry company ...,1
3,positive,With the new production plant the company woul...,10
4,positive,According to the company 's updated strategy f...,10


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Status  4846 non-null   object
 1   Text    4846 non-null   object
 2   Score   4846 non-null   object
dtypes: object(3)
memory usage: 113.7+ KB


In [None]:
syn = set(df["Text"])

print(len(syn))

4838


In [None]:
#drop any duplicates texts
df.drop_duplicates('Text', inplace = True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4838 entries, 0 to 4845
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Status  4838 non-null   object
 1   Text    4838 non-null   object
 2   Score   4838 non-null   object
dtypes: object(3)
memory usage: 151.2+ KB


In [None]:
#create sample
sample = df['Text'][0]

sample = ''.join(filter(lambda x: x.isalpha() or x.isspace(), sample))

#replace erraneous spaces and accented letters
sample = sample.replace("\n", " ")
sample = sample.replace("â", "")
sample = re.sub(' +', ' ', sample)
print(sample)

According to Gran the company has no plans to move all production to Russia although that is where the company is growing 


In [None]:
#store scores as labels and each text entry into word_list
word_list = []
labels = []

#loop through each row and apply transformations used on samples to all text
#append clean text and label to respective lists
for index, row in df.iterrows():
  label = row['Score']
  line = ''.join(filter(lambda x: x.isalpha() or x.isspace(), row['Text']))
  line = line.replace("\n", " ")
  line = line.replace("â", "")
  word_list.append(line)
  labels.append(label)

print(len(word_list))
print(len(labels))


4838
4838


In [None]:
#dict holding both lists
dict = {'label': labels, 'text': word_list}

#merge into a new dataframe
cleaned_df = pd.DataFrame(dict)

cleaned_df.head(5)

Unnamed: 0,label,text
0,7,According to Gran the company has no plans to...
1,7,Technopolis plans to develop in stages an area...
2,1,The international electronic industry company ...
3,10,With the new production plant the company woul...
4,10,According to the company s updated strategy fo...


In [None]:
#save cleaned dataset to new file
cleaned_df.to_csv('/experiment-1/financial/datasets/clean_financialfull.csv', index=False)