# Preprocessing of data from Actaware company

## Install packages

In [1]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.11.1-py2.py3-none-any.whl (433 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.11.1


## Prepare data

In [2]:
import json
import pandas as pd
import numpy as np
import emoji
import copy

In [3]:
with open('/source_repository/Articles_2023.json') as json_file:
  data = json.load(json_file)

In [4]:
data_df = pd.DataFrame(data)

### Columns that might be useful:
- 'ProcessingSteps' - the raw data, so I use this one, to know exactly which preprocessing steps I run
  - 'TitleRaw' - title of the article
  - 'DescriptionRaw' - description, which is a short summary of the article; generated automatically with the use of `newspaper3k` library get `summary` or `description` from `feedparser` library
  - 'ContentRaw' - the content of the article
- 'Keywords' - with the use of `yake` library `yake.KeywordExtractor`
- 'MatchedCompanies' - the names of the companies they matched with the artciles. Question: is this the output of their NER system? If yes, I can treat it as good enough, as they said during the meeting.
- 'NERs' - all NERs from the Content

In [5]:
data_df.head()

Unnamed: 0,_id,Source,SourceType,Title,Links,Description,Content,PublishedDate,Keywords,ProcessingSteps,NERs,QType,MatchedCompanies,Authors,ImageUrl
0,{'$oid': '65847e46b1f6ca8a8b0b8f75'},Daily mail Latest Stories,rss,BBC is urged to sack UK's Eurovision entrant O...,{'ArticleUrl': 'https://www.dailymail.co.uk/ne...,The BBC is being urged to drop singer Olly Ale...,The BBC is being urged to drop singer Olly Ale...,{'$date': {'$numberLong': '1703164970000'}},"{'FromDescription': ['olly alexander', 'singer...",{'TitleRaw': 'BBC is urged to sack UK's Eurovi...,"{'FromDescription': [{'Position': 4, 'Text': '...",[200],[BBC],[],https://actawarenewsdev.blob.core.windows.net/...
1,{'$oid': '65847e9cb1f6ca8a8b0b90e0'},SkyNews World,rss,UK Eurovision act Olly Alexander criticised fo...,{'ArticleUrl': 'https://news.sky.com/story/uk-...,"Olly Alexander, the UK's new Eurovision act, h...","Olly Alexander, the UK's new Eurovision act, h...",{'$date': {'$numberLong': '1703153940000'}},"{'FromDescription': ['accusing israel', 'olly ...",{'TitleRaw': 'UK Eurovision act Olly Alexander...,"{'FromDescription': [{'Position': 20, 'Text': ...",[],[Eurovision],[],https://actawarenewsdev.blob.core.windows.net/...
2,{'$oid': '6583d30c76e5f558ad9e1f1a'},Daily mail Australia news,rss,Shocking sight at McDonald's as staff complain...,{'ArticleUrl': 'https://www.dailymail.co.uk/ne...,A union representative claims they were barred...,A union representative claims they were barred...,{'$date': {'$numberLong': '1703118302000'}},"{'FromDescription': ['hindley street', 'repres...",{'TitleRaw': 'Shocking sight at McDonald's as ...,"{'FromDescription': [{'Position': 65, 'Text': ...",[500],[McDonald's],[],https://actawarenewsdev.blob.core.windows.net/...
3,{'$oid': '658327bb7d2f9da2e581735a'},Daily mail Latest Stories,rss,Prince Andrew 'invited former Goldman Sachs ba...,{'ArticleUrl': 'https://www.dailymail.co.uk/ne...,The Duke invited a gun smuggler and an alleged...,The Duke invited a gun smuggler and an alleged...,{'$date': {'$numberLong': '1703071650000'}},"{'FromDescription': ['james palace', 'alleged ...",{'TitleRaw': 'Prince Andrew 'invited former Go...,"{'FromDescription': [{'Position': 60, 'Text': ...",[],[Goldman Sachs],[],https://actawarenewsdev.blob.core.windows.net/...
4,{'$oid': '6582d237443658d5c3f41ea0'},Daily mail News,rss,Airbnb set to use AI to help identify unauthor...,{'ArticleUrl': 'https://www.dailymail.co.uk/ne...,Airbnb is set to use artificial intelligence t...,Airbnb is set to use artificial intelligence t...,{'$date': {'$numberLong': '1703052131000'}},"{'FromDescription': ['year eve', 'eve parties'...",{'TitleRaw': 'Airbnb set to use AI to help ide...,"{'FromDescription': [{'Position': 0, 'Text': '...",[600],[Airbnb],[],https://actawarenewsdev.blob.core.windows.net/...


In [6]:
df_pr_steps=pd.DataFrame(list(data_df['ProcessingSteps']))
df_raw= pd.DataFrame({'TitleRaw': df_pr_steps['TitleRaw'], 'DescriptionRaw': df_pr_steps['DescriptionRaw'], 'ContentRaw': df_pr_steps['ContentRaw']})
df_pr_steps

Unnamed: 0,TitleRaw,DescriptionRaw,ContentRaw,TitleTokens,QTypeLogprob
0,BBC is urged to sack UK's Eurovision entrant O...,The BBC is being urged to drop singer Olly Ale...,The BBC is being urged to drop singer Olly Ale...,"[bbc, urged, sack, uk, eurovision, entrant, ol...",
1,UK Eurovision act Olly Alexander criticised fo...,"Olly Alexander, the UK's new Eurovision act, h...","Olly Alexander, the UK's new Eurovision act, h...","[uk, eurovision, act, olly, alexander, critici...",
2,Shocking sight at McDonald's as staff complain...,A union representative claims they were barred...,A union representative claims they were barred...,"[shocking, sight, mcdonalds, staff, complain, ...",
3,Prince Andrew 'invited former Goldman Sachs ba...,The Duke invited a gun smuggler and an alleged...,The Duke invited a gun smuggler and an alleged...,"[prince, andrew, invited, former, goldman, sac...",
4,Airbnb set to use AI to help identify unauthor...,Airbnb is set to use artificial intelligence t...,Airbnb is set to use artificial intelligence t...,"[airbnb, set, use, ai, help, identify, unautho...",
...,...,...,...,...,...
11649,United Airlines passengers stranded in America...,"'My daughter is now stranded, United decided t...",Hundreds of travellers missed out on Sydney's ...,"[united, airline, passenger, stranded, america...",
11650,⏪ 2022 rewind: 🤖 Global mobility technology co...,,,"[fast, reverse, button, 2022, rewind, robot, g...",
11651,Twitter sued after Elon Musk failed to pay ren...,Elon Musk's Twitter is being sued by the tech ...,Elon Musk's Twitter is being sued by the tech ...,"[twitter, sued, elon, musk, failed, pay, rent,...",
11652,Prince Harry should be stripped of royal title...,Prince Harry is continuing to lose support amo...,Prince Harry is continuing to lose support amo...,"[prince, harry, stripped, royal, title, netfli...",


In [7]:
data_df_small=data_df[['Title', 'Keywords', 'MatchedCompanies', 'NERs']]
data_df_small_all=data_df_small.join(df_raw, how='left')
data_df_small_all=data_df_small_all.drop(['Title'], axis=1)
data_df_small_all.head()

Unnamed: 0,Keywords,MatchedCompanies,NERs,TitleRaw,DescriptionRaw,ContentRaw
0,"{'FromDescription': ['olly alexander', 'singer...",[BBC],"{'FromDescription': [{'Position': 4, 'Text': '...",BBC is urged to sack UK's Eurovision entrant O...,The BBC is being urged to drop singer Olly Ale...,The BBC is being urged to drop singer Olly Ale...
1,"{'FromDescription': ['accusing israel', 'olly ...",[Eurovision],"{'FromDescription': [{'Position': 20, 'Text': ...",UK Eurovision act Olly Alexander criticised fo...,"Olly Alexander, the UK's new Eurovision act, h...","Olly Alexander, the UK's new Eurovision act, h..."
2,"{'FromDescription': ['hindley street', 'repres...",[McDonald's],"{'FromDescription': [{'Position': 65, 'Text': ...",Shocking sight at McDonald's as staff complain...,A union representative claims they were barred...,A union representative claims they were barred...
3,"{'FromDescription': ['james palace', 'alleged ...",[Goldman Sachs],"{'FromDescription': [{'Position': 60, 'Text': ...",Prince Andrew 'invited former Goldman Sachs ba...,The Duke invited a gun smuggler and an alleged...,The Duke invited a gun smuggler and an alleged...
4,"{'FromDescription': ['year eve', 'eve parties'...",[Airbnb],"{'FromDescription': [{'Position': 0, 'Text': '...",Airbnb set to use AI to help identify unauthor...,Airbnb is set to use artificial intelligence t...,Airbnb is set to use artificial intelligence t...


### Problems with data:
- remove records which have `None` values in either TitleRaw or ContentRaw
- remove records which have empty list in `MatchedCompanies`
- remove emojis
- later conduct preprocessing:
  - BERTopic: "However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context in order to create accurate embeddings."
  - https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#pre-compute-embeddings
  - according to various issues found, preprocessing for BERT and BERTopic might even decrease the performance of the model, therefore for now there will be no preprocessing
    - further, after chosing the evaluation metrics, I will first try suggestions from https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#pre-compute-embeddings and compare the results and only later consider the preprocessing steps

In [8]:
# remove records which have `None` values in either TitleRaw or ContentRaw
index_with_none=data_df_small_all[data_df_small_all['ContentRaw'].isnull() + data_df_small_all['TitleRaw'].isnull()].index
data_df_small_no_none=data_df_small_all.drop(index_with_none).reset_index()
data_df_small_no_none=data_df_small_no_none.drop(['index'], axis=1)

In [9]:
data_df_small_no_none.shape

(10698, 6)

In [10]:
# remove records which have empty list in `MatchedCompanies`
data_df_small_no_empty = data_df_small_no_none[data_df_small_no_none['MatchedCompanies'].apply(len)>0]

In [11]:
data_df_small_no_empty.shape

(10698, 6)

In [12]:
data_df_small_no_empty.head()

Unnamed: 0,Keywords,MatchedCompanies,NERs,TitleRaw,DescriptionRaw,ContentRaw
0,"{'FromDescription': ['olly alexander', 'singer...",[BBC],"{'FromDescription': [{'Position': 4, 'Text': '...",BBC is urged to sack UK's Eurovision entrant O...,The BBC is being urged to drop singer Olly Ale...,The BBC is being urged to drop singer Olly Ale...
1,"{'FromDescription': ['accusing israel', 'olly ...",[Eurovision],"{'FromDescription': [{'Position': 20, 'Text': ...",UK Eurovision act Olly Alexander criticised fo...,"Olly Alexander, the UK's new Eurovision act, h...","Olly Alexander, the UK's new Eurovision act, h..."
2,"{'FromDescription': ['hindley street', 'repres...",[McDonald's],"{'FromDescription': [{'Position': 65, 'Text': ...",Shocking sight at McDonald's as staff complain...,A union representative claims they were barred...,A union representative claims they were barred...
3,"{'FromDescription': ['james palace', 'alleged ...",[Goldman Sachs],"{'FromDescription': [{'Position': 60, 'Text': ...",Prince Andrew 'invited former Goldman Sachs ba...,The Duke invited a gun smuggler and an alleged...,The Duke invited a gun smuggler and an alleged...
4,"{'FromDescription': ['year eve', 'eve parties'...",[Airbnb],"{'FromDescription': [{'Position': 0, 'Text': '...",Airbnb set to use AI to help identify unauthor...,Airbnb is set to use artificial intelligence t...,Airbnb is set to use artificial intelligence t...


In [14]:
#remove emoji
data_df_small_all_3=copy.deepcopy(data_df_small_all)
for i in range(data_df_small_all_3.shape[0]):
  data_df_small_all_3['TitleRaw'][i]=emoji.replace_emoji(data_df_small_all_3['TitleRaw'][i], '')

In [15]:
data_df_small_all_3['TitleRaw'] #in 11650 there is no emoji, so it works properly

0        BBC is urged to sack UK's Eurovision entrant O...
1        UK Eurovision act Olly Alexander criticised fo...
2        Shocking sight at McDonald's as staff complain...
3        Prince Andrew 'invited former Goldman Sachs ba...
4        Airbnb set to use AI to help identify unauthor...
                               ...                        
11649    United Airlines passengers stranded in America...
11650     2022 rewind:  Global mobility technology comp...
11651    Twitter sued after Elon Musk failed to pay ren...
11652    Prince Harry should be stripped of royal title...
11653    Alan Shearer nets VERY handsome Christmas bonu...
Name: TitleRaw, Length: 11654, dtype: object

In [47]:
data_df_small_no_emoji=copy.deepcopy(data_df_small_no_empty)
for i in range(data_df_small_no_emoji.shape[0]):
  title_raw=data_df_small_no_emoji['TitleRaw'][i]
  title_raw=title_raw.replace('\n\n', ' ')
  title_raw=title_raw.replace('\n', '')
  description_raw=data_df_small_no_emoji['DescriptionRaw'][i]
  description_raw=description_raw.replace('\n\n', ' ')
  description_raw=description_raw.replace('\n', '')
  content_raw=data_df_small_no_emoji['ContentRaw'][i]
  content_raw=content_raw.replace('\n\n', ' ')
  content_raw=content_raw.replace('\n', '')
  data_df_small_no_emoji['TitleRaw'][i]=emoji.replace_emoji(title_raw, '')
  data_df_small_no_emoji['DescriptionRaw'][i]=emoji.replace_emoji(description_raw, '')
  data_df_small_no_emoji['ContentRaw'][i]=emoji.replace_emoji(content_raw, '')

In [48]:
list_of_contents=data_df_small_no_emoji['ContentRaw']
list_of_contents[0]

"The BBC is being urged to drop singer Olly Alexander as its entrant for Eurovision after it emerged he signed a letter calling Israel an 'apartheid regime'. The Years And Years frontman, 33, was unveiled as next year's candidate for the UK during the Strictly Come Dancing final, which aired on the BBC on Saturday. But he now faces having that role stripped from him after he signed a letter from LGBT charity Voices4London which described Israel as an 'apartheid regime' which is trying to 'ethnically cleanse' Palestine. The statement, which was published on October 20, almost two weeks after Hamas' October 7 attack, also says that Israel has 'terrorised' Palestinian people and there is now a 'genocide' taking place 'in real time'. The Conservatives have accused the BBC of 'either a massive oversight or sheer brass neck' for selecting Alexander, while a Jewish charity has called for him to be replaced and for the broadcaster to cut ties with him. The BBC is not planning on taking any act

In [49]:
print(f"The percentage of data that had None values in 'TopicRaw' or 'ContentRaw' or no matched companies: {np.round(100-len(list_of_contents)/data_df_small_all.shape[0]*100, 2)}%.")

The percentage of data that had None values in 'TopicRaw' or 'ContentRaw' or no matched companies: 8.2%.


In [50]:
#save into file list_of_contents
with open('/source_repository/list_of_contents.txt', 'w') as file:
  for string in list_of_contents:
    file.write(string + '\n')
#save into file data_df_small_no_emoji
data_df_small_no_emoji.to_csv('/source_repository/data_df_small_no_emoji.csv', index=False)

In [51]:
# how to read the data
data_df_small_no_emoji_new=pd.read_csv('/source_repository/data_df_small_no_emoji.csv')
data_df_small_no_emoji_new

Unnamed: 0,Keywords,MatchedCompanies,NERs,TitleRaw,DescriptionRaw,ContentRaw
0,"{'FromDescription': ['olly alexander', 'singer...",['BBC'],"{'FromDescription': [{'Position': 4, 'Text': '...",BBC is urged to sack UK's Eurovision entrant O...,The BBC is being urged to drop singer Olly Ale...,The BBC is being urged to drop singer Olly Ale...
1,"{'FromDescription': ['accusing israel', 'olly ...",['Eurovision'],"{'FromDescription': [{'Position': 20, 'Text': ...",UK Eurovision act Olly Alexander criticised fo...,"Olly Alexander, the UK's new Eurovision act, h...","Olly Alexander, the UK's new Eurovision act, h..."
2,"{'FromDescription': ['hindley street', 'repres...","[""McDonald's""]","{'FromDescription': [{'Position': 65, 'Text': ...",Shocking sight at McDonald's as staff complain...,A union representative claims they were barred...,A union representative claims they were barred...
3,"{'FromDescription': ['james palace', 'alleged ...",['Goldman Sachs'],"{'FromDescription': [{'Position': 60, 'Text': ...",Prince Andrew 'invited former Goldman Sachs ba...,The Duke invited a gun smuggler and an alleged...,The Duke invited a gun smuggler and an alleged...
4,"{'FromDescription': ['year eve', 'eve parties'...",['Airbnb'],"{'FromDescription': [{'Position': 0, 'Text': '...",Airbnb set to use AI to help identify unauthor...,Airbnb is set to use artificial intelligence t...,Airbnb is set to use artificial intelligence t...
...,...,...,...,...,...,...
10693,"{'FromDescription': ['ukko-pekka luukkonen', '...",['NHL'],"{'FromDescription': [{'Position': 0, 'Text': '...","NHL roundup: Alex Tuch, Sabres knock off Bruin...",Ukko-Pekka Luukkonen made 37 saves for Buffalo...,"[1/5] Dec 31, 2022; Washington, District of Co..."
10694,"{'FromDescription': ['pago pago', 'pago', 'uni...",['United Airlines'],"{'FromDescription': [{'Position': 30, 'Text': ...",United Airlines passengers stranded in America...,"'My daughter is now stranded, United decided t...",Hundreds of travellers missed out on Sydney's ...
10695,"{'FromDescription': ['san francisco', 'elon mu...","['Twitter, Inc.']","{'FromDescription': [{'Position': 66, 'Text': ...",Twitter sued after Elon Musk failed to pay ren...,Elon Musk's Twitter is being sued by the tech ...,Elon Musk's Twitter is being sued by the tech ...
10696,"{'FromDescription': ['british public', 'prince...",['Netflix'],"{'FromDescription': [{'Position': 97, 'Text': ...",Prince Harry should be stripped of royal title...,Prince Harry is continuing to lose support amo...,Prince Harry is continuing to lose support amo...


In [52]:
# how to read the data
path_list=('/source_repository/list_of_contents.txt')
with open(path_list, 'r') as file:
  list_of_contents_new = file.readlines()
list_of_contents_new[0]

"The BBC is being urged to drop singer Olly Alexander as its entrant for Eurovision after it emerged he signed a letter calling Israel an 'apartheid regime'. The Years And Years frontman, 33, was unveiled as next year's candidate for the UK during the Strictly Come Dancing final, which aired on the BBC on Saturday. But he now faces having that role stripped from him after he signed a letter from LGBT charity Voices4London which described Israel as an 'apartheid regime' which is trying to 'ethnically cleanse' Palestine. The statement, which was published on October 20, almost two weeks after Hamas' October 7 attack, also says that Israel has 'terrorised' Palestinian people and there is now a 'genocide' taking place 'in real time'. The Conservatives have accused the BBC of 'either a massive oversight or sheer brass neck' for selecting Alexander, while a Jewish charity has called for him to be replaced and for the broadcaster to cut ties with him. The BBC is not planning on taking any act

In [53]:
list_of_contents_new[1]

'Olly Alexander, the UK\'s new Eurovision act, has been criticised for signing a statement accusing Israel of genocide and describing it as an "apartheid state". Alexander was revealed as the UK\'s act last weekend after being chosen by the BBC to perform in Malmo next May. The It\'s A Sin star signed an open letter from LGBTQ+ activist group Voices4London, which called for a ceasefire in Gaza and for Israel to allow aid into the area. The letter says: "We are watching a genocide take place in real time. Death overflows from our phone screens and into our hearts. "And, as a queer community, we cannot sit idly by while the Israeli Government continues to wipe out entire lineages of Palestinian families. We once said, \'silence equals death\'. Now is not the time to be silent." "We cannot untangle these recent tragedies from a violent history of occupation. Current events simply are an escalation of the state of Israel\'s apartheid regime, which acts to ethnically cleanse the land." The 

In [54]:
list_of_contents_new[2]

"A union representative claims they were barred from inspecting a McDonald's after workers complained about seeing rats at the restaurant for six weeks. Employees at the Hindley Street McDonald's, in Adelaide, reported 'almost daily' rat sightings to their union who sent a representative on December 8 to investigate. A new report published by SafeWork SA alleges that the rep was left waiting outside for 'an hour or two' after arriving at 12:15pm and eventually left. Workers at the diner have claimed that faulty wiring, falling ceiling tiles and the 'rat infestation' had created an unsafe work environment. A McDonalds spokesperson said that it is now in direct contact with the franchisee of the Hindley Street store regarding allegations made in the report. The Maccas on Hindley Street in Adelaide's CBD has been infested with rats for weeks, according to employees who have blown the whistle to the SDA (pictured rat in storage room) The workers allege that rats are seen 'almost daily' and