### Dataset

dataset was obtained in GDELT Project, specifically from the gdelt-bq project, covid19 dataset, and onlinenews table in Google BigQuery. 
The GDELT Project is a comprehensive initiative that monitors the world's news media, coding and analyzing news events globally.
The gdelt-bq.covid19.onlinenews table focuses on online news articles related to COVID-19. 
This dataset is widely used for tracking and analyzing media coverage and trends related to the COVID-19 pandemic.

link: https://console.cloud.google.com/bigquery?p=gdelt-bq&d=covid19&t=onlinenews&page=table&pli=1&project=aled7-297920&ws=!1m5!1m4!4m3!1sgdelt-bq!2scovid19!3sonlinenews!1m9!1m8!1m3!1saled7-297920!2sbquxjob_36e9fcb1_18fea093775!3sUS!14m3!1saled7-297920!2sbquxjob_5a2fe995_18fea190b75!3sUS

my filtering:
- maximum 5 articles per day
- from '2020-01-01' AND '2022-12-31'
- topic covid-19 vaccin
- deleted duplicates



In [None]:
# WITH RankedArticles AS (
#   SELECT *,
#          ROW_NUMBER() OVER (PARTITION BY DATE(DateTime), LOWER(Title) ORDER BY DateTime) AS title_rn
#   FROM `gdelt-bq.covid19.onlinenews`
#   WHERE
#     (LOWER(Topic) LIKE '%covid-19 vaccin%' OR LOWER(Title) LIKE '%covid-19 vaccin%')
#     AND DATE(DateTime) BETWEEN '2020-01-01' AND '2022-12-31'
# ),
# FilteredArticles AS (
#   SELECT *,
#          ROW_NUMBER() OVER (PARTITION BY DATE(DateTime) ORDER BY DateTime) AS rn
#   FROM RankedArticles
#   WHERE title_rn = 1
# )
# SELECT *
# FROM FilteredArticles
# WHERE rn <= 5

### 1. Import Necessary Libraries

In [1]:
!pip install pycountry


Collecting pycountry
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pycountry
Successfully installed pycountry-24.6.1


In [2]:
# Import Necessary Libraries
import numpy as np
import pandas as pd
import re
# import matplotlib.pyplot as plt
# import seaborn as sns
from textblob import TextBlob
# from wordcloud import WordCloud, STOPWORDS
# from geotext import GeoText
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.decomposition import LatentDirichletAllocation
import pycountry  # to map country abreviation to full name

import warnings
warnings.filterwarnings('ignore')

### 2. Load the Dataset

In [27]:
# df = pd.read_csv("/kaggle/input/small-no-sp5000/bquxjob_7eb79e05_18feac2a7a4.csv")
df = pd.read_csv("/kaggle/input/wwwwwwww/bigquery.csv")

print(df.columns)
print(df.shape)
df.info()

df.head()

Index(['Topic', 'DateTime', 'URL', 'Title', 'Context', 'title_rn', 'rn'], dtype='object')
(5236, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5236 entries, 0 to 5235
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Topic     5236 non-null   object
 1   DateTime  5236 non-null   object
 2   URL       5236 non-null   object
 3   Title     5236 non-null   object
 4   Context   5236 non-null   object
 5   title_rn  5236 non-null   int64 
 6   rn        5236 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 286.5+ KB


Unnamed: 0,Topic,DateTime,URL,Title,Context,title_rn,rn
0,Cases,2020-02-12 05:02:03.000000 UTC,https://www.thestar.com.my/news/nation/2020/02...,"Faster route to Covid-19 vaccine possible, say...",KUALA LUMPUR (Bernama): With the number of cas...,1,1
1,Masks,2020-02-12 08:20:01.000000 UTC,https://www.cbs8.com/article/news/health/coron...,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",1,2
2,Testing,2020-02-12 12:04:53.000000 UTC,https://pharmaphorum.com/news/jj-strengthens-r...,J&J strengthens R&D into COVID-19 vaccine with...,The collaborative partnership with BARDA build...,1,3
3,Cases,2020-02-12 23:32:12.000000 UTC,https://www.theage.com.au/world/asia/four-vacc...,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,1,4
4,Cases,2020-02-13 00:18:29.000000 UTC,https://www.watoday.com.au/world/asia/four-vac...,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,1,1


### 3. Data Preprocessing
Select only interested columns, remove NaN, Clean the text data by removing duplicates and formatting the date.

In [28]:
df.dropna()

Unnamed: 0,Topic,DateTime,URL,Title,Context,title_rn,rn
0,Cases,2020-02-12 05:02:03.000000 UTC,https://www.thestar.com.my/news/nation/2020/02...,"Faster route to Covid-19 vaccine possible, say...",KUALA LUMPUR (Bernama): With the number of cas...,1,1
1,Masks,2020-02-12 08:20:01.000000 UTC,https://www.cbs8.com/article/news/health/coron...,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",1,2
2,Testing,2020-02-12 12:04:53.000000 UTC,https://pharmaphorum.com/news/jj-strengthens-r...,J&J strengthens R&D into COVID-19 vaccine with...,The collaborative partnership with BARDA build...,1,3
3,Cases,2020-02-12 23:32:12.000000 UTC,https://www.theage.com.au/world/asia/four-vacc...,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,1,4
4,Cases,2020-02-13 00:18:29.000000 UTC,https://www.watoday.com.au/world/asia/four-vac...,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,1,1
...,...,...,...,...,...,...,...
5231,Covid19,2022-12-31 01:31:35.000000 UTC,https://www.yahoo.com/now/valneva-reports-furt...,Valneva Reports Further Heterologous Booster D...,"VLA2001's manufacturing process, which has alr...",1,1
5232,Cases,2022-12-31 02:32:21.000000 UTC,https://www.medindia.net/patientinfo/covid-19-...,COVID-19 Vaccine: Myths and Facts,MYTH 8 Covid-19 infection can be contracted fr...,1,2
5233,Covid19,2022-12-31 04:17:02.000000 UTC,https://allafrica.com/stories/202212280429.html,South Africa: Health Announces Changes to Covi...,This means the department will publish the COV...,1,3
5234,Cases,2022-12-31 06:02:25.000000 UTC,https://armenpress.am/eng/news/1067043/,From Macron's visit to donation of COVID-19 va...,Another important field is the healthcare. It ...,1,4


In [29]:
# date column with only YY-mm-dd
df['date'] = pd.to_datetime(df['DateTime']).dt.date

In [30]:
# Function to extract media and full country name from URL
def extract_media_country_full_name(url):
    # Extract domain name
    media = re.findall(r'://(www\.)?([^/]+)/', url)
    media = media[0][1] if media else ''

    # Extract country code if present in the domain suffix
    country_code = re.findall(r'\.([a-z]{2})(/|$)', url)
    country_code = country_code[0][0] if country_code else ''
    
    # Convert country code to full name using pycountry
    country = ''
    if country_code:
        try:
            country = pycountry.countries.get(alpha_2=country_code.upper()).name
        except AttributeError:
            country = ''
    elif '.com' in url:
        country = 'Global'

    return media, country

# Apply the function to the URL column
df[['media', 'country']] = df['URL'].apply(lambda x: pd.Series(extract_media_country_full_name(x)))


### 4. Add companies (sp500 health care sector)

In [31]:
def fetch_sp500_health_care_companies():
    """Retrieves a list of S&P 500 companies from Wikipedia"""
    url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    sp500_table = pd.read_html(url)[0]
    health_care_companies = sp500_table[sp500_table['GICS Sector'] == 'Health Care']

    return health_care_companies[['Symbol', 'Security']]

# Fetch the list of S&P 500 companies
sp500_companies_df = fetch_sp500_health_care_companies()
sp500_symbols_list= sp500_companies_df['Symbol'].tolist()
sp500_companies_list = sp500_companies_df['Security'].tolist()
company_to_symbol = dict(zip(sp500_companies_list, sp500_symbols_list))

def extract_companies(text):
    """Extracts mentioned S&P 500 companies and symbols from text"""
    mentioned_companies = [entry for entry in sp500_companies_list if re.search(r'\b' + re.escape(entry) + r'\b', text, re.IGNORECASE)]
    return ', '.join(mentioned_companies) if mentioned_companies else ''

# Apply the function to both Title and Context columns
df['Security'] = df.apply(lambda row: extract_companies(row['Title'] + ' ' + row['Context']), axis=1)
df['Symbol'] = df['Security'].map(company_to_symbol)
df['Symbol'].fillna('', inplace=True)



In [32]:
# print('number of articles with companies: ', df['Security'].apply(lambda x: x.strip() != '').sum())
# print('articles with companies identified: ', df[df['Security'].apply(lambda x: x.strip() != '')])

number of articles with companies:  938
articles with companies identified:             Topic                        DateTime  \
19         Masks  2020-02-19 00:48:20.000000 UTC   
37       Testing  2020-02-25 02:32:25.000000 UTC   
40       Testing  2020-02-25 13:48:33.000000 UTC   
42       Testing  2020-02-26 03:17:22.000000 UTC   
43    Quarantine  2020-02-26 04:17:12.000000 UTC   
...          ...                             ...   
5210     Covid19  2022-12-26 04:32:32.000000 UTC   
5217  Quarantine  2022-12-28 01:31:20.000000 UTC   
5222  Quarantine  2022-12-29 02:16:44.000000 UTC   
5224     Covid19  2022-12-29 05:46:35.000000 UTC   
5225     Covid19  2022-12-29 06:32:00.000000 UTC   

                                                    URL  \
19    http://www.cidrap.umn.edu/news-perspective/202...   
37    https://www.fool.com/investing/2020/02/24/mode...   
40    https://www.benzinga.com/general/biotech/20/02...   
42    https://tribune.net.ph/index.php/2020/02/26/co...   
43 

### 5. Filter columns and save

In [66]:
df_covid = df[['Topic', 'Title', 'Context','date','media','country', 'Security','Symbol' ]]
df_covid[1:20]

Unnamed: 0,Topic,Title,Context,date,media,country,Security,Symbol
1,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-12,cbs8.com,Global,,
2,Testing,J&J strengthens R&D into COVID-19 vaccine with...,The collaborative partnership with BARDA build...,2020-02-12,pharmaphorum.com,Global,,
3,Cases,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,2020-02-12,theage.com.au,Australia,,
4,Cases,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,2020-02-13,watoday.com.au,Australia,,
5,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-13,wtol.com,Global,,
6,Covid19,WHO chief says first Covid-19 vaccine could be...,Tedros reiterated the need to use available re...,2020-02-14,freemalaysiatoday.com,Global,,
7,SocialDistancing,As scientists race to produce Covid-19 vaccine...,By Analou De Vera An official of the World Hea...,2020-02-14,news.mb.com.ph,Philippines,,
8,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-14,13newsnow.com,Global,,
9,Quarantine,China Hobbles Efforts Toward COVID-19 Vaccine,As former CDC chief Tom Friedman noted recentl...,2020-02-14,theepochtimes.com,Global,,
10,Masks,Covid-19 vaccine in late February: White House...,A US biotech firm Gilead announced earlier thi...,2020-02-16,dawn.com,Global,,


In [34]:
# Save the modified DataFrame to a new CSV file
output_file_path = '/kaggle/working/df_covid.csv'
df_covid.to_csv(output_file_path, index=False)

output_file_path

'/kaggle/working/df_covid.csv'

### 6. Sentiment Analysis using TextBlob
Analyze the sentiment of each tweet using TextBlob. 
- polarity: emotional tone of the text (positive, negative, or neutral).
- subjectivity: if text is more opinion-based or fact-based.

In [67]:
# Sentiment Analysis using TextBlob
def get_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    return sentiment

# def get_subjectivity(text):
#     analysis = TextBlob(text)
#     return analysis.sentiment.subjectivity

# Apply the sentiment function to the Title and Context columns
df_covid['title_sentiment'] = df_covid['Title'].apply(get_sentiment)
df_covid['context_sentiment'] = df_covid['Context'].apply(get_sentiment)
df_covid['sentiment'] = df_covid[['title_sentiment', 'context_sentiment']].mean(axis=1)

# df_covid['title_subjectivity'] = df_covid['Title'].apply(get_subjectivity)
# df_covid['context_subjectivity'] = df_covid['Context'].apply(get_subjectivity)
# df_covid['combined_subjectivity'] = df_covid[['title_subjectivity', 'context_subjectivity']].mean(axis=1)


In [48]:
df_covid[1:20]

Unnamed: 0,Topic,Title,Context,date,media,country,Security,Symbol
1,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-12,cbs8.com,Global,,
2,Testing,J&J strengthens R&D into COVID-19 vaccine with...,The collaborative partnership with BARDA build...,2020-02-12,pharmaphorum.com,Global,,
3,Cases,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,2020-02-12,theage.com.au,Australia,,
4,Cases,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,2020-02-13,watoday.com.au,Australia,,
5,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-13,wtol.com,Global,,
6,Covid19,WHO chief says first Covid-19 vaccine could be...,Tedros reiterated the need to use available re...,2020-02-14,freemalaysiatoday.com,Global,,
7,SocialDistancing,As scientists race to produce Covid-19 vaccine...,By Analou De Vera An official of the World Hea...,2020-02-14,news.mb.com.ph,Philippines,,
8,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-14,13newsnow.com,Global,,
9,Quarantine,China Hobbles Efforts Toward COVID-19 Vaccine,As former CDC chief Tom Friedman noted recentl...,2020-02-14,theepochtimes.com,Global,,
10,Masks,Covid-19 vaccine in late February: White House...,A US biotech firm Gilead announced earlier thi...,2020-02-16,dawn.com,Global,,


In [68]:
# Calculate the average sentiment for each day
mean_sentiment_per_day = df_covid.groupby('date')['sentiment'].mean().reset_index()
mean_sentiment_per_day.rename(columns={'sentiment': 'day_sentiment'}, inplace=True)

# Merge to dataframe
df_covid = df_covid.merge(mean_sentiment_per_day, on='date', how='left')


In [69]:
def categorize_sentiment(score):
    if score > 0:
        return 'positive'
    elif score < 0:
        return 'negative'
    else:
        return 'neutral'

df_covid['sentiment_cat'] = df_covid['sentiment'].apply(categorize_sentiment)
df_covid['day_sent_cat'] = df_covid['day_sentiment'].apply(categorize_sentiment)


### 7. Add day before and day after

In [70]:
# Add daybefore and dayafter columns

df_covid['daybefore'] = df_covid['date'] - pd.Timedelta(days=1)
df_covid['dayafter'] = df_covid['date'] + pd.Timedelta(days=1)


In [77]:
# Add the average sentiment for daybefore and dayafter
day_before_sent_cat = mean_sentiment_per_day[['date', 'day_sentiment']].copy()
day_before_sent_cat['date'] = day_before_sent_cat['date'] + pd.Timedelta(days=1)
day_before_sent_cat.rename(columns={'day_sentiment': 'dayBefore_sent_cat'}, inplace=True)

day_after_sent_cat = mean_sentiment_per_day[['date', 'day_sentiment']].copy()
day_after_sent_cat['date'] = day_after_sent_cat['date'] - pd.Timedelta(days=1)
day_after_sent_cat.rename(columns={'day_sentiment': 'dayAfter_sent_cat'}, inplace=True)

# Merge the day_before and day_after sentiment categories back to the original dataframe
df_covid = df_covid.merge(day_before_sent_cat, on='date', how='left')
df_covid = df_covid.merge(day_after_sent_cat, on='date', how='left')



In [64]:
# Save the modified DataFrame to a new CSV file
output_file_path = '/kaggle/working/df_covid.csv'
df_covid.to_csv(output_file_path, index=False)

output_file_path

'/kaggle/working/df_covid.csv'

### Remove

In [None]:
# columns_to_remove = ['dayAfter_sent_cat']  # Replace with actual column names to be removed
# df_covid.drop(columns=columns_to_remove, inplace=True)


In [None]:
# df_covid =  pd.read_csv("/kaggle/input/qqqqqq/df_covid (2).csv")
# df_covid['Symbol'].fillna('', inplace=True)
# df_covid['Security'].fillna('', inplace=True)
# df_covid['date'] = pd.to_datetime(df_covid['date']).dt.date

In [78]:
df_covid[1:20]

Unnamed: 0,Topic,Title,Context,date,media,country,Security,Symbol,title_sentiment,context_sentiment,sentiment,day_sentiment,sentiment_cat,day_sent_cat,daybefore,dayafter,dayBefore_sent_cat,dayAfter_sent_cat
1,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-12,cbs8.com,Global,,,0.0,-0.2,-0.1,-0.008153,negative,negative,2020-02-11,2020-02-13,,-0.004432
2,Testing,J&J strengthens R&D into COVID-19 vaccine with...,The collaborative partnership with BARDA build...,2020-02-12,pharmaphorum.com,Global,,,0.0,0.115,0.0575,-0.008153,positive,negative,2020-02-11,2020-02-13,,-0.004432
3,Cases,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,2020-02-12,theage.com.au,Australia,,,0.0,0.182273,0.091136,-0.008153,positive,negative,2020-02-11,2020-02-13,,-0.004432
4,Cases,Coronavirus outbreak: COVID-19 vaccine candida...,The total death toll in China climbed to 1113....,2020-02-13,watoday.com.au,Australia,,,0.0,0.182273,0.091136,-0.004432,positive,negative,2020-02-12,2020-02-14,-0.008153,0.063356
5,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-13,wtol.com,Global,,,0.0,-0.2,-0.1,-0.004432,negative,negative,2020-02-12,2020-02-14,-0.008153,0.063356
6,Covid19,WHO chief says first Covid-19 vaccine could be...,Tedros reiterated the need to use available re...,2020-02-14,freemalaysiatoday.com,Global,,,0.225,0.194444,0.209722,0.063356,positive,positive,2020-02-13,2020-02-15,-0.004432,
7,SocialDistancing,As scientists race to produce Covid-19 vaccine...,By Analou De Vera An official of the World Hea...,2020-02-14,news.mb.com.ph,Philippines,,,0.1,0.107407,0.103704,0.063356,positive,positive,2020-02-13,2020-02-15,-0.004432,
8,Masks,San Diego lab discovers COVID-19 vaccine in 3 ...,"As the days go by, Inovio Pharmaceuticals is g...",2020-02-14,13newsnow.com,Global,,,0.0,-0.2,-0.1,0.063356,negative,positive,2020-02-13,2020-02-15,-0.004432,
9,Quarantine,China Hobbles Efforts Toward COVID-19 Vaccine,As former CDC chief Tom Friedman noted recentl...,2020-02-14,theepochtimes.com,Global,,,0.0,0.08,0.04,0.063356,positive,positive,2020-02-13,2020-02-15,-0.004432,
10,Masks,Covid-19 vaccine in late February: White House...,A US biotech firm Gilead announced earlier thi...,2020-02-16,dawn.com,Global,,,-0.15,-0.06,-0.105,0.071458,negative,positive,2020-02-15,2020-02-17,,0.032361
