# TalkingVac: Answering Metro Manila’s COVID-19 Vaccine Questions

## Executive Summary

2020 will go down in our modern history as the year of the COVID-19 pandemic. This infectious respiratory disease has disrupted the daily lives of billions of people around the world.

The first case of COVID-19 in the Philippines was identified on January 30, 2020 and eventually spread throughout the country prompting the government to impose lockdowns or community quarantines to limit the spread of the virus. In order for our daily lives to get back to normal, we have to reach herd immunity.


### But what is herd immunity?

Herd immunity occurs when a large portion of a community (the herd) becomes immune to a disease, making the spread from person to person unlikely. It is vital because herd immunity protects the whole community, and not just those who are immune to the disease. Moreover, the more contagious a disease is, the greater the proportion of the population that needs to be immune to it to stop its spread. For example, measles is a highly contagious illness. It's estimated that 94% of the population must be immune to interrupt the chain of transmission.


### How do we achieve herd immunity for COVID-19?

There are two paths: vaccines and natural infection. Vaccines are the ideal approach because they create immunity without causing illness or resulting complications. Using the concept of herd immunity, vaccines have successfully controlled deadly contagious diseases such as smallpox, polio, and many others. As for natural infection, even though majority of COVID-19 cases are asymptomatic or with mild symptoms, some individuals develop complications and develop severe symptoms which could be deadly.

### In this project we explore:
1. "What are the sentiments of the people on COVID-19 vaccination?" through *sentiment analysis*
2. "What are the main concerns on COVID-19 vaccination?" through *topic modelling*
3. "How do we address people’s concerns/ questions on COVID-19 vaccination?" through building a "Question Answering Bot"

## Data Information

### 1. Data Scraping

<img src="images/Data Extraction.png" style="width: 800px;"/>
 <figcaption><center><b>Figure 1.</b> Diagram of data scraping</center></figcaption>

We chose twitter to be our data source and we scraped it using snscrape. Our scraping is limited within Metro Manila because we will be translating the tweets later on and Google Translate can only translate Cebuano and Filipino for now. The timeframe of the tweets we scraped is within December 2020 to March 2021 using these keywords: *vaccine, vaccination, vax, covidvaccine, covidvaccination, covidvax, covidphvaccine, coronavirusvaccine, coronavax, vaccineph, covid19vaccine, getvaccinated, sinovac, astazeneca, az, coronavac, novovax, sputnikv, pfizer, moderna, coronavirusvaccination*.

The total tweets we got were almost 30 thousand with a little more than 3 thousand unique tweets (tweets with unique tweet ID). We considered the other duplicate tweets as retweets.

For a detailed process of data gathering, you may explore the [Data Gathering and Preprocessing (Tweets)](Data%20Gathering%20and%20Preprocessing%20(Tweets).ipynb) notebook.

#### Table 1. Data dictionary (Tweets)

| Data | Data Type | Description |
| --- | --- | --- |
| ID | Integer | Tweet ID |
| Date | Datetime | Date of the tweet |
| Content | String | Tweet |
| City | String | City name, tagged while scraping |

### 2. Data Pre-Processing

<img src="images/Data Preprocessing.png" style="width: 800px;"/>
<figcaption><center><b>Figure 2.</b> Diagram of Data Pre-Processing</center></figcaption>

To ensure the quality of the insights from our models, we meticulously checked the tweets and removed unrelated tweets such as those:
- mentioning a different kind of vaccine like flu or anti-rabbies,
- using ‘az’ in replacement of the word as,
- using a different dialect that cannot be translated by Google translate,
- mentioning a vaccine brand name but are actually a totally different thing like a fashion store with the name Moderna.


After that, we translated tweets using Google Translate API. But there were still tweets that Google wasn’t able to translate because of the way that the some words are spelled so we translated them manually.

We then proceeded to transform this dataset into a format that the model requires. First, we converted them into lowercase and removed punctuation and numbers. Next is tokenization, which is basically just getting each word from the tweet, and then applying bigrams and trigams on them. Bigrams and trigrams joins two or three words that doesn't make sense to be used alone, such as "high school". After that, we removed the stop words, which are the most common words such as “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at”, and many more. Lastly, the tweets are lemmatized wherein the words are converted to their meaningful base forms called lemma.

For a detailed process of data pre-processing, you may explore the [Data Gathering and Preprocessing (Tweets)](Data%20Gathering%20and%20Preprocessing%20(Tweets).ipynb) notebook.

## What are the sentiments of the people on COVID-19 vaccination?

<img src="images/Sentiment Analysis.png" style="width: 800px;"/>
<figcaption><center><b>Figure 3.</b> Process diagram of sentiment analysis</center></figcaption>

We answered the question “what are the sentiments of the people on covid-19 vaccination?” through sentiment analysis.  With the cleaned and formatted data, we computed the weights of each tweet to adjust the impact (1 for the original tweet, 0.5 for the retweets). Then, we generated positivity and subjectivity scores using Texblob to get the result. For a more detailed process of how we did sentiment analysis, head over to the [Sentiment Analysis](Sentiment%20Analysis.ipynb) notebook.

 <img src="images/Pie Sentiments.png" style="width: 550px;"/>
 <figcaption><center><b>Figure 4.</b> Percentage Distribution of Sentiments</center></figcaption>

As we can see in Figure 4, almost half of the sentiments on Twitter were positive. The table below shows example tweets tagged as Postive, Negative, and Neutral.

#### Table 2. Example tweets tagged as Positive, Negative, and Neutral

| Sentiment | Tweet |
| --- | --- |
| Positive | Not a covid free Christmas but atleast vaccines are already around the corner. |
| Negative | More COVID-19 scams: Criminal networks expected to pitch fake vaccines |
| Neutral | We deserve to be choosy because we need to find the right vaccine for us based on efficacy and safety |

## What are the main concerns on COVID-19 vaccination?

<img src="images/Process Topic Modelling.png" style="width: 800px;"/>
<figcaption><center><b>Figure 5.</b> Process diagram of topic modelling</center></figcaption>

Having an idea of how Metro Manila Twitter users feel towards vaccines and vaccination, we now want to be more specific. This leads to our second question: *What are the main concerns around COVID-19 vaccination?* This part uses topic modelling. In a nutshell, Topic Modeling uses statistical models to group similar tweets together.

For this project, we explored different models such as LSI, LDA, LDA Mallet, and HDP. We went with LDA Mallet in generating our final topics because even though HDP had the highest coherence score, we couldn't find enough literature that could help us extract the optimum number of topics from the model. For a more comprehensive discussion on the models we explored and of how they performed, visit the [Topic Modelling](Topic%20Modelling.ipynb) notebook.

To identify the topics generated, we looked at two things: words associated with each group, and the tweets that are most representative of each group.

#### Table 3. List of topics and corresponding keywords

| Topic Name | Keywords |
| --- | --- |
| FDA approval of vaccines | dose, approval, approve, president, local, deal, supply, arrival, march, register |
| Start of vaccines rollout | health, start, receive, expensive, today, feel, early, brand, hospital, end |
| Unregistered vaccines | time, public, work, medical, protect, launch, official, unregistered, risk, law |
| Being hopeful and hopeless | give, good, hope, issue, safe, news, pandemic, care, move, stupid |
| Vaccine efficacy and safety | efficacy, effective, case, day, high, safety, low, datum, show, result |
| Vaccine injection | inject, choose, question, base, leader, priority, price, choice, cost, die |
| Distrust on gov’t and China-made vaccine | government, make, china, national, company, order, procure, step, push, budget |
| Vaccine arrival | buy, arrive, smuggle, money, free, long, city, sec, donate, big |
| Call for a concrete vaccine rollout plan | people, wait, plan, world, trust, expert, live, life, lot, doctor |
| Travel ban against COVID-19 variants | country, program, mayor, czar, lack, focus, variant, surge, procurement, lead   |

 <img src="images/Topic Modelling.png" style="width: 800px;"/>
 <br></br> <br></br>
 <figcaption><center><b>Figure 6.</b> Percentage Distribution of Topics among Tweets</center></figcaption>

In Figure 6, we can see that the topics are largely about the logistics, the procurement, and the efficacy of the vaccine. For example, 12% of the tweets talk about the FDA approval of vaccines, while another 11% talk about the start of the vaccine rollout. 

## How do we address people’s concerns/ questions on COVID-19 vaccination?

<img src="images/QABot.png" style="width: 800px;"/>
<figcaption><center><b>Figure 7.</b> Process diagram of sentiment analysis</center></figcaption>

Our final question then is 'How do we address people's main questions or concerns on COVID-19 vaccination?' as identified using topic modelling. We decided to create a chatbot which is a concrete, useful, and highly relevant product which addresses the identified concerns on covid vaccines. 

We have created a question answering bot by training it on various question and answer pairs on general COVID-19 vaccine queries. Once the user inputs a question, the bot then matches the input question with the questions in the training dataset. It then gets the answer to the question closest to the input, which it returns as an output.

<img src="images/TalkingVac logo 2.png" style="width: 300px;"/>
<figcaption><center><b>Figure 8.</b> TalkingVac logo</center></figcaption>

TalkingVac is our COVID-19 vaccine chatbot that is best used for quickly accessing generally verified information about COVID-19 and COVID-19 vaccination. It takes in a wide variety of frequently asked question and answer pairs from various reputable web sources such as Philippine Department of Health (DOH), Philippine Society of Microbiology and Infectious Diseases (PSMID), World Health Organization (WHO), and United States Centers for Disease Control and Prevention (CDC).

We have created two types of chatbots, one using Chatterbot, and the other using BERT to be able to leverage on the accuracy levels of both approaches.

Head over to the [QABot (Chatterbot) notebook](QABot%20(Chatterbot).ipynb) and [QABot (BERT) notebook](QABot%20(BERT).ipynb) to explore the question answering bots.

### Current capability of TalkingVac
- Frequently asked questions about COVID-19 Vaccines and Vaccination in the Philippines.
- General guidelines on vaccination (e.g. how to get vaccinated, things to do before and after vaccination)

### Current limitations of TalkingVac
- Cannot give real-time COVID-19 statistics
- Does not have specific, personalized, and/or localized information
- Cannot answer to questions in Filipino

#### Table 4. Data dictionary (QABot)

| Data | Data Type | Description |
| --- | --- | --- |
| Topic | String | General category where the question falls under |
| Source | String | Name of the Organization whose website is where the QA pair was taken from |
| Link | String | URL of the webpage where the QA pair was taken from  |
| Last Updated | Date | Date when the information was accessed |
| Question | String | Question related to COVID-19 or COVID-19 vaccines |
| Answer | String | Corresponding answer to the question  |