# Creating a "Tweets" dataset using Twint

So, have you ever played **Plague, Inc**?
* Yes!
    * So, you probably remember those (annoying, maybe?) popups while you were playing, with news related to your disease. I'm importing that idea to our visualization page but instead of news I'll be using tweets from news portals Twitter Accounts.
* No :(
    * (TLDR) In `Plague, Inc`, news of your ficticious world pop up during the game. I consider it a cool idea and want to do it using real tweets.
    * Then I'll have to explain the idea to you. When you are playing `Plague, Inc`, you have in your power a bacteria, a virus, a fungus, or something that can cause an infectious disease. As long as your playing, you are able to evolve your disease by giving it new/stronger ways to infect, new/stronger synthoms, and new/stronger resistance to medicine. The days are passing by before you can get new points to acquire or evolve your disease's "skills", and while the time passes, news from that world start popping up in your screen. An example would be "Tokyo 2020 Olympics postponed to 2021 due to *name of your disease*". What we want to do is to create real popups in our visualization, something like "Tokyo Olympics postponed to 2021 due to coronavirus" and a link to that tweet from `The Guardian`, with that headline; in the tweet, you would probably see a link to that [news story](https://www.theguardian.com/sport/2020/mar/24/tokyo-olympics-to-be-postponed-to-2021-due-to-coronavirus-pandemic). Cool, huh?

From now on, we will be mining those tweets using [Twint](https://github.com/twint-project/twint), an open-source crawler that can scrape every tweet from a given date until the time of the query is called.

## Importing dependencies and libraries

In [1]:
import numpy as np
import pandas as pd
import os, sys, json
import twint
import nest_asyncio

I am using `news_asyncio` because `twint` is one of many libraries that does not work that nicely in a jupyter notebook, and can cause problems with asynchronous calls.

In [2]:
nest_asyncio.apply()

datapath = '../data/twitter_data'

In [3]:
def tweets2df(tweets, country):
    def tweet2json(tweet):
        return {
            'id': tweet.id,
            'datestamp': tweet.datestamp,
            'timestamp': tweet.timestamp,
            'username': tweet.username,
            'tweet': tweet.tweet,
            'replies_count': tweet.replies_count,
            'retweets_count': tweet.retweets_count,
            'likes_count': tweet.likes_count,
            'url': tweet.link,
            'country': country
        }
    
    jsons = [tweet2json(tweet) for tweet in tweets]
    df = pd.DataFrame(jsons)
    return df

In [4]:
def config_twint(username, since='2020-01-22'):
    twint.output.tweets_list = []
    tw = twint.Config()
    tw.Username = newspaper
    tw.Since = '2020-01-22'
    tw.Store_object = True
    tw.Stats = True
    tw.Hide_output = True
    return tw

In [9]:
def create(country, newspaper):
    """
    """
    print(f'Creating {newspaper} of {country}')
    dfname = f'{country}_{newspaper}.tsv'
    tw = config_twint(newspaper)

    twint.run.Search(tw)

    df = tweets2df(twint.output.tweets_list, country)
    if (dfname == 'China_chinaorgcn.tsv'):
        df.to_json(f'{datapath}/{dfname}')
    else:
        df.to_csv(f'{datapath}/{dfname}', sep='\t', index=False)

def update(country, newspaper):
    """
    """
    print(f'Updating {newspaper} of {country}')
    dfname = f'{country}_{newspaper}.tsv'
    dfname2 = f'{country}_{newspaper}_updated.tsv'
    if (dfname == 'China_chinaorgcn.tsv'):
        past = pd.read_json(f'{datapath}/{dfname}')
    else:
        past = pd.read_csv(f'{datapath}/{dfname}', sep='\t', engine='python')
    past.sort_values(['datestamp'], inplace=True, ascending=False)
    most_recent = past['datestamp'][0] + " " + past['timestamp'][0]
    print(f'Querying from {most_recent}. The current dataset has {len(past)} samples.')
    
    tw = config_twint(newspaper, most_recent)
    twint.run.Search(tw)

    df = tweets2df(twint.output.tweets_list, country)
    print(f'The updated dataset has {len(df)} samples')
    if (dfname == 'China_chinaorgcn.tsv'):
        df.to_json(f'{datapath}/{dfname2}')
    else:
        df.to_csv(f'{datapath}/{dfname2}', sep='\t', index=False)

In [6]:
countries = {
    'US': ['cnn', 'nytimes', 'huffpost', 'FoxNews'],
    'Brazil': ['g1', 'folha', 'Estadao', 'JornalOGlobo', 'JornaldoBrasil'],
    'Italy': ['repubblica', 'Corriere', 'Libero_official', 'virgilio_it'],
    'UK': ['BBCNews', 'guardian', 'MailOnline', 'Telegraph'],
    'China': ['ChinaDaily', 'PDChina', 'shanghaidaily', 'chinaorgcn']
}

In [7]:
fields = ['id', 'datestamp, timestamp', 'username', 'tweet', 'replies_count', 'retweets_count', 'likes_count', 'url']
formatted = '"' + "\t".join(['{' + field + '}' for field in fields]) + '"'

In [None]:
for country in countries.keys():
    newspapers = countries[country]
    for newspaper in newspapers:
        dfname = f'{country}_{newspaper}.tsv'
        if dfname in os.listdir(datapath):
            update(country, newspaper)
        else:
            create(country, newspaper)
    
print('done')

Updating cnn of US
Querying from 2020-03-29 03:01:06. The current dataset has 10560 samples.
The updated dataset has 11619 samples
Updating nytimes of US
Querying from 2020-03-29 03:01:06. The current dataset has 16732 samples.


CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0)


Expecting value: line 1 column 1 (char 0) [x] run.Feed
[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!
The updated dataset has 3660 samples
Updating huffpost of US
Querying from 2020-03-29 03:01:06. The current dataset has 20849 samples.


CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0)


Expecting value: line 1 column 1 (char 0) [x] run.Feed
[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!
The updated dataset has 0 samples
Updating FoxNews of US
Querying from 2020-03-29 03:01:06. The current dataset has 20919 samples.


CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0)


Expecting value: line 1 column 1 (char 0) [x] run.Feed
[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!
The updated dataset has 0 samples
Updating g1 of Brazil
Querying from 2020-03-29 03:01:06. The current dataset has 22348 samples.


CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0)


Expecting value: line 1 column 1 (char 0) [x] run.Feed
[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!
The updated dataset has 0 samples
Updating folha of Brazil
Querying from 2020-03-29 03:01:06. The current dataset has 34135 samples.


Dropping duplicate entries

In [None]:
df = tweets2df(twint.output.tweets_list, country)
df.to_json(f'{datapath}/{dfname}')