In [2]:
import pandas as pd
import warnings
from pprint import pprint

from src import wiki, news, twitter
warnings.filterwarnings("ignore")

# Wikipedia

Retrieve Wikipedia pages that have titles matching a particular topic of interest, and then extract the relevant sections from those pages.

Next, examine all the articles referenced within these pages. If a referenced article is relevant to the topic of interest and contains specific keywords, retrieve the pages of those articles as well.


In [5]:
# Extract sections from Wikipedia articles for the listed titles.
# And recursiverly extract all linked articles, filter given the keywords.

df_wiki = wiki.get_wiki_pages(titles = ["2022 FIFA World Cup"], 
                         keywords = ["2022", "fifa", "world", "cup"])

Number of wiki pages: 59
Number of wiki sections: 754


Examples of titles/sections extracted:

In [3]:
df_wiki.head(10)

Unnamed: 0,source,title,heading,text,words
0,wiki,2022 FIFA World Cup,Summary,The 2022 FIFA World Cup was an international f...,354
1,wiki,2022 FIFA World Cup,Overview,The FIFA World Cup is a professional football ...,129
2,wiki,2022 FIFA World Cup,Schedule,"Unlike previous FIFA World Cups, which are typ...",276
3,wiki,2022 FIFA World Cup,Prize money,"In April 2022, FIFA announced the prizes for a...",54
4,wiki,2022 FIFA World Cup,Rule changes,The tournament featured new substitution rules...,99
5,wiki,2022 FIFA World Cup,Host selection,The bidding procedure to host the 2018 and 202...,191
6,wiki,2022 FIFA World Cup,Host selection criticism,There have been allegations of bribery and cor...,668
7,wiki,2022 FIFA World Cup,Cost of hosting the tournament,"At an estimated cost of over $220 billion, it ...",58
8,wiki,2022 FIFA World Cup,Venues,The first five proposed venues for the World C...,459
9,wiki,2022 FIFA World Cup,Stadiums,Team base camps \nBase camps were used by the ...,102


Example of one section:

In [4]:
idx = 3
print("title: ", df_wiki.title.iloc[idx])
print("heading: ", df_wiki.heading.iloc[idx])
print("text: ", df_wiki.text.iloc[idx])

title:  2022 FIFA World Cup
heading:  Prize money
text:  In April 2022, FIFA announced the prizes for all participating nations. Each qualified team received $1.5 million before the competition to cover preparation costs with each team receiving at least $9 million in prize money. This edition's total prize pool was $440 million, $40 million greater than the prize pool of the previous tournament.


In [5]:
print("Examples of extracted wiki pages titles:")
print(*("- " + df_wiki.title.unique()[:10]), sep='\n')


Examples of extracted wiki pages titles:
- 2022 FIFA World Cup
- 2022 FIFA World Cup qualification (OFC)
- Canada v Mexico (2022 FIFA World Cup qualification)
- 2022 FIFA World Cup Group E
- 2022 FIFA World Cup qualification – AFC second round
- 2022 FIFA World Cup qualification – AFC third round
- 2022 FIFA World Cup qualification – AFC fourth round
- Human rights issues involving the 2022 FIFA World Cup
- 2022 FIFA World Cup qualification (CAF)
- 2022 FIFA World Cup broadcasting rights


# News API

Use News REST API to get the list of news articles published about a certain topic.

The free subscription allows searching articles up to a month old only.

The News API gets the unformatted content of the article, where available. This is truncated to 200 chars. We can then use `newspaper` python library to extract all content give article's url.

In [6]:
# News API: Extract News article regarding a certain topic.
# Free API is limited to last 30 days.

NEWS_API_KEY = ""

query = "World Cup 2022"
from_date = ""
to_date = ""

sort_by = "popularity" # popularity, relevancy, publishedAt
searchIn = "title" # match query in title, description, content (can be multiple)
page_size = 20

df_news = news.get_news_pages(NEWS_API_KEY, query, from_date, to_date, 
                            sort_by, searchIn, page_size)


Getting page 1...
Getting page 2...
Number of articles: 17


In [4]:
df_news.head()

Unnamed: 0,source,title,author,description,media,url,publishedAt,text,words
0,news,"US Soccer announces plans to honor Grant Wahl,...",Austin Nivison,Wahl spent decades covering soccer in the Unit...,CBS Sports,https://www.cbssports.com/soccer/news/us-socce...,2023-01-26T21:08:01Z,U.S. Soccer has announced that it will be hono...,254
1,news,FIFA Club World Cup 2022: Al Ahly vs Real Madr...,James Evans,Real Madrid face Egyptian side Al Ahly in the ...,Daily Mail,https://www.dailymail.co.uk/sport/football/art...,2023-02-08T10:14:23Z,Real Madrid are set to face Al Ahly in the Clu...,649
2,news,Exploring the FIFA World Cup 2022 using networ...,Ingrid Fadelli,"Network science is the study of physical, biol...",Tech Xplore,https://techxplore.com/news/2023-02-exploring-...,2023-02-02T16:40:01Z,Credit: Milan Janosov and Patrik Szigeti.\nNet...,880
3,news,"Club World Cup 2022 - How to watch, when is it...",James Evans,Sportsmail breaks down everything you need to ...,Daily Mail,https://www.dailymail.co.uk/sport/football/art...,2023-02-03T16:34:56Z,Carlo Ancelotti's Real Madrid head to Morocco ...,1070
4,news,"Alpine skiing TV, live stream schedule for 202...",OlympicTalk,NBC Sports and Peacock air live coverage of th...,NBCSports.com,https://olympics.nbcsports.com/2023/01/25/alpi...,2023-01-25T19:20:48Z,Click to email a link to a friend (Opens in ne...,1485


In [5]:
print("Examples of extracted wiki pages titles:")
print(*("- " + df_news.title.unique()[:10]), sep='\n')

Examples of extracted wiki pages titles:
- US Soccer announces plans to honor Grant Wahl, journalist who died at 2022 World Cup in Qatar
- FIFA Club World Cup 2022: Al Ahly vs Real Madrid - start time, how to watch in UK and team news
- Exploring the FIFA World Cup 2022 using network science
- Club World Cup 2022 - How to watch, when is it, how do you qualify and who are the players to watch
- Alpine skiing TV, live stream schedule for 2022-23 World Cup season
- Al Ahly vs Auckland City: TV Channel, how and where to watch or live stream online free 2022 Club World Cup... - Bolavip
- Africa: 2022 World Cup Stars Shining At the TotalEnergies CHAN in Algeria
- Real vô địch FIFA Club World Cup 2022
- ONCF Joins FIFA in Success of Club World Cup Morocco 2022 (Statement)
- Rugby: New Black Ferns director of rugby Allan Bunting looking to keep ball rolling after 2022 World Cup success


Not all titles are relevant to our topic. Event happened 3 month, free api limited to last month.

In [6]:
idx = 2
print("title: ", df_news.title.iloc[idx])
print("author: ", df_news.author.iloc[idx])
print("description: ", df_news.description.iloc[idx])
print("url: ", df_news.url.iloc[idx])
print("text: ", df_news.text.iloc[idx])

title:  Exploring the FIFA World Cup 2022 using network science
author:  Ingrid Fadelli
description:  Network science is the study of physical, biological, social and other phenomena through the creation of network representations. These representations can sometimes offer very valuable insight, unveiling interesting patterns in data and relationships between…
url:  https://techxplore.com/news/2023-02-exploring-fifa-world-cup-network.html
text:  Credit: Milan Janosov and Patrik Szigeti.
Network science is the study of physical, biological, social and other phenomena through the creation of network representations. These representations can sometimes offer very valuable insight, unveiling interesting patterns in data and relationships between connected entities.
Milán Janosov and Patrik Szigeti, two data scientists working at Central European University, Baoba Inc. and Revolut recently used network science to examine the FIFA World Cup 2022. The network representations they created, out

# Twitter

In [7]:
# twitter api is limited to last 7 days

TWITTER_API_KEY = ""
TWITTER_API_SECRET_KEY = ""
TWITTER_ACCESS_TOKEN = ""
TWITTER_ACCESS_TOKEN_SECRET = ""

query = "Fifa World Cup"
language = "en"
num_tweets = 1000
include_RTs = False

auth_api = twitter.twitter_api_auth(TWITTER_API_KEY, TWITTER_API_SECRET_KEY, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
df_twitter = twitter.get_tweets(query, language, num_tweets, auth_api, include_RTs)


Fifa World Cup -filter:retweets


# ALL DATA

Concatenate all data content together in one dataframe

In [74]:
df = pd.concat([df_wiki, df_news, df_twitter],join='inner', ignore_index=True)
df.head()


Unnamed: 0,source,text,words
0,wiki,The 2022 FIFA World Cup was an international f...,354
1,wiki,The FIFA World Cup is a professional football ...,129
2,wiki,"Unlike previous FIFA World Cups, which are typ...",276
3,wiki,"In April 2022, FIFA announced the prizes for a...",54
4,wiki,The tournament featured new substitution rules...,99


In [75]:
df['source'].value_counts()

twitter    1000
wiki        754
news         17
Name: source, dtype: int64

In [76]:
sources = [
    "wiki",
    #"news", 
    #"twitter"
]

df = df.loc[df["source"].isin(sources)]


Number of Sections in data:

In [77]:
len(df)

754

In [78]:
df.to_csv("data/fifa_data.csv")