# Obtaining sources

Today we are going to talk how to obtain sources for digital history projects. Generally it could be done in two ways, either via already existing Internet Archives or on Your own while using Web Scraping and platforms' API (application programming interface).

## Internet archives

As discussed during last meeting it is possible to use open-access Internet Archives (links on Ilias). Such data are usually delivered without any 'methodological' description how there were obtained or without any source code. Note that there are certain standards (e.g. Dublin Core) how to 'describe' databases using specific metadata. Although they function in different formats depending on the source type (html, json, png etc.).

## Web scrapping

Web Scraping is a type of 'Data Scraping'. In other words it is a technique that extracts data from other program (in our case platform or website).

## Automatization of data archiving

We are going to use two methods in automated web scraping: API (recommended by platform owners) and automated browser scraping.

### API

An application programming interface (API) allows communication between computers or between computer programs (in contrast to a user interface, which connects a computer to a person).  It is not intended to be used directly by a person (the end user) other than a computer programmer who is incorporating it into the software. API form is always standarized and dedscribed in API documentation (specification)

In our example we utilize Wikipedia API to scrap some articles from Wikipedia. We use (Wikipedia-API)[https://pypi.org/project/Wikipedia-API/] Python module. In order to install the module, open terminal and run command:

```bash
pip install wikipedia-api
```

Now you are able to use this module in notebook:

In [None]:
import wikipediaapi

Firstly you need to initialize `Wikipedia` object (this is the 'thing' making API request for us), which takes language as argument:

In [None]:
wiki = wikipediaapi.Wikipedia(language="en")

Let's get the first article, which title is `"Python (programming language)"`:

In [None]:
page = wiki.page("Python (programming language)")

Now all the data is stored in variable `page`. Let's try to extract the URL of the article to compare with original website:

In [None]:
print(page.fullurl)

You can open it (watch for all characters in link). To get the full text of article, use attribute `text` (function `print` is used here for more human-friendly output):

In [None]:
print(page.text)

You can also extract parts of text, for example summary of the article (attribute: `summary`):

In [None]:
print(page.summary)

or list of sections (attribure: `sections`):

In [None]:
page.sections

or section by title (method `section_by_title` taking title as argument):

In [None]:
page.section_by_title("History")

Moreover you can obtain additional information about the article, such as list of pages in other languages (attribute: `langlinks`):

In [None]:
page.langlinks

For example to get article in german:

In [None]:
print(page.langlinks["de"].text)

To get list of categories, use attribute `categories`:

In [None]:
page.categories

You can also extract all pages refered in links in article, using `links`:

In [None]:
page.links

and back references, using `backlinks`:

In [None]:
page.backlinks

### Parsing websites

While different APIs have their limitations (e.g. time or querry limit) it is possible to somehow overcome such problem. One approach is to download the source of website and parse its content to extract interesting data. In the simplest case website consists of just plain HTML. However, nowadays a lot of websites are dynamic, which means they contain information that changes, depending on the viewer, the time of the day, the time zone, the viewer's native language, and other factors. To obtain desired data you usually have to send specific request and preprocess huge distionaries with massive amount of noise.
On the higher level, the automated browsers (e.g. Selenium) can be utilized to simplify interacting with webpage. It imitates user behaviour like opening links or filling in data, which can be programmed and repeated over and over to collect necessary amount of data.

#### YouTube scrapping - using raw requests

To perform YouTube scrapping we will use our dedicated module written in Python. To install the module, run the following command:

In [1]:
pip install git+https://gitlab.com/digital-history1/youtube-scrapper.git@main

Collecting git+https://gitlab.com/digital-history1/youtube-scrapper.git@main
  Cloning https://gitlab.com/digital-history1/youtube-scrapper.git (to revision main) to /tmp/pip-req-build-02wkg6pc
  Running command git clone --filter=blob:none --quiet https://gitlab.com/digital-history1/youtube-scrapper.git /tmp/pip-req-build-02wkg6pc
  Resolved https://gitlab.com/digital-history1/youtube-scrapper.git to commit 4be059ca0f2f64fae6bf31d989acad0fab7300ee
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: youtube-scrapper
  Building wheel for youtube-scrapper (setup.py) ... [?25ldone
[?25h  Created wheel for youtube-scrapper: filename=youtube_scrapper-0.1.0-py3-none-any.whl size=7460 sha256=676e693be1a44e25d50a7b5c35c60d40588d322719fb4d64a8c864105de7983e
  Stored in directory: /tmp/pip-ephem-wheel-cache-6skg433n/wheels/eb/21/28/f7a100af56ceeaadfc1f79e6ebda71d084a407ec9e4d951139
Successfully built youtube-scrapper
Installing collected packages: youtube-scr

*The package is still under development, so we appreciate any feedback and issue reports: https://gitlab.com/digital-history1/youtube-scrapper/-/issues .*

Currently the package consists of two main obiects: `Video` and`Channel`. The former allows you to collect metadata of the video by its ID (unique part of URL, identifying the video). Using the latter, you can scrap all the videos from given channel. We begin with importing module:

In [2]:
import youtube_scrapper

Consider the following URL: `https://www.youtube.com/watch?v=wmgyXK84TR0`. The string apearing after `watch?v=` is the video ID. To obtain metadata of the video, run:

In [None]:
v = youtube_scrapper.Video("wmgyXK84TR0")
v.get_metadata()

Now you have access to several, self-explenatory attributes, including:

In [None]:
v.title

In [None]:
v.author

In [None]:
v.channel_id

In [None]:
v.description

In [None]:
v.keywords

In [None]:
v.thumbnail  # URL to thumbnail of the video

In [None]:
v.duration  # in seconds

In [None]:
str(v.published_at.date())  # v.published_at is datatime object, so you need this trick for pretty printing

In [None]:
v.view_count

In [None]:
v.like_count

We obtained some interesting information about the video, however repeating this for multiple videos would be annoying. That is where `Channel` object becomes handy. Use the following snippet of code to download metadata of all videos from given channel and save it to `python_simplified_metadata.json` file (more about file formats later):

In [None]:
c = youtube_scrapper.Channel.get_by_title("PythonSimplified")
c.dump_all_videos_to_json("python_simplified_metadata.json")

Now you can check out the output by opening file `python_simplified_metadata.json`.

#### Twitter scrapping - using Selenium

Altough Twitter has it's own API (https://developer.twitter.com/en/docs/twitter-api) it allows only to make 100 000 requests in 24 hours, scraping tweets for last 7 days. With Tweepy, one of the well-built Python library for Twitter, you can get as much as 3 200 last tweets. Moreover you have to register a developer account on Twitter to receive necessary authentication keys for your app in order to communicate with API.

There is also a new possibility from Twitter where you can acquire for a 'academic account' with most of the limits stripped but, for now, this solution is not such popular as webscrapping with Selenium.

That's when automated browser scrapping with Selenium comes handy.

https://betterprogramming.pub/twitter-scrapers-are-all-broken-what-should-we-do-62a7349bfca6



##### Selenium installation guide

Let's start with installing Selenium library for Python:

https://selenium-python.readthedocs.io/installation.html

##### Scweet installation guide

We are going to use Scweet library for Python: https://github.com/Altimis/Scweet

In general it allows us to bypass the spoken limitations. Bear in mind that such automatization might be banned by the platform if too many requests are sent. That is why most of scripts uses time delays between the actions that limits the number of request to be made e.g. in 1 second breaks.


Note : You must have Chrome installed on your system. Run:

In [None]:
pip install Scweet==1.8

As simple as that ;)

We need to restart our kernel to go on.

Open a new notebook and lets start coding:

In [None]:
from Scweet.scweet import scrape
from Scweet.user import get_user_information, get_users_following, get_users_followers

Ok we have installed Scweet and imported all the necessary methods.

Let's analyze README.md file.

Scweet output is stored in `csv` file (comma separated value) and contains the following attributes:

>    'UserScreenName' :  
>    'UserName' : UserName  
>    'Timestamp' : timestamp of the tweet  
>    'Text' : tweet text  
>    'Embedded_text' : embedded text written above the tweet. This can be an image, a video or even another tweet if the tweet in question is a reply  
>    'Emojis' : emojis in the tweet  
>    'Comments' : number of comments  
>    'Likes' : number of likes  
>    'Retweets' : number of retweets  
>    'Image link' : link of the image in the tweet  
>    'Tweet URL' : tweet URL  

Note that Twitter loves changing its front-end so output data might differ in future and the code will require some update (you can always check 'issue' in a library repository (on either gitlab or github) for any problems and solutions.
For now we found the following bugs:

1. 'text' and 'embedded_text' are mixed - in text field you can only see the user name but in embedded text you will find everything what is needed (plus stats from metadata).
1. 'likes' and 'retweets' columns are mutually mixed.
1. unfortunately tweets with video are doubles (see: https://github.com/Altimis/Scweet/issues/126)

Start with very basic example:

In [7]:
data = scrape(
    since="2022-05-09",
    until="2022-05-18",
    from_account="donaldtusk",
    interval=1,
    headless=True,
    display_type="Top",
    save_images=False,
    proxy = None,
    save_dir = "outputs",
    resume=False,
    filter_replies=True,
    proximity=False
)

NameError: name 'scrape' is not defined

As you can see this method takes a bunch of arguments (variables entered by user) to run. In general try to edit those variables that interest you (dates, usernames...). There is whole list of optional arguments:
```
  -h, --help            show this help message and exit
  --words WORDS         Words to search for. they should be separated by "//" : Cat//Dog.
  --from_account FROM_ACCOUNT
                        Tweets posted by "from_account" account.
  --to_account TO_ACCOUNT
                        Tweets posted in response to "to_account" account.
  --mention_account MENTION_ACCOUNT
                        Tweets that mention "mention_account" account.         
  --hashtag HASHTAG
                        Tweets containing #hashtag
  --until UNTIL         End date for search query. example : %Y-%m-%d.
  --since SINCE
                        Start date for search query. example : %Y-%m-%d.
  --interval INTERVAL   Interval days between each start date and end date for
                        search queries. example : 5.
  --lang LANG           Tweets language. Example : "en" for english and "fr"
                        for french.
  --headless HEADLESS   Headless webdrives or not. True or False
  --limit LIMIT         Limit tweets to be scraped.
  --display_type DISPLAY_TYPE
                        Display type of Twitter page : Latest or Top tweets
  --resume RESUME       Resume the last scraping. specify the csv file path.
  --proxy PROXY         Proxy server
  --proximity PROXIMITY Proximity
  --geocode GEOCODE     Geographical location coordinates to center the
                        search (), radius. No compatible with proximity
  --minreplies MINREPLIES
                        Min. number of replies to the tweet
  --minlikes MINLIKES   Min. number of likes to the tweet
  --minretweets MINRETWEETS
                        Min. number of retweets to the tweet
```

This one allows to scrape by `hashtags` in proximity:

In [None]:
data = scrape(
    hashtag="covid19",
    since="2020-04-01",
    until="2020-04-15",
    from_account = None,
    interval=1,
    headless=True,
    display_type="Top",
    save_images=False,
    proxy = None,
    save_dir = 'outputs',
    resume=False,
    filter_replies=True,
    proximity=True
)

If interested you may try scraping with different words but bear in mind as it is to general and broad approach you might receive a lot of noise.

In this example provided from the repository code searches for tweets in proximity of 200 km from Alicante (Spain) with the words `bitcoin` and `ethereum`.

data = scrape(
    words=["bitcoin", "ethereum"],
    since="2021-10-01",
    until="2021-10-05",
    from_account = None,
    interval=1,
    headless=False,
    display_type="Top",
    save_images=False,
    lang="en",
    resume=False,
    filter_replies=False,
    proximity=False,
    geocode="38.3452,-0.481006,200km"
)

You will find your results in 'outputs' directory titled under the name of the account scraped.

## Data storage

The next step (after scrapping and data preprocessing) is storing data. Without proper data storage we are vulnerable to ineffective accesing data and even losing it (e.g. as an efect of hardware failure or ransomware attack). Besides choosing right format, one should consider performing regular backups of the data. 

The easiest way (sufficient for the begining) is to store data in files. Other, more advanced solutions such as databases, requires dedicated software and specialized knowlegde, which is beyond the scope of this course.

### File formats

#### TXT

Storing data in plain text files is the easiest way, although structuralization of data is impossible. This solution can be utilized in case of storing long texts (for example wikipedia article). 
Plain text files does not contain any information about text formatting (in contrast to e.g. DOCX files), so they are space efficient, independent of the platform and easy to further process. Example:

`plain_text_file.txt`:
```
This is the plain text file containing text, some text and even more text. The next lines are just generic content.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sit amet mauris commodo quis imperdiet. Nisi est sit amet facilisis magna etiam. Nec ultrices dui sapien eget mi proin sed. Erat imperdiet sed euismod nisi porta. Sollicitudin nibh sit amet commodo nulla facilisi nullam vehicula ipsum. Nam aliquam sem et tortor consequat id porta. Et malesuada fames ac turpis egestas integer eget. Tincidunt eget nullam non nisi est sit amet. Pharetra pharetra massa massa ultricies mi quis hendrerit dolor magna. Sed arcu non odio euismod. Pretium quam vulputate dignissim suspendisse in est. Ullamcorper velit sed ullamcorper morbi tincidunt ornare massa. Nunc faucibus a pellentesque sit amet porttitor.

```

To save text to plain text file, the following code may be utilized:

In [16]:
data = "This is the plain text file containing text, some text and even more text. The next lines are just generic content. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sit amet mauris commodo quis imperdiet. Nisi est sit amet facilisis magna etiam. Nec ultrices dui sapien eget mi proin sed. Erat imperdiet sed euismod nisi porta. Sollicitudin nibh sit amet commodo nulla facilisi nullam vehicula ipsum. Nam aliquam sem et tortor consequat id porta. Et malesuada fames ac turpis egestas integer eget. Tincidunt eget nullam non nisi est sit amet. Pharetra pharetra massa massa ultricies mi quis hendrerit dolor magna. Sed arcu non odio euismod. Pretium quam vulputate dignissim suspendisse in est. Ullamcorper velit sed ullamcorper morbi tincidunt ornare massa. Nunc faucibus a pellentesque sit amet porttitor."

with open("plain_text_file.txt", "w") as file:
    file.write(data)

#### CSV

Comma-separated values is another text file format, that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. This file format is very similar to Excel sheet. This format is suitable, when stored data can be arranged in plain table. Example:

`comma_separated_data.csv`:
```csv
name,gender,height,shoe_size,birthday
Alice,f,160,36,May 12
Bob,m,190,47,January 2
Chris,m,173,40,July 5
```

To operate on CSV files, the Pandas module can be utilized (Pandas Dataframes cooperates well with csv files):

In [22]:
import pandas

# preparing example data
raw_data = {
    "name": ["Alice", "Bob", "Chris"],
    "gender": ["f", "m", "m"],
    "height": [160, 190, 173],
    "shoe_size": [36, 47, 40],
    "birthday": ["May 12", "January 2", "July 5"]
}
data = pandas.DataFrame(raw_data)

# actual saving to file
data.to_csv("comma_separated_data2.csv")

#### JSON

JavaScript Object Notation is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and arrays (or other serializable values). It allows storing more complex data structures and often is utilized to store some configuration of webpages and applications. All string fields (keys or values) should be surrounded by quotation marks. Objects inside curly brackets `{}` are dictionaries (contains key-value pairs connected with colon `:`, each pair is separated by comma `,`) and objects inside square brackets `[]` are lists (series of single values, separated by commas `,`). Example:

`video_metadata.json`:
```json
  {
    "id": "ZR7_D1V3zD0",
    "title": "Network Chuck",
    "author": "NetworkChuck",
    "description": "This channel is dedicated to all things networking with a side of servers. Subscribe to see tutorials on Cisco Switches, ASAs, Routers, Voice, CUCM....etc. Basically, a lot of Cisco stuff.",
    "keywords": [
      "Cisco Systems Inc. (Business Operation)",
      "cisco",
      "cisco asa",
      "cisco switch",
      "nexus",
      "Computer Network (Industry)",
      "cisco nexus",
      "cisco router",
      "routers",
      "router",
      "catalyst",
      "cisco catalyst",
      "call manager",
      "cucm",
      "cucm 9",
      "cisco unified communications",
      "voice",
      "ccna",
      "ccna voice",
      "CCNA (Field Of Study)",
      "network engineer",
      "network admin",
      "Network Administrator (Profession)",
      "voice admin",
      "voice engineer"
    ],
    "thumbnail": "https://i.ytimg.com/vi/ZR7_D1V3zD0/maxresdefault.jpg",
    "published_at": "2014-10-04",
    "channel_id": "UC9x0AN7BWHpCDHSm9NiJFJQ",
    "duration": 158,
    "view_count": 68767,
    "like_count": 3175
  }
```

To save data to json file, the json module can be used:

In [25]:
import json

data =   {
    "id": "ZR7_D1V3zD0",
    "title": "Network Chuck",
    "author": "NetworkChuck",
    "description": "This channel is dedicated to all things networking with a side of servers. Subscribe to see tutorials on Cisco Switches, ASAs, Routers, Voice, CUCM....etc. Basically, a lot of Cisco stuff.",
    "keywords": [
      "Cisco Systems Inc. (Business Operation)",
      "cisco",
      "cisco asa",
      "cisco switch",
      "nexus",
      "Computer Network (Industry)",
      "cisco nexus",
      "cisco router",
      "routers",
      "router",
      "catalyst",
      "cisco catalyst",
      "call manager",
      "cucm",
      "cucm 9",
      "cisco unified communications",
      "voice",
      "ccna",
      "ccna voice",
      "CCNA (Field Of Study)",
      "network engineer",
      "network admin",
      "Network Administrator (Profession)",
      "voice admin",
      "voice engineer"
    ],
    "thumbnail": "https://i.ytimg.com/vi/ZR7_D1V3zD0/maxresdefault.jpg",
    "published_at": "2014-10-04",
    "channel_id": "UC9x0AN7BWHpCDHSm9NiJFJQ",
    "duration": 158,
    "view_count": 68767,
    "like_count": 3175
  }
with open("video_metadata.json", 'w') as file:
    json.dump(data, file)

## Exercises

1. For each article of:
   * `Angela_Merkel`
   * `Cicero`
   * `Battle_of_Thermopylae`
   * `Battle_of_Waterloo`
   
   download the whole text and save in separate plain text files.
1. For each channel of:
   * `KingsandGenerals`
   * `BeyondScience`
   * `HistoriaCivilis`
   * `HistoryBuffsLondon`
   
   scrape metadata of all videos and save in separate json files.
1. For each Twitter user of:
   * `BorisJohnson`
   * `vonderleyen`
   * `elonmusk`
   * `Pontifex`
   
   scrape tweets from 1.04.2022 to 30.04.2022 and save in separate csv files.
