<a href="https://colab.research.google.com/github/gordeli/NLP_EDHEC2023/blob/main/colab/02_Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.



In [None]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------

# For processing the incoming Twitter data
import json
# os-level utils
import os
# For downloading web data
import requests
# For compressing files
import zipfile

# 3rd party libraries
# -------------------

# beautiful soup for html parsing
!pip install beautifulsoup4
import bs4

# tweepy for using the Twitter API
# !pip install tweepy
# import tweepy

# snscrape for scraping Twitter
# !pip3 install snscrape


# allows downloading of files from colab to your computer
from google.colab import files

# get sample reddit data
if not os.path.exists("reddit_2019_05_5K.json"):
    !wget https://raw.githubusercontent.com/gordeli/textanalysis/master/data/reddit_2019_05_5K.json


print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

---

## Data Collection

- Here we'll cover a few different sources of user-generated content and provide some examples of how to gather data.

### Web Scraping and HTML parsing

- Lots of text data is available directly from web pages.
- Have a look at the following website: [Quotes to Scrape](http://quotes.toscrape.com/page/1/)
- With the Beautiful Soup library, it's very easy to take some html and extract only the text:

In [None]:
html_content = requests.get("http://quotes.toscrape.com/page/1/").content
soup = bs4.BeautifulSoup(html_content,"html.parser")
print(soup.text)

- If you want to extract data in a more targeted way, you can navitage the [html document object model](https://www.w3schools.com/whatis/whatis_htmldom.asp) using [Beautiful Soup functions](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), but we won't dive deeply into this for now,
- **Important: You should not use this kind of code to just go collect data from any website!**
    - Web scaping tools should always check a site's [`robots.txt` file](https://www.robotstxt.org/robotstxt.html), which describes how crawlers, scrapers, indexers, etc., should use the site.
        - For example, see [github's robots.txt](https://github.com/robots.txt)
    - You should be able to find any site's robots.txt (if there is one) at http://\<domain\>/robots.txt for any web \<domain\>.

### Reddit Corpus

- Reddit is a great source of publicly available user-generated content.
- We could scrape Reddit ourselves, but why do that if someone has already (generously) done the heavy lifting?
- Most Reddit is available for researchers to download.
- [Updated in 2003](https://www.reddit.com/r/pushshift/)
- [Alternative hosted by Cornell up to Oct 2018](https://convokit.cornell.edu/documentation/subreddit.html)
- [Alternative on Github up to Mar 2023](https://github.com/ArthurHeitmann/arctic_shift)
- Let's explore a small subset of the data from May 2019:

In [None]:
# read the data that was downloaded during setup
# this is the exact format as the full corpus, just truncated to the first 5000 lines
sample_reddit_posts_raw = open("reddit_2019_05_5K.json",'r').readlines()
print("Loaded",len(sample_reddit_posts_raw),"reddit posts.")
reddit_json = [json.loads(post) for post in sample_reddit_posts_raw]
print(json.dumps(reddit_json[50], sort_keys=True, indent=4))

- Since the posts are in json format, we used the Python json library to process them.
    - This library returns Python dict objects, so we can access them just like we would any other dictionary.
- Let's view some of the text content from these posts:

In [None]:
for post in reddit_json[:100]:
    if post['selftext'].strip() and post['selftext'] not in ["[removed]","[deleted]"]:
        print("Subreddit:",post['subreddit'],"\nTitle:",post['title'],"\nContent:", \
              post['selftext'],"\n")

- Note that we filtered out posts with no text content.
    - Many posts have a non-null "media" field, which could contain images, links to youtube, videos, etc.
        - These could be worth exploring more, using computer vision to process images/videos and NLP to process linked websites.
- That covers the basics of getting Reddit data.

- [-> Next: Corpus Level Processing](https://colab.research.google.com/github/gordeli/NLP_EDHEC2024/blob/main/colab/03_Corpus_Level_Processing.ipynb)