# 🔖 TidyTuesday for Python
## Metadata of TidyTuesday Projects in 2023-2024
This dataset compiles TidyTuesday datasets from 2023-2024, aiming to make resources in the R community more accessible for Python users. TidyTuesday, a project rooted in the R community, provides weekly datasets for data visualization and wrangling. The datasets are well-formatted (in .csv, .json, and common data file types easily accessible by all users), cleaned, and pre-wrangled by experts, but usually not first-hand information for Python learners to practice. Therefore, the initiative attempts to bridge the resource gap between R and Python communities, fostering shared educational learning and open-source collaboration. This collection includes metadata like date posted, project name, source, description, data dictionaries, data download URLs, and project post repo URLs, and the language used is mainly English.

***Note: As a pilot project, the datasets from Tidy Tuesday will be added retrospectively. The baseline is to add all information starting from Jan 2023. Depending on time and availability, data dating back further will also be considered.***

## **Web Scraping**

The first step is to gather all the post information and create a dataframe object encompassing descriptive variables of posts each week. Therefore, I managed to use library `requests` and `BeautifulSoup` to efficiently scrape through the repositories. This section includes several key steps:

- Scrape all the post dates and urls in selected years
- Test scrape one specific post to gather intended variable values
- Scrape all posts in 2023 and construct 2023 dataframe
- Scrape all posts in 2023-2024 and construct final 2023-2024 dataframe
- Scrape all posts in 2023-2024 and construct final 2023-2024 **JSON** file

In [6]:
# Load libraries
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import numpy as np
from datetime import datetime
from tqdm.notebook import tqdm_notebook
import time
from typing import List, Tuple, Dict
from logging import raiseExceptions
from sklearn.model_selection import train_test_split

### **Scrape post date and url**

In order to iteratively scrape the weekly post content, I first built a function called `get_all_posts()`, which takes in a list of year (in str) and returns a dictionary of {post_date: post_url} pairs in the year(s). This step is essentially a foundation for the upcoming post-level scraping, where we could loop through all the post urls, make requests, and retrieve post contents.

The scraped website at this stage is: https://github.com/rfordatascience/tidytuesday/tree/a9e277dd77331e9091e151bb5adb584742064b3e/data  

In [7]:
# Define the base URL or pattern for TidyTuesday posts
base_url = 'https://github.com/rfordatascience/tidytuesday/tree/a9e277dd77331e9091e151bb5adb584742064b3e/data'
root_url = "https://github.com/rfordatascience/tidytuesday/blob/master/"

# Define a filter function to remove non-post folders or files
def is_valid_date(date_string):
  try:
      datetime.strptime(date_string, '%Y-%m-%d')
      return True
  except ValueError:
      return False

In [8]:
# Define a function to retrieve posts from selected years
def get_all_posts(years: List[str]) -> Dict[str, str]:

  # Define a dictionary of posts
  all_posts = {}

  # Loop through each year
  for year in years:
    # Define year_url
    year_url = base_url + '/' + year
    # Send an HTTP request to the post summary page URL
    response_summary = requests.get(year_url)
    # Check if the request was successful
    if response_summary.status_code == 200:
      soup_summary = BeautifulSoup(response_summary.content, 'html.parser')
    else:
      raise Exception("Sorry, response status failed.")
    # Retrieve date and url
    all_folders = json.loads(soup_summary.get_text())['payload']['tree']['items']
    for post_folder in all_folders:
      post_date = post_folder['name']
      post_url = root_url + post_folder['path']
      # Exclude folders that are non-post
      if is_valid_date(post_date):
        all_posts[post_date] = post_url

  return all_posts

In [9]:
# sample call for 2023 data
all_posts_2023 = get_all_posts(['2023'])
all_posts_2023

{'2023-01-03': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-03',
 '2023-01-10': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-10',
 '2023-01-17': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-17',
 '2023-01-24': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-24',
 '2023-01-31': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-31',
 '2023-02-07': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-07',
 '2023-02-14': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-14',
 '2023-02-21': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-21',
 '2023-02-28': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-28',
 '2023-03-07': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-03-07',
 '2023-03-14': 'http

### **Test: Scrape one post**

As we successfully grab all the `post_date` and `post_url` from the previous step, our new challenge is to go inside each of the `post_url` and scrape the individual post repositories. Thus, I did a test run here on one single post (https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-10) to get a sense of the HTML structure of TidyTuesday's posts.


Finding the right HTML instances and layers is truly a headache. With all efforts in source code inspection and request testings, I finally managed to scraped all the information I intended for the following variables:

- `date_posted` (str) : Date when the weekly project was posted (YYYY-MM-DD format).
- `project_name` (str) : Name of the TidyTuesday post.
- `project_source` (List[str]) : A list of URL(s) of the sources.
- `description` (str) : Excerpt of the project and dataset descriptions.
- `data_source_url` (str) : URL to the TidyTuesday post.
- `data_dictionary` (List[Dict[str, str]]) : A list of dictionaries, each containing the variable names, types, and descriptions for each dataset.
- `data` (Dict[str, str]) : A dictionary of dataset names and links to view and download.


In [None]:
post_url = 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-10'

In [None]:
# Send an HTTP request to the post URL
response = requests.get(post_url)

# Check if the request was successful
if response.status_code == 200:
  # Parse the HTML content of the page
  soup = BeautifulSoup(response.content, 'html.parser')

In [None]:
# date_posted - use main page scraping
list(all_posts_2023.keys())[1]

# project_name
soup.find_all('h1')[0].text

# project_source
all_p = soup.find('article').find_all('p')
if all_p[0].get_text() == 'Please add alt text (alternative text) to all of your posted graphics for #TidyTuesday.':
  all_p_clean = all_p[12:]
else:
  all_p_clean = all_p

project_source = []
for html_url in all_p_clean:
  all_a = html_url.find_all('a', href=True)
  for a in all_a:
    url_string = a['href']
    unescaped_url = json.loads(f'"{url_string}"').strip('"')
    if "readme.md" in unescaped_url:
      unescaped_url = 'https://github.com' + unescaped_url
    project_source.append(unescaped_url)
project_source = set(project_source)

# description
description = []
for p in all_p_clean:
  individual_description = p.get_text(strip=True, separator='').replace("\\n", " ").replace("\\", "")
  description.append(individual_description)
' '.join(map(str, description))

# data_source_url
post_url

# data_dictionary
tables = soup.find('article').find_all("table")
data_dictionary = []

for table in tables:
  table_vars = table.find_all("tr")
  table_var_len = len(table_vars)
  data_structure = {"variable": [],
                    "class": [],
                    "description": []}
  for i in range(1, table_var_len):
    var = table_vars[i].find_all("td")
    data_structure["variable"].append(var[0].text)
    data_structure["class"].append(var[1].text)
    data_structure["description"].append(var[2].text)
  data_dictionary.append(data_structure)

# data
files = json.loads(soup.get_text())['payload']['tree']['items']
data_csv = {}
root_url = "https://github.com/rfordatascience/tidytuesday/blob/master/"
for file in files:
    if ".csv" in file['name']:
        new_url = root_url + file['path']
        data_csv[file['name']] = new_url

### **Scrape all posts in 2023 (CSV)**

With all the preparations above, we are now able to define a function `get_all_data()`, which combines the `get_all_posts()` with the post-level retrievals. The function will first takes in a list of year and retrieves all post dates and urls in those years, and then go inside the post urls to gather variable information. The final output is a dataframe with the 7 variables mentioned. Performing a test check on the 2023 data, there are some inconsistency issues in terms of the HTML structures between different posts, but I have effectively fixed the problem by adding more condition checks along the way to ensure the final formatting.

***Note: The first post in 2023 (`2023-01-03`) is a bring-your-own-data project. There is no data or source for that specific week. ***

In [10]:
# Define a function to retrieve all posts information from selected years
def get_all_data(years: List[str]) -> pd.DataFrame:
  # access all posts date and urls in selected years
  all_posts = get_all_posts(years)
  # create a dataframe to store the output
  final_dataframe = pd.DataFrame(columns = ["date_posted", "project_name", "project_source", "description",
                                            "data_source_url", "data_dictionary", "data"])

  # Iteratively scrape through each post_url
  for post_date, post_url in tqdm_notebook(all_posts.items()):
    # request contents
    response = requests.get(post_url)
    if response.status_code == 200:
      soup = BeautifulSoup(response.content, 'html.parser')
    else:
      raise Exception("Sorry, response status failed.")

    # date_posted
    date_posted = post_date

    # project_name
    project_name = soup.find_all('h1')[0].text

    # project_source
    all_p = soup.find('article').find_all('p')
    ## enable scraping even when the repo organization has slightly changed
    if all_p[0].get_text() == 'Please add alt text (alternative text) to all of your posted graphics for #TidyTuesday.':
      all_p_clean = all_p[12:]
    elif all_p[0].get_text() == 'Please add alt text (alternative text) to all of your posted graphics\\nfor #TidyTuesday.':
      all_p_clean = all_p[8:]
    else:
      all_p_clean = all_p

    project_source = []
    for html_url in all_p_clean:
      all_a = html_url.find_all('a', href=True)
      for a in all_a:
        url_string = a['href']
        unescaped_url = json.loads(f'"{url_string}"').strip('"')
        if "readme.md" in unescaped_url:
          unescaped_url = 'https://github.com' + unescaped_url
        project_source.append(unescaped_url)
    project_source = list(set(project_source))

    # description
    description_lines = []
    for p in all_p_clean:
      individual_description = p.get_text(strip=True, separator='').replace("\\n", " ").replace("\\", "")
      description_lines.append(individual_description)
    description = ' '.join(map(str, description_lines))

    # data_source_url
    data_source_url = post_url

    # data_dictionary
    tables = soup.find('article').find_all("table")
    data_dictionary = []
    for table in tables:
      table_vars = table.find_all("tr")
      table_var_len = len(table_vars)
      data_structure = {"variable": [],
                        "class": [],
                        "description": []}
      for i in range(1, table_var_len):
        var = table_vars[i].find_all("td")
        data_structure["variable"].append(var[0].text)
        data_structure["class"].append(var[1].text)
        data_structure["description"].append(var[2].text)
      data_dictionary.append(data_structure)

    # data
    files = json.loads(soup.get_text())['payload']['tree']['items']
    data_csv = {}
    root_url = "https://github.com/rfordatascience/tidytuesday/blob/master/"
    for file in files:
        if ".csv" in file['name']:
            new_url = root_url + file['path']
            data_csv[file['name']] = new_url
    data = data_csv

    # Add to dataframe
    final_dataframe.loc[len(final_dataframe.index)] = {
        "date_posted": date_posted,
        "project_name": project_name,
        "project_source": project_source,
        "description": description,
        "data_source_url": data_source_url,
        "data_dictionary": data_dictionary,
        "data": data
        }
  return final_dataframe

In [None]:
# Call the function above
tidytuesday_2023 = get_all_data(['2023'])
# Fix week 1 post structure: no dataset for this week
tidytuesday_2023.at[0, 'data_dictionary'] = []
# Check the 2023 final dataset
tidytuesday_2023

  0%|          | 0/52 [00:00<?, ?it/s]

Unnamed: 0,date_posted,project_name,project_source,description,data_source_url,data_dictionary,data
0,2023-01-03,Week 1,[],This was really just a bring your own dataset ...,https://github.com/rfordatascience/tidytuesday...,[],{}
1,2023-01-10,Project FeederWatch,[https://feederwatch.org/explore/raw-dataset-r...,The data this week comes from theProject Feede...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['loc_id', 'latitude', 'longitud...",{'PFW_2021_public.csv': 'https://github.com/rf...
2,2023-01-17,Art History,"[https://github.com/saralemus7/arthistory, htt...",The data this week comes from thearthistory da...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['artist_name', 'edition_number'...",{'artists.csv': 'https://github.com/rfordatasc...
3,2023-01-24,Alone,"[https://www.history.com/shows/alone, https://...",The data this week comes from theAlone data pa...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['season', 'name', 'age', 'gende...",{'episodes.csv': 'https://github.com/rfordatas...
4,2023-01-31,Pet Cats UK,[https://www.datarepository.movebank.org/handl...,The data this week comes from theMovebank for ...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['tag_id', 'event_id', 'visible'...",{'cats_uk.csv': 'https://github.com/rfordatasc...
5,2023-02-07,Big Tech Stock Prices,[https://www.kaggle.com/datasets/evangower/big...,The data this week comes from Yahoo Finance vi...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['stock_symbol', 'date', 'open',...",{'big_tech_companies.csv': 'https://github.com...
6,2023-02-14,Hollywood Age Gaps,"[https://hollywoodagegap.com/, https://www.dat...",The data this week comes fromHollywood Age Gap...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['movie_name', 'release_year', '...",{'age_gaps.csv': 'https://github.com/rfordatas...
7,2023-02-21,Bob Ross Paintings,[https://github.com/jwilber/Bob_Ross_Paintings...,The data this week comes from Jared Wilber's d...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['painting_index', 'img_src', 'p...",{'bob_ross.csv': 'https://github.com/rfordatas...
8,2023-02-28,African Language Sentiment,"[https://arxiv.org/pdf/2302.08956.pdf, https:/...",The data this week comes fromAfriSenti: Sentim...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['language_iso_code', 'tweet', '...",{'afrisenti.csv': 'https://github.com/rfordata...
9,2023-03-07,Numbats in Australia,"[https://www.ala.org.au, /rfordatascience/tidy...",The data this week comes from theAtlas of Livi...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['decimalLatitude', 'decimalLong...",{'numbats.csv': 'https://github.com/rfordatasc...


#### **Scrape all posts in 2023-2024**

Finally, with all functions and scraping pipelines defined, we are able to construct our final dataframe `tidytuesday_2023_2024` spanning from 2023 to 2024. The dataframe is of shape (59, 7), with each row representing one post for the specified years. Even though the number of rows looks small, each value in every column collaboratively covers the majority of information of the weekly post. This mainly serves for the consideration of a cleaner and more organized data structure for Python users and a easier retrieval with JSON objects. For example, the `data_dictionary` variable can be serialized into a JSON structure using `json.dumps()` in Python (but this is not necessarily good for R users). And the actual datasets can be fast accessed and viewed by by the list of urls stored in `data` variable, or **loaded/downloaded** by the list of urls storing **raw dataset** in `data_load` variable.

I have tried to play with saving the dataframe to both `csv` and `json` below. See examples at the bottom for the data structure check!

In [11]:
tidytuesday_2023_2024 = get_all_data(['2023', '2024'])

  0%|          | 0/59 [00:00<?, ?it/s]

In [12]:
# Fix week 1 post structure: no dataset for this week
tidytuesday_2023_2024.at[0, 'data_dictionary'] = []
# Check the final dataset
tidytuesday_2023_2024

Unnamed: 0,date_posted,project_name,project_source,description,data_source_url,data_dictionary,data
0,2023-01-03,Week 1,[],This was really just a bring your own dataset ...,https://github.com/rfordatascience/tidytuesday...,[],{}
1,2023-01-10,Project FeederWatch,[https://www.frontiersin.org/articles/10.3389/...,The data this week comes from theProject Feede...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['loc_id', 'latitude', 'longitud...",{'PFW_2021_public.csv': 'https://github.com/rf...
2,2023-01-17,Art History,[https://research.repository.duke.edu/concern/...,The data this week comes from thearthistory da...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['artist_name', 'edition_number'...",{'artists.csv': 'https://github.com/rfordatasc...
3,2023-01-24,Alone,[https://gradientdescending.com/alone-r-packag...,The data this week comes from theAlone data pa...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['season', 'name', 'age', 'gende...",{'episodes.csv': 'https://github.com/rfordatas...
4,2023-01-31,Pet Cats UK,"[http://dx.doi.org/10.1111/acv.12563, https://...",The data this week comes from theMovebank for ...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['tag_id', 'event_id', 'visible'...",{'cats_uk.csv': 'https://github.com/rfordatasc...
5,2023-02-07,Big Tech Stock Prices,[https://github.com/rfordatascience/tidytuesda...,The data this week comes from Yahoo Finance vi...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['stock_symbol', 'date', 'open',...",{'big_tech_companies.csv': 'https://github.com...
6,2023-02-14,Hollywood Age Gaps,[https://www.data-is-plural.com/archive/2018-0...,The data this week comes fromHollywood Age Gap...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['movie_name', 'release_year', '...",{'age_gaps.csv': 'https://github.com/rfordatas...
7,2023-02-21,Bob Ross Paintings,"[https://www.twoinchbrush.com/all-paintings, h...",The data this week comes from Jared Wilber's d...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['painting_index', 'img_src', 'p...",{'bob_ross.csv': 'https://github.com/rfordatas...
8,2023-02-28,African Language Sentiment,"[https://r4ds.io/join, https://arxiv.org/pdf/2...",The data this week comes fromAfriSenti: Sentim...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['language_iso_code', 'tweet', '...",{'afrisenti.csv': 'https://github.com/rfordata...
9,2023-03-07,Numbats in Australia,[/rfordatascience/tidytuesday/blob/master/data...,The data this week comes from theAtlas of Livi...,https://github.com/rfordatascience/tidytuesday...,"[{'variable': ['decimalLatitude', 'decimalLong...",{'numbats.csv': 'https://github.com/rfordatasc...


In [13]:
# Functionality check
## overview
tidytuesday_2023_2024.iloc[47]

date_posted                                               2023-11-28
project_name                                     Doctor Who Episodes
project_source     [https://en.wikipedia.org/wiki/List_of_Doctor_...
description        Doctor Who is an extremely long-running Britis...
data_source_url    https://github.com/rfordatascience/tidytuesday...
data_dictionary    [{'variable': ['era', 'season_number', 'serial...
data               {'drwho_directors.csv': 'https://github.com/rf...
Name: 47, dtype: object

In [14]:
## check project source completeness
tidytuesday_2023_2024.iloc[47]['project_source']

['https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(2005%E2%80%93present)',
 'https://github.com/KittJonathan/datardis/tree/main/misc',
 'https://cran.r-project.org/package=datardis',
 'https://github.com/KittJonathan/datardis']

In [15]:
## check description joining
tidytuesday_2023_2024.iloc[47]['description']

'Doctor Who is an extremely long-running British television program. The show was revived in 2005, and has proven very popular since then. To celebrate this year\'s 60th anniversary of Doctor Who, we have three datasets. The data this week comes from Wikipedia\'s [List of Doctor Who episodes](https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(2005%E2%80%93present)via the{datardis} packagebyJonathan Kitt. Thank you to Jonathan for compiling and sharing this data! As of 2023-11-24, the data only includes episodes from the "revived" era. For an added challenge, consider submitting a pull request to the {datardis} package to update thedata-extraction scriptsto also fetch the "classic" era data! Clean data from the{datardis} package.'

In [11]:
## check data_dictionary
dict_check = tidytuesday_2023_2024.iloc[47]['data_dictionary']
### Convert to JSON string with indentation for pretty printing
dict_json = json.dumps(dict_check, indent=4)
### Print the JSON data
print(dict_json)

[
    {
        "variable": [
            "era",
            "season_number",
            "serial_title",
            "story_number",
            "episode_number",
            "episode_title",
            "type",
            "first_aired",
            "production_code",
            "uk_viewers",
            "rating",
            "duration"
        ],
        "class": [
            "character",
            "double",
            "character",
            "character",
            "double",
            "character",
            "character",
            "double",
            "character",
            "double",
            "double",
            "double"
        ],
        "description": [
            "Whether the episode is in the \\\"classic\\\" or \\\"revived\\\" era. All data in this dataset is within the \\\"revived\\\" era.",
            "The season number within the era. Note that some episodes are outside of a season.",
            "Serial title if available",
            "Story number",

In [16]:
## check data
tidytuesday_2023_2024.iloc[47]['data']

{'drwho_directors.csv': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-28/drwho_directors.csv',
 'drwho_episodes.csv': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-28/drwho_episodes.csv',
 'drwho_writers.csv': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-28/drwho_writers.csv'}

In [19]:
# Add a new column for easy loading the actual dataset
for i in range(tidytuesday_2023_2024.shape[0]):
  actual_data_dict = {}
  actual_data = tidytuesday_2023_2024.iloc[i]['data']
  for dataset_name, dataset_url in actual_data.items():
    load_url = dataset_url.replace("github.com", "raw.githubusercontent.com").replace("/blob", "")
    actual_data_dict[dataset_name] = load_url
  tidytuesday_2023_2024['data_load'][i] = actual_data_dict

In [20]:
tidytuesday_2023_2024['data_load'].head()

0                                                   {}
1    {'PFW_2021_public.csv': 'https://raw.githubuse...
2    {'artists.csv': 'https://raw.githubusercontent...
3    {'episodes.csv': 'https://raw.githubuserconten...
4    {'cats_uk.csv': 'https://raw.githubusercontent...
Name: data_load, dtype: object

In [21]:
## You can directly read in the actual data (csv) through the links
drwho_episodes_url = tidytuesday_2023_2024.iloc[47]['data_load']['drwho_episodes.csv']
drwho_episodes = pd.read_csv(drwho_episodes_url)
drwho_episodes.head()

Unnamed: 0,era,season_number,serial_title,story_number,episode_number,episode_title,type,first_aired,production_code,uk_viewers,rating,duration
0,revived,1.0,,157,1.0,Rose,episode,2005-03-26,1.1,10.81,76,45
1,revived,1.0,,158,2.0,The End of the World,episode,2005-04-02,1.2,7.97,76,44
2,revived,1.0,,159,3.0,The Unquiet Dead,episode,2005-04-09,1.3,8.86,80,44
3,revived,1.0,,160a,4.0,Aliens of London,episode,2005-04-16,1.4,7.63,82,45
4,revived,1.0,,160b,5.0,World War Three,episode,2005-04-23,1.5,7.98,81,42


### **Scrape all posts in 2023-2024 (JSON) 🏆**

From previous successful scrapings in CSV format, we finally proceed to scraping and saving in a JSON nested format. At a first glance, it might be okay to directly grab the CSV to a dataframe and use the information in a JSON hierarchical structure. However, it is more intuitive to directly grab the fruit in the correct JSON structure for python users. Therefore, I did a lot more effort in producing the correct JSON file, such as formatting, nested objects handling, and testing. The final output files that we are going to upload are:



*   `tidytuesday_json`: the full set
*   `tidytuesday_json_train`: the train set
*   `tidytuesday_json_val`: the validation set

All three files above along with the previous CSV files are stored at https://github.com/hollyyfc/tidytuesday-for-python.git

In [176]:
# Define a function to retrieve all posts information from selected years
def get_all_data_json(years: List[str]):
  # access all posts date and urls in selected years
  all_posts = get_all_posts(years)
  # create a dataframe to store the output
  final_json = []

  # Iteratively scrape through each post_url
  for post_date, post_url in tqdm_notebook(all_posts.items()):
    # request contents
    response = requests.get(post_url)
    if response.status_code == 200:
      soup = BeautifulSoup(response.content, 'html.parser')
    else:
      raise Exception("Sorry, response status failed.")

    # date_posted
    date_posted = post_date

    # project_name
    project_name = soup.find_all('h1')[0].text

    # project_source
    all_p = soup.find('article').find_all('p')
    ## enable scraping even when the repo organization has slightly changed
    if all_p[0].get_text() == 'Please add alt text (alternative text) to all of your posted graphics for #TidyTuesday.':
      all_p_clean = all_p[12:]
    elif all_p[0].get_text() == 'Please add alt text (alternative text) to all of your posted graphics\\nfor #TidyTuesday.':
      all_p_clean = all_p[8:]
    else:
      all_p_clean = all_p

    project_source = []
    for html_url in all_p_clean:
      all_a = html_url.find_all('a', href=True)
      for a in all_a:
        url_string = a['href']
        unescaped_url = json.loads(f'"{url_string}"').strip('"')
        if "readme.md" in unescaped_url:
          unescaped_url = 'https://github.com' + unescaped_url
        project_source.append(unescaped_url)
    project_source = list(set(project_source))

    # description
    description_lines = []
    for p in all_p_clean:
      individual_description = p.get_text(strip=True, separator='').replace("\\n", " ").replace("\\", "")
      description_lines.append(individual_description)
    description = ' '.join(map(str, description_lines))

    # data_source_url
    data_source_url = post_url

    # data_dictionary
    tables = soup.find('article').find_all("table")
    data_dictionary = []
    for table in tables:
      table_vars = table.find_all("tr")
      table_var_len = len(table_vars)
      data_structure = {"variable": [],
                        "class": [],
                        "description": []}
      for i in range(1, table_var_len):
        var = table_vars[i].find_all("td")
        data_structure["variable"].append(var[0].text)
        data_structure["class"].append(var[1].text)
        data_structure["description"].append(var[2].text)
      data_dictionary.append(data_structure)

    # data & data_load
    files = json.loads(soup.get_text())['payload']['tree']['items']
    data_csv = {"file_name": [],
                "file_url": []}
    data_loader = {"file_name": [],
                   "file_url": []}
    root_url = "https://github.com/rfordatascience/tidytuesday/blob/master/"
    for file in files:
        if ".csv" in file['name']:
            # add data
            new_url = root_url + file['path']
            data_csv["file_name"].append(file['name'])
            data_csv["file_url"].append(new_url)
            # add data_load
            load_url = new_url.replace("github.com", "raw.githubusercontent.com").replace("/blob", "")
            data_loader["file_name"].append(file['name'])
            data_loader["file_url"].append(load_url)
    data = data_csv
    data_load = data_loader

    # Add to dataframe
    final_json.append({
        "date_posted": date_posted,
        "project_name": project_name,
        "project_source": project_source,
        "description": description,
        "data_source_url": data_source_url,
        "data_dictionary": data_dictionary,
        "data": data,
        "data_load": data_load
        })
  return final_json

In [177]:
tidytuesday_json = get_all_data_json(['2023', '2024']) #### AYYYYYY LOOK AT THIS!

  0%|          | 0/59 [00:00<?, ?it/s]

In [179]:
# Save the final dataframe to json
file_path = '/content/drive/MyDrive/STA 663 Colab/tidytuesday_json.json'
with open(file_path, 'w', encoding='utf8') as f:
  json.dump(tidytuesday_json, f, ensure_ascii=False, indent=4)

# Test read - it works!
with open(file_path, 'r') as j:
  ab = json.load(j)

In [180]:
# Train test split
tidytuesday_json_train, tidytuesday_json_val = train_test_split(tidytuesday_json, test_size=0.3)

# Save split files
file_path = '/content/drive/MyDrive/STA 663 Colab/tidytuesday_json_train.json'
with open(file_path, 'w', encoding='utf8') as f:
  json.dump(tidytuesday_json_train, f, ensure_ascii=False, indent=4)

file_path = '/content/drive/MyDrive/STA 663 Colab/tidytuesday_json_val.json'
with open(file_path, 'w', encoding='utf8') as f:
  json.dump(tidytuesday_json_val, f, ensure_ascii=False, indent=4)

In [23]:
# Save the dataframe to csv
tidytuesday_2023_2024.to_csv("/content/drive/MyDrive/STA 663 Colab/tidytuesday_2023_2024.csv", index=False)

In [108]:
# train-validation split
tidytuesday_train, tidytuesday_val = train_test_split(tidytuesday_2023_2024, test_size=0.3)
# save csv files
tidytuesday_train.to_csv("/content/drive/MyDrive/STA 663 Colab/tidytuesday_train.csv", index=False)
tidytuesday_val.to_csv("/content/drive/MyDrive/STA 663 Colab/tidytuesday_val.csv", index=False)

## **Submit to HuggingFace**

In [54]:
# Installing huggingface datasets library
!pip install datasets -q

In [4]:
# Change to working directory
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/STA 663 Colab/Project1

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/STA 663 Colab/Project1


Git LFS initialized.


In [6]:
# Log into huggingface
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [122]:
!huggingface-cli repo create tidytuesday_for_python --type dataset

[90mgit version 2.34.1[0m
[90mgit-lfs/3.0.2 (GitHub; linux amd64; go 1.18.1)[0m

You are about to create [1mdatasets/hollyyfc/tidytuesday_for_python[0m
Proceed? [Y/n] Y

Your repo now lives at:
  [1mhttps://huggingface.co/datasets/hollyyfc/tidytuesday_for_python[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/datasets/hollyyfc/tidytuesday_for_python



In [123]:
!git clone https://huggingface.co/datasets/hollyyfc/tidytuesday_for_python

Cloning into 'tidytuesday_for_python'...
remote: Enumerating objects: 3, done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 3[K
Unpacking objects: 100% (3/3), 1.15 KiB | 43.00 KiB/s, done.


In [7]:
%cd tidytuesday_for_python

/content/drive/MyDrive/STA 663 Colab/Project1/tidytuesday_for_python


In [19]:
# Push to huggingface
!git add -A

In [20]:
!git commit -m "Finish readme"

[main 7ea918e] Finish readme
 1 file changed, 148 insertions(+), 151 deletions(-)
 rewrite README.md (89%)


In [21]:
!git push

Enumerating objects: 5, done.
Counting objects:  20% (1/5)Counting objects:  40% (2/5)Counting objects:  60% (3/5)Counting objects:  80% (4/5)Counting objects: 100% (5/5)Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects:  33% (1/3)Compressing objects:  66% (2/3)Compressing objects: 100% (3/3)Compressing objects: 100% (3/3), done.
Writing objects:  33% (1/3)Writing objects:  66% (2/3)Writing objects: 100% (3/3)Writing objects: 100% (3/3), 4.73 KiB | 537.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/datasets/hollyyfc/tidytuesday_for_python
   66b64b6..7ea918e  main -> main


In [18]:
!git pull

Already up to date.


## **Load Final Dataset from HuggingFace ✅**

In [1]:
# Test load
!pip install datasets
from datasets import load_dataset

dataset = load_dataset("hollyyfc/tidytuesday_for_python")

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/5.04k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/70.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.0k [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [3]:
next(iter(dataset['train']))

{'date_posted': '2023-03-14',
 'project_name': 'European Drug Development',
 'project_source': ['https://www.ema.europa.eu/en/medicines/download-medicine-data',
  'https://github.com/MiqG/EMA-Data-Scratching-with-RSelenium',
  'https://towardsdatascience.com/dissecting-28-years-of-european-pharmaceutical-development-3affd8f87dc0',
  'https://www.ema.europa.eu/sites/default/files/Medicines_output_european_public_assessment_reports.xlsx'],
 'description': "The data this week comes from theEuropean Medicines AgencyviaMiquel Anglada Girotto on GitHub. We used thesource table of all EPARs for human and veterinary medicines, rather than Miquel's scraped data. Miquelwrote abouthis exploration of the data.",
 'data_source_url': 'https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-03-14',
 'data_dictionary': {'variable': ["['category', 'medicine_name', 'therapeutic_area', 'common_name', 'active_substance', 'product_number', 'patient_safety', 'authorisation_status', 'atc_co

In [17]:
dataset['full'][6]['data_load']

{'file_name': ['age_gaps.csv'],
 'file_url': ['https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-14/age_gaps.csv']}