This notebook is designed to identify composers whose works might fall into the public domain within the European Union. It utilizes Wikipedia page views as a metric to filter out the most relevant entries.

This notebook is designed to identify composers whose works might fall into the public domain within the European Union. It utilizes Wikipedia page views as a metric to filter out the most relevant entries.

1. **Fetching a list of composers**: The first step involves fetching a list of composers from the Wikipedia page [List of 20th-century classical composers](https://en.wikipedia.org/wiki/List_of_20th-century_classical_composers).

The associated code cell performs the following tasks:

   1. **Get HTML content**: The `get_wiki_soup(url)` function fetches the HTML content of the provided URL using the `requests` library. It raises an exception if there's a server error. The HTML content is then parsed using BeautifulSoup.
   
   2. **Parse the HTML table**: The `parse_table(soup)` function finds the relevant table ("wikitable sortable") in the parsed HTML. If it can't find the table, it prints an error message. Otherwise, it goes through each row of the table (skipping the header row), retrieves the text content of each cell, and appends it to a list. It also finds the URL of each composer's Wikipedia page and appends it to the list. Each list of cell values is then converted into a dictionary (using the table headers as keys), and all these dictionaries are gathered into a list.
   
   3. **Create a DataFrame**: The URL of the Wikipedia page is defined, and the above functions are used to fetch and parse the page. If no errors occur, the list of dictionaries is converted into a pandas DataFrame and displayed.


In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_wiki_soup(url):
    """
    Fetches and parses a webpage into a BeautifulSoup object.

    Args:
    url (str): URL of the webpage to fetch.

    Returns:
    BeautifulSoup: Parsed webpage or None if an error occurs.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an exception for HTTP errors
    except requests.RequestException as e:
        # Instead of printing, we could log this error or handle it differently
        print(f"An error occurred while fetching the page: {e}")
        return None

    return BeautifulSoup(response.text, "html.parser")

def parse_table(soup):
    """
    Parses the 'wikitable sortable' table from a BeautifulSoup object.

    Args:
    soup (BeautifulSoup): BeautifulSoup object containing HTML of the page.

    Returns:
    list: List of dictionaries containing the table data or None if table is not found.
    """
    # Define the base URL for complete links
    BASE_URL = "https://en.wikipedia.org"

    table = soup.find("table", {"class": "wikitable sortable"})
    if table is None:
        print("Could not find the table on the page")
        return None

    rows = table.find_all("tr")
    headers = [header.get_text(strip=True) for header in rows[0].find_all("th")]
    headers.append('URL')

    data = []
    for row in rows[1:]:
        values = [value.get_text(strip=True) for value in row.find_all("td")]
        link = row.find("td").find("a")
        url = f"{BASE_URL}{link.get('href')}" if link else None
        values.append(url)
        data.append(dict(zip(headers, values)))

    return data


url = "https://en.wikipedia.org/wiki/List_of_20th-century_classical_composers"
soup = get_wiki_soup(url)
if soup is not None:
    data = parse_table(soup)
    if data is not None:
        df = pd.DataFrame(data)
        display(df)


Unnamed: 0,Name,Year of birth,Year of death,Nationality,Notable 20th-century works,Remarks,URL
0,Charles Dancla,1817,1907,French,"Solo de concours no. 7, Op. 224",Romanticism,https://en.wikipedia.org/wiki/Charles_Dancla
1,Luigi Arditi,1822,1903,Italian,,,https://en.wikipedia.org/wiki/Luigi_Arditi
2,Theodor Kirchner,1823,1903,German,,,https://en.wikipedia.org/wiki/Theodor_Kirchner
3,Carl Reinecke,1824,1910,German,"Trio for piano, clarinet and horn in B♭, Op. 2...",Romanticism,https://en.wikipedia.org/wiki/Carl_Reinecke
4,Richard Hol,1825,1904,Dutch,Organ music,Romanticism,https://en.wikipedia.org/wiki/Richard_Hol
...,...,...,...,...,...,...,...
3358,Daniel Hensel,1978,,German,,,https://en.wikipedia.org/wiki/Daniel_Hensel
3359,Jimmy López,1978,,Peruvian,Arco de luz,,https://en.wikipedia.org/wiki/Jimmy_L%C3%B3pez
3360,Tarik O'Regan,1978,,British-American,,,https://en.wikipedia.org/wiki/Tarik_O%27Regan
3361,Mehdi Hosseini,1979,,Iranian,,,https://en.wikipedia.org/wiki/Mehdi_Hosseini


2. **Parsing the Wikipedia page views via the Wikimedia API**

This step involves fetching the page views for each composer's Wikipedia page. Note that if a page was deleted or created after your defined "end_date", an error might occur. These composers are still included in the DataFrame, but their pageviews are set to 0.

The associated code cell performs the following tasks:

   1. **Define the `get_pageviews` function**: This function takes a URL and a date range as input, extracts the title of the Wikipedia page from the URL, constructs the URL of the Wikimedia pageviews API, sends a GET request to that URL, and parses the response to extract the pageviews. If the URL is NaN, the function returns NaN. If the request fails for any reason, the function prints an error message and returns 0. The function also includes a delay to respect the API's rate limits.
   
   2. **Apply the function to the DataFrame**: The `get_pageviews` function is applied to the 'URL' column of the DataFrame, with the start and end dates set to define the date range for the pageviews. The results are stored in a new 'Pageviews' column in the DataFrame.
   
   3. **Sort the DataFrame**: The DataFrame is sorted by the 'Pageviews' column in descending order, so the composers with the most pageviews are at the top.
   
   4. **Display the DataFrame**: The sorted DataFrame is displayed to visualize the results.

Also, you need to replace `"YourActualUserAgent/1.0 (https://github.com/yourusername/yourrepository; mailto:youremail@example.com)"` with your actual user agent to properly use the Wikimedia API. The user agent should contain your application name, version, URL, and your email address.

In [4]:
from tqdm import tqdm
import requests
import pandas as pd
import numpy as np
import json
import time

def get_pageviews(url, start_date, end_date):
    """
    Fetches the total pageviews for a specific Wikipedia page within a date range.

    Args:
    url (str): URL of the Wikipedia page.
    start_date (str): Start date in YYYYMMDD format.
    end_date (str): End date in YYYYMMDD format.

    Returns:
    int or float: Total pageviews or NaN if an error occurs.
    """
    if pd.isnull(url):
        return np.nan

    title = url.split('/wiki/')[-1]
    base_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/"
    full_url = f"{base_url}{title}/monthly/{start_date}/{end_date}"

    headers = {
        "User-Agent": "YourActualUserAgent/1.0 (https://github.com/yourusername/yourrepository; mailto:youremail@example.com)"
    }

    try:
        response = requests.get(full_url, headers=headers)
        response.raise_for_status()
        data = response.json()
    except (requests.RequestException, json.JSONDecodeError) as e:
        print(f"An error occurred while fetching pageviews for URL: {url}")
        print(f"Error: {e}")
        return np.nan

    pageviews = sum(item['views'] for item in data.get('items', []))
    time.sleep(0.1)  # Delay to respect rate limits
    return pageviews

tqdm.pandas(desc="Fetching pageviews")
start_date = "20230101"
end_date = "20231231"
df['Pageviews'] = df['URL'].progress_apply(get_pageviews, start_date=start_date, end_date=end_date)
df_sorted = df.sort_values(by='Pageviews', ascending=False)
display(df_sorted)


Fetching pageviews:  75%|███████████████████████████████████████████▌              | 2527/3363 [21:29<06:18,  2.21it/s]

An error occurred while fetching pageviews for URL: https://en.wikipedia.org/w/index.php?title=Dexter_Morrill&action=edit&redlink=1
Error: 404 Client Error: Not Found for url: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/https://en.wikipedia.org/w/index.php?title=Dexter_Morrill&action=edit&redlink=1/monthly/20230101/20231231


Fetching pageviews:  96%|███████████████████████████████████████████████████████▍  | 3213/3363 [27:12<01:08,  2.18it/s]

An error occurred while fetching pageviews for URL: https://en.wikipedia.orghttps://hc.sk/en/hudba/osobnost-detail/26-peter-zagar
Error: 404 Client Error: Not Found for url: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/https://en.wikipedia.orghttps://hc.sk/en/hudba/osobnost-detail/26-peter-zagar/monthly/20230101/20231231


Fetching pageviews:  96%|███████████████████████████████████████████████████████▋  | 3230/3363 [27:20<00:59,  2.23it/s]

An error occurred while fetching pageviews for URL: https://en.wikipedia.org/w/index.php?title=Michel_Bosc&action=edit&redlink=1
Error: 404 Client Error: Not Found for url: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/https://en.wikipedia.org/w/index.php?title=Michel_Bosc&action=edit&redlink=1/monthly/20230101/20231231


Fetching pageviews: 100%|██████████████████████████████████████████████████████████| 3363/3363 [28:25<00:00,  1.97it/s]


Unnamed: 0,Name,Year of birth,Year of death,Nationality,Notable 20th-century works,Remarks,URL,Pageviews
2641,Paul McCartney,1942,,English,Liverpool Oratorio;Standing Stone;Ecce Cor Meum,member ofThe BeatlesandWings,https://en.wikipedia.org/wiki/Paul_McCartney,5226998.0
1727,Leonard Bernstein,1918,1990,American,West Side Story;Chichester Psalms;Candide;Sere...,jazz-influenced and pop-influence; also a comp...,https://en.wikipedia.org/wiki/Leonard_Bernstein,4822792.0
2598,Frank Zappa,1940,1993,American,The Perfect Stranger;The Yellow Shark,,https://en.wikipedia.org/wiki/Frank_Zappa,2106989.0
2298,John Williams,1932,,American,"also composer of film scores,The Five Sacred T...",,https://en.wikipedia.org/wiki/John_Williams,2061340.0
2962,Ryuichi Sakamoto,1952,,Japanese,,,https://en.wikipedia.org/wiki/Ryuichi_Sakamoto,1335266.0
...,...,...,...,...,...,...,...,...
3211,Peter Zagar,1961,,Slovak,,,https://en.wikipedia.orghttps://hc.sk/en/hudba...,
3228,Michel Bosc[fr],1963,,French,,,https://en.wikipedia.org/w/index.php?title=Mic...,
3266,Paul Steenhuisen,1965,,Canadian,,https://en.wikipedia.org/wiki/Paul_Steenhuisen,,
3267,Mats Wendt,1965,,Eddan – The Invincible Sword of the Elf-Smith,,https://en.wikipedia.org/wiki/Mats_Wendt,,


3. **Data Cleaning and Type Conversion**

This step involves cleaning the DataFrame and converting certain columns to the appropriate data types.

The associated code cell performs the following tasks:

   1. **Remove rows with NaN values**: The DataFrame is filtered to remove rows where either the 'URL' or 'Pageviews' columns contain NaN values.
   
   2. **Convert columns to numeric**: The 'Year of birth' and 'Year of death' columns are converted to numeric values using `pd.to_numeric`. Any errors during the conversion (e.g., if a value cannot be converted to a number) are ignored, and the problematic values are converted to NaN.
   
   3. **Handle NaN values**: NaN values in the 'Year of birth' and 'Year of death' columns are replaced with a placeholder value (-1), the columns are then converted to integers, and the placeholder values are converted back to NaN. This is done to ensure that NaN values are handled correctly when the columns are converted to integer type.
   
   4. **Convert 'Year of birth' to float**: For consistency, the 'Year of birth' column is converted back to float type.
   
   5. **Display the DataFrame**: The cleaned and transformed DataFrame is displayed to visualize the results.


In [5]:
# Clean the parsed db, remove the rows where either 'URL' or 'Pageviews' is NaN
df_filtered = df_sorted.dropna(subset=['URL', 'Pageviews']).copy()

# Convert the 'Year of birth' and 'Year of death' columns to numeric
df_filtered['Year of birth'] = pd.to_numeric(df_filtered['Year of birth'], errors='coerce')
df_filtered['Year of death'] = pd.to_numeric(df_filtered['Year of death'], errors='coerce')

# Simplified handling of NaN without placeholders, ensuring data type consistency
df_filtered['Year of birth'] = df_filtered['Year of birth'].astype(float)
df_filtered['Year of death'] = df_filtered['Year of death'].astype(float)

# Display the cleaned and processed DataFrame
df_filtered


Unnamed: 0,Name,Year of birth,Year of death,Nationality,Notable 20th-century works,Remarks,URL,Pageviews
2641,Paul McCartney,1942.0,,English,Liverpool Oratorio;Standing Stone;Ecce Cor Meum,member ofThe BeatlesandWings,https://en.wikipedia.org/wiki/Paul_McCartney,5226998.0
1727,Leonard Bernstein,1918.0,1990.0,American,West Side Story;Chichester Psalms;Candide;Sere...,jazz-influenced and pop-influence; also a comp...,https://en.wikipedia.org/wiki/Leonard_Bernstein,4822792.0
2598,Frank Zappa,1940.0,1993.0,American,The Perfect Stranger;The Yellow Shark,,https://en.wikipedia.org/wiki/Frank_Zappa,2106989.0
2298,John Williams,1932.0,,American,"also composer of film scores,The Five Sacred T...",,https://en.wikipedia.org/wiki/John_Williams,2061340.0
2962,Ryuichi Sakamoto,1952.0,,Japanese,,,https://en.wikipedia.org/wiki/Ryuichi_Sakamoto,1335266.0
...,...,...,...,...,...,...,...,...
2925,Craig Russell,1951.0,,American,,,https://en.wikipedia.org/wiki/Craig_Russell_(c...,163.0
2745,Johnterryl Plumeri,1945.0,2016.0,American,,,https://en.wikipedia.org/wiki/Johnterryl_Plumeri,153.0
1357,Ludovit Rajter,1906.0,2000.0,Slovak,,,https://en.wikipedia.org/wiki/Ludovit_Rajter,125.0
2752,Thomas Wells,1945.0,,American,,,https://en.wikipedia.org/wiki/Thomas_Henry_Wells,108.0


4. **Filtering Composers by Year of Death**

This step involves filtering out the composers based on their year of death. We focus on those who passed away between 1945 and 1955.

The associated code cell performs the following tasks:

   1. **Apply filter conditions**: The DataFrame is filtered to include only those composers whose 'Year of death' is between 1945 and 1955 (inclusive).
   
   2. **Display the DataFrame**: The filtered DataFrame is displayed to visualize the results.


In [8]:
# Constants for year range
START_YEAR = 1945
END_YEAR = 1955

def filter_by_year_of_death(df, start_year, end_year):
    """
    Filters DataFrame based on 'Year of death' within a specified range.

    Args:
    df (DataFrame): DataFrame to filter.
    start_year (int): Starting year of the filter range.
    end_year (int): Ending year of the filter range.

    Returns:
    DataFrame: Filtered DataFrame.
    """
    return df[(df['Year of death'] >= start_year) & (df['Year of death'] <= end_year)]

# Applying the filter
df_public_domain = filter_by_year_of_death(df_filtered, START_YEAR, END_YEAR)
display(df_public_domain)


Unnamed: 0,Name,Year of birth,Year of death,Nationality,Notable 20th-century works,Remarks,URL,Pageviews
904,Sergei Prokofiev,1891.0,1953.0,Russian,Romeo and Juliet;The Love for Three Oranges;Cl...,"neoclassicism,Romanticism",https://en.wikipedia.org/wiki/Sergei_Prokofiev,406029.0
350,Richard Strauss,1864.0,1949.0,German,An Alpine Symphony;Der Rosenkavalier;Oboe Conc...,Romanticism,https://en.wikipedia.org/wiki/Richard_Strauss,383278.0
525,Arnold Schoenberg,1874.0,1951.0,Austrian,Gurre-Lieder;Verklärte Nacht;Variations for Or...,"Romanticism, laterexpressionismandserialism; f...",https://en.wikipedia.org/wiki/Arnold_Schoenberg,351625.0
657,Béla Bartók,1881.0,1945.0,Hungarian,Concerto for Orchestra;six string quartets;The...,"neoclassicism, folk-influenced",https://en.wikipedia.org/wiki/B%C3%A9la_Bart%C...,295975.0
959,Ivor Novello,1893.0,1951.0,Welsh,,,https://en.wikipedia.org/wiki/Ivor_Novello,194439.0
...,...,...,...,...,...,...,...,...
1134,Niels Clemmensen,1900.0,1950.0,Danish,,,https://en.wikipedia.org/wiki/Niels_Clemmensen,553.0
210,George Strong,1856.0,1948.0,American,,,https://en.wikipedia.org/wiki/George_Strong_(c...,515.0
371,John Clare Billing,1866.0,1955.0,English,,,https://en.wikipedia.org/wiki/John_Clare_Billing,498.0
895,Thomas Griselle,1891.0,1955.0,American,,,https://en.wikipedia.org/wiki/Thomas_Griselle,413.0


5. **Parsing IMSLP for Corresponding Pages**

This step involves creating the corresponding IMSLP (International Music Score Library Project) URL for each composer and checking if a page exists at that URL.

The associated code cell performs the following tasks:

   1. **Define the `create_imslp_link` function**: This function takes a composer's name, splits it into parts, and creates the corresponding IMSLP URL.
   
   2. **Define the `check_page_exists` function**: This function sends a GET request to a given URL and checks if the response status code is 200, indicating that a page exists at that URL. It includes a delay to prevent overloading the server.
   
   3. **Create a copy of the DataFrame**: The DataFrame is copied to avoid modifying the original data.
   
   4. **Create the IMSLP URLs**: The `create_imslp_link` function is applied to the 'Name' column of the DataFrame, and the resulting IMSLP URLs are stored in a new 'IMSLP_URL' column.
   
   5. **Check if the IMSLP pages exist**: The `check_page_exists` function is applied to the 'IMSLP_URL' column of the DataFrame, and the results are stored in a new 'IMSLP_Exists' column. The `tqdm.pandas()` function is used to display a progress bar for the operation.
   
   6. **Display the DataFrame**: The DataFrame, now including the IMSLP URLs and whether a page exists at each URL, is displayed.


In [9]:
import pandas as pd
import requests
import time
from tqdm import tqdm
from urllib.parse import quote

def create_imslp_link(composer):
    name_parts = composer.split(' ')
    imslp_name = "{},_{}".format(name_parts[-1], '_'.join(name_parts[:-1])) if len(name_parts) > 1 else name_parts[0]
    return f"https://imslp.org/wiki/Category:{quote(imslp_name)}"

def check_page_exists(url):
    try:
        time.sleep(0.1)  # pause to prevent overloading server
        response = requests.get(url)
        response.raise_for_status()
        return True
    except requests.HTTPError:
        return False

# Assuming df_public_domain is loaded
df_imslp = df_public_domain.copy()
df_imslp['IMSLP_URL'] = df_imslp['Name'].apply(create_imslp_link)

tqdm.pandas(desc="Checking IMSLP pages")
df_imslp['IMSLP_Exists'] = df_imslp['IMSLP_URL'].progress_apply(check_page_exists)

display(df_imslp)


Checking IMSLP pages: 100%|██████████████████████████████████████████████████████████| 207/207 [02:31<00:00,  1.37it/s]


Unnamed: 0,Name,Year of birth,Year of death,Nationality,Notable 20th-century works,Remarks,URL,Pageviews,IMSLP_URL,IMSLP_Exists
904,Sergei Prokofiev,1891.0,1953.0,Russian,Romeo and Juliet;The Love for Three Oranges;Cl...,"neoclassicism,Romanticism",https://en.wikipedia.org/wiki/Sergei_Prokofiev,406029.0,https://imslp.org/wiki/Category:Prokofiev%2C_S...,True
350,Richard Strauss,1864.0,1949.0,German,An Alpine Symphony;Der Rosenkavalier;Oboe Conc...,Romanticism,https://en.wikipedia.org/wiki/Richard_Strauss,383278.0,https://imslp.org/wiki/Category:Strauss%2C_Ric...,True
525,Arnold Schoenberg,1874.0,1951.0,Austrian,Gurre-Lieder;Verklärte Nacht;Variations for Or...,"Romanticism, laterexpressionismandserialism; f...",https://en.wikipedia.org/wiki/Arnold_Schoenberg,351625.0,https://imslp.org/wiki/Category:Schoenberg%2C_...,True
657,Béla Bartók,1881.0,1945.0,Hungarian,Concerto for Orchestra;six string quartets;The...,"neoclassicism, folk-influenced",https://en.wikipedia.org/wiki/B%C3%A9la_Bart%C...,295975.0,https://imslp.org/wiki/Category:Bart%C3%B3k%2C...,True
959,Ivor Novello,1893.0,1951.0,Welsh,,,https://en.wikipedia.org/wiki/Ivor_Novello,194439.0,https://imslp.org/wiki/Category:Novello%2C_Ivor,True
...,...,...,...,...,...,...,...,...,...,...
1134,Niels Clemmensen,1900.0,1950.0,Danish,,,https://en.wikipedia.org/wiki/Niels_Clemmensen,553.0,https://imslp.org/wiki/Category:Clemmensen%2C_...,False
210,George Strong,1856.0,1948.0,American,,,https://en.wikipedia.org/wiki/George_Strong_(c...,515.0,https://imslp.org/wiki/Category:Strong%2C_George,False
371,John Clare Billing,1866.0,1955.0,English,,,https://en.wikipedia.org/wiki/John_Clare_Billing,498.0,https://imslp.org/wiki/Category:Billing%2C_Joh...,False
895,Thomas Griselle,1891.0,1955.0,American,,,https://en.wikipedia.org/wiki/Thomas_Griselle,413.0,https://imslp.org/wiki/Category:Griselle%2C_Th...,True


6. **Filtering Rows Where IMSLP Page Exists and Displaying with Clickable Links**

This step involves filtering out the rows where the IMSLP page exists (i.e., 'IMSLP_Exists' is True) and displaying the DataFrame as HTML with clickable links.

The associated code cell performs the following tasks:

   1. **Filter the DataFrame**: The DataFrame is filtered to include only those rows where 'IMSLP_Exists' is True. The filtered DataFrame is copied to avoid modifying the original data.
   
   2. **Create clickable links**: The 'IMSLP_URL' and 'URL' columns are transformed into clickable HTML links using a dedicated function. This improves the reusability and maintainability of the code by abstracting the hyperlink creation into a standalone function.
   
   3. **Display the DataFrame as HTML**: The DataFrame is converted to an HTML table with `to_html`, and this HTML is displayed. The `escape=False` argument is used to render the HTML links correctly. Only the first 100 rows of the DataFrame are displayed.

In [10]:
def format_hyperlink(url):
    return f'<a href="{url}">{url}</a>'

df_imslp_exists = df_imslp[df_imslp['IMSLP_Exists']].copy()  # Simplified conditional check

df_imslp_exists['IMSLP_URL'] = df_imslp_exists['IMSLP_URL'].apply(format_hyperlink)
df_imslp_exists['URL'] = df_imslp_exists['URL'].apply(format_hyperlink)

from IPython.display import HTML

HTML(df_imslp_exists.head(100).to_html(escape=False))


Unnamed: 0,Name,Year of birth,Year of death,Nationality,Notable 20th-century works,Remarks,URL,Pageviews,IMSLP_URL,IMSLP_Exists
904,Sergei Prokofiev,1891.0,1953.0,Russian,Romeo and Juliet;The Love for Three Oranges;Classical Symphony;Symphony No. 5andNo. 6;Violin Concerto No. 1andNo. 2;Piano Concerto No. 2andNo. 3;Alexander Nevsky;Lieutenant Kijé;Peter and the Wolf,"neoclassicism,Romanticism",https://en.wikipedia.org/wiki/Sergei_Prokofiev,406029.0,https://imslp.org/wiki/Category:Prokofiev%2C_Sergei,True
350,Richard Strauss,1864.0,1949.0,German,An Alpine Symphony;Der Rosenkavalier;Oboe Concerto;Salome;Metamorphosen,Romanticism,https://en.wikipedia.org/wiki/Richard_Strauss,383278.0,https://imslp.org/wiki/Category:Strauss%2C_Richard,True
525,Arnold Schoenberg,1874.0,1951.0,Austrian,Gurre-Lieder;Verklärte Nacht;Variations for Orchestra;Pierrot Lunaire;Five Pieces for Orchestra;Violin Concerto,"Romanticism, laterexpressionismandserialism; founder of theSecond Viennese School",https://en.wikipedia.org/wiki/Arnold_Schoenberg,351625.0,https://imslp.org/wiki/Category:Schoenberg%2C_Arnold,True
657,Béla Bartók,1881.0,1945.0,Hungarian,"Concerto for Orchestra;six string quartets;The Miraculous Mandarin;Piano Concerto No. 1;Piano Concerto No. 2;Piano Concerto No. 3;Music for Strings, Percussion and Celesta;Sonata for Two Pianos and Percussion","neoclassicism, folk-influenced",https://en.wikipedia.org/wiki/B%C3%A9la_Bart%C3%B3k,295975.0,https://imslp.org/wiki/Category:Bart%C3%B3k%2C_B%C3%A9la,True
959,Ivor Novello,1893.0,1951.0,Welsh,,,https://en.wikipedia.org/wiki/Ivor_Novello,194439.0,https://imslp.org/wiki/Category:Novello%2C_Ivor,True
1162,Kurt Weill,1900.0,1950.0,German/American,"2 symphonies; String Quartets; ""Die Dreigroschenoper"";Aufstieg und Fall der Stadt Mahagonny; ""Mahagonny-Songspiel""; several Broadway musicals",neoclassical,https://en.wikipedia.org/wiki/Kurt_Weill,146224.0,https://imslp.org/wiki/Category:Weill%2C_Kurt,True
821,Florence Price,1887.0,1953.0,American,,,https://en.wikipedia.org/wiki/Florence_Price,138967.0,https://imslp.org/wiki/Category:Price%2C_Florence,True
520,Charles Ives,1874.0,1954.0,American,The Unanswered Question;Central Park in the Dark;Variations on America;Three Places in New England;Concord Sonata,"avant-garde, folk-influenced",https://en.wikipedia.org/wiki/Charles_Ives,114102.0,https://imslp.org/wiki/Category:Ives%2C_Charles,True
736,Anton Webern,1883.0,1945.0,Austrian,Symphony; Six Pieces for Orchestra,"twelve-tone technique,serialism,polystylism; pupil ofArnold Schoenberg",https://en.wikipedia.org/wiki/Anton_Webern,106623.0,https://imslp.org/wiki/Category:Webern%2C_Anton,True
795,Wilhelm Furtwängler,1886.0,1954.0,German,"Symphony No. 1 in B minor, Symphony No. 2 in E minor, Symphony No. 3 in C♯ minor, Piano Quintet, Religöser Hymnus, Te Deum for Choir and Orchestra","neoclassicism, folk-influenced, twelve-tone technique",https://en.wikipedia.org/wiki/Wilhelm_Furtw%C3%A4ngler,96058.0,https://imslp.org/wiki/Category:Furtw%C3%A4ngler%2C_Wilhelm,True


7. **Saving the DataFrame as a TSV File**

This step involves saving your desired DataFrame as a TSV (Tab Separated Values) file for future reference or use.

The associated code cell performs the following tasks:

   1. **Define the DataFrame to be saved**: The DataFrame to be saved is defined. In this case, it is `df_sorted`, which is the DataFrame obtained after sorting by page views. Throughout the notebook, we worked with several DataFrames: `df` (initial DataFrame from the Wikipedia page), `df_sorted` (sorted by page views), `df_filtered` (cleaned and with numeric columns), `df_public_domain` (filtered by year of death), `df_imslp` (with IMSLP URLs and existence checks), and `df_imslp_exists` (filtered to include only rows where an IMSLP page exists). You can change `df_sorted` to any of these DataFrames depending on which one you want to save.
   
   2. **Save the DataFrame as a TSV file**: The DataFrame is saved as a TSV file using `to_csv` with `sep='\t'`. The `index=False` argument is used to exclude the DataFrame's index from the file. If the DataFrame is saved successfully, a success message is printed with the current working directory and the filename. If an error occurs during the saving process, an error message is printed.


In [11]:
import os
from datetime import datetime
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Define the DataFrame to be saved
df_to_save = df_sorted

# Save the DataFrame as a TSV file
filename = 'composers.tsv'
filepath = os.path.join(os.getcwd(), filename)  # OS-independent path handling

try:
    df_to_save.to_csv(filepath, sep='\t', index=False)
    logging.info(f"Data saved successfully to {filepath}.")
except Exception as e:
    logging.error(f"Error: data could not be saved due to {e}.")


2024-04-17 23:02:22,997 - INFO - Data saved successfully to C:\Users\egorp\Nextcloud\code\public_repos\PublicDomainSheetMusicFinder\composers.tsv.
