# Data Analysis and Visualisation &mdash; Lab 01

## Problem 1: How has Linux adoption as a desktop operating system changed up to 2025?

### 1. Objective

The objective of this study is to examine the trend of Linux adoption as a desktop operating system up to 2025. By collecting data from public sources (*i.e.* StatCounter, Steam Hardware Survey, DistroWatch), we aim to observe whether Linux usage among desktop users is increasing, or remaining stable over time.

### 2. Variables of Interest

| Variable | Description | Type | Source |
| --- | --- | --- | --- |
| `date` | Month and year of observation | Datetime | All sources |
| `region` | Geographic region or country | Categorical | StatCounter |
| `linux_share` | % of desktop users using Linux | Numeric | StatCounter |
| `win_share` | % of desktop users using Windows | Numeric | StatCounter |
| `mac_share` | % of desktop users using macOS | Numeric | StatCounter |
| `steam_linux_share` | % of Steam users on Linux | Numeric | Steam Hardware Survey |
| `distro_hits` | Page hits per day for each Linux distro (popularity indicator) | Numeric | DistroWatch |

In [31]:
import urllib.parse
import pandas as pd
import pycountry

def get_statcounter_os_share(start = "2023-01", end = "2025-10", country = "ww"):
    region = pycountry.countries.get(alpha_2=country).name if country != "ww" else "Worldwide"
    region_enc = urllib.parse.quote(region)
    
    csv_url = (
        f"https://gs.statcounter.com/os-market-share/desktop/chart.php"
        f"?device=Desktop"
        f"&device_hidden=desktop"
        f"&statType=Operating%20System"
        f"&statType_hidden=os_combined"
        f"&region={region_enc}"
        f"&region_hidden={country}"
        f"&granularity=monthly"
        f"&fromInt={start.replace('-', '')}"
        f"&fromMonthYear={start}"
        f"&toInt={end.replace('-', '')}"
        f"&toMonthYear={end}"
        f"&csv=1"
    )

    df = pd.read_csv(csv_url)
    df["Region"] = region
    return df

statcounter_ww = get_statcounter_os_share()
statcounter_vn = get_statcounter_os_share(country="VN")

statcounter_vn.head()

Unnamed: 0,Date,Windows,Unknown,OS X,Linux,macOS,Chrome OS,Other,Region
0,2023-01,55.07,34.96,7.16,2.74,0.0,0.06,0.0,Viet Nam
1,2023-02,54.95,35.37,6.74,2.89,0.0,0.05,0.0,Viet Nam
2,2023-03,52.23,38.97,6.23,2.53,0.0,0.05,0.0,Viet Nam
3,2023-04,37.54,55.95,4.42,2.04,0.0,0.05,0.0,Viet Nam
4,2023-05,38.57,54.5,4.85,2.04,0.0,0.04,0.0,Viet Nam


In [29]:
steam_hw_survey = pd.read_csv("https://raw.githubusercontent.com/jdegene/steamHWsurvey/refs/heads/master/shs_platform.csv")
steam_hw_survey = steam_hw_survey[(steam_hw_survey["date"] >= "2023-01-01") & (steam_hw_survey["date"] <= "2025-10-01")]

platform_versions = steam_hw_survey[steam_hw_survey["category"].isin(["Windows Version", "OSX Version", "Linux Version"])]

shs_windows = platform_versions[platform_versions["platform"] == "pc"]
shs_macos = platform_versions[platform_versions["platform"] == "mac"]
shs_linux = platform_versions[platform_versions["platform"] == "linux"]

shs_linux.head()

Unnamed: 0,date,platform,category,name,change,percentage
217926,2023-03-01,linux,Linux Version,"""Arch Linux"" 64 bit",0.0019,0.1036
217927,2023-03-01,linux,Linux Version,"""Manjaro Linux"" 64 bit",0.0045,0.0695
217928,2023-03-01,linux,Linux Version,"""SteamOS Holo"" 64 bit",0.0015,0.212
217929,2023-03-01,linux,Linux Version,Freedesktop.org SDK 22.08 (Flatpak runtime) 64...,-0.0017,0.071
217930,2023-03-01,linux,Linux Version,Other,0.0833,0.4479


In [43]:
def get_distrowatch_phr():
    url = "https://distrowatch.com/dwres.php?resource=popularity"
    df = pd.DataFrame()
    
    for table in pd.read_html(url):
        cols = table.columns.astype(str)
        
        if any("Last" in col for col in cols):
            month_range = cols[0].split(" ")[1]
            
            table.columns = ["ranking", "distro", "hits_per_day"]
            table["month_range"] = month_range

            df = pd.concat([df, table], ignore_index=True)
            
    return df

distrowatch_phr = get_distrowatch_phr()
distrowatch_phr.head()

Unnamed: 0,ranking,distro,hits_per_day,month_range
0,1,CachyOS,3220,12
1,2,Mint,2777,12
2,3,MX Linux,1914,12
3,4,Debian,1563,12
4,5,EndeavourOS,1558,12


## Problem 2: Does having access to e-books anywhere affect the usage and sales of physical books?

### 1. Objective

The objective of this study is to study how the growing availability and convenience of e-books affect the usage and sales of physical books by understanding whether digital reading trends are reducing physical book purchases or merely expanding overall reading habits.

### 2. Variables of Interest

| Variable | Description | Type | Source |
| --- | --- | --- | --- |
| `year` | Year of record | Numeric | |
| `region` | Geographic region or country | Categorical | |
| `ebook_sales` | Revenue or sale units of e-books | Numeric | |
| `print_sales` | Revenue or sale units of physical books | Numeric | |
| `total_sales` | Total revenue or sale units | Numeric | |
| `ebook_share` | Proportion of e-book sales over total sales | Numeric | |
| `avg_ebook_price` | Average approximated e-book price | Numeric | |
| `avg_print_price` | Average approximated physical book price | Numeric | |

Unnamed: 0,ASIN,GROUP,FORMAT,TITLE,AUTHOR,PUBLISHER
0,1250150183,book,hardcover,The Swamp: Washington's Murky Pool of Corrupti...,Eric Bolling,St. Martin's Press
1,778319997,book,hardcover,"Rise and Shine, Benedict Stone: A Novel",Phaedra Patrick,Park Row Books
2,1608322564,book,hardcover,Sell or Be Sold: How to Get Your Way in Busine...,Grant Cardone,Greenleaf Book Group Press
3,310325331,book,hardcover,Christian Apologetics: An Anthology of Primary...,"Khaldoun A. Sweis, Chad V. Meister",Zondervan
4,312616295,book,hardcover,Gravity: How the Weakest Force in the Universe...,Brian Clegg,St. Martin's Press
