INSTRUCTIONS

Debian uses *deb packages to deploy and upgrade software. The packages are stored in repositories and each repository contains the so called "Contents index". The format of that file is well described here https://wiki.debian.org/RepositoryFormat#A.22Contents.22_indices

Your task is to develop a python command line tool that takes the architecture (amd64, arm64, mips etc.) as an argument and downloads the compressed Contents file associated with it from a Debian mirror. The program should parse the file and output the statistics of the top 10 packages that have the most files associated with them. An example output could be:

./package_statistics.py amd64

<br> \<package name 1>         \<number of files>
<br> \<package name 2>         \<number of files>
<br> ......
<br> \<package name 10>         \<number of files>

You can use the following Debian mirror: http://ftp.uk.debian.org/debian/dists/stable/main/. 

Please try to follow modern Python best practices in your solution (write your solution at the kind of standard you would yourself like to maintain and see from your colleagues). Hint: there are tools that can help you verify your code is compliant. In-line comments are appreciated.

Please do your work in a local Git repository. Your repo should contain a README that explains your thought process and approach to the problem, and roughly how much time you spent on the exercise. When you are finished, create a tar.gz of your repo and submit it to the link included in this email. Please do not make the repository publicly available.

Note: We are interested not only in quality code, but also in seeing your approach to the problem and how you organise your work.

In [52]:
filenames

0                                         bin/abpoa
1                                     bin/abpoa.avx
2                                    bin/abpoa.avx2
3                                 bin/abpoa.generic
4                                    bin/abpoa.sse3
                             ...                   
1641380        var/spool/hylafax/config/zyxel-1496e
1641381      var/spool/hylafax/config/zyxel-1496e-1
1641382    var/spool/hylafax/config/zyxel-1496e-2.0
1641383         var/spool/hylafax/config/zyxel-2864
1641384                           var/yp/securenets
Length: 1641385, dtype: object

In [58]:
import os, sys, requests, gzip, io, platform, json
from pandas import Series, Index

URL_BASE = "http://ftp.uk.debian.org/debian/dists/stable/main/Contents-{arc}.gz"

def get_content_from_arc(arc: str, save_txt = False):
    
    url = URL_BASE.format(arc = arc)
    resp = requests.get(url = url, timeout = 30, stream = True)
    with gzip.GzipFile(fileobj = io.BytesIO(resp.content)) as file_gz:
        content = file_gz.read().decode("utf-8", errors = "ignore")

    if save_txt and (len(content) > 0):
        filename = f"Contents-{arc}.txt"
        with open(filename, "w") as file_txt:
            file_txt.write(content)
            
    return content

packages = get_content_from_arc("amd64")
packages = Series(packages.split("\n"))
packages = packages.loc[packages != ""]
packages = packages.str.split(" +")

files = packages.str[: -1].str.join(" ")
files = Index(files, name = "filename")
packages = packages.str[-1].str.split(",")

packages = Series(packages.values, index = files, name = "package")

packages = packages.explode().reset_index()
packages = packages.groupby("package")["filename"].apply(list)
packages

package
admin/0install                     [usr/lib/0install.net/gui_gtk.cmxs, usr/share/...
admin/0install-core                [usr/bin/0alias, usr/bin/0desktop, usr/bin/0in...
admin/9mount                       [usr/bin/9bind, usr/bin/9mount, usr/bin/9umoun...
admin/abootimg                     [usr/bin/abootimg, usr/bin/abootimg-pack-initr...
admin/accountsservice              [lib/systemd/system/accounts-daemon.service, u...
                                                         ...                        
zope/python3-zope.hookable         [usr/lib/python3/dist-packages/zope.hookable-5...
zope/python3-zope.i18nmessageid    [usr/lib/python3/dist-packages/zope.i18nmessag...
zope/python3-zope.interface        [usr/lib/python3/dist-packages/zope.interface-...
zope/python3-zope.proxy            [usr/include/python3.11m/zope.proxy/proxy.h, u...
zope/python3-zope.security         [usr/lib/python3/dist-packages/zope.security-5...
Name: filename, Length: 32250, dtype: object