# Questions per Linux Distribution
This notebook intends to show how many questions are under the tag for each linux distribution on serverfault.com.

# Step 1: get a list of programming languages
We will use the wikipedia page for [list of linux distributions](https://en.wikipedia.org/wiki/List_of_Linux_distributions) as our major source.

In [1]:
WIKIPEDIA_LIST_OF_LINUX_DISTRIBUTIONS = 'https://en.wikipedia.org/wiki/List_of_Linux_distributions'

import requests
from bs4 import BeautifulSoup

with requests.get(WIKIPEDIA_LIST_OF_LINUX_DISTRIBUTIONS) as r:
    distro_list_soup = BeautifulSoup(r.text, 'html.parser')

Next, we scrape for names of linux distributions.

In [2]:
import re

table_rows = distro_list_soup.find_all('tr')

linux_distros = []

for tr in table_rows:
    td = tr.find('td')
    if(td != None):
        a = td.find('a')
        if(a != None and a.text != ''):
            linux_distros.append(a.text)
            
for i in range(len(linux_distros)):
    linux_distros[i] = linux_distros[i].strip()
    
for i in range(len(linux_distros)):
    linux_distros[i] = linux_distros[i].replace(' ', '-')

for i in range(len(linux_distros)):
    linux_distros[i] = linux_distros[i].lower()

# manually replacing a couple things that serverfault treats differently
linux_distros[linux_distros.index('red-hat-linux')] = 'redhat'

# Step 2: query serverfault for the number of questions having each tag

In [3]:
import time

SERVERFAULT_TAG_SEARCH_URL_BASE = 'https://serverfault.com/questions/tagged/'

# simple class to hold data
class Distro:
    tag = ''
    num_questions = 0
    description = ''
    
    def __init__(self, tag, num_questions, description):
        self.tag = tag
        self.num_questions = num_questions
        self.description = description.strip()
        
    def __str__(self):
        return '{:20} {:10d} {:60.60}...'.format(self.tag, self.num_questions, self.description)
        
# regular expression for getting number from questions
re_num_questions = re.compile(r'(\d+,?)+' , re.MULTILINE)

linux_distro_stats = []
ld = 'ubuntasdfu'

for ld in linux_distros:
    # stack exchange throttles all IP addresses sending more than 30 requests per second.
    # Therefore, we will wait 100 milliseconds between each request to be well within this
    # limit.
    
    time.sleep(0.1)
    
    with requests.get(SERVERFAULT_TAG_SEARCH_URL_BASE + ld) as r:
        if r.ok:
            ### create soup from webpage
            serverfault_soup = BeautifulSoup(r.text, 'html.parser')

            ### find div with number of questions
            questions_text = serverfault_soup.find('div', {'class': 'fs-body3 grid--cell fl1 mr12 sm:mr0 sm:mb12'})
            num_questions_match = re_num_questions.search(questions_text.text)
            num_questions_string = num_questions_match.group(0).replace(',', '')
            num_questions = int(num_questions_string)
            #print(num_questions)

            ### find descripition
            description_container = serverfault_soup.find_all('div', {'class': 'mb24'})
            # we are operating under the assumption that the description is the *second* div with 
            # class mb24. THIS MAY CHANGE without notice. We are subject to the whims of 
            # serverfault
            description_container = description_container[1]
            description = description_container.find('p').text
            #print(description)

            ### put it all into a Distro object that gets added to our list
            linux_distro_stats.append(Distro(ld, num_questions, description))
            print('Successfully stored info for questions on tag \'', ld, '\'', sep='')
        else:
            print('Got status code ', r.status_code, ' while searching for questions with tag \'',
                  ld, '\'', sep='')

Successfully stored info for questions on tag 'redhat'
Successfully stored info for questions on tag 'centos'
Successfully stored info for questions on tag 'fedora'
Successfully stored info for questions on tag 'opensuse'
Got status code 404 while searching for questions with tag 'mandrake-linux'
Got status code 404 while searching for questions with tag 'asianux'
Successfully stored info for questions on tag 'clearos'
Got status code 404 while searching for questions with tag 'fermi-linux-lts'
Got status code 404 while searching for questions with tag 'miracle-linux'
Successfully stored info for questions on tag 'oracle-linux'
Got status code 404 while searching for questions with tag 'red-flag-linux'
Got status code 404 while searching for questions with tag 'rocks-cluster-distribution'
Successfully stored info for questions on tag 'scientific-linux'
Successfully stored info for questions on tag 'amazon-linux-2'
Got status code 404 while searching for questions with tag 'berry-linux'

Got status code 404 while searching for questions with tag 'deepin'
Successfully stored info for questions on tag 'devuan'
Got status code 404 while searching for questions with tag 'dreamlinux'
Got status code 404 while searching for questions with tag 'emdebian-grip'
Got status code 404 while searching for questions with tag 'finnix'
Got status code 404 while searching for questions with tag 'gnewsense'
Got status code 404 while searching for questions with tag 'grml'
Got status code 404 while searching for questions with tag 'handylinux'
Got status code 404 while searching for questions with tag 'kanotix'
Successfully stored info for questions on tag 'knoppix'
Got status code 404 while searching for questions with tag 'kurumin'
Got status code 404 while searching for questions with tag 'leaf-project'
Got status code 404 while searching for questions with tag 'limux'
Successfully stored info for questions on tag 'maemo'
Got status code 404 while searching for questions with tag 'mepi

Got status code 404 while searching for questions with tag 'ps2-linux'
Got status code 404 while searching for questions with tag 'puppy-linux'
Got status code 404 while searching for questions with tag 'replicant'
Got status code 404 while searching for questions with tag 'rpath'
Got status code 404 while searching for questions with tag 'sailfish-os'
Got status code 404 while searching for questions with tag 'slitaz'
Got status code 404 while searching for questions with tag 'smallfoot'
Successfully stored info for questions on tag 'smoothwall'
Got status code 404 while searching for questions with tag 'softlanding-linux-system'
Got status code 404 while searching for questions with tag 'solus'
Got status code 404 while searching for questions with tag 'source-mage'
Got status code 429 while searching for questions with tag 'thinstation'
Got status code 429 while searching for questions with tag 'tinfoil-hat-linux'
Got status code 429 while searching for questions with tag 'tiny-core

# Step 3: Store data for later use
In case stack exchange decides to implement more aggressive rate-limiting or dramatically changes their website, it will be handy to have this data saved to disk. Below we will use pickle to serialize and store the data we've retrieved.

In [6]:
DATA_DIRECTORY = 'data'

import pickle

for ld in linux_distro_stats:
    print(str(ld))

#with open(DATA_DIRECTORY + '/linux_distro_stats' +)
#pickle.dump(linux_distro_stats)

# wouldn't it just make more sense to store this as a CSV? then we can use 
# the powerful tools offered by pandas and such

redhat           2997 Red Hat is an open source technology solutions provider with...
centos          10113 CentOS is a free (as in beer and speech) GNU/Linux distribut...
fedora           1065 Fedora is a fast, stable, powerful, free RPM-based GNU/Linux...
opensuse          332 OpenSuSE (formerly known as SuSE Professional) is the "free"...
clearos            30 ClearOS (formerly named ClarkConnect) is a Linux distributio...
oracle-linux         40 The oracle-linux tag has no usage guidance.                 ...
scientific-linux         51 Scientific Linux ("SL") is an Enterprise Linux distro based ...
amazon-linux-2         26 The amazon-linux-2 tag has no usage guidance.               ...
ubuntu          14644 Ubuntu Linux is a Debian derivative that aims to bring Linux...
kubuntu            21 Kubuntu is a popular Linux distribution based on Ubuntu and ...
ubuntu-server      14644 Ubuntu Linux is a Debian derivative that aims to bring Linux...
xubuntu            21 Xubuntu is a Linu