# Week 4. Day 1. Exercises from Chapter 10 of FSStDS. 
## Fundamentals of Social Data Science. MT 2022

Within your study pod discuss the following questions. Please submit an individual assignment by 12:30pm Tuesday, November 1st, 2022 on Canvas. 

# Refactoring code 

Chapter 10 gave the example of the Movie Stack Exchange as a file with posts that could be cleaned and imported into Python. The steps taken are sequential and result in a final DataFrame which was pickled. 

Below we want you to proceed in steps, refactor or rewriting that code until we end up with a script whereby you can take the zipped 

# Step 1. 
**Be able to get from the 7z file to the preferred XML file**

In the first step (for which I've provided starter code), you should be able to open a downloaded 7z file representing the archive, export it to a folder under data dir.

## Challenge 1. 
Can you do this with data from the web instead of downloading it first?  

In [1]:
# Answer step 1 below here

# Note you will likely need to install py7zr through pip 
# or use an altnernate approach to unpacking such as 
# pyunpack or libarchive (both of which I found fussy)

In [2]:
from pathlib import Path
import py7zr
import requests as req
import os
import pandas as pd
from lxml.etree import XMLParser, parse
xmlparser = XMLParser(huge_tree=True)
import xmltodict as xd
import bs4
import shutil
import numpy as np

In [3]:
def extract_stack(stack):

    endpoint = f"https://archive.org/download/stackexchange/{stack}.stackexchange.com.7z"
    res = req.get(endpoint, stream=True)
    path = f"../data/{stack}"
    filename = path + f"/{stack}.7z"

    if res.status_code == 200:

        if not Path(path).exists(): 
            Path(path).mkdir()

        with open(filename, 'wb') as out:
            out.write(res.content)

        with py7zr.SevenZipFile(filename, mode='r') as z:
            z.extractall(path)

        os.remove(filename)

    else: print(f"Request Failed: {res.status_code}")


# Step 2. 

Refactor the code from Chapter 10 of the book into a function that works for the `posts.xml`. That function should take in the base data and then:
1. Convert the `int` variables (except where they start or end in `Id`) into integer values. 
2. Convert the `datetime` variables. 
3. Convert tags data from `str` into a `list`.
4. Create a separate column for `CleanBody` which is the `Body` without HTML. 
5. Assign the HTML into a column as a list called `ListURL`.
6. Pickle that DataFrame with a coherent name, such as `f"{stack}_Posts_cleaned.xml"`.

Notes: 
> I say 'in a function', but you might want to have a main function and then helper functions for subprocesses. 
> You can make this more abstract, but that's coming anyway. Read the exercises below, and then think about this plan.

In [4]:
# Answer Step 2 below here 

parse_dict = {
    "Id": "string",
    "CreationDate": "datetime",
    "Body": "text",
    "Title": "string",
    "DisplayName": "string",
    "AccountId": "string"
}

def clean_row(row):

    cleaned_row = {}

    for key, val in row.items():
        
        if key in parse_dict.keys():
            var_type = parse_dict[key]
            if var_type == "string": cleaned_row[key] = val
            elif var_type == "datetime": cleaned_row[key] = pd.to_datetime(val)
            elif var_type == "text": cleaned_row[key] = bs4.BeautifulSoup(val, "lxml").text.replace("\n"," ")

    return cleaned_row

def xml_to_df(tree):

    rows = tree.getroot().getchildren()
    row_list = []
    for row in rows:
        cleaned = clean_row(dict(row.attrib))
        row_list.append(cleaned)
    return pd.DataFrame(row_list)


def extract_data(stack):

    path = f"../data/{stack}"
    
    post_tree = parse(path + '/Posts.xml', parser=xmlparser)
    post_df = xml_to_df(post_tree)

    user_tree = parse(path + '/Users.xml', parser=xmlparser)
    user_df = xml_to_df(user_tree)

    shutil.rmtree(path)
    Path(path).mkdir()

    post_df.to_pickle(f"../data/{stack}/post_df.pkl")
    user_df.to_pickle(f"../data/{stack}/user_df.pkl")


# Step 3. 

Parameterise the function. Depending on how you created the function above, you might have hard coded the names of the columns from the `Posts.xml` data. This time, make a parameter for the specific schema that you are going to use to convert the data. The schema should be stored as JSON and have the type of XML file as a key, with the value being a dictionary for the column names and the conversion, such as the following: 

~~~ Python
["Posts": {"Id":none,
           "PostTypeID":none,...,
           "Tags":["str","list"],
           "AnswerCount":"int",
           "Body":["str","CleanHTML","listHTML"]},
 "Users": {"Id":none,
           "Reputation":"int", ...}
~~~

So, now this time, the main function should read in the Schema from file, select the right table type (such as "Posts") and then return (or export to pickle) a DataFrame with the same naming conventions as above.   

Full schema available here: https://i.stack.imgur.com/AyIkW.png

> Note that this schema or question does not make assumptions about what you will do with None values, but it is encouraged for you to consider whether to convert to None, "", pd.NA, np.NAN, depending on context. 

### Answer Step 3 below here 

I did this in the original

# Step 4. A wide parameter. 

Create a means (either with a parameter in the original file or using a separate file) in order to create a long table from one of the wide tables. That is, it should take the column that is used for the long data (which we assume would be a list of values) and then `explode()` the table. By default, it should only explode the selected column. You should be able to pass it a list of additional columns that will also appear in the exploded data.  

In [5]:
# Answer Step 4 below here



# Challenge #2. 
_(example code will not be provided)_

Recall that we downloaded the data from the Internet Archive. That main URL has a list of all of the Stack Exchanges and their sizes. 

Can we use this data in order to present a list of Stack Exchanges and then have the user select which one to first download instead of linking directly? 

Explore this as well as some packages for providing progress bars for the download process. Package all the code up in a script that allows the user to select which Exchange and then receive the resulting preferred tables as .pkl.

There are 194 stacks
['3dprinting' 'Sites' '__ia_thumb' 'academia' 'ai']


In [8]:
# Try Challenge 2 below here 

stack_list = [
    'movies',
    'philosophy',
    'pets'
]

def get_stack_data(stacks):

    for stack in stacks:
        try:
            extract_stack(stack)
            extract_data(stack)
        except: print(f'something failed for {stack}')

get_stack_data(['math'])  



Request Failed: 404
something failed for Sites
Request Failed: 404
something failed for __ia_thumb




Request Failed: 404
something failed for askubuntu




something failed for codegolf
something failed for codereview




something failed for drupal




Request Failed: 404
something failed for es




Request Failed: 404
something failed for ja




Request Failed: 404
something failed for license




Request Failed: 404
something failed for logo-stackoverflow
Request Failed: 404
something failed for logo-stackoverflow_thumb




something failed for math




Request Failed: 404
something failed for mathoverflow




Request Failed: 404
something failed for pt




Request Failed: 404
something failed for readme




Request Failed: 404
something failed for ru




Request Failed: 404
something failed for se-icon
Request Failed: 404
something failed for se-logo




Request Failed: 404
something failed for serverfault




Request Failed: 404
something failed for stackapps
Request Failed: 404
something failed for stackexchange_archive
Request Failed: 404
something failed for stackexchange_files
Request Failed: 404
something failed for stackexchange_meta
Request Failed: 404
something failed for stackexchange_reviews
Request Failed: 404
something failed for stackoverflow




Request Failed: 404
something failed for superuser




Request Failed: 404
something failed for {0}
