# Where we left off
In Part 01 we registered for an API key and wrote a custom class for fetching questions from the Stack Overflow. Now we're going to pull data and begin processing it.

# Fetching questions
We'll start by importing our code and fetching questions.

In [6]:
from is_question.api import StackOverflow


so = StackOverflow()
response = so.fetch_questions()
items = response.get("items")
zeroth_question = items[0]
print(zeroth_question)

{'tags': ['python', 'makefile', 'visual-studio-2019', 'setuptools', 'nmake'], 'owner': {'reputation': 9718, 'user_id': 1624552, 'user_type': 'registered', 'accept_rate': 60, 'profile_image': 'https://i.stack.imgur.com/FH8SK.jpg?s=256&g=1', 'display_name': 'Willy', 'link': 'https://stackoverflow.com/users/1624552/willy'}, 'is_answered': False, 'view_count': 2, 'answer_count': 0, 'score': 0, 'last_activity_date': 1688391269, 'creation_date': 1688391269, 'question_id': 76605250, 'content_license': 'CC BY-SA 4.0', 'link': 'https://stackoverflow.com/questions/76605250/pass-argument-to-py-setup-py-bdist-wheel-and-then-read-it-from-setup-py', 'title': 'pass argument to py setup.py bdist_wheel and then read it from setup.py'}


Let's walk through the keys in our question item:

In [7]:
print(*zeroth_question.keys(), sep="\n")

tags
owner
is_answered
view_count
answer_count
score
last_activity_date
creation_date
question_id
content_license
link
title


- `tags` is a list of tags a question is tagged with
- `owner` is a dictionary of question-author attributes
- `is_answered` is a boolean stating if the question has an accepted answer
- `view_count` tells how many views the question has
- `answer_count` tells how many answers have been posted to the question
- `score` is the net score (upvotes minus downvotes)
- `last_activity_date` is the last time the question received any activity
- `creation_date` is the original publish date
- `question_id` is the unique identifier for the question
- `content_license` is the question license type
- `link` is a link to the question
- `title` is the question title

You may have noticed that we're missing the body of the question. We can get this by adjusting `filter` argument in our `fetch_questions` method. As an example I'll limit to our `zeroth_question` and return it's body.

In [9]:
qid = zeroth_question.get("question_id")
zeroth_response = so.fetch_questions(question_ids=[qid], filter="withbody")
zeroth_items = zeroth_response.get("items")
zeroth_question = zeroth_items[0]
print(*zeroth_question.keys(), sep="\n")

tags
owner
is_answered
view_count
answer_count
score
last_activity_date
creation_date
last_edit_date
question_id
content_license
link
title
body


Now we have a `body`. What does it look like?

In [10]:
zeroth_body = zeroth_question.get("body")
print(zeroth_body)

<p>In Visual Studio 2019, I have a makefile project in VC++ and using nmake.</p>
<p>In the make.mak file I have below (only showing here a piece):</p>
<pre><code># release build
release: clean
    CD..
    PY -3.8-32 setup.py build_ext--OutputDirectory=$(OutDirectory)
    PY -3.8-32 setup.py bdist_wheel
</code></pre>
<p>In the project properties, in the build command line, I have the following:</p>
<pre><code>nmake /F &quot;$(ProjectDir)..\make.mak&quot; release OutDirectory=$(OutDir)
</code></pre>
<p>I am using the visual studio macro $(OutDir)
In the make.mak file I have the following setup:</p>
<pre><code>setup(
    package_dir={'': 'src'},
    packages=['mypackage'],
    ext_modules=[myextmodule],
    cmdclass={
        'build_ext': BuildExtCommand,
    },
    # While not using a pyproject.toml, support setuptools_scm setup.cfg usage,
    # see https://github.com/pypa/setuptools_scm/#setupcfg-usage
    use_scm_version={
        'write_to': 'src/mypackage/__version__.py',
        #T

The body appears to be made up of HTML with some tags relating to `<code>`. We'll probably want to remove the tags. And while the code may be important for *answering* the question, we're not interested in it. So we'll remove the `<code>` tags completely. If you're familiar with `scrapy` then you may have used the `w3lib` package. More specifically the `html` module and its functions `remove_tags` and `remove_tags_with_content`. These will make our job of removing tags a lot easier.

> NOTE: if you don't have `w3lib` installed you'll need to do that.
    > `pip install w3lib`

# Removing tags

In [11]:
from w3lib import html


no_code = html.remove_tags_with_content(text=zeroth_body, which_ones=["code"])

The above code does exactly what it says -- `remove_tags_with_content`. It takes a body of `text` and removes tags with content in `which_ones`. Here's what it looks like:

In [13]:
# before removal of code
print(zeroth_body)

<p>In Visual Studio 2019, I have a makefile project in VC++ and using nmake.</p>
<p>In the make.mak file I have below (only showing here a piece):</p>
<pre><code># release build
release: clean
    CD..
    PY -3.8-32 setup.py build_ext--OutputDirectory=$(OutDirectory)
    PY -3.8-32 setup.py bdist_wheel
</code></pre>
<p>In the project properties, in the build command line, I have the following:</p>
<pre><code>nmake /F &quot;$(ProjectDir)..\make.mak&quot; release OutDirectory=$(OutDir)
</code></pre>
<p>I am using the visual studio macro $(OutDir)
In the make.mak file I have the following setup:</p>
<pre><code>setup(
    package_dir={'': 'src'},
    packages=['mypackage'],
    ext_modules=[myextmodule],
    cmdclass={
        'build_ext': BuildExtCommand,
    },
    # While not using a pyproject.toml, support setuptools_scm setup.cfg usage,
    # see https://github.com/pypa/setuptools_scm/#setupcfg-usage
    use_scm_version={
        'write_to': 'src/mypackage/__version__.py',
        #T

In [14]:
# after removal of code
print(no_code)

<p>In Visual Studio 2019, I have a makefile project in VC++ and using nmake.</p>
<p>In the make.mak file I have below (only showing here a piece):</p>
<pre></pre>
<p>In the project properties, in the build command line, I have the following:</p>
<pre></pre>
<p>I am using the visual studio macro $(OutDir)
In the make.mak file I have the following setup:</p>
<pre></pre>
<p>Then in the setup.py file i have defined the class BuildExtCommand:</p>
<pre></pre>
<p>This class contains the methods initialize_options, finalize_options and the run. The run method looks like below:</p>
<pre></pre>
<p>Finally I initialize the user_options variable:</p>
<pre></pre>
<p>Finally with the class I also have a custom method which I get the value from OutputDirectory. This is working for first command in the make.mak file:</p>
<pre></pre>
<p>but not working for:</p>
<pre></pre>
<p>so how can I pass the argument --OutputDirectory the same I do for setup.py build_ext and then read it from my custom method wit

All the code tags and their content have successfully been removed. Now let's remove all the other tags, but keep the content.

In [15]:
no_tags = html.remove_tags(text=no_code)
print(no_tags)

In Visual Studio 2019, I have a makefile project in VC++ and using nmake.
In the make.mak file I have below (only showing here a piece):

In the project properties, in the build command line, I have the following:

I am using the visual studio macro $(OutDir)
In the make.mak file I have the following setup:

Then in the setup.py file i have defined the class BuildExtCommand:

This class contains the methods initialize_options, finalize_options and the run. The run method looks like below:

Finally I initialize the user_options variable:

Finally with the class I also have a custom method which I get the value from OutputDirectory. This is working for first command in the make.mak file:

but not working for:

so how can I pass the argument --OutputDirectory the same I do for setup.py build_ext and then read it from my custom method within the class?



Let's write a function that does both of these steps and apply it over all of our question bodies.

In [16]:
def clean_text(text: str) -> str:
    """Remove code and HTML tags."""
    no_code = html.remove_tags_with_content(text=text, which_ones=["code"])
    no_tags = html.remove_tags(text=no_code)
    return no_tags

In [17]:
zeroth_clean = clean_text(text=zeroth_body)
print(zeroth_clean)

In Visual Studio 2019, I have a makefile project in VC++ and using nmake.
In the make.mak file I have below (only showing here a piece):

In the project properties, in the build command line, I have the following:

I am using the visual studio macro $(OutDir)
In the make.mak file I have the following setup:

Then in the setup.py file i have defined the class BuildExtCommand:

This class contains the methods initialize_options, finalize_options and the run. The run method looks like below:

Finally I initialize the user_options variable:

Finally with the class I also have a custom method which I get the value from OutputDirectory. This is working for first command in the make.mak file:

but not working for:

so how can I pass the argument --OutputDirectory the same I do for setup.py build_ext and then read it from my custom method within the class?



To keep our working space clean let's create a new module called "utils.py" in our `is_question` package and store our function in there.

In [18]:
"""Contents of utils.py"""
from w3lib import html


def clean_text(text: str) -> str:
    """Remove code and HTML tags."""
    no_code = html.remove_tags_with_content(text=text, which_ones=["code"])
    no_tags = html.remove_tags(text=no_code)
    return no_tags

# Storing data
We have a way to fetch data and a way to do some minor preprocessing. Now we should probably save it somewhere. I suggest saving both the *raw* data (what came from the API) and the clean data. Because the data came to us as JSON but the `items` are list-like we'll use JSONL format. I'll make a new directory called "data" and store the "raw.jsonl" and "clean.jsonl" files in there.
> NOTE: the data folder is at the same level as the `is_question` package.

In [62]:
import json


response = so.fetch_questions(filter="withbody")
items = response.get("items")

In [63]:
with open("../data/raw.jsonl", mode="w") as f:
    for item in items:
        json_ = json.dumps(item)
        f.write(f"{json_}\n")

Before we clean up the bodies and save them, we need to take a `deepcopy` of the `items` variable. Why? If we don't we'll end up overwriting the values in our preexisting `items`. While we aren't going to use them after this, it's good practice to explicitly make copies of objects to not accidentally modify them.

In [73]:
from copy import deepcopy


clean_items = deepcopy(items)
# we enumerate to replace the body of
# a specific item in clean_items
for idx, item in enumerate(clean_items):
    raw_body = item.get("body")
    clean_body = clean_text(text=raw_body)
    clean_items[idx]["body"] = clean_body

TypeError: to_unicode must receive bytes or str, got NoneType

Uh oh! We got a `TypeError`. What caused this?

In [74]:
item

{'tags': ['c#', 'kubernetes', 'exec', 'kubectl'],
 'owner': {'reputation': 1163,
  'user_id': 19544859,
  'user_type': 'registered',
  'profile_image': 'https://i.stack.imgur.com/cjZT6.jpg?s=256&g=1',
  'display_name': 'mikyll98',
  'link': 'https://stackoverflow.com/users/19544859/mikyll98'},
 'is_answered': False,
 'view_count': 26,
 'answer_count': 1,
 'score': 0,
 'last_activity_date': 1688393579,
 'creation_date': 1688383889,
 'last_edit_date': 1688393579,
 'question_id': 76604324,
 'content_license': 'CC BY-SA 4.0',
 'link': 'https://stackoverflow.com/questions/76604324/kubernetes-kubectl-exec-command-different-behaviour-from-c-program',
 'title': 'Kubernetes: kubectl exec command different behaviour from C# program'}

It looks like we have an `item` without a `body`? Not sure how that happened. Let's see how many `items` don't have bodies.

In [79]:
with_body = sum("body" in item for item in items)
total = len(items)
print(f"items with body: {with_body}")
print(f"total number of items: {total}")
print(f"proportion with body vs not: {with_body / total}, {1 - with_body / total}")

items with body: 498
total number of items: 500
proportion with body vs not: 0.996, 0.0040000000000000036


Two items are missing bodies. We can do a couple of things here:
1. Remove the items without a body (and possibly replace them with different questions).
2. Keep the items and write a `try/except` in our for-loop that handles the `TypeError`.
3. Check if `body` exists and clean it if it does, otherwise continue with the loop.

Let's go with option 3.

In [80]:
clean_items = deepcopy(items)
# we enumerate to replace the body of
# a specific item in clean_items
for idx, item in enumerate(clean_items):
    raw_body = item.get("body")
    if raw_body is None:
        # this will continue to the next item in the loop
        # skipping all code after the conditional
        continue
    clean_body = clean_text(text=raw_body)
    clean_items[idx]["body"] = clean_body

Much better. Now we can write the JSONL file.

In [84]:
with open("../data/clean.jsonl", mode="w") as f:
    for item in clean_items:
        json_ = json.dumps(item)
        f.write(f"{json_}\n")