# Data Preparation Stage 2 â€” Model Ready Dataset

This notebook prepares the processed StackOverflow dataset for model training.
It performs NLP cleaning, selects top frequent tags, filters records, and
exports a modeling-ready dataset.

- Load processed dataset
- Text Cleaning
- Selecting top frequent tags
- Filtering low-frequency labels
- Preparing modeling columns
- Exporting final parquet dataset

**Input**
- Loading_data.parquet

**Output**
- Model_ready.parquet


## Load Processed Dataset

Load the cleaned dataset generated in the previous pipeline stage.
This dataset already contains combined text fields and parsed tag lists.


In [3]:
import pandas as pd

df= pd.read_parquet('Loading_data.parquet')

In [7]:
print(df.shape)
df.head()

(60000, 3)


Unnamed: 0,id,text,tag_list
0,12421444,How to format a number 0..9 to display with 2 ...,"[java, number-formatting]"
1,12468823,Python datetime - setting fixed hour and minut...,"[python, date, datetime, time, date-manipulation]"
2,12553160,Getting visitors country from their IP <p>I wa...,"[php, geolocation, ip, country-codes]"
3,12583638,When is the @JsonProperty property used and wh...,"[java, ajax, jackson]"
4,12567578,What does the LayoutInflater attachToRoot para...,"[android, android-layout, android-view, layout..."


##  Text Cleaning

We remove:
- HTML tags
- URLs
- unwanted formatting

This ensures text consistency before model feature extraction.


In [8]:
from bs4 import BeautifulSoup
import re

In [9]:
def clean_text(text):
    text = BeautifulSoup(text, "lxml").text
    text = re.sub(r'http\S+', "", text)
    return text.lower()

df["clean_text"] = df["text"].apply(clean_text)

## Identify Most Frequent Tags

To reduce label sparsity, we select the most frequently occurring tags
and use them as the target label space for the classification model.


In [11]:
from collections import Counter

tag_counter = Counter()

for tags in df["tag_list"]:
    tag_counter.update(tags)

top_tags = set([tag for tag, _ in tag_counter.most_common(50)])

len(top_tags)


50

## Filter Dataset to Selected Tags

Each record is filtered to retain only tags that belong to the selected
frequent-tag vocabulary. Rows with no remaining tags are removed.


In [12]:
df["filtered_tags"] = df["tag_list"].apply(
    lambda tags: [tag for tag in tags if tag in top_tags]
)

In [13]:
# Remove rows that have no remaining tags:

df= df[df["filtered_tags"].map(len) > 0]

## Create Modeling Dataset

Only the required modeling columns are retained:
- id
- cleaned text
- filtered tag labels


In [15]:
df_model = df[['id', "clean_text", "filtered_tags"]]
df_model.head()

Unnamed: 0,id,clean_text,filtered_tags
0,12421444,how to format a number 0..9 to display with 2 ...,[java]
1,12468823,python datetime - setting fixed hour and minut...,[python]
2,12553160,getting visitors country from their ip i want ...,[php]
3,12583638,when is the @jsonproperty property used and wh...,[java]
4,12567578,what does the layoutinflater attachtoroot para...,[android]


## Export Processed Dataset

The processed dataset is saved in parquet format to reduce storage size
and speed up downstream ML pipeline loading.


In [16]:
df_model.to_parquet('model_ready.parquet')

## Summary

- Input dataset: Loading_data.parquet
- Output dataset: model_ready.parquet
- Text cleaned and normalized
- Labels restricted to top frequent tags
- Dataset ready for vectorization and model training
