# Data Processing Stage 1 â€” Loading & Cleaning

This notebook loads the filtered StackOverflow dataset, performs basic cleaning,
and exports a processed parquet file for downstream notebooks.

**Input**
- Stackoverflow_filtered.csv

**Output**
- Loading_data.parquet


## Pipeline Step Overview
1. Import libraries
2. Load dataset
3. Handle missing values
4. Convert tag string to list
5. Create combined text column
6. Select required columns
7. Save parquet file


#### Loading Libraries

In [14]:
import pandas as pd
import os

#### Load dataset

In [2]:
df = pd.read_csv("Stackoverflow_filtered.csv")
df.head()

Unnamed: 0,id,title,body,tag_string
0,12421444,How to format a number 0..9 to display with 2 ...,<p>I'd like to always show a number under 100 ...,java|number-formatting
1,12468823,Python datetime - setting fixed hour and minut...,<p>I've successfully converted something of <c...,python|date|datetime|time|date-manipulation
2,12553160,Getting visitors country from their IP,<p>I want to get visitors country via their IP...,php|geolocation|ip|country-codes
3,12583638,When is the @JsonProperty property used and wh...,<p>This bean 'State' :</p>\n\n<pre><code>publi...,java|ajax|jackson
4,12567578,What does the LayoutInflater attachToRoot para...,"<p>The <a href=""http://developer.android.com/r...",android|android-layout|android-view|layout-inf...


 **Data Quality Check**

We inspect dataset shape, missing values and column types before cleaning.


#### Handling missing values

In [13]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          60000 non-null  int64 
 1   title       60000 non-null  object
 2   body        60000 non-null  object
 3   tag_string  60000 non-null  object
 4   tag_list    60000 non-null  object
dtypes: int64(1), object(4)
memory usage: 2.3+ MB


(60000, 5)

In [7]:
df["title"] = df["title"].fillna("")
df["body"] = df["body"].fillna("")
df["tag_string"] = df["tag_string"].fillna("")

#### Convert tag string to list

In [12]:
df["tag_list"] = df["tag_string"].str.split("|")
df['tag_list'].iloc[0]

['java', 'number-formatting']

#### Create combined text column

In [15]:
df['text']= df['title']+" "+ df['body']

#### Select required columns

In [16]:
df = df[["id", "text", "tag_list"]]
df.head()

Unnamed: 0,id,text,tag_list
0,12421444,How to format a number 0..9 to display with 2 ...,"[java, number-formatting]"
1,12468823,Python datetime - setting fixed hour and minut...,"[python, date, datetime, time, date-manipulation]"
2,12553160,Getting visitors country from their IP <p>I wa...,"[php, geolocation, ip, country-codes]"
3,12583638,When is the @JsonProperty property used and wh...,"[java, ajax, jackson]"
4,12567578,What does the LayoutInflater attachToRoot para...,"[android, android-layout, android-view, layout..."


 **Post-Processing Validation**
Verify that:
- No null values remain in text columns
- Tag list conversion is correct
- Final dataset has expected columns


#### Save parquet file

**Export Processed Dataset**

The processed dataset is saved in parquet format to reduce storage size
and speed up downstream ML pipeline loading.


In [19]:
df.to_parquet("Loading_data.parquet")

In [20]:
os.listdir()

['.config',
 'Stackoverflow_filtered.csv',
 'Loading_data.parquet',
 '.ipynb_checkpoints',
 'sample_data']

## Summary
- Original rows: 60,000
- Final rows: 60,000
- Columns retained: id, text, tag_list
- Output file: Loading_data.parquet
