# Imports

In [None]:
# imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# additional imports for the Textual features
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# NLP: Bag of Words & Text Classification Tasks

## The Data  

We will use the **Women’s Clothing E-Commerce dataset** , which is revolving around the reviews written by customers.


* **Review Text:** String variable for the review body.

* **Recommended:** Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.


## Task - "EDA"
1. Load the dataset`Womens_Clothing_E-Commerce_Reviews.csv` into a pandas DataFrame.
* You can use any other public dataset!
2. Drop any unnecessary columns
3. Print the number of rows and columns in the dataset.
4. For each column, calculate:
   - The number of **unique values**
   - The number of **missing values**
5. Display the result in a summary table for quick inspection.

 This task helps you understand the dataset structure, spot missing values, and plan preprocessing accordingly.


## Task - Split Train-Test

1. Split your dataset into **training** and **test** sets  (80% train / 20% test)
2. Extract the textual data from the column `'Review Text'` into two variables:
   - `x_train_textual`
   - `x_test_textual`

2. Create two DataFrames:
   - `train_text_df` — will hold both the raw and preprocessed review texts for the train set
   - `test_text_df` — same for the test set

Each DataFrame should have two columns:
- `'raw text'`: the original review
- `'preprocessed text'`: the cleaned review (to be filled in the next task)


## Task - Text Preprocessing

Now preprocess the reviews in both `x_train_textual` and `x_test_textual`.

Your preprocessing pipeline should include:

- Lowercasing
- Tokenization (by splitting on spaces)
- Stopword removal
- **Stemming** using NLTK’s `PorterStemmer`
- Join the tokens back into a single string

Additional instructions:

- Use NLTK’s stopword list.
- Exclude the words `"no"` and `"not"` from the stopwords list (to preserve negation).
- Apply the pipeline separately for the train and test sets.
- Store the results in the appropriate `'preprocessed text'` column in `train_text_df` and `test_text_df`.

Feel free to use the code from the slides.

## Task - Features Extraction

In this task, you’ll convert the preprocessed text into numeric features using `BoW` (or `TF-IDF`).

Your steps:

1. Extract the `'preprocessed text'` column from both `train_text_df` and `test_text_df`, and store them in:
   - `processed_train`
   - `processed_test`

2. Initialize a `CountVectorizer`.

3. Fit the vectorizer only on the training set (`processed_train`).

4. Transform both the training and test sets using the fitted vectorizer:
   - `x_train_textual = cv.transform(processed_train)`
   - `x_test_textual = cv.transform(processed_test)`

5. Convert the results to dense numpy arrays using `.toarray()` for later compatibility with classical models.

6. Print the shape and a sample of the resulting feature vectors to validate your work.



## Task - The Model

Now that you’ve preprocessed the text and converted it into numerical features, it's time to **train a classification model**.

Your goal is to build a model that can **predict the target label** using the feature matrix `x_train_textual`.

Instructions:

1. Select and train a classification model of your choice (e.g.`LogisticRegression`)

2. Fit the model on the **training set**.

3. Use it to predict on the **test set**.

4. Evaluate your model using