# Discussion 07



# Regex and NLP

In [1]:
import os
import numpy as np
import pandas as pd
import requests
import time
import re

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# from discussion import *

In [4]:
import os
import numpy as np
import pandas as pd
import requests
import time
import re

## Regular Expressions

### Resources

**Online Simulators**

 - https://pythex.org/

 - https://regex101.com/
 
**Cheat sheets**

 - https://dsc80.com/resources/other/berkeley-regex-reference.pdf

 - https://www.debuggex.com/cheatsheet/regex/python

 - https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf

In [5]:
def duplicate_words(s):
    """
    Provide a list of all words that are duplicates in an input sentence.
    Assume that the sentences are lower case.

    :Example:
    >>> duplicate_words('let us plan for a horror movie movie this weekend')
    ['movie']
    >>> duplicate_words('I like surfing')
    []
    >>> duplicate_words('the class class is good but the tests tests tests are hard')
    ['class', 'tests']
    """
    # BEGIN SOLUTION
    return re.findall(r'(\b\w+\b)\s+\b\1\b',s)
    # END SOLUTION

### Explanation

**duplicate_words regex**

- `\b` matches a **word boundary**, ensuring we start and end on whole words.
- `\w+` matches one or more word characters (letters, digits, or underscore).
- The entire `\b\w+\b` is wrapped in a **capturing group** `(...)` so we can refer back to the word.
- `\s+` matches one or more whitespace characters between the repeated words.
- `\b\1\b` is a **back‑reference** to the captured word, requiring the **exact same word** to appear again.
- Overall, the pattern captures words that appear **twice in a row** (e.g., `'movie movie'`).

### Tips & Tricks

- Use **back‑references** (`\1`, `\2`, …) to detect repeated text.
- Always anchor with `\b` if you only want **whole‑word** matches.
- In Python, write regex strings as *raw* strings: `r'...'` to avoid double escaping.

**Question 2**: Extract laptop specifications

Given a df with product description - Return df with added columns of `processor` (i3, i5), `generation` (9th Gen, 10th Gen), `storage` (512 GB SSD, 1 TB HDD), `display_in_inch` (15.6 inch, 14 inch). The below image provides details on column names and the exact patterns.

If there is no specific information present, keep a null (`NaN`) value.

**Hint:** You can write regex patterns in `.str.extract()` pandas methods. Note that this method may return multiple columns based on the number of capture groups present.

<img src='imgs/laptop_specs.PNG'>

In [10]:
def laptop_details(df):
    """
    Given a df with product description - Return df with added columns of 
    processor (i3, i5), generation (9th Gen, 10th Gen), 
    storage (512 GB SSD, 1 TB HDD), display_inch (15.6 inch, 14 inch)

    :Example:
    >>> df = pd.read_csv('data/laptop_details.csv')
    >>> new_df = laptop_details(df)
    >>> new_df.shape
    (21, 5)
    >>> new_df['processor'].nunique()
    3
    """
    # BEGIN SOLUTION
    df_copy = df.copy()
    df_copy['processor'] = df_copy['laptop_description'].str.extract(r'(\bi[0-9]\b)')
    df_copy['generation'] = df_copy['laptop_description'].str.extract(r'([0-9]{1,2}\w{2}\s+Gen)')
    df_copy['storage'] = df_copy['laptop_description'].str.extract(r'([0-9]{,3}\s+(G|T)B\s+(SSD|HDD))')[0]
    df_copy['display_inch'] = df_copy['laptop_description'].str.extract(r'([0-9]{2}(\.[0-9]{1,2})?\s+inch)')[0]

    return df_copy
    # END SOLUTION

### Explanation

**processor regex** `r'(\bi[0-9]\b)'`

- `\b` anchors the pattern to word boundaries so we don't match the *i* in `Thinkpad`.
- `i[0-9]` looks for the letter **i** followed by a **single digit** (e.g., `i5`).

**generation regex** `r'([0-9]{1,2}\w{2}\s+Gen)'`

- `[0-9]{1,2}` captures **1–2 digits** (e.g., `11`).
- `\w{2}` grabs the ordinal suffix (`th`, `st`).
- `\s+Gen` requires a space(s) then the word **Gen**.

**storage regex** `r'([0-9]{,3}\s+(G|T)B\s+(SSD|HDD))'`

- `[0-9]{,3}` permits **up to 3 digits** (e.g., `512`).
- `(G|T)B` captures the unit **GB** or **TB**.
- `(SSD|HDD)` captures the storage type.

**display size regex** `r'([0-9]{2}(\.[0-9]{1,2})?\s+inch)'`

- `[0-9]{2}` captures two leading digits (e.g., `15`).
- `(\.[0-9]{1,2})?` optionally matches a **decimal fraction** (e.g., `.6`).
- `inch` is matched literally.

In [11]:
# Use 'pd_column.str.extract(r'pattern')' to extract the required pattern
df = pd.read_csv('data/laptop_details.csv')

In [12]:
# don't change this cell -- it is needed for the tests to work
out = laptop_details(pd.read_csv('data/laptop_details.csv'))

## Natural Language Processing - Dealing with Text Data

- Unstructured data is everywhere - Everything you read, see and listen
- Quantifying text data and extracting features from it is important to generate insights and build models


- Text representation ia a huge area of study - Representing a piece of text as a vector of numbers (BoW, TF-IDF, semantic embeddings etc.)

- In this section, we will focus on Bag-of-Words representations using uni-grams and bi-grams

Let's use the musical instuments reviews dataset which contains information on reviews and ratings.

In [18]:
review_df = pd.read_csv('data/musical_instruments_reviews.csv')
review_df

Unnamed: 0,reviewerID,reviewText,overall,summary
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...",5,good
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,5,Jake
2,A195EZSQDW3E21,The primary job of this device is to block the...,5,It Does The Job Well
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...,5,GOOD WINDSCREEN FOR THE MONEY
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...,5,No more pops when I record my vocals.
...,...,...,...,...
195,A2DKLC2FJTY9OI,Good all around mike. If you are looking for a...,5,Best in price range
196,A1MI9FDCNB3CMR,"Seriously? The Shure SM57 sets the standard, ...",5,Industry Standard
197,A37AQI4AU3JWSR,If it's good enough to track Tom Petty's vox o...,5,Classic. the last of Shures good mics
198,A37U8NH2CD9EDX,There's a reason every mic cabinet has at leas...,5,"It's an SM57, what's there to say?"


### N-grams in text

 - uni-gram consists of a single word from a text sequence
 - Extending this, an n-gram consists of consecutive 'n' words from a text sequence
 - Eg. For `text = 'i love data science'`, uni-grams are `['i', 'love', 'data', 'science']`, bi-grams are `['i love', 'love data', 'data science']`

### Getting the uni-grams and their counts

In [19]:
# First normalize the reviews by converting to lower and removing all puntuations
reviews = review_df['reviewText'].str.lower().str.replace('[^\w\s]','', regex=True)
reviews = reviews.tolist()
# reviews

In [20]:
# Getting all the unigrams from all the reviews
unigrams = []
for review in reviews:
    words = review.split()
    unigrams.extend(words)
unigrams[:10]

['not', 'much', 'to', 'write', 'about', 'here', 'but', 'it', 'does', 'exactly']

In [21]:
# Getting unigram counts
pd.Series(unigrams).value_counts()

the             539
a               364
i               360
and             346
to              342
               ... 
canon             1
pc                1
forget            1
recordersbut      1
living            1
Length: 2201, dtype: int64

- Does this make sense? Both the values and their counts?
- What are the positives/drawbacks of using unigram bag-of-words for text representations?

### How does 'reviewText' differ from 'summary'?

In [22]:
reviews = review_df['summary'].str.lower().str.replace('[^\w\s]','', regex=True)
reviews = reviews.tolist()

# Getting all the unigrams from all the reviews
unigrams = []
for review in reviews:
    words = review.split()
    unigrams.extend(words)

pd.Series(unigrams).value_counts()

good         45
the          38
cable        38
great        28
for          25
             ..
easily        1
breaking      1
complaint     1
which         1
by            1
Length: 328, dtype: int64

**Question 3**: Create bi-gram counts of the whole reviews text corpus.

Given a DataFrame like `review_df` and a column string (either `reviewText` or `summary`),
return a Series with bi-gram counts of that column sorted in descending order. The index of the series should be a tuple of bi-grams and the value should indicate the count of times that bi-gram appears in the whole corpus.

Perform the text normalization (lower case conversion and removing all punctuations) like we did in the uni-gram case before creating bi-gram counts.

**Hint:** Use splitting and zipping to create bi-gram combinations

In [23]:
def bigram_counts(review_df, column='reviewText'):
    """
    Given a DataFrame like `review_df`, return a Series with bi-gram counts sorted in descending order. 
    The index of the series should be a tuple of bi-grams 
    and the value should indicate the count of times that bi-gram appears in the whole corpus.

    :Example:
    >>> out_bigrams_text = bigram_counts(pd.read_csv('data/musical_instruments_reviews.csv'), 'reviewText')
    >>> isinstance(out_bigrams_text, pd.Series)
    True
    >>> out_bigrams_text.shape == (8470,)
    True
    >>> out_bigrams_text.index[0] == ('for', 'the')
    True
    """
    # BEGIN SOLUTION
    reviews = review_df[column].str.lower().str.replace('[^\w\s]','', regex=True)
    reviews = reviews.tolist()

    bigrams = []

    for r in reviews:
        words = r.split()
        for b in zip(words[:-1], words[1:]):
            bigrams.append(b)

    bigrams = pd.Series(bigrams).value_counts()

    return bigrams
    # END SOLUTION

### Explanation

**punctuation removal regex** `r'[^\w\s]'`

- `[^...]` is a **negated character class** – it matches anything **not** listed inside.
- `\w` is any word character; `\s` is any whitespace. Taking the negation removes punctuation.

In [24]:
review_df.head()

Unnamed: 0,reviewerID,reviewText,overall,summary
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...",5,good
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,5,Jake
2,A195EZSQDW3E21,The primary job of this device is to block the...,5,It Does The Job Well
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...,5,GOOD WINDSCREEN FOR THE MONEY
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...,5,No more pops when I record my vocals.


In [25]:
# don't change this cell -- it is needed for the tests to work
out_bigrams_text = bigram_counts(pd.read_csv('data/musical_instruments_reviews.csv'), 'reviewText')
out_bigrams_summary = bigram_counts(pd.read_csv('data/musical_instruments_reviews.csv'), 'summary')

### Bag-of-Words

- The bag of words model represents texts (e.g. review, summary) as vectors of word counts.
- It is called 'bag of words' because it doesn't consider order.

### Creating the Bag-of-Words Count Matrix

Let's create a BoW count matrix of 'summary' using 'bi-grams'

In [36]:
out_bigrams_summary = bigram_counts(pd.read_csv('data/musical_instruments_reviews.csv'), 'summary')
out_bigrams_summary

(for, the)          8
(guitar, cable)     7
(the, best)         6
(it, works)         6
(good, quality)     5
                   ..
(as, advertised)    1
(low, cost)         1
(its, purpose)      1
(serves, its)       1
(stand, by)         1
Length: 559, dtype: int64

In [37]:
reviews = review_df['summary'].str.lower().str.replace('[^\w\s]','', regex=True)
# reviews = reviews.tolist()

# We can reduce sparsity in representations by filtering the bigrams as well.
k = 1000

counts_dict = {}
for bigram in out_bigrams_summary.index[:k]:
    bigram = ' '.join(bigram)
    regex_pattern = fr'\b{bigram}\b'
    counts_dict[bigram] = reviews.str.count(regex_pattern).astype(int).tolist()
    
counts_df = pd.DataFrame(counts_dict)
counts_df

Unnamed: 0,for the,guitar cable,the best,it works,good quality,the job,good for,the price,so far,works great,...,well built,quality guitar,nice high,for practice,perfect for,as advertised,low cost,its purpose,serves its,stand by
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
196,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
197,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
counts_df = pd.concat([reviews.to_frame(), counts_df], axis=1).set_index('summary')
counts_df

Unnamed: 0_level_0,for the,guitar cable,the best,it works,good quality,the job,good for,the price,so far,works great,...,well built,quality guitar,nice high,for practice,perfect for,as advertised,low cost,its purpose,serves its,stand by
summary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
good,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
jake,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
it does the job well,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
good windscreen for the money,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
no more pops when i record my vocals,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
best in price range,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
industry standard,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
classic the last of shures good mics,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
its an sm57 whats there to say,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TF-IDF (Term Frequency - Inverse Document Frequency)

- Addresses the BoW drawback of giving high weightage to common words
- TF-IDF tries to **give high weightage to words that are unique to that particular document**
- For comparison, BoW is simply TF


- TF-IDF = Term Frequency * Inverse Document Frequency
    - TF is a function of that document
    - IDF is a function of the corpus

## Additional "good to know" Regex patterns

- **Word boundaries** `\b` let you match whole words without grabbing prefixes/suffixes.
- **Raw strings** `r''` save you from double‑escaping backslashes in Python.
- Use **quantifiers**: `?` (0 or 1), `*` (0 +), `+` (1 +), `{m,n}` (range).
- **Character classes** `[abc]` match a, b, or c; `[^abc]` matches anything except those.
- **Capturing groups** `(...)` create back‑references; use `(?:...)` for non‑capturing groups.
- **Look‑arounds** `(?=...)` (look‑ahead) and `(?<!...)` (negative look‑behind) match without consuming.
- Test patterns quickly on <https://regex101.com/> and use the generated explanation.