# Textual Data

A textual data set consists of multiple texts. Each text is called a **document**. The collection of texts is called **corpus**.

Documents of textual data are usually stored in different files.

In this notebook we will use the books of Dr. Seuss

**seus_dir** -> Stores the web directory where the Dr. Seuss books are located.

**seus_files** -> A list containing the filenames of Dr. Seuss books in the directory.

In [1]:
seuss_dir = "http://dlsun.github.io/pods/data/drseuss/"
seuss_files = [
    "green_eggs_and_ham.txt", "cat_in_the_hat.txt",
    "fox_in_socks.txt", "how_the_grinch_stole_christmas.txt",
    "hop_on_pop.txt", "horton_hears_a_who.txt",
    "oh_the_places_youll_go.txt", "one_fish_two_fish.txt"]

In [5]:
import requests

docs = {}
for filename in seuss_files:
    response = requests.get(seuss_dir + filename, "r")
    docs[filename] = response.text


docs

{'green_eggs_and_ham.txt': 'I am Sam\n\nI am Sam\nSam I am\n\nThat Sam-I-am\nThat Sam-I-am!\nI do not like\nthat Sam-I-am\n\nDo you like\ngreen eggs and ham\n\nI do not like them,\nSam-I-am.\nI do not like\ngreen eggs and ham.\n\nWould you like them\nHere or there?\n\nI would not like them\nhere or there.\nI would not like them\nanywhere.\nI do not like\ngreen eggs and ham.\nI do not like them,\nSam-I-am\n\nWould you like them\nin a house?\nWould you like them\nwith a mouse?\n\nI do not like them\nin a house.\nI do not like them\nwith a mouse.\nI do not like them\nhere or there.\nI do not like them\nanywhere.\nI do not like green eggs and ham.\nI do not like them, Sam-I-am.\n\n\nWould you eat them\nin a box?\nWould you eat them\nwith a fox?\n\nNot in a box.\nNot with a fox.\nNot in a house.\nNot with a mouse.\nI would not eat them here or there.\nI would not eat them anywhere.\nI would not eat green eggs and ham.\nI do not like them, Sam-I-am.\n\nWould you? Could you?\nin a car?\nEat t

In [6]:
print(docs["hop_on_pop.txt"])

UP PUP Pup is up.
CUP PUP Pup in cup.
PUP CUP Cup on pup.
MOUSE HOUSE Mouse on house.
HOUSE MOUSE House on mouse.
ALL TALL We all are tall.
ALL SMALL We all are small.
ALL BALL We all play ball.
BALL WALL Up on a wall.
ALL FALL Fall off the wall.
DAY PLAY We play all day.
NIGHT FIGHT We fight all night.HE ME He is after me.
HIM JIM Jim is after him.
SEE BEE We see a bee.
SEE BEE THREE Now we see three.
THREE TREE Three fish in a tree.
Fish in a tree? How can that be?
RED RED They call me Red.
RED BED I am in bed.
RED NED TED and ED in BED
PAT PAT they call him Pat.
PAT SAT Pat sat on hat.
PAT CAT Pat sat on cat.
PAT BAT Pat sat on bat.
NO PAT NO Don’t sit on that.
SAD DAD BAD HAD Dad is sad.
Very, very sad.
He had a bad day. What a day Dad had!
THING THING What is that thing?
THING SING That thing can sing!
SONG LONG A long, long song.
Good-by, Thing. You sing too long.
WALK WALK We like to walk.
WALK TALK We like to talk.
HOP POP We like to hop.
We like to hop on top of Pop.
STOP You 

## Bag-of-Words Model

In the **bag-of-words model**, each column represents a word, and the values in the column are the word counts.

First, we need to count the words in each document.

In [7]:
from collections import Counter

Counter(docs["hop_on_pop.txt"].split())

Counter({'is': 10,
         'on': 10,
         'We': 10,
         'a': 9,
         'He': 6,
         'PAT': 6,
         'Brown': 6,
         'in': 5,
         'all': 5,
         'like': 5,
         'Pup': 4,
         'ALL': 4,
         'RED': 4,
         'and': 4,
         'to': 4,
         'Mr.': 4,
         'PUP': 3,
         'the': 3,
         'can': 3,
         'Pat': 3,
         'sat': 3,
         'What': 3,
         'THING': 3,
         'WALK': 3,
         'of': 3,
         'down.': 3,
         'went': 3,
         'up.': 2,
         'CUP': 2,
         'MOUSE': 2,
         'HOUSE': 2,
         'are': 2,
         'BALL': 2,
         'play': 2,
         'wall.': 2,
         'day.': 2,
         'after': 2,
         'SEE': 2,
         'BEE': 2,
         'see': 2,
         'THREE': 2,
         'that': 2,
         'They': 2,
         'call': 2,
         'me': 2,
         'BED': 2,
         'I': 2,
         'him': 2,
         'NO': 2,
         'Dad': 2,
         'sad.': 2,
         'That

Now we need to stack tese counts into a `DataFrame`.

The thing we are creating is called the **term-frequency matrix**

In [10]:
import pandas as pd

pd.DataFrame(
    [Counter(doc.split()) for doc in docs.values()],
    index = docs.keys()
).fillna(0)

Unnamed: 0,I,am,Sam,That,Sam-I-am,Sam-I-am!,do,not,like,that,...,Gack,park,"home,",Clark.,grow,sleep,Zeep.,gone.,Tomorrow,one.
green_eggs_and_ham.txt,71,3.0,3.0,2.0,4.0,2.0,34.0,46,44.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cat_in_the_hat.txt,48,0.0,0.0,4.0,0.0,0.0,13.0,27,13.0,16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
fox_in_socks.txt,9,0.0,0.0,0.0,0.0,0.0,6.0,1,1.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
how_the_grinch_stole_christmas.txt,6,0.0,0.0,2.0,0.0,0.0,2.0,1,2.0,11,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hop_on_pop.txt,2,1.0,0.0,2.0,0.0,0.0,0.0,2,5.0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
horton_hears_a_who.txt,18,1.0,0.0,7.0,0.0,0.0,0.0,3,0.0,24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
oh_the_places_youll_go.txt,2,0.0,0.0,0.0,0.0,0.0,2.0,6,1.0,11,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
one_fish_two_fish.txt,48,3.0,0.0,0.0,0.0,0.0,11.0,9,21.0,1,...,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0


Aleternatively, we can use `CountVectorizer` in **Scikit-Learn** to produce a term-frequency matrix.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
vec.fit(docs.values())
vec.transform(docs.values())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2308 stored elements and shape (8, 1344)>

Even if your total vocabulary is huge, `CountVectorizer` only counts **unique tokens** that actually appear in your dataset. So even if we have 2308 elements the sparse matrix has 1344 columns.

The set of words across a corpus is called the **vocabulary**. We can view the vocabulary in a fitted `CountVectorizer` as follows:

In [14]:
vec.vocabulary_

{'am': 23,
 'sam': 935,
 'that': 1138,
 'do': 287,
 'not': 767,
 'like': 644,
 'you': 1336,
 'green': 471,
 'eggs': 326,
 'and': 26,
 'ham': 495,
 'them': 1141,
 'would': 1316,
 'here': 526,
 'or': 786,
 'there': 1143,
 'anywhere': 32,
 'in': 576,
 'house': 558,
 'with': 1303,
 'mouse': 722,
 'eat': 323,
 'box': 132,
 'fox': 419,
 'could': 242,
 'car': 179,
 'they': 1145,
 'are': 35,
 'may': 688,
 'will': 1292,
 'see': 953,
 'tree': 1204,
 'let': 635,
 'me': 691,
 'be': 62,
 'mot': 718,
 'train': 1202,
 'on': 778,
 'say': 944,
 'the': 1139,
 'dark': 265,
 'rain': 884,
 'goat': 453,
 'boat': 118,
 'so': 1035,
 'try': 1213,
 'if': 575,
 'good': 459,
 'thank': 1136,
 'sun': 1107,
 'did': 279,
 'shine': 972,
 'it': 586,
 'was': 1255,
 'too': 1188,
 'wet': 1268,
 'to': 1178,
 'play': 836,
 'we': 1261,
 'sat': 940,
 'all': 16,
 'cold': 231,
 'day': 270,
 'sally': 934,
 'two': 1220,
 'said': 932,
 'how': 560,
 'wish': 1302,
 'had': 488,
 'something': 1042,
 'go': 452,
 'out': 789,
 'ball': 50

The number here represents the column index in the matrix!

(So column 23 contains the counts for "am", etc.)

## Text Normalization

`Counter({'UP': 1, 'PUP': 3, 'Pup': 4, 'is': 1, 'up.': 2, ...} )`

It's usually good to **normalize** for punctuation and capitalization. Normalization options are specified when you initialize the `CountVector`. By default, Scikit-Learn strips punctuation and converts all characters to lowercase.

If you don't want Scikit-Learn to normalize for punctuation and capitalization, you can do the following:

- **Pass the paramter `lowercase=False`** -> Not convert words to lowercase. It keeps the original casing of the text.
- **Pass the parameter `token_pattern=r"[\S]+"`** -> Treats anything seperated by whitespace as a token(word), r-raw string. Puctuation, numbers, and symbols are all counted as tokens.

In [15]:
vec = CountVectorizer(lowercase=False, token_pattern=r"[\S]+")
vec.fit(docs.values())
vec.transform(docs.values())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3679 stored elements and shape (8, 2562)>

Now we are back to 2562 words(tokens) in the vocabulary as seen.


Bag-of-words is easy to understand and easy to implement but it has **disadvantages**.

1. **Losses word order**
   
   It ignores grammar and sequence, so "dog bites man" = "man bites dog" -> same representation.

   
2. **Losses context and meaning**
   
   It doesn't understand synonyms or context. "good" and "excellent" are treated as unrelated words.


3. **High dimensionality**

   Every unique word becomes a feature, producing huge, sparse vectors.


4. **Doesn't handle unseen words well**
   
   New words in test data are ignored or cause issues.

5. No understanding of word importance

   Common words can dominate without additional techniques like TF-IDF

6. Sensitive to vocabulary noise

   Misspellings, punctuation, and capitilzation create seperate tokens unless preprocessing is strong.

## N-grams

An **n-gram** is a sequence of n words. It allow us to capture more of the meaning.

For example, if we count **bigrams(2-grams)** instead of words, we can distinguish the two documnets from before:

1. "The dog bit her owner."
2. "Her dog bit the owner."

**Google Books Ngram Viewer:**

A tool that shows how often words or phrases in books over time, helping you see language and trend changes across years.

- Uses Google Books data
- Shows word/phrase frequency over time
- Helps analyze language trends
- Allows historical comparisons
- Displays results as a time-series graph

In Scikit-Learn we can create n-grams with `CountVectorizer` with pass in `ngram=range`.

So to get bigrams, we set the range (2, 2):

In [16]:
vec = CountVectorizer(ngram_range=(2, 2))

vec.fit(docs.values())
vec.transform(docs.values())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6459 stored elements and shape (8, 5846)>

With setting the `ngram_range` to (1, 2). We can get both unigrams(words) and bigrams.

In [17]:
vec = CountVectorizer(ngram_range=(1, 2))

vec.fit(docs.values())
vec.transform(docs.values())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 8767 stored elements and shape (8, 7190)>

But as we can see the memory space is getting bigger and bigger.