# NLP Modeling

How do we quantify a document?

- [Setup](#setup)
- [Data Representation](#data-representation)
    - [Bag of Words](#bag-of-words)
    - [TF-IDF](#tf-idf)
    - [Bag Of Ngrams](#bag-of-ngrams)
- [Modeling](#modeling)
    - [Modeling Results](#modeling-results)
- [Next Steps](#next-steps)

## Setup

In [2]:
from pprint import pprint
import pandas as pd
import nltk
import re
import acquire
import prepare

def clean(text: str) -> list:
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = (text.encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split() # tokenization
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

In [3]:
blogs = acquire.get_blog_articles()
news = acquire.get_news_articles()



  article_soup = BeautifulSoup(requests.get(blog, headers=header).content)


  soup = BeautifulSoup(requests.get(url).content)


  cat_soup = BeautifulSoup(requests.get(cat_url).content)


In [4]:
blogs = prepare.clean_df(blogs, ['content'])
news = prepare.clean_df(news, ['content'])

In [5]:
blogs.to_csv('blogs.csv')
news.to_csv('news.csv')

## Data Representation

Simple data for demonstration.

In [6]:
data = [
    'Python is pretty cool',
    'Python is a nice programming language with nice syntax',
    'I think SQL is cool too',
]

In [7]:
pprint(data)

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


### Bag of Words

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
bag_of_words = cv.fit_transform(data)

In [9]:
bag_of_words

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [10]:
bag_of_words.todense()

matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
        [0, 1, 1, 2, 0, 1, 1, 0, 1, 0, 0, 1],
        [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0]])

Here `bag_of_words` is a **sparse matrix**. Usually you should keep it as such,
but for demonstration we'll view the data within.

In [12]:
cv.get_feature_names_out()

array(['cool', 'is', 'language', 'nice', 'pretty', 'programming',
       'python', 'sql', 'syntax', 'think', 'too', 'with'], dtype=object)

In [13]:
cv.vocabulary_

{'python': 6,
 'is': 1,
 'pretty': 4,
 'cool': 0,
 'nice': 3,
 'programming': 5,
 'language': 2,
 'with': 11,
 'syntax': 8,
 'think': 9,
 'sql': 7,
 'too': 10}

In [18]:
# Taking a look at the bag of words transformation for education and diagnostics.
# In practice this is not necesssary and the resulting data might be to big to be reasonably helpful.
bow = pd.DataFrame(bag_of_words.todense(),
                   columns = cv.get_feature_names_out())

In [19]:
bow

Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,1,1,0,0,1,0,1,0,0,0,0,0
1,0,1,1,2,0,1,1,0,1,0,0,1
2,1,1,0,0,0,0,0,1,0,1,1,0


In [20]:
pprint(data)

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


In [21]:
bow.apply(lambda row: row/row.sum(), axis=1)

Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,0.25,0.25,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.0
1,0.0,0.125,0.125,0.25,0.0,0.125,0.125,0.0,0.125,0.0,0.0,0.125
2,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.2,0.0


### TF-IDF

- term frequency - inverse document frequency
- $\text{tf} \times \text{idf} = \frac{\text{tf}}{\text{df}}$
- a measure that helps identify how important a word is in a document
- combination of how often a word appears in a document (**tf**) and how unqiue the word
  is among documents (**idf**)
- used by search engines
- naturally helps filter out stopwords
- tf is for a single document, idf is for a corpus

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
bag_of_words = tfidf.fit_transform(data)
pprint(data)
pd.DataFrame(bag_of_words.todense(), columns=tfidf.get_feature_names_out())

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,0.480458,0.373119,0.0,0.0,0.631745,0.0,0.480458,0.0,0.0,0.0,0.0,0.0
1,0.0,0.197673,0.334689,0.669378,0.0,0.334689,0.25454,0.0,0.334689,0.0,0.0,0.334689
2,0.38377,0.298032,0.0,0.0,0.0,0.0,0.0,0.504611,0.0,0.504611,0.504611,0.0


To get the idf score for each word (these aren't terribly usefule themselves):

In [29]:
pd.Series(
    dict(
        zip(
            tfidf.get_feature_names_out(),
            tfidf.idf_
        )
    )
)

cool           1.287682
is             1.000000
language       1.693147
nice           1.693147
pretty         1.693147
programming    1.693147
python         1.287682
sql            1.693147
syntax         1.693147
think          1.693147
too            1.693147
with           1.693147
dtype: float64

### Bag Of Ngrams

For either `CountVectorizer` or `TfidfVectorizer`, you can set the `ngram_range`
parameter.

In [31]:
cv = CountVectorizer(ngram_range=(2,2))
bag_of_grams = cv.fit_transform(data)

In [32]:
pd.DataFrame(bag_of_grams.todense(), columns=cv.get_feature_names_out())

Unnamed: 0,cool too,is cool,is nice,is pretty,language with,nice programming,nice syntax,pretty cool,programming language,python is,sql is,think sql,with nice
0,0,0,0,1,0,0,0,1,0,1,0,0,0
1,0,0,1,0,1,1,1,0,1,1,0,0,1
2,1,1,0,0,0,0,0,0,0,0,1,1,0


## Modeling

In [40]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

df = news.copy()
df.head()

Unnamed: 0,title,content,category,content_clean
0,India's first Billiards Premier League,The Billiards and Snooker Association of Mahar...,india,billiards snooker association maharashtrabsam ...
1,Oldest woman in India passes away,"Kunjannam, a 112-yr-old woman from Parannur (K...",india,kunjannam 112yrold woman parannur kerala wa de...
2,"AAP drops Rajouri Garden candidate, a week bef...","Only a week before Delhi Assembly polls, Aam A...",india,week delhi assembly poll aam aadmi party tuesd...
3,"Samsung launches Galaxy Star 2 Plus at Rs.7,335",Samsung has unveiled the Galaxy start 2 Plus s...,india,samsung ha unveiled galaxy start 2 plus smartp...
4,Bharti Airtel rakes in 61% profit,"Bharti Airtel, India's top telecommunications ...",india,bharti airtel india ' top telecommunication co...


In [41]:
X = df.content_clean
y = df.category
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=123)

In [42]:
X_train.head()

238    air pollution particle found developing lung b...
270    honda motor lg energy solution tuesday said bu...
53     spain said sending riot police qatar help safe...
192    hypersocial ceo braden wallake gone viral cry ...
195    audience kushstock festival u ' california wa ...
Name: content_clean, dtype: object

In [43]:
y_train.head()

238          science
270       automobile
53            sports
192    miscellaneous
195    miscellaneous
Name: category, dtype: object

Iterate:

- try out the bag of ngrams
- try out different ways of text prep (stem vs lemmatize)
- etc...

In [65]:
# Whatever transformations we apply to X_train need to be applied to X_test
cv = CountVectorizer()
X_bow = cv.fit_transform(X_train)
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_bow, y_train)
tree.score(X_bow, y_train)

0.35526315789473684

In [66]:
X_test_bow = cv.transform(X_test)
# tree.transform(X_test_bow, y_test)
tree.score(X_test_bow, y_test)

0.24561403508771928

In [72]:
tv = TfidfVectorizer()
X_bow = tv.fit_transform(X_train)
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_bow, y_train)
tree.score(X_bow, y_train)

0.3508771929824561

In [75]:
X_test_bow = tv.transform(X_test)
tree.score(X_test_bow, y_test)

0.24561403508771928

### Modeling Results

A super-useful feature of decision trees and linear models is that they do some
built-in feature selection through the coefficeints or feature importances:

In [71]:
pd.Series(
    dict(
        zip(cv.get_feature_names_out(),
            tree.feature_importances_))).sort_values(ascending=False)

cup           0.249580
vehicle       0.194098
leader        0.173087
startup       0.169490
researcher    0.156474
                ...   
enough        0.000000
enriching     0.000000
ensure        0.000000
enter         0.000000
zulfiqar      0.000000
Length: 3619, dtype: float64

## Next Steps

- Try other model types

    [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
    ([`sklearn`
    docs](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html))
    is a very popular classifier for NLP tasks.

- Look at other metrics, is accuracy the best choice here?

- Try ngrams instead of single words

- Try a combination of ngrams and words (`ngram_range=(1, 2)` for words and
  bigrams)

- Try using tf-idf instead of bag of words

- Combine the top `n` performing words with the other features that you have
  engineered (the `CountVectorizer` and `TfidfVectorizer` have a `vocabulary`
  argument you can use to restrict the words used)