# Introduction to with NLP

### Introduction

In this lesson, we'll begin our journey with NLP.  Our first step is to understand some of the terms that will allow identify the components of an NLP problem.  Let's get started.

### Loading our Data

Let's get started by loading up some of our newsgroups dataset from sklearn.

In [8]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

X = newsgroups_train['data']
y = newsgroups_train['target']

The data consists of different posts to an Internet message board.

In [13]:
X[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

And the task is to classify these posts into a different categories, where each number corresponds to a different category.

In [15]:
y[:3]

array([7, 4, 4])

> The categories range from sports, to cars, to religion.

In [24]:
newsgroups_train['target_names'][:5]

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware']

### Defining Terms

Now in Natural Language Processing, we have different terms to refer to the components of our dataset.

* **Document** - Each row  in our dataset is a different document.

So below, is our first document.

In [28]:
X[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

Now each unique word in a document is called a **term**.

In [27]:
len(set(X[0]))

60

> So we can see that above, the first document has 60 terms.

Finally, the set of unique words throughout the entire document is called a **corpus**.

### Summary

In this lesson, we saw how to implement TF-IDF, largely from scratch.  Term Frequency Inverse Document Frequency simply multiplies those two components.  

Term frequency is the number of times a given word occurs in a document, divided by the length of the document.  

And inverse document frequency captures how rare a word is throughout the corpus.  We can think of this in two steps: 

1. calculate the frequency of a term throughout the entire corpus $\frac{num t in C}{C}$ which gives a larger number the more frequent a word is

2. Invert this number $\frac{C}{num\_t\_in\_C}$

### Resources

[Sklearn text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

[ML Mastery Multiclass classification](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/#:~:text=Each%20binary%20classification%20model%20may,one%2Dversus%2Done%20classifier.)