# Text Classification

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

The section goals are to understand the machine learning basics, to understand what classification metrics are. We will then look at text feature extraction and use python with Scikit-learn to perform classification on some real data sets. 

#### Q: What are we doing here?
A: To conceptualise what we're doing here we need to understand that it is what is known as _"supervised learning"_. An interesting additional text in this area of study is [an Introduction to statistical learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) in which the student can get a better grasp of the mathematics involved in machine learning, largely linear algebra, calculus. 

#### Q: So, what is machine learning?
A: Essentially it's a method of data analysis that automates analytical model building by using algorithms that iteratively learn from the data being processed or worked upon. Machine learning facilitates computers identifying hidden insights in data without being explicitly programmed to do so.  

#### Q: OK, What is it used for? 
A: OK, here goes... Fraud detection, web search results, real time ads on web pages, credit scoring and next best offer, prediction of failure and equipment failure, pricing models, network intrusion detection, recommender engines, customer segmentation, sentiment analysis, churn prediction, pattern and image recognition, spam filtering... Once upon a time they used to say javascript was eating the world, now in late 2020 it's ML that's eating the world. 

#### Q: So what is supervised learning?
A: supervised learning algorithms are rained on what call 'labeled data', that means data where the desired output is known. typically our data in such circumstances is retrospective and has been classified and we're working to get the machine to successfully arrive at conclusion we can cross reference with the real output or result. In automated terms the machine can adjust its weights in an algorithm to tweak it's processing where it can see the prediction is not the same as the known output. The common usage, or scenario where supervised learning is undertaken is for applications where historical data predicts likely suture events. 

#### Q: So what does the supervised learning process look like?
A: It is a multi step process. 
1. Data Acquisition
2. Data cleaning (formatting, vectorization, feature extraction, null cleaning)
3. Split between test data and training data becaise you don't want to test your model on data it has already seen and processed. Here is can be anything between 65/35 to 85/15 split ratio. Ultimately the more training data you have the _better_ the result you're likely to get. (loosely speaking) 
4. Fit and evaluate your model on testing data.
5. we repeat steps 3 & 4 until we have a workable result we want to deploy or we abandon as not a good model and look to iterate on corrective steps, starting at point 1 again and improving all steps in pursuit of a better result.

Some common terms that may be encountered in the data collection, cleansing and processing steps are: 
- `Ham Vs Spam` - good Vs bad, etc... 
- `x_train & x_test, y_train & y_test` - Where data is split between `x,y` as the axes of a graph, the data and the label. 


# 5.1.0 - Classification Metrics

After our process is complete we use performance metrics to evaluate how our model performed. The key classification metrics we will come to understand are:
- Accuracy
- Recall
- Precision
- F1-Score (culmination of recall and precision)

Typically, a model can only achieve two results, it was correct with its prediction, or it was wrong. This expands to situations where you have multiple classes. In terms of a real world scenario, imagine a **binary classification** situation where we only have two classes available.

Keep in mind there are a few steps involved to convert the raw text data into a format that an ML model can understand. Vectorization is a process where we pass raw text from the `x_test` through the process to become a vectorized version of `x_test`.

#### Q: How do we setup vectorization?
A: We setup this vectorization process in the pipeline and there are many ways of transforming the raw text into numerical information. 

Once complete we run the model on the test data and derive our classifications. At the end we have a count of both correct and incorrect matches. The key is to remember that in the real world **_not all incorrect or correct matches hold equal value_**. This is why there often different classification metrics, a single metric wont tell the real,or whole story. so let's look at each and what it is for.

#### Accuracy
Accuracy is the number of correct classifications or predictions divided by the number of predictions. Accuracy is useful when you have a well balanced set of data with a balanced set of actual classifications. Accuracy is a poor choice when the data is heavily weighted towards 1 classification type. In our `ham vs spam` example a set of 95 ham vs 5 spam would make an accuracy metric a bad choice because the test is not balanced, where as a 50/50 split would be the most balanced and therefore would make accuracy a good metric here. 

Where balance is compromised making accuracy a poor choice `recall` and `precision` make for useful additions to our metric to complete the story. 

#### Recall
The ability of a model to find **_all_** the relevant cases in a dataset. The precise definition of recall is _"number of true positives divided by the number of true positives plus the number of false negatives"_.

#### Precision
Ability of a classification model to identify **_only_** the relevant data points. Precision is defined as _"the number of true positives divided by the number of true positives plus the number of false positives"_.

**Note:** often you have a trade-off between recall and precision. Recall expresses the ability to find all relevant instances in a dataset. Precision expresses the proportion of the data points our model says was relevant which were _actually_ relevant.

#### F1-Score
Combination of recall and precision deemed to be the _optimal blend_.

$F_1 = 2 * \frac{{precision} \cdot {recall}}{precision + recall}$

This is called the **harmonic mean** as it takes both metric into account. We use this instead of an average because it punishes extreme values. A classifier with a precision of 1.0 and a recall of 0.0 has a simple average of 0.5 but an F1-score of 0.0 

**Step 1:** $F_1 = 2 * \frac{1 \cdot 0}{1 + 0}$ 

**Step 2:** $F_1 = 2 * \frac{0}{1}$ 

**Step 3:** $F_1 = 2 * 0 = 0$


# 5.2.0 The Confusion Matrix

Within a classification problem there are two categories:
- True condition
- predicted condition

That means at the end of the testing phase there are 4 possible groups:
- Output was `True` and classified as `True`
- Output was `True` and classified as `False`
- Output was `False` and classified as `True`
- Output was `False` and classified as `False`

**Note:** It is important to remember that there are many fancy stats that can be obtained from a confusion matrix but they are all fundamentally ways of comparing predicted Vs true values. What is determined as `good metrics` will often depend on the specific situation to be evaluated.

[Wikipedia for Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

![](https://www.dataschool.io/content/images/2015/01/confusion_matrix2.png)

![](https://i2.wp.com/softwareengineeringdaily.com/wp-content/uploads/2016/09/scikit-learn-logo.png)

# 5.3.0 - Scikit-learn Introduction and overview

In [None]:
- Every algorithm is exposed in scikit-learn via an Estimator. 
- Estimator is just another word for Model. 
- All Estimators take parameterisation, whilst having generally suitable defaults in place. 
- Fitting a model, is also 'training' a model.
- It is critical to split the data between train and test sets. 
- scikit-learn, or 'sklearn' comes with a `train_test_split` functionality. 
- Absolute datasize in sklearn is between 0-1, of you want 25% of your data split between test and training then test_size parameter should be set to 0.25
- 

# 5.4.0 - Text Feature Extraction