# NLP Notes

## Experience at Principal

**Project 1: question direction system**

I worked on a team with two economists, two strategist, and had access to Bond and Stock experts.  Principal had 5,000 financial advisors can email the team with questions.  Led the creation of an NLP email classification system, that forwarded the email economist, strategist, Bond or Stock expert. They recently acquired the Wells Fargo network so the number of FA's 4X.

The system was written using python: 
1. Data cleaning - some manual (such as wells vs principal), removed punctuation, stop words, lemmatization.
2. Used TD-IFD (sklearn.feature_extraction.text.TfidfVectorizer), almost all of the emails would have some similar jargon, such as "market", this also had better results than bag of words, bag of words (n-gram), and a word2vec.
3. Trained using pipelines with statification, mostly balanced dataset. GridsearchCV... KNN, Naive Bayes, but picked RandomForest. 
4. Note: in this case False Positive/Negative had the same result (sent to wrong person), so confusion matrix was not helpful.

Libraries: scikit-learn, Gensim, Spacy 


**Project 2: sentiment ratio score**

Investor sentiment index - it is commonly thought that small investor sentiment is a good signal for regime change (bull to bear market, etc).  Data came from another team:

1. Sampled tweets / stock twits / a few blogs -> small investor bucket
2. Text from market experts -> institutional investor bucket

Goal was a ratio (small sentiment)/(large sentiment).

Started with a dictionary approach with labeling positive/negative word ratio for each.  Ended with a TD-IFD and Logistic/Ridge regression, for each.  Saving the top words for each category for word clouds and time series.
    

Libraries: scikit-learn, Gensim, Spacy 





## Jargon

- Corpus - all of the docs.
- Document - the text you are using for analysis.
- Vocabulary - all of the tokens from your corpus.
- Semantic - are meanings of sentences.
- Syntactic - rules for the creation of sentences.

Preprocessing: 

- Regular Expressions
- Stemming (NLTK) - preprocessing step to take off prefix and/or suffix.
- Lemmatization - preprocessing step that looks up the base word for each token.  (can add new lemmas)
- Stop words - common words that don't add much meaning to the sentence.

Token information:

- PoS - Part of speech (noun, verb)
- NER - Named entity recognition

Word Embeddings / Vecor space model:

- One Hot Encoding - dummy variables for words.
- Bag or words - counts from sentences/documents from your vocabulary.
- Bag of ngrams - counts from sentences/documents (1,2) means single words and two consecutive words.
- TF-IDF - log(Total # of Documents / # of DOCUMENTS the term appears in), so if it apears in all documents log(4/4), in one document log(4/1).  Higher means more fidelity.

Encoded: 

- AutoEncoder - a compressed form representation of words, such as word2vec, an unsupervised learning algorithm creates the Neural Network hidden nodes and represents it with a vector.
- Word2Vec - Self-supervised autoencoder using either a continuous bag of words or a skip-gram to train the neural net.  The word2vec vector is the perceptron values. 


Visualize Embeddings:
- (PCA or SVD) - plot to see if there is a seperation between first two components.

Classification algo:

- KNN
- Naive Bayes
- Logistic regression (for binomial choice)
- Random forrest classifier (ensemble)

- CV - Cross validation with n-folds.
- GridSearchCV - Cross validation with k parameters and n-folds.

Evaluation:
- Look at top words for sanity check.
    - Lime - For Word2Vec the most influential words are not obvious, Lime perturbs the document and the most influential towards each classification.
- Confusion Matrix - counts of: predicted vs true.
- Accuracy - percentage of how correct you are overall
- Precision - true posititives / (all actual positives) - Precision is a good measure to determine, when the costs of False Positive is high. For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model.
- Recall -  true positives / (all predicted positives) - Applying the same understanding, we know that Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.  For instance, in fraud detection or sick patient detection. If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank.
- F1 - F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).

Random:
- Imbalanced data - more of one classificatoin than another, use under-sampling, over-sampling.

I haven't used:

- CNN - convolutional neural network
- RNN - Recurrent neural network
- LSTM - Long short-term memory
- Bert - bidirectional encoder representation from transformers
- Transformers - sequence to sequence task.
- LSA - Latent Semantic Analysis - analyzing docs to find the underlying meaning or concepts in the doc. Uses SVD to reduce features.

 
## Python Libraries and Tools

Pandas

Numpy

Scikit-learn:
- train_test_split
- Pipeline
- classification_report
- TfidfVectorizer
- KNeighborsClassifier
- confusion_matrix
- naive_bayes.MultinomialNB
- RandomForestClassifier
- CountVectorizer
- PCA, TruncatedSVD
- LogisticRegression

Spacy
- nlp (trains pipelines)
- 

NLTK
- Stemming

GENSIM
- Word2Vec

keras 
- preprocessing.text
- to_categorical

re

Lime
- LimeTextExplainer







https://www.interviewbit.com/nlp-interview-questions/
