# Best pratices

* Remove punctuation: Punctuation like commas, periods, and quotation marks are not always useful for NLP tasks and can add noise to the data.

* Remove HTML tags: If working with web data, it's important to remove any HTML tags before analyzing the text.

* Remove special characters: Special characters like @, #, or $ can also add noise to the data and should be removed.

* Remove stop words: Stop words are common words like "the," "and," and "a" that don't add much meaning to the text and can be removed to reduce noise.

* Remove numbers: If numbers are not important to the analysis, they should be removed.

* Correct spelling mistakes: Misspelled words can be corrected using techniques like spell-checking or pattern matching.

* Handle capitalization: Depending on the task, capitalization may or may not be important. It's important to consider this and decide whether to convert all text to lowercase or preserve the original capitalization.

* Handle abbreviations: Abbreviations should be expanded to their full form to ensure their meaning is captured.

* Handle slang and informal language: Slang and informal language can add complexity to the analysis and should be handled appropriately, depending on the task.

* Tokenize and lemmatize: Text should be tokenized into smaller chunks like words or phrases, and words should be lemmatized to their base form to ensure consistency and reduce noise.

# Stopwords

In [3]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

text = "This is an example sentence that includes some stop words."
words = nltk.word_tokenize(text)

# Remove stop words
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]

print(filtered_words)


['example', 'sentence', 'includes', 'stop', 'words', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lfroes\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Vectorization
Vectorization in NLP refers to the process of converting text data into numerical vectors that can be used as input for machine learning models. This process involves encoding each word or phrase in a text corpus as a vector, where each dimension of the vector represents a particular feature or attribute of the word or phrase.

* Bag of Words (BoW): This approach is useful when we have a large corpus of text and the focus is on keyword-based analysis. BoW can be used for tasks such as document classification, spam detection, and sentiment analysis.

* TF-IDF: This approach is similar to BoW, but it gives more importance to rare words that are discriminative for a specific document. TF-IDF is useful when we want to focus on important words that are not very common in the corpus. It can be used for tasks such as document classification, search engines, and recommender systems.

* Word embeddings: This approach is useful when we want to capture the semantic relationships between words. Word embeddings are low-dimensional dense vectors that represent the meaning of words. They can be used for tasks such as language modeling, named entity recognition, and sentiment analysis.

* BERT: This approach is useful when we want to capture the context-dependent meaning of words. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model that can be fine-tuned for various NLP tasks such as sentiment analysis, question answering, and text classification.

* GloVe: This approach is similar to word embeddings but uses a different technique for training the vectors. GloVe (Global Vectors for Word Representation) is trained on the co-occurrence statistics of words in a corpus. It can be used for similar tasks as word embeddings.








Bag of word

BoW is a good choice for tasks where word frequency matters and the context is not as important. For example, spam detection, sentiment analysis, or topic modeling. 

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# define the dataset
X = ['this is a good movie', 'this is a bad movie', 'I like this movie', 'I hate this movie']
y = [1, 0, 1, 0]

# create the BoW representation
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(X)

# train a classifier
clf = MultinomialNB()
clf.fit(X_bow, y)

# predict the sentiment of a new text
new_text = ['this is a great movie']
new_text_bow = vectorizer.transform(new_text)
predicted_sentiment = clf.predict(new_text_bow)[0]

print(predicted_sentiment)


0


TF-IDF

TF-IDF is a good choice when we want to weight the importance of a word in a document based on its frequency in the corpus. It is useful in applications such as document classification, information retrieval, or keyword extraction.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

# define the dataset
X = ['this is a good movie', 'this is a bad movie', 'I like this movie', 'I hate this movie']
y = ['positive', 'negative', 'positive', 'negative']

# create the TF-IDF representation
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X)

# train a classifier
clf = SVC()
clf.fit(X_tfidf, y)

# predict the class of a new text
new_text = ['this is a great movie']
new_text_tfidf = vectorizer.transform(new_text)
predicted_class = clf.predict(new_text_tfidf)[0]

print(predicted_class)

negative


Word Embeddings

Word embeddings are useful when we want to capture the meaning of words and the context in which they are used. 

In [10]:
import tensorflow as tf
import tensorflow_hub as hub

# define the dataset
X = ['this is a good movie', 'this is a bad movie', 'I like this movie', 'I hate this movie']
y = [1, 0, 1, 0]

# load the pre-trained embedding model
embed = hub.load("https://tfhub.dev/google/nnlm-en-dim50/2")

# convert the text to embeddings
X_embeddings = embed(X)



In [11]:
X_embeddings

<tf.Tensor: shape=(4, 50), dtype=float32, numpy=
array([[ 0.20583047,  0.18753016,  0.1268343 ,  0.09016597, -0.1543585 ,
        -0.10795088,  0.16558027,  0.06119017, -0.20971456,  0.07139651,
        -0.03833621, -0.08680907,  0.04552191, -0.02194868,  0.13046873,
        -0.12728618, -0.1495209 ,  0.18820915,  0.03948668, -0.29305086,
        -0.02014771, -0.22412075,  0.12644166,  0.0870514 , -0.08078168,
        -0.00701902, -0.415201  , -0.06377582,  0.21221496,  0.04958345,
        -0.17487982,  0.12846486, -0.03652546, -0.12215738, -0.0015185 ,
         0.05498244,  0.25573802, -0.09744763, -0.04855936, -0.2361716 ,
         0.08039324, -0.02337173, -0.06055538,  0.03737477, -0.28236404,
        -0.27627096, -0.04884144, -0.04169737,  0.05408333,  0.01726649],
       [ 0.22665966,  0.12342206,  0.00879236,  0.06335424, -0.12636712,
        -0.19964664,  0.18133612,  0.03515789, -0.13501574,  0.09100834,
         0.02877173, -0.00904876,  0.03736711, -0.08183728,  0.01896096,
 

# Feature Scaling

NLP models, such as neural networks, can benefit from the use of batch normalization

minmaxscaler

standardscaler