# Preprocessing

**Stopwords**

In [1]:
tweet = """I'm amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public fine tuned checkpoints, is good enough for the job. 

Both impressed, and a little disappointed how rarely I get to actually train a model that matters :)"""
tweet

"I'm amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public fine tuned checkpoints, is good enough for the job. \n\nBoth impressed, and a little disappointed how rarely I get to actually train a model that matters :)"

In [3]:
from nltk.corpus import stopwords

In [5]:
stop_words= stopwords.words('english')
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [6]:
stop_words = set(stop_words)

In [18]:
tweet = tweet.lower().split()

In [19]:
tweet_no_stopwords = [word for word in tweet if word not in stop_words]

In [21]:
print(' '.join(tweet))

i'm amazed how often in practice, not only does a @huggingface nlp model solve your problem, but one of their public fine tuned checkpoints, is good enough for the job. both impressed, and a little disappointed how rarely i get to actually train a model that matters :)


In [22]:
print(' '.join(tweet_no_stopwords))

i'm amazed often practice, @huggingface nlp model solve problem, one public fine tuned checkpoints, good enough job. impressed, little disappointed rarely get actually train model matters :)


**Stemming**

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat. ... That's why rather than storing all forms of a word, a search engine can store only the stems.

In [29]:
text = """I am amazed by how amazingly amazing you are"""

In [31]:
words_to_stem =['happy','happier','happiest', 'cactus', 'cactii', 'elephant', 'elephants', 'amazed', 'amazing', 'amazingly']

In [32]:
from nltk.stem import PorterStemmer, LancasterStemmer

In [34]:
porter = PorterStemmer()
lancester = LancasterStemmer()

In [40]:
stemmed = [(word, porter.stem(word), lancester.stem(word)) for word in words_to_stem]

In [38]:
stemmed

[('happy', 'happi', 'happy'),
 ('happier', 'happier', 'happy'),
 ('happiest', 'happiest', 'happiest'),
 ('cactus', 'cactu', 'cact'),
 ('cactii', 'cactii', 'cacti'),
 ('elephant', 'eleph', 'eleph'),
 ('elephants', 'eleph', 'eleph'),
 ('amazed', 'amaz', 'amaz'),
 ('amazing', 'amaz', 'amaz'),
 ('amazingly', 'amazingli', 'amaz')]

**Lemmatization**

In [42]:
words = ['amaze', 'amazed', 'amazing']

In [43]:
import nltk

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [46]:
from nltk.stem import WordNetLemmatizer

In [47]:
lemmatizer = WordNetLemmatizer()

In [48]:
[lemmatizer.lemmatize(word) for word in words]

['amaze', 'amazed', 'amazing']

In [49]:
from nltk.corpus import wordnet

In [50]:
[lemmatizer.lemmatize(word, wordnet.VERB) for word in words]

['amaze', 'amaze', 'amaze']