# Twitter Text Processing

### Introduction
This document shows how text processing affects a given sentence or text. This is intend for text processing learning as well as document text preprocessing.


## Text Processing

### Removing Special Characters
Special characters are noise to the tweets and has to be removed to lessen the noise. 

In [13]:
import re
def remove_special_character(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

#this is the tweet that we will remove special characters from
doc_sample = 'The cats are hanging their feet while playing $#?@!!'

print('Before removing special characters: ')
print(doc_sample)

print('\nAfter special characters removed: ')
doc_sample = remove_special_character(doc_sample, remove_digits=True)
print(doc_sample)


Before removing special characters: 
The cats are hanging their feet while playing $#?@!!

After special characters removed: 
The cats are hanging their feet while playing 


### Removing Stop Words
Stop words are words that needs to be filtered before or after natural language processing. These are words that are most common in the sentence like "a", "an", "the", etc. 

In [14]:
!pip install -U nltk

Requirement already up-to-date: nltk in c:\users\arcit\anaconda3\lib\site-packages (3.5)


In [15]:
#download stopwords
import nltk 

stopword_list = nltk.corpus.stopwords.words('english')
print(stopword_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Tokenization
Tokenization is chopping up sentences into pieces or words called tokens. 

In [16]:
# Tokenize - only applied to single tweet for showing 
from nltk.tokenize import word_tokenize

tokens = word_tokenize(doc_sample)

print('Before Tokenization: ')
print(doc_sample)

print('\nAfter Tokenization: ')
print(tokens)

Before Tokenization: 
The cats are hanging their feet while playing 

After Tokenization: 
['The', 'cats', 'are', 'hanging', 'their', 'feet', 'while', 'playing']


### Tokenization and Removing Stop Words
Using tokenized sentence, we're now going to remove stop words. It is important to tokenize the sentence/document first before removing stop words.

In [17]:
# Used this package 
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

In [19]:
#Tokenize and remove stopwords
def remove_stopwords(text):
    tokens = word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ', '.join(filtered_tokens)   
    return filtered_text

print('Before removing stopwords: ')
print(tokens)

print('\nAfter removing stopwords: ')
processed_sample = remove_stopwords(doc_sample)
print(processed_sample)

Before removing stopwords: 
['The', 'cats', 'are', 'hanging', 'their', 'feet', 'while', 'playing']

After removing stopwords: 
The, cats, hanging, feet, playing


### Stemming
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language(english language).

Inflection - modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change

##### Stems are created by removing suffixes or prefixes used with a word.
##### Stemming word or sentence may result in words that are not actual words or words that doesn't have meaning


In [20]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def text_stemmer(text):
    portStem = nltk.porter.PorterStemmer()
    text = ' '.join([portStem.stem(word) for word in text.split()])
    return text
    
print('Before stemming: ')
print(processed_sample)

print('\nAfter stemming: ')
stemmed_doc = text_stemmer(processed_sample)
print(stemmed_doc)

Before stemming: 
The, cats, hanging, feet, playing

After stemming: 
the, cats, hanging, feet, play


### Lemmatization
Unlike Stemming, it reduces the inflected words properly ensuring that the root word belongs to the language.

In [21]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [22]:
def text_lemmatizer(text):
    word_list = nltk.word_tokenize(text)
    lemmatized_text = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lemmatized_text

print('Before lemmatization: ')
print(processed_sample)

print('\nAfter lemmatization: ')
lemmatized_doc = text_lemmatizer(processed_sample)
print(lemmatized_doc)

Before lemmatization: 
The, cats, hanging, feet, playing

After lemmatization: 
The , cat , hanging , foot , playing


### Apply Text Preprocessing to the tweets
This is where we tokenize and lemmatize all tweets in the dataset.

Note:
* On single text preprocessing(process per tweet), we stemmed and lemmatized the tweet separately to see the changes. Comparing the words that were stemmed and lemmatized, some of the words changed but some doesn't. 
* I will combine all text preprocessing that we've done to make it simple. 

In [23]:
#text_sample = []
def prepare_text(text):
    special_character = remove_special_character(text)
    stop_words = remove_stopwords(special_character)
    stem_text = text_stemmer(stop_words)
    lemmatize_text = text_lemmatizer(stem_text)
    #text_sample.append(lemmatize_text)
    return lemmatize_text

sample = 'The bat is hanging its feet for #2193@'

print(prepare_text(sample))

the , bat , hanging , foot , 2193


In [24]:
text_data = []
i = 0
with open('tweets-clean.csv', errors='ignore') as f:
    for line in f:
        tokens = prepare_text(line)
        if i < 10:
            print(tokens)
            i += 1
        text_data.append(tokens)

tweet
awww , thats , bummer , shoulda , got , david , carr , third , day
upset , cant , update , facebook , texting , might , cry , result , school , today , also , blah
dived , many , time , ball , managed , save , 50 , rest , go , bound
whole , body , feel , itchy , like , fire
behaving , im , mad , cant , see
whole , crew
need , hug
hey , long , time , see , yes , rain , bit , bit , lol , im , fine , thanks , how
nope , didnt
