https://machinelearningknowledge.ai/11-techniques-of-text-preprocessing-using-nltk-in-python/

In [None]:
!pip install pandas
!pip install numpy



In [None]:
import pandas as pd
df = pd.read_csv('all_annotated.tsv', sep='\t')
df_text=df[['Tweet']]
df_text.head()

Unnamed: 0,Tweet
0,Bugün bulusmami lazimdiii
1,Volkan konak adami tribe sokar yemin ederim :D
2,Bed
3,I felt my first flash of violence at some fool...
4,Ladies drink and get in free till 10:30


i) Lowercasing

In [None]:
df_text['Tweet']=df_text['Tweet'].str.lower()
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet']=df_text['Tweet'].str.lower()


Unnamed: 0,Tweet
0,bugün bulusmami lazimdiii
1,volkan konak adami tribe sokar yemin ederim :d
2,bed
3,i felt my first flash of violence at some fool...
4,ladies drink and get in free till 10:30


ii) Remove Extra Whitespaces

In [None]:
def remove_whitespace(text):
    return  " ".join(text.split())

# Test
text = " We will going to win this match"
remove_whitespace(text)

'We will going to win this match'

In [None]:
df_text['Tweet']=df['Tweet'].apply(remove_whitespace)
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet']=df['Tweet'].apply(remove_whitespace)


Unnamed: 0,Tweet
0,Bugün bulusmami lazimdiii
1,Volkan konak adami tribe sokar yemin ederim :D
2,Bed
3,I felt my first flash of violence at some fool...
4,Ladies drink and get in free till 10:30


iii) Tokenization


In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk import word_tokenize
df_text['Tweet']=df_text['Tweet'].apply(lambda X: word_tokenize(X))
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet']=df_text['Tweet'].apply(lambda X: word_tokenize(X))


Unnamed: 0,Tweet
0,"[Bugün, bulusmami, lazimdiii]"
1,"[Volkan, konak, adami, tribe, sokar, yemin, ed..."
2,[Bed]
3,"[I, felt, my, first, flash, of, violence, at, ..."
4,"[Ladies, drink, and, get, in, free, till, 10:30]"


iv) Spelling Correction

In [None]:
!pip install pyspellchecker



In [None]:
from spellchecker import SpellChecker

def spell_check(text):

    result = []
    spell = SpellChecker()
    for word in text:
        correct_word = spell.correction(word)
        result.append(correct_word)

    return result

#Test
text = "confuson matrx".split()
spell_check(text)

['confusion', 'matrix']

In [None]:
df_text['Tweet'] = df_text['Tweet'].apply(spell_check)
df_text.head()

v) Removing Stopwords

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
en_stopwords = stopwords.words('english')

def remove_stopwords(text):
    result = []
    for token in text:
        if token not in en_stopwords:
            result.append(token)

    return result

#Test
text = "this is the only solution of that question".split()
remove_stopwords(text)

['solution', 'question']

In [None]:
df_text['Tweet'] = df_text['Tweet'].apply(remove_stopwords)
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet'] = df_text['Tweet'].apply(remove_stopwords)


Unnamed: 0,Tweet
0,"[Bugün, bulusmami, lazimdiii]"
1,"[Volkan, konak, adami, tribe, sokar, yemin, ed..."
2,[Bed]
3,"[I, felt, first, flash, violence, fool, bumped..."
4,"[Ladies, drink, get, free, till, 10:30]"


vi) Removing Punctuations

In [None]:
from nltk.tokenize import RegexpTokenizer

def remove_punct(text):

    tokenizer = RegexpTokenizer(r"\w+")
    lst=tokenizer.tokenize(' '.join(text))
    return lst

#Test
text=df_text['Tweet'][0]
print(text)
remove_punct(text)

['Bugün', 'bulusmami', 'lazimdiii']


['Bugün', 'bulusmami', 'lazimdiii']

In [None]:
df_text['Tweet'] = df_text['Tweet'].apply(remove_punct)
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet'] = df_text['Tweet'].apply(remove_punct)


Unnamed: 0,Tweet
0,"[Bugün, bulusmami, lazimdiii]"
1,"[Volkan, konak, adami, tribe, sokar, yemin, ed..."
2,[Bed]
3,"[I, felt, first, flash, violence, fool, bumped..."
4,"[Ladies, drink, get, free, till, 10, 30]"


vii) Removing Frequent Words


In [None]:
from nltk import FreqDist

def frequent_words(df):

    lst=[]
    for text in df.values:
        lst+=text[0]
    fdist=FreqDist(lst)
    return fdist.most_common(10)
frequent_words(df_text)

[('t', 4705),
 ('co', 4370),
 ('https', 3090),
 ('I', 1993),
 ('http', 1255),
 ('m', 1031),
 ('n', 957),
 ('de', 653),
 ('que', 460),
 ('s', 450)]

In [None]:
freq_words = frequent_words(df_text)

lst = []
for a,b in freq_words:
    lst.append(b)

def remove_freq_words(text):

    result=[]
    for item in text:
        if item not in lst:
            result.append(item)

    return result

df_text['Tweet']=df_text['Tweet'].apply(remove_freq_words)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet']=df_text['Tweet'].apply(remove_freq_words)


viii) Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize,pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

def lemmatization(text):

    result=[]
    wordnet = WordNetLemmatizer()
    for token,tag in pos_tag(text):
        pos=tag[0].lower()

        if pos not in ['a', 'r', 'n', 'v']:
            pos='n'

        result.append(wordnet.lemmatize(token,pos))

    return result

#Test
text = ['running','ran','runs']
lemmatization(text)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


['run', 'ran', 'run']

In [None]:
df_text['Tweet']=df_text['Tweet'].apply(lemmatization)
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet']=df_text['Tweet'].apply(lemmatization)


Unnamed: 0,Tweet
0,"[Bugün, bulusmami, lazimdiii]"
1,"[Volkan, konak, adami, tribe, sokar, yemin, ed..."
2,[Bed]
3,"[I, felt, first, flash, violence, fool, bump, ..."
4,"[Ladies, drink, get, free, till, 10, 30]"


ix) Stemming

In [None]:
from nltk.stem import PorterStemmer

def stemming(text):
    porter = PorterStemmer()

    result=[]
    for word in text:
        result.append(porter.stem(word))
    return result

#Test
text=['Connects','Connecting','Connections','Connected','Connection','Connectings','Connect']
stemming(text)

['connect', 'connect', 'connect', 'connect', 'connect', 'connect', 'connect']

In [None]:
df_text['Tweet']=df_text['Tweet'].apply(stemming)
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet']=df_text['Tweet'].apply(stemming)


Unnamed: 0,Tweet
0,"[bugün, bulusmami, lazimdiii]"
1,"[volkan, konak, adami, tribe, sokar, yemin, ed..."
2,[bed]
3,"[i, felt, first, flash, violenc, fool, bump, i..."
4,"[ladi, drink, get, free, till, 10, 30]"


x) Removal of Tags

In [None]:
import re
def remove_tag(text):

    text=' '.join(text)
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

#Test
text = "<HEAD> this is head tag </HEAD>"
remove_tag(text.split())

' this is head tag '

In [None]:
df_text['Tweet'] = df_text['Tweet'].apply(remove_tag)
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet'] = df_text['Tweet'].apply(remove_tag)


Unnamed: 0,Tweet
0,bugün bulusmami lazimdiii
1,volkan konak adami tribe sokar yemin ederim d
2,bed
3,i felt first flash violenc fool bump i piti fool
4,ladi drink get free till 10 30


xi) Removal of URLs

In [None]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

#Test
text = "Machine learning knowledge is an awsome site. Here is the link for it https://machinelearningknowledge.ai/"
remove_urls(text)

'Machine learning knowledge is an awsome site. Here is the link for it '

In [None]:
df_text['Tweet'] = df_text['Tweet'].apply(remove_urls)
df_text.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['Tweet'] = df_text['Tweet'].apply(remove_urls)


Unnamed: 0,Tweet
0,bugün bulusmami lazimdiii
1,volkan konak adami tribe sokar yemin ederim d
2,bed
3,i felt first flash violenc fool bump i piti fool
4,ladi drink get free till 10 30


In [74]:
# 將 DataFrame 存儲為 TSV 文件
df_text.to_csv('processed_tweets.tsv', sep='\t', index=False)