# Goal

In this module, you will learn how to do data cleaning:

1. Remove punctuations
2. Lowering text
3. Removing URLs 
4. Tokenization (Chinese)
5. Remove stop words
6. Stemming
7. Replacing emoji with text

# Remove punctuations

In [7]:
import string

In [8]:
string.punctuation   #these are the punctuations contained in the library 

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

## Example

In [27]:
sentence = "The cat~ says meow..!!"

In [46]:
punctuation_free = []
for character in sentence:
    if character not in string.punctuation:
        punctuation_free.append(character)
        no_pun_sentence = "".join(punctuation_free) 
       
print(no_pun_sentence)

The cat says meow


## Alternatively, you can write this in one line!

In [52]:
no_pun_sentence = "".join([character for character in sentence if character not in string.punctuation]) 
       
print(no_pun_sentence)

The cat says meow


## Task 1: Remove punctuations in the tweets

hint: 

access each tweet element in the dictionary, identify elements that are not in the string library

join the elements using "".join(element)

In [None]:
tweet_dict = {
  1: {"userid": "000234",
  "Tweet": "That strange moment when someone reminds the teacher about the homework."},
  2: {"userid": "002214",
  "Tweet": "Hey, I get really nervous when I see others studying so much before the test.!!"},     
}

# Lowering text

In [64]:
sentence = "The cat~ says meow..!!"
sentence.lower()

'the cat~ says meow..!!'

## Task 2: Remove punctuations in the tweets (dictionary structure)

using tweet_dict as an example


# Remove URLs


Regular expression syntax let you check if a particular string matches a given regular expression 

Here's a quick start for you https://www.rexegg.com/regex-quickstart.html

In [68]:
import re 

In [72]:
sentence_url = "The course materials are on https://lushichen.com/"

In [73]:
re.sub(r"http\S+", "", sentence_url) # means substitute "http+a string of non-whitespace characters" with ""

'The course materials are on '

# Tokenization

We already learnt the English tokenization in the previous module, task 7: split words in sentence

In some languages, the boundary of a word is not separated by space, we need some packages to help with tokenization

## Tokenizing Chinese

jieba: "Jieba" (Chinese for "to stutter") Chinese text segmentation

pip install jieba

In [121]:
import jieba.posseg as pseg
import jieba

In [119]:
jieba.add_word('于吉', freq=None, tag='nr')

In [129]:
text = "赵小刚曾经采访过一位意大利厨师，他十几年前来过中国，去了不少地方，他喜欢川菜，对粤菜也颇为上瘾，他好奇中餐这么好吃，高级的饭馆为什么这么少"
words = pseg.cut(text)
for w in words:
    print('%s %s' % (w.word, w.flag))

赵小刚 nr
曾经 d
采访 v
过 ug
一位 m
意大利 ns
厨师 n
， x
他 r
十几年 m
前来 n
过 ug
中国 ns
， x
去 v
了 ul
不少 d
地方 n
， x
他 r
喜欢 v
川菜 n
， x
对 p
粤菜 n
也 d
颇为 v
上瘾 v
， x
他 r
好奇 a
中餐 n
这么 r
好吃 v
， x
高级 b
的 uj
饭馆 n
为什么 r
这么 r
少 a


## Customize word boundry

In [130]:
jieba.suggest_freq('前来', tune=True)  # you want to reduce this word frequency so that jieba doesn't cut it

3001

In [131]:
jieba.add_word('前来', freq=10, tag='n') #adjust frequency 

In [132]:
text = "他十几年前来过中国"
words = pseg.cut(text)
for w in words:
    print('%s %s' % (w.word, w.flag))

他 r
十几年 m
前 f
来过 n
中国 ns


# Remove stop words

install nltk

In [134]:
import nltk

In [136]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/luciachen/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [148]:
#Stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [138]:
sentence = "Hey, I get really nervous when I see others studying so much before the test"

#### the stopword dictionary is lower case, we need to lower case the sentence before we search for stop words in the library

In [143]:
output= [i for i in sentence.lower().split(' ') if i not in stopwords]
print(output)

['hey,', 'get', 'really', 'nervous', 'see', 'others', 'studying', 'much', 'test']


## Task 2:  Customize the stopword list in nltk

we want to remove the word 'hey' from the sentence, add this word to nltk stopword list

# Stemming 

taking the root/base form of the word, for example, 'programming' -> "program", "speaks" -> "speak"

In [154]:
from nltk.stem.porter import PorterStemmer

In [155]:
#defining the object for stemming
porter_stemmer = PorterStemmer()

In [157]:
stem_text = [porter_stemmer.stem(word) for word in sentence.split(' ')]
print(stem_text)

['hey,', 'i', 'get', 'realli', 'nervou', 'when', 'i', 'see', 'other', 'studi', 'so', 'much', 'befor', 'the', 'test']


# Replace emmoji with text

In [159]:
import emoji

In [160]:
text = "game is on 🔥 🔥"
emoji.demojize(text, delimiters=("", ""))  # 'game is on fire fire'

'game is on fire fire'

# Task 3: Write a function to preprocess text 

Joining all the previous steps into one function

Here's how you define a function in Python

def my_function(arg1, arg2):

    do something

    return

In [192]:
sentence = "hey, I want to organize a BBQ tonight 🔥 🔥, do you want to join?"


def preprocessing_text(text):
    
    return

# Task 4: Preprocess tweets in dictionary structure

In [207]:
tweet_dict = {
  1: {"userid": "000234",
  "Tweet": "That strange moment when someone reminds the teacher about the homework."},
  2: {"userid": "002214",
  "Tweet": "Hey, I get really nervous when I see others studying so much before the test.🔥 🔥!!"},     
}

In [214]:
def preprocess_tweets(tweets):
    
        
    return new_tweets