# Data Cleaning

this notebook is about cleaning the data, which includes:
* **remove punctuation**: remove all punctuation from a string
* **stop words**: words which are filtered out before or after processing of text
* **stemming**: process of reducing inflected (or sometimes derived) words to their word stem, base or root form
* **lemmatization**: process of grouping together the inflected forms of a word so they can be analysed as a single item
* **tokenization**: process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens



In [None]:
import pandas as pd

In [None]:
tweets = pd.read_csv('twitter.csv')

In [None]:
t0 = tweets.tweet[0]

In [None]:
t0

' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'

## Remove Punctuation

In [None]:
import string

In [None]:
t0.translate(str.maketrans('', '', string.punctuation))

' user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction   run'

In [None]:
# another way to remove punctuation
''.join([char for char in t0 if char not in string.punctuation])

' user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction   run'

In [None]:
# another way to remove punctuation
"".join(filter(lambda x: x not in string.punctuation, t0))

' user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction   run'

## Remove stop words

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
#stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
t0

' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'

In [None]:
#remove stop words from the tweet

' '.join([word for word in t0.split() if word.lower() not in stopwords.words('english')])

'@user father dysfunctional selfish drags kids dysfunction. #run'

### Excercise
1. Write a function that removes all stopwords from a given `text` and punctuation
2. Run the function on all tweets


In [None]:
def remove_stopwords_punctuation(text):
    return text

<details>
    <summary>Click to reveal answer</summary>
    <p class="answer">
    def remove_stopwords_punctuation(text):
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # remove stop words
    return ' '.join([word for word in text.split() if word.lower() not in stopwords.words('english')])
    # return text
    </p>
</details>


Remove stop words and punctuation from all tweets, save the result in a new column called 'cleaned'

Unnamed: 0,id,label,tweet,cleaned
0,1,0,@user when a father is dysfunctional and is s...,user father dysfunctional selfish drags kids d...
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thanks lyft credit cant use cause do...
2,3,0,bihday your majesty,bihday majesty
3,4,0,#model i love u take with u all the time in ...,model love u take u time urð± ðððð...
4,5,0,factsguide: society now #motivation,factsguide society motivation


<details>
    <summary>Click to reveal answer</summary>
    <p class="answer">
    tweets['cleaned'] = tweets.tweet.apply(remove_stopwords_punctuation)
    tweets.head()
    </p>
</details>


## Stemming

In [None]:
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [None]:
porter = nltk.PorterStemmer()

In [None]:
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

## Lemmatization

In [None]:
WNlemma = nltk.WordNetLemmatizer()

In [None]:
[WNlemma.lemmatize(t) for t in words1]

['list', 'listed', 'list', 'listing', 'listing']

## Stemming vs Lemmatization

* **Stemming**:

 stemming is typically faster as it simply chops off the end of a word using heuristics, without any understanding of the context in which a word is used.

    * Faster
    * Less accurate
    * More aggressive
    * Removes prefixes and suffixes
    * Not always a real word

* **Lemmatization**:

Lemmmatization is typically more accurate as it uses more informed analysis to create groups of words with similar meaning based on the context around the word.

    * Slower
    * More accurate
    * Less aggressive
    * Removes prefixes and suffixes
    * Always a real word
    * Requires a dictionary


## Tokenization
* **word_tokenize**: tokenize a string to words
* **sent_tokenize**: tokenize a string to sentences

In [None]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [None]:
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

In [None]:
nltk.download('punkt')
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


4

In [None]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']