# 13.12 Cleaning/Preprocessing Tweets for Analysis
* **Data cleaning** is one of data scientists' most common tasks 
* Some NLP tasks for normalizing tweets
    * Converting all text to the same case
    * Removing `#` from hashtags, `@`-mentions, duplicates, hashtags
    * Removing excess whitespace, punctuation, **stop words**, URLs
    * Removing `RT` (retweet) and `FAV` (favorite) 
    * **Stemming** and **lemmatization**
    * **Tokenization**

### [**tweet-preprocessor**](https://github.com/s/preprocessor) Library and TextBlob Utility Functions
* tweet-preprocessor can automatically remove any combination of
	* URLs
	* `@`-mentions (like `@nasa`)
	* hashtags (like `#mars`)
	* Twitter reserved words (like, `RT` for retweet and `FAV` for favorite, which is similar to a “like” on other social networks)
	* emojis (all or just smileys) 
	* numbers 

### tweet-preprocessor Constants for Cleaning Options

| Option	| Option constant
| :---	| :---
| @-Mentions (e.g., `@nasa`)	| `OPT.MENTION` 
| Emoji	| `OPT.EMOJI` 
| Hashtag (e.g., `#mars`)	| `OPT.HASHTAG` 
| Number	| `OPT.NUMBER` 
| Reserved Words (`RT` and `FAV`)	| `OPT.RESERVED` 
| Smiley	| `OPT.SMILEY` 
| URL	| `OPT.URL` 

### Installing tweet-preprocessor
>```python
pip install tweet-preprocessor
```


### Cleaning a Tweet 
* Remove reserved word (RT for "retweet") and a URL
* The tweet-preprocessor library’s module name is **`preprocessor`** and they recommend importing as **`p`**

In [None]:
import preprocessor as p

In [None]:
p.set_options(p.OPT.URL, p.OPT.RESERVED)

In [None]:
tweet_text = 'RT A sample retweet with a URL https://nasa.gov'

In [None]:
p.clean(tweet_text)

------
&copy;1992&ndash;2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book [**Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud**](https://amzn.to/2VvdnxE).

DISCLAIMER: The authors and publisher of this book have used their 
best efforts in preparing the book. These efforts include the 
development, research, and testing of the theories and programs 
to determine their effectiveness. The authors and publisher make 
no warranty of any kind, expressed or implied, with regard to these 
programs or to the documentation contained in these books. The authors 
and publisher shall not be liable in any event for incidental or 
consequential damages in connection with, or arising out of, the 
furnishing, performance, or use of these programs.                  