## Tokenization
Before we can classify any posts, we'll need to clean and tokenize the text data. Use what you remember from the last lesson on NLP to implement the function `tokenize`. This function should perform the following steps on the string, `text`, using nltk:

1. Identify any urls in `text`, and replace each one with the word, `"urlplaceholder"`.
2. Split `text` into tokens.
3. For each token: lemmatize, normalize case, and strip leading and trailing white space.
4. Return the tokens in a list!

For example, this:
```python
text = 'Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG'

tokenize(text)
```
should return this:
```txt
['barclays', 'ceo', 'stress', 'the', 'importance', 'of', 'regulatory', 'and', 'cultural', 'reform', 'in', 'financial', 'service', 'at', 'brussels', 'conference', 'urlplaceholder']
```

Hint: You'll have to add an import statement to use the `re` package (which supports regular expressions) and two import statements to use the appropriate functions from `nltk`! Add them to this first code cell.

In [2]:
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])

[nltk_data] Downloading package punkt to /Users/jsuk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jsuk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [103]:
# import statements
import pandas as pd
import re

from encodings.aliases import aliases

from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

## Load dataset

### Check for encoding of the file
'corporate_messaging.csv' is not utf-8 encoded --> need to find encoding type

In [83]:
# Below code returns UnicodeDecodeError
# pd.read_csv('corporate_messaging.csv')

In [58]:
encoding_list = list(set(aliases.values()))

def test_encoding(file, trace=False) :
    
    encoding_success = []
    
    for encoding in encoding_list :
        try: 
            df = pd.read_csv(file, encoding=encoding)                      
            if df.columns[0] == '_unit_id' :  
                encoding_success.append(encoding)              
            if trace :
                print(f'Encoding with {encoding} successful')
                
        except:
            if trace : 
                print(f'Encoding with {encoding} failed')              
            continue
            
    return encoding_success

In [81]:
test_encoding('corporate_messaging.csv')[:10] # latin-1 encoding works!

['latin_1',
 'cp857',
 'cp858',
 'iso8859_15',
 'iso8859_16',
 'cp862',
 'cp775',
 'hp_roman8',
 'mac_cyrillic',
 'mac_turkish']

---
### Check for cleaning

In [86]:
df.head(1)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,category,category:confidence,category_gold,id,screenname,text
0,662822308,False,finalized,3,2/18/15 4:31,Information,1.0,,4.36528e+17,Barclays,Barclays CEO stresses the importance of regula...


Given text, we sort out category. So text will be X (independent variable) and category will then be y (response variable). 

In [88]:
df.category.value_counts()

Information    2129
Action          724
Dialogue        226
Exclude          39
Name: category, dtype: int64

'Exclude' category is misc. which does not have meaning for this analysis so needs to be dropped. Also, we will only use category values with 100% confidence (as per 'category:confidence' column).

In [89]:
def load_data():
    
    df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
    df = df[(df['category'] != 'Exclude') & (df['category:confidence'] == 1)]
    
    X = df['text']
    y = df['category']
    
    return X, y

In [94]:
# Test the function
X, y = load_data()
X.shape, y.shape

((2403,), (2403,))

#### For step 1, the regular expression to detect a url is given below

In [97]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

In [105]:
def tokenize(text):
    
    # Find all urls from the text
    urls = re.findall(url_regex, text)
    
    # Substitude urls with 'urlplaceholder'
    for url in urls:
        text = re.sub(url, 'urlplacholder', text)
    
    clean_tokens = []
    lemmatizer = WordNetLemmatizer() # Instantiate lemmatizer    

    # Normalize, tokenize and lemmtize the texts
    for token in word_tokenize(text.lower().strip()) : 
       clean_tokens.append(lemmatizer.lemmatize(token))

    return clean_tokens

In [106]:
# test out function
X, y = load_data()
for message in X[:5]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG
['barclays', 'ceo', 'stress', 'the', 'importance', 'of', 'regulatory', 'and', 'cultural', 'reform', 'in', 'financial', 'service', 'at', 'brussels', 'conference', 'urlplacholder'] 

Barclays announces result of Rights Issue http://t.co/LbIqqh3wwG
['barclays', 'announces', 'result', 'of', 'right', 'issue', 'urlplacholder'] 

Barclays publishes its prospectus for its å£5.8bn Rights Issue: http://t.co/YZk24iE8G6
['barclays', 'publishes', 'it', 'prospectus', 'for', 'it', 'å£5.8bn', 'right', 'issue', ':', 'urlplacholder'] 

Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health http://t.co/nkuHoAfnSD
['barclays', 'group', 'finance', 'director', 'chris', 'lucas', 'is', 'to', 'step', 'down', 'at', 'the', 'end', 'of', 'the', 'week', 'due', 'to', 'ill', 'health', 'urlplacholder'] 

Barclays announces that Irene M