<a href="https://colab.research.google.com/github/gstripling00/conferences/blob/main/02_06_adv_nlp_solution_end.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Solution: Setup, Loading, Cleaning

#The Natural Language Toolkit Setup

In [None]:
import nltk

In [None]:

import os

# Specify the directory name
directory_name = "my_new_directory"

# Check if the directory already exists
if not os.path.exists(directory_name):
    # Create the directory
    os.makedirs(directory_name)
    print(f"Directory '{directory_name}' created successfully.")
else:
    print(f"Directory '{directory_name}' already exists.")


In [None]:
#Set the new path by right-clicking on the new directory folder and copying path.

PATH = '/content/my_new_directory/nltk_data'
nltk.data.path.append(PATH)


In [None]:
nltk.download(download_dir=PATH)

In [None]:
# Did it work?
from nltk.corpus import stopwords

stopwords.words('english')[0:25] #show the first 25 words

# NLP Basics: Reading In Text Data

### Read In Semi-structured Text Data

The raw dataset being used in this course can be found at: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [None]:
# Read in and view the raw data
import pandas as pd

messages = pd.read_csv('/content/spam.csv', encoding='latin-1')
messages.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
# Drop unused columns and label columns that will be used
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# How big is this dataset?
messages.shape

(5572, 2)

In [None]:
# What portion of our text messages are actually spam?
messages['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [None]:
# Are we missing any data?
print('Number of nulls in label: {}'.format(messages['label'].isnull().sum()))
print('Number of nulls in text: {}'.format(messages['text'].isnull().sum()))

Number of nulls in label: 0
Number of nulls in text: 0


# NLP Basics: Implementing A Pipeline To Clean Text

### Pre-processing Text Data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. We will explore three pre-processing steps in this lesson:
1. Remove punctuation
2. Tokenization
3. Remove stopwords

#UPDATE THE PATH TO CALL FROM THE COURSE GITHUB REPO

In [None]:
# Read in raw data and clean up the column names
import pandas as pd
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('/content/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

### Remove Punctuation

In [None]:
# What punctuation is included in the default list?
import string

string.punctuation

In [None]:
# Why is it important to remove punctuation?

"This message is spam" == "This message is spam."

In [None]:
# Define a function to remove punctuation in our messages
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

messages['text_clean'] = messages['text'].apply(lambda x: remove_punct(x))

messages.head()

### Tokenize

In [None]:
# Define a function to split our sentences into a list of words
import re

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

messages['text_tokenized'] = messages['text_clean'].apply(lambda x: tokenize(x.lower()))

messages.head()

### Remove Stopwords

In [None]:
# What does an example look like?

tokenize("I am learning NLP".lower())

In [None]:
# Load the list of stopwords built into nltk
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

In [None]:
# Define a function to remove all stopwords
def remove_stopwords(tokenized_text):
    text = [word for word in tokenized_text if word not in stopwords]
    return text

messages['text_nostop'] = messages['text_tokenized'].apply(lambda x: remove_stopwords(x))

messages.head()

In [None]:
# Remove stopwords in our example
remove_stopwords(tokenize("I am learning NLP".lower()))