# NLP Basics: Reading in text data & why do we need to clean the text?

### Read in semi-structured text data
This dataset contains text messages labeled as spam or ham

In [1]:
# Read in the raw text
rawData = open("../Support/SMSSpamCollection.tsv").read()

# Print the raw data
rawData[0:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

In [2]:
parsedData = rawData.replace('\t', '\n').split('\n')
parsedData[0:5]

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham']

In [3]:
labelList = parsedData[0::2]
textList = parsedData[1::2]
if len(labelList) != len(textList):
    labelList.remove('')

In [4]:
import pandas as pd

fullCorpus = pd.DataFrame({
    'label': labelList,
    'body_list': textList
})

fullCorpus.head()

Unnamed: 0,label,body_list
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### Easier way using pandas

In [5]:
dataset = pd.read_csv("../Support/SMSSpamCollection.tsv",
                      sep="\t", header=None)
dataset.columns = ['label', 'body_list']
dataset.head()

Unnamed: 0,label,body_list
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### Explore the dataset

In [6]:
(x, y) = dataset.shape
spam = len(dataset[dataset['label'] == 'spam'])
ham = len(dataset[dataset['label'] == 'ham'])
print("Input data has {} rows and {} columns".format(x, y))
print("number of spams: {}, number of hams: {}".format(spam, ham))

Input data has 5568 rows and 2 columns
number of spams: 746, number of hams: 4822
