# Preprocessing Text Fields

This assignment walks through the basics of cleaning text fields with regex and pandas.

### Importing Libraries and Dataset

First, we import the necessary libraries and dataset. The dataset we are using for this assignment is [Kaggle's spam SMS dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset). The only modifications that we've done to the dataset are to encode the CSV using UTF-8 so that it can be read by pandas, change ham and spam labels to 0 and 1 respectively, as well as clean up the dataset column labels. 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

sns.set()

In [None]:
sms = pd.read_csv("spam.csv")
sms.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1, inplace = True)
sms.rename(columns = {"v1": "label", "v2": "text"}, inplace = True)
sms["label"] = sms["label"].map({"ham": 0, "spam": 1})
sms.head()

### Exploratory Data Analysis

Let's first just look at the counts of the how many spam and ham labels we have.

In [None]:
sms.groupby(by = ["label"]).count()

Since we have a class imbalance, a naive classifier that always outputs ham is able to achieve an accuracy of 87%. To discourage this, we use the area under the ROC curve (AUROC) as our metric instead of just accuracy. A perfect classifier will have an AUROC of 1, while a naive classifier has an AUROC of 0.5. We can confirm this by calculating the ROC of our naive always ham classifier below. 

You do not need to understand how this metric works, as it is not the goal of this assignment.

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(sms["label"], [0] * len(sms.index))
metrics.auc(fpr, tpr)

This shows that the naive classifier indeed has an AUROC of 0.5. Let's see if we can do better than that by processing the text!

Next, let's take a look at some of these text entries to see if we can find any differences between spam and ham texts by just reading them. 

In [None]:
print("First 15 Ham Texts:\n")
for i, ham_text in enumerate(sms.loc[sms["label"] == 0]["text"][:15]):
    print("{}. {}\n".format(i + 1, ham_text))

In [None]:
print("First 15 Spam Texts:\n")
for i, spam_text in enumerate(sms.loc[sms["label"] == 1]["text"][:15]):
    print("{}. {}\n".format(i + 1, spam_text))

From this initial inspection of our dataset, we can see that our dataset isn't perfectly labeled. The sixth text in the ham texts is clearly a spam, despite being labeled as a ham text. Despite that, we can still see at least some differences in the texts. Spam texts generally contain more numbers (either phone numbers or codes), links, and miscellaneous capitalization/punctuation compared to ham texts.

Let's see if we can use these indicators as features for a classifier!

### Finding Phone Numbers

Our classifier's first feature will be the number of digits of the longest number in the text. For example, a phone number like 09061209465 will have a length of 11. Add this feature as a column called "longestNum" to the `sms` dataframe. 

*Hint: To apply regex to a pandas series, look into series.str.function.*<br>
*Hint: Try to first find all numbers for a single text, and then search for the maximum length one.*

In [None]:
# BEGIN STUDENT SOLUTION

# END STUDENT SOLUTION

Let's plot the distribution of this new feature for spam and ham texts:

In [None]:
sns.histplot(
    data = sms,
    x = "longestNum",
    hue = "label"
)
plt.title("Distribution of characters")
plt.show()

From this plot, we can observe that this feature is pretty helpful in separating our data, as spam features tend to have longer numbers than ham texts.

### Finding Links

In addition to just having numbers with more digits, spam texts also generally have more links than ham texts. However, one problem with trying to identify links is that they don't all take the same format. Links could have "http", "www", etc. as the prefix and ".com", ".net", etc. as the suffix. To simplify our search, assume that links will always have the following structure: 

"www" / "http" | text (at least 1 character) | ".com" / ".net"
Prefix         | domain                      | suffix

Note that the only character the text cannot contain is a space. Additionally, while the last part is referred to as a suffix, we do not require there to be a space after the suffix. Add a {0, 1} indicator for if text contains a link as a column called "containsLink" to the `sms` dataframe. 

*Hint: The pipe (|) operator serves as an or.*<br>
*Hint: Parentheses may be helpful.*

In [None]:
# BEGIN STUDENT SOLUTION

# END STUDENT SOLUTION 

Let's take a look at these counts: 

In [None]:
sms.drop("longestNum", axis = 1).groupby(by=["label", "containsLink"]).count()

We can see that in total, we only found 52 links in our entire dataset; while this is far lower than we might have hoped, it's also not entirely unexpected, as we're looking at spam texts instead of spam emails. Additionally, despite the low counts, links are much more prevalent for spam texts as opposed to ham texts. We will keep this feature in, but will be far less helpful than our first feature. 

### Magic Words

One very naive way to process text is to have some features be indicators for if "magic words" you believe to be good separators exist in a piece of text. For example, "free" might be a good magic word, since many spam texts from our initial selection use that word. We define the set of the following magic words: {"txt", "claim", "confirm", "free", "reply", "xxx", "reward", "award"}. Let our magic words be case insensitive, so "Free" and "free" both count towards the indicator.

However, we want to avoid having too many false positives on our indicator. Create two separate features, "magicWordsOnce" and "magicWordsTwice". The first feature indicates if any of the magic words appear in our text. The second feature indicates if at least two words from the text are in our magic words set. They do not have to be unique either, so a text that's just "free free" would also activate this indicator. Add both of these {0, 1} indicators to the `sms` dataframe. 

*Hint: Don't overcomplicate the regex for this problem.*<br>
*Hint: str.lower() converts the string into lowercase.*

In [None]:
# BEGIN STUDENT SOLUTION

# END STUDENT SOLUTION

Let's take a look at these counts:

In [None]:
sms.loc[:,["text", "label", "magicWordsOnce"]].groupby(by=["label", "magicWordsOnce"]).count()

In [None]:
sms.loc[:,["text", "label", "magicWordsTwice"]].groupby(by=["label", "magicWordsTwice"]).count()

We can see that these are much better indicators than using just links, since these indicators for our magic words are much more likely to go off. 

### Evaluating our Features

Now that we've processed our text, let's try to see how well our classifier performs using just these three features. We split our data into a train and test set and train a basic logistic regression classifier on the data. 

In [None]:
X, y = sms.drop(["label", "text"], axis = 1), sms["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
model = LogisticRegression().fit(X_train, y_train)
fpr, tpr, thresholds = metrics.roc_curve(y_train, model.predict(X_train))
train_auc = metrics.auc(fpr, tpr)
fpr, tpr, thresholds = metrics.roc_curve(y_test, model.predict(X_test))
test_auc = metrics.auc(fpr, tpr)

print("Train AUROC: ", train_auc)
print("Test AUROC: ", test_auc)

Given that you've added your features correctly, you should have an AUROC of above 0.9. This shows that our classifier is fairly capable, despite using a simple model! Hopefully these exercises provide a good foundation on how to process text with pandas and regex. 

As for some additional practice, one aspect of regex that this assignment did not take advantage of is capture groups. While the features didn't really care about what the exact matches were (ie. our features didn't need to know what the link matched was or what phone number was included in the spam mail); there are certain applications where the match's exact value is vital.