
# Cleaning Data



**In this notebook, we'll be cleaning the text in the dataset before it's used for training. We'll be removing:**
 1. **Non-alphanumeric characters**
 2. **Numeric Text**
 3. **URLS**
 4. **Stopwords (an, the, in, the, ....) -> NLTK**
 


**Load Libraries:**


In [6]:

import pandas as pd
import re
from nltk.corpus import stopwords
import scripts
import pickle


**Load Dataset:**


In [2]:

dataset = pd.read_csv("input/master_dataset.csv")
num_data_points = dataset.shape[0]


**Clean text found in new_story column:**


In [3]:
text_result = []
for text in dataset['news_story']:
    # Remove non-alphabet and numeric characters data 
    result = re.sub('[^a-zA-Z]', ' ', text)
    # Convert remaining characters to lowercase
    result = result.lower()
    result = result.split()
    # Remove stopwords from dataset (NLTK)
    result = [i for i in result if i not in set(stopwords.words('english'))]
    text_result.append(" ".join(result))



**Check if number of rows in cleaned text matches number of rows in original text:**

In [4]:
scripts.check_size(num_data_points,len(text_result))


True

**Save Cleaned Text as Pickle File:**


In [7]:
with open("input/clean.txt", "wb") as f:
    pickle.dump(text_result, f)

