## Text Preprocessing ##

This tutorial walks through a basic class for preprocessing text. 

Suppose we have a list of texts, for example a list of the top movie quotes.

In [3]:
import pandas as pd

df = pd.read_csv("movie_quotes.csv", encoding='utf-8')

print(df.head())

   #                                              QUOTE               MOVIE  \
0  1             Frankly, my dear, I don't give a damn.  GONE WITH THE WIND   
1  2       I'm gonna make him an offer he can't refuse.       THE GODFATHER   
2  3  You don't understand!  I coulda had class. I c...   ON THE WATERFRONT   
3  4  Toto, I've a feeling we're not in Kansas anymore.    THE WIZARD OF OZ   
4  5                        Here's looking at you, kid.          CASABLANCA   

   YEAR                     Director  
0  1939          Victor Fleming       
1  1972    Francis Ford Coppola       
2  1954              Elia Kazan       
3  1939          Victor Fleming       
4  1942          Michael Curtiz       


We will build a class which cleans up the quote and generates a one-hot encoding or a TF-IDF matrix. 

Typically, text data is cleaned by removing unnecessary elements. For example, words such as <i>i, me, myself, you, the</i> show up in many texts and are hardly informative. These are called <b>stopwords</b>. We should remove these. Secondly, to create a more homogeneous corpus of text, we could lower case all characters so that the feature <i>Frankly, my dear</i> is identical to <i>frankly, my dear</i>. Lastly, in this tutorial we will <b>stem</b> the words to their root form. This means we will transform words like <i>fishing, fishes, fish</i> to the root word <i>fish</i>.

A class in Python is similar to making a cookie cutter. We want to make a prototype so that every time we use the cookie cutter on a text dataset, the object will have the same shape, characteristics and functions. 

Let's start by initializing our class.

In [5]:
from nltk import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import download

class Preprocess():
	def __init__(self, text, sw=stopwords.words('english'), lower=True, stem = True):

		if not (type(text)==pd.core.series.Series):
			text = pd.Series(text)

		self.text = text
		self.sw = sw
		self.lower = lower
		self.stem = stem

Every class in Python should be <i>initialized</i>. Basically, we want to use the function <b>$__init__$</b> (that is two underscrolls preceding and following the word init) to give the object some basic characteristics. In our case, these will be the characteristics for how we want to clean the data.

<b>self</b> is an important element of object oriented program. This is the class itself and will be a variable which we can append attributes to. 

<b>text</b> will be the list of texts we want to clean up. This variable will be mandatory for the user to input when calling the Preprocess class we are creating. 

<b>sw</b> is a list of stopwords. You'll notice this is equal to a value (stopwords.words('english')). This means by default the stopwords will be imported from the nltk module (a nice natural language tool created by the kind people at UPenn <a>http://www.nltk.org</a>). This means sw is NOT mandatory as it will take a value by default if the user doesn't feed it a value.

<b>lower</b> is a Boolean variable by default equal to True. We will use this later on to know whether or not to lowercase the texts.

<b>stem</b> is a Boolean variable by default equal to True. We will use this later on to know whether or not to stem the words in the texts.

The last lines like <i>self.text = text</i> are pegging attributes to our object. That is to say, in other places within the class, we can refer to these attributes by running <i>class_instance</i>.text. 

(type(text)==pd.core.series.Series) checks to see if the text is a Pandas series.If it is not, then we'll convert it to one. Pandas is a module which makes it easy to manipulate a list, for example lower casing or returning a dataframe. 

Let's show this by example:

In [8]:
docs = Preprocess(df.QUOTE)

print(docs.text.head())
print(docs.lower)
print(docs.sw)

0               Frankly, my dear, I don't give a damn.
1         I'm gonna make him an offer he can't refuse.
2    You don't understand!  I coulda had class. I c...
3    Toto, I've a feeling we're not in Kansas anymore.
4                          Here's looking at you, kid.
Name: QUOTE, dtype: object
True
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'

First we created an <i>instance</i> of the class, then we called their attributes (text I printed out only the top part, or the head, of the dataframe to save space). 

Now that we have the $__init__$ function, let's define our first <b>method</b>. A method is a function within a class. The first argument must be self, as this variable contains the attributes we defined in the $__init__$ function. First, we'll check if the <i>lower</i> attribute is True; if so, let's lowercase the text. Then, let's split each text into a list of words. This way we can cycle through each word and either i) stem it if self.stem is True ii) remove it if it is a stop word. Similar to any other Python function, we can define functions within functions to make cleaner code. Lastly, we will join each list of words to create a string, and then initialized a TfidfVectorizer object. TfidfVectorizer will allow us to create a dense matrix where each column is a word in our vocabulary, and each row corresponds to a document.  

In [16]:
class Preprocess():
	def __init__(self, text, sw=stopwords.words('english'), lower=True, stem = True):

		if not (type(text)==pd.core.series.Series):
			text = pd.Series(text)

		self.text = text
		self.sw = sw
		self.lower = lower
		self.stem = stem


	def clean_text(self):
		def stem(word_list):
			return map(lambda x: PorterStemmer().stem(x), word_list)

		def remove_sw(word_list):
			keep = []
			for word in word_list:
				if not word in self.sw:
					keep.append(word)
			return keep

		if self.lower:
			self.text = self.text.str.lower()

		self.text = self.text.apply(lambda x: x.split())
		
		if self.stem: self.text = self.text.apply(stem)
		if self.sw: self.text = self.text.apply(remove_sw)

		self.text = self.text.apply(lambda x: ' '.join(x))
		self.vectorizer = TfidfVectorizer()
		self.df_dense = self.vectorizer.fit_transform(self.text)


Let's test it out again:

In [17]:
docs = Preprocess(df.QUOTE)
docs.clean_text()

print(docs.text.head())
print(docs.vectorizer)
print(docs.df_dense[0:2])

0                      frankly, dear, don't give damn.
1                   i'm gonna make offer can't refuse.
2    don't understand! coulda class. coulda contend...
3                   toto, i'v feel we'r kansa anymore.
4                                 here' look you, kid.
Name: QUOTE, dtype: object
TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
  (0, 71)	0.451149812367
  (0, 113)	0.491651345454
  (0, 83)	0.330886087089
  (0, 74)	0.451149812367
  (0, 106)	0.491651345454
  (1, 237)	0.500815076903
  (1, 50)	0.407581696307
  (1, 211)	0.500815076903
  (1, 177)	0.407581696307
  (1, 117)	0.40758

The <b>df_dense</b> attribute is a dense representation of the TF-IDF matrix (<a>https://en.wikipedia.org/wiki/Tf%E2%80%93idf).</a>. (0, 71) corresponds to the first document,"Frankly, my dear, I don't give a damn.", and the 71st word in our vocabulary which is contained in that quote. 0.451149... is the TF-IDF score.

We may want to return this in the form of a more readable dataset. So let's create a Pandas dataframe where the columns are the words in our vocabulary after cleaning the text, self.vectorizer.get_feature_names(), and then the values we can get from self.df_dense.toarray(). 

<b>onehot</b> will be a variable which we can use so the user can either return a matrix with values of 1 if the document contains the word in the vocabulary in the ith position, or the TF-IDF score. I do this in another function to make it easy for the user to return just the array or the dataframe if they choose. 

In [19]:
#################################
### Author: Paul Soto 		  ###
### 		paul.soto@upf.edu ###
#								#
# This file is a class to #######
# clean a series of text   ######
# with basic preprocessing     ##
#################################
import pandas as pd
from nltk import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import download

# download('stopwords')

df = pd.read_csv("final_text_db.csv", encoding='utf-8')

class Preprocess():
	def __init__(self, text, sw=stopwords.words('english'), lower=True, stem = True):

		if not (type(text)==pd.core.series.Series):
			text = pd.Series(text)

		self.text = text
		self.sw = sw
		self.lower = lower
		self.stem = stem


	def clean_text(self):
		def stem(word_list):
			return map(lambda x: PorterStemmer().stem(x), word_list)

		def remove_sw(word_list):
			keep = []
			for word in word_list:
				if not word in self.sw:
					keep.append(word)
			return keep

		if self.lower:
			self.text = self.text.str.lower()

		self.text = self.text.apply(lambda x: x.split())
		
		if self.stem: self.text = self.text.apply(stem)
		if self.sw: self.text = self.text.apply(remove_sw)

		self.text = self.text.apply(lambda x: ' '.join(x))
		self.vectorizer = TfidfVectorizer()
		self.df_dense = self.vectorizer.fit_transform(self.text)

	def array(self, onehot=1):
		array = self.df_dense.toarray().copy()
		if onehot:
			array[array>0] = 1
		return array

	def make_df(self,onehot=1):
		df = pd.DataFrame(columns=self.vectorizer.get_feature_names(),
							data = self.array(onehot))
		df['Text'] = self.text
		df = df[['Text']+list(df.columns[:-1])]
		return df

docs = Preprocess(df.QUOTE)
docs.clean_text()
text_df = docs.make_df(onehot=1)

print(text_df.head())

                                                Text  adrian  again  ahead  \
0                    frankly, dear, don't give damn.       0      0      0   
1                 i'm gonna make offer can't refuse.       0      0      0   
2  don't understand! coulda class. coulda contend...       0      0      0   
3                 toto, i'v feel we'r kansa anymore.       0      0      0   
4                               here' look you, kid.       0      0      0   

   ain  airplanes  alive  all  alone  alway    ...      win  wire  witness  \
0    0          0      0    0      0      0    ...        0     0        0   
1    0          0      0    0      0      0    ...        0     0        0   
2    0          0      0    0      0      0    ...        0     0        0   
3    0          0      0    0      0      0    ...        0     0        0   
4    0          0      0    0      0      0    ...        0     0        0   

   word  world  ya  yet  yo  you  youngster  
0     0      0  

There are many other methods you could add, such as one which lemmatizes the word, gives the part of speech, etc. I'll leave it here for now.