# II. Data Processing

In order to create a model to predict the label for each document, we need to pre-process the data.

Pre-processed data will then be used ot create a prediciton model 

Here's the data preprocessing we shall undertake:
- Word Tokenize
- Lowercase 
- Remove punctuation
- Stopwords removal
- Remove most common words
- Remove least common words

Finally, we saved modified dataset.

In [1]:
# import libraries
import pandas as pd
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import numpy as np
import numpy.random as nr
import seaborn as sns
import scipy.stats as ss
import math

%matplotlib inline


In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### 1. Import datasets

In [3]:
#import first data set (train_values)
train=pd.read_csv('train_values.csv',sep=',')
train.shape

(18830, 2)

In [4]:
train.head(5)

Unnamed: 0,row_id,doc_text
0,0,"For more information, visit http://www.wor..."
1,1,...
2,2,...
3,3,71399\r\n\r\nPr...
4,4,90189\r\n\r\n\r\n\r\...


### 2. Text Processing

**2.1 Word Tokenize**

In [5]:
# Train tokenize
from nltk.tokenize import word_tokenize

train['doc_text'].apply(word_tokenize)

0        [For, more, information, ,, visit, http, :, //...
1        [www.ifc.org/ThoughtLeadership, Note, 28, |, J...
2        [WPS4830, P, olicy, R, eseaRch, W, oRking, P, ...
3        [71399, Procurement, Monitoring, and, Social, ...
4        [90189, Executive, Board, Meeting, Minutes, of...
5        [90966, International, Comparison, Program, IC...
6        [CENTRAL, ASIA, EARTHQUAKE, RISK, REDUCTION, F...
7        [46966, Samuel, Clark, Blair, Palmer, $, :, TH...
8        [53062, PPI, data, update, note, 26, November,...
9        [63696, Creating, opportunities, for, women, I...
10       [49736, STATUS, OF, PROJECTS, IN, EXECUTION, F...
11       [FIXED, INCOME, In, Focus, INSTRUMENTS, TO, MO...
12       [72668, v1, World, Trade, Indicators, 2009/10,...
13       [62250, Successful, Education, Reform, :, Less...
14       [WPS5990, Policy, Research, Working, Paper, 59...
15       [Report, on, Advisory, Services, Operations, i...
16       [48573, Volume, 7, ,, Number, 4, April, 2009, .

**2.2 Lowercase**

In [6]:
# Text Preprocessing (train)
# lowercase
train['doc_text'] = train['doc_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train['doc_text'].head()




0    for more information, visit http://www.worldba...
1    www.ifc.org/thoughtleadership note 28 | januar...
2    wps4830 p olicy r esearch w orking p aper 4830...
3    71399 procurement monitoring and social accoun...
4    90189 executive board meeting minutes of meeti...
Name: doc_text, dtype: object

**2.3 Remove punctuation**

In [7]:
#Removing Punctuation (train)
train['doc_text'] = train['doc_text'].str.replace('[^\w\s]','')
train['doc_text'].head()

0    for more information visit httpwwwworldbankorg...
1    wwwifcorgthoughtleadership note 28  january 20...
2    wps4830 p olicy r esearch w orking p aper 4830...
3    71399 procurement monitoring and social accoun...
4    90189 executive board meeting minutes of meeti...
Name: doc_text, dtype: object

**2.3 Stopwords removal**

In [8]:
# Stopwords removal (train)
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['doc_text'] = train['doc_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train['doc_text'].head()

0    information visit httpwwwworldbankorgprospects...
1    wwwifcorgthoughtleadership note 28 january 201...
2    wps4830 p olicy r esearch w orking p aper 4830...
3    71399 procurement monitoring social accountabi...
4    90189 executive board meeting minutes meeting ...
Name: doc_text, dtype: object

**2.3 Remove common words**

In [9]:
# View most common 10 words (train)
freq = pd.Series(' '.join(train['doc_text']).split()).value_counts()[:10]
freq

bank           158238
world          147280
development    121561
business       103123
countries       97972
1               93992
data            74815
2               73725
percent         70188
sector          67257
dtype: int64

In [10]:
# Remove most common 10 words (train)
freq = list(freq.index)
train['doc_text'] = train['doc_text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['doc_text'].head()

0    information visit httpwwwworldbankorgprospects...
1    wwwifcorgthoughtleadership note 28 january 201...
2    wps4830 p olicy r esearch w orking p aper 4830...
3    71399 procurement monitoring social accountabi...
4    90189 executive board meeting minutes meeting ...
Name: doc_text, dtype: object

**2.4 Remove rare words**

In [11]:
 # View most rare words (train)
freq = pd.Series(' '.join(train['doc_text']).split()).value_counts()[-10:]
freq

69944              1
dic09              1
threeaddressing    1
ipsr               1
presenc            1
flambee            1
gobek              1
prazos             1
geex19266          1
59829              1
dtype: int64

In [12]:
# Remove rare words (train)
freq = list(freq.index)
train['doc_text'] = train['doc_text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['doc_text'].head()

0    information visit httpwwwworldbankorgprospects...
1    wwwifcorgthoughtleadership note 28 january 201...
2    wps4830 p olicy r esearch w orking p aper 4830...
3    71399 procurement monitoring social accountabi...
4    90189 executive board meeting minutes meeting ...
Name: doc_text, dtype: object

**2.5 Steeming**

In [13]:
#train
from nltk.stem import PorterStemmer
st = PorterStemmer()
train['doc_text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0        inform visit httpwwwworldbankorgprospect 98944...
1        wwwifcorgthoughtleadership note 28 januari 201...
2        wps4830 p olici r esearch w ork p aper 4830 de...
3        71399 procur monitor social account curriculum...
4        90189 execut board meet minut meet octob 14 20...
5        90966 intern comparison program icp revis back...
6        central asia earthquak risk reduct forum forum...
7        46966 samuel clark blair palmer indonesian soc...
8        53062 ppi updat note 26 novemb 2009 privat act...
9        63696 creat opportun women ifc women program f...
10       49736 statu project execut fy08 latin america ...
11       fix incom focu instrument mobil institut inves...
12       72668 v1 trade indic 200910 former yugoslav re...
13       62250 success educ reform lesson poland key me...
14       wps5990 polici research work paper 5990 green ...
15       report advisori servic oper middl east north a...
16       48573 volum 7 number 4 april 2009 best wish ha.

### 3. Save modified trianing file for model creation

In [15]:
train.to_csv("train_modif2.csv", index=False)