## Data Pre-Processing

### This notebook contains pre-processing the data, and creating features using TfIdf Vectorizer and Count Vectorizer.

In [53]:
import pandas as pd
import nltk as nk
import numpy as np
from sklearn.feature_extraction import stop_words
from collections import Counter
from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## Important Note: Change the path of the below files as per your folder.

In [54]:
data1 = pd.read_csv("/Users/Purvank/Desktop/MACHINE LEARNING IIT/CS595_ML/Project/Additional_Files/ride_data1.csv")
data2 = pd.read_csv("/Users/Purvank/Desktop/MACHINE LEARNING IIT/CS595_ML/Project/Additional_Files/ride_data2.csv")

In [55]:
data = pd.concat([data1,data2])

In [56]:
data.head()

Unnamed: 0,ride_review,ride_rating,sentiment
0,I had just completed running the New York Mara...,1.0,0
1,My appointment time for auto repairs required ...,1.0,0
2,Whether I am using Uber for a ride service or ...,1.0,0
3,Why is it so hard for you to understand that i...,1.0,0
4,"I was in South Beach, FL. I was staying at a m...",1.0,0


## Data Preprocessing

### Remove Stopwords, words with the same root, not alphabetic words

## Change the path of the below file as per your folder.

In [57]:
stopwords = stopwords.words('english')
with open("/Users/Purvank/Desktop/MACHINE LEARNING IIT/CS595_ML/Project/Additional_Files/stopwords_stanford") as f:
    stanford_sw = f.readlines()

In [58]:
stanford_sw = [x.strip() for x in stanford_sw]

In [59]:
sw = list(set(stopwords + stanford_sw + list(stop_words.ENGLISH_STOP_WORDS)))
sw = [x.lower() for x in sw]

In [60]:
from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer
from nltk.tokenize import sent_tokenize,word_tokenize

In [61]:
def data_preprocessing(text):
    sb = SnowballStemmer("english")
    for i in range(0,len(text['ride_review'])):
        tokens = word_tokenize(text['ride_review'].iloc[i])
        tokens = [x.lower() for x in tokens if x.lower() not in sw]
        tokens = list(filter(lambda x:x.isalpha(),tokens))
        stem_tokens = [sb.stem(x) for x in tokens]
        c = Counter()
        c.update(stem_tokens)
        d = dict(c)
        seen = set()
        result = []
        for item in tokens:
            if sb.stem(item) not in seen:
                seen.add(sb.stem(item))
                result.append(item)
        review = ' '.join(x for x in result)
        text['ride_review'].iloc[i] = review
    return text

In [62]:
pro_data = data_preprocessing(data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [63]:
pro_data['ride_review'].iloc[0]

'completed new requested let tell started right driver agreed pick drop locations called spoke know actually blocked pickup away race road got notification rider joining spoken shortly passenger went wrong moments continuing destination phone asked going claimed need confused far time clearly played dumb saying understand works middle strange wanting argue lucrative jerk peacefully exited stood private residence shivering cold degree rainy weather dark mile safe seek rest warmth safety sooner scrawling fare piece paper standing outside meet inkling conscientiousness nerve charge partial excited using taxis coming lyft connected minutes long warm dressed clothes weary expecting stranded pace wait main traffic heavy rode closures able realistic eternal happiness arrived delivered nicely day finding way file complaint sent couple hours replied wet blanket apology inconvenience offered credit make worse taken wrote corrected harm previous good reputation earned pay preserve relationships b

## Create TfIdf features using TfIdf Vectorizer

In [64]:
tf_idf = TfidfVectorizer(ngram_range=(1,3),min_df=5)

In [65]:
d1 = tf_idf.fit_transform(pro_data["ride_review"]).toarray()

In [66]:
n_data1 = pd.DataFrame(d1,columns=tf_idf.get_feature_names())

In [67]:
sb = SnowballStemmer("english")

In [69]:
stem_tokens = [sb.stem(x) for x in tf_idf.get_feature_names()]
c = Counter()
c.update(stem_tokens)
d = dict(c)
deleted = []
print (type(tf_idf.get_feature_names()))

<class 'list'>


In [70]:
a = list(tf_idf.get_feature_names())

In [71]:
seen = set()
result = []
for item in a:
    if sb.stem(item) not in seen:
        seen.add(sb.stem(item))
        result.append(item)

In [72]:
n_data2 = n_data1[result]

In [73]:
n_data2.shape

(1344, 1563)

In [74]:
n_data2['sentiment'] = pro_data['sentiment'].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [75]:
n_data2.shape

(1344, 1564)

In [76]:
#n_data1['sentiment'] = pro_data['sentiment'].values

## Important Note: Please change the path of the file as per your folder.

In [77]:
n_data2.to_csv("/Users/Purvank/Desktop/MACHINE LEARNING IIT/CS595_ML/Project/Additional_Files/tf_idf.csv",index=False)

## Create Binarized features using Count Vectorizer

In [78]:
cv = CountVectorizer(ngram_range=(1,3),min_df=5)

In [79]:
d2 = cv.fit_transform(pro_data['ride_review']).toarray()

In [80]:
n_data3 = pd.DataFrame(d2,columns=cv.get_feature_names())

In [81]:
stem_tokens = [sb.stem(x) for x in tf_idf.get_feature_names()]
c = Counter()
c.update(stem_tokens)
d = dict(c)
deleted = []
print (type(tf_idf.get_feature_names()))

<class 'list'>


In [82]:
a = list(cv.get_feature_names())

In [83]:
seen = set()
result = []
for item in a:
    if sb.stem(item) not in seen:
        seen.add(sb.stem(item))
        result.append(item)

In [84]:
n_data4 = n_data3[result]

In [85]:
n_data4['sentiment'] = pro_data['sentiment'].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [86]:
n_data4.shape

(1344, 1564)

## Important Note: Please change the path of the file as per your folder.

In [87]:
n_data4.to_csv("/Users/Purvank/Desktop/MACHINE LEARNING IIT/CS595_ML/Project/Additional_Files/count_vectorizer.csv",index=False)

In [88]:
## End of Ipython Notebook