# Travel agency's reviews - data filtering

The goal of this assignment is to load data from '../data/reviews.csv', and filter out all non-English rows using the tools from the previous assignment and store it to '../data/en-reviews.csv'. The CSV file includes user reviews of [Kiwi.com](http://www.kiwi.com) (airline tickets aggregator and travel agency) and has two columns:
* **rating** - integer value between 1 and 5.
* **text** - text of the review

In [1]:
import pandas as pd

reviews = pd.read_csv('../data/reviews.csv', sep='\t', header=None, names =['rating', 'text'])
reviews[35:45]

Unnamed: 0,rating,text
35,4,I bought the cheapest ticket i could find from...
36,5,Quick and prompt with solving any issues. The ...
37,5,Just bought tickets to go to our honeymoon in ...
38,5,C'est la première fois que j'utilise <http://K...
39,1,I was doing searches from San Diego to Cancun ...
40,5,I bought the cheapest tickets through this ser...
41,5,Such a pleasure to know that you will be prope...
42,5,I always use this website to look for flights ...
43,2,A startup that finds discount flight tickets '...
44,5,"Excellent customer service, fast and kind. Wan..."


## English language detector

Use the English language detector code from the previous assignment. Notice that we normalize the input texts first. What does happen if don't normalize? 

In [2]:
import numpy as np
from gensim.corpora.wikicorpus import filter_wiki, tokenize

V = 0 #size of vocabulary
histogram = {} #unigram and bigram frequencies

with open('../data/corpora/enlang1.txt') as fin:
    for doc in fin.readlines():
        for i in range(len(doc)-2):
            bigram = doc[i:i+2]
            unigram = doc[i]
            histogram[bigram] = histogram.get(bigram, 0) + 1
            histogram[unigram] = histogram.get(unigram, 0) + 1
    V = len([unigram for unigram in histogram.keys() if len(unigram) == 1])

#Compute the probability of a bigram using the Laplace smoothing
def getProbability(bigram):
    return 1.0*(histogram.get(bigram, 0) + 1) / \
                (histogram.get(bigram[0], 0) + V)

# Get the perplexity of text.
def getPerplexity(text):
    bigrams = [text[i:i+2] for i in range(len(text) - 1)]
    h = -sum(map(lambda x: np.log2(getProbability(x)), bigrams))
    return np.power(2, h/len(bigrams))

# Return True if the 'text' is written in English, return False otherwise.
def detectLang(text, threshold=14):
    #normalize the text first
    words = tokenize(filter_wiki(text.lower()))
    text = " ".join(words)
    if len(text) <= 1:
        return False
    else:
        return True if getPerplexity(text) <= threshold else False

Read the CSV file from '../data/reviews.csv', filter out all non-English reviews and store the cleaned data to '../data/en-review.csv'.

In [3]:
import csv

INPUT_FILE = '../data/reviews.csv'
OUTPUT_FILE = '../data/en_reviews.csv'

print('Number of lines in the original file: {}.'.format(len(reviews)))
counter = 0
with open(INPUT_FILE) as inFile:
    with open(OUTPUT_FILE, "w") as outFile:
        fieldnames = ['rating', 'review']
        csvReader = csv.DictReader(inFile, fieldnames=fieldnames, delimiter='\t')
        csvWriter = csv.DictWriter(outFile, fieldnames=fieldnames, delimiter='\t')
        for row in csvReader:
            review = row['review']
            if (detectLang(review, 14)):
                csvWriter.writerow(row)
                counter += 1
print("Number of English reviews written: {}.".format(counter))

Number of lines in the original file: 9424.
Number of English reviews written: 7793.


In [4]:
en_reviews = pd.read_csv('../data/en_reviews.csv', sep='\t', header=None, names =['rating', 'text'])
en_reviews[35:45]

Unnamed: 0,rating,text
35,5,I bought the cheapest tickets through this ser...
36,5,Such a pleasure to know that you will be prope...
37,5,I always use this website to look for flights ...
38,2,A startup that finds discount flight tickets '...
39,5,"Excellent customer service, fast and kind. Wan..."
40,4,very good service from Quan Costa to help me w...
41,3,.@Skypickercom Finds Cheap Flights 'Hidden' On...
42,5,I have a problem with my tickets skypicker don...
43,4,Even though it took a bit time untill an agent...
44,5,Today I had a great experience with one of Kiw...
