## language_filtering.ipynb
### Jenna Liebe, Setu Patel
This code imports the original_data_merged.csv.gz file (combined reviews and offering data with dictionary columns split out), and then isolates the comments (id, title, text) in a new file, review_comments.json.

From there, review_comments.json is read in by Spark, and the language is detected using the langid library for each comment. Only those comments with a language of 'en' (English) or 'null' are kept. A join is then performed with original_data, to recover all the dataset columns, but only keeping the filtered English comments.

Finally, this English-filtered dataset is printed to english_data.csv.gz for further processing.

* Note: Filtering by just 'en' removed too many comments, so those marked 'null' (as in, the filter couldn't confidently determine the language) were also kept. For the most part, these are all English. The few that aren't will be handled by the sentiment analysis model later on.

In [None]:
import pandas as pd
import langid
import re
from pyspark.sql import SparkSession, functions, types
from pyspark.sql.functions import col, when, coalesce

### Generate comments json file

In [3]:
ORIGINAL_DATA_PATH = "./../processed_data/original_data_merged.csv.gz"
REGEX_CLEANING_PATTERN = "[^0-9a-zA-Z\s]+"

In [4]:
original_data = pd.read_csv(ORIGINAL_DATA_PATH, low_memory=False, dtype={'title': str})

In [None]:
# Select Customer review comment title, text and id as unique identifier
all_review_comments_pd = original_data[["title", "text", "id"]]

#all_review_comments_pd

In [None]:
all_review_comments_pd.to_json("./../processed_data/review_comments.json",orient = 'records')

### Use Spark to read in review_comments.json

In [5]:
# set up Spark stuff
spark = SparkSession.builder \
    .appName("Translate Non-English Text") \
    .config('spark.driver.memory', '8g') \
    .config('spark.executor.memory', '8g') \
    .config('spark.network.timeout', '600000') \
    .config('spark.sql.broadcastTimeout', '600000') \
    .getOrCreate()

In [6]:
# Schema for review_comments.json
comments_schema = types.StructType([
    types.StructField('title', types.StringType()),
    types.StructField('text', types.StringType()),
    types.StructField('id', types.StringType())
])

In [7]:
spark_comments = spark.read.load("./../processed_data/review_comments.json", format="json", schema = comments_schema).cache()

### Language filtering

In [8]:
# language detection function
def detect_language(text):
    try:
        language, confidence = langid.classify(text)
        return language if confidence > 0.5 else None
    except: return None

In [9]:
detect_language_udf = functions.udf(detect_language, types.StringType())

In [10]:
english_comment_ids = spark_comments.withColumn('language', detect_language_udf(col('title'))) \
                                 .where((col('language') == 'en') | (col('language') == 'null')) \
                                 .select('id') \
                                 .cache()

### Convert to Pandas DF for printing

In [11]:
pandas_comment_ids = english_comment_ids.toPandas()

In [12]:
# don't need Spark anymore, so may as well stop it
spark.stop()

In [13]:
pandas_comment_ids.rename(columns = {'id': 'comment_id'}, inplace = True)

In [14]:
cols = list(original_data.columns)
cols[6] = 'comment_id'
original_data.columns = cols

In [15]:
pandas_comment_ids['comment_id'] = pandas_comment_ids['comment_id'].astype('int64')

In [16]:
english_data = original_data.merge(pandas_comment_ids, on = 'comment_id', how = 'inner')

In [18]:
english_data.to_csv('./../processed_data/english_data.csv.gz', index = False, compression = 'gzip')

### Final output: english_data.csv.gz 
#### (header + 147,034 rows, ~52MB compressed)