# Filtering User Reviews by Language

In this short notebook, I will be using the `langdetect` library to detect the language of my review data, and see if I can clean the data by taking only the English reviews. This is because for the purpose of my data analysis, I want to focus on English words to carry out sentiment analysis which is the bulk of my overall project.

Here we go~

In [17]:
from langdetect import detect
import json

## Detecting language in Rotten Tomatoes user reviews

We'll start with RT user reviews of _The Host_

In [20]:
rt_host = json.load(open('../data/user_reviews/tomatoes_host_user.json'))

In [21]:
len(rt_host)

5216

In [22]:
rt_host[10]

{'date': 'Dec 19, 2009',
 'score': 3,
 'text': "Some good CGI, but playing it for laughs didn't really work, not when children are being eaten."}

In [23]:
detect(rt_host[0]['text'])

'en'

In [32]:
# Creating a new list with just the English-language reviews:

rt_host_revs = []

for rev in rt_host:
    try:
        lang = detect(rev['text'])
        if lang=='en':
            rt_host_revs.append(rev)

    except:
        print('error with', rev)

error with {'date': 'Aug 20, 2009', 'score': 3.5, 'text': 'http://www.stripes.com/article.asp?section=104&article=25717&archive=true'}
error with {'date': 'Apr 12, 2009', 'score': 4, 'text': 'http://cineptimoarte.blogspot.mx/'}
error with {'date': 'Feb 16, 2008', 'score': 4.5, 'text': 'http://www.lukechu.com/serendipity/index.php?/archives/442-The-Host.html'}


In [35]:
len(rt_host_revs)

4855

In [36]:
# Saving out this cleaned English review list as the new json file:
with open('../data/user_reviews/tomatoes_host_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(rt_host_revs))

Now we'll do the same thing with RT user reviews of _Parasite_ to take out non-English reviews

In [37]:
rt_parasite = json.load(open('../data/user_reviews/tomatoes_parasite_user.json'))

In [38]:
len(rt_parasite)

4672

In [39]:
rt_parasite[0]

{'date': 'May 1, 2021',
 'score': 5,
 'text': "Watch it!!!!!!! I don't need to say a word, just trust me."}

In [40]:
detect(rt_parasite[0]['text'])

'en'

In [41]:
# Another new list with just the English-language reviews:

rt_parasite_revs = []

for rev in rt_parasite:
    try:
        lang = detect(rev['text'])
        if lang=='en':
            rt_parasite_revs.append(rev)

    except:
        print('error with', rev)

error with {'date': 'Apr 13, 2020', 'text': '......................................', 'score': 5}


In [42]:
len(rt_parasite_revs)

4409

In [43]:
# Saving out the cleaned English review list as the new json file:
with open('../data/user_reviews/tomatoes_parasite_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(rt_parasite_revs))

## Detecting language in Metacritic user reviews

Now we'll run the same steps above to get a cleaned list with English reviews, but with the user reviews from Metacritic. We'll start with the ones of _The Host_ and then go from there!

In [44]:
meta_host = json.load(open('../data/user_reviews/metacritic_host_user.json'))

In [45]:
len(meta_host)

73

In [52]:
meta_host[0]

{'date': 'Oct  5, 2009',
 'score': '5',
 'text': 'I guess the movie it self wasnt all that bad....but boy was i let down after hearing of all the "great reviews" this film was given. one review said this was "On Par with Jaws" You kidding me!!!!! I was expecting tense scary moments butI guess the movie it self wasnt all that bad....but boy was i let down after hearing of all the "great reviews" this film was given. one review said this was "On Par with Jaws" You kidding me!!!!! I was expecting tense scary moments but instead all i found my self doing when the monster appeared was casually say "o i wonder if the thing is going to take someone...." in the end if you want an okay monster movie to watch go out and rent it. Jaws i felt terrified to swim in the ocean for awhile The Host... well I think ill still brave tubing down rivers without any fear.… Expand'}

In [49]:
detect(meta_host[0]['text'])

'en'

In [53]:
# Again, creating a new list with just the English-language reviews:

meta_host_revs = []

for rev in meta_host:
    try:
        lang = detect(rev['text'])
        if lang=='en':
            meta_host_revs.append(rev)

    except:
        print('error with', rev)

In [54]:
len(meta_host_revs)

71

In [55]:
# Saving out the cleaned English review list as the new json file:
with open('../data/user_reviews/metacritic_host_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(meta_host_revs))

Same thing now with the Metacritic user reviews of _Parasite_

In [56]:
meta_parasite = json.load(open('../data/user_reviews/metacritic_parasite_user.json'))

In [57]:
len(meta_parasite)

326

In [58]:
meta_parasite[0]

{'date': 'Apr 30, 2021',
 'score': '10',
 'text': 'Director {Bong Joon-Ho} has done it again with a captivating and extraordinary film. Leaving me questioning if the real world is actually like this! Not only is the climax risen to a new level but the jaw dropping ending left me shook.'}

In [59]:
detect(meta_parasite[0]['text'])

'en'

In [60]:
# Creating a new list with just the English-language reviews:

meta_parasite_revs = []

for rev in meta_parasite:
    try:
        lang = detect(rev['text'])
        if lang=='en':
            meta_parasite_revs.append(rev)

    except:
        print('error with', rev)

In [61]:
len(meta_parasite_revs)

290

In [62]:
# Saving out the new English review file as the json file:
with open('../data/user_reviews/metacritic_parasite_user.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(meta_parasite_revs))

## Conclusion

In this notebook, I worked on cleaning and organizing all the user reviews (from both Metacritic and Rotten Tomatoes). I wanted to filter out non-English words because my analysis for this project is limited to the English language. After using the `langdetect` program to filter out non-English reviews, I saved out four new JSON files: user reviews of _Parasite_ from both sites, and user reviews of _The Host_ from both sites. 

With that, I can move onto the analysis of these movie reviews! Please refer to the notebooks with the titles that begin with `analysis` to see the work I completed.

Thank you!