# Ad-Hoc Pipeline Step: Remove Spanish Reviews

This section produces an analysis of how many reviews in our corpus consist of words that are, to any capacity, written in a foreign language. This step was motivated by the fact that several unsupervised learning models kept producing topics where most of the driving words were in Spanish.
This notebook removes all spanish reviews and replaces all remaining foreign-language artifacts such as the word excelente in reviews that got identified non-english.

**Note:** This section step does not remove any reviews unless they have been identifies as Spanish by spacy. 


## Table of Contents
<ol>
    <li>Packages</li>
    <li>Parquet Ingestion</li>
    <li>Language Detection Functionality and Execution</li>
    <li>Analysis and Remarks</li>
    <li>Candidates for Removal </li>
    <li>Replacement & Removal </li>
    <li>Re-tokenization</li>
    <li>Export</li>
</ol>

### Candidates for Removal
<table>
    <tr>
        <th>Country</th>
        <th>ISO</th>
    </tr>
    <tr>
        <td>
            Spanish
        </td>
        <td>
            es
        </td>
    </tr>
    <tr>
        <td>
            Russian
        </td>
        <td>
            ru
        </td>
    </tr>
</table>

### Candidates for Replacement

<table>
    <tr>
        <th>Word</th>
        <th>Replacement</th>
    </tr>
    <tr><td>Excelente</td><td>Excellent</td></tr>
    <tr><td>Producto</td><td>Product</td></tr>
    <tr><td>Recomendado</td><td>Recommend</td></tr>
</table>

### Packages

In [189]:
import pandas as pd 
import textacy
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler
from sklearn.decomposition import NMF
from collections import Counter, defaultdict
import warnings #turn off warnings
warnings.filterwarnings("ignore", category=UserWarning) 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models import LdaModel
from gensim.corpora import Dictionary
import time
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector
import re 

### Parquet Ingestion

In [2]:
df = pd.read_parquet('prepared_text.parquet.gzip')
df = df.rename({"new_reviews":"text"},axis=1)
df.sample(3)

Unnamed: 0,Rating,text,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities,tokens
400667,3,screen protector side supposed phone screen cu...,"[screen, protector, side, suppose, phone, scre...","[suppose, strangeneed, right, correct]","[screen, protector, phone, screen, cut, incorr...","[screen_protector, screen_protector_side, prot...","[right_screen, right_screen_protector, front_c...",[],"[screen, protector, side, supposed, phone, scr..."
298095,1,terrible phones even worst carrier provider ma...,"[terrible, phone, even, bad, carrier, provider...","[terrible, bad, well, recommend]","[phone, carrier, provider, phone]",[carrier_provider],"[terrible_phone, bad_carrier, bad_carrier_prov...",[],"[terrible, phones, even, worst, carrier, provi..."
407983,5,quality phone great see used refurbishedits li...,"[quality, phone, great, see, use, refurbishedi...","[great, think, good]","[quality, phone, refurbishedit, newi, buythank]",[quality_phone],[good_buythank],[],"[quality, phone, great, see, used, refurbished..."


### Language Detection Functionality and Execution
Because Spacy doesn't treat the ``LanguageDetector()`` function as a native pipeline step, we have to wrap it in a function. We also need a function that takes in a text argument and returns the detected language.

In [7]:
def get_lang_detector(nlp, name):
    return LanguageDetector()

def get_text_lang(text):
    doc = nlp(text)
    return(doc._.language['language'])

In [None]:
nlp = spacy.load("en_core_web_sm")
Language.factory("language_detector", func=get_lang_detector)
nlp.add_pipe('language_detector', last=True)

In [None]:
df['language'] = df['text'].map(get_text_lang)

### Analysis and Remarks

#### Distribution of Identified Languages

Oddly enough, spacy has identified spanish reviews as being only at about 8k, which languages like so, ca, and french have far more members:

In [119]:
langs = df.query("language!='en'")['language'].value_counts().sort_values(ascending=False)
print(langs)

so         12541
ca         11002
fr         10364
pt          8723
af          8721
ro          8395
es          8169
sl          4838
cy          3363
sk          3156
it          3139
da          2411
nl          2307
no          2119
pl          1915
et          1580
hr          1137
UNKNOWN     1110
tl           891
sv           801
fi           585
lv           539
sw           493
tr           395
de           367
id           324
cs           298
hu           165
lt           141
sq           129
vi           112
ru             2
Name: language, dtype: int64


#### Observing Spanish Reviews
This lines up with expectations - most reviews are clearly in spanish, even if some of them may really be english or mixed-languagd(spanglish):

In [59]:
df[df['language']=='es'].sample(10)

Unnamed: 0,Rating,text,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities,tokens,language
317616,5,excelente gracias,"[excelente, gracias]",[],"[excelente, gracias]",[],[],[],"[excelente, gracias]",es
235210,5,buenisimo gracias,"[buenisimo, gracias]",[],"[buenisimo, gracias]",[],[],[],"[buenisimo, gracias]",es
94088,2,el equipo salio defectuoso el trackpad funcion...,"[el, equipo, salio, defectuoso, el, trackpad, ...",[siento],"[el, equipo, salio, defectuoso, el, trackpad, ...",[],[],[el_equipo_salio_defectuoso_el/ORG],"[el, equipo, salio, defectuoso, el, trackpad, ...",es
295040,1,came broken lol,"[come, break, lol]","[come, break]",[lol],[],[],[],"[came, broken, lol]",es
262982,4,buen telefono bastantes aplicaciones utiles pu...,"[buen, telefono, bastante, aplicacione, utile,...","[utile, rede, opcione]","[buen, telefono, bastante, aplicacione, puede,...","[bastante_aplicacione, puede_tener, twitter_in...",[],"[bastante/ORG, la_aplicaciones_de/ORG]","[buen, telefono, bastantes, aplicaciones, util...",es
89247,5,bueno,[bueno],[bueno],[],[],[],[],[bueno],es
335663,5,muy bueno,"[muy, bueno]",[],"[muy, bueno]",[],[],[muy/ORG],"[muy, bueno]",es
91010,2,yo soy de venezuela el teléfono llegó casi los...,"[yo, soy, de, venezuela, el, teléfono, llegó, ...",[do],"[yo, soy, de, venezuela, el, teléfono, llegó, ...",[tmovile_por],[],[yo_soy_de_venezuela_el_teléfono_llegó_casi_lo...,"[yo, soy, de, venezuela, el, teléfono, llegó, ...",es
85620,5,equipo nuevo la batería le dura todo el día fu...,"[equipo, nuevo, la, batería, le, dura, todo, e...","[equipo, tiempo]","[nuevo, la, batería, le, dura, todo, el, día, ...",[],[],"[equipo_nuevo_la_batería/PERSON, le_dura_todo/...","[equipo, nuevo, la, batería, le, dura, todo, e...",es
91442,5,excelente equipo lo recomiendo quienes desean ...,"[excelente, equipo, lo, recomiendo, quienes, d...",[],"[excelente, equipo, lo, recomiendo, quienes, d...","[por_mucho, por_mucho_tiempo, mucho_tiempo]",[],"[un/ORG, al_mercado/PERSON]","[excelente, equipo, lo, recomiendo, quienes, d...",es


#### Observing Other Foreign Reviews
High-level analysis shows that much of these languages don't actually correspond to somali or catalan, but rather, a "type" of review that is generally one word. We can also leverage functionality to see the most common words in each group:

In [152]:
def combine_tokens(tokens):
    out = []
    for token_list in tokens:
        for t in token_list:
            out.append(t)
        #out = out + token_list.shape(0,-1)
    return out

def overview(f):
    print(f'size:{len(f)}')
    print('avg number of tokens per review:')
    print(f['token_count'].mean())
    n = f['tokens'].sample(5,replace=True)
    combined = combine_tokens(f['tokens'])
    print(n)
    print(Counter(combined).most_common(10))

In [143]:
df['token_count'] = df['tokens'].map(len)
df.head(1)

Unnamed: 0,Rating,text,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities,tokens,language,token_count
0,5,feel lucky found used phone us used hard phone...,"[feel, lucky, find, use, phone, use, hard, pho...","[feel, lucky, find, hard, upgrade, sell, like,...","[phone, phone, line, son, year, thank, seller,...","[phone_line, thank_seller]","[hard_phone, hard_phone_line, old_one, recomme...",[],"[feel, lucky, found, used, phone, us, used, ha...",en,38


In [153]:
#SOMALI
somali = df[df['language']=='so']
overview(somali)

size:12541
avg number of tokens per review:
1.1800494378438722
178662    [good]
134232    [good]
85957     [good]
7329      [good]
27045     [good]
Name: tokens, dtype: object
[('good', 11623), ('bad', 456), ('thanks', 150), ('buy', 148), ('far', 124), ('thank', 113), ('deal', 85), ('job', 72), ('quality', 67), ('x', 43)]


In [154]:
#CATALAN
catalan = df[df['language']=='ca']
overview(catalan)

size:11002
avg number of tokens per review:
1.4510089074713688
343897      [perfect]
136140    [excellent]
371892      [perfect]
188135    [excellent]
145700    [excellent]
Name: tokens, dtype: object
[('excellent', 6138), ('perfect', 2391), ('exelente', 695), ('great', 364), ('quality', 264), ('good', 247), ('exelent', 232), ('excelent', 184), ('camera', 174), ('value', 167)]


In [155]:
#FRENCH
french = (df[df['language']=='fr'])
overview(french)

size:10364
avg number of tokens per review:
3.9173099189502123
36718                              [excellent, product]
232149                     [dead, 2, days, use, return]
81105     [apples, samsungs, lgs, favorite, phone, far]
219627       [excelent, phone, came, great, conditions]
61121                                [beautiful, phone]
Name: tokens, dtype: object
[('excellent', 5391), ('phone', 4527), ('product', 2331), ('price', 694), ('love', 663), ('recommend', 603), ('good', 574), ('seller', 563), ('excelent', 456), ('great', 447)]


<div style='color:red'>Punjabi is another language group that, while incorrectly tagged as Punjabi, seems to contain many words that are actually spanish: </div>

In [156]:
#PUNJABI
punjabi = df[df['language']=='pt']
overview(punjabi)

size:8723
avg number of tokens per review:
1.2965722801788375
302313     [excelente]
105617    [excellente]
130888     [excelente]
234571     [excelente]
201346      [excelent]
Name: tokens, dtype: object
[('excelente', 5610), ('excelent', 2339), ('recomendado', 262), ('100', 222), ('e', 102), ('item', 94), ('described', 79), ('good', 74), ('sim', 59), ('producto', 45)]


it seems that excelente is the most commonly used word in "punjabi" reviews, but analysis seems to show that most of these reviews are literally just the word excelente. If we can modify these reviews to replace excelente with excelent, we can probably avoid removing them.

#### Iterating over all languages

In [157]:
for lang in langs.index:
    if lang != "es":
        print(f"==========| {lang} |============")
        overview(df[df['language']==lang])

size:12541
avg number of tokens per review:
1.1800494378438722
98241             [good]
57404        [far, good]
114511            [good]
105827    [good, thanks]
404637            [good]
Name: tokens, dtype: object
[('good', 11623), ('bad', 456), ('thanks', 150), ('buy', 148), ('far', 124), ('thank', 113), ('deal', 85), ('job', 72), ('quality', 67), ('x', 43)]
size:11002
avg number of tokens per review:
1.4510089074713688
200082     [xcelente]
292367      [perfect]
175776     [exelente]
393770    [excellent]
234960    [excellent]
Name: tokens, dtype: object
[('excellent', 6138), ('perfect', 2391), ('exelente', 695), ('great', 364), ('quality', 264), ('good', 247), ('exelent', 232), ('excelent', 184), ('camera', 174), ('value', 167)]
size:10364
avg number of tokens per review:
3.9173099189502123
176464                                    [dont, buy]
341354                            [complaints, phone]
29924     [phone, came, perfect, condition, problems]
144150              [plug, phon

size:493
avg number of tokens per review:
1.2880324543610548
290415         [0k]
79281     [amazing]
332176          [k]
19886           [k]
367318    [amazing]
Name: tokens, dtype: object
[('amazing', 318), ('much', 67), ('like', 60), ('want', 17), ('k', 15), ('fake', 14), ('okay', 8), ('hi', 6), ('watch', 6), ('weak', 6)]
size:395
avg number of tokens per review:
2.0151898734177216
43012                   [yea]
224340                  [yes]
287635        [little, bulky]
68653     [yes, nice, mobile]
303147                  [yes]
Name: tokens, dtype: object
[('yes', 208), ('buy', 48), ('bulky', 27), ('bad', 23), ('nice', 17), ('mobile', 15), ('güzel', 14), ('ürün', 14), ('ama', 14), ('satıcı', 14)]
size:367
avg number of tokens per review:
1.7438692098092643
209667    [android, 44, sehr, schlecht]
17695                        [glitches]
5125                             [bien]
76379                [much, faster, 5s]
156814                           [bien]
Name: tokens, dtype: object
[(

### Candidates for Removal
<table>
    <tr>
        <th>Country</th>
        <th>ISO</th>
    </tr>
    <tr>
        <td>
            Spanish
        </td>
        <td>
            es
        </td>
    </tr>
    <tr>
        <td>
            Russian
        </td>
        <td>
            ru
        </td>
    </tr>
</table>

### Candidates for Replacement

<table>
    <tr>
        <th>Word</th>
        <th>Replacement</th>
    </tr>
    <tr><td>Excelente</td><td>Excellent</td></tr>
    <tr><td>Producto</td><td>Product</td></tr>
    <tr><td>Recomendado</td><td>Recommend</td></tr>
</table>

#### Replacement and Removal

In [174]:
#remove spanish reviews
df = df.drop(df[df['language']=='es'].index)
df = df.drop(df[df['language']=='ru'].index)

In [167]:
#remap words
word_remap = {
    'excelente':'excellent',
    'producto':'product',
    'recomendado':'recommend'
}

def word_replace(text):
    out = text
    for k,v in word_remap.items():
        out = re.sub(k,v,out)
    return out

In [179]:
#remove all foreign artifacts
df['new_reviews'] = df['text'].map(word_replace)
df.head(1)

Unnamed: 0,Rating,text,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities,tokens,language,token_count,new_reviews
0,5,feel lucky found used phone us used hard phone...,"[feel, lucky, find, use, phone, use, hard, pho...","[feel, lucky, find, hard, upgrade, sell, like,...","[phone, phone, line, son, year, thank, seller,...","[phone_line, thank_seller]","[hard_phone, hard_phone_line, old_one, recomme...",[],"[feel, lucky, found, used, phone, us, used, ha...",en,38,feel lucky found used phone us used hard phone...


In [194]:
#Sanity check
print(len(df[df['language']=='pt']))
df[df['language']=='pt'].head()

8723


Unnamed: 0,Rating,text,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities,tokens,language,token_count,new_reviews
40,5,excelente,[excellent],[excellent],[],[],[],[],[excelente],pt,1,excellent
41,5,excelente,[excellent],[excellent],[],[],[],[],[excelente],pt,1,excellent
58,5,excelente,[excellent],[excellent],[],[],[],[],[excelente],pt,1,excellent
60,5,excelente,[excellent],[excellent],[],[],[],[],[excelente],pt,1,excellent
65,5,excelente,[excellent],[excellent],[],[],[],[],[excelente],pt,1,excellent


### Retokenization
because of the fact that we replaced three different words in a significant portion of our corpus, we need to rerun the spacy tokenization pipeline for those rows. We can save time by running the pipeline **only** on those chunks:

In [186]:
#apply this pipeline to our df to generate tokens:
def extract_nlp(doc):
    return {
        'lemmas' : extract_lemmas(doc,
        exclude_pos = ['PART', 'PUNCT',
        'DET', 'PRON', 'SYM', 'SPACE'],
        filter_stops = False),
        'adjs_verbs' : extract_lemmas(doc, include_pos = ['ADJ', 'VERB']),
        'nouns' : extract_lemmas(doc, include_pos = ['NOUN', 'PROPN']),
        'noun_phrases' : extract_noun_phrases(doc, ['NOUN']),
        'adj_noun_phrases': extract_noun_phrases(doc, ['ADJ']),
        'entities' : extract_entities(doc, ['PERSON', 'ORG', 'GPE', 'LOC'])
    }

def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]

def extract_noun_phrases(doc, preceding_pos=['NOUN'], sep='_'):
    patterns = []
    for pos in preceding_pos:
        patterns.append(f"POS:{pos} POS:NOUN:+")
    spans = textacy.extract.matches.token_matches(doc, patterns=patterns)
    return [sep.join([t.lemma_ for t in s]) for s in spans]

def extract_entities(doc, include_types=None, sep='_'):
    ents = textacy.extract.entities(doc,
    include_types=include_types,
    exclude_types=None,
    drop_determiners=True,
    min_freq=1)
    return [sep.join([t.lemma_ for t in e])+'/'+e.label_ for e in ents]

In [193]:
import time
import tqdm.notebook as tq
start = time.localtime()
batch_size = 50

for i in tq.tqdm(range(0, len(df), batch_size),position=0,leave=True):
    docs = nlp.pipe(df['new_reviews'][i:i+batch_size])
    for j, doc in enumerate(docs):
        #Only replace the tokens if the review was modified somehow 
        if df['text'].iloc[i+j] != df['new_reviews'].iloc[i+j]:
            for col, values in extract_nlp(doc).items():
                df[col].iloc[i+j] = values
            
            
end = time.localtime()
print(start,end)

  0%|          | 0/8114 [00:00<?, ?it/s]

time.struct_time(tm_year=2022, tm_mon=6, tm_mday=25, tm_hour=15, tm_min=11, tm_sec=58, tm_wday=5, tm_yday=176, tm_isdst=1) time.struct_time(tm_year=2022, tm_mon=6, tm_mday=25, tm_hour=15, tm_min=30, tm_sec=28, tm_wday=5, tm_yday=176, tm_isdst=1)


### Export

In [199]:
df.drop(columns=['text','language']).to_parquet(
    'prepared_text.parquet.gzip',
    compression='gzip',
    index=False) 