## <font color='darkblue'>Preface</font>
Here we are going to work on the model to extract Car type/model/accessories from article/post.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

## <font color='darkblue'>Data Set</font>
We collect and label our data from dumped data [here](https://drive.google.com/file/d/1LTyPKtafRQnQp-DslpjBM5QUWoxAtSTY/view?usp=sharing). Let's load labelling data back for future process:

In [2]:
df = pd.read_csv("datas.csv")
df.head()

Unnamed: 0,kw,sentence
0,Tesla Model 3,"The Tesla\_h Model\_h 3\_h is an American four-door saloon car with rear- or -four-wheel drive, seating for five people at a pinch, and a touchscreen inside."
1,Tesla Model 3,We think right now may be the best time to buy a Tesla\_h Model\_h 3\_h.
2,BMW 228 Gran Coupe,The BMW\_h 228\_h Gran\_h Coupe\_h is a sedan.
3,228 Gran Coupe,"The 2021 228\_h Gran\_h Coupe\_h starts at $35,700 (MSRP), with a destination charge of $995."
4,Toyota Avalon,The Toyota\_h Avalon\_h is a sedan.


In [22]:
word_set = set()
for ri, r in df.iterrows():
    words = r.sentence.split()
    for i in range(len(words)):                
        if words[i][-1] in ",.!?":
            words[i] = words[i][:-1]
        elif words[i][-2:] == "'s'":
            words[i] = words[i][:-2]
            
        if words[i].startswith('('):
            words[i] = words[i][1:]
            
        if words[i].endswith(')'):
            words[i] = words[i][:-1]
            
    new_words = []
    for w in words:
        if w.endswith('\_h'):
            w = w[:-3]
            word_set.add(w)
            w = f"{color.BOLD}{color.BLUE}{w}{color.END}"
            
        if w.endswith('\_a'):
            w = w[:-3]
            word_set.add(w)
            w = f"{color.BOLD}{color.RED}{w}{color.END}"
            
        new_words.append(w)        
    print(" ".join(new_words))

The [1m[94mTesla[0m [1m[94mModel[0m [1m[94m3[0m is an American four-door saloon car with rear- or -four-wheel drive seating for five people at a pinch and a touchscreen inside
We think right now may be the best time to buy a [1m[94mTesla[0m [1m[94mModel[0m [1m[94m3[0m
The [1m[94mBMW[0m [1m[94m228[0m [1m[94mGran[0m [1m[94mCoupe[0m is a sedan
The 2021 [1m[94m228[0m [1m[94mGran[0m [1m[94mCoupe[0m starts at $35,700 MSRP with a destination charge of $995
The [1m[94mToyota[0m [1m[94mAvalon[0m is a sedan
The [1m[94mToyota[0m [1m[94mAvalon[0m is a full-size car produced by Toyota
The [1m[94mBMW[0m [1m[94m327[0m is a medium-sized touring coupé produced by the Bavarian firm between 1937 and 1941
In the shark nose of the [1m[94mFerrari[0m [1m[94m812[0m [1m[94mGTS[0m is a 6.5-litre V12 engine free of turbos and able to rev to almost 9,000rpm as a result
Equip your [1m[94mFerrari[0m [1m[94m812[0m with the Novitec Aero Kit
I jus

## <font color='darkblue'>Fasttext - Word to vector</font>
Here we will leverage [**Fasttext**](https://github.com/facebookresearch/fastText) package to help us to translate word into vector for later ML training. Instead of re-training a new Fasttext model, we will use exist trained model [here](https://fasttext.cc/docs/en/crawl-vectors.html) [cc.en.300.bin.gz](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz). Please click the link and download it:

In [8]:
#!pip install fasttext

In [24]:
#import fasttext.util

#fasttext.util.download_model('en', if_exists='ignore')  # English
#ft = fasttext.load_model('cc.en.300.bin')

After that, below code will load the downloaded model:

In [25]:
#import fasttext
#import fasttext.util

## <font color='darkblue'>Gensim - Word to vector</font>
Here we will leverage [**Gensim**](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html) to help us to translate word into vector for later ML training.

In [11]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')



In [23]:
unknown_word_set = set()
for w in word_set:
    try:
        wv[w]
    except:
        unknown_word_set.add(w)
        
print(f"Total {len(unknown_word_set)} unknown word(s) found!")
print(unknown_word_set)

Total 24 unknown word(s) found!
{'jvc', 'entune', '2011', "Toyota's", 'HR-V', 'cr-v', '2019', '2013', '812', 'HondaLink', '228i', '2015', '2021', 'WF-1000xm3', '2017', 'Crosstrek', 'BT-Splitter', '2020', "Civic's", '911', '327', '2014', 'R-link', '228'}


## <font color='darkblue'>Supplement</font>
* [All Car Brands](https://www.carlogos.org/car-brands/)
* [FAQ - How to handle unknown word in Gensim?](https://stackoverflow.com/questions/54709178/how-to-handle-words-that-are-not-in-word2vecs-vocab-optimally)