<a 
 href="https://colab.research.google.com/github/LearnPythonWithRune/MachineLearningWithPython/blob/main/colab/final/14 - Project - Word2Vec.ipynb"
 target="_parent">
<img 
 src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"/>
</a>

# Project: Create a Word2Vec Model

### Step 1: Import libraries

In [1]:
import os
import nltk
from os import system
from nltk.corpus import stopwords

In [2]:
# Create local directories in Google Colab
!mkdir -p files/holmes

In [3]:
# This part, only for colabs, in order to have all the fullpath name in the list "holmes_files"

REMOTE_DIRECTORY = "https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/jupyter/final/files/holmes/"

FILES = ["bachelor.txt", "clerk.txt", "face.txt ", "problem.txt", "twisted.txt", "blaze.txt", "copper.txt" , "gloria_scott.txt", "ritual.txt", "bohemia.txt", "coronet.txt", "interpreter.txt", "speckled.txt", "boscombe.txt", "crooked.txt ", "league.txt", "squires.txt", "carbuncle.txt", "engineer.txt", "atient.txt", "treaty.txt"]

holmes_files = []
for filename in FILES:
    full_name = REMOTE_DIRECTORY + filename
    system("curl -o files/holmes/"+filename+" "+full_name)

### Step 2: Download stopwords
- Execute the following cell

In [4]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Step 3: Read content and sentinize
- Initialize an empty list called **all_sentences**
- For each filename in **'files/holmes'**:
    - HINT: Use **os.listdir(...)** ([docs](https://docs.python.org/3/library/os.html#os.listdir))
- Open the file and read the content and convert to lowercase and apply **nltk.sent_tokenize** on content.
    - Use **lower()** on content.

In [5]:
all_sentences = []

for filename in os.listdir('files/holmes'):
    with open(f'files/holmes/{filename}') as f:
        content = f.read()
        all_sentences += nltk.sent_tokenize(content.lower())

### Step 4: Tokenize each sentence
- Get all words by applying **nltk.word_tokenize** on them and assign the result to **all_words**
    - HINT: Use list comprehension

In [6]:
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

In [7]:
all_words[0][:10]

['the',
 'adventure',
 'of',
 'the',
 'blue',
 'carbuncle',
 'i',
 'had',
 'called',
 'upon']

### Step 5: Remove all stop words
- Use **stopwords.words('english')** to filter all the words in **all_words**
    - HINT: iterate over the length of **all_words**, for each index use list comprehension

In [8]:
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

### Step 6: Remove special characters
- Iterate over items in **all_words** to remove words with special characters
    - HINT: Use **isalpha()** ([doc](https://docs.python.org/3/library/stdtypes.html#str.isalpha))

In [9]:
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w.isalpha()]

### Step 7: Install gensim and python-Levenshtein
- Run the following cells

In [10]:
!pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [11]:
!pip install python-Levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.20.9-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.20.9
  Downloading Levenshtein-0.20.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.0/174.0 KB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.13.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m58.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.20.9 python-Levenshtein-0.20.9 rapidfuzz-2.13.7


### Step 8: Import another library
- Run the following cell

In [12]:
from gensim.models import Word2Vec

### Step 9: Create a model
- Use **Word2Vec** on **all_words**
    - Use **min_count=2** : Ignores all words with total frequency lower than this.

In [13]:
model = Word2Vec(all_words, min_count=2)

### Step 10: Find distances
- Try to run **model.wv.distance('holmes', 'watson')**
- Try to run **model.wv.distance('holmes', 'water')**

In [14]:
model.wv.distance('holmes', 'watson')

0.00013363361358642578

In [15]:
model.wv.distance('holmes', 'water')

0.00038683414459228516

### Step 11: Find closests words
- Get all the words
    - HINT: **words = model.wv.index2entity**
- Implement a function **closets_words(word)**
    - HINT: **distances = {w: model.wv.distance(word, w) for w in words}**
    - HINT: **sorted(distances, key=lambda w: distances[w])[:15]**

In [16]:
words = model.wv.index2entity

def closets_words(word):
    distances = {w: model.wv.distance(word, w) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:15]

In [17]:
closets_words('holmes')

['holmes',
 'well',
 'rather',
 'friend',
 'round',
 'may',
 'one',
 'man',
 'way',
 'house',
 'might',
 'upon',
 'little',
 'time',
 'took']