# <font color = 'dodgerblue'>**Tokenization approaches spacy - Real Dataset**

# <font color = 'dodgerblue'>**Install/Import Libraries**

In [None]:
get_ipython()

<google.colab._shell.Shell at 0x7d94ee442650>

The function `get_ipython()` returns a reference to the current IPython instance running in the environment. This instance is an IPython shell or an IPython kernel, depending on the context in which the code is executed.

In [None]:
# install spacy
if 'google.colab' in str(get_ipython()):
    !pip install -U spacy -qq



In [None]:
# Import the Path module from the pathlib library
from pathlib import Path

# Import the pandas library for working with data frames
import pandas as pd

# Import the spacy library for natural language processing
import spacy

# Import the List type from the typing module to use in function annotations
from typing import List




In [None]:
# check spacy version
spacy.__version__


'3.7.2'

# <font color = 'dodgerblue'>**Specify Data Folders**

In [None]:
# Check if the code is running in a Colab environment
if 'google.colab' in str(get_ipython()):  # If the code is running in Colab
    # mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    # set the base path to a Google Drive folder
    base_path = '/content/drive/MyDrive/data'
else:
    # If the code is not running in Colab, set the base path to a local folder
    base_path = '/home/harpreet/Insync/google_drive_shaannoor/data'


# Convert the base path to a Path object
base_folder = Path(base_path)

# Define the data folder path
data_folder = base_folder/'datasets'


Mounted at /content/drive


Code Explanation:

- **Environment Check**: The code determines whether it's running in a Google Colab environment or locally on a machine. This distinction guides the subsequent steps.
- **Mounting Google Drive (if in Colab)**:
  - **Access to Files**: By mounting Google Drive, the code gains access to files and folders stored in the user's Google Drive account. This is essential for reading and writing data that's stored in the cloud.
  - **Collaboration and Portability**: Mounting Google Drive allows multiple users to work on shared files and ensures that the code can be run from any device with access to the user's Google Drive. It promotes collaboration and makes the code more portable.
  - **Persistent Storage**: Google Colab instances are temporary and reset after a period of inactivity. Mounting Google Drive provides a way to save and access data across different sessions, ensuring persistence.
- **Setting the Base Path**: Depending on the environment, the base path is set to a specific directory in Google Drive (if in Colab) or a local folder (if running locally).
- **Using Path Objects**: The code utilizes `Path` objects for handling file paths, enhancing cross-platform compatibility.
- **Defining Specific Folder Paths**: Paths to specific subdirectories (`archive` and `datasets`) are defined relative to the base folder, organizing the data structure.

By accommodating both local and Colab environments and leveraging the advantages of Google Drive, this code snippet provides a flexible and robust way to handle file paths, access shared resources, and ensure data persistence.

# <font color = 'dodgerblue'>**Load csv file**

In [None]:
train_data = pd.read_csv(data_folder / 'aclImdb'/'train.csv', index_col=0)

In [None]:
# Printing shape of dataframe
train_data.shape


(25000, 2)

In [None]:
# diaplay first five rows
train_data.head()


Unnamed: 0,Reviews,Labels
0,Ever wanted to know just how much Hollywood co...,1
1,The movie itself was ok for the kids. But I go...,1
2,You could stage a version of Charles Dickens' ...,1
3,this was a fantastic episode. i saw a clip fro...,1
4,and laugh out loud funny in many scenes.<br />...,1


In [None]:
# Randomly select 500 observations
sampled_df = train_data.sample(n=2500, random_state=42)

# <font color = 'dodgerblue'>**Import Spacy Model**

In [None]:
# check the models we have dowloaded in spacy folder
!python -m spacy download en_core_web_sm

2024-01-22 04:57:05.293794: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-22 04:57:05.293866: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-22 04:57:05.295749: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-sm
  Attempting uninstall

# <font color = 'dodgerblue'>**Compare tokenization approaches**

In [None]:
# We will load the model -en_core_web_sm
nlp = spacy.load('en_core_web_sm')


## <font color = 'dodgerblue'>**Method1 : Typical approach using spacy**

In [None]:
def tokenize(text: str) -> List[str]:
    """Tokenize the input text using spaCy.

    Args:
    text: The input text to be tokenized.

    Returns:
    A list of tokens.
    """
    # Apply the spaCy NLP model to the input text
    doc = nlp(text)
    # Extract the tokens from the spaCy doc and return as a list
    tokens = [token.text for token in doc]
    return tokens


In [None]:
%%time
sampled_df['tokens_method1'] = sampled_df['Reviews'].apply(tokenize)


CPU times: user 2min 2s, sys: 1.4 s, total: 2min 3s
Wall time: 2min 3s


In [None]:
sampled_df.head()


Unnamed: 0,Reviews,Labels,tokens_method1
6868,Enjoyed catching this film on very late late l...,1,"[Enjoyed, catching, this, film, on, very, late..."
24016,i checked this one out on DVD for a dollar so ...,0,"[i, checked, this, one, out, on, DVD, for, a, ..."
9668,One of the best films I have seen in the past ...,1,"[One, of, the, best, films, I, have, seen, in,..."
13640,"First of all, I would just like to say to ever...",0,"[First, of, all, ,, I, would, just, like, to, ..."
14018,This is not a good movie. Too preachy in parts...,0,"[This, is, not, a, good, movie, ., Too, preach..."


## <font color = 'dodgerblue'>**Method 2: Using nlp.pipe from Spacy**

In [None]:
import os
os.cpu_count()


8

In [None]:
%%time
# initialize an empty list to store tokens
tokens_method2 = []

# process multiple documents in parallel using the spaCy NLP library
for doc in nlp.pipe(sampled_df.Reviews.values, batch_size=1000, n_process=4):
    # extract text of each token in the document and create a list of tokens
    tokens = [token.text for token in doc]
    # add the list of tokens to the tokens_method2
    tokens_method2.append(tokens)

# add the tokens_method2 to the train_data dataframe as a new column 'tokens_method2'
sampled_df['tokens_method2'] = tokens_method2


CPU times: user 6.36 s, sys: 788 ms, total: 7.14 s
Wall time: 42.8 s


This code performs tokenization on the `train_data.Reviews.values` by using the spaCy NLP library (`nlp`).

- The **`nlp.pipe` method is used to process multiple documents in parallel**, where `batch_size=1000` and `n_process=32` specify the batch size and number of CPU processes to use respectively.

- For each document in the batch, the code creates a list of tokens, represented by the text of the spaCy token objects, using a list comprehension `[token.text for token in doc]`.

- The resulting list of tokens is then appended to `tokens_method2`. Finally, the `tokens_method2` list is added as a new column ``'tokens_method2'` to the `train_data` dataframe.






In [None]:
sampled_df.head()


Unnamed: 0,Reviews,Labels,tokens_method1,tokens_method2
6868,Enjoyed catching this film on very late late l...,1,"[Enjoyed, catching, this, film, on, very, late...","[Enjoyed, catching, this, film, on, very, late..."
24016,i checked this one out on DVD for a dollar so ...,0,"[i, checked, this, one, out, on, DVD, for, a, ...","[i, checked, this, one, out, on, DVD, for, a, ..."
9668,One of the best films I have seen in the past ...,1,"[One, of, the, best, films, I, have, seen, in,...","[One, of, the, best, films, I, have, seen, in,..."
13640,"First of all, I would just like to say to ever...",0,"[First, of, all, ,, I, would, just, like, to, ...","[First, of, all, ,, I, would, just, like, to, ..."
14018,This is not a good movie. Too preachy in parts...,0,"[This, is, not, a, good, movie, ., Too, preach...","[This, is, not, a, good, movie, ., Too, preach..."


## <font color = 'dodgerblue'>**Method 3: Using nlp.pipe and disable not required components**

In [None]:
%%time

# initialize an empty list to store tokens
token_list_method3 = []

# temporarily disable the named pipes of spaCy NLP processing pipeline
disabled = nlp.select_pipes(
    disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])

# process multiple documents in parallel using the spaCy NLP library
for doc in nlp.pipe(sampled_df.Reviews.values, batch_size=500, n_process=3):
    # extract text of each token in the document and create a list of tokens
    tokens = [token.text for token in doc]
    # add the list of tokens to the token_list_method3
    token_list_method3.append(tokens)

# add the token_list_method3 to the train_data dataframe as a new column 'tokens_method3'
sampled_df['tokens_method3'] = token_list_method3

# restore the named pipes that were disabled
disabled.restore()


CPU times: user 5.51 s, sys: 270 ms, total: 5.78 s
Wall time: 6.81 s


In [None]:
train_data.head()


Unnamed: 0,Reviews,Labels
0,Ever wanted to know just how much Hollywood co...,1
1,The movie itself was ok for the kids. But I go...,1
2,You could stage a version of Charles Dickens' ...,1
3,this was a fantastic episode. i saw a clip fro...,1
4,and laugh out loud funny in many scenes.<br />...,1
