# English level servise

This notebook grabs all the files from `Sample_subs` folder and labels them using the model saved in `english_level_model.pkl` file

## Imports

Here's imports and functions we could use from previous notebooks. It's better to get them all into a separate `.py` file so I can import them without copy/pasting the code. Oy-vey! No time for this

In [17]:
# libraries to work with data
import pandas as pd
import numpy as np
import re

In [18]:
# libraries to work with files
import os
import pysrt # https://github.com/byroot/pysrt
import chardet # detect encoding
import codecs # decode files
import pickle

from pathlib import Path

In [6]:
# natural language toolkit
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [7]:
# sklearn
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression

In [19]:
# global variables
UTF8_SUBFOLDER = r'/utf-8'

RND_STATE = 1337

PATH_MODEL=r'./english_level_model.pkl'
PATH_SUBS_SAMPLE=r'./Sample_subs'

In [11]:
# regex for text processing
HTML = re.compile(r'<.*?>')
TAG = re.compile(r'{.*?}')
COMMENTS = re.compile(r'[\(\[][A-Z ]+[\)\]]')
LETTERS = re.compile(r'[^a-zA-Z\'.,!? ]')
SPACES = re.compile(r'([ ])\1+')
DOTS = re.compile(r'[\.]+')

ONLY_WORDS = re.compile(r'[.,!?]|(?:\'[a-z]*)') # for BOW

In [12]:
# this function returns encoding of the file
def encoding_detector(file_path):
    # read the first 1000 bytes of the file
    with open(file_path, 'rb') as file:
        raw_data = file.read(1000)
    # detect the encoding of the file
    result = chardet.detect(raw_data)
    encoding = result['encoding']
    file.close()
    return encoding

In [13]:
# this function takes folder, gets all .srt files , encodes them to utf-8
# and puts them into /utf-8 subfolder
def folder_to_utf(folder_path):
    # create a utf-8 subfolder
    os.makedirs(os.path.join(folder_path, 'utf-8'), exist_ok=True)

    # loop through all files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith('.srt'):
            # define the file paths
            file_path = os.path.join(folder_path, filename)
            new_file_path = os.path.join(folder_path, 'utf-8', filename)

            # open the file and read its contents
            with codecs.open(file_path, 'r', encoding=encoding_detector(file_path), errors='replace') as file:
                contents = file.read()
                file.close()

            # write the contents to a new file with UTF-8 encoding
            with codecs.open(new_file_path, 'w', encoding='UTF-8', errors='replace') as new_file:
                new_file.write(contents)
                new_file.close()

In [20]:
# re text cleaner
def re_clean_subs(subs):
    txt = re.sub(HTML, ' ', subs) # html to space
    txt = re.sub(TAG, ' ', txt) # tags to space
    txt = re.sub(COMMENTS, ' ', txt) # commentaries to space
    txt = re.sub(LETTERS, ' ', txt) # non-char to space
    txt = re.sub(SPACES, r'\1', txt) # leading spaces to one space
    txt = re.sub(DOTS, r'.', txt)  # ellipsis to dot
    txt = txt.encode('ascii', 'ignore').decode() # clear non-ascii symbols   
    txt = ".".join(txt.lower().split('.')[1:-1]) # delete the first and the last subtitle (ads)
    return txt

In [28]:
# defining stop words list and creating a lemmatiser object
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

In [29]:
# this function executes text preprocessing
def text_preprocess_lem(text):
    text = re_clean_subs(text) # clean text using RE
    tokens = nltk.word_tokenize(text) # tokenisation
    text = [word for word in tokens if word not in stop_words] # stop words removal
    text = [lemmatizer.lemmatize(word) for word in text] # lemmatising tokens
    text = " ".join(text) # making text from the list
    return text

In [26]:
# this function extracts raw text from .srt file
def srt_raw_text(file_path):
    try:
        subs = pysrt.open(file_path)
        return subs.text
    except:
        return np.NaN

## Getting file labels

### Prepearing files and DataFrame

Encode all `.srt` files in directory to `utf-8`:

In [21]:
folder_to_utf(PATH_SUBS_SAMPLE)

In [22]:
# saving path to the folder with reencoded .srt
all_subs_path = Path(PATH_SUBS_SAMPLE+UTF8_SUBFOLDER)

In [25]:
# getting df with file names and file paths
all_subs_list = [p.name for p in all_subs_path.glob('*.srt')]
all_subs_df = pd.DataFrame({'file_name': all_subs_list,
                            'file_path': list(all_subs_path.glob('*.srt'))})
display(all_subs_df)
print(f'Found {all_subs_df.shape[0]} subtitle files')

Unnamed: 0,file_name,file_path
0,A_knights_tale(2001).srt,Sample_subs\utf-8\A_knights_tale(2001).srt
1,Beauty_and_the_beast(2017).srt,Sample_subs\utf-8\Beauty_and_the_beast(2017).srt
2,The_fault_in_our_stars(2014).srt,Sample_subs\utf-8\The_fault_in_our_stars(2014)...
3,The_usual_suspects(1995).srt,Sample_subs\utf-8\The_usual_suspects(1995).srt
4,While_You_Were_Sleeping(1995).srt,Sample_subs\utf-8\While_You_Were_Sleeping(1995...


Found 5 subtitle files


In [27]:
# adding text to dataframe
all_subs_df['raw_text'] = all_subs_df['file_path'].apply(srt_raw_text)
all_subs_df

Unnamed: 0,file_name,file_path,raw_text
0,A_knights_tale(2001).srt,Sample_subs\utf-8\A_knights_tale(2001).srt,Resync: Xenzai[NEF]\nRETAIL\nShould we help hi...
1,Beauty_and_the_beast(2017).srt,Sample_subs\utf-8\Beauty_and_the_beast(2017).srt,"Once upon a time,\nin the hidden heart of Fran..."
2,The_fault_in_our_stars(2014).srt,Sample_subs\utf-8\The_fault_in_our_stars(2014)...,<i>I believe we have a choice in this\nworld a...
3,The_usual_suspects(1995).srt,Sample_subs\utf-8\The_usual_suspects(1995).srt,"How you doing, Keaton?\nI can't feel my legs....."
4,While_You_Were_Sleeping(1995).srt,Sample_subs\utf-8\While_You_Were_Sleeping(1995...,"LUCY: <i>Okay, there are two things that</i>\n..."


In [31]:
all_subs_df['preprocessed_text'] = all_subs_df['raw_text'].apply(text_preprocess_lem)
df1 = all_subs_df.drop(columns=['file_path', 'raw_text'])
df1

Unnamed: 0,file_name,preprocessed_text
0,A_knights_tale(2001).srt,two minute forfeit . lend u . right . left . d...
1,Beauty_and_the_beast(2017).srt,handsome young prince . lived beautiful castle...
2,The_fault_in_our_stars(2014).srt,"one hand , sugarcoat . way movie romance novel..."
3,The_usual_suspects(1995).srt,keyser . ready ? time ? . started back new yor...
4,While_You_Were_Sleeping(1995).srt,"n't remember orange . first , remember dad . w..."


### Vectorising text

In [33]:
# leaving only words
df1['bow_text'] = df1['preprocessed_text'].apply(lambda x: re.sub(ONLY_WORDS, '', x))
df1['bow_text'] = df1['bow_text'].apply(lambda x: re.sub(r'\s+', ' ', x)) # removing multiple spaces
display(df1)

Unnamed: 0,file_name,preprocessed_text,bow_text
0,A_knights_tale(2001).srt,two minute forfeit . lend u . right . left . d...,two minute forfeit lend u right left dead eh t...
1,Beauty_and_the_beast(2017).srt,handsome young prince . lived beautiful castle...,handsome young prince lived beautiful castle p...
2,The_fault_in_our_stars(2014).srt,"one hand , sugarcoat . way movie romance novel...",one hand sugarcoat way movie romance novel bea...
3,The_usual_suspects(1995).srt,keyser . ready ? time ? . started back new yor...,keyser ready time started back new york six we...
4,While_You_Were_Sleeping(1995).srt,"n't remember orange . first , remember dad . w...",n remember orange first remember dad would get...


In [36]:
# getting features
X1 = df1['bow_text'].copy()

vectorizer = CountVectorizer()
vectorizer.fit(X1)

X1 = vectorizer.transform(X1)

display(X1)

<5x4104 sparse matrix of type '<class 'numpy.int64'>'
	with 6810 stored elements in Compressed Sparse Row format>

### Applying the model

In [39]:
model = pickle.load(open(PATH_MODEL, 'rb'))

In [41]:
labels = model.predict(X1)
labels

ValueError: dimension mismatch

# Conclusion

That's all for this project so far!

I encountered a new problem I yet don't know how to deal with. But I'm out of time so here's all I got so far

Expierence I gained from this project:
* worked with file system and learned how to deal with different encodings
* gained some expierence working with text processing libraries
* worked with Optuna and CatBoost; didn't get any good result so far