# English level servise

This notebook grabs all the files from `Sample_subs` folder and labels them using the pipeline saved in `english_level_model.pkl` file

## Imports

Here's imports and functions we could use from previous notebooks. Others are in `english_level_functions.py` file

In [1]:
# libraries to work with data
import pandas as pd
import numpy as np
import re

In [2]:
# libraries to work with files
import joblib

from pathlib import Path

In [3]:
# sklearn
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression

In [4]:
# global variables
UTF8_SUBFOLDER = r'/utf-8'

RND_STATE = 1337

PATH_MODEL=r'./english_level_model.pkl'
PATH_SUBS_SAMPLE=r'./Sample_subs'

In [5]:
# regex for text processing
ONLY_WORDS = re.compile(r'[.,!?]|(?:\'[a-z]*)') # for BOW

### Imports from `english_level_functions.py`

I saved the functions I used in this project in `english_level_functions.py` file. I will import them now

In [6]:
# my functions
from english_level_functions import encoding_detector, folder_to_utf
from english_level_functions import srt_raw_text, srt_full_subs
from english_level_functions import re_clean_subs, text_preprocess_lem

## Getting file labels

### Prepearing files and DataFrame

Encode all `.srt` files in directory to `utf-8`:

In [7]:
folder_to_utf(PATH_SUBS_SAMPLE)

In [8]:
# saving path to the folder with reencoded .srt
all_subs_path = Path(PATH_SUBS_SAMPLE+UTF8_SUBFOLDER)

In [9]:
# getting df with file names and file paths
all_subs_list = [p.name for p in all_subs_path.glob('*.srt')]
all_subs_df = pd.DataFrame({'file_name': all_subs_list,
                            'file_path': list(all_subs_path.glob('*.srt'))})
display(all_subs_df)
print(f'Found {all_subs_df.shape[0]} subtitles files')

Unnamed: 0,file_name,file_path
0,A_knights_tale(2001).srt,Sample_subs\utf-8\A_knights_tale(2001).srt
1,Beauty_and_the_beast(2017).srt,Sample_subs\utf-8\Beauty_and_the_beast(2017).srt
2,The_fault_in_our_stars(2014).srt,Sample_subs\utf-8\The_fault_in_our_stars(2014)...
3,The_usual_suspects(1995).srt,Sample_subs\utf-8\The_usual_suspects(1995).srt
4,While_You_Were_Sleeping(1995).srt,Sample_subs\utf-8\While_You_Were_Sleeping(1995...


Found 5 subtitles files


In [10]:
# adding text to dataframe
all_subs_df['raw_text'] = all_subs_df['file_path'].apply(srt_raw_text)
all_subs_df

Unnamed: 0,file_name,file_path,raw_text
0,A_knights_tale(2001).srt,Sample_subs\utf-8\A_knights_tale(2001).srt,Resync: Xenzai[NEF]\nRETAIL\nShould we help hi...
1,Beauty_and_the_beast(2017).srt,Sample_subs\utf-8\Beauty_and_the_beast(2017).srt,"Once upon a time,\nin the hidden heart of Fran..."
2,The_fault_in_our_stars(2014).srt,Sample_subs\utf-8\The_fault_in_our_stars(2014)...,<i>I believe we have a choice in this\nworld a...
3,The_usual_suspects(1995).srt,Sample_subs\utf-8\The_usual_suspects(1995).srt,"How you doing, Keaton?\nI can't feel my legs....."
4,While_You_Were_Sleeping(1995).srt,Sample_subs\utf-8\While_You_Were_Sleeping(1995...,"LUCY: <i>Okay, there are two things that</i>\n..."


In [11]:
all_subs_df['preprocessed_text'] = all_subs_df['raw_text'].apply(text_preprocess_lem)
df1 = all_subs_df.drop(columns=['file_path', 'raw_text'])
df1

Unnamed: 0,file_name,preprocessed_text
0,A_knights_tale(2001).srt,two minute forfeit . lend u . right . left . d...
1,Beauty_and_the_beast(2017).srt,handsome young prince . lived beautiful castle...
2,The_fault_in_our_stars(2014).srt,"one hand , sugarcoat . way movie romance novel..."
3,The_usual_suspects(1995).srt,keyser . ready ? time ? . started back new yor...
4,While_You_Were_Sleeping(1995).srt,"n't remember orange . first , remember dad . w..."


### Vectorising text

In [12]:
# leaving only words
df1['bow_text'] = df1['preprocessed_text'].apply(lambda x: re.sub(ONLY_WORDS, '', x))
df1['bow_text'] = df1['bow_text'].apply(lambda x: re.sub(r'\s+', ' ', x)) # removing multiple spaces
display(df1)

Unnamed: 0,file_name,preprocessed_text,bow_text
0,A_knights_tale(2001).srt,two minute forfeit . lend u . right . left . d...,two minute forfeit lend u right left dead eh t...
1,Beauty_and_the_beast(2017).srt,handsome young prince . lived beautiful castle...,handsome young prince lived beautiful castle p...
2,The_fault_in_our_stars(2014).srt,"one hand , sugarcoat . way movie romance novel...",one hand sugarcoat way movie romance novel bea...
3,The_usual_suspects(1995).srt,keyser . ready ? time ? . started back new yor...,keyser ready time started back new york six we...
4,While_You_Were_Sleeping(1995).srt,"n't remember orange . first , remember dad . w...",n remember orange first remember dad would get...


In [13]:
# getting features
X1 = df1['bow_text'].copy()

display(X1)

0    two minute forfeit lend u right left dead eh t...
1    handsome young prince lived beautiful castle p...
2    one hand sugarcoat way movie romance novel bea...
3    keyser ready time started back new york six we...
4    n remember orange first remember dad would get...
Name: bow_text, dtype: object

### Applying the model

In [14]:
# loading saved pipeline
english_level_pipeline = joblib.load(PATH_MODEL)

In [15]:
# applying the pipeline
labels = english_level_pipeline.predict(X1)
print('Predicted labels are:', labels)

df1['english_level'] = labels
df_results = df1[['file_name', 'english_level']].copy()

display(df_results)

Predicted labels are: ['B2' 'B2' 'B1' 'B2' 'B1']


Unnamed: 0,file_name,english_level
0,A_knights_tale(2001).srt,B2
1,Beauty_and_the_beast(2017).srt,B2
2,The_fault_in_our_stars(2014).srt,B1
3,The_usual_suspects(1995).srt,B2
4,While_You_Were_Sleeping(1995).srt,B1


# Conclusion

That's all for this project so far!

I encountered a new problem I yet don't know how to deal with. But I'm out of time so here's all I got so far

Expierence I gained from this project:
* worked with file system and learned how to deal with different encodings
* gained some expierence working with text processing libraries
* worked with Optuna and CatBoost; didn't get any good result so far
* saving the pipeline to file using joblib