# English Level (dataset formation)

## Intro

**Yandex.Practicum's English department** is a customer for this project https://practicum.yandex.ru/english/

One of the most effective ways to study foreighn languages (including English) is to watch movies. It's considered it's best for the student to watch videos in which they can understand from 50% to 70% of all dialogs in order to maximise their learning rate. Thus it's vital how movie contents matches an English level. We will use **CEFR** to define English level.

A dataset containig information on some movies' English level is provided by Yandex.Prackticum experts.

**Objective** is to build a model that can evaluate English level of movies based on their subtitles content.

This project is devided on three notebooks:
* `english_level_dataset.ipynb`: forms a dataset from all the data and saves it into `text_labels.csv` file
* `english_level_modeling.ipynb`: takes `text_labels.csv` file, does text processing, modeling and saves model `english_labels_model.pkl` file
* `english_level_servise.ipynb`: allows to label provided `.srt` file using the saved model

### Data provided

* A spreadsheet containig data on some movies' titles and English levels
* Sets of labeled subtitles (`.srt` fromat)

This notebook will process all the data to form a dataset that contains text data from `.srt` files and English level labels and save it into an `.csv` file for future training

## Data loading

### Libraries and settings

In [1]:
# libraries to work with data
import pandas as pd
import numpy as np
import re
import difflib

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# libraries to work with files
import os
import pysrt # https://github.com/byroot/pysrt
import chardet # detect encoding
import codecs # decode files

from pathlib import Path

In [3]:
# global variables
PATH_MOVIE_LABELS = r'./English_scores/movies_labels.xlsx' # movie titles and labels
PATH_ALL_SUBS_FOLDER = r'./English_scores/Subtitles_all/Subtitles' # all unlabeled .srt files
PATH_SUBS_A2 = r'./English_scores/Subtitles_all/A2' # .srt files labeled A2
PATH_SUBS_B1 = r'./English_scores/Subtitles_all/B1' # .srt files labeled B1
PATH_SUBS_B2 = r'./English_scores/Subtitles_all/B2' # .srt files labeled B2
PATH_SUBS_C1 = r'./English_scores/Subtitles_all/C1' # .srt files labeld C1
UTF8_SUBFOLDER = r'/utf-8' # subfolder for .srt files re-encoded to utf-8 
PATH_TRAIN_DATA = r'./text_labels.csv' # file for dataset
RND_STATE = 1337 # random state

In [4]:
# regex for text processing
HTML = re.compile(r'<.*?>')
TAG = re.compile(r'{.*?}')
COMMENTS = re.compile(r'[\(\[][A-Z ]+[\)\]]')
LETTERS = re.compile(r'[^a-zA-Z\'.,!? ]')
SPACES = re.compile(r'([ ])\1+')
DOTS = re.compile(r'[\.]+')

ONLY_WORDS = re.compile(r'[.,!?]|(?:\'[a-z]*)') # for BOW

### Loading dataframe

In [5]:
# loading dataset
movies_labels_df = pd.read_excel(PATH_MOVIE_LABELS)
movies_labels_df

Unnamed: 0,id,Movie,Level
0,0,10_Cloverfield_lane(2016),B1
1,1,10_things_I_hate_about_you(1999),B1
2,2,A_knights_tale(2001),B2
3,3,A_star_is_born(2018),B2
4,4,Aladdin(1992),A2/A2+
...,...,...,...
236,236,Matilda(2022),C1
237,237,Bullet train,B1
238,238,Thor: love and thunder,B2
239,239,Lightyear,B2


In [6]:
# clean the index
movies_labels_df = movies_labels_df.drop(columns=['id'])

In [7]:
# rename columns
movies_labels_df.columns = ['movie', 'label']

As we see, `movie` contains not only a movie title but also a release year. We will probably need that data so I will extract it. Column `label` is target data.

In [8]:
# movie names to lowercase
movies_labels_df['movie_title'] = movies_labels_df['movie'].str.casefold()

In [9]:
# extract year
movies_labels_df['year'] = movies_labels_df['movie_title'].str.extract('((?<=\()\d\d\d\d(?=\)))')
movies_labels_df['year'] = pd.to_numeric(movies_labels_df['year'], errors='coerce').astype('Int32')
# remove year
movies_labels_df['movie_title'] = movies_labels_df['movie_title'].apply(lambda x: re.sub('\(((19)|(20))\d\d\)', '', x))
# change underscore to space
movies_labels_df['movie_title'] = movies_labels_df['movie_title'].apply(lambda x: re.sub('_', ' ', x))

In [10]:
movies_labels_df

Unnamed: 0,movie,label,movie_title,year
0,10_Cloverfield_lane(2016),B1,10 cloverfield lane,2016
1,10_things_I_hate_about_you(1999),B1,10 things i hate about you,1999
2,A_knights_tale(2001),B2,a knights tale,2001
3,A_star_is_born(2018),B2,a star is born,2018
4,Aladdin(1992),A2/A2+,aladdin,1992
...,...,...,...,...
236,Matilda(2022),C1,matilda,2022
237,Bullet train,B1,bullet train,
238,Thor: love and thunder,B2,thor: love and thunder,
239,Lightyear,B2,lightyear,


In [11]:
movies_labels_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movie        241 non-null    object
 1   label        241 non-null    object
 2   movie_title  241 non-null    object
 3   year         108 non-null    Int32 
dtypes: Int32(1), object(3)
memory usage: 7.0+ KB


Lets' check for duplicates

In [12]:
movies_labels_df[movies_labels_df['movie_title'].duplicated(keep=False)].sort_values(by='movie_title')

Unnamed: 0,movie,label,movie_title,year
43,Inside_out(2015),B1,inside out,2015
44,Inside_out(2015),B1,inside out,2015
56,Matilda(1996),B1,matilda,1996
236,Matilda(2022),C1,matilda,2022
38,Powder(1995),B1,powder,1995
68,Powder(1995),B1,powder,1995
75,The_blind_side(2009),B2,the blind side,2009
84,The_blind_side(2009),B1,the blind side,2009
83,The_terminal(2004),B1,the terminal,2004
99,The_terminal(2004),"A2/A2+, B1",the terminal,2004


Some movies are duplicated and have different English level rating. What should I do with them?

## Forming a dataset

We need to form a dataset that contains text data and labels indicating English level determined by the experts. Let's overview all the data we have so far:
* a `.xls` spreadsheet containing movie titles and labels
* a bunch of `.srt` files piled in one folder
* some `.srt` files sorted by English level in separate folders

We will extract text data using [pysrt](https://github.com/byroot/pysrt) library. This library can extract plain text from `.srt` files without timestamps but has troubles dealing with some encodings. Thus the first thing we do is to encode all the `.srt.` files to `UTF-8` and put them into `/utf-8` subfolder

### Encoding all files to UTF-8

In [13]:
# this function returns encoding of the file
def encoding_detector(file_path):
    # read the first 1000 bytes of the file
    with open(file_path, 'rb') as file:
        raw_data = file.read(1000)
    # detect the encoding of the file
    result = chardet.detect(raw_data)
    encoding = result['encoding']
    file.close()
    return encoding

In [14]:
# this function takes folder, gets all .srt files , encodes them to utf-8
# and puts them into /utf-8 subfolder
def folder_to_utf(folder_path):
    # create a utf-8 subfolder
    os.makedirs(os.path.join(folder_path, 'utf-8'), exist_ok=True)

    # loop through all files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith('.srt'):
            # define the file paths
            file_path = os.path.join(folder_path, filename)
            new_file_path = os.path.join(folder_path, 'utf-8', filename)

            # open the file and read its contents
            with codecs.open(file_path, 'r', encoding=encoding_detector(file_path), errors='replace') as file:
                contents = file.read()
                file.close()

            # write the contents to a new file with UTF-8 encoding
            with codecs.open(new_file_path, 'w', encoding='UTF-8', errors='replace') as new_file:
                new_file.write(contents)
                new_file.close()

Now we iterate through all the folders we have, encode files to `UTF-8` and put them in `/utf-8` subfolder

In [15]:
for folder in [
    PATH_ALL_SUBS_FOLDER,
    PATH_SUBS_A2,
    PATH_SUBS_B1,
    PATH_SUBS_B2,
    PATH_SUBS_C1
]:
    folder_to_utf(folder)

Now we have a `/utf-8` subfolder with re-encoded `.srt` files

### Adding subs to the spreadsheet

In [16]:
# saving path to the folder with reencoded .srt
all_subs_path = Path(PATH_ALL_SUBS_FOLDER+UTF8_SUBFOLDER)

In [17]:
# getting df with file names and file paths
all_subs_list = [p.name for p in all_subs_path.glob('*.srt')]
all_subs_df = pd.DataFrame({'file_name': all_subs_list,
                            'file_path': list(all_subs_path.glob('*.srt'))})
display(all_subs_df.head())
print(f'Found {all_subs_df.shape[0]} subtitle files')

Unnamed: 0,file_name,file_path
0,10_Cloverfield_lane(2016).srt,English_scores\Subtitles_all\Subtitles\utf-8\1...
1,10_things_I_hate_about_you(1999).srt,English_scores\Subtitles_all\Subtitles\utf-8\1...
2,Aladdin(1992).srt,English_scores\Subtitles_all\Subtitles\utf-8\A...
3,All_dogs_go_to_heaven(1989).srt,English_scores\Subtitles_all\Subtitles\utf-8\A...
4,An_American_tail(1986).srt,English_scores\Subtitles_all\Subtitles\utf-8\A...


Found 115 subtitle files


Let's add file names and file paths to the spreadsheet

In [18]:
movies_labels_df['file_name'] = movies_labels_df.apply(
    lambda x: difflib.get_close_matches(x['movie'], all_subs_list, n=1, cutoff=0.8), axis=1)

In [19]:
# this function extracts lists after .get_close_matches method
def list_extractor(list):
    if list == []:
        return np.nan
    else:
        return list[0]

In [20]:
movies_labels_df['file_name'] = movies_labels_df['file_name'].apply(list_extractor)

In [21]:
display(movies_labels_df.head())

Unnamed: 0,movie,label,movie_title,year,file_name
0,10_Cloverfield_lane(2016),B1,10 cloverfield lane,2016,10_Cloverfield_lane(2016).srt
1,10_things_I_hate_about_you(1999),B1,10 things i hate about you,1999,10_things_I_hate_about_you(1999).srt
2,A_knights_tale(2001),B2,a knights tale,2001,A_knights_tale(2001).srt
3,A_star_is_born(2018),B2,a star is born,2018,A_star_is_born(2018).srt
4,Aladdin(1992),A2/A2+,aladdin,1992,Aladdin(1992).srt


In [22]:
# let's check df for nans
movies_labels_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movie        241 non-null    object
 1   label        241 non-null    object
 2   movie_title  241 non-null    object
 3   year         108 non-null    Int32 
 4   file_name    110 non-null    object
dtypes: Int32(1), object(4)
memory usage: 8.8+ KB


We will perform an inner merge on file names

In [23]:
# merging dfs
subs_df = movies_labels_df.merge(all_subs_df, how='inner', left_on='file_name', right_on='file_name')
display(subs_df.head())
print(f'Matched {subs_df.shape[0]} files')

Unnamed: 0,movie,label,movie_title,year,file_name,file_path
0,10_Cloverfield_lane(2016),B1,10 cloverfield lane,2016,10_Cloverfield_lane(2016).srt,English_scores\Subtitles_all\Subtitles\utf-8\1...
1,10_things_I_hate_about_you(1999),B1,10 things i hate about you,1999,10_things_I_hate_about_you(1999).srt,English_scores\Subtitles_all\Subtitles\utf-8\1...
2,A_knights_tale(2001),B2,a knights tale,2001,A_knights_tale(2001).srt,English_scores\Subtitles_all\Subtitles\utf-8\A...
3,A_star_is_born(2018),B2,a star is born,2018,A_star_is_born(2018).srt,English_scores\Subtitles_all\Subtitles\utf-8\A...
4,Aladdin(1992),A2/A2+,aladdin,1992,Aladdin(1992).srt,English_scores\Subtitles_all\Subtitles\utf-8\A...


Matched 110 files


### Adding subs from other folders

We will scan all other folders for `.srt` files, add their file pathes to the list and add labels based on folder

In [24]:
subs_df = subs_df[['label', 'file_path']]
subs_df

Unnamed: 0,label,file_path
0,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...
1,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...
2,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...
3,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...
4,A2/A2+,English_scores\Subtitles_all\Subtitles\utf-8\A...
...,...,...
105,B2,English_scores\Subtitles_all\Subtitles\utf-8\V...
106,B1,English_scores\Subtitles_all\Subtitles\utf-8\W...
107,B1,English_scores\Subtitles_all\Subtitles\utf-8\W...
108,B1,English_scores\Subtitles_all\Subtitles\utf-8\W...


In [25]:
# iterate through folders
for folder, label in zip([PATH_SUBS_A2, PATH_SUBS_B1, PATH_SUBS_B2, PATH_SUBS_C1],
                         ['A2', 'B1', 'B2', 'C1']):
    folder_utf8 = folder + UTF8_SUBFOLDER
    temp_df = pd.DataFrame({'file_path': list(Path(folder_utf8).glob('*.srt'))})
    temp_df['label'] = label
    print(f'Adding {temp_df.shape[0]} subs for {label} label...')
    display(temp_df.head(3))
    subs_df = pd.concat([subs_df, temp_df], axis=0, ignore_index=True)

print('All subs dataset')
display(subs_df)

Adding 6 subs for A2 label...


Unnamed: 0,file_path,label
0,English_scores\Subtitles_all\A2\utf-8\The Walk...,A2
1,English_scores\Subtitles_all\A2\utf-8\The Walk...,A2
2,English_scores\Subtitles_all\A2\utf-8\The Walk...,A2


Adding 17 subs for B1 label...


Unnamed: 0,file_path,label
0,English_scores\Subtitles_all\B1\utf-8\American...,B1
1,English_scores\Subtitles_all\B1\utf-8\Angelas....,B1
2,English_scores\Subtitles_all\B1\utf-8\Indiana ...,B1


Adding 107 subs for B2 label...


Unnamed: 0,file_path,label
0,English_scores\Subtitles_all\B2\utf-8\Angela's...,B2
1,English_scores\Subtitles_all\B2\utf-8\Collater...,B2
2,English_scores\Subtitles_all\B2\utf-8\Crazy4TV...,B2


Adding 33 subs for C1 label...


Unnamed: 0,file_path,label
0,English_scores\Subtitles_all\C1\utf-8\Downton ...,C1
1,English_scores\Subtitles_all\C1\utf-8\Downton ...,C1
2,English_scores\Subtitles_all\C1\utf-8\Downton ...,C1


All subs dataset


Unnamed: 0,label,file_path
0,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...
1,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...
2,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...
3,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...
4,A2/A2+,English_scores\Subtitles_all\Subtitles\utf-8\A...
...,...,...
268,C1,English_scores\Subtitles_all\C1\utf-8\Suits.S0...
269,C1,English_scores\Subtitles_all\C1\utf-8\Suits.S0...
270,C1,English_scores\Subtitles_all\C1\utf-8\Suits.S0...
271,C1,English_scores\Subtitles_all\C1\utf-8\Suits.S0...


In [26]:
# checking all the classes
subs_df['label'].value_counts()

B2            136
B1             54
C1             39
A2/A2+         25
B1, B2          8
A2              6
A2/A2+, B1      5
Name: label, dtype: int64

There are some objects, that were classified in multiple cleasses. Also there are no **A1** and **C2**.
* **A1** level is too basic and it's not so easy to find movies that don't go beyond **A1** in terms of language (except ones that are made this way)
* We will re-label **A2/A2+** to just **A2**
* We will re-label **B1, B2** to **B1**. Since movies are used for language learning, we can choose lower skill level
* We will re-label **A2/A2+, B1** to **A2** following the same motive

In [27]:
# rename classes
subs_df.loc[(subs_df['label'] == 'A2/A2+')|(subs_df['label'] == 'A2/A2+, B1'), 'label'] = 'A2'
subs_df.loc[subs_df['label'] == 'B1, B2', 'label'] = 'B1'
subs_df['label'].value_counts()

B2    136
B1     62
C1     39
A2     36
Name: label, dtype: int64

### Adding text data

Now we add subtitle text to every DataFrame object

In [28]:
# this function extracts raw text from .srt file
def srt_raw_text(file_path):
    try:
        subs = pysrt.open(file_path)
        return subs.text
    except:
        return np.NaN

In [29]:
# this function extracts full text from .srt file
def srt_full_subs(file_path):
    try:
        with open(file_path) as file:
            full_text = file.read()
            file.close()
        return full_text
    except:
        return np.NaN

In [30]:
# applying text extraction function to df
subs_df['raw_text'] = subs_df['file_path'].apply(srt_raw_text)
subs_df.head()

Unnamed: 0,label,file_path,raw_text
0,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...,"<font color=""#ffff80""><b>Fixed & Synced by boz..."
1,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...,"Hey!\nI'll be right with you.\nSo, Cameron. He..."
2,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...,Resync: Xenzai[NEF]\nRETAIL\nShould we help hi...
3,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...,"- <i><font color=""#ffffff""> Synced and correct..."
4,A2,English_scores\Subtitles_all\Subtitles\utf-8\A...,"<i>Oh, I come from a land\nFrom a faraway plac..."


In [31]:
# add full text to dataset
subs_df['full_subs'] = subs_df['file_path'].apply(srt_full_subs)
subs_df.head()

Unnamed: 0,label,file_path,raw_text,full_subs
0,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...,"<font color=""#ffff80""><b>Fixed & Synced by boz...","1\n00:00:55,279 --> 00:01:07,279\n<font color=..."
1,B1,English_scores\Subtitles_all\Subtitles\utf-8\1...,"Hey!\nI'll be right with you.\nSo, Cameron. He...","1\n00:01:54,281 --> 00:01:55,698\nHey!\n\n2\n0..."
2,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...,Resync: Xenzai[NEF]\nRETAIL\nShould we help hi...,"1\n00:00:15,089 --> 00:00:21,229\nResync: Xenz..."
3,B2,English_scores\Subtitles_all\Subtitles\utf-8\A...,"- <i><font color=""#ffffff""> Synced and correct...","1\n00:00:17,610 --> 00:00:22,610\n- <i><font c..."
4,A2,English_scores\Subtitles_all\Subtitles\utf-8\A...,"<i>Oh, I come from a land\nFrom a faraway plac...","1\n00:00:27,240 --> 00:00:30,879\n<i>Oh, I com..."


In [32]:
subs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   label      273 non-null    object
 1   file_path  273 non-null    object
 2   raw_text   273 non-null    object
 3   full_subs  273 non-null    object
dtypes: object(4)
memory usage: 8.7+ KB


### Saving train data to file

In [33]:
# leave only labels and text and save
subs_df[['label', 'raw_text', 'full_subs']].to_csv(path_or_buf=PATH_TRAIN_DATA, index=False)

## Conclusion

The main objective here was to get all the data together and form a solid dataset. All files are encoded differently so we had to preprocess them in on encoding. The main problem is there's not so much labeled data collected

We created a dataset that contains:
* raw text for avery subtitle file
* full subtutile data with timestamps
* lebels for subtitles

Now we can use this file for training