## Prerequisites

### Install nltk_contrib
<code>
$ git clone https://github.com/nltk/nltk_contrib.git
$ cd nltk_contrib
$ python setup.py install
</code>

Also ensure you have the <code>Punkt</code> tokenizer installed through nltk.download('punkt')

## Compute and export readability scores

In [5]:
from nltk_contrib.readability.readabilitytests import ReadabilityTool
import csv
import pickle

In [27]:
# suppress the warnings logged by ReadabilityTool()
import logging
logging.getLogger().setLevel('ERROR')

In [68]:
def get_readability_scores(path_to_transcription_file):
    transcriptions = pickle.load(open(path_to_transcription_file, 'rb'))
    
    score_dictionary = {}
    
    for video_id in transcriptions:
        transcription = transcriptions[video_id]
        # if video filename starts with '-', have it start with '?-'
        if video_id[0] == '-':
            video_id = '?%s' % video_id
    
        current_video_scores = {}
        current_video_scores['video_id'] = video_id
        
        if len(transcription) > 0:
            transcription_readability_tool = ReadabilityTool(transcription)

            current_video_scores['ARI'] = transcription_readability_tool.ARI()
            current_video_scores['Flesch Reading Ease'] = \
                transcription_readability_tool.FleschReadingEase()
            current_video_scores['Flesch-Kincaid Grade Level'] = \
                transcription_readability_tool.FleschKincaidGradeLevel()
            current_video_scores['Gunning Fog Index'] = \
                transcription_readability_tool.GunningFogIndex()
            current_video_scores['SMOG Index'] = \
                transcription_readability_tool.SMOGIndex()
            current_video_scores['Coleman Liau Index'] = \
                transcription_readability_tool.ColemanLiauIndex()
            current_video_scores['LIX'] = transcription_readability_tool.LIX()
            current_video_scores['RIX'] = transcription_readability_tool.RIX()
        else:
            print 'Video without transcription?? Check video %s' % video_id
            # set all scores to NaN
            current_video_scores['ARI'] = float('nan')
            current_video_scores['Flesch Reading Ease'] = float('nan')
            current_video_scores['Flesch-Kincaid Grade Level'] = float('nan')
            current_video_scores['Gunning Fog Index'] = float('nan')
            current_video_scores['SMOG Index'] = float('nan')
            current_video_scores['Coleman Liau Index'] = float('nan')
            current_video_scores['LIX'] = float('nan')
            current_video_scores['RIX'] = float('nan')
            
        
        score_dictionary[video_id] = current_video_scores 
        
    return score_dictionary

In [None]:
readability_scores = get_readability_scores('/Volumes/Samsung_T3/ChaLearn/val/transcription_validation.pkl')

In [72]:
def export_scores_to_csv(score_dictionary, path_to_csv):
    writer = csv.DictWriter(open(path_to_csv, 'w'),\
                            delimiter=',',\
                            fieldnames=['video_id',
                                        'ARI',\
                                        'Flesch Reading Ease',\
                                        'Flesch-Kincaid Grade Level',\
                                        'Gunning Fog Index',\
                                        'SMOG Index',\
                                        'Coleman Liau Index',\
                                        'LIX',\
                                        'RIX'
                                       ])
    
    writer.writeheader()
    
    for video_id in score_dictionary:
        writer.writerow(score_dictionary[video_id])

    return    

In [81]:
export_scores_to_csv(readability_scores, '/Volumes/Samsung_T3/ChaLearn/test/scores_test.csv')

## DATASET CORRECTIONS
### Training transcriptions

In [None]:
transcription_training = pickle.load(open('/YOUR/PATH/TO/transcription_training.pkl', 'rb'))

In [63]:
transcription_training['iYVJt41_q7M.002.mp4'] = u'all set, ok? Thank you for showing so much love across this channel. Hundred subscribers soon, I cannot wait for that, I\'ll give away when that happens as well. Just thank you so much generally from the bottom of my heart for the love you\'re showing on this channel. You guys take care and I will see you in the next video.' 


In [54]:
transcription_training['4LZJvOecyM8.005.mp4'] = u'-ly. Uhh, dududududududu - Interview With The Vampire, Queen of the Damned, tsk tsk tsk tsk tsk tsk tsk uhh uhm ... that\'s all I\'m gonna leave for right now, we\'ll see like the big main ones. Uhh...what did it take you to-'

In [56]:
transcription_training['YC3X1DcnUrk.000.mp4'] = u'but especially probably Princess Mononoke is just... the most beautiful, uhh, movie, uhh, it\'s, yeah, it\'s just one of our favorite films in all categories, not just in the "Animated" sort of category. Uhmm...other Japanese animated films that we love-' 

In [57]:
transcription_training['ztyBhnjtrz0.000.mp4'] = u'like model curves. Hayo Smith asked: "Why are you so generous?". Well, I like to give things away, and it\'s a nice thing to do. Even though I don\'t have a lot of money, but I have enough money to-' 

In [58]:
transcription_training['JTmq4k4uQCY.003.mp4'] = u'build buildings, and you play, and fun, and the...the regular blocks are the best ones, \'cause you can build anything with them. Everyone knows all that about Lego already, and...Minecraft is that, but you get to actually move around and play in this 3D world. I wouldn\'t, like, throw them into survival mode, but I\'d-' 

In [59]:
transcription_training['HhC2cGFFZeY.000.mp4'] = u'being 200 pounds would look monstrous, like, literally, what, uhmm, if you look at Lex Griffen, I think he\'s got similar build to me, like, bone density-wise, and, I think he is about similar height, and, he looks huge and he is only 170 pounds. I\'m about-'

In [60]:
transcription_training['cRDYrvxRJ6U.001.mp4'] = u'then, what relationships they wanna have, they get really clear and really focused, they live a life that\'s very purposeful to them. I absolutely love giving this to my clients. It\'s so, so, so much fun. So that\'s kinda what you can expect, working with a light coach especially'

In [64]:
pickle.dump(transcription_training, open('/YOUR/PATH/TO/transcription_training_extended.pkl', 'wb'))

### Test transcriptions

In [None]:
transcription_test = pickle.load(open('/YOUR/PATH/TO/ChaLearn/test/transcription_test.pkl', 'rb'))

In [75]:
transcription_test['JmAQlC-FEV8.000.mp4'] = u'pre'

In [76]:
transcription_test['_plk5k7PBEg.004.mp4'] = u'stuff I know, and a video, uhmm, two videos ago, whatever it was, where I talked about being a normal tech YouTube person enough anymore. I wanna make some things clear: I\'m not saying I\'m stopping all kinds of technology videos. I\'ll still do videos related to technology, and I will still do' 

In [77]:
pickle.dump(transcription_test, open('/YOUR/PATH/TO/transcription_test_extended.pkl', 'wb'))