Now that we have our top 50 shounen anime, and their transcripts downloaded, we need to do some cleaning. In 
particular, there is going to be a lot of whitespace, timestamps, and text that is not Japanese. To make our NLP 
analysis easier, we will clean each text file until all that remains is the Japanese.

It should be noted before we move further, however, that real-world data is messy and incomplete. In that vein, I 
must admit that I was unable to find any subtitles for D.Gray Man, Great Teacher Onizuka, Claymore, Katekyo Hitman 
Reborn, Rosario to Vampire, Watashi Ga Motenai no wa Dou Kangaetemo..., and Kaze no Stigma. Further, I was only able 
to find incomplete transcripts for Naruto, Soul Eater, One Piece, My Hero Academia, and Pandora Hearts. 

----------------------------------------------------------------------------------------------------------------------

Though we only have a little less than 86% representation of the top 50 anime present, we also have 4368 text files 
amassed for analysis. Because our ultimate goal is to find the most common words and phrases amongst the top 50 shounen 
anime, I believe we can still achieve our goal with this sample set, and posit that had we had 100% representation in 
our analysis, the results would only be further skewed toward the results we will get with the current sample available.


*NOTE*: It is imperative to share that 3 of the top 50 anime (One Piece, Naruto, and Dragon Ball) represent roughly half of all the sample files. This will come into play later when we analyze our bag-of-words results, as each of these
anime has its own unique slang and world that may affect our results more than we anticipate.

In [1]:
# Import necessary analysis modules
import re
import codecs

In [2]:
'''Our goal for this portion of the project will be to clean each of the 4368 text files and combine them 
all together to create one giant file full of nothing but lines from the top 50 shounen anime. Once we have this text 
file, we will then perform a NLP bag-of-words analysis to get our "ultimate" vocaulary list.

The flow of this workbook is as follows:

1. Open text file from computer
2. Remove unwanted text, numbers, stop words, and whitespace from said file
3. Append cleaned file to a "global" text file that will serve as our bag-of-words analysis document
4. Rinse and repeat

----------------------------------------------------------------------------------------------------------------------

The files for analysis that we got were originally .srt and .ass formatted files. They were converted to .txt for 
convenience purposes. However, the way these different file extensions are formatted when converted to text are 
remarkably different. As such, we will make different functions for cleaning both .srt and .ass files.

.srt files are easier to deal with, so we'll start with them

'''

def srt_munging(text_file):
    
    srt_filepath = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/Subtitle Files By Type/SRT/'
    filename = srt_filepath + text_file
    
    #Open the file, take the text, and create a list
    newfile = open(filename, 'rt')
    text = newfile.read()
    newfile.close()
    
    lines = text.split()
    
    
    # Regex and string expressions to be called throughout the function
    squiggly = '～'
    squiggly2 = '〜'
    quarter_note =re.compile(r'♪+')
    eighth_note = re.compile(r'♬+')
    large_open_pointer = '〈'
    large_double_open = '《'
    large_closed_pointer = '〉'
    large_double_closed = '》'
    double_open_pointer = '≪'
    double_closed_pointer = '≫'
    large_thin_open = '＜'
    large_thin_closed = '＞'
    fat_arrow = '➡'
    pesky_arrow = '→'
    open_block_quote = '『'
    closed_block_quote = '』'
    double_paren_open = re.compile(r'\(\(')
    double_paren_closed = re.compile(r'\)\)')
    
    
    
    
    # Start working by removing digits
    no_digits = []
    for i in lines:        
        # Remove lines that contain only numbers
        if not i.isdigit():
            no_digits.append(i)
            
    
    # Remove the '-->' symbol that shows span of time
    no_arrows = []          
    for i in no_digits:
        if not i == '-->':
            no_arrows.append(i)
                   
        
    # Remove unicode character at the beginning of the file
    no_unicode = []    
    for i in no_arrows:
        if not i == '\ufeff1' and not i == '\ufeff':
            no_unicode.append(i)
            
    
    # Create regex to remove timestamps from no_unicode list
    timestamp = re.compile(r'[\d\d:\d\d:\d\d,\d\d\d]')
    no_timestamp = [i for i in no_unicode if not timestamp.match(i)]
    
    
    # Create new empty list to store our newly cleaned data
    no_parentheses = []
    for line in no_timestamp:
        line = re.sub('（.*）','', line)        # Uses *JAPANESE* punctuation marks!!!    
        no_parentheses.append(line)
        
  
    # Get rid of those pesky American parentheses
    no_american = []  
    for line in no_parentheses:       
        line = re.sub('\(.*\)','', line)        
        no_american.append(line)
        
    no_brackets = []
    for i in no_american:
        i = re.sub('\[.*\]', '', i)
        no_brackets.append(i)
            
        
    # Now it's time to clear the decorative punctuations
    
    # Remove squiggly lines (～)
    no_squiggly = []
    for i in no_brackets:
        i = re.sub(squiggly, '', i)
        no_squiggly.append(i)
        
    no_squiggly2 = []
    for i in no_squiggly:
        i = re.sub(squiggly2, '', i)
        no_squiggly2.append(i)
        
        
    #Remove musical notes
    no_quarter = []
    no_eighth = []
    for i in no_squiggly2:
        i = re.sub(quarter_note, '', i)
        no_quarter.append(i)
        
    for i in no_quarter:
        i = re.sub(eighth_note, '', i)
        no_eighth.append(i)
        
        
    # Remove all the arrows
    no_pesky_arrows = []
    no_fat_arrows = []
    for i in no_eighth:
        i = re.sub(pesky_arrow, '', i)
        no_pesky_arrows.append(i)
        
    for i in no_pesky_arrows:
        i = re.sub(fat_arrow, '', i)
        no_fat_arrows.append(i)
        
        
    # Remove all pointers (< & > types), quotes, and remaining parentheses
    no_large_open_pointer = []
    no_large_closed_pointer = []
    no_large_double_open = []
    no_large_double_closed = []
    no_double_open = []
    no_double_closed = []
    no_open_block = []
    no_closed_block = []
    no_thin_open = []
    no_thin_closed = []
    no_double_paren_open = []
    no_double_paren_closed = []
    
    for i in no_fat_arrows:
        i = re.sub(large_open_pointer, '', i)
        no_large_open_pointer.append(i)
        
    for i in no_large_open_pointer:
        i = re.sub(large_closed_pointer, '', i)
        no_large_closed_pointer.append(i)
        
    for i in no_large_closed_pointer:
        i = re.sub(large_double_open, '', i)
        no_large_double_open.append(i)
        
    for i in no_large_double_open:
        i = re.sub(large_double_closed, '', i)
        no_large_double_closed.append(i)
        
    for i in no_large_double_closed:
        i = re.sub(double_open_pointer, '', i)
        no_double_open.append(i)
        
    for i in no_double_open:
        i = re.sub(double_closed_pointer, '', i)
        no_double_closed.append(i)
        
    for i in no_double_closed:
        i = re.sub(open_block_quote, '', i)
        no_open_block.append(i)
        
    for i in no_open_block:
        i = re.sub(closed_block_quote, '', i)
        no_closed_block.append(i)
        
    for i in no_closed_block:
        i = re.sub(large_thin_open, '', i)
        no_thin_open.append(i)
        
    for i in no_thin_open:
        i = re.sub(large_thin_closed, '', i)
        no_thin_closed.append(i)
        
    for i in no_thin_closed:
        i = re.sub(double_paren_open, '', i)
        no_double_paren_open.append(i)
        
    for i in no_double_paren_open:
        i = re.sub(double_paren_closed, '', i)
        no_double_paren_closed.append(i)
        
        
    # Remove English characters and punctuations
    all_japanese = []
    for i in no_double_paren_closed:
        if not i == ')' and not i == '(' and not i == '｡':
            i = re.sub(r'[A-Za-z]', '', i)
            all_japanese.append(i)
    
    
    
    # Exclude empty elements from our list
    # This piece of code should ALWAYS be last
    full_dialogue = []
    for i in all_japanese:
        if not i == '':
            full_dialogue.append(i)
        
        
    return full_dialogue
       

In [3]:
'''Next, we will write a function to handle the more difficult .ass files. There is a lot going on in these, to 
include CSS encoding, timestamps, and other kinds of formatting. So this could get messy!'''

# In all honesty, I just wanted to make a function called ass_munging

def ass_munging(text_file):
    
    srt_filepath = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/Subtitle Files By Type/ASS/'
    filename = srt_filepath + text_file
    
    #Open the file, take the text, and create a list
    newfile = open(filename, 'rt')
    text = newfile.read()
    newfile.close()
    
    lines = text.splitlines()
    dialogue_box = [i for i in lines if i.startswith('Dialogue')]
    
    
    # Regular & string expressions to be used throughout to help clean the data    
    formatted = re.compile(r'[A-Za-z]')
    timestamp = re.compile(r'[\d,\d:\d\d:\d\d.\d\d,\d:\d\d:\d\d.\d\d,.*\,,0000,0000,0000,,]')
    digits = re.compile(r'[0-9]')
    nichibun = '日文'               # Can't read Chinese, so the variable "translations" are probably horribly incorrect
    taikatsu = '对话'
    useless_chinese1 = '制作人员==诸神字幕组==听译：朝颜'
    useless_chinese2 = '警示标语本字幕由诸神字幕组制作仅供学习交流之用若您喜欢本作品请支持正版影像制品'
    useless_chinese3 = '制作，仅供交流学习，禁止用于任何商业用途'
    useless_chinese4 = '日听兰樱&'
    useless_chinese5 = '本字幕由诸神字幕组出品'
    useless_chinese6 = '翻译：'
    useless_chinese7 = '諸神字幕組本字幕由诸神字幕组制作仅供学习交流之用若您喜欢本作品请支持正版影像制品'
    useless_chinese8 = '諸神字幕組{制作仅供学习交流之用若您喜欢本作品请支持正版影像制品'
    useless_chinese9 = '諸神字幕組'
    squiggly = '～'
    dash = re.compile(r'[-]+')
    brackets = re.compile(r'\{.*\}')
    white_noise_apos = re.compile(r'(`)+')
    white_noise_dot = re.compile(r'(·)+')
    white_noise_dot2 = re.compile(r'(・)+')
    pesky_arrow = '→'
    open_pointer = '≪'
    closed_pointer = '≫'
    large_japanese_open_pointer = '《'
    large_japanese_closed_pointer = '》'
    another_open_pointer = '＜'
    equals_sign = re.compile(r'[=]+')
    
    
    
    # Remove that pesky unicode
    no_unicode = []
    for i in dialogue_box:
        if not i == '\ufeff1' and not i == '\ufeff':
            no_unicode.append(i)
            
    no_spaces = []
    for i in no_unicode:
        i = re.sub('\u3000', '', i)
        no_spaces.append(i)
        
    extra_spaces = []
    for i in no_spaces:
        i = re.sub(' ', '', i)
        extra_spaces.append(i)
        
        
    # Remove tab characters
    no_tabs = []
    for i in extra_spaces:
        i = re.sub('\t', '', i)
        no_tabs.append(i)
    
    
    # Removes formatting lines
    no_formatting = []
    for i in no_tabs:
        i = re.sub(timestamp, '', i)
        no_formatting.append(i)
    
    
    # Remove the timestamps from the file
    no_timestamps = []
    for i in no_formatting:
        i = re.sub(timestamp,'', i)
        no_timestamps.append(i)
                              
            
    # Remove any English-language stragglers
    no_english = []
    for i in no_timestamps:
        i = re.sub(formatted,'', i)
        no_english.append(i)
        
    
    # Remove chinese characters from dialogue
    no_nichibun = []
    for i in no_english:
        i = re.sub(nichibun, '', i)
        no_nichibun.append(i)
        
    no_taikatsu = []
    for i in no_nichibun:
        i = re.sub(taikatsu, '', i)
        no_taikatsu.append(i)
                           
    
    # Remove backslashes from text
    no_backslashes = []
    for i in no_taikatsu:
        i = re.sub(r'\\', '', i)
        no_backslashes.append(i)
        
        
    # Loop through the last list to remove the parenthetical phrases
    no_parentheses = []
    for line in no_backslashes:
        line = re.sub('（.*）','', line)        # Uses *JAPANESE* punctuation marks!!!    
        no_parentheses.append(line)
        
    
    # Get rid of those pesky American parentheses
    no_american = []  
    for line in no_parentheses:        
        line = re.sub('\(.*\)','', line)        
        no_american.append(line)
        
        
    # Remove dashes (hyphens) from text
    no_hyphens = []
    for i in no_american:
        i = re.sub(dash, '', i)
        no_hyphens.append(i)
        
        
    # Remove equals signs
    no_equals = []
    for i in no_hyphens:
        i = re.sub(equals_sign, '', i)
        no_equals.append(i)
        
        
    # Remove squiggly (～)
    no_squiggly = []
    for i in no_equals:
        i = re.sub(squiggly, '', i)
        no_squiggly.append(i)
        
        
    # Remove brackets and their content from string element
    no_brackets = []
    for i in no_squiggly:
        i = re.sub(brackets, '', i)
        no_brackets.append(i)
        
    
    # Further remove any straggler brackets and quotation marks from our data (open brackets)
    no_open_brackets = []
    for i in no_brackets:
        if not i == '{' and not i == '“':
            no_open_brackets.append(i)
            
            
    # Remove any random lines of Japanese punctuation, pt.1
    noise_apos = []
    for i in no_open_brackets:
        i = re.sub(white_noise_apos, '', i)
        noise_apos.append(i)
        
        
    # Remove any random lines of Japanese punctuation, pt. 2
    noise_dot = []
    for i in noise_apos:
        i = re.sub(white_noise_dot, '', i)
        noise_dot.append(i)
        
    # Chinese and Japanese dots are different...
    noise_dot2 = []
    for j in noise_dot:
        j = re.sub(white_noise_dot2, '', j)
        noise_dot2.append(j)
        
        
    # Remove arrow symbols
    no_arrows = []
    for i in noise_dot2:
        i = re.sub(pesky_arrow, '', i)
        no_arrows.append(i)
        
        
    # Remove pointers (≪ & ≫)
    no_open_pointers = []
    for i in no_arrows:
        i = re.sub(open_pointer, '', i)
        no_open_pointers.append(i)
        
    no_closed_pointers = []
    for i in no_open_pointers:
        i = re.sub(closed_pointer, '', i)
        no_closed_pointers.append(i)
        
    # Just like dots, Chinese and Japanese pointers are different...
    no_large_open_pointers = []
    for i in no_closed_pointers:
        i = re.sub(large_japanese_open_pointer, '', i)
        no_large_open_pointers.append(i)
        
    no_large_closed_pointers = []
    for i in no_large_open_pointers:
        i = re.sub(large_japanese_closed_pointer, '', i)
        no_large_closed_pointers.append(i)
        
    # But wait, there's one more pointer!
    plz_no_more_pointers = []
    for i in no_large_closed_pointers:
        i = re.sub(another_open_pointer, '', i)
        plz_no_more_pointers.append(i)
        
        
    # Last bit of cleanup concerning random punctuation and Chinese
    language_only = [i for i in plz_no_more_pointers if not '『' in i]
    no_Chinese = [i for i in language_only if not '制作人员' in i and not '标题' in i]
    
    
    # Now let's get rid of that random Chinese text still scattered about our files
    clean_text = []
    for i in no_Chinese:
        if not i == useless_chinese1 and not i == useless_chinese2 and not i == useless_chinese3 and not i == useless_chinese4 and not i == useless_chinese5 and not i == useless_chinese6 and not i == useless_chinese7 and not i == useless_chinese8 and not i == useless_chinese9 and not i == '－－':
            clean_text.append(i)
        
    no_chuubun = []
    for i in clean_text:
        if not i.startswith('中文'):
            no_chuubun.append(i)
    
    
    # Exclude empty elements from our list
    # This piece of code should ALWAYS be last
    full_dialogue = []
    for i in no_chuubun:
        if not i == '':
            full_dialogue.append(i)
            
            
    
    
    return full_dialogue
    
    

In [4]:
'''When analyzing the .ass files, it was noticed that a few of them were encoded, thus could not be processed like 
the rest of the data. In order to solve that, we have 2 options:

1. See if we can simply remove the encoded lines from the files directly (there aren't that many)
2. Write some more code that decodes the files, performs the cleaning, then re-encodes the file

We will start with option 1. Ichiban Ushiro no Daimaou, Fullmetal Alchemist: Brotherhood, and a handful of Gintama 
episodes were the culprits. If we were dealing with thousands of encoded files, we would jump right ahead to option 2, 
but since the remaining files total less than 75, it may be easier just to make the individual changes.

I have organized the files into a new sub-directory labeled "Encoded ASS". We will essentially copy the ass_munging 
function above, change the file path, delete a couple lines from the original file (none with Japanese text) and see 
if that works. If so, this section is all but done. If not, we simply move on to option 2.

-----------------------------------------------------------------------------------------------------------------------

UPDATE: The files in question were coded in utf-16 (apparently pretty common amongst Asian languages). We will move to 
option 2, but the fix is incredibly simple. When we open the file, we will specify the keyword argument 
*encoding='utf-16'*. This will allow us to perform the necessary cleaning like before.


'''

def encoded_munging(text_file):
    
    srt_filepath = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/Subtitle Files By Type/Encoded ASS/'
    filename = srt_filepath + text_file
    
    #Open the file, take the text, and create a list
    newfile = open(filename, 'rb')        # Open file to read in bytes
    text = newfile.read()
    
    bom = codecs.BOM_UTF16_LE             # Print dir(codecs) for other encodings
    if text.startswith(bom):              # Make sure the encoding is what you expect, otherwise you'll get wrong data
        encoded_text = text[len(bom):]    # Strip away the BOM
        decoded_text = encoded_text.decode('utf-16le')  # Decode to unicode
    else:
        decoded_text = text.decode('utf-8')
    
    newfile.close()
    
    lines = decoded_text.split()
    
    
    # Regular & string expressions to be used throughout to help clean the data    
    formatted = re.compile(r'[A-Za-z]')
    timestamp = re.compile(r'[\d,\d:\d\d:\d\d.\d\d,\d:\d\d:\d\d.\d\d,.*\,,0000,0000,0000,,]')
    digits = re.compile(r'[0-9]')
    nichibun = '日文'         # Can't read Chinese, so the variable "translations" are probably horribly incorrect
    taikatsu = '对话'
    useless_chinese1 = '制作人员==诸神字幕组==听译：朝颜'
    useless_chinese2 = '警示标语本字幕由诸神字幕组制作仅供学习交流之用若您喜欢本作品请支持正版影像制品'
    useless_chinese3 = '制作，仅供交流学习，禁止用于任何商业用途'
    useless_chinese4 = '日听兰樱&'
    useless_chinese5 = '本字幕由诸神字幕组出品'
    useless_chinese6 = '翻译：'
    useless_chinese7 = '諸神字幕組本字幕由诸神字幕组制作仅供学习交流之用若您喜欢本作品请支持正版影像制品'
    useless_chinese8 = '諸神字幕組{制作仅供学习交流之用若您喜欢本作品请支持正版影像制品'
    useless_chinese9 = '諸神字幕組'
    squiggly = '～'
    dash = re.compile(r'[-]+')
    brackets = re.compile(r'\{.*\}')
    white_noise_apos = re.compile(r'(`)+')
    white_noise_dot = re.compile(r'(·)+')
    white_noise_dot2 = re.compile(r'(・)+')
    pesky_arrow = '→'
    open_pointer = '≪'
    closed_pointer = '≫'
    large_japanese_open_pointer = '《'
    large_japanese_closed_pointer = '》'
    another_open_pointer = '＜'
    equals_sign = re.compile(r'[=]+')
    
    
                           
    # Remove that pesky unicode
    no_unicode = []
    for i in lines:
        if not i == '\ufeff1' and not i == '\ufeff':
            no_unicode.append(i)
            
    no_sixteen = []
    for i in no_unicode:
        i = re.sub('\ue4c6', '', i)
        no_sixteen.append(i)
    
    
    # Removes formatting lines
    no_formatting = []
    for i in no_sixteen:
        i = re.sub(timestamp, '', i)
        no_formatting.append(i)
    
    
    # Remove the timestamps from the file
    no_timestamps = []
    for i in no_formatting:
        i = re.sub(timestamp,'', i)
        no_timestamps.append(i)
                              
            
    # Remove any English-language stragglers
    no_english = []
    for i in no_timestamps:
        i = re.sub(formatted,'', i)
        no_english.append(i)
        
    
    # Remove chinese characters from dialogue
    no_nichibun = []
    for i in no_english:
        i = re.sub(nichibun, '', i)
        no_nichibun.append(i)
        
    no_taikatsu = []
    for i in no_nichibun:
        i = re.sub(taikatsu, '', i)
        no_taikatsu.append(i)
                           
    
    # Remove backslashes from text
    no_backslashes = []
    for i in no_taikatsu:
        i = re.sub(r'\\', '', i)
        no_backslashes.append(i)
        
        
    # Loop through the last list to remove the parenthetical phrases
    no_parentheses = []
    for line in no_backslashes:
        line = re.sub('（.*）','', line)        # Uses *JAPANESE* punctuation marks!!!    
        no_parentheses.append(line)
        
    
    # Get rid of those pesky American parentheses
    no_american = []  
    for line in no_parentheses:        
        line = re.sub('\(.*\)','', line)        
        no_american.append(line)
        
        
    # Remove dashes (hyphens) from text
    no_hyphens = []
    for i in no_american:
        i = re.sub(dash, '', i)
        no_hyphens.append(i)
        
        
    # Remove equals signs
    no_equals = []
    for i in no_hyphens:
        i = re.sub(equals_sign, '', i)
        no_equals.append(i)
        
        
    # Remove squiggly (～)
    no_squiggly = []
    for i in no_equals:
        i = re.sub(squiggly, '', i)
        no_squiggly.append(i)
        
        
    # Remove brackets and their content from string element
    no_brackets = []
    for i in no_squiggly:
        i = re.sub(brackets, '', i)
        no_brackets.append(i)
        
    
    # Further remove any straggler brackets and quotation marks from our data (open brackets)
    no_open_brackets = []
    for i in no_brackets:
        if not i == '{' and not i == '“':
            no_open_brackets.append(i)
            
            
    # Remove any random lines of Japanese punctuation, pt.1
    noise_apos = []
    for i in no_open_brackets:
        i = re.sub(white_noise_apos, '', i)
        noise_apos.append(i)
        
        
    # Remove any random lines of Japanese punctuation, pt. 2
    noise_dot = []
    for i in noise_apos:
        i = re.sub(white_noise_dot, '', i)
        noise_dot.append(i)
        
    # Chinese and Japanese dots are different...
    noise_dot2 = []
    for j in noise_dot:
        j = re.sub(white_noise_dot2, '', j)
        noise_dot2.append(j)
        
        
    # Remove arrow symbols
    no_arrows = []
    for i in noise_dot2:
        i = re.sub(pesky_arrow, '', i)
        no_arrows.append(i)
        
        
    # Remove pointers (≪ & ≫)
    no_open_pointers = []
    for i in no_arrows:
        i = re.sub(open_pointer, '', i)
        no_open_pointers.append(i)
        
    no_closed_pointers = []
    for i in no_open_pointers:
        i = re.sub(closed_pointer, '', i)
        no_closed_pointers.append(i)
        
    # Just like dots, Chinese and Japanese pointers are different...
    no_large_open_pointers = []
    for i in no_closed_pointers:
        i = re.sub(large_japanese_open_pointer, '', i)
        no_large_open_pointers.append(i)
        
    no_large_closed_pointers = []
    for i in no_large_open_pointers:
        i = re.sub(large_japanese_closed_pointer, '', i)
        no_large_closed_pointers.append(i)
        
    # But wait, there's one more pointer!
    plz_no_more_pointers = []
    for i in no_large_closed_pointers:
        i = re.sub(another_open_pointer, '', i)
        plz_no_more_pointers.append(i)
        
        
    # Last bit of cleanup concerning random punctuation and Chinese
    language_only = [i for i in plz_no_more_pointers if not '『' in i]
    no_Chinese = [i for i in language_only if not '制作人员' in i and not '标题' in i]
    
    
    # Now let's get rid of that random Chinese text still scattered about our files
    clean_text = []
    for i in no_Chinese:
        if not i == useless_chinese1 and not i == useless_chinese2 and not i == useless_chinese3 and not i == useless_chinese4 and not i == useless_chinese5 and not i == useless_chinese6 and not i == useless_chinese7 and not i == useless_chinese8 and not i == useless_chinese9 and not i == '－－' and not i == '//':
            clean_text.append(i)
        
    no_chuubun = []
    for i in clean_text:
        if not i.startswith('中文'):
            no_chuubun.append(i)
    
    
    # Exclude empty elements from our list
    # This piece of code should ALWAYS be last
    full_dialogue = []
    for i in no_chuubun:
        if not i == '':
            full_dialogue.append(i)
            
            
    
    
    return full_dialogue

In [6]:
'''Great! Now that the files have been sufficiently munged, we can write all the .srt and .ass files to their own 
master files (corpora), then combine them to create one massive anime corpus!'''

srt_corpus = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/Corpora/SRT_Corpus.txt'
ass_corpus = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/Corpora/ASS_Corpus.txt'
encoded_corpus = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/Corpora/ENCODED_Corpus.txt'


# SRT preparation - list of all SRT-based files
srt_consolidated = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/consolidated_srt.txt'
srt_list = open(srt_consolidated)
consolidated_list = srt_list.readlines()
srt_list.close()

srt_anime = []
for i in consolidated_list:
    srt_anime.append(i.strip())
    
    
# ASS preparation - list of all ASS_based files
ass_consolidated = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/consolidated_ass.txt'
ass_list = open(ass_consolidated)
consolidated_ass = ass_list.readlines()
ass_list.close()

ass_anime = []
for i in consolidated_ass:
    ass_anime.append(i.strip())
    
    
# ENCODED preparation - list of all ENCODED files
encoded_consolidated = '/Users/nickburkhalter/Desktop/Data Science/Projects/Top Shounen Anime Vocab List/consolidated_encoded.txt'
encoded_list = open(encoded_consolidated)
consolidated_encoded = encoded_list.readlines()
encoded_list.close()

encoded_anime = []
for i in consolidated_encoded:
    encoded_anime.append(i.strip())


    
# Write program that opens corpus file, appends each transcript line to the corpus, then closes the file
with open(srt_corpus, 'a') as srt:
    for i in srt_anime:
        srt.write(str(srt_munging(i)) + '\n')
        
with open(ass_corpus, 'a') as ass:
    for i in ass_anime:
        ass.write(str(ass_munging(i)) + '\n')
        
with open(encoded_corpus, 'a') as enc:
    for i in encoded_anime:
        enc.write(str(encoded_munging(i)) + '\n')
        
        
# To save headaches, since we were dealing with both utf-8 and utf-16 file types for the encoded_munging function,
# we simply save the written-to file as utf-8

# Lastly, we use the command line to merge all 3 files to create our ultimate anime corpus!