# Project

In this Project, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. 

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

## Project steps

To complete this project, you will follow these steps:

1. [Transcribing the videos](#2.-Transcribing-the-videos)
2. [Normalizing the text](#3.-Normalizing-the-text)
3. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
4. [Creating the dashboard](#5.-Creating-the-dashboard)

In [2]:
# required libraries
# !pip install SpeechRecognition==3.1.3
# !pip install keybert
# !pip install gensim
# !pip install keybert
# !pip install Wave
# !pip install moviepy

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [32]:
import speech_recognition as sr
import os
import subprocess
import wave, math, contextlib
from moviepy.editor import AudioFileClip


In [33]:

def get_transcript_file(source_file,destination_file):
    audioclip = AudioFileClip(source_file)
    audioclip.write_audiofile(f"{destination_file}.wav")
    with contextlib.closing(wave.open(f'{destination_file}.wav','r')) as f:
        frames = f.getnframes()
        rate = f.getframerate()
        duration = frames / float(rate)
    
    total_duration = math.ceil(duration / 60)
    r = sr.Recognizer()

    for i in range(0, total_duration):
        with sr.WavFile(f'{destination_file}.wav') as source:
            audio = r.record(source, offset=i*60, duration=60)
        f = open(f"{destination_file}.txt", "a")
#         print(r.recognize_google(audio,language='en-US',show_all = True ))
        f.write(r.recognize_google(audio,language='en-CA'))
        f.write(" ")
        f.close()


In [34]:
# Initialize recognizer class (for recognizing the speech)
rootdir = './data'
desitnation_folder = './transcriptions'
for fname in os.listdir(rootdir):
    print(fname)
    sourcename = os.path.join(rootdir,fname)
    dest_name =  os.path.join(desitnation_folder,fname.split('.')[0]) 
    try:
        get_transcript_file(sourcename, dest_name)
    except:
        pass
    


Mod03_Sect03_part1.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect03_part1.wav


                                                                                

MoviePy - Done.
Mod03_Sect04_part2.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect04_part2.wav


                                                                                

MoviePy - Done.
Mod03_Sect05.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect05.wav


                                                                                

MoviePy - Done.
Mod04_Sect01.mp4
MoviePy - Writing audio in ./transcriptions/Mod04_Sect01.wav


                                                                                

MoviePy - Done.
Mod03_Sect01.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect01.wav


                                                                                

MoviePy - Done.
Mod07_Sect01.mp4
MoviePy - Writing audio in ./transcriptions/Mod07_Sect01.wav


                                                                                

MoviePy - Done.
Mod05_WrapUp_ver2.mp4
MoviePy - Writing audio in ./transcriptions/Mod05_WrapUp_ver2.wav


                                                                                

MoviePy - Done.
Mod05_Intro.mp4
MoviePy - Writing audio in ./transcriptions/Mod05_Intro.wav


                                                                                

MoviePy - Done.
Mod03_Sect03_part3.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect03_part3.wav


                                                                                

MoviePy - Done.
Mod06_Intro.mp4
MoviePy - Writing audio in ./transcriptions/Mod06_Intro.wav


                                                                                

MoviePy - Done.
Mod02_Sect04.mp4
MoviePy - Writing audio in ./transcriptions/Mod02_Sect04.wav


                                                                                

MoviePy - Done.
Mod06_Sect01.mp4
MoviePy - Writing audio in ./transcriptions/Mod06_Sect01.wav


                                                                                

MoviePy - Done.
Mod03_Sect06.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect06.wav


                                                                                

MoviePy - Done.
Mod06_Sect02.mp4
MoviePy - Writing audio in ./transcriptions/Mod06_Sect02.wav


                                                                                

MoviePy - Done.
Mod03_Sect04_part1.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect04_part1.wav


                                                                                

MoviePy - Done.
Mod03_Sect07_part1.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect07_part1.wav


                                                                                

MoviePy - Done.
Mod01_Course Overview.mp4
MoviePy - Writing audio in ./transcriptions/Mod01_Course Overview.wav


                                                                                

MoviePy - Done.
Mod03_Sect02_part3.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect02_part3.wav


                                                                                

MoviePy - Done.
Mod03_Intro.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Intro.wav


                                                                                

MoviePy - Done.
Mod06_WrapUp.mp4
MoviePy - Writing audio in ./transcriptions/Mod06_WrapUp.wav


                                                                                

MoviePy - Done.
Mod02_Sect02.mp4
MoviePy - Writing audio in ./transcriptions/Mod02_Sect02.wav


                                                                                

MoviePy - Done.
Mod02_Sect01.mp4
MoviePy - Writing audio in ./transcriptions/Mod02_Sect01.wav


                                                                                

MoviePy - Done.
Mod03_WrapUp.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_WrapUp.wav


                                                                                

MoviePy - Done.
Mod02_Sect05.mp4
MoviePy - Writing audio in ./transcriptions/Mod02_Sect05.wav


                                                                                

MoviePy - Done.
Mod02_Sect03.mp4
MoviePy - Writing audio in ./transcriptions/Mod02_Sect03.wav


                                                                                

MoviePy - Done.
Mod04_WrapUp.mp4
MoviePy - Writing audio in ./transcriptions/Mod04_WrapUp.wav


                                                                                

MoviePy - Done.
Mod05_Sect03_part3.mp4
MoviePy - Writing audio in ./transcriptions/Mod05_Sect03_part3.wav


                                                                                

MoviePy - Done.
Mod02_WrapUp.mp4
MoviePy - Writing audio in ./transcriptions/Mod02_WrapUp.wav


                                                                                

MoviePy - Done.
Mod03_Sect02_part2.mp4
MoviePy - Writing audio in ./transcriptions/Mod03_Sect02_part2.wav


                                                                                

MoviePy - Done.


## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [35]:

import re
import nltk
import gensim 
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [36]:
# Write your answer/code here\
def clean_text(lst):
    cleaned_text = []
    stopword = stopwords.words("english")
    
    ## Text Cleaning (Removing Punctuations, Stopwords, Tokenization and Lemmatization)
    for text in lst:
        text = str(text).lower()
        text = re.sub(r'[^\w ]+', "", text)
        text = " ".join([lemmatizer.lemmatize(word,pos='v') for word in word_tokenize(text) if not word in set(stopword) and len(word)>3])
        if text != '':
            cleaned_text.append(text)
            
    norm_text = ' '.join(cleaned_text)
    return norm_text

In [40]:
desitnation_folder = './transcriptions'

norm_text = ''
for file in os.listdir(desitnation_folder):
    if file.endswith(".txt"):
        file = os.path.join(desitnation_folder,file)
        with open(file, "r") as file:
            text = file.read()
            text = text.split()
            text = clean_text(text)
            norm_text = " ".join([norm_text,text])
print(norm_text)

 welcome back academy machine learn module go work entire machine learn pipeline use amazon sage maker module discuss typical process handle machine learn problem machine learn pipeline apply many machine learn problems focus supervise learn process learn module adapt type machine learn well large module well cover material module youll able formulate problem business request obtain secure data machine learn build jupiter notebook use amazon sage maker outline process evaluate data explain need preprocessed open source tool examine preprocess data amazon sagemaker train host machine learn model cross validation test performance machine learn model host model inference finally create amazon sagemaker hyperparameter tune optimize model effectiveness ready start next video welcome back academy machine learn module go work entire machine learn pipeline use amazon sage maker module discuss typical process handle machine learn problem machine learn pipeline apply many machine learn problems 

## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [41]:
# Write your answer/code here
from keybert import KeyBERT

def get_keyphrases(full_text):
    kw_model = KeyBERT(model='all-mpnet-base-v2')
    keywords = kw_model.extract_keywords(full_text, 

                                         keyphrase_ngram_range=(1, 3), 

                                         stop_words='english', 

                                         highlight=False,

                                         top_n=50)
    keywords_list= list(dict(keywords).keys())
    return keywords_list

2023-04-16 07:14:55.692469: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [42]:
keyphrase = get_keyphrases(norm_text)
print(keyphrase)

['machine learn business', 'youre develop model', 'create systems learn', 'create machine learn', 'solution implement business', 'introduce machine learn', 'machine learn project', 'develop model', 'machine learn development', 'guidance formulate business', 'machine learn need', 'requirement machine', 'develop machine learn', 'module learn business', 'analysis data engineer', 'develop model predict', 'develop solution', 'machine learn engineer', 'solution meet businesss', 'business requirement machine', 'module youll introduction', 'machine learn build', 'engineer machine learn', 'need assess model', 'create sophisticate assistance', 'implement machine learn', 'learn machine learn', 'machine learn model', 'machine learn want', 'learn provide various', 'handle machine learn', 'use machine learn', 'problem data preparation', 'machine learn developer', 'business problem machine', 'data scientists machine', 'business problems data', 'problem predict', 'data opportunities involve', 'learn m

In [43]:
def make_biagram(data,tokens):
    bigram = gensim.models.Phrases(data, min_count=20, threshold=100) # higher threshold fewer phrases.
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    return [bigram_mod[doc] for doc in tokens]  

In [44]:
def topic_modeling(data,num_of_topics):

    data = data.split(' ')
    tokens = []
    for text in data:
        text = word_tokenize(text)
        tokens.append(text)
        
    # Make Biagrams
    tokens = make_biagram(data=data,tokens=tokens)

    # Corpora Dictionary
    dictionary = corpora.Dictionary(tokens)

    doc_term_matrix = [dictionary.doc2bow(doc) for doc in tokens]

    lda_model =  gensim.models.LdaModel(doc_term_matrix,  
                                       num_topics = num_of_topics,     
                                       id2word = dictionary,                                    
                                       passes = 2,
                                       chunksize=10,
                                       update_every=1,
                                       alpha='auto',
                                       per_word_topics=True,
                                       random_state=42
                                       )


    for idx, topic in lda_model.print_topics(-1):
        print("Topic: {} \nWords: {}".format(idx, topic ))
        print("\n")
    
    return lda_model,doc_term_matrix,dictionary


In [45]:

num_topics = 5
lda_model,corpus,id2word = topic_modeling(norm_text,num_topics)
print("whole topic list:",lda_model)


Topic: 0 
Words: 0.148*"learn" + 0.102*"look" + 0.028*"column" + 0.026*"frame" + 0.024*"pandas" + 0.021*"load" + 0.017*"file" + 0.015*"domain" + 0.011*"format" + 0.010*"spreadsheet"


Topic: 1 
Words: 0.041*"type" + 0.036*"pandas" + 0.031*"column" + 0.030*"use" + 0.028*"analysis" + 0.026*"part" + 0.024*"describe" + 0.023*"many" + 0.022*"frame" + 0.020*"label"


Topic: 2 
Words: 0.081*"well" + 0.046*"know" + 0.045*"also" + 0.034*"type" + 0.033*"youre" + 0.024*"pandas" + 0.022*"column" + 0.020*"machine" + 0.020*"make" + 0.019*"review"


Topic: 3 
Words: 0.083*"need" + 0.070*"section" + 0.032*"pandas" + 0.031*"column" + 0.027*"thats" + 0.026*"type" + 0.024*"frame" + 0.021*"columns" + 0.020*"load" + 0.016*"correct"


Topic: 4 
Words: 0.402*"data" + 0.044*"example" + 0.025*"type" + 0.020*"case" + 0.017*"pandas" + 0.016*"youll" + 0.015*"column" + 0.013*"identify" + 0.012*"frame" + 0.011*"load"


whole topic list: LdaModel<num_terms=1707, num_topics=5, decay=0.5, chunksize=10>


## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.`

In [46]:
import pyLDAvis.gensim
import pickle 
import pyLDAvis
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_data_filepath = os.path.join('./ldavis_prepared_'+str(num_topics))
if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, './ldavis_prepared_'+ str(num_topics) +'.html')
LDAvis_prepared

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av