# Welcome to the part 2/3 of my series in processig text in NLP

## If you haven't viewed my previous part, I highly recommend it. That way it's easier to follow in this one

### For this part, we will:<br>- Tokenize words<br>- Remove stop words<br>- Create a DTM<br>- Print top 10 most said words for each comedian<br>- Print 10 least said words for each comedian<br>- Measure amount of profanity

#### Libraries needed for this notebook: pandas, sklearn, matplotlib and wordcloud

In [None]:
# Import pickled data using pandas

import pandas as pd
pd.set_option('max_colwidth', 999999999)

transcripts = pd.read_pickle("clean_corpus.pkl")

transcripts

# DTM - Document Term Matrix

### With a DTM we can present each word from each comedian and by using CountVectorizer<br>we can tokenize (isolate) each word and count how many times a specific word has been said.<br><br>By inserting ---> stop_words='english' <--- inside the CountVectorizer parameter, we remove<br> unnecessary words like "are", "we", "is", "the" et.c.<br><br>This section covers: Tokenization, stop-words, DTM.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
cv_data = cv.fit_transform(transcripts.transcript)

In [None]:
dtm_data = pd.DataFrame(cv_data.toarray(), columns = cv.get_feature_names())
dtm_data.index = transcripts.index
dtm_data

### As seen from this data, stop-words remove common words but not necessarily strange words or 'non-words' such as "aaagh". This can manually be taken care of by adding your own stop-words to be used istead or on top of basic-stopwords. For this case, it is not really vital.

## Let's flip the DTM-axis for easier view

In [None]:
copy_dtm_data = dtm_data

In [None]:
copy_dtm_data = copy_dtm_data.transpose()
copy_dtm_data

### Here we will print out the top 10 most said words said by each comedian

In [None]:
most_common_words = {}

for comedian in copy_dtm_data.columns:
    top = copy_dtm_data[comedian].sort_values(ascending=False).head(10)
    most_common_words[comedian] = list(zip(top.index, top.values))

most_common_words

### Here we will print out the 10 LEAST common words said by each comedian

In [None]:
least_common_words = {}
for comedian in copy_dtm_data.columns:
    bottom = copy_dtm_data[comedian].sort_values(ascending=True).head(10)    
    least_common_words[comedian] = list(zip(bottom.index, bottom.values))

least_common_words

## Let's make a word cloud, representing the most common words
### By plotting "word clouds", it's quite easy to see what words (in this case) is used the most. This kind of visualization can be very helpful in the NLP-area.

In [None]:
from wordcloud import WordCloud

word_cloud = WordCloud(background_color="white", colormap="Dark2",
               max_font_size=250, random_state=42)

In [None]:
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [24, 8]

comedian_names = ['Carlin', 'Gervais', 'Jefferies']

# Create subplots for each comedian
for index, comedian in enumerate(most_common_words):
    word_cloud.generate(transcripts.transcript[comedian])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(word_cloud, interpolation="bilinear")
    plt.axis("off")
    plt.title(comedian_names[index])
    
plt.show()

## Let's measure amount of profanity

### By simply analyzing the most common words, we can see that at least 'fucking' and 'shit' is mentioned.<br>As we call the dtm_data - brackets - swearword, we can see the amount of times each comdeian has said that<br>perticular word. Here is an example with the word 'fucking'

In [None]:
dtm_data['fucking']

### To see multiple words, we can use Pandas function 'concat' and sort two similar words: 'Fuck' and 'fucking'<br> and add them together while also measure the amout of times the word 'shit' has been said.

In [None]:
data_profanity = pd.concat([dtm_data.fucking + dtm_data.fuck, dtm_data.shit], axis=1)
data_profanity.columns = ['f_word', 's_word']
data_profanity

# Results:
###  Main question I had was "Who swears the most"?<br>By only printing the swearword 'fucking', we can see that Jim Jefferies used that word way more than George Carlin and more than double the times than Ricky Gervais. So in summary, Jim Jefferies is the comedian who swears the most. At least with these selected transcripts, but it kind of reflects the comedians way of acting. Another 'happy' discovery is that the 'f-word' is said at least twice as much as the 's-word' by all comedians.<br>

## This concludes this NLP introduction, we have:<br>- Made a setup<br>- Harvested data from URL-addresses<br>- Cleaned data<br>- Tokenized words and removed stop-words<br>- Created a DTM<br>- Displayed most and least said words from each comedian<br>- Measured amount of times a specific word has been said by each comedian

## Further implementations and tips:
### Instead of seeing 'same words' in different forms, a tool called "Lemmatization" can help further to reform each word into it's root word. As an example: Call, called and calling all has comes from the root-word 'call'.<br><br>Cleaing data can allways be applied and it's fine to apply some cleaning techniques afterwards.

# This is the end for part 2. Thank you for readning and following this introduction to NLP.<br>I hope it's of good use.