# Profanity-findings in NLP

### Since I have a huge interest in comedy, I decided to harvest some data in that area and with different tools (presented below and further in this notebook collection) I will share some insights of how to see results in various ways using NLP (Natrual Language Processing)
# This is part 1 out of 2:
## Set-up<br>Harvest data<br>Data cleaning

## - - - - - - Step 1: Set-up - - - - - -
### Before gathering, cleaning or other type of processing of data, it is vital to have a good set-up.<br> Decide what to strive to achieve and a purpose. For this case, I wonder how much three comedians swear.
## Goal: Who swears the most between George Carlin, Ricky Gervais and Jim Jefferies?
### For this, we will:<br>- Measure size of content for each comedian<br>- See the top 10 most common words for each comedian<br>- See the top 10 least common words for each comedian<br>- See which comedian that used the most profanity words<br>- Discover the amount of times a swearword has been said for each comedian<br><br> Transcripts from comedians: Ricky Gervais, Jim Jefferies and George Carlin

### Notes:
#### I have noticed that there is (basically) two ways of choosing when to import libraries, either at the very top of each notebook or along the way.<br> I will import the libraries at the point I need them, I think it's clearer to see when they are used.

### Libraries needed for this notebook: BeautifulSoup, pickle, pandas, request

## - - - - - - Step 2: Harvest data - - - - - -

### We will harvest transcripts from a website called 'scrapsfromtheloft.com'.<br> We will then find a suitable way to present the data.

In [None]:
# import libraries for harvesting data from URLs

import requests
from bs4 import BeautifulSoup

# Function gets urls from below and search for p-tags
def get_url_convert_to_transcript(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find_all('p')]
    print(url)
    print(text)
    return text

urls = ['https://scrapsfromtheloft.com/2020/07/08/jim-jefferies-intolerant-transcript/',
        'https://scrapsfromtheloft.com/2019/09/12/george-carlin-dumb-americans-transcript/',
        'https://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/']

In [None]:
# Comedian last name, usefull as keys later

comedians_last_name = ['jefferies', 'carlin', 'gervais']

In [None]:
# Call function and download data
transcripts = [get_url_convert_to_transcript(u) for u in urls]

In [None]:
# Import Pickle to save and upload data with ease
import pickle as pkl

# Creates a new folder and stores the transcripts there
!mkdir transcripts

# This funktion will set each comedian as a key to their transcripts
for every, comedian in enumerate(comedians_last_name):
    with open('transcripts/' + comedian + ".txt", "wb") as file:
        pkl.dump(transcripts[every], file)

In [None]:
# Upload data as dict

data = {}

for every, comedian in enumerate(comedians_last_name):
    with open("transcripts/" + comedian + ".txt", "rb") as file:
        data[comedian] = pkl.load(file)

In [None]:
# Call keys, if correct - three lastnames shall appear

data.keys()

In [None]:
# Print 10 first objects (sentences) from 'jefferies' to make
# sure that the data has been loaded and that the key is correct.

data['jefferies'][:5]

In [None]:
# Why not look at Carlin as well?

data['carlin'][:3]

In [None]:
# And Gervais just for the fun of it? :-)

data['gervais'][:1]

In [None]:
# By using the function below, we can
# ... print out the first (next) key

next(iter(data.keys()))

### At this point, the data has the comedian-name as key but uses<br> the list-format for the corpus. Let's convert that into string-format <br>for easier use in the future.

In [None]:
def reshape_list_to_string(text_list):
    converted_format = ' '.join(text_list)
    return converted_format

In [None]:
# Calls the function above and sets keys and values, keys
# ... being comedian names. This ends up in a dict-format.

data_string = {key: [reshape_list_to_string(value)] for (key, value) in data.items()}

### Using data as Python's own dict-format is fine. However, I personally<br> prefer using a "Data frame" from the Pandas-library

In [None]:
import pandas as pd

# Pandas default in width is quite narrow, so I like to
# expand that as much as possible
pd.set_option('max_colwidth', 999999999)

# The transpose-function swiches place between axis for easier view
# ... setting comedian name as Y-axis and transcripts as X-axis
df = pd.DataFrame.from_dict(data_string).transpose()
df.columns = ['transcript']

# Sorts objects by axis, alphabetically
df = df.sort_index()
df

In [None]:
# The .loc-feature allow us to retrieve rows by calling
# a key from the data frame
df.transcript.loc['gervais']

In [None]:
# Take a look at the data type, it should say:
#    transcript   object
#    dtype:       object

df.dtypes

# Pandas way of handling data is by making them dtype: object.

In [None]:
# df.shape presents the number of columns of each axis:
# ... First vertical (|) and then horizontal (-)

df.shape

### The data looks okay and is easy to select the key(s) we want. It's time to enter the next step.<br>
## - - - - - - Step 3: Cleaning - - - - - -
### Cleaning data in a nutshell means to get rid of unnessecary information and remove symbols, charaters and/or numbers that we don't want to (or can't) use. It also gives us the opportuneity to divide data into chunks if we would want to.
#### By using regex, a powerful library for going through big mass of data and apply specific actions, lots of cleaning can be made quite simple. <br>First of, we will do somethings that's good practice to do no matter what text-data that's being pre-processed and that is:<br> * Make all text into lower case.<br> * Remove symbols as well as numbers that are not useful.<br>Other than that, I explain within the code what's happening.
### Note: re.sub parameter explained --> (replace what, with this, here)

### When looking through the texts, I notice that some words within brackets ( [ ] ) are not words said by the actual comedian but rather printed noices such as 'laughter' and 'applause'. These are words I don't want to use in the process later on, so I am going to remove them.<br> I also notice a HTML syntax, \n , which indicates a new line. Let's remove that too.

In [None]:
# We are going to use re (RegEx) and string to clean this data

import re

def cleaning_session(corpus):
    
    # Converting all characters to lower case
    corpus = corpus.lower()
    
    # Removes anything within brackets (reason: No valid text-data there)
    corpus = re.sub('\[.*?\]', '', corpus)
    
    # Replaces visual breaks for new line (\n) with a space
    corpus = re.sub('\n', ' ', corpus)
    
    # Replace any symbols that are NOT characters with a space but keep apostrophies
    corpus = re.sub(r"[^a-z\’]", " ", corpus)
    
    return corpus

# Note: Running this cell only configures the function, we'll run it below.

In [None]:
# Let's save the data we have so far in its variable and create a copy.
# ... Good practice when experimenting

df_copy = df

In [None]:
# Lambda is a quicker way of witing code
# Here we simply put the function in a variable

cleaning = lambda x: cleaning_session(x)

# Above function is doing the exact same as the following function:
# def cleaning(x):
    # return cleaning_session(x)

In [None]:
# Activates cleaning session

clean_data = pd.DataFrame(df_copy.transcript.apply(cleaning))
clean_data

### To always know what to replace/remove can be a bit tricky, so I saved some other useful regex-functions with description here for you to browse.
#### Last updated 2020 with python 3.8.3<br>

<b>Remove quotation marks and tripple punctations</b><br>corpus = re.sub('[‘’“”…]', '', corpus)<br><br><b>Remove dash-symbols (-)</b><br>corpus = re.sub('[–]', '', corpus)<br><br><b>Remove visual breaks for new line (\n), add space instead</b><br>corpus = re.sub('\n', ' ', corpus)<br><br><b>Removes all punctuations</b><br>corpus = re.sub('[%s]' % re.escape(string.punctuation), '', corpus)<br><br><b>Remove digits before and after alphanumeric characters<br>\w\d\w =  (a-z, 0-9).<br>Asterix * = 0+ times.<br>... If a word is surrounded by letters or digits, remove that word.
</b><br>corpus = re.sub('\w*\d\w*', '', corpus)<br><br><b>Removes anything within curly brackets</b><br>corpus = re.sub('\{.*?\}', '', corpus)<br><br><b>Remove white-spaces</b><br>without_spaces = [re.sub(r'\s+', '', item) for item in words]


## This whole process might look messy to a human eye, we need to remember that all this is about preparing the data for a computer to process.

### Let's save the file! The saved file will end up in the same folder as the notebook.

In [None]:
clean_data.to_pickle("clean_corpus.pkl")

print("Pickled complete!")

# --------------- Next ----------------- 

### This is the end of part 1/2, as a sum we have:<br>- Set goals<br>- Harvested HTML data<br>- Cleaned text data<br><br>In the following notebook, we will remove so called stop-words, create a DTM and plot words.<br>This is commonly called EDA - Exploratory Data Analysis.

## Thank you for visiting my NLP notebook, I hope you liked it!