# Resources
* https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/
* https://link.medium.com/sKzSyiJOu8
* https://analyticsindiamag.com/data-pre-processing-in-python/
* https://www.kdnuggets.com/2020/07/easy-guide-data-preprocessing-python.html
* https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
* https://python-markdown.github.io
* https://docs.python.org/3/library/re.html

## Our goal is to remove the HTML Tags and any irrelevant symbols from our scraped data

### Steps
 1. Import the necessary libraries
 2. Display the data scraped using Scrapy
 3. Display all of the posts from the CSV file
 4. Convert the Text from the Post from Markdown formatting to HTML Formatted Text to Raw Text
 5. Remove punctuation marks and whitespace characters from the posts
 6. Update the data scraped using Scrapy to contain the formatted posts
 7. Display all of the updated/formatted posts
 8. Display a comparison of both versions of all of the posts (original vs. formatted)
 9. Save the updated dataset into a CSV file

In [7]:
# Step 1 Import the necessary libraries (some of the libraries are provided below)
import pandas as pd
from string import punctuation, whitespace
from bs4 import BeautifulSoup
import numpy as np
from markdown import markdown
import re

In [8]:
# Step 2 Display the data scraped using Scrapy (Hint: use pandas)
data_main = pd.read_csv("CarTalkCommunityMain.csv")
data_posts = pd.read_csv("CarTalkCommunityPost.csv")
data_main.head()

Unnamed: 0,topic_id,title,created_at,last_posted_at,views,like_count,category_id,total_replies,total_posts,topic_slug,tags
0,167995,2019 Subaru Outback fast idle,2020-07-28T13:04:53.155Z,2020-07-28T20:45:47.786Z,98,4,6,1,3,2019-subaru-outback-fast-idle,"subaru,outback"
1,167772,"Attention Mazda owners! Input, please",2020-07-23T21:46:08.833Z,2020-07-28T17:00:41.532Z,437,6,12,12,23,attention-mazda-owners-input-please,mazda
2,167972,2011 ram 1500 4x4 tire issues,2020-07-28T00:11:44.190Z,2020-07-28T19:49:19.107Z,235,17,6,9,18,2011-ram-1500-4x4-tire-issues,"ram,1500"
3,168003,2019 Nissan Armada BCI malfunction,2020-07-28T16:49:31.550Z,2020-07-28T19:46:08.073Z,103,4,6,2,4,2019-nissan-armada-bci-malfunction,"nissan,armada"
4,167504,Gastation wants exact change due to coin shortage,2020-07-18T14:24:29.213Z,2020-07-28T19:01:26.191Z,1650,47,8,58,102,gastation-wants-exact-change-due-to-coin-shortage,


In [9]:
data_posts.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,<p>I have a 2012 Chevy Cruze 1.4 cylinder. Whi...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,"<aside class=""quote no-group"" data-username=""T...",2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,<p>Does the coolant look like this?</p>\n<p><i...,3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,<p>Yes. But black is an exaggeration. It’s mor...,4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,<p>It wasn’t that black. The guy at the shop t...,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [10]:
# Step 3 Display all of the posts from the CSV file
data_main[['title']].head()

Unnamed: 0,title
0,2019 Subaru Outback fast idle
1,"Attention Mazda owners! Input, please"
2,2011 ram 1500 4x4 tire issues
3,2019 Nissan Armada BCI malfunction
4,Gastation wants exact change due to coin shortage


In [11]:
data_posts[['cooked']].head()

Unnamed: 0,cooked
0,<p>I have a 2012 Chevy Cruze 1.4 cylinder. Whi...
1,"<aside class=""quote no-group"" data-username=""T..."
2,<p>Does the coolant look like this?</p>\n<p><i...
3,<p>Yes. But black is an exaggeration. It’s mor...
4,<p>It wasn’t that black. The guy at the shop t...


In [14]:
# Step 4 Convert the Text from the Post from Markdown formatting to HTML Formatted Text to Raw Text 
# (Hint: Take a look at the 'markdown' and 're' libraries)
def markdown_to_raw_text(markdown_text):
    flag = markdown_text is np.nan or markdown_text is None
    if not flag:
        html = markdown(markdown_text)
        return "".join(BeautifulSoup(html).findAll(text=True))

In [15]:
raw_text = data_posts['cooked'].apply(markdown_to_raw_text)
raw_text.to_frame().head()

Unnamed: 0,cooked
0,I have a 2012 Chevy Cruze 1.4 cylinder. While ...
1,\n\n\n Td401_169304:\n\nI checked coolant and ...
2,"Does the coolant look like this?\n\nIf so, air..."
3,Yes. But black is an exaggeration. It’s more o...
4,It wasn’t that black. The guy at the shop told...


In [16]:
# Step 5 Remove punctuation marks and whitespace characters from the posts
# (Hint: What is a whitespace character? Are there multiple types of whitespace characters?)
def remove_non_alphanum_chars(raw_text):
    flag = raw_text is np.nan or raw_text is None
    if not flag:
        return re.sub('[^A-Za-z0-9]+', ' ', raw_text)

In [17]:
alphanum_removed = raw_text.apply(remove_non_alphanum_chars)
alphanum_removed.to_frame().head()

Unnamed: 0,cooked
0,I have a 2012 Chevy Cruze 1 4 cylinder While d...
1,Td401 169304 I checked coolant and it was ful...
2,Does the coolant look like this If so air prob...
3,Yes But black is an exaggeration It s more of ...
4,It wasn t that black The guy at the shop told ...


In [18]:
# Step 6 Update the data scraped using Scrapy to contain the formatted posts
def process_post_text(post_df):
  temp = post_df.copy()
  temp['cooked'] = temp['cooked'].apply(markdown_to_raw_text).apply(remove_non_alphanum_chars)
  return temp

In [19]:
data_posts_processed = process_post_text(data_posts)
data_posts_processed.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,I have a 2012 Chevy Cruze 1 4 cylinder While d...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,Td401 169304 I checked coolant and it was ful...,2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,Does the coolant look like this If so air prob...,3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,Yes But black is an exaggeration It s more of ...,4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,It wasn t that black The guy at the shop told ...,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [20]:
data_posts.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,<p>I have a 2012 Chevy Cruze 1.4 cylinder. Whi...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,"<aside class=""quote no-group"" data-username=""T...",2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,<p>Does the coolant look like this?</p>\n<p><i...,3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,<p>Yes. But black is an exaggeration. It’s mor...,4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,<p>It wasn’t that black. The guy at the shop t...,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [21]:
# Step 7 Display all of the updated/formatted posts
data_posts_processed.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,I have a 2012 Chevy Cruze 1 4 cylinder While d...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,Td401 169304 I checked coolant and it was ful...,2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,Does the coolant look like this If so air prob...,3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,Yes But black is an exaggeration It s more of ...,4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,It wasn t that black The guy at the shop told ...,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [22]:
# Step 8 Display a comparison of both versions of all of the posts (original vs. formatted)
# (Hint: use pandas)
comp = data_posts[['cooked']]
# comp.rename(columns = {'cooked' : 'raw cooked '}, inplace = True)

In [23]:
# Step 9 Save the updated dataset into a CSV file(Hint: use pandas)
data_posts_processed.to_csv("CarTalkCommunityPostProcessed.csv")