# Resources
* https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/
* https://link.medium.com/sKzSyiJOu8
* https://analyticsindiamag.com/data-pre-processing-in-python/
* https://www.kdnuggets.com/2020/07/easy-guide-data-preprocessing-python.html
* https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
* https://python-markdown.github.io
* https://docs.python.org/3/library/re.html

## Our goal is to remove the HTML Tags and any irrelevant symbols from our scraped data

### Steps
 1. Import the necessary libraries
 2. Display the data scraped using Scrapy
 3. Display all of the posts from the CSV file
 4. Convert the Text from the Post from Markdown formatting to HTML Formatted Text to Raw Text
 5. Remove punctuation marks and whitespace characters from the posts
 6. Update the data scraped using Scrapy to contain the formatted posts
 7. Display all of the updated/formatted posts
 8. Display a comparison of both versions of all of the posts (original vs. formatted)
 9. Save the updated dataset into a CSV file

In [81]:
# Step 1 Import the necessary libraries (some of the libraries are provided below)
import pandas as pd
from string import punctuation, whitespace
from bs4 import BeautifulSoup
import numpy as np
import markdown
import re

In [82]:
# Step 2 Display the data scraped using Scrapy (Hint: use pandas)
data_main = pd.read_csv("CarTalkCommunityMain.csv")
data_posts = pd.read_csv("CarTalkCommunityPost.csv")
data_main.head()

Unnamed: 0,topic_id,title,created_at,last_posted_at,views,like_count,category_id,total_replies,total_posts,topic_slug,tags
0,167995,2019 Subaru Outback fast idle,2020-07-28T13:04:53.155Z,2020-07-28T20:45:47.786Z,98,4,6,1,3,2019-subaru-outback-fast-idle,"subaru,outback"
1,167772,"Attention Mazda owners! Input, please",2020-07-23T21:46:08.833Z,2020-07-28T17:00:41.532Z,437,6,12,12,23,attention-mazda-owners-input-please,mazda
2,167972,2011 ram 1500 4x4 tire issues,2020-07-28T00:11:44.190Z,2020-07-28T19:49:19.107Z,235,17,6,9,18,2011-ram-1500-4x4-tire-issues,"ram,1500"
3,168003,2019 Nissan Armada BCI malfunction,2020-07-28T16:49:31.550Z,2020-07-28T19:46:08.073Z,103,4,6,2,4,2019-nissan-armada-bci-malfunction,"nissan,armada"
4,167504,Gastation wants exact change due to coin shortage,2020-07-18T14:24:29.213Z,2020-07-28T19:01:26.191Z,1650,47,8,58,102,gastation-wants-exact-change-due-to-coin-shortage,


In [83]:
data_posts.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,<p>I have a 2012 Chevy Cruze 1.4 cylinder. While driving it started to overheat and I got an alarm on dash saying to turn a/c off due to overheat. The temp gauge shot up and pulled over to let the car cool down. I hooked up a code reader when I got home and got P00B7-engine coolant flow low/performance. I checked coolant and it was full but was back. I took it get a coolant flush. They said it was pretty bad. It caused there machine to clog. Not sure if they even was able to complete it beca...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,"<aside class=""quote no-group"" data-username=""Td401_169304"" data-post=""1"" data-topic=""167996"">\n<div class=""title"">\n<div class=""quote-controls""></div>\n<img alt="""" width=""20"" height=""20"" src=""https://avatars.discourse.org/v4/letter/t/3da27b/40.png"" class=""avatar""> Td401_169304:</div>\n<blockquote>\n<p>I checked coolant and it was full but was back</p>\n</blockquote>\n</aside>\n<p>Was “back” supposed to read as “black”?<br>\nIf so, then I would surmise that the engine’s head gasket has been b...",2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,"<p>Does the coolant look like this?</p>\n<p><img src=""https://aws1.discourse-cdn.com/cartalk/original/3X/c/b/cb478da696b0494cbd0a38f234a7fd055cccfc4d.jpeg"" alt=""image"" data-base62-sha1=""t0iebsQbOHc9YBwikw52DchbzAh"" width=""640"" height=""480""></p>\n<p>If so, air probably got into the cooling system and the Deathcool turned acidic eating up the cooling system.</p>\n<p>Tester</p>",3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,"<p>Yes. But black is an exaggeration. It’s more of a brownish color. So, if it’s discolored it’s definitely the head gasket? Can it be anything else?</p>",4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,<p>It wasn’t that black. The guy at the shop told me he think the thermostat was clogged from the “sludge”. But that was just the guy at jiffy lube so I take that with a grain of salt.</p>,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [84]:
# Step 3 Display all of the posts from the CSV file
data_main[['title']].head()

Unnamed: 0,title
0,2019 Subaru Outback fast idle
1,"Attention Mazda owners! Input, please"
2,2011 ram 1500 4x4 tire issues
3,2019 Nissan Armada BCI malfunction
4,Gastation wants exact change due to coin shortage


In [85]:
data_posts[['cooked']].head()

Unnamed: 0,cooked
0,<p>I have a 2012 Chevy Cruze 1.4 cylinder. While driving it started to overheat and I got an alarm on dash saying to turn a/c off due to overheat. The temp gauge shot up and pulled over to let the car cool down. I hooked up a code reader when I got home and got P00B7-engine coolant flow low/performance. I checked coolant and it was full but was back. I took it get a coolant flush. They said it was pretty bad. It caused there machine to clog. Not sure if they even was able to complete it beca...
1,"<aside class=""quote no-group"" data-username=""Td401_169304"" data-post=""1"" data-topic=""167996"">\n<div class=""title"">\n<div class=""quote-controls""></div>\n<img alt="""" width=""20"" height=""20"" src=""https://avatars.discourse.org/v4/letter/t/3da27b/40.png"" class=""avatar""> Td401_169304:</div>\n<blockquote>\n<p>I checked coolant and it was full but was back</p>\n</blockquote>\n</aside>\n<p>Was “back” supposed to read as “black”?<br>\nIf so, then I would surmise that the engine’s head gasket has been b..."
2,"<p>Does the coolant look like this?</p>\n<p><img src=""https://aws1.discourse-cdn.com/cartalk/original/3X/c/b/cb478da696b0494cbd0a38f234a7fd055cccfc4d.jpeg"" alt=""image"" data-base62-sha1=""t0iebsQbOHc9YBwikw52DchbzAh"" width=""640"" height=""480""></p>\n<p>If so, air probably got into the cooling system and the Deathcool turned acidic eating up the cooling system.</p>\n<p>Tester</p>"
3,"<p>Yes. But black is an exaggeration. It’s more of a brownish color. So, if it’s discolored it’s definitely the head gasket? Can it be anything else?</p>"
4,<p>It wasn’t that black. The guy at the shop told me he think the thermostat was clogged from the “sludge”. But that was just the guy at jiffy lube so I take that with a grain of salt.</p>


In [86]:
# Step 4 Convert the Text from the Post from Markdown formatting to HTML Formatted Text to Raw Text 
# (Hint: Take a look at the 'markdown' and 're' libraries)
def markdown_to_raw_text(markdown_text):
  if not (markdown_text is np.nan or markdown_text is None):
    html = markdown.markdown(markdown_text)
    return "".join(BeautifulSoup(html).findAll(text=True))

In [87]:
raw_text = data_posts['cooked'].apply(markdown_to_raw_text)
raw_text.to_frame().head()

Unnamed: 0,cooked
0,I have a 2012 Chevy Cruze 1.4 cylinder. While driving it started to overheat and I got an alarm on dash saying to turn a/c off due to overheat. The temp gauge shot up and pulled over to let the car cool down. I hooked up a code reader when I got home and got P00B7-engine coolant flow low/performance. I checked coolant and it was full but was back. I took it get a coolant flush. They said it was pretty bad. It caused there machine to clog. Not sure if they even was able to complete it because...
1,"\n\n\n Td401_169304:\n\nI checked coolant and it was full but was back\n\n\nWas “back” supposed to read as “black”?\nIf so, then I would surmise that the engine’s head gasket has been breached, and crankcase oil is getting into the cooling system. I suggest that you have a mechanic determine whether the head gasket has been breached, and if it has been breached, then it is time to assess what other types of engine damage have taken place.\nThe total cost of repairs might not be worthwhile ..."
2,"Does the coolant look like this?\n\nIf so, air probably got into the cooling system and the Deathcool turned acidic eating up the cooling system.\nTester"
3,"Yes. But black is an exaggeration. It’s more of a brownish color. So, if it’s discolored it’s definitely the head gasket? Can it be anything else?"
4,It wasn’t that black. The guy at the shop told me he think the thermostat was clogged from the “sludge”. But that was just the guy at jiffy lube so I take that with a grain of salt.


In [88]:
# Step 5 Remove punctuation marks and whitespace characters from the posts
# (Hint: What is a whitespace character? Are there multiple types of whitespace characters?)
def remove_non_alphanum_chars(raw_text):
  if not (raw_text is np.nan or raw_text is None):
    return re.sub('[^A-Za-z0-9]+', ' ', raw_text).replace(" ", "")

In [89]:
alphanum_removed = raw_text.apply(remove_non_alphanum_chars)
alphanum_removed.to_frame().head()

Unnamed: 0,cooked
0,Ihavea2012ChevyCruze14cylinderWhiledrivingitstartedtooverheatandIgotanalarmondashsayingtoturnacoffduetooverheatThetempgaugeshotupandpulledovertoletthecarcooldownIhookedupacodereaderwhenIgothomeandgotP00B7enginecoolantflowlowperformanceIcheckedcoolantanditwasfullbutwasbackItookitgetacoolantflushTheysaiditwasprettybadItcausedtheremachinetoclogNotsureiftheyevenwasabletocompleteitbecausetheyhaveonlymachineIaskedtheclerkandheassuredmeitwasButtheissuewasntfixedCarisstilloverheatingIjustranthecarfo...
1,Td401169304IcheckedcoolantanditwasfullbutwasbackWasbacksupposedtoreadasblackIfsothenIwouldsurmisethattheenginesheadgaskethasbeenbreachedandcrankcaseoilisgettingintothecoolingsystemIsuggestthatyouhaveamechanicdeterminewhethertheheadgaskethasbeenbreachedandifithasbeenbreachedthenitistimetoassesswhatothertypesofenginedamagehavetakenplaceThetotalcostofrepairsmightnotbeworthwhilewithan8yearoldcarandonlyatrustedcompetentmechanicNoteThatmeansNOTgoingtoPepBoysSearsMidasMeinekeMonroAAMCOoratireshopca...
2,DoesthecoolantlooklikethisIfsoairprobablygotintothecoolingsystemandtheDeathcoolturnedacidiceatingupthecoolingsystemTester
3,YesButblackisanexaggerationItsmoreofabrownishcolorSoifitsdiscoloreditsdefinitelytheheadgasketCanitbeanythingelse
4,ItwasntthatblackTheguyattheshoptoldmehethinkthethermostatwascloggedfromthesludgeButthatwasjusttheguyatjiffylubesoItakethatwithagrainofsalt


In [90]:
# Step 6 Update the data scraped using Scrapy to contain the formatted posts
def process_post_text(post_df):
  temp = post_df.copy()
  temp['cooked'] = temp['cooked'].apply(markdown_to_raw_text).apply(remove_non_alphanum_chars)
  return temp

In [96]:
data_posts_processed = process_post_text(data_posts)
data_posts_processed.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,Ihavea2012ChevyCruze14cylinderWhiledrivingitstartedtooverheatandIgotanalarmondashsayingtoturnacoffduetooverheatThetempgaugeshotupandpulledovertoletthecarcooldownIhookedupacodereaderwhenIgothomeandgotP00B7enginecoolantflowlowperformanceIcheckedcoolantanditwasfullbutwasbackItookitgetacoolantflushTheysaiditwasprettybadItcausedtheremachinetoclogNotsureiftheyevenwasabletocompleteitbecausetheyhaveonlymachineIaskedtheclerkandheassuredmeitwasButtheissuewasntfixedCarisstilloverheatingIjustranthecarfo...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,Td401169304IcheckedcoolantanditwasfullbutwasbackWasbacksupposedtoreadasblackIfsothenIwouldsurmisethattheenginesheadgaskethasbeenbreachedandcrankcaseoilisgettingintothecoolingsystemIsuggestthatyouhaveamechanicdeterminewhethertheheadgaskethasbeenbreachedandifithasbeenbreachedthenitistimetoassesswhatothertypesofenginedamagehavetakenplaceThetotalcostofrepairsmightnotbeworthwhilewithan8yearoldcarandonlyatrustedcompetentmechanicNoteThatmeansNOTgoingtoPepBoysSearsMidasMeinekeMonroAAMCOoratireshopca...,2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,DoesthecoolantlooklikethisIfsoairprobablygotintothecoolingsystemandtheDeathcoolturnedacidiceatingupthecoolingsystemTester,3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,YesButblackisanexaggerationItsmoreofabrownishcolorSoifitsdiscoloreditsdefinitelytheheadgasketCanitbeanythingelse,4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,ItwasntthatblackTheguyattheshoptoldmehethinkthethermostatwascloggedfromthesludgeButthatwasjusttheguyatjiffylubesoItakethatwithagrainofsalt,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [97]:
data_posts.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,<p>I have a 2012 Chevy Cruze 1.4 cylinder. While driving it started to overheat and I got an alarm on dash saying to turn a/c off due to overheat. The temp gauge shot up and pulled over to let the car cool down. I hooked up a code reader when I got home and got P00B7-engine coolant flow low/performance. I checked coolant and it was full but was back. I took it get a coolant flush. They said it was pretty bad. It caused there machine to clog. Not sure if they even was able to complete it beca...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,"<aside class=""quote no-group"" data-username=""Td401_169304"" data-post=""1"" data-topic=""167996"">\n<div class=""title"">\n<div class=""quote-controls""></div>\n<img alt="""" width=""20"" height=""20"" src=""https://avatars.discourse.org/v4/letter/t/3da27b/40.png"" class=""avatar""> Td401_169304:</div>\n<blockquote>\n<p>I checked coolant and it was full but was back</p>\n</blockquote>\n</aside>\n<p>Was “back” supposed to read as “black”?<br>\nIf so, then I would surmise that the engine’s head gasket has been b...",2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,"<p>Does the coolant look like this?</p>\n<p><img src=""https://aws1.discourse-cdn.com/cartalk/original/3X/c/b/cb478da696b0494cbd0a38f234a7fd055cccfc4d.jpeg"" alt=""image"" data-base62-sha1=""t0iebsQbOHc9YBwikw52DchbzAh"" width=""640"" height=""480""></p>\n<p>If so, air probably got into the cooling system and the Deathcool turned acidic eating up the cooling system.</p>\n<p>Tester</p>",3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,"<p>Yes. But black is an exaggeration. It’s more of a brownish color. So, if it’s discolored it’s definitely the head gasket? Can it be anything else?</p>",4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,<p>It wasn’t that black. The guy at the shop told me he think the thermostat was clogged from the “sludge”. But that was just the guy at jiffy lube so I take that with a grain of salt.</p>,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [98]:
# Step 7 Display all of the updated/formatted posts
data_posts_processed.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,Ihavea2012ChevyCruze14cylinderWhiledrivingitstartedtooverheatandIgotanalarmondashsayingtoturnacoffduetooverheatThetempgaugeshotupandpulledovertoletthecarcooldownIhookedupacodereaderwhenIgothomeandgotP00B7enginecoolantflowlowperformanceIcheckedcoolantanditwasfullbutwasbackItookitgetacoolantflushTheysaiditwasprettybadItcausedtheremachinetoclogNotsureiftheyevenwasabletocompleteitbecausetheyhaveonlymachineIaskedtheclerkandheassuredmeitwasButtheissuewasntfixedCarisstilloverheatingIjustranthecarfo...,1,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,Td401169304IcheckedcoolantanditwasfullbutwasbackWasbacksupposedtoreadasblackIfsothenIwouldsurmisethattheenginesheadgaskethasbeenbreachedandcrankcaseoilisgettingintothecoolingsystemIsuggestthatyouhaveamechanicdeterminewhethertheheadgaskethasbeenbreachedandifithasbeenbreachedthenitistimetoassesswhatothertypesofenginedamagehavetakenplaceThetotalcostofrepairsmightnotbeworthwhilewithan8yearoldcarandonlyatrustedcompetentmechanicNoteThatmeansNOTgoingtoPepBoysSearsMidasMeinekeMonroAAMCOoratireshopca...,2,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012
2,1174119,Tester,2020-07-28T13:59:17.419Z,DoesthecoolantlooklikethisIfsoairprobablygotintothecoolingsystemandtheDeathcoolturnedacidiceatingupthecoolingsystemTester,3,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,YesButblackisanexaggerationItsmoreofabrownishcolorSoifitsdiscoloreditsdefinitelytheheadgasketCanitbeanythingelse,4,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,ItwasntthatblackTheguyattheshoptoldmehethinkthethermostatwascloggedfromthesludgeButthatwasjusttheguyatjiffylubesoItakethatwithagrainofsalt,5,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012


In [99]:
# Step 8 Display a comparison of both versions of all of the posts (original vs. formatted)
# (Hint: use pandas)
comp = data_posts[['cooked']]
# comp.rename(columns = {'cooked' : 'raw cooked '}, inplace = True)

In [101]:
# Step 9 Save the updated dataset into a CSV file(Hint: use pandas)
data_posts_processed.to_csv("CarTalkCommunityPostProcessed.csv")