# 1. Exploratory Data Analysis





## 1.1 Let's load our data

I have loaded from Google sheets below as the raw CSVs needed a little extra cleaning, so I did this in Sheets vs loading the CSVs and then cleaning in Pandas (just because it was quicker)! :) 

However, please feel free to experiment loading your data in a different way! :) 

In [65]:
import pandas as pd
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
# from oauth2client.client import GoogleCredentials
# from google.colab import auth

In [66]:
FILES = ['CarTalk', 'Webflow']
df_all = pd.DataFrame()
for file in FILES:
    filename = file + '.csv'
    df = pd.read_csv(filename)
    df_all = pd.concat([df_all, df], sort = False)
df_all.head()

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reply_to_post_num,reads,topic_id,user_id,topic_slug,forum
0,1174109,Td401_169304,2020-07-28T13:35:25.800Z,I have a 2012 Chevy Cruze 1 4 cylinder While d...,1.0,2020-07-28T13:35:25.800Z,1,,31,167996,176954,car-overheating-chevy-cruze-2012,CarTalk
1,1174112,VDCdriver,2020-07-28T13:51:10.165Z,Td401 169304 I checked coolant and it was ful...,2.0,2020-07-28T13:51:41.768Z,1,,30,167996,195,car-overheating-chevy-cruze-2012,CarTalk
2,1174119,Tester,2020-07-28T13:59:17.419Z,Does the coolant look like this If so air prob...,3.0,2020-07-28T13:59:17.419Z,1,,30,167996,96,car-overheating-chevy-cruze-2012,CarTalk
3,1174120,Td401_169304,2020-07-28T14:00:01.389Z,Yes But black is an exaggeration It s more of ...,4.0,2020-07-28T14:00:01.389Z,2,2.0,30,167996,176954,car-overheating-chevy-cruze-2012,CarTalk
4,1174122,Td401_169304,2020-07-28T14:02:09.000Z,It wasn t that black The guy at the shop told ...,5.0,2020-07-28T14:02:23.957Z,1,3.0,30,167996,176954,car-overheating-chevy-cruze-2012,CarTalk


In [67]:
# Shuffle df
df_all = df_all.sample(frac=1).reset_index(drop=True)

## 1.2 Data Cleansing and Prep

Now we've loaded the data, we must remove noise from the dataset. Please explore some techniques in which we could clean the data in order to us to see how well pre-trained BERT works on our dataset. Luckily, due to the way BERT tokenises the data, we don't need to the same extent of data preprocessing as required of previous NLP models. However we still need to - 

1. Filter nulls
2. Filter for duplicates
3. [Optional] Remove post_text which does not have vocab in pre-trained BERT. Later, we will leave this in for finetuning.
  * Hyperlinks 
  * Foreign languages - there are multilingual BERT models
  * Any more you can think of?
4. Encode the labels - map categorical labels to numerical values

* See [here](https://drive.google.com/open?id=1PotbhjemiMobHu0Loy-mDHIumdJh-LxC) for Pandas cleaning tutorials
* See [here](https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/#3) for beginner EDA tutorial for NLP


### 1.2.1 Filter nulls


In [68]:
# Check the different data types of the columns
df_all.dtypes

post_id                int64
username              object
created_at            object
cooked                object
post_num             float64
updated_at            object
reply_count            int64
reply_to_post_num    float64
reads                  int64
topic_id               int64
user_id                int64
topic_slug            object
forum                 object
dtype: object

In [69]:
df_all.shape

(14007, 13)

In [70]:
# Removing this column due to redundancy
df_all.drop('reply_to_post_num', axis=1, inplace=True) 

In [71]:
df_all.shape

(14007, 12)

In [72]:
df_all = df_all.dropna(axis=0)
df_all.isnull().sum()  # No nulls now

post_id        0
username       0
created_at     0
cooked         0
post_num       0
updated_at     0
reply_count    0
reads          0
topic_id       0
user_id        0
topic_slug     0
forum          0
dtype: int64

### 1.2.2 Filter for duplicates


In [73]:
df_all = df_all.drop_duplicates()

In [74]:
df_all.shape

(13883, 12)

### 1.2.3 Encode the labels

We then need to encode the labels.

In [75]:
df_all['forum'] = df_all['forum'].astype('category')
df_all['forum_name_encoded'] = df_all['forum'].cat.codes.astype('int64')
df_all.head(5)

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reads,topic_id,user_id,topic_slug,forum,forum_name_encoded
0,1165782,Renegade,2020-06-25T13:39:04.662Z,Yes just started and its saying acquiring sign...,6.0,2020-06-25T13:39:04.662Z,0,7,166300,132125,2018-ford-edge-hd-radio-no-signal,CarTalk,0
1,1160223,davepsinbox_157004,2020-06-01T16:45:57.446Z,Here s a description of how to clean your vehi...,5.0,2020-06-01T16:45:57.446Z,0,25,165106,166234,1996-toyota-previa-black-exhaust,CarTalk,0
2,1164660,connie01,2020-06-21T14:22:04.000Z,No hill They have come to my house and it star...,4.0,2020-06-21T14:22:10.913Z,1,22,166089,176172,2009-subaru-forester-starting-issues,CarTalk,0
3,769025,mlkirchw,2015-06-12T19:22:21.000Z,You may have already figured this out but this...,4.0,2016-07-06T06:08:25.831Z,1,38,90136,130412,honda-odyssey-issues-dead-battery-then-loud-bu...,CarTalk,0
4,1162883,jtsanders,2020-06-13T14:13:28.922Z,You probably have more than one issue If the b...,5.0,2020-06-13T14:13:28.922Z,1,20,165713,84,why-did-my-car-die-at-a-light-after-i-let-off-...,CarTalk,0


### 1.2.3 [Optional] Filter noise

Remove post_text which does not have vocab in pre-trained BERT. Later, we will leave this in for finetuning.

* Emojis
* Hyperlinks
* Foreign languages
* Any more you can think of?

**How do we filter for these anomalies?** 

Perhaps we try to split strings by space and remove that match markdown hyperlink syntax `![]()`?

We can always see how BERT performs with dirty data and then perform further pre-processing as we move forward such as expanding contractions etc.




In [76]:
# Filter emojis
# !pip install emoji

import emoji

def give_emoji_free_text(text):
    allchars = [str for str in text]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.split() if not any(i in str for i in emoji_list)])
    return clean_text

text = give_emoji_free_text(df_all['cooked'][0])
print(text)

df_all['cooked'] = df_all['cooked'].apply((lambda x: give_emoji_free_text(x)))
df_all.sample(5)

Yes just started and its saying acquiring signal It was working fine when I bought it My satellite radio will do that every now and again I have to unplug the antenna and replug it


Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reads,topic_id,user_id,topic_slug,forum,forum_name_encoded
4416,1169374,Dr_Chrisq,2020-07-09T20:21:06.495Z,When i turn on my air conditioner some times i...,1.0,2020-07-09T20:21:06.495Z,0,23,167056,1750,blower-2005-chevy-colorado,CarTalk,0
6102,1161958,Merlin84123,2020-06-09T18:01:42.000Z,TY I will have it checked,5.0,2020-06-09T18:01:57.452Z,0,25,165534,175894,2004-dodge-stratus-radiator-gurgling-after-shu...,CarTalk,0
3619,1171840,GorehamJ,2020-07-18T23:57:13.832Z,Give us some context on the adjuster thing,2.0,2020-07-18T23:57:13.832Z,0,42,167530,134584,2011-chevrolet-malibu-rusty-axles,CarTalk,0
925,1163007,old_mopar_guy,2020-06-14T01:34:48.589Z,Did Dodge even offer a factory temperature gau...,2.0,2020-06-14T01:34:48.589Z,2,38,165740,156042,want-to-change-temp-gauge,CarTalk,0
2258,1166723,davepsinbox_157004,2020-06-28T20:35:57.882Z,There are a couple of possibilities First chec...,2.0,2020-06-28T20:35:57.882Z,0,15,166515,166234,2012-hyundai-elantra-dome-light-stays-on,CarTalk,0


In [77]:
# Filter hyperlinks
# !pip install html2text

import html2text
h = html2text.HTML2Text()
h.unicode_snob = True
df_all['cooked'] = df_all['cooked'].apply((lambda x: h.handle(x).replace('\n', ' ').replace('  ', ' ')))
df_all.sample(5)

Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reads,topic_id,user_id,topic_slug,forum,forum_name_encoded
4260,1172251,Mustangman,2020-07-20T19:53:25.431Z,Is the check engine light on Just how many mil...,2.0,2020-07-20T19:53:25.431Z,0,58,167594,94516,horrible-gas-mileage,CarTalk,0
10143,495959,oblivion,2011-06-09T10:04:09.000Z,Dieseling in reverse unlikely Maybe on a carbu...,16.0,2016-07-06T01:24:21.109Z,0,22,65290,194,car-reversed-while-in-drive,CarTalk,0
2442,610480,EllyEllis,2013-03-08T08:22:36.000Z,Are we talking about Alcohol or ethanol or the...,9.0,2016-07-06T03:22:21.216Z,0,49,77854,19570,gas-mileage-with-ethanol-free-vs-10-ethanol,CarTalk,0
5231,1173680,SteveCBT,2020-07-26T18:04:15.857Z,A little bit of gas makes a lot of stink But i...,3.0,2020-07-26T18:04:15.857Z,0,18,167875,119987,2009-jeep-wrangler-smells-like-gas,CarTalk,0
7952,1170024,Nevada_545,2020-07-12T04:31:14.823Z,If the drain plug is leaking it is easy enough...,6.0,2020-07-12T04:31:14.823Z,0,19,167174,81815,2008-chevrolet-impala-trying-the-egg-thing,CarTalk,0


In [57]:
# # !pip install nltk

# import nltk
# nltk.download('words')
# words = set(nltk.corpus.words.words())

# # Filter non English posts

# def remove_foreign_sentences(s):
#   for w in nltk.wordpunct_tokenize(s):
#     if not w.isalpha():
#       print('Not alpha:', w)
#       print(s.replace(w, ''))
#     if w.lower() not in words:
#       print('Foreign:', w)
#       return None

# test = ['1234charlene', '### asdfsafdsaf ', '*** yaas', 'yesss', 'I don\'t knowwww, woohoo!']
# # for s in df_all['post_text'][:10]:
# for s in test:
#   remove_foreign_sentences(s)
  
# df_all['cooked'] = df_all['cooked'].apply((lambda x: remove_foreign_sentences(x)))
# df_all.sample(5)

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\1hrit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


Not alpha: 1234charlene

Foreign: 1234charlene
Not alpha: ###
 asdfsafdsaf 
Foreign: ###
Not alpha: ***
 yaas
Foreign: ***
Foreign: yesss
Not alpha: '
I dont knowwww, woohoo!
Foreign: '
Foreign: Mitsubishi
Foreign: Uniroyal
Not alpha: 50
I change mine at  period I wouldn t be sleeping at night at 5 Like Keith said though it s just a computer program It has no idea what the oil is actually like You could spend 30 and send in an oil sample for analysis but at that point might as well just change it 
Foreign: 50
Foreign: changes
Foreign: Yup
Foreign: earlier
Not alpha: 2005
I own a  Ford Explorer with the 4 0L V6 With a new belt pulley and tensioner the air conditioning increases the engine load at idle by approxmiately 15 I had wondered about this for a while and performed some basic and unscientific tests in my driveway in the fall using a wifi based OBDII dongle These values will vary based on vehicle environmental conditions and a variety of other factors which I cannot begin to scrat

Not alpha: 2010
My  Equinox is doing the same exact thing I m curious what the fix was 
Foreign: 2010
Foreign: reasons
Not alpha: db4690
 Robert Gift Why not 10W 30 as thosengines originally used You re living in the past Is this engine made withe same tolerances as the one replaced If so I would use 10W 30 full synthetic 
Foreign: db4690
Foreign: online
Foreign: corroded
Foreign: heard
Not alpha: g28
I assume I let my turbo go behind worth repair rebuild Looking for a new one can i easily upgrade to  probably just a tune or will only tsi parts fit a G25 and lastly can any turbo made for this car model and year fit on with old kit parts as future reference also what should be a fair labor charge for removal and instillation of the turbo since they took 2 weeks just to remove it i cant even guess what it ll cost me now im not doing the rebuild 
Foreign: g28
Not alpha: 0
The dashboard display is telling you you should change your oil between now and  left Don t overthink it It doesn t me

Foreign: symptoms
Foreign: Congratulations
Foreign: KenR
Foreign: bits
Foreign: Dieseling
Not alpha: daneb1
 Approximately how long can one expect a drive train to keep functioning well in that model of Honda No one can predict something like that I ask again has she asked for your advice a 16 year old vehicle is to be evaluated on it s own condition because it may not have the problems other 2004 Elements have had and develop some of it s own 
Foreign: daneb1
Foreign: helps
Foreign: Toyota
Foreign: shanonia
Not alpha: 180K
Mileage is  With regard to which head gasket is bad it would be the one where the faint mist of coolant is coming out near cyl 7 I m confident in ruling out the right hand head gasket No oil in coolant no coolant in oil etc Am I missing something Definitely will check for flatness And I wasn t accusing anyone I was just pointing out these projects can snowball under the as long as I ve go it apart philosophy We ve all been there Okay so who sells hot cams for the 5 

Not alpha: 2017
When I go to start my wife s  Dodge Journey the power comes on and everything seems normal When I hit the push to start button all it does is click It doesn t want to start I tried jumping the vehicle thinking it may be a battery problem but nothing happens I really need help finding out what the problem is 
Foreign: 2017
Foreign: VOLVO
Foreign: posts
Foreign: referring
Foreign: earlier
Not alpha: lion9car
 Barkydog jtsanders Thank you All good information to know Thank you everyone for the informative responses These will be helpful to my friends 
Foreign: lion9car
Foreign: Inline
Not alpha: 2001
Foreign: 2001
Foreign: doesn
Not alpha: Jordan69
 the spider updated new version and it seems that is the only one they carry the new updated version I am right No you re wrong You can still get the old version although I don t know why anybody would want to do that 
Foreign: Jordan69
Foreign: isn
Not alpha: LadyDi17
 If you have not already done so I recommend that you escala

Foreign: Michael
Foreign: wouldn
Foreign: inline
Foreign: MajikImaje
Not alpha: image1112
 2208 633 KB 
Foreign: image1112
Foreign: Autozone
Foreign: reporting
Foreign: Has
Foreign: CEL
Foreign: has
Not alpha: db4690
 very useful pids that a cheapo scanner code reader etc will NOT give you Agree A perfect situation to take the car to a professional equipped with a professional scan tool especially if there is no CEL The cheap tools can sometimes point you in the right direction if you know what you are doing and the problem no so complicated Their live scanning abilities leave a bit to be desired I like strip charts that display several PIDs in at a time That takes a decently big display 8 inch tablet and a fairly powerful tablet to process But that is still no substitute for a pro scanner 
Foreign: db4690
Foreign: jtsanders
Foreign: lines
Foreign: OK
Foreign: Im
Foreign: has
Foreign: charged
Foreign: friends
Foreign: ive
Foreign: BillRussell
Foreign: Mustangman
Foreign: Mustangman
For

Foreign: recommended
Not alpha: 3
Well yes the car was sold  5 yrs ago The VVT was probably fine when I sold it Mileage was over 185K miles when sold I think it is a common Toyota problem Hopefully the instructions will help someone else 
Foreign: 3
Foreign: didn
Foreign: founded
Foreign: madison
Foreign: DIYer
Foreign: inspected
Foreign: Okay
Foreign: seems
Foreign: served
Foreign: changed
Foreign: Sounds
Foreign: davepsinbox
Foreign: Dave
Not alpha: 2021
I would guess a  would have a warranty Oh wait this is probably not a  First step call whoever put the gaskets in 
Foreign: 2021
Foreign: asking
Foreign: Ahh
Foreign: rods
Foreign: Plugs
Foreign: sockets
Foreign: website
Foreign: bled
Not alpha: Rav4
I d have no problem getting the  Toyota has done great for many years with their hybrid reliability I d avoid the Honda because of their poor hybrid track record Here s a comparison the  did better in this test Car and Driver 9 Jun 20 Tested 2020 Honda CR V Hybrid Can t Match the Toyota 

Foreign: responding
Foreign: results
Not alpha: 2016
I have a  Ford Escape Recently it has started to make a rattling sound not painfully loud but a clearly noticable and sounds This only happens when it is in gear Drive or Reverse and does not occur in Park or Neutral It does not appear to get louder with higher RPMs e g as vehicle is moving Not 100 sure on exact location but when I pop the hood the noise stops when I lean on the left side of the hood it stops It only makes the noise when it s down and in either R or D Any ideas on what this is likely to be Really appreciate any support 
Foreign: 2016
Foreign: Tom
Foreign: ve
Not alpha: 1
This  And trying to drive 60mph in second gear can t be good for it Drive it slowly to a good independent shop for a diagnosis 
Foreign: 1
Foreign: replaced
Not alpha: 3
 batteries in 2 and one half years 
Foreign: 3
Foreign: doors
Foreign: spots
Foreign: Mazda
Foreign: bolts
Foreign: repaired
Foreign: VFox
Foreign: SpudMuffin
Foreign: Sounds
Foreign

Foreign: caused
Foreign: Duratec
Foreign: creaking
Foreign: ll
Foreign: decides
Foreign: folks
Foreign: ll
Foreign: VOLVO
Foreign: replies
Not alpha: T40
When I broke off a  I removed the broken piece with a magnet When I worked at the Dodge dealer I kept a spare  and T45 bits in my tool box the Snap On tool dealer only comes by once a week to replace broken tools oldtimer 11 Were they the new Chinese made Craftsman The old USA manufactured Craftsman tools will break working on these engines I wonder if there is any reason to disassemble a 2 7 L Dodge engine before scraping the vehicle 
Foreign: T40
Not alpha: 12V
Be prepared for sticker shock  batteries for Priuses are typically VERY expensive even decent aftermarket batteries highway robbery in my opinion given the size of the actual battery 
Foreign: 12V
Foreign: says
Not alpha: image600
The starter relay may be bad  600 72 8 KB Tester 
Foreign: image600
Not alpha: 2007
I bought  Camery hybird but the ABS MODULE always made noisy an

Purebred Care to expand your statement  I ll even give the OP the opportunity to consider the following questions before he posts his expanded response Does it rumble when the engine is idling or when you are driving at low speeds or when you are driving at high speeds or when you are braking or when you are turning or 
Foreign: 1
Not alpha: 100
See if the  amp starter fuse is blown in the fuse box under the hood Tester 
Foreign: 100
Foreign: weeks
Foreign: seems
Foreign: systems
Foreign: putting
Foreign: online
Foreign: Dilberts
Foreign: fasteners
Foreign: GorehamJ
Foreign: Barkydog
Foreign: KaylaNirvana
Foreign: pein
Foreign: CEL
Foreign: requires
Foreign: has
Foreign: videos
Foreign: OK
Foreign: loses
Not alpha: davies767
 When I turn on the defrost the AC runs which at subzero temps is really counter productive and just plain cold When you re running both heat and the air conditioner the air coming out of the vents should be hot not cold If when you run both the air is still cold e

Foreign: Mazda
Foreign: optispark
Foreign: Eagles
Not alpha: 1300
With labor and the unit it would be  dos anyone know if most mechanics can install this or it has to be a mobility shop 
Foreign: 1300
Foreign: problems
Foreign: info
Foreign: Hmm
Foreign: shouldn
Not alpha: Dave514
 Does anyone know anything about this year and model and if it has had ongoing issues At 14 years old and that mileage the vehicle has to stand on it s own merits It could run for a long time or die tomorrow What the shop inspection finds is all that matters Myself I would not even consider an All Wheel Drive vehicle used unless it was absolutely necessary Normal Front Wheel will work most of the time in snow areas 
Foreign: Dave514
Foreign: splines
Not alpha: 2
I do hear the tick every  3 seconds I have no clue to the source Hope your dealer services department can determine the source Please post back with their explanation 
Foreign: 2
Foreign: existed
Not alpha: 2017
Road noise  Subaru Outback ring failure

Foreign: jammed
Foreign: mini
Foreign: functioning
Foreign: bridges
Foreign: issues
Foreign: Mustangman
Foreign: ends
Foreign: makes
Not alpha: 1
 on all points to greendrag0n s comments 
Foreign: 1
Foreign: speculating
Foreign: noticed
Not alpha: sarah58
 It was 88 octane with 15 ethanol Whatever the shorthand is for that Then I wouldn t be worried about it E15 is not going to cause any problems on a modern car 
Foreign: sarah58
Foreign: happened
Foreign: isn
Not alpha: 1
 year old Vette still under warranty Let the dealer fix it for you under warranty 
Foreign: 1
Not alpha: 1
Nowhere does that say a steady  amp for an hour or more 
Foreign: 1
Foreign: cdaquila
Foreign: ADP
Foreign: oldtimer
Foreign: infotainment
Not alpha: 04
I have a  Silverado and it seems like the same thing I ll be in the rain and ill hit the gas or whatever to go sometimes too much and it ll spin for 1 4 of a sec and thud thud thud and its locked ill start making my turn after i let of the gas and it ll hop and 

Foreign: installs
Not alpha: 2009
 Corolla LE 118k battery and alternator are both less than a year old were installed at the same time Got in it and it wouldn t start No prior symptoms Terminals are clean The clock even lost an hour and a half at some point No changes in driving habits no new accessories beyond running AC AAA guy found the alternator okay and the battery bad Started with a jump pack Drove it to our regular mechanic since he installed the battery The car died pulling in to his lot He says this indicates a bad alternator and I m inclined to agree He replaced the battery and alternator last year I know stuff goes bad but how common is it for a battery or alternator to fail in less than a year Am I just having bad luck or is there something else going on 
Foreign: 2009
Foreign: hoses
Foreign: voltages
Not alpha: Kennedy1963
 Full of fluid if not just a hair over So the fluid level was within the cross hatched area 
Foreign: Kennedy1963
Foreign: idk
Foreign: Years
Not alph

Foreign: hedreamlen
Not alpha: 9
it s a variety pack for  bucks Figured couldn t hurt having various sizes in a container in the garage for both our cars friends etc whoever s car we work on And that way I always have them vs trying to track one down Last time had to go to 2 auto parts store 
Foreign: 9
Foreign: ve
Not alpha: kobowden93
 I thought I d ask on here before I called them They most likely have a web site try that 
Foreign: kobowden93
Foreign: ve
Not alpha: 96
Why don t more people who work on their  vehicles that are having issues immediately consider this Readers that poll data and send it to a laptop or cell phone are cheap and readily available I mean OBDlink starts at like 30 and who knows what other stuff is out there The information they provide is invaluable Is it really not that well known 
Foreign: 96
Foreign: Remembering
Foreign: copies
Foreign: rotates
Not alpha: Fred28
 I can usually put into the tank another 2 to 3 gallons on each fill up The Prius has a bladde

Foreign: ethankeuning
Foreign: guns
Not alpha: al947394
 on my car hunt what questions would be good to ask to ensure I m getting a good car You ask if you can take it to a shop for an inspection about 125 00 to 150 00 to see if there are any problems That will not guarantee trouble free but will put the odds in your favor As a first time buyer I would avoid private sale from Craigslist or any online market place Also look at the Carmax web site if there is one near you because while not the lowest prices they do have a decent warranty and return policy 
Foreign: al947394
Foreign: TwinTurbo
Foreign: Tom
Foreign: heard
Foreign: inserting
Foreign: followed
Foreign: calls
Foreign: onetwothreemike
Foreign: upload
Foreign: SCMSoCal
Foreign: Bugmenot
Not alpha: 72
My  nova was not too bad for heater core no ac though 
Foreign: 72
Foreign: annoyed
Not alpha: 1994
That  is now 26 years old even it is still in running order And the question was asked 10 years ago 
Foreign: 1994
Foreign: afterma

Foreign: DTC
Foreign: info
Foreign: tools
Foreign: Couldn
Foreign: ve
Foreign: VDCdriver
Foreign: brakes
Foreign: https
Foreign: responses
Foreign: vehicles
Foreign: safest
Foreign: QC
Foreign: realized
Foreign: fans
Not alpha: 1000
I think I d at least try for say  on Craigslist or in my local newspaper Cash on the table as is full disclosure of all known problems Someone might want a challenge or a fixer upper or even need such an Equinox for parts But otherwise yeah I d either donate the vehicle or perhaps even take it to a junkyard You re just above the cost benefit value threshold of not worth fooling with At least in my mind Good luck 
Foreign: 1000
Not alpha: 5
 2 20 Thanx for your reply The fuel is fresh Also the tank was cleaned and has fresh fuel The fuel fuse is also ok Later 
Foreign: 5
Foreign: doors
Not alpha: ledhed75
 Can you really tow 5000 lbs with a V6 I d be a little skeptical Properly equipped yes The new Ford Ranger with it s 2 3L 4cyl can tow 7500 lbs 
Foreign: l

Foreign: BMW
Foreign: contaminants
Foreign: oddball
Foreign: friends
Foreign: ones
Not alpha: kurtwm1
 What can most likely cause this type of uneven wear Why not show the tire to a good tire shop and see what they say I will Bat Signal the resident tire expert for you CapriRacer do you want to wade in 
Foreign: kurtwm1
Foreign: Buick
Foreign: guys
Foreign: Has
Foreign: XM
Foreign: Mustangman
Foreign: aftermarket
Foreign: Cbsdkids
Foreign: pads
Not alpha: smg252
 what should I use to clean lube the engine so I don t get too dirty while working on it and make it shiny I prefer to just carefully wipe down everything I can reach with a microfiber cloth sprayed with quick detailer It might not look perfect but that way I m sure I m not getting moisture into any components that aren t meant to get wet 
Foreign: smg252
Foreign: Mr
Foreign: possibilities
Foreign: todays
Not alpha: 2
My  back door latches are broken they won t open 
Foreign: 2
Foreign: legs
Foreign: Matthew
Foreign: Mustangman

Foreign: Zerex
Foreign: threads
Foreign: couldn
Foreign: toyota
Foreign: CVT
Foreign: ve
Foreign: didn
Not alpha: 2
Just my  cents and I could be dead wrong but I suspect some misdiagnosis of the problem which may not even be related to the coils It s stated that the coils are burning out How in the world has that been determined To make matters worse this is stated to be a 4 or 5 time event with 3 different shops involved That muddies the water even more 
Foreign: 2
Foreign: hoses
Foreign: Doesn
Foreign: engineers
Foreign: sparklehoof
Foreign: MajikImaje
Not alpha: 88
It s Me do you mean  octane E85 My area has  Octane E15 I wouldn t worry about it for one tank 
Foreign: 88
Not alpha: 20200716
 1911183024 4032 1 85 MB First time poster long time lurker I think I finally have an issue that I haven t found the answer for though I m happy to be corrected I m hoping someone has experience of this issue I changed the front brake pads on my Mini One D 2011 R56 Since the change the brakes ha

Foreign: cars
Foreign: Joyce
Foreign: starts
Foreign: workie
Foreign: earlier
Not alpha: 0
nope thats a fairy tale a fuse busted in his car and cost his  
Foreign: 0
Foreign: things
Foreign: ll
Foreign: wants
Foreign: Toyota
Foreign: years
Foreign: called
Foreign: formulated
Foreign: tranny
Foreign: has
Foreign: prevents
Foreign: Ok
Foreign: Koltyn
Foreign: parked
Foreign: Missouri
Foreign: lengths
Not alpha: 3
My personal approach is every  years or 0k miles whichever comes first Hopefully the OP s vehicle has had the trans fluid changed 5 times over the 15 years Otherwise 
Foreign: 3
Foreign: Japanese
Foreign: CEL
Foreign: has
Foreign: Tannor
Foreign: mentioned
Foreign: reads
Foreign: wheels
Foreign: bearings
Foreign: sets
Foreign: wasn
Foreign: Google
Not alpha: ledhed75
 I have a teen who s learning how to drive now One of the first lessons when we got her a car was me going over everything under the hood with her watching and listening We check the oil in all our cars every Saturd

Foreign: miles
Foreign: misspoke
Foreign: surprised
Not alpha: ok4450
 Thankfully OK passed a statute last year which states that if a train is blocking the road for more than 10 minutes the police can be called and the RR will be cited and fined 1000 dollars About time Denver established such an ordinance I believe in thearly 1900s My Asian driving students claimed thathe flashing mee mool lights o T o meant stop and proceed when safe same as flashing red traffic signals Years before I nvented a theory thathe alternately flashing lamps mimicked the flagman swinging his red lantern to and fro Some grade crossing signals have had a boulevard STOP sign which turns 90 degrees facing traffic when the mee mool lights activate Turns back to edgewise facing traffic when the o T o deactivate That STOP sign also conveyed stop and proceed as a STOP sign indicates 
Foreign: ok4450
Foreign: caused
Foreign: things
Foreign: installing
Foreign: ftc
Foreign: moved
Foreign: Catholics
Foreign: texases
F

Not alpha: 125
Simple you pay a shop about  00 to inspect the vehicle for you to see if it has any problems That is all you can do for any used vehicle and that does not guarantee you won t have problems 
Foreign: 125
Foreign: seems
Not alpha: 302
My friend put that stuff in a Ford  after doing an oil change that was smoking a little About 50 miles later the oil light started coming on at idle which it NEVER did before I did an oil pressure check and had 5 6 psi at idle and 20 3000 rpm Turned out the oil filter was all clogged up with that stuff Changed the oil filter and everything went back to normal Pressure was 30 psi at idle and 45 psi 3000 rpm We cut the filter open and found all this dark green crap coating the element 
Foreign: 302
Foreign: spirits
Foreign: didn
Not alpha: 96
I had a  Subaru Impreza that leaked for years I had put gallons of power stirring fluid with stop leak in it I put a cap full of break fluid in and it never leaked again Drove it for a few years after and 

Foreign: Mrnicefordguy
Foreign: Eggmode
Not alpha: db4690
 If I mentioned the name of the Bbattery vendor who I believe has a despicable store policy in regards to honoring the battery free replacement period who also happens to sell everything else under the sun it would predictably devolve into an extremely ugly political discussion I suspect many of you can now figure out who that is I m sure that I can but in Sears defense I ordered a part for my Kenmore dishwasher from them the other day Because the machine was manufactured by Whirlpool I went first to Whirlpool s site but then I decided to double check the price for this part from Sears It turned out that Sears was charging 20 less for the same part so of course I ordered it from them To my great amazement the part that I ordered from Sears at around 10 AM Wednesday was delivered to my front door at 11 36 AM on Thursday And it came in a box marked Whirlpool So even though Sears is probably not long for this world they provided th

Not alpha: COROLLAGUY1
 Some pick and pull scrapyard may have one Yes in theory I don t know about the area where the OP lives but in my part of The US Mitsubishis have been virtually absent from the roads for a long time Hopefully the OP can get lucky with finding the desired part in a junkyard but when a particular make becomes as scarce as hen s teeth the chance of finding the desired parts in a junkyard can be slim 
Foreign: COROLLAGUY1
Foreign: beleive
Foreign: specified
Not alpha: 2002
I have a  Intrepid ES with a 3 5L V 6 with 264 000 miles on it I have recently had the timing belt and water pump changed Three weeks later I am driving home and the temperature gauge starts climbing This is the first time this car has ever overheated Well I call my buddy who is a certified mechanic and he tows the car to his shop I have him check the water pump to make sure it hasn t prematurely failed He tears the front of the engine down and inspects the pump He finds nothing wrong with the pump

Foreign: misspoken
Foreign: didn
Foreign: Paperwork
Foreign: ve
Not alpha: 2015
 Crosstrek straining noise Rear axle bearings are a common Subaru problem around 100 000 miles or so I d find a good Subaru independent mechanic and have them take a look Okay but when the OP posted his question in October 2018 he also told us that peterFawlter I only have 11k miles ken2116 I suggest having a reputable mechanic who works on a lot of Subaru s look at it Tell them what you ve observed or demonstrate it on a ride with them From what you ve described it s difficult to pinpoint a problem it could be major or minor but usually the sooner you fix something the less it will cost in the long run It is possible that the OP s vehicle hasn t yet exhausted its 3 yr 30k mile Bumper to Bumper Warranty in which case he shouldn t be taking it to an indy mechanic but in any event I suspect that he has resolved or accepted this noise problem in the 20 months or so since he posted his question 
Foreign: 2015
F

Foreign: replacing
Foreign: has
Not alpha: RAV4
You don t say what year the  is but on a 2017 a bad fuel pressure regulator can cause white smoke with a gas smell image600 600 58 6 KB Tester 
Foreign: RAV4
Foreign: words
Foreign: metals
Not alpha: r9tee
 they replaced the fuel injection What exactly did the replace All the fuel injectors 
Foreign: r9tee
Foreign: Cavell
Foreign: LincolnPains
Foreign: mountainbike
Foreign: Plugs
Foreign: Takes
Foreign: shadowfax
Foreign: Cranks
Foreign: Ok
Foreign: lights
Not alpha: R10
Surely there is a model number tag on the compressor Maybe one could be found that way My understanding is that the  is nothing more than a body change The mechanicals should be the same 
Foreign: R10
Foreign: wires
Not alpha: E15
If it really is  fuel it won t matter at all We have local E20 fuel available and it mixes freely with E10 in a modern car with no issues 
Foreign: E15
Foreign: parts
Foreign: concerns
Foreign: PCV
Not alpha: 2000
Hey all I have a  chevy express

Foreign: alternates
Foreign: washers
Foreign: escalating
Foreign: stands
Foreign: switches
Foreign: Chinese
Foreign: has
Foreign: lights
Foreign: failed
Foreign: called
Foreign: Tom
Foreign: heard
Not alpha: 2002
 Silverado 1500 LS trim heats up when the A C is turn on Outside temperature is around 85 to 90 degrees plus when this happens Turn A C off and it goes back to normal Runs normal other wise Any ideas or suggestions would be appreciated 
Foreign: 2002
Foreign: ve
Foreign: purchased
Foreign: ve
Foreign: changed
Foreign: leaks
Foreign: gets
Not alpha: 2008
Foreign: 2008
Not alpha: 2007
My  Hyundai Elantra transmission may be shot at 70 000 miles The garage man that I went to said that you can expect to get about 5 years from a Hyundai 60 000 What have others found I m wondering if my car will steadily decline and want to plan for the eventuality of having to buy another car because costs outweigh benefits I m not there yet but I m very disappointed to be having such problems at 7

Foreign: tolerances
Foreign: shocks
Foreign: sounds
Foreign: Madison
Foreign: kelfeind
Foreign: Autozone
Foreign: Triedaq
Foreign: eBay
Not alpha: 87
Took my  Dakota in and it failed 4 times High HC and high NO And while the failure points to a bad EGR system I am absolutely certain the EGR system is functioning as it should The whole ignition system is completely new The carburetor is in great shape The timing is retarded 3 degrees from specs The truck has passed with this exact same configuration since 2008 The engine is the 3 9L V6 with 170 000 miles auto trans Drives like a dream and the AC works The only other thing that I can think of besides a bad EGR system is a lean condition that results in lean misfire The only possible solution that I can think of is to put in larger main jets The mechanical fuel pump appears to be working The carb bowl fills when the engine is cranked and the floats shuts off the fuel Float is adjusted properly Other than changing the main jets I am flat o

I m at a loss I m trying to help a friend with a  Jeep Patriot 2 4L The battery is draining overnight The owner threw a new battery at it and is still having the problem and brought it to me to help out Here s what I have done Doing parasitic drain testing by disconnecting negative from battery and hooking multimeter between cable and terminal I measure 1 8A I ve individually pulled all fuses in the fuse box under the hood as far as I can tell this is the only fuse box No changes in current draw doing this I disconnected alternator and starter power cables and signal wires No change in current draw The battery reads 12 6V when charged and the car is off and the alternator has the correct output when running When she brought it to me I did leave it overnight without doing anything and it did drain significantly below 12 6V I can t remember exactly what the reading was What else can I check to try to locate the problem I really thought it would be either a fused circuit or the starter al

Foreign: owned
Foreign: Problems
Foreign: looks
Foreign: articles
Foreign: issues
Foreign: tires
Foreign: runs
Foreign: detailing
Foreign: Op
Not alpha: 2013
I took my  NissanMaxima to the local quick lube to have routine service oil change Never had a problem with the car and I ve had it three years Pull out of the quicklube the car starts shaking and stalling and surging and goes dead Did this same thing about five diff times until I could finally get it onto the shoulder of the road and out of traffic I called the quick lube and the manager immediately came half mile down the road to check it out The housing for the air filter wasn t clipped correctly and upon opening it the filter was wedged in there in a way that had it in a bind and bent up preventing the cover from latching correctly He took it out adjusted it and replaced correctly and closed the clamps and it restarted no problem The next two days I only drive it a mile or two but today I had a two hour drive Now the car wants

Foreign: drags
Foreign: info
Not alpha: 93
And for a  you might also have to deal with worn out suspension parts contributing to the problem Ideally you d have a trusted mechanic who would take a look and tell you what all is needed to keep it safely on the road expect to pay them for this 
Foreign: 93
Foreign: knfenimore
Foreign: dealers
Foreign: has
Foreign: pumps
Foreign: vehicles
Foreign: doesn
Foreign: fans
Foreign: Looks
Foreign: clicking
Foreign: tires
Not alpha: Rich56
 My car overheated because the Thermostat failed You may have that backwards Overheating an engine can damage the thermostat and it may not open when hot The plastic plumbing may have failed first 
Foreign: Rich56
Not alpha: 2004
Mine did the same thing  Mazda Miata I just took out the fuse for the Antenna My radio is fine but just prior to this my CD player stopped working its all One Unit Not sure now what to do replace antenna radio cd player or all my fuses 
Foreign: 2004
Foreign: circuitsmith
Foreign: Steve


Not alpha: 2
First I would swap the  injector with 3 and see if the missfire moves If that didn t work double check the spark plug for cracks it has happened to me where a new spark plug cracks If those are good next thing I would do is replace the coil pack 
Foreign: 2
Foreign: Experts
Not alpha: bcohen2010
 asked a good question You guys slammed him and then he explained why it was a good question 
Foreign: bcohen2010
Not alpha: Kennedy1963
 He charged it with a charger For how long A small battery charger may take 24 hours to recharge the battery 
Foreign: Kennedy1963
Foreign: easyout
Foreign: miles
Foreign: happens
Foreign: George
Foreign: replies
Foreign: installed
Foreign: Annie
Foreign: OP
Foreign: minutes
Not alpha: 18
Its only  months old Take it in and warranty it The last I warranted it only cost me 17 pro rate That was 4 years on a 6 year warranty 
Foreign: 18
Not alpha: COROLLAGUY1
 To tell you the truth your mechanic is ripping you off It just a Civic after all I would no

Foreign: has
Foreign: vacume
Foreign: signing
Foreign: Sonoma
Foreign: cuts
Foreign: leaking
Foreign: Rebooting
Foreign: ve
Foreign: started
Foreign: pipes
Foreign: has
Foreign: VOLVO
Foreign: hoping
Foreign: shanonia
Foreign: sunroof


Unnamed: 0,post_id,username,created_at,cooked,post_num,updated_at,reply_count,reads,topic_id,user_id,topic_slug,forum,forum_name_encoded
12951,1173999,db4690,2020-07-28T00:30:47.704Z,,5.0,2020-07-28T00:30:47.704Z,0,29,167912,107005,check-gas-cap-light-causing-check-engine-light,CarTalk,0
13470,1157592,oldtimer_11,2020-05-21T04:02:54.019Z,,3.0,2020-05-21T04:02:54.019Z,0,34,164623,47918,replacing-brakes-sealer,CarTalk,0
10490,1169720,ledhed75,2020-07-11T11:03:01.061Z,,9.0,2020-07-11T11:03:01.061Z,0,23,167120,2110,could-someone-check-this-carfax,CarTalk,0
11337,1163069,Darrel-G,2020-06-14T12:52:20.593Z,,3.0,2020-06-14T12:52:20.593Z,0,22,150884,171055,2004-ford-ranger-lugging-low-power,CarTalk,0
3899,1157986,jtsanders,2020-05-22T20:58:04.974Z,,4.0,2020-05-22T20:58:04.974Z,0,18,164690,84,1998-mercedes-benz-slk-class-unusual-tap,CarTalk,0


# 2. Forum Classifier with BERT

There are two steps to creating a text classifier - 

1. Train NLP model to transform sentences into meaningful sentence embeddings  
2. Train a classifier to make predictions

There are many models we could use to transform our post text into meaningful sentence embeddings. However, for our project we have chosen the BERT model due to high performance on unseen data as it's been trained on large corpuses of common texts and it's ability to handle dirty data due to its tokenisation architecture. More specifically, we will be using the [DistilBERT model](https://huggingface.co/transformers/model_doc/distilbert.html) from the HuggingFace transformer library.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* We’ll be using [BertForSequenceClassification](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification). This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

The data we pass between the two models is a vector of size 768 (The [CLS] vector). We can think of this of vector as an embedding for the sentence that we can use for classification. 


## 2.1 Sentence Embeddings with BERT
There are many flavours of BERT out there, all trained for a variety of different use cases. However, the model we are using in our project is [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) from the HuggingFace transformers library as it's simple to use.

Please follow the following steps to generate a dataframe of sentence embeddings with their corresponding forum labels.

1. Tokenise the sentences
2. Pad & truncate all sentences to a single constant length for batch processing
3. Explicitly differentiate real tokens from padding tokens with an “attention mask” 
4. Pass tokenised sentences through BERT to generate sentence embedding features


* See this [visual starter notebook](https://colab.research.google.com/drive/1elYlJ_JKupvMJwLtuwizoYIBmwwjgaur?usp=sharing) for further understanding of how to use the BERT model.

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />


Let’s extract the sentences and labels of our training set as numpy ndarrays.

In [78]:
# Get the lists of sentences and their labels to np array.
sentences = df_all.cooked.values
labels = df_all.forum_name_encoded.values

Let’s apply the tokenizer to one sentence just to see the output.

---



When we actually convert all of our sentences, we’ll use the `tokenizer.encode` function to handle both steps, rather than calling `tokenize` and `convert_tokens_to_ids` separately.

In [79]:
from transformers import DistilBertTokenizer

# Load pretrained DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Sample df
df_all[['cooked', 'forum']].sample(5)

Unnamed: 0,cooked,forum
8351,Why did you teplace that stuff Symptoms Seems ...,CarTalk
11999,rockyfeb821 3 days later the brake pedal is st...,CarTalk
11695,I am NEVER getting under a vehicle supported b...,CarTalk
12566,It s going to cost a lot of money and that s j...,CarTalk
2555,Today I tryed to refill my coolant and power s...,CarTalk


In [80]:
# Print the original sentence.
print(' Original: ', sentences[1])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[1]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[1])))



 Original:  Here s a description of how to clean your vehicle s EGR valve courtesy of the folks at NAPA http knowhow napaonline com clean egr valve If you were talking about the throttle body the folks at Mobil have you covered https www mobil com en lubricants for personal vehicles auto care vehicle maintenance how to do it yourself proper throttle body cleaning guide 
Tokenized:  ['here', 's', 'a', 'description', 'of', 'how', 'to', 'clean', 'your', 'vehicle', 's', 'e', '##gr', 'valve', 'courtesy', 'of', 'the', 'folks', 'at', 'nap', '##a', 'http', 'know', '##how', 'nap', '##ao', '##nl', '##ine', 'com', 'clean', 'e', '##gr', 'valve', 'if', 'you', 'were', 'talking', 'about', 'the', 'throttle', 'body', 'the', 'folks', 'at', 'mob', '##il', 'have', 'you', 'covered', 'https', 'www', 'mob', '##il', 'com', 'en', 'lu', '##bri', '##can', '##ts', 'for', 'personal', 'vehicles', 'auto', 'care', 'vehicle', 'maintenance', 'how', 'to', 'do', 'it', 'yourself', 'proper', 'throttle', 'body', 'cleaning',

The below cell will perform one tokenisation pass of the dataset in order to measure the maximum sentence length.

In [81]:
max_len = 0
# For first 10 sentences - 
for s in sentences[:10]:
    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(s, add_special_tokens=True)
    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

Max sentence length:  177




For BERT, all sentences must be padded or truncated to a single, fixed length. The maximum sentence length is 512 tokens so you may have to have to split the post_text. The maximum length does impact training and evaluation speed, however. For example, with a Tesla K80 (Colab GPU):

`MAX_LEN = 128 --> Training epochs take ~5:28 each`

`MAX_LEN = 64 --> Training epochs take ~2:57 each`

Try and encode the the dataset with the DistilBertTokenizer below.

See [docs here](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=encode_plus#transformers.PreTrainedTokenizer.encode) for `tokenizer.encode` . 


In [83]:
# !pip install nltk
import nltk

nltk.download('punkt')

# Splitting sentences over 128 strings

for s in sentences[0:100]:
  if len(s)>512:
    print(nltk.tokenize.sent_tokenize(s))

['You may have already figured this out but this exact thing happened to my 2008 Odyssey The issue is that the AC Clutch relay is stuck What happens is the clutch for the compressor stays engaged even when the car is off and creates a parasitic draw on your battery which is why it is dead every morning It is also the reason your car was making that noise and smoking after you jumped it the compressor was engaged all the time even when the air was off All you have to do is go to the fuse box under the hood on the left side as you look into the engine passenger side Pop off the top of the box and its a little black rectangle plug in that is farthest left and closest to you It shows a picture of a snowflake for this on the bottom of the fuse box top It s about a 6 dollar replacement at an auto store Hope this helps']
['You probably have more than one issue If the battery is fine the alternator should be too unless an alternator failure just happened Still the battery will run down if the 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\1hrit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 2.1.2 Padding the Sentences and Attention Mask 

- For BERT, all sentences must be padded or truncated to a single, fixed length.
The maximum sentence length is 512 tokens.
- Padding is done with a special [PAD] token, which is at index 0 in the BERT vocabulary. The below illustration demonstrates padding out to a “MAX_LEN” of 8 tokens.




<img src="http://www.mccormickml.com/assets/BERT/padding_and_mask.png" width="500"/>

The “Attention Mask” is simply an array of 1s and 0s indicating which tokens are padding and which aren’t (seems kind of redundant, doesn’t it?!). This mask tells the “Self-Attention” mechanism in BERT not to incorporate these PAD tokens into its interpretation of the sentence.


Therefore the encoding task requires the following - 

1. Split the sentence into tokens.
2. Add the special [CLS] and [SEP] tokens.
3. Pad or truncate all sentences to the same length.
4. Create the attention masks which explicitly differentiate real tokens from [PAD] tokens.
5. Map the tokens to their IDs.


You should try implement the padding and attention masks yourself with matrix multiplication via numpy. It is trained on lower-cased English text. Hence we set the flag **do_lower_case** to true in BertTokenizer.


Otherwise, the first four features are in `tokenizer.encode`, but you can also try use the `tokenizer.encode_plus` to get the fifth item (attention masks). Documentation is [here](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=encode_plus#transformers.PreTrainedTokenizer.encode_plus).


In [None]:
import torch

# Tokenize all of the sentences and map the tokens to thier word IDs.
token_ids = []
attention_masks = []
T = 128
# For every sentence...
for s in sentences:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start and append the `[SEP]` token to the end.
    #   (3) Pad or truncate the sentence to `max_length`
    #   (4) Create attention masks for [PAD] tokens
    #   (5) Map tokens to their IDs.

    # You can encode_plus as function
    encoded_dict = tokenizer.encode_plus(
                        s,                              # Sentence to encode.
                        add_special_tokens = True,      # Add '[CLS]' and '[SEP]'
                        max_length = T,                 # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn masks.
                        return_tensors = 'pt',
                                 # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.
    token_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
token_ids = torch.cat(token_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', token_ids[0])

ValueError: ignored

## 3.1 Train the Classification Model

### 3.1.1 DistilBert For Sequence Classification


For this task, we first want to modify the pre-trained BERT model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task.

Thankfully, the HuggingFace Pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained DistilBERT model, each has different top layers and output types designed to accomodate their specific NLP task.


We’ll be using [DistilBertForSequenceClassification](https://huggingface.co/transformers/v2.2.0/model_doc/distilbert.html#distilbertforsequenceclassification). This is the normal DistilBERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

We then pass the sentence embeddings and features through to the linear regression model to evaluate forum predictions.

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

<!-- 
1. Append the classification  layer to the BERT model


<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

### [Optional] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularisation strength. -->

In [None]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(token_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print(f'{train_size:>5,} training samples')
print(f'{val_size:>5,} validation samples')

5,355 training samples
  595 validation samples


In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [None]:
from transformers import DistilBertForSequenceClassification, AdamW, BertConfig

# Load DistilBertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU. Make sure you enable the runtime clicking [Runtime]->[Change Runtime Type]->[Hardware Accelerator]->GPU->[Save]
model.cuda()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 104 different named parameters.

==== Embedding Layer ====

distilbert.embeddings.word_embeddings.weight            (30522, 768)
distilbert.embeddings.position_embeddings.weight          (512, 768)
distilbert.embeddings.LayerNorm.weight                        (768,)
distilbert.embeddings.LayerNorm.bias                          (768,)
distilbert.transformer.layer.0.attention.q_lin.weight     (768, 768)

==== First Transformer ====

distilbert.transformer.layer.0.attention.q_lin.bias           (768,)
distilbert.transformer.layer.0.attention.k_lin.weight     (768, 768)
distilbert.transformer.layer.0.attention.k_lin.bias           (768,)
distilbert.transformer.layer.0.attention.v_lin.weight     (768, 768)
distilbert.transformer.layer.0.attention.v_lin.bias           (768,)
distilbert.transformer.layer.0.attention.out_lin.weight   (768, 768)
distilbert.transformer.layer.0.attention.out_lin.bias         (768,)
distilbert.transformer.layer.0.sa_layer_norm.weight           (

### 3.1.2 Optimizer & Learning Rate Scheduler

Now that we have our model loaded we need to grab the training hyperparameters from within the stored model.

For the purposes of fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the BERT paper):

* Batch size: 16, 32
* Learning rate (Adam): 5e-5, 3e-5, 2e-5
* Number of epochs: 2, 3, 4

We chose:

* Batch size: 32 (set when creating our DataLoaders)
* Learning rate: 2e-5
* Epochs: 4 (we’ll see that this is probably too many…)


The epsilon parameter eps = 1e-8 is “a very small number to prevent any division by zero in the implementation” (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in run_glue.py [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

In [None]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )


In [None]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs. The BERT authors recommend between 2 and 4. 
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

Below is our training loop. There’s a lot going on, but fundamentally for each pass in our loop we have a trianing phase and a validation phase.

**Training:**

* Unpack our data inputs and labels
* Load data onto the GPU for acceleration
* Clear out the gradients calculated in the previous pass.
* In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out.
* Forward pass (feed input data through the network)
* Backward pass (backpropagation)
* Tell the network to update parameters with optimizer.step()
* Track variables for monitoring progress

**Evalution:**
* Unpack our data inputs and labels
* Load data onto the GPU for acceleration
* Forward pass (feed input data through the network)
* Compute loss on our validation data and track variables for monitoring progress

Pytorch hides all of the detailed calculations from us, but we’ve commented the code to point out which of the above steps are happening on each line.

Define a helper function for calculating accuracy.

In [None]:
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
import time
import datetime

# Helper function for formatting elapsed times as hh:mm:ss
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device. 

In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P4


In [None]:
import random

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128


# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
loss_values = []

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        print(b_input_ids.shape)
        print(b_input_ids.shape)
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].type(torch.LongTensor).to(device)
       
        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(b_input_ids,
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple.
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            outputs = model(b_input_ids, 
                            attention_mask=b_input_mask)
        
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...


NameError: ignored

Try using Trainer class

In [None]:
import dataclasses
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Dict, Optional

import numpy as np

from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, EvalPrediction, GlueDataset
from transformers import GlueDataTrainingArguments as DataTrainingArguments
from transformers import (
    HfArgumentParser,
    Trainer,
    TrainingArguments,
    glue_compute_metrics,
    glue_output_modes,
    glue_tasks_num_labels,
    set_seed,
)

logging.basicConfig(level=logging.INFO)

In [None]:
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )

In [None]:
model_args = ModelArguments(
    model_name_or_path="distilbert-base-cased",
)
data_args = DataTrainingArguments(task_name="mnli", data_dir="./glue_data/MNLI")
training_args = TrainingArguments(
    output_dir="./models/model_name",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_gpu_train_batch_size=32,
    per_gpu_eval_batch_size=128,
    num_train_epochs=1,
    logging_steps=500,
    logging_first_step=True,
    save_steps=1000,
    evaluate_during_training=True,
)

In [None]:
def compute_metrics(p: EvalPrediction) -> Dict:
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

NameError: ignored

Made by Charlene Leong