## Data Generation

In order to train a model to classify faults with metrobuses, we must first get our hands on some data. In this notebook we take a small data set, containing only 100 samples, and use [Markovify](https://github.com/jsvine/markovify) to simulate a larger data set, based on the original samples.

We obtained a small data set by asking our colleagues to describe issues they currently have or had in the past, with their vehicles. 

**Note: you do not need to run this notebook in order to execute the remainder of the workshop. This notebook is solely for data generation. We have already generated the data for you, and it can be found in the `/dataset` folder.**

Let's take a look at that data: 

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_colwidth', None)
df = pd.read_csv('dataset/response.csv') 
df.sample(10)

Unnamed: 0,Timestamp,What is the main issue you're having,Please pick one of these three symptoms:,Please pretend you're a customer. In the space below tell us in your own words what's going wrong with your car:,Please pick one of these three symptoms:.1,Please pretend you're a customer. In the space below tell us in your own words what's going wrong with your car:.1,Please pretend you're a customer. In the space below tell us in your own words what's going wrong with your car:.2
41,2021/03/04 1:19:59 PM CST,Other,,,,,I'm hearing a noise and feeling a vibration when I turn the wheel.
77,2021/03/04 3:46:43 PM CST,Other,,,,,I hear a rattling noise when i drive it above 60 mph
98,2021/03/05 12:22:23 PM CST,Other,,,,,"My battery keeps dying. This is the third one, and I don't get it. I turn the radio and a/c off, trying to use as little power as possible, and the batteries just keep dying after about a week. Aaarrrggghhh!!!!"
17,2021/03/04 12:10:05 PM CST,Brakes,,,Car makes grinding noise,"It was really cold out and I could not get the emergency brake to disengage. When I tried to drive, the car made horrible grinding noises.",
53,2021/03/04 1:52:00 PM CST,Starter,Car makes an odd noise,It makes a grinding sound for a few seconds and then nothing happens.,,,
103,2021/03/05 12:28:42 PM CST,Brakes,,,Car doesn't stop in timely manner,my car does not stop,
22,2021/03/04 12:34:25 PM CST,Other,,,,,"When it rains, the car slides under braking and specially slips when driving over painted lines on the road"
56,2021/03/04 1:54:16 PM CST,Other,,,,,"When I put the car in reverse, the hatchback latch opens so that the backup camera can engage. That's all good. But sometimes, the latch stays permanently opened while I'm driving and when I stop and park, I cannot open the hatchback due to the latch being stuck in the open position (but not engaged)."
58,2021/03/04 1:56:11 PM CST,Brakes,,,Car makes grinding noise,The brake squeak and any time the slightest pressure is applied to the brake pedal. And the brake pedal vibrates,
71,2021/03/04 2:28:34 PM CST,Other,,,,,It makes clunking noises when I go over a bump.


As you can see, the data contains a range of information including, the time the error was recorded, and the type of issue, as well as a description of the problem. The information we are really interested in is the type of issue, and the description - let's go ahead and process this data, extracting only the info we need!

In [3]:
df = df.fillna('')
df['response']=df.iloc[:,3]+df.iloc[:,5]+df.iloc[:,6]
df['issue'] = df.iloc[:,1]
df['symptom'] = df.iloc[:,2] + df.iloc[:,4]
subset = df.iloc[:,-3:]
subset.sample(10)

Unnamed: 0,response,issue,symptom
77,I hear a rattling noise when i drive it above 60 mph,Other,
33,"Sometimes the car doesn't start and it seems like the battery is dead, but I have replaced the battery three times over the past year, so I think maybe either I'm buying defective batteries or something about the car is wrecking perfectly good batteries that should be lasting years.",Starter,Car doesn't start
61,"Leak of a clear, cloudy, oil like substance in the back of the car.",Other,
49,Car creates whistle sound each time I try to start it.,Starter,Car makes an odd noise
43,There is a lag when I push the gas peddle. There isn't an immediate response.,Other,
105,my lights do not work,Other,
88,I have to press the ignition button at least twice before my car starts.,Starter,Car starts then stops
56,"When I put the car in reverse, the hatchback latch opens so that the backup camera can engage. That's all good. But sometimes, the latch stays permanently opened while I'm driving and when I stop and park, I cannot open the hatchback due to the latch being stuck in the open position (but not engaged).",Other,
62,Front right corner is dented from a collision.,Other,
90,"the privacy glass between my driver and my seating area doesn't work well. sometimes, it won't go up at all!",Other,


This looks much nicer! Let's explore our data:

In [4]:
subset.issue.unique()

array(['Brakes', 'Starter', 'Other'], dtype=object)

We only have 3 categories of issue - 'Brakes', 'Starter' and 'Other'. 

How much data do we have?

In [5]:
subset.shape

(109, 3)

In [6]:
subset['issue'].value_counts() 

Other      56
Brakes     28
Starter    25
Name: issue, dtype: int64

Only 109 rows - That's not likely to be enough to train a successful categorization model.

Most of the data falls into the 'other' category.

In the rest of this notebook, we use Markovify to generate more data for each class of issue. Markovify is a Markov Chain generator, and we are going to use it to simulate more responses for each type of issue.

First we install markovify, which is in our `nb-requirements.txt` file:

In [7]:
!pip install -r nb-requirements.txt

Collecting markovify==0.9.3
  Downloading markovify-0.9.3.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting tensorflow==2.7.0
  Downloading tensorflow-2.7.0-cp38-cp38-manylinux2010_x86_64.whl (489.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.6/489.6 MB[0m [31m258.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m285.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25ldone
[?25h  Created wheel for markovify: filename=markovify-0.9.3-py3-none-any.whl size=18596 sha256=948ab65a7dfe4a0977130776657d2bd1702a29d2b9baf6bcde3b5cad356fb9cc
  Stored in directory: /tmp/pip-ephem-wheel-cache-fui8gtgw/wheels/14/aa/cc/2ed670f560efe25f80cbc90ad0220e80e4893ccf72f8099ec7
Successfully built markovify
Install

In [8]:
import markovify
import csv

In [13]:
def train_markov_type(data, issue):
    return markovify.Text(data[data["issue"] == issue].response, retain_original=False, state_size=2)

#Function takes one of the 'issue' models and creates a randomly-generated sentence of length up to 100 characters.  
def make_sentence(model, length=100):
    return model.make_short_sentence(length, max_overlap_ratio = .7, max_overlap_total=15)

#built models
other_model = train_markov_type(subset, "Other")
brakes_model = train_markov_type(subset, "Brakes")
starter_model = train_markov_type(subset, "Starter")

We can combine these models with relative weights:

In [14]:
import numpy

def generate_cases(models, weights=None):
    if weights is None:
        weights = [1] * len(models)
    
    choices = [] # Array of tuples of weight and models
    
    total_weight = float(sum(weights))
    
    for i in range(len(weights)):
        choices.append((float(sum(weights[0:i+1])) / total_weight, models[i]))
    
    # Return a tuple of model and category that are randomly selected by given weights.
    def choose_model():
        r = numpy.random.uniform()
        for (model_weight, model) in choices:
            if r <= model_weight:
                return model
        return choices[-1][1]


    while True:
        local_model = choose_model() 
        # local_model[0]) is the markovify model, local_model[1] is the category
        yield make_sentence(local_model[0]), local_model[1]
   


We can now use this code to generate new sentences:

In [11]:
import numpy as np

generated_cases = generate_cases([(other_model,'other'), (brakes_model,'brakes'), (starter_model,'starter')], [14,7,7])

# Tuples with sentence and category
sentence_tuples = [next(generated_cases)  for i in range(1000)]  # create 200 sentence/category tuples

In [12]:
sentence_tuples

[('The paint on the fuel line, and apparently it came loose while driving.',
  'other'),
 ('The car pulls to the air bag or seat belt.', 'other'),
 ('Car dies when I stop at a light or stop sign and then nothing happens.',
  'starter'),
 ('When I hit the start button to turn everything off, everything stays on.',
  'other'),
 ("AC doesn't work well.", 'other'),
 ('I have some noise when I open my rear trunk', 'other'),
 ("I can't get up to 88 MPH", 'other'),
 ("The car won't turn on, but it won't go up at all!", 'other'),
 ('Fuel economy has gotten very bad.', 'other'),
 ('Brake pads appear less than ¼ inch thick', 'brakes'),
 ('Parking brake doesn’t return once released', 'brakes'),
 ('Parking brake doesn’t return once released', 'brakes'),
 ('My cars breaks make a squeaking noise whenever I try to stop when I press down',
  'brakes'),
 ('My cars breaks make a terrible sound when stopping.', 'brakes'),
 ("I'm having an issue with fuel injection sensor or pressure.", 'other'),
 ('I thi

In [15]:
len(sentence_tuples)

1000

We can save these new issues and responses to file, and we will use this file later to train our model. 

In [16]:
# Write to csv file
with open('dataset/testdata1.csv', 'w') as file:
    writer = csv.writer(file, delimiter=',', lineterminator='\n')
    writer.writerows(sentence_tuples)

At this point we have created a new data set, and we can transform this data and train a classification model. 

Let's head to notebook [01-Create-Claims-Classification.ipynb](01-Create-Claims-Classification.ipynb), where we process the data and train a classifier. 